Full-Code · Agent SDKsactive

OpenClaw-RL

Type: full-code  ·  Vendor: Gen-Verse  ·  Language: Python  ·  License: Apache-2.0  ·  Status: active  ·  Status in practice: experimental  ·  First released: 2026-02-26

Links: homepage docs repo

Train personalised LLM agents by turning live multi-turn conversations into fully-asynchronous RL training signals across terminal, GUI, software-engineering, and tool-call settings.

Description. OpenClaw-RL is an Apache-2.0 reinforcement-learning framework from Gen-Verse that wraps a self-hosted model as an OpenAI-compatible API, intercepts live conversations through the OpenClaw plugin, and runs four async loops (agent serving, rollout collection, PRM/judge evaluation, policy training) that continuously optimise the policy without interrupting usage. Two paradigms are supported: Binary RL (GRPO with a Process Reward Model) and On-Policy Distillation (OPD via a judge model that emits textual hints), plus a Hybrid combination. Technical report on arXiv (2603.10165, 2026-03-10) reached #1 on HuggingFace Daily Papers.

Agent loop shape. Four independent asynchronous loops (serving, rollout, judge, training) instead of a single synchronous agent loop. Conversation traffic flowing through an OpenAI-compatible wrapper feeds the rollout collector; the trainer updates the policy in the background while serving and judging continue concurrently.

Primary use cases

  • personalising a self-hosted agent from a single user's conversational feedback
  • scaling RL training across terminal, GUI, SWE, and tool-call agent environments
  • continuously updating a deployed policy without taking the inference endpoint offline

Key concepts

  • OpenClaw plugin (docs)OpenAI-compatible wrapper around a self-hosted model that intercepts live multi-turn conversations and forwards them into the RL training pipeline.
  • PRM / Judge agent-as-judge (docs)Process Reward Model and Judge component that scores each turn asynchronously, with majority voting for robustness.
  • Binary RL (GRPO) (docs)Per-turn scalar reward from the PRM combined with GRPO advantage estimation and a PPO-style clipped surrogate loss.
  • On-Policy Distillation (OPD) (docs)Judge extracts a textual hint from next-state hindsight; the token-level log-prob gap between teacher and student becomes a directional advantage signal.
  • Fully-asynchronous 4-component architecture evaluator-optimizer (docs)Serving, rollout, judge, and training run as independent loops that do not block one another.

Patterns this full-code implements

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.

Listed as alternative by (1)

References

Provenance

  • Last analyzed:
  • Last updated:
  • Verification status: partial