OpenClaw-RL
Type: full-code · Vendor: Gen-Verse · Language: Python · License: Apache-2.0 · Status: active · Status in practice: experimental · First released: 2026-02-26
Train personalised LLM agents by turning live multi-turn conversations into fully-asynchronous RL training signals across terminal, GUI, software-engineering, and tool-call settings.
Description. OpenClaw-RL is an Apache-2.0 reinforcement-learning framework from Gen-Verse that wraps a self-hosted model as an OpenAI-compatible API, intercepts live conversations through the OpenClaw plugin, and runs four async loops (agent serving, rollout collection, PRM/judge evaluation, policy training) that continuously optimise the policy without interrupting usage. Two paradigms are supported: Binary RL (GRPO with a Process Reward Model) and On-Policy Distillation (OPD via a judge model that emits textual hints), plus a Hybrid combination. Technical report on arXiv (2603.10165, 2026-03-10) reached #1 on HuggingFace Daily Papers.
Agent loop shape. Four independent asynchronous loops (serving, rollout, judge, training) instead of a single synchronous agent loop. Conversation traffic flowing through an OpenAI-compatible wrapper feeds the rollout collector; the trainer updates the policy in the background while serving and judging continue concurrently.
Primary use cases
- personalising a self-hosted agent from a single user's conversational feedback
- scaling RL training across terminal, GUI, SWE, and tool-call agent environments
- continuously updating a deployed policy without taking the inference endpoint offline
Key concepts
- OpenClaw plugin (docs) — OpenAI-compatible wrapper around a self-hosted model that intercepts live multi-turn conversations and forwards them into the RL training pipeline.
- PRM / Judge → agent-as-judge (docs) — Process Reward Model and Judge component that scores each turn asynchronously, with majority voting for robustness.
- Binary RL (GRPO) (docs) — Per-turn scalar reward from the PRM combined with GRPO advantage estimation and a PPO-style clipped surrogate loss.
- On-Policy Distillation (OPD) (docs) — Judge extracts a textual hint from next-state hindsight; the token-level log-prob gap between teacher and student becomes a directional advantage signal.
- Fully-asynchronous 4-component architecture → evaluator-optimizer (docs) — Serving, rollout, judge, and training run as independent loops that do not block one another.
Patterns this full-code implements —
- ★Agent-as-a-Judge
An asynchronous Process Reward Model / judge model scores each turn; majority voting is used when needed for robust scoring. Judge-emitted hints also drive the OPD training signal.
- ★★Evaluator-Optimizer
Evaluator (judge) and optimizer (trainer) are decoupled into independent async loops, so judging happens concurrently with new interactions and training runs in the background while serving continues.
- ★★Event-Driven Agent
Live multi-turn conversation traffic, intercepted through the OpenClaw OpenAI-compatible plugin, is the event source feeding rollout collection and downstream training; nothing is batch-scheduled.
- ★Process Reward Model
Per-turn scalar reward from a Process Reward Model drives the Binary RL paradigm; GRPO advantage estimation plus a PPO-style clipped surrogate loss train the policy.
- ★★Tool Use
Track 2 explicitly targets tool-call agents (alongside terminal, GUI, SWE) for real-world settings; supported as a training environment rather than a runtime feature.
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.