OpenClaw-RL

Type: full-code · Vendor: Gen-Verse · Language: Python · License: Apache-2.0 · Status: active · Status in practice: experimental · First released: 2026-02-26

Links: homepage docs repo

Train personalised LLM agents by turning live multi-turn conversations into fully-asynchronous RL training signals across terminal, GUI, software-engineering, and tool-call settings.

Description. OpenClaw-RL is an Apache-2.0 reinforcement-learning framework from Gen-Verse that wraps a self-hosted model as an OpenAI-compatible API, intercepts live conversations through the OpenClaw plugin, and runs four async loops (agent serving, rollout collection, PRM/judge evaluation, policy training) that continuously optimise the policy without interrupting usage. Two paradigms are supported: Binary RL (GRPO with a Process Reward Model) and On-Policy Distillation (OPD via a judge model that emits textual hints), plus a Hybrid combination. Technical report on arXiv (2603.10165, 2026-03-10) reached #1 on HuggingFace Daily Papers.

Agent loop shape. Four independent asynchronous loops (serving, rollout, judge, training) instead of a single synchronous agent loop. Conversation traffic flowing through an OpenAI-compatible wrapper feeds the rollout collector; the trainer updates the policy in the background while serving and judging continue concurrently.

Primary use cases

personalising a self-hosted agent from a single user's conversational feedback
scaling RL training across terminal, GUI, SWE, and tool-call agent environments
continuously updating a deployed policy without taking the inference endpoint offline

flowchart TD USER[User conversation] --> PLUGIN[OpenClaw plugin<br/>OpenAI-compatible wrapper] PLUGIN --> SERVE[Async loop: Agent serving] SERVE --> ROLLOUT[Async loop: Rollout collection] ROLLOUT --> JUDGE[Async loop: PRM / Judge<br/>scores each turn + majority voting] JUDGE -->|Binary RL signal| TRAINER[Async loop: Policy training] JUDGE -->|OPD textual hints| TRAINER TRAINER -->|updated policy| SERVE PLUGIN -.->|tool calls| TOOLS[(Terminal / GUI / SWE / tool-call envs)] TOOLS -.->|next-state observation| ROLLOUT

Key concepts

OpenClaw plugin (docs) — OpenAI-compatible wrapper around a self-hosted model that intercepts live multi-turn conversations and forwards them into the RL training pipeline.
PRM / Judge → agent-as-judge (docs) — Process Reward Model and Judge component that scores each turn asynchronously, with majority voting for robustness.
Binary RL (GRPO) (docs) — Per-turn scalar reward from the PRM combined with GRPO advantage estimation and a PPO-style clipped surrogate loss.
On-Policy Distillation (OPD) (docs) — Judge extracts a textual hint from next-state hindsight; the token-level log-prob gap between teacher and student becomes a directional advantage signal.
Fully-asynchronous 4-component architecture → evaluator-optimizer (docs) — Serving, rollout, judge, and training run as independent loops that do not block one another.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.

Alternatives & relatives

full-code · framework
OpenClaw
complements
Same brand and openclaw.ai homepage; OpenClaw-RL is the RL training framework that wraps a model in the OpenClaw plugin so OpenClaw can intercept and learn from conversations.

OpenClaw-RL

Neighbourhood

Alternatives & relatives

Listed as alternative by (1)

References

Provenance