OpenClaw-RL
Train personalised LLM agents by turning live multi-turn conversations into fully-asynchronous RL training signals across terminal, GUI, software-engineering, and tool-call settings.
Description
OpenClaw-RL is an Apache-2.0 reinforcement-learning framework from Gen-Verse that wraps a self-hosted model as an OpenAI-compatible API, intercepts live conversations through the OpenClaw plugin, and runs four async loops (agent serving, rollout collection, PRM/judge evaluation, policy training) that continuously optimise the policy without interrupting usage. Two paradigms are supported: Binary RL (GRPO with a Process Reward Model) and On-Policy Distillation (OPD via a judge model that emits textual hints), plus a Hybrid combination. Technical report on arXiv (2603.10165, 2026-03-10) reached #1 on HuggingFace Daily Papers.
Solution
Four independent asynchronous loops (serving, rollout, judge, training) instead of a single synchronous agent loop. Conversation traffic flowing through an OpenAI-compatible wrapper feeds the rollout collector; the trainer updates the policy in the background while serving and judging continue concurrently.
Primary use cases
- personalising a self-hosted agent from a single user's conversational feedback
- scaling RL training across terminal, GUI, SWE, and tool-call agent environments
- continuously updating a deployed policy without taking the inference endpoint offline
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.