Add new RL Colocate Trainer by jayhenry · Pull Request #1503 · InternLM/xtuner

jayhenry · 2026-02-28T06:33:48Z

Summary

This PR introduces the RLColocateTrainer — a complete RL training loop for the colocated (shared-GPU) setting — along with an evaluation framework and several component-level refinements to support end-to-end GRPO-style reinforcement learning.

Key Changes

RLColocateTrainer (new): A full-featured RL trainer that orchestrates the rollout → train → weight-sync → evaluate loop in a colocated placement group. It integrates AgentLoopManager for both training and evaluation data production, supports TensorBoard/JSONL experiment tracking, and handles data preparation (advantage estimation, sequence context construction, logprobs alignment) for GRPO training.
Evaluator (new): A lightweight evaluation component (Evaluator + EvaluatorConfig) that computes metrics (e.g., accuracy) from rollout samples. Evaluation data production is not covered here, and it's handled by AgentLoopManager.
_DatasetSampler.__len__: Added __len__ so the evaluator can query total eval sample count for batch size computation.
Abstract Judger base class: Extracted a Judger ABC from NativeJudger so that both NativeJudger and NativeJudgerRouter share a common interface. Added RayJudger / RayJudgerProxy type aliases for better IDE support and type hint.
AgentLoopManagerConfig adjustment: It has SamplerConfig, AgentLoopConfig, ProduceStrategyConfig as members now, and internally constructs the agent loop, produce strategy, and sampler.
RolloutState serialization fix: Added a field_serializer for routed_experts to gracefully skip ray.ObjectRef during Pydantic serialization, preventing PydanticSerializationError.
Design doc update: Refined design/component_rl.py with updated component APIs including Storage, Sampler, CheckpointEngine, AdvantageEstimator, and Packer abstractions, along with revised usage examples for colocate and disaggregate scenarios.

Test

Verified correct reward curve on GSM8K baseline trained on Dense 0.6B, 8B and MoE 30B models.

TODO

Simplify Trainer.__init__ code by replacing self.build_XXX() instead of XXXConfig.build (rl02)
Save train and eval trajectory(rl05)
save checkpoint (rl06)
resume
set deterministic and seed
add cli (rl01)
add ReplayBufferCfg
Improve RolloutController relative methods such as check_health, pause_generation, restart etc.
add verl agent loop (rl03)

…to include judger and related configs

1) Modify RolloutState reward type to support dict; 2) Introduce Evaluator class for metric computation; 3) Integrate evaluator into RLColocateTrainer for initial and periodic evaluations; 4) Add length method to DatasetSampler for evaluator usage

xtuner/v1/train/rl_colocate_trainer.py

YanhuiDua · 2026-03-02T03:53:04Z

xtuner/v1/train/rl_colocate_trainer.py

+        compute_metric_func=None,
+    )
+    # Finally, build the trainer
+    trainer = RLColocateTrainer(


那还需要RLTrainerConfig吗？

可以直接用现在的rl.py的入口吧

后续再加

xtuner/v1/train/rl_colocate_trainer.py

xtuner/v1/rl/base/agent_loop_manager.py

YanhuiDua · 2026-03-02T04:03:02Z

xtuner/v1/train/rl_colocate_trainer.py

+    def __init__(
+        self,
+        *,
+        resources: AcceleratorResourcesConfig,


这里可能还需要传个cpu_resource_config， JudgerRouter.build支持用户传入pg

@YanhuiDua 跟JudgerRouterConfig一起改

jayhenry and others added 12 commits February 28, 2026 06:18

add adv estimator and checkpoint engine

9185db1

refine design file

d5572b6

rl colocate trainer

adae731

add exp tracker and fix train loop bug

deff95e

skip routed_experts when RolloutState.dump

41b53ff

fix Sampler, ProduceStrategy, ... init methods

4a1a67b

add debug_rollout

c6813a8

[Baseline] Run successfully with correct reward curve

a61e22a

1) Introduce abstract Judger class. 2) Adjust AgentLoopManagerConfig …

a21b4df

…to include judger and related configs

Add RL evaluation framework:

c164f88

1) Modify RolloutState reward type to support dict; 2) Introduce Evaluator class for metric computation; 3) Integrate evaluator into RLColocateTrainer for initial and periodic evaluations; 4) Add length method to DatasetSampler for evaluator usage

refine code

c981c68

add xtuner meta and work_dir

3f2ae77

jayhenry requested review from YanhuiDua and hhaAndroid March 1, 2026 06:54

YanhuiDua reviewed Mar 2, 2026

View reviewed changes

xtuner/v1/train/rl_colocate_trainer.py Outdated Show resolved Hide resolved

YanhuiDua reviewed Mar 2, 2026

View reviewed changes

xtuner/v1/train/rl_colocate_trainer.py Show resolved Hide resolved

YanhuiDua reviewed Mar 2, 2026

View reviewed changes

xtuner/v1/rl/base/agent_loop_manager.py Show resolved Hide resolved

YanhuiDua reviewed Mar 2, 2026

View reviewed changes

YanhuiDua approved these changes Mar 2, 2026

View reviewed changes

jayhenry added 3 commits March 2, 2026 06:54

fix agent loop manager unit test

07c4c65

adjust evaluator config

fb2d050

add _log_mini_batch_metrics

d1d2e06

jayhenry merged commit 9c3a122 into InternLM:rl_design Mar 2, 2026
0 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new RL Colocate Trainer#1503

Add new RL Colocate Trainer#1503
jayhenry merged 15 commits intoInternLM:rl_designfrom
jayhenry:colocate_trainer

jayhenry commented Feb 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

YanhuiDua Mar 2, 2026

Uh oh!

YanhuiDua Mar 2, 2026

Uh oh!

YanhuiDua Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

YanhuiDua Mar 2, 2026 •

edited

Loading

Uh oh!

YanhuiDua Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayhenry commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Test

TODO

Uh oh!

Uh oh!

YanhuiDua Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

YanhuiDua Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

YanhuiDua Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

YanhuiDua Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YanhuiDua Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jayhenry commented Feb 28, 2026 •

edited

Loading

YanhuiDua Mar 2, 2026 •

edited

Loading