Skip to content

Add new RL Colocate Trainer#1503

Merged
jayhenry merged 15 commits intoInternLM:rl_designfrom
jayhenry:colocate_trainer
Mar 2, 2026
Merged

Add new RL Colocate Trainer#1503
jayhenry merged 15 commits intoInternLM:rl_designfrom
jayhenry:colocate_trainer

Conversation

@jayhenry
Copy link
Collaborator

@jayhenry jayhenry commented Feb 28, 2026

Summary

This PR introduces the RLColocateTrainer — a complete RL training loop for the colocated (shared-GPU) setting — along with an evaluation framework and several component-level refinements to support end-to-end GRPO-style reinforcement learning.

Key Changes

  • RLColocateTrainer (new): A full-featured RL trainer that orchestrates the rollout → train → weight-sync → evaluate loop in a colocated placement group. It integrates AgentLoopManager for both training and evaluation data production, supports TensorBoard/JSONL experiment tracking, and handles data preparation (advantage estimation, sequence context construction, logprobs alignment) for GRPO training.
  • Evaluator (new): A lightweight evaluation component (Evaluator + EvaluatorConfig) that computes metrics (e.g., accuracy) from rollout samples. Evaluation data production is not covered here, and it's handled by AgentLoopManager.
  • _DatasetSampler.__len__: Added __len__ so the evaluator can query total eval sample count for batch size computation.
  • Abstract Judger base class: Extracted a Judger ABC from NativeJudger so that both NativeJudger and NativeJudgerRouter share a common interface. Added RayJudger / RayJudgerProxy type aliases for better IDE support and type hint.
  • AgentLoopManagerConfig adjustment: It has SamplerConfig, AgentLoopConfig, ProduceStrategyConfig as members now, and internally constructs the agent loop, produce strategy, and sampler.
  • RolloutState serialization fix: Added a field_serializer for routed_experts to gracefully skip ray.ObjectRef during Pydantic serialization, preventing PydanticSerializationError.
  • Design doc update: Refined design/component_rl.py with updated component APIs including Storage, Sampler, CheckpointEngine, AdvantageEstimator, and Packer abstractions, along with revised usage examples for colocate and disaggregate scenarios.

Test

Verified correct reward curve on GSM8K baseline trained on Dense 0.6B, 8B and MoE 30B models.
image

TODO

  • Simplify Trainer.__init__ code by replacing self.build_XXX() instead of XXXConfig.build (rl02)
  • Save train and eval trajectory(rl05)
  • save checkpoint (rl06)
  • resume
  • set deterministic and seed
  • add cli (rl01)
  • add ReplayBufferCfg
  • Improve RolloutController relative methods such as check_health, pause_generation, restart etc.
  • add verl agent loop (rl03)

@jayhenry jayhenry requested review from YanhuiDua and hhaAndroid March 1, 2026 06:54
compute_metric_func=None,
)
# Finally, build the trainer
trainer = RLColocateTrainer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那还需要RLTrainerConfig吗?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以直接用现在的rl.py的入口吧

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后续再加

def __init__(
self,
*,
resources: AcceleratorResourcesConfig,
Copy link
Collaborator

@YanhuiDua YanhuiDua Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可能还需要传个cpu_resource_config, JudgerRouter.build支持用户传入pg

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YanhuiDua 跟JudgerRouterConfig一起改

@jayhenry jayhenry merged commit 9c3a122 into InternLM:rl_design Mar 2, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants