Add new RL Colocate Trainer#1503
Merged
jayhenry merged 15 commits intoInternLM:rl_designfrom Mar 2, 2026
Merged
Conversation
…to include judger and related configs
1) Modify RolloutState reward type to support dict; 2) Introduce Evaluator class for metric computation; 3) Integrate evaluator into RLColocateTrainer for initial and periodic evaluations; 4) Add length method to DatasetSampler for evaluator usage
YanhuiDua
reviewed
Mar 2, 2026
YanhuiDua
reviewed
Mar 2, 2026
| compute_metric_func=None, | ||
| ) | ||
| # Finally, build the trainer | ||
| trainer = RLColocateTrainer( |
YanhuiDua
reviewed
Mar 2, 2026
YanhuiDua
reviewed
Mar 2, 2026
YanhuiDua
reviewed
Mar 2, 2026
| def __init__( | ||
| self, | ||
| *, | ||
| resources: AcceleratorResourcesConfig, |
Collaborator
There was a problem hiding this comment.
这里可能还需要传个cpu_resource_config, JudgerRouter.build支持用户传入pg
YanhuiDua
approved these changes
Mar 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces the
RLColocateTrainer— a complete RL training loop for the colocated (shared-GPU) setting — along with an evaluation framework and several component-level refinements to support end-to-end GRPO-style reinforcement learning.Key Changes
RLColocateTrainer(new): A full-featured RL trainer that orchestrates the rollout → train → weight-sync → evaluate loop in a colocated placement group. It integratesAgentLoopManagerfor both training and evaluation data production, supports TensorBoard/JSONL experiment tracking, and handles data preparation (advantage estimation, sequence context construction, logprobs alignment) for GRPO training.Evaluator(new): A lightweight evaluation component (Evaluator+EvaluatorConfig) that computes metrics (e.g., accuracy) from rollout samples. Evaluation data production is not covered here, and it's handled byAgentLoopManager._DatasetSampler.__len__: Added__len__so the evaluator can query total eval sample count for batch size computation.Judgerbase class: Extracted aJudgerABC fromNativeJudgerso that bothNativeJudgerandNativeJudgerRoutershare a common interface. AddedRayJudger/RayJudgerProxytype aliases for better IDE support and type hint.AgentLoopManagerConfigadjustment: It hasSamplerConfig,AgentLoopConfig,ProduceStrategyConfigas members now, and internally constructs the agent loop, produce strategy, and sampler.RolloutStateserialization fix: Added afield_serializerforrouted_expertsto gracefully skipray.ObjectRefduring Pydantic serialization, preventingPydanticSerializationError.design/component_rl.pywith updated component APIs includingStorage,Sampler,CheckpointEngine,AdvantageEstimator, andPackerabstractions, along with revised usage examples for colocate and disaggregate scenarios.Test
Verified correct reward curve on GSM8K baseline trained on Dense 0.6B, 8B and MoE 30B models.

TODO
Trainer.__init__code by replacingself.build_XXX()instead ofXXXConfig.build(rl02)check_health,pause_generation,restartetc.