RL-based data scheduler cuts LLM pretraining steps by 66% with minimal overhead
Researchers have developed AC-ODM, a reinforcement learning-driven data scheduling method that dynamically allocates training examples across source tasks during large language model pretraining. Tested on the Pythia-1B model, it achieved a 27.5% relative improvement in MMLU accuracy and a 2.23x higher HumanEval pass@1 score compared to competitive baselines. The system reaches optimal validation perplexity using up to 66% fewer training steps, while adding only 0.4% to per-step wall-clock time and 2% to memory usage. Unlike previous approaches that relied on static or hand-crafted data mixing schedules, AC-ODM learns an online policy that adapts in real time based on the model's training state. The study notes that results are currently limited to a 1-billion-parameter model, leaving scalability to larger architectures as an open question.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in