Support Flexible Scheduler#142
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @benji-cannot-code, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the flexibility of learning rate scheduling within the training framework. Previously, the learning rate scheduler for the FSDP strategy was hard-coded. This change introduces a configurable option, allowing users to select different learning rate schedules via the configuration file. It achieves this by integrating Hugging Face's get_scheduler utility, providing access to a broader array of scheduling strategies. Additionally, the PR includes minor cleanups by removing trailing whitespaces in documentation and configuration files, improving overall code hygiene.
Highlights
- Configurable Learning Rate Scheduler: The learning rate scheduler for the FSDP strategy is no longer hard-coded to
constant_with_warmup. It can now be specified via theoptimizer_config.schedulerparameter in the configuration. - Integration of Hugging Face Schedulers: The implementation now leverages the
transformers.trainer.get_schedulerutility, allowing users to choose from a wider range of pre-defined learning rate schedules provided by Hugging Face. - Configuration and Documentation Updates: The
ppo_base_config.yamlfile has been updated to include the newschedulerparameter in the optimizer configuration, and theconfig.rstdocumentation has been revised to reflect this change and explain its usage. - Whitespace Cleanup: Various trailing whitespaces were removed from documentation and configuration files, contributing to cleaner and more consistent code formatting.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces support for flexible learning rate schedulers, a valuable enhancement. The changes are well-implemented for the FSDP strategy and the DeepSpeed policy worker, and the configuration and documentation are updated accordingly. To make this feature more robust, consider two improvements: 1) Pass num_training_steps to get_scheduler to support a wider range of schedulers (e.g., 'linear', 'cosine'). 2) Update the DeepSpeed critic worker (DeepSpeedCriticWorkerBase) to use the configurable scheduler for consistency, as it appears to remain hardcoded. The removal of trailing whitespaces is also a nice cleanup.
| - ``optimizer_config.max_grad_norm``: Gradient clipping parameter. The total L2 norm of the model gradients will be scaled to this value during training. | ||
| - ``optimizer_config.offload_after_step``: Whether to offload optimizer state to CPU after step if colocated. When generation and training workers are colocated, we recommend using the default setting of ``true``. In some cases with non-colocation, it can be desirable to leave optimizer state on GPU memory to avoid offloading costs as well as additional CPU memory usage. | ||
| - ``optimizer_config.num_warmup_steps``: Number of warmup steps for the learning rate scheduler. | ||
| - ``optimizer_config.scheduler``: Which learning rate scheduler to use. |
There was a problem hiding this comment.
To improve clarity for users, consider expanding the description for optimizer_config.scheduler. You could mention that it accepts scheduler types from Hugging Face's transformers library and provide a link to the documentation for available options.
- ``optimizer_config.scheduler``: Which learning rate scheduler to use. See `transformers.SchedulerType <https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.SchedulerType>`_ for available options.
There was a problem hiding this comment.
Probably don't need to leave a link, but it would be good to mention that this config param is intended to align with schedulers in transformers.SchedulerType
| actor_lr_scheduler = get_scheduler( | ||
| optim_config.scheduler, | ||
| actor_optimizer, | ||
| num_warmup_steps=optim_config.num_warmup_steps, | ||
| ) |
There was a problem hiding this comment.
Using get_scheduler from Transformers is a great step for flexibility. However, to fully support schedulers like linear, cosine, and polynomial, the num_training_steps argument is required. Without it, these schedulers won't function correctly as they need to know the total duration of training to decay the learning rate.
Consider refactoring to calculate and pass num_training_steps during scheduler initialization. This will make the implementation more robust and truly flexible.
There was a problem hiding this comment.
This makes sense to me. We'll need to know the number of training steps for many schedulers. This information is known in the trainer, so it can/should be plumbed down.
| actor_scheduler = get_scheduler( | ||
| "constant_with_warmup", | ||
| self.cfg.trainer.policy.optimizer_config.scheduler, | ||
| actor_optim, | ||
| num_warmup_steps=self.cfg.trainer.policy.optimizer_config.num_warmup_steps, | ||
| ) |
There was a problem hiding this comment.
Similar to the FSDP strategy, the get_scheduler call here is missing the num_training_steps argument. This will prevent schedulers that decay over the course of training (e.g., 'linear', 'cosine') from working as expected. To enable full support for all Hugging Face schedulers, num_training_steps should be provided.
|
My editor automatically trimmed trailing white spaces for the doc and the config. Is this fine? |
tyler-griggs
left a comment
There was a problem hiding this comment.
Hey @benji-cannot-code , this is great, thank you!
Only two comments:
- Please plumb num_training_steps argument into
get_scheduler, otherwise we can't actually use many of the schedulers. - Please test this with a new scheduler (ie, not
constant_with_warmup). I don't believe we have a test to cover this, so you consider adding one or launching short training runs with ~2 new schedulers and validate that they are working.
| - ``optimizer_config.max_grad_norm``: Gradient clipping parameter. The total L2 norm of the model gradients will be scaled to this value during training. | ||
| - ``optimizer_config.offload_after_step``: Whether to offload optimizer state to CPU after step if colocated. When generation and training workers are colocated, we recommend using the default setting of ``true``. In some cases with non-colocation, it can be desirable to leave optimizer state on GPU memory to avoid offloading costs as well as additional CPU memory usage. | ||
| - ``optimizer_config.num_warmup_steps``: Number of warmup steps for the learning rate scheduler. | ||
| - ``optimizer_config.scheduler``: Which learning rate scheduler to use. |
There was a problem hiding this comment.
Probably don't need to leave a link, but it would be good to mention that this config param is intended to align with schedulers in transformers.SchedulerType
Done. Tested both FSDP and DeepSpeed with linear and cosine (requires training steps). |
|
Awesome, thanks @benji-cannot-code . Can you past a screenshot of the wandb log for the FSDP and DeepSpeed tests, especially for the learning rate plot? When we run mini training run tests, we typically share screenshots to keep track of what the curves look like. |
Policy graphs (rl graphs match expectations): |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for flexible learning rate schedulers from Hugging Face Transformers, which is a great improvement over the hardcoded scheduler. The implementation is mostly correct, with changes to the configuration, documentation, and both FSDP and DeepSpeed strategies to accommodate the new flexible scheduler. I've identified one issue in the DeepSpeed critic worker where the scheduler is still hardcoded, preventing the new feature from working correctly in that specific case. Once that is addressed, this PR will be in excellent shape.
| @@ -265,6 +266,7 @@ def init_model(self, model_id_or_path): | |||
| "constant_with_warmup", | |||
There was a problem hiding this comment.
The critic scheduler is hardcoded to "constant_with_warmup". To support flexible schedulers for the critic as well, this should use the value from the configuration, similar to how it's done for the policy scheduler.
| "constant_with_warmup", | |
| self.cfg.trainer.critic.optimizer_config.scheduler, |
There was a problem hiding this comment.
Nice spot, fixed!
|
All looks good to me! Once you get a chance to confirm that checkpointing correctly saves-and-resumes the scheduler state (which it should), we can merge. |
**GitHub Issue:** [Support more learning rate schedulers](NovaSky-AI#140) **Changes:** 1. Changed scheduler for FSDP strategy from hard-coded `constant_with_warmup` to flexible HF scheduler. 2. Added scheduler to config 3. Removed trailing whitespaces in config and docs **Testing:** 1. Ran `uv run --isolated --extra dev pytest tests/cpu -q` sanity tests 2. vLLM smoke test ``` python - << 'PY' from vllm import LLM, SamplingParams m="TinyLlama/TinyLlama-1.1B-Chat-v1.0" llm=LLM(model=m, tensor_parallel_size=1, trust_remote_code=True) print(llm.generate(["hiiii im running test"], SamplingParams(max_tokens=16))[0].outputs[0].text.strip()) PY ``` 3. E2E training test ``` uv run --isolated -m skyrl_train.entrypoints.main_base \ --config examples/gsm8k/minimal.yaml \ data.gsm8k_path=/skyrl-data/datasets/gsm8k \ model.name="TinyLlama/TinyLlama-1.1B-Chat-v1.0" \ generator.max_new_tokens=32 \ trainer.steps=10 \ trainer.global_batch_size=2 \ generator.rollouts_per_iter=4 \ ray.num_cpus=2 ray.num_gpus=1 \ logging.output_dir=/skyrl-data/logs/gsm8k-smoke \ checkpoint.dir=/skyrl-data/checkpoints/gsm8k-smoke ```




GitHub Issue: Support more learning rate schedulers
Changes:
constant_with_warmupto flexible HF scheduler.Testing:
uv run --isolated --extra dev pytest tests/cpu -qsanity tests