[train] Fix issue with unset pad_token_id#1232
Conversation
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
|
With this PR, it is still not possible to successfully train with The issue is that that the repo for If I manually set the env var in the workers, I am able to run SFT with |
There was a problem hiding this comment.
Code Review
This pull request introduces a centralized get_tokenizer utility to consistently handle tokenizer instantiation and fix an issue with unset pad_token_id. While the refactoring is well-executed, a critical security concern was identified: the use of trust_remote_code=True when loading tokenizers from user-supplied model paths can lead to Remote Code Execution (RCE) if a malicious model is loaded. It is recommended to disable trust_remote_code by default and only allow it if explicitly requested with appropriate warnings. Furthermore, to improve robustness, the new utility should handle cases where a tokenizer lacks both a pad_token_id and an eos_token_id to prevent potential runtime errors.
| tokenizer.pad_token_id = tokenizer.eos_token_id | ||
| tokenizer.pad_token = tokenizer.eos_token |
There was a problem hiding this comment.
The current implementation to set pad_token_id from eos_token_id is not fully robust. If a tokenizer has neither a pad_token_id nor an eos_token_id, pad_token_id will be set to None. This can lead to runtime errors in downstream code that expects an integer pad_token_id for padding (e.g., in skyrl.backends.skyrl_train_backend._to_training_batch).
To prevent this, you should verify that eos_token_id is available before the assignment. If it's not, raising an explicit ValueError would provide a clear error message to the user, indicating that the tokenizer configuration is incomplete for the required padding operations.
| tokenizer.pad_token_id = tokenizer.eos_token_id | |
| tokenizer.pad_token = tokenizer.eos_token | |
| if tokenizer.eos_token_id is not None: | |
| tokenizer.pad_token_id = tokenizer.eos_token_id | |
| tokenizer.pad_token = tokenizer.eos_token | |
| else: | |
| raise ValueError( | |
| f"Tokenizer for '{model_name_or_path}' has no `pad_token_id` and no `eos_token_id`. " | |
| "Please set `pad_token_id` for this model to ensure correct padding." | |
| ) |
| Initialize the Megatron-Bridge bridge and provider objects + hf_config and tokenizer | ||
| """ | ||
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) | ||
| tokenizer = get_tokenizer(model_path, trust_remote_code=True) |
There was a problem hiding this comment.
The use of trust_remote_code=True when loading a tokenizer from a potentially untrusted model_path (which can be supplied via command-line arguments) poses a significant security risk. If an attacker provides a path to a malicious model, arbitrary code contained within the model's configuration or tokenizer files could be executed on the system. It is highly recommended to set trust_remote_code=False by default and only enable it if the user explicitly opts in through a configuration flag, ideally with a warning about the risks involved.
| self.tokenizer = get_tokenizer( | ||
| self.cfg.trainer.policy.model.path, | ||
| trust_remote_code=True, | ||
| use_fast=not self.cfg.trainer.disable_fast_tokenizer, | ||
| padding_side="left", | ||
| ) |
There was a problem hiding this comment.
Similar to the finding in megatron_worker.py, this call to get_tokenizer explicitly enables trust_remote_code for a model path derived from user-controlled configuration (self.cfg.trainer.policy.model.path). This creates a vector for Remote Code Execution (RCE) if the model path points to a malicious repository or local directory. Consider making this an optional user-controlled setting that defaults to False.
What does this PR do?
Fixes #1231