Skip to content

[train] Fix issue with unset pad_token_id#1232

Merged
SumanthRH merged 4 commits intomainfrom
sumanthrh/fix-none-pad-token-id
Feb 27, 2026
Merged

[train] Fix issue with unset pad_token_id#1232
SumanthRH merged 4 commits intomainfrom
sumanthrh/fix-none-pad-token-id

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Feb 27, 2026

What does this PR do?

Fixes #1231


Open with Devin

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH
Copy link
Copy Markdown
Member Author

With this PR, it is still not possible to successfully train with meta-llama/Llama-3.2-1B with --fsdp.

The issue is that that the repo for meta-llama/Llama-3.2-1B on Huggingface is gated, and we need to propagate HF_TOKEN env var to the FSDP workers. However, the current recommended path for doing so with --env-file .env does not work. I've opened another issue: #1234 .

If I manually set the env var in the workers, I am able to run SFT with meta-llama/Llama-3.2-1B

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@SumanthRH SumanthRH merged commit be7ee34 into main Feb 27, 2026
5 of 6 checks passed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a centralized get_tokenizer utility to consistently handle tokenizer instantiation and fix an issue with unset pad_token_id. While the refactoring is well-executed, a critical security concern was identified: the use of trust_remote_code=True when loading tokenizers from user-supplied model paths can lead to Remote Code Execution (RCE) if a malicious model is loaded. It is recommended to disable trust_remote_code by default and only allow it if explicitly requested with appropriate warnings. Furthermore, to improve robustness, the new utility should handle cases where a tokenizer lacks both a pad_token_id and an eos_token_id to prevent potential runtime errors.

Comment on lines +12 to +13
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation to set pad_token_id from eos_token_id is not fully robust. If a tokenizer has neither a pad_token_id nor an eos_token_id, pad_token_id will be set to None. This can lead to runtime errors in downstream code that expects an integer pad_token_id for padding (e.g., in skyrl.backends.skyrl_train_backend._to_training_batch).

To prevent this, you should verify that eos_token_id is available before the assignment. If it's not, raising an explicit ValueError would provide a clear error message to the user, indicating that the tokenizer configuration is incomplete for the required padding operations.

Suggested change
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.eos_token_id is not None:
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token
else:
raise ValueError(
f"Tokenizer for '{model_name_or_path}' has no `pad_token_id` and no `eos_token_id`. "
"Please set `pad_token_id` for this model to ensure correct padding."
)

Initialize the Megatron-Bridge bridge and provider objects + hf_config and tokenizer
"""
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer = get_tokenizer(model_path, trust_remote_code=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The use of trust_remote_code=True when loading a tokenizer from a potentially untrusted model_path (which can be supplied via command-line arguments) poses a significant security risk. If an attacker provides a path to a malicious model, arbitrary code contained within the model's configuration or tokenizer files could be executed on the system. It is highly recommended to set trust_remote_code=False by default and only enable it if the user explicitly opts in through a configuration flag, ideally with a warning about the risks involved.

Comment on lines +126 to +131
self.tokenizer = get_tokenizer(
self.cfg.trainer.policy.model.path,
trust_remote_code=True,
use_fast=not self.cfg.trainer.disable_fast_tokenizer,
padding_side="left",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Similar to the finding in megatron_worker.py, this call to get_tokenizer explicitly enables trust_remote_code for a model path derived from user-controlled configuration (self.cfg.trainer.policy.model.path). This creates a vector for Remote Code Execution (RCE) if the model path points to a malicious repository or local directory. Consider making this an optional user-controlled setting that defaults to False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SkyRLTrainBackend pads with None → crash in _to_training_batch (meta-llama/Llama-3.2-1B, --backend fsdp)

1 participant