Skip to content

[tx] Add experimental SkyRL-train backend that supports SFT#871

Merged
pcmoritz merged 51 commits intoNovaSky-AI:mainfrom
pcmoritz:tx-skyrl-train-backend
Jan 28, 2026
Merged

[tx] Add experimental SkyRL-train backend that supports SFT#871
pcmoritz merged 51 commits intoNovaSky-AI:mainfrom
pcmoritz:tx-skyrl-train-backend

Conversation

@pcmoritz
Copy link
Collaborator

@pcmoritz pcmoritz commented Jan 13, 2026

The engine can e.g. by started with

uv run --extra gpu --extra tinker -m tx.tinker.api --base-model Qwen/Qwen3-4B --backend "skyrl_train"

and then you can e.g. run

uv run --with wandb --with tinker sl_loop.py base_url=http://localhost:8000 model_name=Qwen/Qwen3-4B lora_rank=1

@pcmoritz pcmoritz added the tx label Jan 13, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new SkyRL-train backend for supervised training. The changes include updating project dependencies in pyproject.toml and adding the new backend implementation in skyrl-tx/tx/tinker/backends/skyrl_train.py. While this is a good starting point for the new backend, my review has identified several issues that need to be addressed. The most critical issue is in the forward_backward method, which is currently a stub and does not perform a backward pass or return actual losses, preventing any training from occurring. Other significant issues include the use of hardcoded paths and hyperparameters, potentially incorrect token padding, and breaking encapsulation by accessing private members of a library class. Addressing these points will be crucial for the backend to be functional and maintainable.

Comment on lines +192 to +197
ray.get([actor.save_checkpoint.remote(output_path) for actor in self._actor_group._actor_handlers])

def load_checkpoint(self, checkpoint_path, model_id: str) -> None:
if model_id != self._model_id:
raise ValueError(f"Model {model_id} not found")
ray.get([actor.load_checkpoint.remote(Path(checkpoint_path)) for actor in self._actor_group._actor_handlers])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing the private member _actor_handlers of PPORayActorGroup breaks encapsulation and makes the code dependent on the internal implementation of the skyrl-train library. This could lead to breakages if the library is updated. It would be more robust to use a public API from PPORayActorGroup for this purpose, or request one if it doesn't exist.

@vercel
Copy link

vercel bot commented Jan 25, 2026

@pcmoritz is attempting to deploy a commit to the Tyler's projects Team on Vercel.

A member of the Team first needs to authorize it.

@pcmoritz pcmoritz changed the title [tx] [WIP] Add SkyRL-train backend [tx] Add SkyRL-train backend Jan 25, 2026
@pcmoritz
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental SkyRL-train backend for supervised fine-tuning (SFT). The changes include adding the necessary dependencies and configuration in pyproject.toml, integrating the new backend into the engine and API, and implementing the backend logic in a new skyrl_train.py file.

My review focuses on improving the robustness and maintainability of this new backend. I've identified a few issues:

  • A configuration issue in pyproject.toml that unnecessarily restricts the backend to a specific Python version.
  • A potential resource leak in the new backend's delete_model implementation.
  • Use of a hardcoded padding token ID, which could lead to incorrect behavior with different models.
  • Some minor maintainability concerns regarding duplicated dependency configuration and a hardcoded value.

Overall, this is a great addition. Addressing these points will make the new backend more robust and easier to maintain.

@pcmoritz
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental skyrl_train backend to support Supervised Fine-Tuning (SFT). However, the accompanying API implementation has several critical security flaws that must be addressed. Specifically, the API lacks authentication, exposing sensitive model weights and training controls, the checkpoint download mechanism is vulnerable to Denial of Service (OOM) due to loading large files into memory, and sensitive API keys are exposed via command-line arguments. Beyond these critical security concerns, there are also suggestions to improve maintainability and clarity, including refining a comment in pyproject.toml for git dependency accuracy and resolving naming inconsistencies for action_mask/response_mask within the new backend code.

Comment on lines +105 to +107
# For now, just always use the current main branch, later it will be better to pin it to a released version. For development, you
# can set it to your own development branch.
skyrl-train = { git = "https://github.com/pcmoritz/SkyRL", rev = "6d117dd65139fa335d87efa31f056208b123280d", subdirectory = "skyrl-train" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on line 105 is slightly inaccurate. It says "just always use the current main branch", but rev is pinned to a specific commit hash, which is good practice for stability. I suggest updating the comment to reflect this. Also, as the comment mentions, it's a good idea to track an issue to update this to a released version of skyrl-train in the future to improve stability and reproducibility.

Suggested change
# For now, just always use the current main branch, later it will be better to pin it to a released version. For development, you
# can set it to your own development branch.
skyrl-train = { git = "https://github.com/pcmoritz/SkyRL", rev = "6d117dd65139fa335d87efa31f056208b123280d", subdirectory = "skyrl-train" }
# For now, we pin to a specific commit on the main branch.
# TODO: Update to a released version of skyrl-train once available.
# For development, you can set it to your own development branch.
skyrl-train = { git = "https://github.com/pcmoritz/SkyRL", rev = "3cf9c781c5e733a5bb035efc6b4d2f83cac7cc8c", subdirectory = "skyrl-train" }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the branch before merging

Comment on lines +126 to +147
sequences, attention_masks, loss_masks, response_masks = [], [], [], []

for seq, weights in zip(full_sequences, prepared_batch.all_token_weights):
pad_len = max_seq_len - len(seq)
sequences.append([self._tokenizer.pad_token_id] * pad_len + list(seq))
attention_masks.append([0] * pad_len + [1] * len(seq))
action_pad = max_response_len - len(weights)
loss_masks.append([0.0] * action_pad + [float(w) for w in weights])
response_masks.append([0] * action_pad + [1] * len(weights))

sequences_tensor = torch.tensor(sequences, dtype=torch.long)
attention_mask_tensor = torch.tensor(attention_masks, dtype=torch.long)
loss_mask_tensor = torch.tensor(loss_masks, dtype=torch.float32)
response_mask_tensor = torch.tensor(response_masks, dtype=torch.long)

batch = TrainingInputBatch(
{
"sequences": sequences_tensor,
"attention_mask": attention_mask_tensor,
"loss_mask": loss_mask_tensor,
"response_mask": response_mask_tensor,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a naming inconsistency for the mask. Here, you are creating a response_mask, but in skyrl-train/skyrl_train/workers/worker.py the code expects an action_mask. This inconsistency can be confusing. For better clarity and maintainability, I recommend using a consistent name across the components. Since worker.py is being updated to use action_mask, it seems to be the intended name.

This change would also require updating skyrl-train/training_batch.py to use action_mask in the TrainingInput TypedDict.

        sequences, attention_masks, loss_masks, action_masks = [], [], [], []

        for seq, weights in zip(full_sequences, prepared_batch.all_token_weights):
            pad_len = max_seq_len - len(seq)
            sequences.append([self._tokenizer.pad_token_id] * pad_len + list(seq))
            attention_masks.append([0] * pad_len + [1] * len(seq))
            action_pad = max_response_len - len(weights)
            loss_masks.append([0.0] * action_pad + [float(w) for w in weights])
            action_masks.append([0] * action_pad + [1] * len(weights))

        sequences_tensor = torch.tensor(sequences, dtype=torch.long)
        attention_mask_tensor = torch.tensor(attention_masks, dtype=torch.long)
        loss_mask_tensor = torch.tensor(loss_masks, dtype=torch.float32)
        action_mask_tensor = torch.tensor(action_masks, dtype=torch.long)

        batch = TrainingInputBatch(
            {
                "sequences": sequences_tensor,
                "attention_mask": attention_mask_tensor,
                "loss_mask": loss_mask_tensor,
                "action_mask": action_mask_tensor,
            }

Copy link
Collaborator

@erictang000 erictang000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, just a couple minor nits (can be fixed later too)

num_gpus = self._cfg.trainer.placement.policy_num_gpus_per_node

pg = placement_group([{"GPU": num_gpus, "CPU": 1}], strategy="PACK")
get_ray_pg_ready_with_timeout(pg, timeout=30)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit - this is sometimes not long enough since installing dependencies on all ray workers the first time can be expensive

we have this env var:

from skyrl_train.env_vars import SKYRL_RAY_PG_TIMEOUT_IN_S

that we set to 180 by default

num_gpus_per_node=num_gpus,
ray_actor_type=PolicyWorker,
pg=pg,
num_gpus_per_actor=0.2 if pg else 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can probably just be hard coded to 1 for now since the pg is always created above, but should revisit if colocation is going to be supported

@pcmoritz pcmoritz merged commit 642a6a4 into NovaSky-AI:main Jan 28, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants