Skip to content

Commit 9608a8a

Browse files
authored
[train] Increase default timeout for placement groups to 180s (NovaSky-AI#525)
# What does this PR do? Increases the default timeout for placement groups to 3 minutes. I've noticed that in multi-node clusters the initial worker startup time can be easily ~ 1 min because it takes time to download all required packages with uv. Thus, we need to account for additional time at startup. This could be increased even more, but this should work okay for small clusters for now (the tradeoff is that you might wait too long even though pg is not satisfiable) --------- Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
1 parent 35a32b6 commit 9608a8a

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

skyrl-train/docs/troubleshooting/troubleshooting.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Placement Group Timeouts
66
-------------------------
77

88
In SkyRL, we use Ray placement groups to request resources for different actors. In Ray clusters that autoscale with KubeRay, placement group creation can take a long time since the cluster might have to add a new node, pull the relevant image and start the container, etc.
9-
You can use the ``SKYRL_RAY_PG_TIMEOUT_IN_S`` environment variable (Used in the ``.env`` file passed to the ``uv run`` command with ``--env-file``) to increase the timeout for placement group creation (By default, this is 60 seconds)
9+
You can use the ``SKYRL_RAY_PG_TIMEOUT_IN_S`` environment variable (Used in the ``.env`` file passed to the ``uv run`` command with ``--env-file``) to increase the timeout for placement group creation (By default, this is 180 seconds)
1010

1111
Multi-node Training
1212
-------------------

skyrl-train/skyrl_train/utils/constants.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import os
22

33
#
4-
SKYRL_RAY_PG_TIMEOUT_IN_S = int(os.environ.get("SKYRL_RAY_PG_TIMEOUT_IN_S", 60))
4+
SKYRL_RAY_PG_TIMEOUT_IN_S = int(os.environ.get("SKYRL_RAY_PG_TIMEOUT_IN_S", 180))
55
"""
66
Timeout for allocating the placement group for different actors in SkyRL
77
"""

0 commit comments

Comments
 (0)