Commit 9608a8a
authored
[train] Increase default timeout for placement groups to 180s (NovaSky-AI#525)
# What does this PR do?
Increases the default timeout for placement groups to 3 minutes. I've
noticed that in multi-node clusters the initial worker startup time can
be easily ~ 1 min because it takes time to download all required
packages with uv. Thus, we need to account for additional time at
startup.
This could be increased even more, but this should work okay for small
clusters for now (the tradeoff is that you might wait too long even
though pg is not satisfiable)
---------
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>1 parent 35a32b6 commit 9608a8a
File tree
2 files changed
+2
-2
lines changed- skyrl-train
- docs/troubleshooting
- skyrl_train/utils
2 files changed
+2
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
| 4 | + | |
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| |||
0 commit comments