Skip to content

[train] Increase default timeout for placement groups to 180s#525

Merged
SumanthRH merged 2 commits intoNovaSky-AI:mainfrom
SumanthRH:increase_timeout
Oct 20, 2025
Merged

[train] Increase default timeout for placement groups to 180s#525
SumanthRH merged 2 commits intoNovaSky-AI:mainfrom
SumanthRH:increase_timeout

Conversation

@SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Oct 20, 2025

What does this PR do?

Increases the default timeout for placement groups to 3 minutes. I've noticed that in multi-node clusters the initial worker startup time can be easily ~ 1 min because it takes time to download all required packages with uv. Thus, we need to account for additional time at startup.

This could be increased even more, but this should work okay for small clusters for now (the tradeoff is that you might wait too long even though pg is not satisfiable)

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
@SumanthRH
Copy link
Member Author

Test failures not related

@SumanthRH SumanthRH marked this pull request as ready for review October 20, 2025 22:53
x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
@SumanthRH SumanthRH merged commit bfe2a32 into NovaSky-AI:main Oct 20, 2025
3 checks passed
atemaguer pushed a commit to atemaguer/SkyRL that referenced this pull request Oct 24, 2025
…y-AI#525)

# What does this PR do?

Increases the default timeout for placement groups to 3 minutes. I've
noticed that in multi-node clusters the initial worker startup time can
be easily ~ 1 min because it takes time to download all required
packages with uv. Thus, we need to account for additional time at
startup.

This could be increased even more, but this should work okay for small
clusters for now (the tradeoff is that you might wait too long even
though pg is not satisfiable)

---------

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
…y-AI#525)

# What does this PR do?

Increases the default timeout for placement groups to 3 minutes. I've
noticed that in multi-node clusters the initial worker startup time can
be easily ~ 1 min because it takes time to download all required
packages with uv. Thus, we need to account for additional time at
startup.

This could be increased even more, but this should work okay for small
clusters for now (the tradeoff is that you might wait too long even
though pg is not satisfiable)

---------

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…y-AI#525)

# What does this PR do?

Increases the default timeout for placement groups to 3 minutes. I've
noticed that in multi-node clusters the initial worker startup time can
be easily ~ 1 min because it takes time to download all required
packages with uv. Thus, we need to account for additional time at
startup.

This could be increased even more, but this should work okay for small
clusters for now (the tradeoff is that you might wait too long even
though pg is not satisfiable)

---------

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants