Skip to content

Handshake is setting DialTimeout instead of context deadline#1709

Open
NamanMahor wants to merge 14 commits intoClickHouse:mainfrom
NamanMahor:handshake-timeout-fix
Open

Handshake is setting DialTimeout instead of context deadline#1709
NamanMahor wants to merge 14 commits intoClickHouse:mainfrom
NamanMahor:handshake-timeout-fix

Conversation

@NamanMahor
Copy link

Summary

If a user provides a context.Context with a timeout, it is ignored here in conn_handshake.go even though comment is correct to set context level deadline override any read deadline but instead of deadline we are setting Dialtimeout.

Made change to handle context deadline as we are already doing in here in conn_ping.go

@CLAassistant
Copy link

CLAassistant commented Nov 11, 2025

CLA assistant check
All committers have signed the CLA.

@kavirajk
Copy link
Contributor

@NamanMahor thanks for the PR :). I know it's a simple change. can you please add tests for this to avoid any regression in the future?

@NamanMahor
Copy link
Author

@kavirajk sure will add the test. I have one question which could also be bug. we are overriding the ctx with timeout/deadline here https://github.com/ClickHouse/clickhouse-go/blob/main/clickhouse.go#L304 which is being used by ch.dial(ctx) down in the code and later in ch.dial(ctx) will call handshake with same context so effectively we still using the Dial timeout in handshake even after my PR.

@kavirajk
Copy link
Contributor

have one question which could also be bug. we are overriding the ctx with timeout/deadline here https://github.com/ClickHouse/clickhouse-go/blob/main/clickhouse.go#L304 which is being used by ch.dial(ctx) down in the code and later in ch.dial(ctx) will call handshake with same context so effectively we still using the Dial timeout in handshake even after my PR.

The way I see it, we have ctx on public APIs like Query(ctx), Exec(ctx) which are use-passed context. And when acquiring connection from the pool, we create "new" context by setting DialTimeout and call the dial. The handshake is happened to be inside the dial method. Whatever the timeout you passed via public APIs would still be in effect after dial to be used in real queries.

i'm curious what is your use case here?. You trying to set this timeout more dynamically only on the handsake of the whole dial method? What is blocking you to use just dialTimeout for the whole dial call? including handsake?

@NamanMahor
Copy link
Author

@kavirajk sorry about late response totally miss this one. I have added the test.

@NamanMahor
Copy link
Author

@kavirajk can you please have a look i have added the tests.

@NamanMahor
Copy link
Author

have one question which could also be bug. we are overriding the ctx with timeout/deadline here https://github.com/ClickHouse/clickhouse-go/blob/main/clickhouse.go#L304 which is being used by ch.dial(ctx) down in the code and later in ch.dial(ctx) will call handshake with same context so effectively we still using the Dial timeout in handshake even after my PR.

The way I see it, we have ctx on public APIs like Query(ctx), Exec(ctx) which are use-passed context. And when acquiring connection from the pool, we create "new" context by setting DialTimeout and call the dial. The handshake is happened to be inside the dial method. Whatever the timeout you passed via public APIs would still be in effect after dial to be used in real queries.

i'm curious what is your use case here?. You trying to set this timeout more dynamically only on the handsake of the whole dial method? What is blocking you to use just dialTimeout for the whole dial call? including handsake?

@kavirajk

At Rill, users configure ClickHouse connections through a UI where they provide the host, port, and other connection details. We commonly see two scenarios:

The user enters a hostname that is valid but not actually running ClickHouse.

The ClickHouse cluster is in a hibernated state and takes time to wake up.

To handle the hibernation case, we need to increase the read timeout so the handshake can wait long enough for the cluster to wake up. However, if we also increase the dial timeout, then the first scenario (wrong host) ends up waiting for the full dial timeout before failing, which leads to a poor user experience.

Ideally, we want the TCP dial to fail fast when the host is incorrect, while still allowing the handshake to wait longer when the cluster is waking up.

If the dial context isn’t reused for the handshake, the handshake can use the user-provided context (which has the longer timeout), and we don’t have to increase the dial timeout. This allows wrong-host cases to fail quickly while still supporting clusters that take longer to respond during handshake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants