tariff: fix HTTP 429 treated as permanent + goroutine leak on startup failure#28194
tariff: fix HTTP 429 treated as permanent + goroutine leak on startup failure#28194GrimmiMeloni wants to merge 6 commits intoevcc-io:masterfrom
Conversation
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
runOrError, the context is only cancelled on the error path; consider deferringcancel()so the context is also cleaned up on the success path and doesn’t outlive the initialisation unnecessarily. - The repeated
select { case <-ctx.Done(): return default: continue }pattern after each error in the variousrunimplementations could be simplified (e.g. by factoring into a small helper or checkingctx.Done()once per loop) to reduce duplication and make the cancellation behavior easier to reason about.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `runOrError`, the context is only cancelled on the error path; consider deferring `cancel()` so the context is also cleaned up on the success path and doesn’t outlive the initialisation unnecessarily.
- The repeated `select { case <-ctx.Done(): return default: continue }` pattern after each error in the various `run` implementations could be simplified (e.g. by factoring into a small helper or checking `ctx.Done()` once per loop) to reduce duplication and make the cancellation behavior easier to reason about.
## Individual Comments
### Comment 1
<location path="tariff/helper_test.go" line_range="98-101" />
<code_context>
+ // plenty of time to arrive. We replicate this with a short timer: if ctx is
+ // cancelled quickly (fix is in place) we exit cleanly; if ctx is never
+ // cancelled (bug is still present) the timer fires and we signal the leak.
+ select {
+ case <-ctx.Done():
+ return // correctly stopped by cancel() — no leak
+ case <-time.After(100 * time.Millisecond):
+ close(r.running) // cancel() never came — goroutine leak
+ <-r.stop
</code_context>
<issue_to_address>
**issue (testing):** Goroutine-leak test cannot actually detect the leak due to mismatched timeouts
In `persistingRunner.run`, the leak is only signaled when `r.running` is closed after `time.After(100 * time.Millisecond)`, but `TestRunOrError_DoesNotLeakGoroutineOnInitialFailure` waits only `50 * time.Millisecond` on `r.running` before considering the goroutine stopped. In the leak case (ctx never cancelled), the test will usually pass because it stops observing before the leak signal occurs. Please adjust the test to wait longer than the leak timer (e.g. ~150ms), or change the runner to signal completion/leak immediately (e.g. via a dedicated `stopped` channel that the test can assert closes within a deadline), so the test reliably catches regressions.
</issue_to_address>
### Comment 2
<location path="tariff/stekker.go" line_range="96" />
<code_context>
t.log.ERROR.Println(err)
- continue
+ select {
+ case <-ctx.Done():
+ return
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the repeated context cancellation `select` logic into a small helper function and calling it in each error branch to simplify the control flow.
You can keep the new cancellation behavior while reducing duplication by extracting the repeated `select` into a small helper and using it in each error branch.
For example:
```go
func ctxCanceled(ctx context.Context) bool {
select {
case <-ctx.Done():
return true
default:
return false
}
}
```
Then simplify each error path:
```go
resp, err := client.Get(url)
if err != nil {
once.Do(func() { done <- err })
t.log.ERROR.Println("http error:", err)
if ctxCanceled(ctx) {
return
}
continue
}
if resp.StatusCode != http.StatusOK {
once.Do(func() { done <- fmt.Errorf("http status %d", resp.StatusCode) })
t.log.ERROR.Printf("http status %d", resp.StatusCode)
resp.Body.Close()
if ctxCanceled(ctx) {
return
}
continue
}
// ... and similarly for the other error branches
```
This preserves the current semantics (immediate cancel check after each error) while removing the repeated `select` blocks, reducing branching noise and making the function easier to read and maintain.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| once.Do(func() { done <- err }) | ||
| t.log.ERROR.Println("http error:", err) | ||
| continue | ||
| select { |
There was a problem hiding this comment.
issue (complexity): Consider extracting the repeated context cancellation select logic into a small helper function and calling it in each error branch to simplify the control flow.
You can keep the new cancellation behavior while reducing duplication by extracting the repeated select into a small helper and using it in each error branch.
For example:
func ctxCanceled(ctx context.Context) bool {
select {
case <-ctx.Done():
return true
default:
return false
}
}Then simplify each error path:
resp, err := client.Get(url)
if err != nil {
once.Do(func() { done <- err })
t.log.ERROR.Println("http error:", err)
if ctxCanceled(ctx) {
return
}
continue
}
if resp.StatusCode != http.StatusOK {
once.Do(func() { done <- fmt.Errorf("http status %d", resp.StatusCode) })
t.log.ERROR.Printf("http status %d", resp.StatusCode)
resp.Body.Close()
if ctxCanceled(ctx) {
return
}
continue
}
// ... and similarly for the other error branchesThis preserves the current semantics (immediate cancel check after each error) while removing the repeated select blocks, reducing branching noise and making the function easier to read and maintain.
…up failure When a tariff's first API call returned HTTP 429 (Too Many Requests), backoffPermanentError wrapped it as a permanent error, causing backoff.Retry to abort immediately. runOrError then received the error and discarded the tariff — but the background goroutine was never signalled to stop, so it kept looping (blocked on its hourly tick) indefinitely, continuing to hit the API and preventing rate-limit recovery even after evcc appeared to disable the tariff. Fix 1: exclude HTTP 429 from permanent-error treatment in backoffPermanentError so backoff.Retry keeps retrying on rate limits. Fix 2: thread context.Context through the runnable interface so runOrError can cancel the goroutine via cancel() when startup fails, and all run() loops check ctx.Done() instead of blindly calling continue on error. Fixes evcc-io#26654 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b1b9d13 to
f2e5578
Compare
…ervation window The persistingRunner was closing r.running after 100ms, but the test only waited 50ms for that signal. In the bug-present case the test would time out at 50ms and incorrectly conclude "no leak". Reduce the leak timer to 20ms so it fires well within the 50ms observation window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
As for stopping the background routine we already have the context. |
| once.Do(func() { done <- err }) | ||
|
|
||
| t.log.ERROR.Println(err) | ||
| continue |
There was a problem hiding this comment.
Imho the fix for the 429 would be to continue in case of 429 and assume that a follow-up call (in an hour) might work and return an early success. That is a bit dangerous though since the 429 might typically be done by a load balancer before credentials etc are even verified.
That said: it‘s unclear to me what this PR really fixes in relation to the original issue.
There was a problem hiding this comment.
You're right about the 429 — I have reverted that change. The 429 is correctly permanent within bo(), and the outer tick loop is the right recovery mechanism.
That said: it‘s unclear to me what this PR really fixes in relation to the original issue.
What this PR fixes is the goroutine leak when runOrError fails on startup (regardless of error type — 429, DNS failure, auth error, etc.):
- runOrError starts go
t.run(done)and waits on<-done - run() sends the error via
once.Do(func() { done <- err }) - runOrError receives the error, discards the tariff, returns nil, err
- But the goroutine is still alive — it hits continue, blocks on
<-tickfor one hour, then retries the API call. This repeats forever.
The goroutine outlives the tariff object. For GSI with tight rate limits, this is especially bad: evcc shows the tariff as disabled, but the orphaned goroutine keeps hitting the API every hour in the background. Restarting evcc creates a new leaking goroutine (just replacing the old one), so the provider never sees the traffic stop and the rate limit is never lifted.
The fix adds a stop channel to runOrError that gets closed when startup fails. Each run() method checks <-stop in its error path — if closed, it returns instead of calling continue. On the success path, stop is never closed, so the goroutine behaves exactly as before (runtime errors still retry via default: continue).
I initially tried using context.WithCancel for this, but it triggered a govet/lostcancel error because cancel() is intentionally not called on the success path. defer cancel() doesn't work either — it fires on both return paths, so it would cancel the context immediately on successful startup, causing the goroutine to exit permanently on its first runtime error instead of retrying. The stop channel avoids this: receiving from a closed channel returns immediately (goroutine exits), while receiving from an open channel blocks forever (goroutine keeps retrying via default: continue — identical to the pre-fix behaviour).
There was a problem hiding this comment.
tbo, I don't care much about the leak. This is not a mass problem (though we could still fix it). I'd really care about figuring out the root cause of #26654, which I couldn't sofar.
There was a problem hiding this comment.
It is my understanding that the leak is causing this. GSI is picky about their rate limits. And the leak that's being addressed is causing "silent" requests to keep going. The leak fix doesn't change the immediate outcome (tariff still fails on startup 429), but it makes the situation recoverable — evcc stops making things worse by continuing to hit a rate-limited API in the background.
HTTP 429 within the inner backoff.Retry loop is correctly permanent: it stops immediate retries and defers to the outer hourly tick loop, which is the right way to respect rate limiting. Keeping the goroutine leak fix is sufficient to address issue evcc-io#26654. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Fixes #26654 — Grünstromindex goroutine leak when startup fails with HTTP 429.
When a tariff's first API call fails (e.g. HTTP 429 at startup),
runOrErrordiscards thetariff and returns an error — but the background
run()goroutine had no way to be told aboutthis. It kept looping (blocked on its hourly tick) indefinitely, hitting the API in the
background even though evcc had seemingly disabled the tariff. For GSI this is particularly
harmful: the orphaned goroutine burns rate-limit quota, so the provider never sees evcc backing
off and the rate limit is never lifted. Restarting evcc creates a new leaking goroutine on top
of the old one.
Change
tariff/helper.go: add astopchannel torunOrError; close it when startup fails so the goroutine exits cleanly instead of continuing to make API calls in the background.run()implementations updated to acceptstop <-chan struct{}and check<-stopin error paths instead of unconditionally callingcontinue.tariff/helper_test.go: new testTestRunOrError_DoesNotLeakGoroutineOnInitialFailureasserts the goroutine exits after startup failure.Why a stop channel rather than
context.WithCancel?context.WithCanceltriggers agovet/lostcancellinter error becausecancel()isintentionally not called on the success path. The obvious workaround —
defer cancel()— isincorrect:
deferfires on both return paths, so it would cancel the context immediately onsuccessful startup too. That would cause the goroutine to exit permanently on its first runtime
error instead of retrying, silently breaking recovery for tariffs that hit a transient error
after days of normal operation.
A
stopchannel has the right asymmetry: a closed channel always unblocks a receive; an openchannel never does. In the success path
stopis left open forever, so<-stopin thegoroutine's error select is permanently blocked and
default: continuealways wins — identicalto the pre-fix behaviour.
Test Plan
go test ./tariff/...passesTestRunOrError_DoesNotLeakGoroutineOnInitialFailureis green