tariff: fix HTTP 429 treated as permanent + goroutine leak on startup failure by GrimmiMeloni · Pull Request #28194 · evcc-io/evcc

GrimmiMeloni · 2026-03-13T23:22:07Z

Summary

Fixes #26654 — Grünstromindex goroutine leak when startup fails with HTTP 429.

When a tariff's first API call fails (e.g. HTTP 429 at startup), runOrError discards the
tariff and returns an error — but the background run() goroutine had no way to be told about
this. It kept looping (blocked on its hourly tick) indefinitely, hitting the API in the
background even though evcc had seemingly disabled the tariff. For GSI this is particularly
harmful: the orphaned goroutine burns rate-limit quota, so the provider never sees evcc backing
off and the rate limit is never lifted. Restarting evcc creates a new leaking goroutine on top
of the old one.

Change

tariff/helper.go: add a stop channel to runOrError; close it when startup fails so the goroutine exits cleanly instead of continuing to make API calls in the background.
All run() implementations updated to accept stop <-chan struct{} and check <-stop in error paths instead of unconditionally calling continue.
tariff/helper_test.go: new test TestRunOrError_DoesNotLeakGoroutineOnInitialFailure asserts the goroutine exits after startup failure.

Why a stop channel rather than `context.WithCancel`?

context.WithCancel triggers a govet/lostcancel linter error because cancel() is
intentionally not called on the success path. The obvious workaround — defer cancel() — is
incorrect: defer fires on both return paths, so it would cancel the context immediately on
successful startup too. That would cause the goroutine to exit permanently on its first runtime
error instead of retrying, silently breaking recovery for tariffs that hit a transient error
after days of normal operation.

A stop channel has the right asymmetry: a closed channel always unblocks a receive; an open
channel never does. In the success path stop is left open forever, so <-stop in the
goroutine's error select is permanently blocked and default: continue always wins — identical
to the pre-fix behaviour.

Test Plan

go test ./tariff/... passes
TestRunOrError_DoesNotLeakGoroutineOnInitialFailure is green
Deploy with a GSI tariff and verify a 429 at startup no longer leaks a background goroutine

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

In runOrError, the context is only cancelled on the error path; consider deferring cancel() so the context is also cleaned up on the success path and doesn’t outlive the initialisation unnecessarily.
The repeated select { case <-ctx.Done(): return default: continue } pattern after each error in the various run implementations could be simplified (e.g. by factoring into a small helper or checking ctx.Done() once per loop) to reduce duplication and make the cancellation behavior easier to reason about.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `runOrError`, the context is only cancelled on the error path; consider deferring `cancel()` so the context is also cleaned up on the success path and doesn’t outlive the initialisation unnecessarily.
- The repeated `select { case <-ctx.Done(): return default: continue }` pattern after each error in the various `run` implementations could be simplified (e.g. by factoring into a small helper or checking `ctx.Done()` once per loop) to reduce duplication and make the cancellation behavior easier to reason about.

## Individual Comments

### Comment 1
<location path="tariff/helper_test.go" line_range="98-101" />
<code_context>
+	// plenty of time to arrive. We replicate this with a short timer: if ctx is
+	// cancelled quickly (fix is in place) we exit cleanly; if ctx is never
+	// cancelled (bug is still present) the timer fires and we signal the leak.
+	select {
+	case <-ctx.Done():
+		return // correctly stopped by cancel() — no leak
+	case <-time.After(100 * time.Millisecond):
+		close(r.running) // cancel() never came — goroutine leak
+		<-r.stop
</code_context>
<issue_to_address>
**issue (testing):** Goroutine-leak test cannot actually detect the leak due to mismatched timeouts

In `persistingRunner.run`, the leak is only signaled when `r.running` is closed after `time.After(100 * time.Millisecond)`, but `TestRunOrError_DoesNotLeakGoroutineOnInitialFailure` waits only `50 * time.Millisecond` on `r.running` before considering the goroutine stopped. In the leak case (ctx never cancelled), the test will usually pass because it stops observing before the leak signal occurs. Please adjust the test to wait longer than the leak timer (e.g. ~150ms), or change the runner to signal completion/leak immediately (e.g. via a dedicated `stopped` channel that the test can assert closes within a deadline), so the test reliably catches regressions.
</issue_to_address>

### Comment 2
<location path="tariff/stekker.go" line_range="96" />
<code_context>

 			t.log.ERROR.Println(err)
-			continue
+			select {
+			case <-ctx.Done():
+				return
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the repeated context cancellation `select` logic into a small helper function and calling it in each error branch to simplify the control flow.

You can keep the new cancellation behavior while reducing duplication by extracting the repeated `select` into a small helper and using it in each error branch.

For example:

```go
func ctxCanceled(ctx context.Context) bool {
	select {
	case <-ctx.Done():
		return true
	default:
		return false
	}
}
```

Then simplify each error path:

```go
resp, err := client.Get(url)
if err != nil {
	once.Do(func() { done <- err })
	t.log.ERROR.Println("http error:", err)

	if ctxCanceled(ctx) {
		return
	}
	continue
}

if resp.StatusCode != http.StatusOK {
	once.Do(func() { done <- fmt.Errorf("http status %d", resp.StatusCode) })
	t.log.ERROR.Printf("http status %d", resp.StatusCode)
	resp.Body.Close()

	if ctxCanceled(ctx) {
		return
	}
	continue
}

// ... and similarly for the other error branches
```

This preserves the current semantics (immediate cancel check after each error) while removing the repeated `select` blocks, reducing branching noise and making the function easier to read and maintain.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

tariff/helper_test.go

sourcery-ai · 2026-03-13T23:23:38Z

tariff/stekker.go

 			once.Do(func() { done <- err })
 			t.log.ERROR.Println("http error:", err)
-			continue
+			select {


issue (complexity): Consider extracting the repeated context cancellation select logic into a small helper function and calling it in each error branch to simplify the control flow.

You can keep the new cancellation behavior while reducing duplication by extracting the repeated select into a small helper and using it in each error branch.

For example:

func ctxCanceled(ctx context.Context) bool { select { case <-ctx.Done(): return true default: return false } }

Then simplify each error path:

resp, err := client.Get(url) if err != nil { once.Do(func() { done <- err }) t.log.ERROR.Println("http error:", err) if ctxCanceled(ctx) { return } continue } if resp.StatusCode != http.StatusOK { once.Do(func() { done <- fmt.Errorf("http status %d", resp.StatusCode) }) t.log.ERROR.Printf("http status %d", resp.StatusCode) resp.Body.Close() if ctxCanceled(ctx) { return } continue } // ... and similarly for the other error branches

This preserves the current semantics (immediate cancel check after each error) while removing the repeated select blocks, reducing branching noise and making the function easier to read and maintain.

…up failure When a tariff's first API call returned HTTP 429 (Too Many Requests), backoffPermanentError wrapped it as a permanent error, causing backoff.Retry to abort immediately. runOrError then received the error and discarded the tariff — but the background goroutine was never signalled to stop, so it kept looping (blocked on its hourly tick) indefinitely, continuing to hit the API and preventing rate-limit recovery even after evcc appeared to disable the tariff. Fix 1: exclude HTTP 429 from permanent-error treatment in backoffPermanentError so backoff.Retry keeps retrying on rate limits. Fix 2: thread context.Context through the runnable interface so runOrError can cancel the goroutine via cancel() when startup fails, and all run() loops check ctx.Done() instead of blindly calling continue on error. Fixes evcc-io#26654 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ervation window The persistingRunner was closing r.running after 100ms, but the test only waited 50ms for that signal. In the bug-present case the test would time out at 50ms and incorrectly conclude "no leak". Reduce the leak timer to 20ms so it fires well within the 50ms observation window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tariff/helper.go

andig · 2026-03-14T08:42:46Z

As for stopping the background routine we already have the context.

andig · 2026-03-14T08:47:39Z

tariff/gruenstromindex.go

 			once.Do(func() { done <- err })

 			t.log.ERROR.Println(err)
-			continue


Imho the fix for the 429 would be to continue in case of 429 and assume that a follow-up call (in an hour) might work and return an early success. That is a bit dangerous though since the 429 might typically be done by a load balancer before credentials etc are even verified.

That said: it‘s unclear to me what this PR really fixes in relation to the original issue.

You're right about the 429 — I have reverted that change. The 429 is correctly permanent within bo(), and the outer tick loop is the right recovery mechanism.

That said: it‘s unclear to me what this PR really fixes in relation to the original issue.

What this PR fixes is the goroutine leak when runOrError fails on startup (regardless of error type — 429, DNS failure, auth error, etc.):

runOrError starts go t.run(done) and waits on <-done

run() sends the error via once.Do(func() { done <- err })

runOrError receives the error, discards the tariff, returns nil, err

But the goroutine is still alive — it hits continue, blocks on <-tick for one hour, then retries the API call. This repeats forever.

The goroutine outlives the tariff object. For GSI with tight rate limits, this is especially bad: evcc shows the tariff as disabled, but the orphaned goroutine keeps hitting the API every hour in the background. Restarting evcc creates a new leaking goroutine (just replacing the old one), so the provider never sees the traffic stop and the rate limit is never lifted.

The fix adds a stop channel to runOrError that gets closed when startup fails. Each run() method checks <-stop in its error path — if closed, it returns instead of calling continue. On the success path, stop is never closed, so the goroutine behaves exactly as before (runtime errors still retry via default: continue).

I initially tried using context.WithCancel for this, but it triggered a govet/lostcancel error because cancel() is intentionally not called on the success path. defer cancel() doesn't work either — it fires on both return paths, so it would cancel the context immediately on successful startup, causing the goroutine to exit permanently on its first runtime error instead of retrying. The stop channel avoids this: receiving from a closed channel returns immediately (goroutine exits), while receiving from an open channel blocks forever (goroutine keeps retrying via default: continue — identical to the pre-fix behaviour).

tbo, I don't care much about the leak. This is not a mass problem (though we could still fix it). I'd really care about figuring out the root cause of #26654, which I couldn't sofar.

It is my understanding that the leak is causing this. GSI is picky about their rate limits. And the leak that's being addressed is causing "silent" requests to keep going. The leak fix doesn't change the immediate outcome (tariff still fails on startup 429), but it makes the situation recoverable — evcc stops making things worse by continuing to hit a rate-limited API in the background.

HTTP 429 within the inner backoff.Retry loop is correctly permanent: it stops immediate retries and defers to the outer hourly tick loop, which is the right way to respect rate limiting. Keeping the goroutine leak fix is sufficient to address issue evcc-io#26654. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GrimmiMeloni added 2 commits March 13, 2026 08:28

ignore local worktrees

32b3682

add superpower plans

4e5f7db

GrimmiMeloni requested a review from andig March 13, 2026 23:22

sourcery-ai bot reviewed Mar 13, 2026

View reviewed changes

GrimmiMeloni force-pushed the fix/tariff-429-goroutine-leak branch from b1b9d13 to f2e5578 Compare March 13, 2026 23:47

andig reviewed Mar 14, 2026

View reviewed changes

tariff/helper.go Outdated Show resolved Hide resolved

andig marked this pull request as draft March 14, 2026 08:42

andig added the bug Something isn't working label Mar 14, 2026

andig reviewed Mar 14, 2026

View reviewed changes

GrimmiMeloni and others added 2 commits March 14, 2026 14:19

fix linter

12bb849

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tariff: fix HTTP 429 treated as permanent + goroutine leak on startup failure#28194

tariff: fix HTTP 429 treated as permanent + goroutine leak on startup failure#28194
GrimmiMeloni wants to merge 6 commits intoevcc-io:masterfrom
GrimmiMeloni:fix/tariff-429-goroutine-leak

GrimmiMeloni commented Mar 13, 2026 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

sourcery-ai bot Mar 13, 2026

Uh oh!

Uh oh!

andig commented Mar 14, 2026

Uh oh!

andig Mar 14, 2026

Uh oh!

GrimmiMeloni Mar 14, 2026

Uh oh!

andig Mar 14, 2026

Uh oh!

GrimmiMeloni Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

GrimmiMeloni commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Why a stop channel rather than context.WithCancel?

Test Plan

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andig commented Mar 14, 2026

Uh oh!

andig Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

GrimmiMeloni Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

andig Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

GrimmiMeloni Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GrimmiMeloni commented Mar 13, 2026 •

edited

Loading

Why a stop channel rather than `context.WithCancel`?