Skip to content

tariff: fix HTTP 429 treated as permanent + goroutine leak on startup failure#28194

Draft
GrimmiMeloni wants to merge 6 commits intoevcc-io:masterfrom
GrimmiMeloni:fix/tariff-429-goroutine-leak
Draft

tariff: fix HTTP 429 treated as permanent + goroutine leak on startup failure#28194
GrimmiMeloni wants to merge 6 commits intoevcc-io:masterfrom
GrimmiMeloni:fix/tariff-429-goroutine-leak

Conversation

@GrimmiMeloni
Copy link
Collaborator

@GrimmiMeloni GrimmiMeloni commented Mar 13, 2026

Summary

Fixes #26654 — Grünstromindex goroutine leak when startup fails with HTTP 429.

When a tariff's first API call fails (e.g. HTTP 429 at startup), runOrError discards the
tariff and returns an error — but the background run() goroutine had no way to be told about
this. It kept looping (blocked on its hourly tick) indefinitely, hitting the API in the
background even though evcc had seemingly disabled the tariff. For GSI this is particularly
harmful: the orphaned goroutine burns rate-limit quota, so the provider never sees evcc backing
off and the rate limit is never lifted. Restarting evcc creates a new leaking goroutine on top
of the old one.

Change

  • tariff/helper.go: add a stop channel to runOrError; close it when startup fails so the goroutine exits cleanly instead of continuing to make API calls in the background.
  • All run() implementations updated to accept stop <-chan struct{} and check <-stop in error paths instead of unconditionally calling continue.
  • tariff/helper_test.go: new test TestRunOrError_DoesNotLeakGoroutineOnInitialFailure asserts the goroutine exits after startup failure.

Why a stop channel rather than context.WithCancel?

context.WithCancel triggers a govet/lostcancel linter error because cancel() is
intentionally not called on the success path. The obvious workaround — defer cancel() — is
incorrect: defer fires on both return paths, so it would cancel the context immediately on
successful startup too. That would cause the goroutine to exit permanently on its first runtime
error instead of retrying, silently breaking recovery for tariffs that hit a transient error
after days of normal operation.

A stop channel has the right asymmetry: a closed channel always unblocks a receive; an open
channel never does. In the success path stop is left open forever, so <-stop in the
goroutine's error select is permanently blocked and default: continue always wins — identical
to the pre-fix behaviour.

Test Plan

  • go test ./tariff/... passes
  • TestRunOrError_DoesNotLeakGoroutineOnInitialFailure is green
  • Deploy with a GSI tariff and verify a 429 at startup no longer leaks a background goroutine

@GrimmiMeloni GrimmiMeloni requested a review from andig March 13, 2026 23:22
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In runOrError, the context is only cancelled on the error path; consider deferring cancel() so the context is also cleaned up on the success path and doesn’t outlive the initialisation unnecessarily.
  • The repeated select { case <-ctx.Done(): return default: continue } pattern after each error in the various run implementations could be simplified (e.g. by factoring into a small helper or checking ctx.Done() once per loop) to reduce duplication and make the cancellation behavior easier to reason about.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `runOrError`, the context is only cancelled on the error path; consider deferring `cancel()` so the context is also cleaned up on the success path and doesn’t outlive the initialisation unnecessarily.
- The repeated `select { case <-ctx.Done(): return default: continue }` pattern after each error in the various `run` implementations could be simplified (e.g. by factoring into a small helper or checking `ctx.Done()` once per loop) to reduce duplication and make the cancellation behavior easier to reason about.

## Individual Comments

### Comment 1
<location path="tariff/helper_test.go" line_range="98-101" />
<code_context>
+	// plenty of time to arrive. We replicate this with a short timer: if ctx is
+	// cancelled quickly (fix is in place) we exit cleanly; if ctx is never
+	// cancelled (bug is still present) the timer fires and we signal the leak.
+	select {
+	case <-ctx.Done():
+		return // correctly stopped by cancel() — no leak
+	case <-time.After(100 * time.Millisecond):
+		close(r.running) // cancel() never came — goroutine leak
+		<-r.stop
</code_context>
<issue_to_address>
**issue (testing):** Goroutine-leak test cannot actually detect the leak due to mismatched timeouts

In `persistingRunner.run`, the leak is only signaled when `r.running` is closed after `time.After(100 * time.Millisecond)`, but `TestRunOrError_DoesNotLeakGoroutineOnInitialFailure` waits only `50 * time.Millisecond` on `r.running` before considering the goroutine stopped. In the leak case (ctx never cancelled), the test will usually pass because it stops observing before the leak signal occurs. Please adjust the test to wait longer than the leak timer (e.g. ~150ms), or change the runner to signal completion/leak immediately (e.g. via a dedicated `stopped` channel that the test can assert closes within a deadline), so the test reliably catches regressions.
</issue_to_address>

### Comment 2
<location path="tariff/stekker.go" line_range="96" />
<code_context>

 			t.log.ERROR.Println(err)
-			continue
+			select {
+			case <-ctx.Done():
+				return
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the repeated context cancellation `select` logic into a small helper function and calling it in each error branch to simplify the control flow.

You can keep the new cancellation behavior while reducing duplication by extracting the repeated `select` into a small helper and using it in each error branch.

For example:

```go
func ctxCanceled(ctx context.Context) bool {
	select {
	case <-ctx.Done():
		return true
	default:
		return false
	}
}
```

Then simplify each error path:

```go
resp, err := client.Get(url)
if err != nil {
	once.Do(func() { done <- err })
	t.log.ERROR.Println("http error:", err)

	if ctxCanceled(ctx) {
		return
	}
	continue
}

if resp.StatusCode != http.StatusOK {
	once.Do(func() { done <- fmt.Errorf("http status %d", resp.StatusCode) })
	t.log.ERROR.Printf("http status %d", resp.StatusCode)
	resp.Body.Close()

	if ctxCanceled(ctx) {
		return
	}
	continue
}

// ... and similarly for the other error branches
```

This preserves the current semantics (immediate cancel check after each error) while removing the repeated `select` blocks, reducing branching noise and making the function easier to read and maintain.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

once.Do(func() { done <- err })
t.log.ERROR.Println("http error:", err)
continue
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting the repeated context cancellation select logic into a small helper function and calling it in each error branch to simplify the control flow.

You can keep the new cancellation behavior while reducing duplication by extracting the repeated select into a small helper and using it in each error branch.

For example:

func ctxCanceled(ctx context.Context) bool {
	select {
	case <-ctx.Done():
		return true
	default:
		return false
	}
}

Then simplify each error path:

resp, err := client.Get(url)
if err != nil {
	once.Do(func() { done <- err })
	t.log.ERROR.Println("http error:", err)

	if ctxCanceled(ctx) {
		return
	}
	continue
}

if resp.StatusCode != http.StatusOK {
	once.Do(func() { done <- fmt.Errorf("http status %d", resp.StatusCode) })
	t.log.ERROR.Printf("http status %d", resp.StatusCode)
	resp.Body.Close()

	if ctxCanceled(ctx) {
		return
	}
	continue
}

// ... and similarly for the other error branches

This preserves the current semantics (immediate cancel check after each error) while removing the repeated select blocks, reducing branching noise and making the function easier to read and maintain.

…up failure

When a tariff's first API call returned HTTP 429 (Too Many Requests),
backoffPermanentError wrapped it as a permanent error, causing backoff.Retry
to abort immediately. runOrError then received the error and discarded the
tariff — but the background goroutine was never signalled to stop, so it kept
looping (blocked on its hourly tick) indefinitely, continuing to hit the API
and preventing rate-limit recovery even after evcc appeared to disable the
tariff.

Fix 1: exclude HTTP 429 from permanent-error treatment in backoffPermanentError
so backoff.Retry keeps retrying on rate limits.

Fix 2: thread context.Context through the runnable interface so runOrError can
cancel the goroutine via cancel() when startup fails, and all run() loops check
ctx.Done() instead of blindly calling continue on error.

Fixes evcc-io#26654

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@GrimmiMeloni GrimmiMeloni force-pushed the fix/tariff-429-goroutine-leak branch from b1b9d13 to f2e5578 Compare March 13, 2026 23:47
…ervation window

The persistingRunner was closing r.running after 100ms, but the test
only waited 50ms for that signal. In the bug-present case the test would
time out at 50ms and incorrectly conclude "no leak". Reduce the leak
timer to 20ms so it fires well within the 50ms observation window.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andig
Copy link
Member

andig commented Mar 14, 2026

As for stopping the background routine we already have the context.

@andig andig marked this pull request as draft March 14, 2026 08:42
@andig andig added the bug Something isn't working label Mar 14, 2026
once.Do(func() { done <- err })

t.log.ERROR.Println(err)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho the fix for the 429 would be to continue in case of 429 and assume that a follow-up call (in an hour) might work and return an early success. That is a bit dangerous though since the 429 might typically be done by a load balancer before credentials etc are even verified.

That said: it‘s unclear to me what this PR really fixes in relation to the original issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about the 429 — I have reverted that change. The 429 is correctly permanent within bo(), and the outer tick loop is the right recovery mechanism.

That said: it‘s unclear to me what this PR really fixes in relation to the original issue.

What this PR fixes is the goroutine leak when runOrError fails on startup (regardless of error type — 429, DNS failure, auth error, etc.):

  1. runOrError starts go t.run(done) and waits on <-done
  2. run() sends the error via once.Do(func() { done <- err })
  3. runOrError receives the error, discards the tariff, returns nil, err
  4. But the goroutine is still alive — it hits continue, blocks on <-tick for one hour, then retries the API call. This repeats forever.

The goroutine outlives the tariff object. For GSI with tight rate limits, this is especially bad: evcc shows the tariff as disabled, but the orphaned goroutine keeps hitting the API every hour in the background. Restarting evcc creates a new leaking goroutine (just replacing the old one), so the provider never sees the traffic stop and the rate limit is never lifted.

The fix adds a stop channel to runOrError that gets closed when startup fails. Each run() method checks <-stop in its error path — if closed, it returns instead of calling continue. On the success path, stop is never closed, so the goroutine behaves exactly as before (runtime errors still retry via default: continue).

I initially tried using context.WithCancel for this, but it triggered a govet/lostcancel error because cancel() is intentionally not called on the success path. defer cancel() doesn't work either — it fires on both return paths, so it would cancel the context immediately on successful startup, causing the goroutine to exit permanently on its first runtime error instead of retrying. The stop channel avoids this: receiving from a closed channel returns immediately (goroutine exits), while receiving from an open channel blocks forever (goroutine keeps retrying via default: continue — identical to the pre-fix behaviour).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbo, I don't care much about the leak. This is not a mass problem (though we could still fix it). I'd really care about figuring out the root cause of #26654, which I couldn't sofar.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that the leak is causing this. GSI is picky about their rate limits. And the leak that's being addressed is causing "silent" requests to keep going. The leak fix doesn't change the immediate outcome (tariff still fails on startup 429), but it makes the situation recoverable — evcc stops making things worse by continuing to hit a rate-limited API in the background.

GrimmiMeloni and others added 2 commits March 14, 2026 14:19
HTTP 429 within the inner backoff.Retry loop is correctly permanent:
it stops immediate retries and defers to the outer hourly tick loop,
which is the right way to respect rate limiting. Keeping the goroutine
leak fix is sufficient to address issue evcc-io#26654.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Grünstromindex: unexpected status: 429 (Too Many Requests)

2 participants