Skip to content

Fix reference count bug in partition batcher#14444

Merged
bogdandrutu merged 3 commits intoopen-telemetry:mainfrom
aditya-systems-hub:fix/partition-batcher-refcount
Jan 19, 2026
Merged

Fix reference count bug in partition batcher#14444
bogdandrutu merged 3 commits intoopen-telemetry:mainfrom
aditya-systems-hub:fix/partition-batcher-refcount

Conversation

@aditya-systems-hub
Copy link
Contributor


Summary

This PR fixes an off-by-one error in the partition batcher’s reference counting logic that could cause exporter errors to be silently dropped under specific error conditions.

When a batch is split into multiple requests and MergeSplit() returns an error, the reference counter was initialized with an incorrect value due to a copy-paste mistake. This could lead to the done callback firing too early, before all flush operations completed.


Problem Description

In Consume, the number of references (numRefs) is intentionally incremented when mergeSplitErr is non-nil to account for the additional error callback. However, the reference counter was initialized using len(reqList) instead of numRefs.

As a result, the reference count could be lower than the actual number of callbacks that would be invoked.

Buggy behavior (simplified)

numRefs := len(reqList)
if mergeSplitErr != nil {
    numRefs++
}

done = newRefCountDone(done, int64(len(reqList))) // incorrect

This mismatch causes the underlying done callback to be triggered prematurely.


Impact

Before this fix

  • Exporter errors from final flush operations could be silently lost
  • done could be invoked before all export operations completed
  • Error aggregation reported back to callers could be incomplete
  • In queue-based exporters using waitForResult, callers could observe success even when exports failed
  • Silent data loss in production telemetry pipelines was possible

After this fix

  • All flush operations are correctly tracked
  • Errors from all callbacks are properly aggregated
  • done is invoked only after all operations complete
  • Completion signaling is consistent and reliable

Steps to Reproduce

This issue occurs when all of the following conditions are met:

  1. A batch is split into multiple requests (len(reqList) >= 2)
  2. MergeSplit() returns a non-nil error
  3. The partition batcher initializes the reference counter

In this case, the extra error callback increases the true number of references, but the counter was initialized with a lower value, causing premature completion.

This is an edge case and does not crash or panic, which makes it difficult to detect without careful inspection or targeted testing.


Fix

The fix ensures the reference counter is initialized with the correct number of references (numRefs) so that all callbacks are properly accounted for.

Correct behavior

numRefs := len(reqList)
if mergeSplitErr != nil {
    numRefs++
}

done = newRefCountDone(done, int64(numRefs)) // correct

This aligns the logic with the already-correct implementation used earlier in the same file and restores correct lifecycle handling.


Why This Is Important

This bug is particularly hard to detect because:

  • No crash or panic occurs
  • The failure mode is silent
  • Most exports still succeed
  • The issue only affects specific edge cases involving split batches and merge errors

However, when it does occur, it can lead to silent telemetry data loss and misleading success signals in production systems.


@aditya-systems-hub aditya-systems-hub requested a review from a team as a code owner January 17, 2026 23:02
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 17, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@github-actions github-actions bot requested a review from bogdandrutu January 17, 2026 23:02
@aditya-systems-hub
Copy link
Contributor Author

Hi @dmitryax , @bogdandrutu
I’ve addressed the issue and added a fix for the reference counting bug in the partition batcher. I’ve also included a detailed explanation in the PR description covering the impact, reproduction scenario, and rationale for the change.

When you have time, I’d appreciate a review or any feedback you may have. Please let me know if you’d like additional tests or adjustments.

Thank you for your time and guidance.

@bogdandrutu bogdandrutu enabled auto-merge January 18, 2026 16:47
@codecov
Copy link

codecov bot commented Jan 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.80%. Comparing base (7c31dd5) to head (6d459ee).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14444      +/-   ##
==========================================
- Coverage   91.81%   91.80%   -0.02%     
==========================================
  Files         677      677              
  Lines       42677    42677              
==========================================
- Hits        39184    39179       -5     
- Misses       2433     2436       +3     
- Partials     1060     1062       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bogdandrutu
Copy link
Member

Please add a changelog entry

@aditya-systems-hub
Copy link
Contributor Author

Thank you sir @bogdandrutu for pointing this out. I will add an appropriate changelog entry under /.chloggen to document this change.

auto-merge was automatically disabled January 18, 2026 19:13

Head branch was pushed to by a user without write access

Signed-off-by: aditya4044656 <adityakuchekar0077@gmail.com>
@aditya-systems-hub aditya-systems-hub force-pushed the fix/partition-batcher-refcount branch from d0ee32a to 6e48cb4 Compare January 18, 2026 20:12
@bogdandrutu bogdandrutu enabled auto-merge January 19, 2026 01:12
Update component from exporter/exporterhelper to pkg/exporterhelper
auto-merge was automatically disabled January 19, 2026 02:51

Head branch was pushed to by a user without write access

@bogdandrutu bogdandrutu enabled auto-merge January 19, 2026 03:04
@bogdandrutu bogdandrutu added this pull request to the merge queue Jan 19, 2026
Merged via the queue into open-telemetry:main with commit b32538d Jan 19, 2026
61 checks passed
@otelbot
Copy link
Contributor

otelbot bot commented Jan 19, 2026

Thank you for your contribution @aditya4044656! 🎉 We would like to hear from you about your experience contributing to OpenTelemetry by taking a few minutes to fill out this survey.

TimoBehrendt pushed a commit to TimoBehrendt/tracebasedlogsampler that referenced this pull request Feb 9, 2026
This PR contains the following updates:

| Package | Type | Update | Change | Pending |
|---|---|---|---|---|
| [go.opentelemetry.io/collector/component](https://github.com/open-telemetry/opentelemetry-collector) | require | minor | `v1.45.0` → `v1.50.0` | `v1.51.0` |
| [go.opentelemetry.io/collector/component/componenttest](https://github.com/open-telemetry/opentelemetry-collector) | require | minor | `v0.139.0` → `v0.144.0` | `v0.145.0` |
| [go.opentelemetry.io/collector/confmap](https://github.com/open-telemetry/opentelemetry-collector) | require | minor | `v1.45.0` → `v1.50.0` | `v1.51.0` |
| [go.opentelemetry.io/collector/consumer](https://github.com/open-telemetry/opentelemetry-collector) | require | minor | `v1.45.0` → `v1.50.0` | `v1.51.0` |
| [go.opentelemetry.io/collector/consumer/consumertest](https://github.com/open-telemetry/opentelemetry-collector) | require | minor | `v0.139.0` → `v0.144.0` | `v0.145.0` |
| [go.opentelemetry.io/collector/pdata](https://github.com/open-telemetry/opentelemetry-collector) | require | minor | `v1.45.0` → `v1.50.0` | `v1.51.0` |
| [go.opentelemetry.io/collector/processor](https://github.com/open-telemetry/opentelemetry-collector) | require | minor | `v1.45.0` → `v1.50.0` | `v1.51.0` |
| [go.opentelemetry.io/collector/processor/processortest](https://github.com/open-telemetry/opentelemetry-collector) | require | minor | `v0.139.0` → `v0.144.0` | `v0.145.0` |
| [go.uber.org/zap](https://github.com/uber-go/zap) | require | patch | `v1.27.0` → `v1.27.1` |  |

---

### Release Notes

<details>
<summary>open-telemetry/opentelemetry-collector (go.opentelemetry.io/collector/component)</summary>

### [`v1.50.0`](https://github.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v1500v01440)

##### 🛑 Breaking changes 🛑

- `pkg/exporterhelper`: Change verbosity level for otelcol\_exporter\_queue\_batch\_send\_size metric to detailed. ([#&#8203;14278](open-telemetry/opentelemetry-collector#14278))
- `pkg/service`: Remove deprecated `telemetry.disableHighCardinalityMetrics` feature gate. ([#&#8203;14373](open-telemetry/opentelemetry-collector#14373))
- `pkg/service`: Remove deprecated `service.noopTracerProvider` feature gate. ([#&#8203;14374](open-telemetry/opentelemetry-collector#14374))

##### 🚩 Deprecations 🚩

- `exporter/otlp_grpc`: Rename `otlp` exporter to `otlp_grpc` exporter and add deprecated alias `otlp`. ([#&#8203;14403](open-telemetry/opentelemetry-collector#14403))
- `exporter/otlp_http`: Rename `otlphttp` exporter to `otlp_http` exporter and add deprecated alias `otlphttp`. ([#&#8203;14396](open-telemetry/opentelemetry-collector#14396))

##### 💡 Enhancements 💡

- `cmd/builder`: Avoid duplicate CLI error logging in generated collector binaries by relying on cobra's error handling. ([#&#8203;14317](open-telemetry/opentelemetry-collector#14317))

- `cmd/mdatagen`: Add the ability to disable attributes at the metric level and re-aggregate data points based off of these new dimensions ([#&#8203;10726](open-telemetry/opentelemetry-collector#10726))

- `cmd/mdatagen`: Add optional `display_name` and `description` fields to metadata.yaml for human-readable component names ([#&#8203;14114](open-telemetry/opentelemetry-collector#14114))
  The `display_name` field allows components to specify a human-readable name in metadata.yaml.
  When provided, this name is used as the title in generated README files.
  The `description` field allows components to include a brief description in generated README files.

- `cmd/mdatagen`: Validate stability level for entities ([#&#8203;14425](open-telemetry/opentelemetry-collector#14425))

- `pkg/xexporterhelper`: Reenable batching for profiles ([#&#8203;14313](open-telemetry/opentelemetry-collector#14313))

- `receiver/nop`: add profiles signal support ([#&#8203;14253](open-telemetry/opentelemetry-collector#14253))

##### 🧰 Bug fixes 🧰

- `pkg/exporterhelper`: Fix reference count bug in partition batcher ([#&#8203;14444](open-telemetry/opentelemetry-collector#14444))

<!-- previous-version -->

### [`v1.49.0`](https://github.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v1490v01430)

##### 💡 Enhancements 💡

- `all`: Update semconv import to 1.38.0 ([#&#8203;14305](open-telemetry/opentelemetry-collector#14305))
- `exporter/nop`: Add profiles support to nop exporter ([#&#8203;14331](open-telemetry/opentelemetry-collector#14331))
- `pkg/pdata`: Optimize the size and pointer bytes for pdata structs ([#&#8203;14339](open-telemetry/opentelemetry-collector#14339))
- `pkg/pdata`: Avoid using interfaces/oneof like style for optional fields ([#&#8203;14333](open-telemetry/opentelemetry-collector#14333))

<!-- previous-version -->

### [`v1.48.0`](https://github.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v1480v01420)

##### 💡 Enhancements 💡

- `exporter/debug`: Add logging of dropped attributes, events, and links counts in detailed verbosity ([#&#8203;14202](open-telemetry/opentelemetry-collector#14202))

- `extension/memory_limiter`: The memorylimiter extension can be used as an HTTP/GRPC middleware. ([#&#8203;14081](open-telemetry/opentelemetry-collector#14081))

- `pkg/config/configgrpc`: Statically validate gRPC endpoint ([#&#8203;10451](open-telemetry/opentelemetry-collector#10451))
  This validation was already done in the OTLP exporter. It will now be applied to any gRPC client.

- `pkg/service`: Add support to disabling adding resource attributes as zap fields in internal logging ([#&#8203;13869](open-telemetry/opentelemetry-collector#13869))
  Note that this does not affect logs exported through OTLP.

<!-- previous-version -->

### [`v1.47.0`](https://github.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v1470v01410)

##### 🛑 Breaking changes 🛑

- `pkg/config/confighttp`: Use configoptional.Optional for confighttp.ClientConfig.Cookies field ([#&#8203;14021](open-telemetry/opentelemetry-collector#14021))

##### 💡 Enhancements 💡

- `pkg/config/confighttp`: Setting `compression_algorithms` to an empty list now disables automatic decompression, ignoring Content-Encoding ([#&#8203;14131](open-telemetry/opentelemetry-collector#14131))
- `pkg/service`: Update semantic conventions from internal telemetry to v1.37.0 ([#&#8203;14232](open-telemetry/opentelemetry-collector#14232))
- `pkg/xscraper`: Implement xscraper for Profiles. ([#&#8203;13915](open-telemetry/opentelemetry-collector#13915))

##### 🧰 Bug fixes 🧰

- `pkg/config/configoptional`: Ensure that configoptional.None values resulting from unmarshaling are equivalent to configoptional.Optional zero value. ([#&#8203;14218](open-telemetry/opentelemetry-collector#14218))

<!-- previous-version -->

### [`v1.46.0`](https://github.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v1460v01400)

##### 💡 Enhancements 💡

- `cmd/mdatagen`: `metadata.yaml` now supports an optional `entities` section to organize resource attributes into logical entities with identity and description attributes ([#&#8203;14051](open-telemetry/opentelemetry-collector#14051))
  When entities are defined, mdatagen generates `AssociateWith{EntityType}()` methods on ResourceBuilder
  that associate resources with entity types using the entity refs API. The entities section is backward
  compatible - existing metadata.yaml files without entities continue to work as before.

- `cmd/mdatagen`: Add semconv reference for metrics ([#&#8203;13920](open-telemetry/opentelemetry-collector#13920))

- `connector/forward`: Add support for Profiles to Profiles ([#&#8203;14092](open-telemetry/opentelemetry-collector#14092))

- `exporter/debug`: Disable sending queue by default ([#&#8203;14138](open-telemetry/opentelemetry-collector#14138))
  The recently added sending queue configuration in Debug exporter was enabled by default and had a problematic default size of 1.
  This change disables the sending queue by default.
  Users can enable and configure the sending queue if needed.

- `pkg/config/configoptional`: Mark `configoptional.AddEnabledField` as beta ([#&#8203;14021](open-telemetry/opentelemetry-collector#14021))

- `pkg/otelcol`: This feature has been improved and tested; secure-by-default redacts configopaque values ([#&#8203;12369](open-telemetry/opentelemetry-collector#12369))

##### 🧰 Bug fixes 🧰

- `all`: Ensure service service.instance.id is the same for all the signals when it is autogenerated. ([#&#8203;14140](open-telemetry/opentelemetry-collector#14140))

<!-- previous-version -->

</details>

<details>
<summary>uber-go/zap (go.uber.org/zap)</summary>

### [`v1.27.1`](https://github.com/uber-go/zap/releases/tag/v1.27.1)

[Compare Source](uber-go/zap@v1.27.0...v1.27.1)

Enhancements:

- [#&#8203;1501][]: prevent `Object` from panicking on nils
- [#&#8203;1511][]: Fix a race condition in `WithLazy`.

Thanks to [@&#8203;rabbbit](https://github.com/rabbbit), [@&#8203;alshopov](https://github.com/alshopov), [@&#8203;jquirke](https://github.com/jquirke), [@&#8203;arukiidou](https://github.com/arukiidou) for their contributions to this release.

[#&#8203;1501]: uber-go/zap#1501

[#&#8203;1511]: uber-go/zap#1511

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

👻 **Immortal**: This PR will be recreated if closed unmerged. Get [config help](https://github.com/renovatebot/renovate/discussions) if that's undesired.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi4xMC41IiwidXBkYXRlZEluVmVyIjoiNDIuOTUuMiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOltdfQ==-->

Reviewed-on: https://gitea.t000-n.de/t.behrendt/tracebasedlogsampler/pulls/25
Reviewed-by: t.behrendt <t.behrendt@noreply.localhost>
Co-authored-by: Renovate Bot <renovate@t00n.de>
Co-committed-by: Renovate Bot <renovate@t00n.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants