Skip to content

EW-9372 EW-9455 [o11y] Prepare reporting SpanOpen event earlier#5370

Merged
fhanau merged 1 commit intomainfrom
felix/102125-stw-cleanup
Mar 18, 2026
Merged

EW-9372 EW-9455 [o11y] Prepare reporting SpanOpen event earlier#5370
fhanau merged 1 commit intomainfrom
felix/102125-stw-cleanup

Conversation

@fhanau
Copy link
Contributor

@fhanau fhanau commented Oct 21, 2025

Purpose:
This PR serves to perform two long-standing cleanup tasks in the STW implementation:

  1. Sending the SpanOpen event as soon as a span is opened instead of when it closes
  2. Getting rid of the CompleteSpan struct, which represents a full span but is something that won't be needed once SpanOpen is handled separately.

To implement this in a backwards-compatible way, we need to land it in two parts so that the old code path and the new code path are both supported until we have phased out the old version which doesn't have the APIs for handling SpanOpen separately.

For code that is workerd-only and thus never involved in RPC or that is solely on the RPC server side, we can already decompose function calls so that we don't need to implement the sape functionality twice. This needs to land alongside a downstream PR. A follow-up PR will actually invoke the code path to send SpanOpen first, get rid of CompleteSpan struct and perform a bunch of cleanup – see #6051.

Note that:

  • The internal tracing system will not be affected by these changes – we still propagate completed spans there. In the final version, this differentiation is implemented through differences in the SpanObserver implementations.
  • Some functions that are being added here won't actually be called just yet, that will change in the follow-up and in some cases they are already necessary based on backwards-compatibility.
  • Commit history still needs to be cleared up

@fhanau fhanau requested review from a team as code owners October 21, 2025 20:36
@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch 2 times, most recently from 20b779f to b6452a3 Compare October 22, 2025 20:01
@fhanau fhanau requested a review from a team as a code owner October 22, 2025 20:01
@codspeed-hq
Copy link

codspeed-hq bot commented Oct 22, 2025

Merging this PR will not alter performance

✅ 70 untouched benchmarks
⏩ 129 skipped benchmarks1


Comparing felix/102125-stw-cleanup (a0e61e3) with main (ebeeae2)

Open in CodSpeed

Footnotes

  1. 129 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@mar-cf
Copy link
Contributor

mar-cf commented Oct 28, 2025

Tests fail, but that just might be from being out of sync or ontop of something old?

A short PR description would help.

@fhanau
Copy link
Contributor Author

fhanau commented Dec 24, 2025

Closing this for now – delivering SpanOpen earlier would make it more difficult to implement renaming spans, which we may support in the future.

@fhanau fhanau closed this Dec 24, 2025
@danlapid
Copy link
Collaborator

We should definitely reopen this and send SpanOpen events when the spans open and not when they close.
That is a key goal of Streaming Tail Workers compared to Buffered Tail Workers, we should not lose sight of that.
The rename does not relate to this.
OTEL officially supports a UpdateName message (https://opentelemetry.io/docs/specs/otel/trace/api/#updatename) which can be emitted at any point between the SpanOpen and the SpanClose to rename the span.
That's what we should also build into the Streaming Tail Workers protocol.

@fhanau fhanau reopened this Dec 26, 2025
@fhanau fhanau marked this pull request as draft December 26, 2025 15:24
@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch 3 times, most recently from 7db22e8 to c73e84d Compare February 9, 2026 19:19
@fhanau fhanau changed the title [o11y] Report SpanOpen event earlier EW-9372 EW-9455 [o11y] Report SpanOpen event earlier Feb 10, 2026
@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch from 6d91a06 to 538c76c Compare February 10, 2026 15:07
@github-actions
Copy link

github-actions bot commented Feb 10, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@codecov-commenter
Copy link

codecov-commenter commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 31.37255% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.69%. Comparing base (ebeeae2) to head (a0e61e3).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/workerd/io/tracer.c++ 37.09% 30 Missing and 9 partials ⚠️
src/workerd/io/trace.c++ 6.89% 27 Missing ⚠️
src/workerd/server/server.c++ 63.63% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5370      +/-   ##
==========================================
- Coverage   70.74%   70.69%   -0.05%     
==========================================
  Files         420      420              
  Lines      112938   113018      +80     
  Branches    18517    18533      +16     
==========================================
+ Hits        79894    79902       +8     
- Misses      22007    22075      +68     
- Partials    11037    11041       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch from 538c76c to c5c817f Compare February 10, 2026 23:14
@fhanau
Copy link
Contributor Author

fhanau commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 39.39394% with 80 lines in your changes missing coverage. Please review. ✅ Project coverage is 70.29%. Comparing base (d2c9058) to head (538c76c). ⚠️ Report is 1 commits behind head on main.

The coverage percentage appears lower than it should be here as some functions are only used upstream/not used at all until the follow-up PR. I'm convinced that our coverage doesn't actually get worse when taking that into account.

@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch from c5c817f to cc51938 Compare February 10, 2026 23:35
@fhanau fhanau marked this pull request as ready for review February 10, 2026 23:59
@fhanau fhanau requested a review from mar-cf February 10, 2026 23:59
@mar-cf
Copy link
Contributor

mar-cf commented Feb 14, 2026

In the downstream changes, I see some refused implementations which suggests the interface is wrong. I think moving to an event driven interface could help.

I'm assuming most of the test changes are quirks of the testing harnesses, ordering and attribute association, rather than actually functional changes.

@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch from cc51938 to c302bb5 Compare February 17, 2026 21:51
@fhanau
Copy link
Contributor Author

fhanau commented Feb 17, 2026

In the downstream changes, I see some refused implementations which suggests the interface is wrong. I think moving to an event driven interface could help.

I believe what you're referring to are KJ_UNIMPLEMENTED or failing asserts being used for addSpan() and submitSpan() implementations downstream and in the follow-up PR? These functions are marked as such based on different reasons, let's look at them individually:

  • submitSpan() is intended to only be used for when complete spans are being submitted (as with the internal tracing system), submitSpanEnd() is to be used with user tracing. Since we still need to support the internal format and submitSpanEnd takes almost the same parameters as submitSpan, we can just have submitSpan() for both formats, the SpanSubmitter implementation will handle both of these appropriately. The only wrinkle is that there is an unused parameter in the user tracing submitSpan() implementation that way, but that's still cleaner than having unused functions.
    TL;DR I agree that this can be improved, pushed changes to this and the other PRs so that we don't create submitSpanEnd(). submitSpan() continues to be used and doesn't end up being deprecated anywhere
  • For addSpan(), we are getting rid of that function everywhere except for the RPC server class in the downstream PR. In this case, we need to have separate addSpan() and addSpanEnd() calls and can't merge them (as is the case with submitSpan()) since the RPC stream needs to be backwards compatible and we accordingly need to support adding a full span and just the span end information at the same time. After that change has rolled out and addSpanEnd() is supported everywhere, we can simplify the code so that addSpan() is no longer needed. Since addSpan() will still be in the capnp implementation, it will continue to exist in the RPC server class indefinitely. While this function should never be called, I'm choosing to still implement it instead of relying on the virtual function in the capnp base class so that we will get a descriptive error message in case someone accidentally calls it. Note that outside of that class, we're already getting rid of addSpan().
    TL;DR we can't replace or get rid of one instance of addSpan() due to RPC backwards compatibility requirements, I'm explicitly implementing it instead of relying on the capnp-generated version to get a more descriptive error message.

I'm assuming most of the test changes are quirks of the testing harnesses, ordering and attribute association, rather than actually functional changes.

Yup, the user-facing change here is that the SpanOpen event is reported earlier, which results in a different order of spans in some cases.

@mar-cf
Copy link
Contributor

mar-cf commented Feb 17, 2026

The recent push I think helped. It still feels we're working around an issue rather than fixing it

     class SpanObserver {
       virtual Own<SpanObserver> newChild() = 0;
       virtual void onOpen(ConstString operationName, Date startTime) = 0;
       virtual void onClose(Date endTime, TagMap& tags) = 0;
       virtual void onUpdateName(ConstString newName) {}
       virtual Date getTime();
     };

Any reason we couldn't do something like that? I haven't considered all the details, but maybe it helps clean up the span time handling and the duplicate/dead code issues. Worth a shot or no?

@fhanau
Copy link
Contributor Author

fhanau commented Feb 18, 2026

The recent push I think helped. It still feels we're working around an issue rather than fixing it

     class SpanObserver {
       virtual Own<SpanObserver> newChild() = 0;
       virtual void onOpen(ConstString operationName, Date startTime) = 0;
       virtual void onClose(Date endTime, TagMap& tags) = 0;
       virtual void onUpdateName(ConstString newName) {}
       virtual Date getTime();
     };

Any reason we couldn't do something like that? I haven't considered all the details, but maybe it helps clean up the span time handling and the duplicate/dead code issues. Worth a shot or no?

  1. We're already doing that for the most part: reportStart has the same signature as the onOpen you suggest and is also transmitted at span open time, report corresponds to onClose (unfortunately we can't use the simpler parameters here based on point 2), onUpdateName is not needed just yet, getTime and newChild already exist.
  2. SpanObserver also needs to support the internal tracing system. That is why we still need to pass in a full span in report/onClose, in the internal tracing system the whole span needs to be available at span closing time.

@mar-cf
Copy link
Contributor

mar-cf commented Feb 18, 2026

  1. We're already doing that for the most part: reportStart has the same signature as the onOpen you suggest and is also transmitted at span open time, report corresponds to onClose (unfortunately we can't use the simpler parameters here based on point 2), onUpdateName is not needed just yet, getTime and newChild already exist.
  2. SpanObserver also needs to support the internal tracing system. That is why we still need to pass in a full span in report/onClose, in the internal tracing system the whole span needs to be available at span closing time.

Internal tracing needs the full span != we must pass the full span. The observer could assemble the span.

onOpen would actually be used by both. newChild is unchanged but included the sketch as we're not deleting it. onUpdateName was included to show how we'd extend the interface to support it. getTime isn't currently used on both sides of the span time measurement.

@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch from 5bc70a9 to 30bc3e2 Compare February 20, 2026 04:06
@mar-cf
Copy link
Contributor

mar-cf commented Feb 20, 2026

If the interface changes are not feasible, could you explain why?

@fhanau
Copy link
Contributor Author

fhanau commented Feb 23, 2026

  1. We're already doing that for the most part: reportStart has the same signature as the onOpen you suggest and is also transmitted at span open time, report corresponds to onClose (unfortunately we can't use the simpler parameters here based on point 2), onUpdateName is not needed just yet, getTime and newChild already exist.
  2. SpanObserver also needs to support the internal tracing system. That is why we still need to pass in a full span in report/onClose, in the internal tracing system the whole span needs to be available at span closing time.

Internal tracing needs the full span != we must pass the full span. The observer could assemble the span.

onOpen would actually be used by both. newChild is unchanged but included the sketch as we're not deleting it. onUpdateName was included to show how we'd extend the interface to support it. getTime isn't currently used on both sides of the span time measurement.

I'm sorry, but after implementing this I still don't see the upside of that approach. See the changes at felix/102125-stw-cleanup-mar-api (in workerd and downstream):

  • Tests are passing for this change, I added the methods you suggested. The only difference is that onClose() continues to take a full span – since we already have that available in SpanBuilder::end(), it doesn't make sense to pass in only the endTime, tags and span logs (for internal tracing) since then we'd be extracting data from the span only to assemble a new span with the same information. It would be possible to implement onClose() as suggested, but it would only increase code complexity.
  • Even with the simpler onClose(), we're not really gaining anything here as you can see on the branches. Nothing is being simplified or gets a cleaner API interface, we merely have more data that needs to be stored for spans that are currently open. I think it's better to have a function that's a no-op (and explaining why) than having the function do something and store data in its class if we can get the same information for free later on.
  • We could make it so that all span submitter implementations also take data at start and end time, instead of taking up a whole span. That way, onClose() implementations with the API you suggested could pass through their arguments without having to reassemble a span if we were using the proposed interface, but for internal tracing we inevitably need to reassemble the full span at some point (preferably before passing the RPC interface, unless we want to add more RPC methods) so the problem is moving to another function.
  • The renaming part (report -> onClose, reportStart -> onOpen) appears more agreeable, but it would cause some confusion as the report() function must remain available over RPC in the downstream project, we'd be calling reportRequest() from onClose() which makes things less consistent.

I'm happy to add additional code comments to explain why the existing approach appears to be right to me (if we only had to support user tracing and not internal tracing here it would be a different story). If I'm missing something here, feel free to edit the branches to show the advantages of the interface you're proposing.

@fhanau
Copy link
Contributor Author

fhanau commented Feb 23, 2026

Note that I still have to consider the suggested adjustSpanTime change – my impression was that it would not reduce complexity and merely move it around but it'll be easier to evaluate once implemented.

@jasnell
Copy link
Collaborator

jasnell commented Feb 23, 2026

Overall, I think on this it's likely good to go with the current approach as long as there are no obvious logical flaws. The approach can be iterated on if necessary in follow up PRs.

@mar-cf
Copy link
Contributor

mar-cf commented Feb 24, 2026

Tests are passing for this change, I added the methods you suggested. The only difference is that onClose() continues to take a full span – since we already have that available in SpanBuilder::end(), it doesn't make sense to pass in only the endTime, tags and span logs (for internal tracing) since then we'd be extracting data from the span only to assemble a new span with the same information. It would be possible to implement onClose() as suggested, but it would only increase code complexity.

That branch is mostly cosmetic, renaming reportStart -> onOpen and report -> onClose. It's also not, form the span, send part of the span and form again. The main substantive changes are to not send the span, rather assembling it the observer, and utilize getTime properly. Can we try that instead?

@mar-cf
Copy link
Contributor

mar-cf commented Feb 24, 2026

We're missing the forest for the trees. The core issue isn't renaming methods. I think the right path is to narrow onClose to end only data, properly call getTime and have the internal observer assemble spans.

@jasnell I'd rather we properly explore this than keep building around the current interface. The workarounds are expanding, the proposed change isn't large and follow up PRs rarely happen.

@fhanau
Copy link
Contributor Author

fhanau commented Feb 24, 2026

I stand by what I said earlier, happy to discuss on Wednesday if needed.
As mentioned, that branch does not include the getTime() changes which I have now implemented at felix/102125-stw-cleanup-mar-getTime. That commit clocks in at 77 insertions, 128 deletions, but if we take into account that we're removing both adjustSpanTime() methods now instead of keeping one and removing one in the follow-up, we'd be looking at +2 lines in the final version if my math is right. I think it's debatable whether that version is cleaner than the current version, both have aspects that are rather ugly. There's probably other ways to do this too, lmk if you can think of ways to make this cleaner. One issue with this approach is that we'd also have to implement getTime() in the internal span submitter class for user tracing, which would be 50 additional lines unless we find a way to share the code (likely ugly again).
Overall I don't think this is making things much better or worse, happy to discuss later.

@fhanau
Copy link
Contributor Author

fhanau commented Mar 4, 2026

The follow-up PRs have been updated to include changes from mar/spanapi, including the event-based API. That should unblock reviews for the initial step.

@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch from 30bc3e2 to e326304 Compare March 9, 2026 03:17
@mar-cf
Copy link
Contributor

mar-cf commented Mar 18, 2026

Looking at the end state in felix/102125-stw-cleanup-p2:

Both systems use the interface but each only uses half of it - they haven't been updated to use the same methods. What actually changed here?

Either split the interface into two separate abstractions or update both to use the same one. I'd recommend the latter.

Copy link
Contributor

@mar-cf mar-cf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I was looking at a stale commit. I see it was update now.

LGTM!

This PR serves to perform two long-standing cleanup tasks in the STW
implementation:
1) Sending the SpanOpen event as soon as a span is opened instead of when it
   closes
2) Getting rid of the CompleteSpan struct, which represents a full span but is
   something that won't be needed once SpanOpen is handled separately.

To implement this in a backwards-compatible way, we need to land it in two parts
so that the old code path and the new code path are both supported until we have
phased out the old version which doesn't have the APIs for handling SpanOpen
separately.
@fhanau fhanau force-pushed the felix/102125-stw-cleanup branch from e326304 to a0e61e3 Compare March 18, 2026 18:26
@fhanau fhanau changed the title EW-9372 EW-9455 [o11y] Report SpanOpen event earlier EW-9372 EW-9455 [o11y] Prepare reporting SpanOpen event earlier Mar 18, 2026
@fhanau fhanau merged commit 71cce94 into main Mar 18, 2026
22 of 23 checks passed
@fhanau fhanau deleted the felix/102125-stw-cleanup branch March 18, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants