🐛 Resource Timings are inaccurate when ServiceWorker is involved #2566

huwyca · 2024-01-11T17:24:09Z

Describe the bug

NOTE: This was discussed and shown at length in a call with support on 12/12/2023 with traeger.meyer@datadoghq.com and shayne.sebro@datadoghq.com. There should be a recording available of that call. Datadog 1345744 // 1963 - RUM Resource Timing - Customer call

Service worker based resource network requests are being reported improperly by browser-sdk rum resourceCollection. This causes the duration to be incorrect as well as causing other fields to be missing from the RUM entry.

Browser: Chrome (Firefox and Safari don't seem to support workerStart yet)
BrowserSDK versions affected: Seems to affect multiple major versions including v5 and v4

This seems to happen mainly due to requestStart being set to workerStart and those falling before the connectionStart/connectionEnd timings when a ServiceWorker. I notice it most commonly when a service worker is intercepting a request over http2 re-using an existing TCP connection to a domain.

When a service worker is involved in a network request it changes the fields of the PerformanceResourceTiming.
In addition to setting the workerStart property to be non-zero it also changes the requestStart time to be set to the workerStart property. This causes a logic check to fail in the resourceUtils.ts#toValidEntry function to fail due to the entry.requestStart time falling before the other timings in the areInOrder check.

browser-sdk/packages/rum-core/src/domain/resource/resourceUtils.ts

Lines 138 to 157 in 43c4456

    
           export function toValidEntry(entry: RumPerformanceResourceTiming) { 
        
             // Ensure timings are in the right order. On top of filtering out potential invalid 
        
             // RumPerformanceResourceTiming, it will ignore entries from requests where timings cannot be 
        
             // collected, for example cross origin requests without a "Timing-Allow-Origin" header allowing 
        
             // it. 
        
             if ( 
        
               !areInOrder( 
        
                 entry.startTime, 
        
                 entry.fetchStart, 
        
                 entry.domainLookupStart, 
        
                 entry.domainLookupEnd, 
        
                 entry.connectStart, 
        
                 entry.connectEnd, 
        
                 entry.requestStart, 
        
                 entry.responseStart, 
        
                 entry.responseEnd 
        
               ) 
        
             ) { 
        
               return undefined 
        
             }

This causes the matchRequestTiming function to not find a correct matching timing due to it filtering out the timing due to filter(toValidEntry) removing the timings from service workers.

browser-sdk/packages/rum-core/src/domain/resource/matchRequestTiming.ts

Lines 24 to 37 in 43c4456

    
           export function matchRequestTiming(request: RequestCompleteEvent) { 
        
             if (!performance || !('getEntriesByName' in performance)) { 
        
               return 
        
             } 
        
             const sameNameEntries = performance.getEntriesByName(request.url, 'resource') 
        
             if (!sameNameEntries.length || !('toJSON' in sameNameEntries[0])) { 
        
               return 
        
             } 
        
             const candidates = sameNameEntries 
        
               .map((entry) => entry.toJSON() as RumPerformanceResourceTiming) 
        
               .filter(toValidEntry) 
        
               .filter((entry) =>

The matchingTiming being set to undefined from matchRequestTiming then results in not calculating any correspondingTimingOverrides with computePerformanceEntryMetrics. This has a couple of knock on effects:

Due to not having the correct entry metrics the duration being reported in the RumEvent is the timestamp of when the datadog rum XHR onLoad handler fires instead of when the resource finished.

browser-sdk/packages/rum-core/src/domain/resource/resourceCollection.ts

Lines 76 to 78 in 43c4456

    
           const correspondingTimingOverrides = matchingTiming ? computePerformanceEntryMetrics(matchingTiming) : undefined 
        
           const duration = computeRequestDuration(pageStateHistory, startClocks, request.duration)

This makes the duration be susceptible to delays on the event loop which can cause a large difference in the reported time from the actual when the event loop is busy and a network resource is waiting to fire its onLoad handlers.

This also results in the various fields from computePerformanceResourceDetails not being included in the RumEvent due to not being called from computePerformanceEntryMetrics.

browser-sdk/packages/rum-core/src/domain/resource/resourceUtils.ts

Lines 88 to 136 in 43c4456

    
           export function computePerformanceResourceDetails( 
        
             entry: RumPerformanceResourceTiming 
        
           ): PerformanceResourceDetails | undefined { 
        
             const validEntry = toValidEntry(entry) 
        
             if (!validEntry) { 
        
               return undefined 
        
             } 
        
             const { 
        
               startTime, 
        
               fetchStart, 
        
               redirectStart, 
        
               redirectEnd, 
        
               domainLookupStart, 
        
               domainLookupEnd, 
        
               connectStart, 
        
               secureConnectionStart, 
        
               connectEnd, 
        
               requestStart, 
        
               responseStart, 
        
               responseEnd, 
        
             } = validEntry 
        
             const details: PerformanceResourceDetails = { 
        
               download: formatTiming(startTime, responseStart, responseEnd), 
        
               first_byte: formatTiming(startTime, requestStart, responseStart), 
        
             } 
        
             // Make sure a connection occurred 
        
             if (connectEnd !== fetchStart) { 
        
               details.connect = formatTiming(startTime, connectStart, connectEnd) 
        
               // Make sure a secure connection occurred 
        
               if (areInOrder(connectStart, secureConnectionStart, connectEnd)) { 
        
                 details.ssl = formatTiming(startTime, secureConnectionStart, connectEnd) 
        
               } 
        
             } 
        
             // Make sure a domain lookup occurred 
        
             if (domainLookupEnd !== fetchStart) { 
        
               details.dns = formatTiming(startTime, domainLookupStart, domainLookupEnd) 
        
             } 
        
             if (hasRedirection(entry)) { 
        
               details.redirect = formatTiming(startTime, redirectStart, redirectEnd) 
        
             } 
        
             return details 
        
           }

To Reproduce
Steps to reproduce the behavior:

I did a reproduction on a call with the support team that was recorded. Included below is a basic reproduction case but can be slightly annoying to setup and show from scratch. I'm happy to jump on a call showing a reproduction.

Open Chrome (firefox and safari don't have workerStart support yet) to a page with DD browser sdk running and that has a service worker registered for resources.
Turn on CPU throttling in the Devtools > Performance > Settings (click the gear icon on the Performance tab)
Take a profile while doing something that makes network requests for resources). This step is just to show that the reported duration will line up with the XHR onLoad handler time. You can view the network panel and check for the gear icon beside the network resource name for a request to see if a service worker was responsible.
If you log the events in the beforeSend callback you can view them easily. Otherwise you can view them in the DD rum explorer. If the workerStart is set the requestStart has been set to its value. Might take a few tries but should be able to see that the RumEvents are missing computePerformanceResourceDetails fields and have an incorrect duration.

Expected behavior
RumEvents should be reported correctly when a service worker is involved. To fix this you can make a change to the toValidEntry function to check if the entry.workerStart field is non-zero. If so then use a different conditional check to verify fields are valid and in proper order.

Related useful information / links:

Spec Links:

ResourceTiming - https://w3c.github.io/resource-timing/#dom-performanceresourcetiming-requeststart
NavigationTiming - https://www.w3.org/TR/navigation-timing-2/#PerformanceResourceTiming
ServiceWorker - https://www.w3.org/TR/service-workers/#service-worker-timing
Fetch - https://fetch.spec.whatwg.org/#http-fetch

W3C workerStart spec questions

w3c PR where they talk about how workerStart is interacting with requestStart w3c/resource-timing#119
workerStart and redirects - w3c/navigation-timing#131
whatwg/fetch#1413

WPT Updates

Fetch, ResourceTiming and ServiceWorkers - web-platform-tests/wpt#31229

The text was updated successfully, but these errors were encountered:

BenoitZugmeyer · 2024-01-24T10:27:24Z

Thank you for your thorough report! We are looking into it. We'll keep you updated.

BenoitZugmeyer · 2024-02-05T14:47:00Z

We are still figuring how to best handle this case. In the meantime I opened a Chromium issue: https://issues.chromium.org/issues/323703325

Before this commit, we dropped any `PerformanceResourceEntry` that had inconsistent timings. With this commit, when the experimental feature "tolerant_resource_timings" is enabled, the various timings are validated independently. This should improve the situation on Chrome when a Service Worker is involved. In this case, most of the time, connection timings are unexpectedly set after the `requestStart` timing (see Chromium issue[1]). Before this change, the whole `PerformanceResourceEntry` would be ignored, and no timings would be included. With this change (and the flag enabled), relevant timings will be correctly defined. Related issue: #2566 [1]: https://issues.chromium.org/issues/323703325

huwyca added the bug Something isn't working label Jan 11, 2024

BenoitZugmeyer mentioned this issue Apr 3, 2024

🐛 [RUM-2782] remove buggy redirect timing estimation based on fetchStart #2683

Merged

4 tasks

BenoitZugmeyer closed this as completed in #2683 Apr 4, 2024

BenoitZugmeyer reopened this Apr 4, 2024

BenoitZugmeyer mentioned this issue Apr 4, 2024

⚗ [RUM-2782] validate resource timings more granularly #2688

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Resource Timings are inaccurate when ServiceWorker is involved #2566

🐛 Resource Timings are inaccurate when ServiceWorker is involved #2566

huwyca commented Jan 11, 2024 •

edited

BenoitZugmeyer commented Jan 24, 2024

BenoitZugmeyer commented Feb 5, 2024

🐛 Resource Timings are inaccurate when ServiceWorker is involved #2566

🐛 Resource Timings are inaccurate when ServiceWorker is involved #2566

Comments

huwyca commented Jan 11, 2024 • edited

Related useful information / links:

Spec Links:

W3C workerStart spec questions

WPT Updates

BenoitZugmeyer commented Jan 24, 2024

BenoitZugmeyer commented Feb 5, 2024

huwyca commented Jan 11, 2024 •

edited