Summary data in `all.requests` issues #257

tunetheweb · 2024-03-25T12:54:47Z

Found this out while looking at the combined pipeline issues.

The all pipeline has the following issues

It's null for 404s and other errors, even though these have summary data in the legacy summary_requests table.
It doesn't set firstReq and firstHtml correctly (they are always set to true).

This is because we call the summary code per request here:

Lines 341 to 346 in d047906

    
           summary_request = None 
        
           try: 
        
               status_info = HarJsonToSummary.initialize_status_info(file_name, page) 
        
               summary_request, _, _, _ = HarJsonToSummary.summarize_entry( 
        
                   request, "", "", 0, status_info 
        
               )

And that code was more intended to be called in one go since it does this:

data-pipeline/modules/transformation.py

Lines 406 to 425 in d047906

    
           first_req = False 
        
           first_html = False 
        
           if not first_url: 
        
               if (400 <= status <= 599) or 12000 <= status: 
        
                   logging.warning( 
        
                       f"The first request ({url}) failed with status {status}. status_info={status_info}" 
        
                   ) 
        
                   return None, None, None, None 
        
               # This is the first URL found associated with the page - assume it's the base URL. 
        
               first_req = True 
        
               first_url = url 
        
           if not first_html_url: 
        
               # This is the first URL found associated with the page that's HTML. 
        
               first_html = True 
        
               first_html_url = url 
        
           ret_request.update({"firstReq": first_req, "firstHtml": first_html}) 
        
           return ret_request, first_url, first_html_url, entry_number

You basically need to generate the whole page and all requests, and then lookup this summary_requests array for each request:

    try:
        _, requests = HarJsonToSummary.generate_pages(file_name, har)
    except Exception:
        logging.exception(
            f"Unable to unpack HAR, check previous logs for detailed errors. "
            f"{file_name=}, {har=}"
        )
        return None

    summary_requests = []

    for request in requests:

        try:
            wanted_summary_fields = [
                field["name"]
                for field in constants.BIGQUERY["schemas"]["summary_requests"]["fields"]
            ]

            request = utils.dict_subset(request, wanted_summary_fields)
        except Exception:
            logging.exception(
                f"Unable to unpack HAR, check previous logs for detailed errors. "
                f"{file_name=}, {har=}"
            )
            continue

        if request:
            summary_requests.append(request)

The text was updated successfully, but these errors were encountered:

pmeenan · 2024-05-21T18:30:34Z

This should be fixed in the streaming writes from the agent for the next crawl: HTTPArchive/wptagent@53189db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary data in `all.requests` issues #257

Summary data in `all.requests` issues #257

tunetheweb commented Mar 25, 2024

pmeenan commented May 21, 2024

Summary data in all.requests issues #257

Summary data in all.requests issues #257

Comments

tunetheweb commented Mar 25, 2024

pmeenan commented May 21, 2024

Summary data in `all.requests` issues #257

Summary data in `all.requests` issues #257