Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary data in all.requests issues #257

Open
tunetheweb opened this issue Mar 25, 2024 · 1 comment
Open

Summary data in all.requests issues #257

tunetheweb opened this issue Mar 25, 2024 · 1 comment

Comments

@tunetheweb
Copy link
Member

Found this out while looking at the combined pipeline issues.

The all pipeline has the following issues

  • It's null for 404s and other errors, even though these have summary data in the legacy summary_requests table.
  • It doesn't set firstReq and firstHtml correctly (they are always set to true).

This is because we call the summary code per request here:

summary_request = None
try:
status_info = HarJsonToSummary.initialize_status_info(file_name, page)
summary_request, _, _, _ = HarJsonToSummary.summarize_entry(
request, "", "", 0, status_info
)

And that code was more intended to be called in one go since it does this:

first_req = False
first_html = False
if not first_url:
if (400 <= status <= 599) or 12000 <= status:
logging.warning(
f"The first request ({url}) failed with status {status}. status_info={status_info}"
)
return None, None, None, None
# This is the first URL found associated with the page - assume it's the base URL.
first_req = True
first_url = url
if not first_html_url:
# This is the first URL found associated with the page that's HTML.
first_html = True
first_html_url = url
ret_request.update({"firstReq": first_req, "firstHtml": first_html})
return ret_request, first_url, first_html_url, entry_number

You basically need to generate the whole page and all requests, and then lookup this summary_requests array for each request:

    try:
        _, requests = HarJsonToSummary.generate_pages(file_name, har)
    except Exception:
        logging.exception(
            f"Unable to unpack HAR, check previous logs for detailed errors. "
            f"{file_name=}, {har=}"
        )
        return None

    summary_requests = []

    for request in requests:

        try:
            wanted_summary_fields = [
                field["name"]
                for field in constants.BIGQUERY["schemas"]["summary_requests"]["fields"]
            ]

            request = utils.dict_subset(request, wanted_summary_fields)
        except Exception:
            logging.exception(
                f"Unable to unpack HAR, check previous logs for detailed errors. "
                f"{file_name=}, {har=}"
            )
            continue

        if request:
            summary_requests.append(request)
@pmeenan
Copy link
Member

pmeenan commented May 21, 2024

This should be fixed in the streaming writes from the agent for the next crawl: HTTPArchive/wptagent@53189db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants