Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Urls without www are not handled by extractors where domains have www in url #744

Open
hwo411 opened this issue May 31, 2023 · 0 comments
Open

Comments

@hwo411
Copy link

hwo411 commented May 31, 2023

Hi everyone, recently we discovered an issue in our system because some urls were parsed without www and thefeore an extractor for that source wasn't used. In this case we either need to submit all custom extractors without www or allow searching for www extractors in addition to base hostname extractors.

Expected Behavior

Commands

postlight-parser https://www.newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice

and

postlight-parser https://newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice

to produce the same result.

Current Behavior

In case of

postlight-parser https://newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice

the custom extractor is not used and body has only 1949 words instead of 3950.

Steps to Reproduce

postlight-parser https://newyorker.com/culture/annals-of-inquiry/the-case-for-free-range-lab-mice and see content and word_count fields

Detailed Description

Because of not using custom extractors parser returns an incomplete body.

Possible Solution

Either rename all folders without www. and set domains without www. or allow getExtractor to also check extractors with www. + hostname and www + base host name

I'm not sure which option is better for the parser (I'd rather go with the first one, though it might be error-prone, the second one is less error-prone).

@hwo411 hwo411 changed the title Remove www from custom extractor domains or treat domain without www as www Urls without www are not handled by extractors where domains have www in url May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant