Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge articles spread on multiple pages #8

Open
fhamborg opened this issue Dec 18, 2016 · 2 comments
Open

Merge articles spread on multiple pages #8

fhamborg opened this issue Dec 18, 2016 · 2 comments

Comments

@fhamborg
Copy link
Owner

Example: http://www.zeit.de/2016/18/ttip-barack-obama-hannover-usa-widerstand Under the given URL only the first part of the article is shown. A (human) reader can either click on a link that points to the second page or can click on "Auf einer Seite lesen" to read all on one page.

What will be the output of the current workflow? Ideally of course multiple pages should be identified and crawled as a single article. However, as this requires actual processing of the article, I expect the system to crawl this article as two articles?
If so, is there any way to easily identify (e.g., during the actual article extraction performed by the km4 team) that two (or more) articles actually belong to only one?

Answer:

It depends on the crawler:

The sitemap and RSS crawler only find pages that are listed in the corresponding files. Thus, those crawlers only find the listed article, which might be the first page, all pages, the entire article or a combination.

The recursive crawlers on the other hand will find all pages as well as the entire article and, if the heuristics work for those, will save all of them.

For latter one, a possible way to identity if articles belong together is to search for commen text parts since all pages should be part of the entire article.

For both, it would be possible to extract URLs with keywords like "continue reading" or "page x" etc.

@fhamborg
Copy link
Owner Author

seo compliant pages implement link rel net & pref, see https://support.google.com/webmasters/answer/1663744

@fhamborg
Copy link
Owner Author

Do nothing. Paginated content is very common, and Google does a good job returning the most relevant results to users, regardless of whether content is divided into multiple pages.
Specify a View All page. Searchers commonly prefer to view a whole article or category on a single page. Therefore, if we think this is what the searcher is looking for, we try to show the View All page in search results. You can also add a rel="canonical" link to the component pages to tell Google that the View All version is the version you want to appear in search results.
Use rel="next" and rel="prev" links to indicate the relationship between component URLs. This markup provides a strong hint to Google that you would like us to treat these pages as a logical sequence, thus consolidating their linking properties and usually sending searchers to the first page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant