Merge articles spread on multiple pages #8

fhamborg · 2016-12-18T17:28:17Z

Example: http://www.zeit.de/2016/18/ttip-barack-obama-hannover-usa-widerstand Under the given URL only the first part of the article is shown. A (human) reader can either click on a link that points to the second page or can click on "Auf einer Seite lesen" to read all on one page.

What will be the output of the current workflow? Ideally of course multiple pages should be identified and crawled as a single article. However, as this requires actual processing of the article, I expect the system to crawl this article as two articles?
If so, is there any way to easily identify (e.g., during the actual article extraction performed by the km4 team) that two (or more) articles actually belong to only one?

Answer:

It depends on the crawler:

The sitemap and RSS crawler only find pages that are listed in the corresponding files. Thus, those crawlers only find the listed article, which might be the first page, all pages, the entire article or a combination.

The recursive crawlers on the other hand will find all pages as well as the entire article and, if the heuristics work for those, will save all of them.

For latter one, a possible way to identity if articles belong together is to search for commen text parts since all pages should be part of the entire article.

For both, it would be possible to extract URLs with keywords like "continue reading" or "page x" etc.

fhamborg · 2017-07-16T12:48:43Z

seo compliant pages implement link rel net & pref, see https://support.google.com/webmasters/answer/1663744

fhamborg · 2017-10-23T13:35:04Z

Do nothing. Paginated content is very common, and Google does a good job returning the most relevant results to users, regardless of whether content is divided into multiple pages.
Specify a View All page. Searchers commonly prefer to view a whole article or category on a single page. Therefore, if we think this is what the searcher is looking for, we try to show the View All page in search results. You can also add a rel="canonical" link to the component pages to tell Google that the View All version is the version you want to appear in search results.
Use rel="next" and rel="prev" links to indicate the relationship between component URLs. This markup provides a strong hint to Google that you would like us to treat these pages as a logical sequence, thus consolidating their linking properties and usually sending searchers to the first page.

fhamborg added the help wanted label Mar 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge articles spread on multiple pages #8

Merge articles spread on multiple pages #8

fhamborg commented Dec 18, 2016

fhamborg commented Jul 16, 2017

fhamborg commented Oct 23, 2017

Merge articles spread on multiple pages #8

Merge articles spread on multiple pages #8

Comments

fhamborg commented Dec 18, 2016

fhamborg commented Jul 16, 2017

fhamborg commented Oct 23, 2017