Skip to content
Lakshya Singh edited this page Apr 6, 2021 · 1 revision

Document files contains a series of Wikipedia articles, represented each by an XML <tt>doc</tt> element: ... ... ... ...

The element <tt>doc</tt> has the following attributes:

  • <tt>id</tt>, which identifies the document by means of a unique serial number
  • <tt>url</tt>, which provides the URL of the original Wikipedia page.

The content of a <tt>doc</tt> element consists of pure text, one sentence per line.

Here is an example of a <tt>doc</tt> element:

Harmonium. L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale. Sono stati costruiti anche alcuni harmonium con due manuali. ...

Notice that because of Wikipedia conventions, the first sentence is the title of the article.

Such documents are produced by Wikipedia Extractor .

Clone this wiki locally