Skip to content

Normalize project Gutenberg books to a format easier for statistical models and machine learning to consume

License

Notifications You must be signed in to change notification settings

ChristianMurphy/gutenberg-book-normalize

Repository files navigation

Gutenberg Book Normalize

Normalize project Gutenberg books to a format easier for statistical models and machine learning to consume

Installation

git clone git@github.com:ChristianMurphy/gutenberg-book-normalize.git
cd gutenberg-book-normalize
npm install

Usage

Download books

Download all project Gutenberg English languages books in HTML format

Uses project Gutenberg's official robot access guide recommendations https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages

⚠️ size is over 75 gigabytes, download time can take 24 hours or more.

npm run gutenberg-download

Extract books

Unzips content into files and folders

npm run gutenberg-extract

Normalize books

Normalizes HTML content into an easier to process JSON format

npm run gutenberg-normalize

Example output:

{
  "type": "book",
  "title": "lorem ipsum",
  "author": "lorem ipsum",
  "children": [
    {
      "type": "chapter",
      "title": "lorem ipsum",
      "level": "h2",
      "children": [
        {
          "type": "paragraph",
          "value": "lorem ipsum"
        }
      ]
    }
  ]
}

📓 format conforms to unist. Any of the unist utilities can be used to further process the content.

About

Normalize project Gutenberg books to a format easier for statistical models and machine learning to consume

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published