Skip to content

jumbojett/extract-content-javascript

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ExtractContentJS

Text extraction JavaScript library

You can do it

  • Text extraction

  • Tag Recommended

File

Basically, moving it to read the following in this order:

Lib / lib.js

thing in common

lib/extract-content.js

本文抽出

The repository of route make package Then extract-content-all.js that is the concatenation of these are generated.

When you see you want to detail the actual use:

sketch/extract-content.test.js

本文抽出テスト

Lib / scoring-words.js

scoring tag (sample)

How to use

Text extraction interface

Use if you want to specify that you want / handler only text extraction.

ExtractContentJS.LayeredExtractor

var ex = new ExtractContentJS.LayeredExtractor();
//ex.addHandler( ex.factory.getHandler('Description') );
//ex.addHandler( ex.factory.getHandler('Scraper'));
//ex.addHandler( ex.factory.getHandler('GoogleAdsence') );
ex.addHandler( ex.factory.getHandler('Heuristics') );
var res = ex.extract(document);

if (res.isSuccess) {
    res.url;   // URL string
    res.title; // title string
    res.engine; // handler itself used for extraction
    res.content; // content class of an instance (see below)
}

Handler is far Heuristics only been implemented.

Content class

Returns an array of // body's determined to be leaves class instance that contains the leaf node (see below); content.asLeaves ()
content.asNode (); // return the things of the deepest of the common ancestor of all of the leaf nodes
content.asTextFragment (); return a concatenation of the text of the node that is included in the // asLeaves ()
Return the textContent of // asNode (); content.toString ()

Leaf class

leaf.node; // leaf node
leaf.depth; // depth from the body of the node

AUTHOR

Ina Lintro

Copryright

Copyright © 2009 INA Lintaro / Hatena. All rights reserved.

Copyright of the original implementation

Copyright © 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.

LICENCE

MIT License

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 99.7%
  • Makefile 0.3%