Skip to content
This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

behemoth-1.1

Latest
Compare
Choose a tag to compare
@jnioche jnioche released this 10 Feb 12:04
· 29 commits to master since this release
  • core classes are unpacked in the job archives so that the core comman…
  • changed version of the job files in behemoth script
  • TikaDriver displays help if missing options
  • Fixed UIMA processor so that annotations of type Annotations (and not…
  • UIMAMapper iterates on AnnotationFS instead of casting to AnnotationImpl
  • Index file created by ContentExtractor points to the right part number
  • Tika 1.3 + copyright year on license
  • behemoth script calls jobs relatively to itself
  • example custom config uses block compression
  • IO : moved lemur code to original package + skip parsing of http resp…
  • SparseVectorsFromBehemoth dumps the usage if the input or output is m…
  • bugfix toString() BehemothDocument (ArrayOutOfBoundsException) + avoi…
  • Applied code formatting + added timings to MapReduce jobs
  • CorpusGenerator has timings + more compression and archive formats re…
  • Compatability with CDH 4.1
  • Merge pull request #44 from mumrah/master
  • Upgrade to Solr 4.3 (thanks to LucidWorks)
  • Updated version of Javadoc plugin
  • Upgraded to Tika 1.4
  • Upgraded version of commons-compress to 1.5
  • Can specify AS name for input to GATE doc
  • Bugfix NPE when using the GATECorpusGenerator
  • updating gate version to 7.1
  • Nutch converter takes dir as input + prints out timings
  • POM sign artefacts when releasing
  • Upgrade hadoop to 1.2.1 and add override method to upgrade Add metadata fields as solr dynamic fields if dynamic.fields param is…
  • Use prefixes for dynamic fields on annotations and metdata
  • Merge pull request #47 from kiranchitturi/master
  • Corpusreader uses the filesystem specified in the input path before r…
  • Update LICENSE.txt
  • WarcFileRecordReader can read from S3 + WARCConverterJob stores http …
  • Merge branch 'master' of github.com:DigitalPebble/behemoth
  • Get IP address from WARC metadata and store in MD
  • WARCConverterJob uses filters
  • exclude asm dependency as breaks builds
  • GATEDriver returns -1 on error
  • GATE documents generated from plain text are marked as not markup awa…
  • bugfix httpresponse content length skipped when empty
  • Added option to force reparse with Tika