Skip to content

kpu/preprocess

preprocess

Pipelines for preprocessing corpora.

Compile with cmake:

mkdir build
cd build
cmake ..
make -j4

Paths are relative to the build directory.

bin/text.sh $language $lower

does all the tokenization and normalization for normal text that has already been extracted and sentence split.

bin/gigaword_unwrap

takes Gigaword XML files on stdin and outputs text with P tags intended to be used as input to the sentence splitter. Also removes or normalizes many ad-hoc parenthesized expressions like (UNDERLINE) and consecutive duplicate lines.

moses/ems/support/split-sentences.perl -l $language

is the Moses/Europarl sentence splitter with a bugfix to also split sentences separated by two spaces.

bin/resplit.sh $language

preserves existing line breaks and introduces additional breaks when multiple sentences appear in the same line. This is useful when you want to use the target side of parallel corpora for language modeling.

bin/gigaword_extract.sh $language

combines the unwrap and sentence split steps.

bin/dedupe

deduplicates text at the line level.

bin/cache slow_program slow_program_args...

Wraps a deterministic line-based program with a deduplicating cache. The program should always write exactly one line for each line it reads. Example input:

Repeated line
Some text
Repeated line
More text

The slow program will only see unique lines in its input:

Repeated line
Some text
More text

Suppose the slow program translates to French:

Ligne répétée
Du texte
Plus de texte

Then cache will reinsert the cached results for final output:

Ligne répétée
Du texte
Ligne répétée
Plus de texte
bin/shard $prefix $shard_count

Shards stdin into multiple files named prefix0 prefix1 prefix2 etc. This is useful when the deduper above runs out of memory.

bin/remove_long_lines $length_limit

removes lines longer than the specified length in bytes. The default is 2000 bytes.

bin/remove_invalid_utf8

removes lines with invalid UTF-8

bin/simple_cleaning

Removes lines with:

  • Invalid UTF-8
  • Control characters (except tab and newline)
  • Less than --min-chars as measured in codepoints
  • Consecutive runs of --character-run or more of the same non-space character
  • Too many Common and Inherited Unicode script characters (like numbers)
  • Too much or too little punctuation
  • Too little in the expected script
bin/process_unicode -l $language [--flatten] [--normalize] [--lower]

Processes UTF8.

  • --lower lowercases
  • --normalize applies the ICU normalization function
  • --flatten applies a bunch of substitutions for punctuation
bin/heuristics.perl -l $language

A collection of substitution heuristics from various people.

moses/tokenizer/tokenizer.perl -l $language

The Moses tokenizer.

bin/truecase --model $model

is a fast reimplementation of the Moses truecase.perl script. It does not support factors.

xzcat $language.*.raw.xz |commoncrawl_dedupe /dev/null |xz >$language.deduped.xz

Process the CommonCrawl n-grams raw files into the deduped files:

  • Remove lines beginning with df6fa1abb58549287111ba8d776733e9 (these mark document boundaries)
  • Strip leading and trailing whitespace
  • Deduplicate, preserving the first instance of the line
  • Remove any lines with invalid UTF-8
zcat sentences.gz | b64filter program program-args | gzip -c > processed.gz

Wraps a program that is not able to ingest base64 documents, but is able to process lines of text in such a way that it will always output an equal amount of lines as went into it. For example an MT system, or a tokenizer.

< long_sentences.txt foldfilter -w 1000 translate.sh > long_english_sentences.txt

Think of it as a wrapper version of fold.

Wrap an MT system that does not like long sentences and this tool chops those lines (assuming each line is a single sentence) temporarily into multiple lines. It uses some heuristics to determine where to break up lines, but if it can't find a good break point it will still just chop words in half.

Usage: foldfilter [ -w INT ] [ -d DELIMITERS ] [ -s ] command [ args ... ]

Arguments:
  -w INT         Split input lines into lines of at most INT bytes. Default: 80
  -d DELIMITERS  Use these delimiters instead of the default set `:, -./`. Note
                 that the order specified here is also the order of preference
                 when breaking a line. The first delimiter of this set that
                 breaks the line into a width smaller/equal to the wanted width
                 will be selected, not the last! I.e. with the default set it
                 will prefer to break on white space instead of breaking up a
                 word with a dash in it. The default set assumes the input is
                 already split into a single sentence per line, and `.` and `/`
                 are mainly here to break up URIs.
  -s             Skip sending the delimiters to the wrapped command. If added,
                 delimiters around the breaking point (both before and after)
                 will be pass directly to the output, and not to the wrapped
                 command. Useful if you do not trust the wrapped program to not
                 trim them off. Delimiters inside lines, i.e. that are not at
                 the beginning or end of a line are always sent.

The program's exit code is that of the wrapped command, or 1 if the arguments could not be interpreted or an error occurred, i.e. invalid utf8 was passed in.

utf8 safe: This tool won't break up unicode characters, and you can use unicode characters as delimiters.

docenc -d $INDICES sentences.gz

Bit like b64filter, but split into a decoding and an encoding process. Can be helpful constructing or debugging base64 encoded document collections. Accepts gzip-encoded input.

Usage: docenc [ -d ] [ -0 ] [ -q | -v ] [ -n ] [ index ... ] [ file ... ]

The indexes can be specified as a single index or a range in the form INT-INT.
You can specify multiple indices or ranges at once. The rest of the arguments
are interpreted as files to encode/decode.

Indices:
  N    Single index, starting with 1
  M-N  Index range, i.e. 1-3 expands to 1 2 3

Arguments:
  -d      Decode mode: convert base64 into plain text. If not specified it will
          default to encode mode.
  -0      Use null byte instead of double newline as document separator.
  -q      Do not complain if the document separator is encountered in the output
          while decoding.
  -v      Print how many documents were encoded/decoded to stderr.
  -n      When decoding, prefix each line with the document index.

Modes:
  encode  Interpret the input as plain text documents that need to be base64
          encoded. The document separator (double newline) will cause a new
          document to be generated.
  decode  Interpret each line as a base64 encoded document that needs to be
          converted to plain text. The plain text will have double newlines
          (or null bytes) to indicate a document has ended and the next one
          begins.

I.e this will behave similarly:

docenc -d plain_text.gz | tr '[:lower:]' '[:upper:]' | docenc | gzip -c > loud_text.gz

and

gzip -cd plain_text.gz | b64filter tr '[:lower:]' '[:upper:]' | docenc | gzip -c > loud_text.gz

Pretty similar to what can be achieved with something like the following set on a Linux machine using GNU base64 (Note: macOS base64 will not accept multiple lines):

zcat $@ | head -nX | tail -nY | base64 -d

Mostly useful for debugging, but can also be useful to convert plain text into base64-encoded documents, i.e. converting single line html into multi-line documents:

cat lines_of_html.tsv \
  | cut -d$'\t' -f2 \
  | sed -r 's/$/\x0/g' \
  | sed -r 's/<br\/>|<\/p><p>/\n/g' \
  | sed -r 's/<\/?p>//g' \
  | docenc -0 \
  > sentences.gz

About

Corpus preprocessing

Resources

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE
GPL-3.0
COPYING
LGPL-3.0
COPYING.LESSER

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published