Skip to content

The-Orizon/nlputils

Repository files navigation

nlputils

Utility scripts or libraries for various Natural Language Processing tasks.

List

  • charfreq.awk: calculate character frequency.
  • convcat.py: cat files with different encodings together.
  • csvcol.py: get specified columns of csv files.
  • csvsql.py: convert csv file to sql definition.
  • dbsort.tcl: sort SQLite tables in place.
  • detokenizer.py: detokenize Chinese text.
  • dump2db.py: make a database from leaked password dumps.
  • epubzhconv.py: Chinese varient conversion for epub books.
  • filtermd5.py: remove md5s not in known list.
  • findbadlines.py: find encoding errors in stdin.
  • gbk_pua.py: convert PUA codes in GBK to unicode.
  • getautodesk.py: get Moses format parallel text from Autodesk corpus.
  • gettxtcollection.py: merge a txt file collection to one large corpus.
  • haodoo: crawl and download all books from haodoo.net.
  • iconv.py: implements iconv.
  • iso639.json, iso639-to-calibre.py: get ISO639 codes from Wikipedia and convert to calibre po file.
  • jiebazhc: tokenize Classical Chinese using jieba.
  • libpinyin_bopomofo.py: Decorator to use with python-pinyin, to convert Pinyin to Bopomofo. (now useless)
  • ngramfreq.awk: calculate n-gram character frequency.
  • num2chinese.py: convert numbers to Chinese numbers.
  • phrasecombine.py: combine splitted words to large phrases given a dictionary.
  • pwdsort.js, zxcvbn.js: print out password strength according to zxcvbn.
  • pgexplaindot.py: output a GraphViz dot file for EXPLAIN (FORMAT JSON).
  • pgviewdep.tcl: output a GraphViz dot file representing view dependencies in a PostgreSQL database.
  • rmdup.c: remove duplicate lines without sort (compile with make, needs libxxhash-dev).
  • simpdump.py: try to find username, email, password and hash from leaked password dumps.
  • splitrecutfilter.py: reads stdin, filters non-chinese sentences and cuts sentences and words.
  • tatoeba: convert tatoeba dumps to a SQLite3 database.
  • wordfreq.awk: calculate word frequency.
  • WWStarClone.py: clone of WWStar, an ancient Classical Chinese translator.
  • zhutil.py: misc. utils for processing Chinese.
  • modelzh.json: model to detect Classical/Modern Chinese.

License

If not otherwise noted in file, all files are licensed under MIT License.

About

Utility scripts or libraries for various Natural Language Processing tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages