nlputils

Utility scripts or libraries for various Natural Language Processing tasks.

List

charfreq.awk: calculate character frequency.
convcat.py: cat files with different encodings together.
csvcol.py: get specified columns of csv files.
csvsql.py: convert csv file to sql definition.
dbsort.tcl: sort SQLite tables in place.
detokenizer.py: detokenize Chinese text.
dump2db.py: make a database from leaked password dumps.
epubzhconv.py: Chinese varient conversion for epub books.
filtermd5.py: remove md5s not in known list.
findbadlines.py: find encoding errors in stdin.
gbk_pua.py: convert PUA codes in GBK to unicode.
getautodesk.py: get Moses format parallel text from Autodesk corpus.
gettxtcollection.py: merge a txt file collection to one large corpus.
haodoo: crawl and download all books from haodoo.net.
iconv.py: implements iconv.
iso639.json, iso639-to-calibre.py: get ISO639 codes from Wikipedia and convert to calibre po file.
jiebazhc: tokenize Classical Chinese using jieba.
libpinyin_bopomofo.py: Decorator to use with python-pinyin, to convert Pinyin to Bopomofo. (now useless)
ngramfreq.awk: calculate n-gram character frequency.
num2chinese.py: convert numbers to Chinese numbers.
phrasecombine.py: combine splitted words to large phrases given a dictionary.
pwdsort.js, zxcvbn.js: print out password strength according to zxcvbn.
pgexplaindot.py: output a GraphViz dot file for EXPLAIN (FORMAT JSON).
pgviewdep.tcl: output a GraphViz dot file representing view dependencies in a PostgreSQL database.
rmdup.c: remove duplicate lines without sort (compile with make, needs libxxhash-dev).
simpdump.py: try to find username, email, password and hash from leaked password dumps.
splitrecutfilter.py: reads stdin, filters non-chinese sentences and cuts sentences and words.
tatoeba: convert tatoeba dumps to a SQLite3 database.
wordfreq.awk: calculate word frequency.
WWStarClone.py: clone of WWStar, an ancient Classical Chinese translator.
zhutil.py: misc. utils for processing Chinese.
modelzh.json: model to detect Classical/Modern Chinese.

License

If not otherwise noted in file, all files are licensed under MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
haodoo		haodoo
jiebazhc		jiebazhc
tatoeba		tatoeba
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
WWStarClone.py		WWStarClone.py
charfreq.awk		charfreq.awk
convcat.py		convcat.py
csvcol.py		csvcol.py
csvsql.py		csvsql.py
dbsort.tcl		dbsort.tcl
detokenizer.py		detokenizer.py
dump2db.py		dump2db.py
epubzhconv.py		epubzhconv.py
filtermd5.py		filtermd5.py
findbadlines.py		findbadlines.py
gbk_pua.py		gbk_pua.py
getautodesk.py		getautodesk.py
gettxtcollection.py		gettxtcollection.py
iconv.py		iconv.py
iso639-to-calibre.py		iso639-to-calibre.py
iso639.json		iso639.json
libpinyin_bopomofo.py		libpinyin_bopomofo.py
modelzh.json		modelzh.json
ngramfreq.awk		ngramfreq.awk
num2chinese.py		num2chinese.py
pgexplaindot.py		pgexplaindot.py
pgviewdep.tcl		pgviewdep.tcl
phrasecombine.py		phrasecombine.py
pwdsort.js		pwdsort.js
rmdup.cpp		rmdup.cpp
simpdump.py		simpdump.py
splitrecutfilter.py		splitrecutfilter.py
stopwords.txt		stopwords.txt
wordfreq.awk		wordfreq.awk
zhutil.py		zhutil.py
zxcvbn.js		zxcvbn.js

License

The-Orizon/nlputils

Folders and files

Latest commit

History

Repository files navigation

nlputils

List

License

About

Resources

License

Stars

Watchers

Forks

Languages