Corpus readers transform corpora containing structured information into dictionaries containing the structured information. These dictionaries can then be sent to other transformers for use.
All readers conform to the following function specification:
reader(pathtofile, fields)
where:
pathtofile
denotes the path to the file containing the corpusfields
is a tuple denoting which fields to extract
All corpus readers use different file formats, so please make sure the file format is correct.
A list of supported corpora can be found here
All corpora function return a pandas DataFrame.
from wordkit.corpora import subtlex
w = subtlex("path", fields=("frequency", "orthography"))
Generic corpora (either in csv or xsl(x) format) can be read using the reader
function.
# Assume some CSV
from wordkit.corpora import reader
w = reader("path_to_weird_csv", fields=("my_frequency_1", "wordness12"))
# w is now a frame.