Skip to content

Releases: interrogator/corpkit

Fixed memory problems

09 May 13:58
Compare
Choose a tag to compare

There were some issues with large XML file processing that have now been resolved.

Have fun!

Fast, efficient, documented, factored

29 Apr 17:42
Compare
Choose a tag to compare
  • Speed increases, especially for feature counting
  • Multiprocessing for parsing, very useful when you have access to a big machine
  • Improved searching for CoreNLP (looking in all paths), automating download and installation
  • Simpler backend implementation of keywords and ngrams
  • Better documentation, especially at ReadTheDocs
  • Code has been refactored and made largely PEP8 compliant, aiding collaboration
  • Can now sort by subcorpus name in interrogation.edit() method

Very little difference to the API, however!

Major release

20 Feb 23:49
Compare
Choose a tag to compare

In this major release, stability and performance have been improved in dozens of ways:

  • Python 2/3 compatibility
  • Smart multiprocessing
  • Useful documentation, ReadTheDocs site generation
  • Much smaller repository size
  • Compatible with multiple versions of CoreNLP
  • Increased object orientation generally
  • Nose tests
  • Travis CI integration
  • Faster save/load via cPickle
  • Countless bugfixes

Levels of abstraction have been added beyond Corpus (Corpora) and Interrogation (Interrodict), with useful methods attached to each. Interrogation and concordancing have become two sides of the same coin, rather than separate tasks, helping to build computational workflows that investigate functional linguistic notions of probabilistic grammar and lexis as delicate grammar.

One emerging part of corpkit is the configurations() method, which automatically analyses the behaviour of a lexical item or items in the corpus. This will be very useful in automated workflows that seek to identify key participants and processes, and then to generate an overview of how each behaves. A little more work is still needed here, however. Also on the horizon are multilingual support and the use of spaCy ... but perhaps some of this needs to wait until I've made peace with my thesis.

corpkit plus ReadTheDocs

31 Jan 23:18
Compare
Choose a tag to compare

The main thing going on now is some decent docstrings, which allow for some decent documentation via http://corpkit.readthedocs.org/en/latest/. Since the last release, things have also gotten more stable. Corpus class, and its subclasses, are working really nicely: it's now easy to search particular subcorpora, multiprocess, or treat files as subcorpora. the interrogate method has also impoved a lot. conc has been subsumed within interrogate. All is well.

Classes, methods, improved concordancing

16 Jan 17:12
Compare
Choose a tag to compare

This release marks a transition to a class-and-method structure, rather than a collection of functions. Users now instantiate a Corpus object with methods for parsing, interrogating and concordancing. Interrogations output Interrogation objects, which have methods for editing, plotting, saving, etc.

Another major update is that the concordance() method takes the same core arguments as the interrogate() method. This means that users can quickly check that their interrogation is counting what they think it is.

There have also been some bugfixes, documentation updates, and that kind of usual stuff.

New interrogation options

21 Nov 19:33
Compare
Choose a tag to compare

This release is designed to reflect a change from purpose-built interrogator() search functions to the search and show arguments, which are much more powerful. Users can construct a dict object with one or more dependency criteria to match, and elect to match all criteria or any criterion with searchmode = 'any'/'all'.

>>> criteria = {'lemma': ['think', 'feel', 'want'],
...             'pos': r'^V',
...             'function': 'root'}

>>> r = interrogator(corpus, search = criteria, show = ['word'], searchmode = 'all')
>>> list(r.results.columns)[:5]

might return:

['think', 'thinking', 'want', 'wants', 'feel']

Passing in a longer list for the show argument will set what is given in the output, as well as its order:

>>> r = interrogator(corpus, search = criteria, show = ['f', 'p', 'l'], searchmode = 'all')
>>> list(r.results.columns)[:3]

will produce column names with concatenated function, pos and lemma:

['root/vbp/think', 'root/vbg/thinking', 'root/vb/want']

Another improvement is the exclude argument, which takes the place of blacklist, function_filter and pos_filter. Alongside excludemode = 'any'/'all', it operates just like search, allowing the user to exclude results matching one or more criteria:

>>> excs = {'pos': r'^V', 'word': r'ing$'}
>>> r = interrogator(corpus, search = criteria, show = ['f', 'p', 'l'],
...     searchmode = 'all', exclude = excs, excludemode = 'all')

would remove any verbal token ending in ing. Changing excludemode to 'any' would remove all verbs and all words ending in ing.

The release has various other bugfixes, code cleanup, and some miscellaneous bits and pieces, such as a function for turning results into Pandas Multi Index DataFrames. Full API documentation is forthcoming.

corpkit user interface

19 Aug 06:20
Compare
Choose a tag to compare

This release contains the beta version of an OSX .app version of corpkit.

First proper release

02 Jul 05:04
Compare
Choose a tag to compare

I'm releasing corpkit today as 1.0 mostly so that it can get a DOI and be cited.

The toolkit's interrogator(), editor(), plotter(), conc() and keywords() functions are now in a fairly useable state, though documentation of some options may still be lacking. I also haven't really testing the toolkit on single subcorpora and plain text files, because the main aim is to work with parsed and structured corpora.

A major issue at present is that dependency querying is quite slow. Though I think it could be sped up by multiprocessing, and by parsing CoreNLP output with lxml/ corenlp_xml. Because Knuth warns against premature optimisation, and because I have a thesis to finish, I'm going to try not to spend too much time on this issue yet.

This release also marks the start of my transition toward developing:

  • Tools for getting data parsed and structured
  • Tools for connecting concordance lines to HTML

Once these are done, I'll ideally like to wrap everything up as some kind of web-service/application. These future goals, however, score me very few points for my thesis, so I'm not going to be developing them as furiously as I'd like to be.

Be in touch if you have any questions or comments!