Releases · interrogator/corpkit

09 May 13:58

interrogator

v2.0.13

bddb505

Fixed memory problems Latest

Latest

There were some issues with large XML file processing that have now been resolved.

Have fun!

Assets 2

29 Apr 17:42

interrogator

v.2.0.8

64c9e30

Fast, efficient, documented, factored

Speed increases, especially for feature counting
Multiprocessing for parsing, very useful when you have access to a big machine
Improved searching for CoreNLP (looking in all paths), automating download and installation
Simpler backend implementation of keywords and ngrams
Better documentation, especially at ReadTheDocs
Code has been refactored and made largely PEP8 compliant, aiding collaboration
Can now sort by subcorpus name in interrogation.edit() method

Very little difference to the API, however!

Assets 2

20 Feb 23:49

interrogator

2.0.0

2e69ab1

Major release

In this major release, stability and performance have been improved in dozens of ways:

Python 2/3 compatibility
Smart multiprocessing
Useful documentation, ReadTheDocs site generation
Much smaller repository size
Compatible with multiple versions of CoreNLP
Increased object orientation generally
Nose tests
Travis CI integration
Faster save/load via cPickle
Countless bugfixes

Levels of abstraction have been added beyond Corpus (Corpora) and Interrogation (Interrodict), with useful methods attached to each. Interrogation and concordancing have become two sides of the same coin, rather than separate tasks, helping to build computational workflows that investigate functional linguistic notions of probabilistic grammar and lexis as delicate grammar.

One emerging part of corpkit is the configurations() method, which automatically analyses the behaviour of a lexical item or items in the corpus. This will be very useful in automated workflows that seek to identify key participants and processes, and then to generate an overview of how each behaves. A little more work is still needed here, however. Also on the horizon are multilingual support and the use of spaCy ... but perhaps some of this needs to wait until I've made peace with my thesis.

Assets 2

31 Jan 23:18

interrogator

1.87

5cbe176

corpkit plus ReadTheDocs

The main thing going on now is some decent docstrings, which allow for some decent documentation via http://corpkit.readthedocs.org/en/latest/. Since the last release, things have also gotten more stable. Corpus class, and its subclasses, are working really nicely: it's now easy to search particular subcorpora, multiprocess, or treat files as subcorpora. the interrogate method has also impoved a lot. conc has been subsumed within interrogate. All is well.

Assets 2

16 Jan 17:12

interrogator

1.82

c7d9309

Classes, methods, improved concordancing

This release marks a transition to a class-and-method structure, rather than a collection of functions. Users now instantiate a Corpus object with methods for parsing, interrogating and concordancing. Interrogations output Interrogation objects, which have methods for editing, plotting, saving, etc.

Another major update is that the concordance() method takes the same core arguments as the interrogate() method. This means that users can quickly check that their interrogation is counting what they think it is.

There have also been some bugfixes, documentation updates, and that kind of usual stuff.

Assets 2

21 Nov 19:33

interrogator

1.76

dee181e

New interrogation options

This release is designed to reflect a change from purpose-built interrogator() search functions to the search and show arguments, which are much more powerful. Users can construct a dict object with one or more dependency criteria to match, and elect to match all criteria or any criterion with searchmode = 'any'/'all'.

>>> criteria = {'lemma': ['think', 'feel', 'want'],
...             'pos': r'^V',
...             'function': 'root'}

>>> r = interrogator(corpus, search = criteria, show = ['word'], searchmode = 'all')
>>> list(r.results.columns)[:5]

might return:

['think', 'thinking', 'want', 'wants', 'feel']

Passing in a longer list for the show argument will set what is given in the output, as well as its order:

>>> r = interrogator(corpus, search = criteria, show = ['f', 'p', 'l'], searchmode = 'all')
>>> list(r.results.columns)[:3]

will produce column names with concatenated function, pos and lemma:

['root/vbp/think', 'root/vbg/thinking', 'root/vb/want']

Another improvement is the exclude argument, which takes the place of blacklist, function_filter and pos_filter. Alongside excludemode = 'any'/'all', it operates just like search, allowing the user to exclude results matching one or more criteria:

>>> excs = {'pos': r'^V', 'word': r'ing$'}
>>> r = interrogator(corpus, search = criteria, show = ['f', 'p', 'l'],
...     searchmode = 'all', exclude = excs, excludemode = 'all')

would remove any verbal token ending in ing. Changing excludemode to 'any' would remove all verbs and all words ending in ing.

The release has various other bugfixes, code cleanup, and some miscellaneous bits and pieces, such as a function for turning results into Pandas Multi Index DataFrames. Full API documentation is forthcoming.

Assets 2

19 Aug 06:20

interrogator

1.26

8cc60b8

corpkit user interface

This release contains the beta version of an OSX .app version of corpkit.

Assets 2

02 Jul 05:04

interrogator

1.0

69a76fb

First proper release

I'm releasing corpkit today as 1.0 mostly so that it can get a DOI and be cited.

The toolkit's interrogator(), editor(), plotter(), conc() and keywords() functions are now in a fairly useable state, though documentation of some options may still be lacking. I also haven't really testing the toolkit on single subcorpora and plain text files, because the main aim is to work with parsed and structured corpora.

A major issue at present is that dependency querying is quite slow. Though I think it could be sped up by multiprocessing, and by parsing CoreNLP output with lxml/ corenlp_xml. Because Knuth warns against premature optimisation, and because I have a thesis to finish, I'm going to try not to spend too much time on this issue yet.

This release also marks the start of my transition toward developing:

Tools for getting data parsed and structured
Tools for connecting concordance lines to HTML

Once these are done, I'll ideally like to wrap everything up as some kind of web-service/application. These future goals, however, score me very few points for my thesis, so I'm not going to be developing them as furiously as I'd like to be.

Be in touch if you have any questions or comments!

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: interrogator/corpkit

Fixed memory problems

Fast, efficient, documented, factored

Major release

corpkit plus ReadTheDocs

Classes, methods, improved concordancing

New interrogation options

corpkit user interface

First proper release