Skip to content

Releases: Bookworm-project/BookwormAPI

DuBois v.2

03 Oct 18:13
Compare
Choose a tag to compare

Two changes, plus pulling master to be up-to-date with the pandas branch now that it's proved its worth in local production. Renaming Pandas to DuBois, because he's an author and I just taught him.

Also removing "alpha" and "beta" from all future pre v1.0 releases; instability should be assumed.

1. Adding a new syntactic option to drop groups from the comparison set.

So ordinarily a query like {"groups":["year","library"],"counttype":["TextPercent"]} will give for each interaction of year and library the number of texts that come from that particular library in that year. That's not interesting. (By definition, it will always be 100%.

On the other hand,

  • {"groups":["year","*library"],"counttype":["TextPercent"]} will drop the library grouping on the superset and give the percentage of all texts for that year that come from the library, so each column will sum to 100%;
  • {"groups":["*year","library"],"counttype":["TextPercent"]} will drop the year superset and give the percentage of all texts for that library that come from that year and library.
  • * {"groups":["*year","*library"],"counttype":["TextPercent"]} will drop both and give the percentage of all texts for the library defined by search_limits or constrain_limits contained in each cell: the sum of all the TextPercent cells in the entire return set should be 100. (Though it may not be if year or library is undefined for some items).

Combining this syntax with that for defining a separate compare_limits will produce some pretty nonsensical queries, so it's generally better to do just one or the other.

2. Support for the topic-model extension.

Allows really fine-grained analysis of Mallet topic models at the token level. Blog Post forthcoming, hopefully.

First Pandas release.

27 Aug 17:56
Compare
Choose a tag to compare

This is a major architectural update to allow further development; it includes with some important performance changes and new features.

It introduces two new python package dependencies:

  • pandas
  • numexpr (for convenience; maybe this will be bundled out eventually).

Both should be easily available through easy_install or whatever else you use.

It also sets aside, for the time being, the need for temporary tables and the bookworm_scratch database described in v0.4, though they may still be revived.

Architecture changes

Parts of the core functionality of the API have been abstracted out of the SQL generating code.

All the different counttypes have been boiled down to two core types: "WordCount" and "TextCount"; and each single API call now separately constructs two corpora and runs the ratios ("Words per million," and so forth) inside of python instead of SQL.

This is done for two reasons:

  1. It allows better performance on MySQL (see below).
  2. The SQL construction engine is considerably less complicated, so re-implementing it on top of other platforms is easier.
    • Solr will be somewhat easier and can use more existing code, but will still need a few methods.
    • The meta-bookworm (an implementation that dispatches calls to other API nodes, rather than directly to a database) should be quite easy to write for most methods, although ordering search results presents issues.

Most of the API handling code has been bundled into a module, with the cgi-bin bits now taking up minimal space. This should make local (non-apache-interfacing) connections slightly easier.

Performance changes

Corpus creation queries are now usually cached: for large (5m+) bookworms, this can frequently speed up queries by 5-6x, getting most normal queries near a second again.

Additions

The new dispatching makes new methods working off of the API much easier to write: as an example, I've added in "Average Text Length" and TF-IDF as core summary statistics. This may be removed at a later point.

This also allows a non-canon return method of a cPickled pandas dataframe rather than just json or tsv. That should be great news for anyone looking to do analysis directly in python.

Sterne bugfix

31 Jul 19:07
Compare
Choose a tag to compare

Minor but important bugfix: caches to memory, not disk, by defaut.

Sterne, take 2

30 Jul 20:23
Compare
Choose a tag to compare

Note: As described below version will not work without creating a new database called bookworm_scratch and configuring it properly. You can do this automatically by typing "python OneClick.py doctor" into a new clone of the Presidio repository with a bookworm.cnf file already defined.

It's being left only on dev for the time being for that reason.

0.4-alpha

This version makes a major change to the underlying architecture of queries: instead of using derived tables, all intermediate queries are stored as temporary tables instead. This may have some costs on RAM, but is dramatically faster for most queries on very large databases. (For instance, with a 6m document Chronicling America db, some queries that were previously taking about 5-10 seconds now take 0.5 seconds).

These gains are made primarily through better caching: the component parts of subqueries were not previously being cached, but now they are. (There's also some gains on very large results from indexes). So the improvements won't show up the first time you redefine the corpus for a query, but should for subsequent ones.

To work, existing bookworm installation will need to change two things:

  1. You need to create a new database called bookworm_scratch, with read and write privileges for the non-admin user. This scratch DB is being used instead of the bookworm's own db to keep the edits sandboxed from the main bookworm installation. This can be done in a single command, python OneClick.py doctor, from any bookworm installation with the latest version of Presidio installed.
  2. You need to make sure your query cache is working properly; MySQL 5.6 changed its defaults from 5.5 so that the cache was generally off. The automatic setup script in Presidio in /etc/mysqlSetup will handle this, or you can do it by hand. Some decent values are below: As always, restarting a server takes some overhead in recreating the memory tables.
query_cache_limit = 1M
query_cache_size = 32M
query_cache_type = 1

This shift also allows something we've been discussing for years: a 'hasword' query in a key. It's not fully up to the new spec, but will be in the next release.

Sterne

20 Jun 19:49
Compare
Choose a tag to compare

API release to enable all features of the 0.3-alpha Presidio build.

Includes some unnecessary legacy code to be dropped in later releases.

First stab at a modern API release

20 Jun 19:47
Compare
Choose a tag to compare
Pre-release
Merge pull request #8 from bmschmidt/dev

Fixing Quotation Marks

The real original master

20 Jun 19:45
Compare
Choose a tag to compare
Pre-release

Real location of original master: not to be used except at rice installations.