Optimizing evidence representation #998

bgyori · 2019-11-05T16:30:18Z

This PR implements two optimizations to the representation of evidences that significantly decrease memory usage when manipulating large sets of INDRA Statements. The bulk of memory used by INDRA Statements is attributable to the Evidence objects (incl. evidence text) that are attached to them. One approach to decrease memory usage is to define the __slots__ attribute of Evidence to make sure the set of attributes it can have is pre-defined (rather than variable via a __dict__ attribute). This seemed to make a minor difference in memory usage. Much larger memory savings can be achieved if lists of Evidences attached to a Statement are stored in a serialized, compressed form, and only decompressed and deserialized when being accessed. Based on some experiments, a Statement with 100 pieces of Evidence uses 75% less memory using this PR. On some large assembled corpora that I tried, which have Statements with a mixture of number of Evidences, 80% lower memory usage is typical.

Not much of this affects the way INDRA Statements are used, however there is one important difference: when accessing a Statement's evidence (i.e., stmt.evidence) one gets a view of the list evidences rather than a reference to them. So directly manipulating stmt.evidence will not result in persistent changes to the Statement. Rather, one has to do something like:

evs = stmt.evidence
for ev in evs:
    # Make some changes to each ev object
stmt.evidence = evs

to make changes to a Statement's list of Evidences. Some specialized code dealing with Evidence manipulation, as well as some tests needed to be updated. I am still ambivalent about whether this change will cause confusion later, and therefore not sure yet if this PR should be merged.

cthoyt · 2019-11-12T17:48:47Z

From my point of view, this new API is pretty confusing. It's unclear why saving in a variable solves this problem

bgyori · 2019-11-12T18:20:46Z

Well, users of INDRA would never really notice any change, it's only during internal development (of e.g., pre-assembly algorithms or input processors) that one could make a mistake by attempting to change a view of a list of Evidences rather than the actual evidence attribute of a Statement. Saving into a variable is not really necessary, the key is just to always set evidences as stmt.evidence = [...] to update the actual evidence list attribute rather than attempt to iterate over and manipulate stmt.evidence[idx] directly, which with this change would just change a view of the evidences. I agree it is somewhat confusing hence my ambivalence about the change.

bgyori force-pushed the slots branch from 0f85f12 to 9483079 Compare November 5, 2019 16:31

bgyori force-pushed the slots branch from 7a8e2a3 to fade96b Compare November 12, 2019 20:17

bgyori added 12 commits December 31, 2019 10:48

Implement slots for evidence

b568639

Implement setter and getter for evidence

d2df398

Implement JSON/Gzip for Evidences

9a3c504

Fix evidence attr in make generic copy

dd93a78

Add exception for evidence in unicode_strs

6ef017c

Add exception for object decode

de97135

Implement add evidence

4293d8e

Fix modifying evidence epistemics

3afe3b1

Further fixes for evidence list references

cd7b8be

Fix evidence flattening references

35a8d53

Fix agent coordinates tests

69d0332

Simplify getter/setter code and add comments

5ce681f

bgyori force-pushed the slots branch from fade96b to 5ce681f Compare December 31, 2019 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing evidence representation #998

Optimizing evidence representation #998

bgyori commented Nov 5, 2019

cthoyt commented Nov 12, 2019

bgyori commented Nov 12, 2019

Optimizing evidence representation #998

Are you sure you want to change the base?

Optimizing evidence representation #998

Conversation

bgyori commented Nov 5, 2019

cthoyt commented Nov 12, 2019

bgyori commented Nov 12, 2019