RunReachCLI gets stuck with specific nxml files #783

guerrerosimonl · 2023-02-22T08:29:15Z

I'm wondering if this is a bug. I have been running a bunch of nxml files (1240 of them) with the standard procedure described at https://github.com/clulab/reach/wiki/Running-Reach. 1224 papers are able to run successfully (re-tested in a couple of them). 16 of them stay for hours running and don't let the process finish (also tried individually).

The 16 files are here: https://we.tl/t-1DHGfBjKc9

kwalcock · 2023-02-23T17:46:45Z

Someone saw this and will try to replicate.

kwalcock · 2023-02-24T18:57:21Z

I took the smallest of the files (PMC8589633.nxml), divided it up into even smaller parts, and ran them. They all finished, eventually, like after 24 hours. That particular file did not get stuck in an infinite loop or anything. It's just slow. That's partly to blame on there being around 7425 mentions found, but more to blame on some inefficient code for the serial-json format. If you don't happen to need that format, removing it from the list is an easy solution. If that's not possible and you have time, you might wait it out. Be sure to allow Java lots of memory, probably more than 10GB, so that doesn't spend too much time garbage collecting. A third alternative is to wait for some more efficient code, which is what will be discussed below.

kwalcock · 2023-02-24T19:43:15Z

@enoriega, the main problem seems to be in org/clulab/reach/mentions/serialization/json/package.scala where in the process of producing IDs, things like BioTextBoundMentionOps are running TextBoundMentionOps(tb).jsonAST which calls into processors code to calculate document.equivalenceHash. A hash for an entire Document is a major undertaking. The processors code also calculates the equivalenceHash of the Mention itself, which in turn calls document.equivalenceHash again. Then the BioTextBoundMentionOps replaces the id in json with yet another calculation of the Mention's ID which again calculates the document's equivalenceHash. That's at least 3 times the same hash value is used per Mention, and more if that Mention is related to other mentions as triggers or arguments. There are around 7500 Mentions in the Document, so this can take a very long time.

Processors can't in general know that the Document hasn't changed between serializations of Mentions, so the recalculation is partially justified. Reach knows that all the mentions are being serialized at the end of processing with no further changes expected to the Document. I believe that a cache of document equivalenceHashes can be stored there so that values can be reused. Some code may have to be copied over from processors in order to achieve this. (Maybe not if some related changes to processors go through.) I'll assign this to myself in nobody objects.

FYI @MihaiSurdeanu

kwalcock · 2023-03-11T05:48:56Z

Some of the larger files, if they don't hang, will eventually crash because they generate strings over 2GB in length. That is being looked into. A work-around is to divide the input files into smaller sized documents.

bgyori · 2023-03-12T01:34:42Z

Thanks @kwalcock for working on this! I just wanted to chime in and say that based on my prior interactions with @guerrerosimonl, I suspect you only need the fries output, and so @kwalcock's remark that "If you don't happen to need that format, removing it from the list is an easy solution." (from this list specifically: https://github.com/clulab/reach/blob/master/main/src/main/resources/application.conf#L40) I think applies here.

kwalcock · 2023-03-12T20:34:02Z

Thanks for the tip @bgyori. If the fries output suffices, that's the more expedient solution. I hope to have json output in not too long, nevertheless.

kwalcock mentioned this issue Mar 4, 2023

Speed up (and debug) serial json output #786

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RunReachCLI gets stuck with specific nxml files #783

RunReachCLI gets stuck with specific nxml files #783

guerrerosimonl commented Feb 22, 2023

kwalcock commented Feb 23, 2023

kwalcock commented Feb 24, 2023

kwalcock commented Feb 24, 2023

kwalcock commented Mar 11, 2023

bgyori commented Mar 12, 2023

kwalcock commented Mar 12, 2023

RunReachCLI gets stuck with specific nxml files #783

RunReachCLI gets stuck with specific nxml files #783

Comments

guerrerosimonl commented Feb 22, 2023

kwalcock commented Feb 23, 2023

kwalcock commented Feb 24, 2023

kwalcock commented Feb 24, 2023

kwalcock commented Mar 11, 2023

bgyori commented Mar 12, 2023

kwalcock commented Mar 12, 2023