Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RunReachCLI gets stuck with specific nxml files #783

Open
guerrerosimonl opened this issue Feb 22, 2023 · 6 comments
Open

RunReachCLI gets stuck with specific nxml files #783

guerrerosimonl opened this issue Feb 22, 2023 · 6 comments

Comments

@guerrerosimonl
Copy link

I'm wondering if this is a bug. I have been running a bunch of nxml files (1240 of them) with the standard procedure described at https://github.com/clulab/reach/wiki/Running-Reach. 1224 papers are able to run successfully (re-tested in a couple of them). 16 of them stay for hours running and don't let the process finish (also tried individually).

The 16 files are here: https://we.tl/t-1DHGfBjKc9

@kwalcock
Copy link
Member

Someone saw this and will try to replicate.

@kwalcock
Copy link
Member

I took the smallest of the files (PMC8589633.nxml), divided it up into even smaller parts, and ran them. They all finished, eventually, like after 24 hours. That particular file did not get stuck in an infinite loop or anything. It's just slow. That's partly to blame on there being around 7425 mentions found, but more to blame on some inefficient code for the serial-json format. If you don't happen to need that format, removing it from the list is an easy solution. If that's not possible and you have time, you might wait it out. Be sure to allow Java lots of memory, probably more than 10GB, so that doesn't spend too much time garbage collecting. A third alternative is to wait for some more efficient code, which is what will be discussed below.

@kwalcock
Copy link
Member

@enoriega, the main problem seems to be in org/clulab/reach/mentions/serialization/json/package.scala where in the process of producing IDs, things like BioTextBoundMentionOps are running TextBoundMentionOps(tb).jsonAST which calls into processors code to calculate document.equivalenceHash. A hash for an entire Document is a major undertaking. The processors code also calculates the equivalenceHash of the Mention itself, which in turn calls document.equivalenceHash again. Then the BioTextBoundMentionOps replaces the id in json with yet another calculation of the Mention's ID which again calculates the document's equivalenceHash. That's at least 3 times the same hash value is used per Mention, and more if that Mention is related to other mentions as triggers or arguments. There are around 7500 Mentions in the Document, so this can take a very long time.

Processors can't in general know that the Document hasn't changed between serializations of Mentions, so the recalculation is partially justified. Reach knows that all the mentions are being serialized at the end of processing with no further changes expected to the Document. I believe that a cache of document equivalenceHashes can be stored there so that values can be reused. Some code may have to be copied over from processors in order to achieve this. (Maybe not if some related changes to processors go through.) I'll assign this to myself in nobody objects.

FYI @MihaiSurdeanu

@kwalcock
Copy link
Member

Some of the larger files, if they don't hang, will eventually crash because they generate strings over 2GB in length. That is being looked into. A work-around is to divide the input files into smaller sized documents.

@bgyori
Copy link
Contributor

bgyori commented Mar 12, 2023

Thanks @kwalcock for working on this! I just wanted to chime in and say that based on my prior interactions with @guerrerosimonl, I suspect you only need the fries output, and so @kwalcock's remark that "If you don't happen to need that format, removing it from the list is an easy solution." (from this list specifically: https://github.com/clulab/reach/blob/master/main/src/main/resources/application.conf#L40) I think applies here.

@kwalcock
Copy link
Member

Thanks for the tip @bgyori. If the fries output suffices, that's the more expedient solution. I hope to have json output in not too long, nevertheless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants