Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of backend API requests using resolve[] #2993

Open
h4l opened this issue May 2, 2023 · 3 comments
Open

Performance of backend API requests using resolve[] #2993

h4l opened this issue May 2, 2023 · 3 comments

Comments

@h4l
Copy link

h4l commented May 2, 2023

I don't normally work with ArchivesSpace, so I'm new to this, but I've been helping investigate a performance issue with Cambridge University's ArchivesSpace deployment. It's been intermittently becoming unresponsive, requiring a restart to recover.

I tracked it down to the resource edit page of a particular item in the staff frontend. I analysed a JVM heap dump taken at the point it became unresponsive, and from that I found that the /repositories/:repo_id/resources/:id endpoint was returning a ~130MB JSON response, and resulted in ~1.3GB reserved memory when parsed as a JRuby hash.

The problem is that the resource in question has ~1000 events that reference itself, so when using the resolve[] param to inline referenced URLs, the root resource is duplicated ~1000 times. (And there are also 2 other objects that get resolved ~1000 times each.) So it's a bit of a pathological case for resolve[], presumably it wan't intended to resolve so many items, nor so many duplicates of the same URL!

In looking at the code, I noticed there's already a workaround in 3.3.1 which mitigates this by not resolving events in resources with > 100 events ( #2635 ). We've not yet upgraded to 3.3.1, so we'll do that to benefit from this workaround in the first instance.

What I'm wondering is whether you'd be interested in going further to improve the behaviour when resolving items with large numbers of relationships? While looking at the implementation, I notice that it doesn't currently de-duplicate URL resolution, so in our case the DB gets queried identically ~3000 times which could have been 3 times. The memory usage from duplicating the objects in the JSON tree is the main problem though.

We could contribute a change to improve the situation, and I think we could avoid the need for the 100 event limit in #2635 . But it seems like this would require some level of change to the REST API, so it's not just something we can fix and send a PR for.

One option I considered would be to add an parameter to the API to have it resolve only the first instance of a URL it finds, and also have the frontend's fetch_resolved() to use that param, and then populate the "_resolved"entries for each matching URL after parsing the JSON, so that duplicate objects would be references to the same object, rather than copies in memory.

Looking at the implementation, it seems like there would be some complications with that approach, as it could prevent the resolver finding references when walking paths that would have contained a copy to walk into and resolve further, but don't because they were pruned.

An alternative variation could be to include at the too-level one JSON object that maps URLs to resolved JSON objects. And don't insert _resolved properties in the tree. Then expect the API client to look up "ref" URLs in that top-level object.

@h4l
Copy link
Author

h4l commented Jul 25, 2023

We're keen to resolve this and are planning to implement a fix for our deployment. Assuming we're successful we'll open a PR with the fix. My plan is to take the approach of returning resolved objects in a top-level object rather than inlining them into the tree. And make this an opt-in behaviour to maintain backwards compatibility, but update internal use of the API to use this method.

Does anyone have any thoughts/input on this?

@jambun
Copy link
Contributor

jambun commented Jul 25, 2023

Hi Hal,

Just writing to offer our support for this venture.

My hunch is it would be less problemy to leave the resolved records inline, but to only include the first copy, and then leave a clue to the client about where to find it in subsequent refs. Something like:

   "ref": "/blah/blah",
   "_resolved_at": ...

Since the uri in the ref is unique, the clue isn't strictly necessary - the client has already seen the resolved record and could map it to the uri as it goes. Maybe it just becomes a reassuring boolean "_is_resolved": true.

As you've guessed, just doing this blindly for all resolves will break a bunch of stuff. It would be great to work it through, but that is obviously a much bigger project. Maybe your work can be a pathfinder for that effort.

Anyway, good luck!

Feel free to get in touch if you'd like a chat.

Cheers,
James

@h4l
Copy link
Author

h4l commented Jul 26, 2023

Hey @jambun, thanks very much for your input, that's really useful. Glad you're supportive of this. The approach you describe sounds good to me, I'll start digging into it and will update with progress/questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants