Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when working with large files because of huge DocumentSymbols response #815

Open
sebthom opened this issue Feb 29, 2024 · 10 comments
Open

Comments

@sebthom
Copy link
Contributor

sebthom commented Feb 29, 2024

While working on #703 LSSymbolsContentProvider populating the outline view freezes UI for minutes for large files I am now facing the issue that when loading large JSON on log files (in my case 23MB with 11k lines) my Eclipse IDE with 3GB heap exits with an OOM while rendering the outline.

I traced the issue down to the fact that the JSON language server responds to the DocumentSymbols request with a whooping 160MB large JSON response which is then processed by the lsp4j's ConcurrentMessageProcessor + StreamMessageProducer. Unfortunately the StreamMessageProducer isn't really stream processing but tries to deserialize the JSON string DOM-style like into a huge object with countless of GSON LinkedTreeMap instances which eventually trigger the OOM.

A first optimization I see is to avoid this string allocation and instead pass an InputStreamReader instance at:

String content = new String(buffer, headers.charset);
try {
Message message = jsonHandler.parseMessage(content);

In my case that would avoid the temporary blocking 160MB of heap and make it available for the JSON processing.

However ultimately this will not be enough. The main issue here is that the DocumentSymbols request does not support batching microsoft/language-server-protocol#1533

So I was wondering if it wouldn't make sense in LSP4J to not process incoming messages of a certain size and handle them gracefully with a RuntimeException that is properly logged.


Btw. I did not encounter these OOMs half a year ago when I opened the eclipse/lsp4e#703. At that time I was using Eclipse 2023-06 and the respective lsp4j version. Can it be that lsp4j allocates more memory in the latest release?

@jonahgraham
Copy link
Contributor

(I am quoting this part of your OP message because it was an edit people only following email notifications won't see it)

Btw. I did not encounter these OOMs half a year ago when I opened the eclipse/lsp4e#703. At that time I was using Eclipse 2023-06 and the respective lsp4j version. Can it be that lsp4j allocates more memory in the latest release?

The only LSP4J change I can think of that would have a small (tiny?) effect is that messages have a new field MessageJsonHandler - see #772. The other change may be if the wired up gson version changed?

As to your main question, I don't have much thoughts on it. Having a configurable max message size seems ok, but not sure where that should be configured. Very large JSON messages can translate into enormous Objects, or relatively small ones (for example if most of the JSON message is simply long strings such as file contents). I don't think LSP4J's jsonrpc should have a default max size.

I wonder if there is more efficient way (space wise) to process the JSON input that would avoid all the intermediary objects. i.e equivalent to sax vs dom xml parsing. Perhaps jackson? It would expose how much of the LSP4J API leaks out GSON API I guess.

@sebthom
Copy link
Contributor Author

sebthom commented Feb 29, 2024

When thinking about replacing GSON I recommend to check out https://github.com/fabienrenaud/java-json-benchmark esp. the deserialization performance graphs.

@jonahgraham
Copy link
Contributor

Useful link. Thanks for sharing.

with payloads of 1, 10, 100 and 1000 KB size

We need to adjust benchmark payload to 100× that and look at memory usage too.

@pisv
Copy link
Contributor

pisv commented Mar 1, 2024

However ultimately this will not be enough. The main issue here is that the DocumentSymbols request does not support batching microsoft/language-server-protocol#1533

I completely agree. When the number of nodes in the input message is huge, there is only so much LSP4J can do, GSON or not -- the resulting deserialized object graph would still be huge.

However, LSP supports partial results streaming for such use cases, which the DocumentSymbols request also supports:

partial result: DocumentSymbol[] | SymbolInformation[].

So, it does support batching. It is just that both the server and the client need to support this option.

@sebthom
Copy link
Contributor Author

sebthom commented Mar 1, 2024

@pisv does LSP4J support partial results? I am wondering how I can convert ls.getTextDocumentService().documentSymbol() used here accordingly.

final var params = new DocumentSymbolParams(LSPEclipseUtils.toTextDocumentIdentifier(documentURI));
symbols = outlineViewerInput.wrapper.execute(ls -> ls.getTextDocumentService().documentSymbol(params));

@pisv
Copy link
Contributor

pisv commented Mar 1, 2024

Yes, LSP4J supports partial results, i.e. it provides mapping of all the necessary LSP structures to Java. For a usage example on the client-side, you can have a look at

https://github.com/lxtk-org/lxtk/blob/5f7fb05350299903e0a4f5ae26c56decdacb589b/org.lxtk.lx4e.ui/src/org/lxtk/lx4e/ui/symbols/WorkspaceSymbolSelectionDialog.java#L145-L177

See also AbstractPartialResultProgress and related classes in LXTK.

@sebthom
Copy link
Contributor Author

sebthom commented Mar 2, 2024

Yes, LSP4J supports partial results, i.e. it provides mapping of all the necessary LSP structures to Java. For a usage example on the client-side, you can have a look at

lxtk-org/lxtk@5f7fb05/org.lxtk.lx4e.ui/src/org/lxtk/lx4e/ui/symbols/WorkspaceSymbolSelectionDialog.java#L145-L177

See also AbstractPartialResultProgress and related classes in LXTK.

I had a look at the project. To be honest I don't really grasp it, there are so many layers of abstractions/indirection that I got lost in the code. It looks like you needed to implement a whole infrastructure on top of lsp4j just to get partial results working.

So far I understand I need to set partialResultToken and workDoneToken on the request params and then can send the request with the same tokens multiple times.
The server signals the last result by returning the workDoneToken.

A very naive implementation of my current understanding is:

var params = new DocumentSymbolParams(new TextDocumentIdentifier(documentURI));
params.setPartialResultToken(UUID.randomUUID().toString());
params.setWorkDoneToken(UUID.randomUUID().toString());
do {
  List<Either<SymbolInformation,DocumentSymbol>> result = languageServer.getTextDocumentService().documentSymbol(params).get());
} while ( /* when to exit ?? */ );

What I don't understand is, how to know when all results are returned. documentSymbol() request only returns the list of symbols but no progress or workdone info.

@pisv
Copy link
Contributor

pisv commented Mar 3, 2024

So far I understand I need to set partialResultToken and workDoneToken on the request params and then can send the request with the same tokens multiple times. The server signals the last result by returning the workDoneToken.

A very naive implementation of my current understanding is:

var params = new DocumentSymbolParams(new TextDocumentIdentifier(documentURI));
params.setPartialResultToken(UUID.randomUUID().toString());
params.setWorkDoneToken(UUID.randomUUID().toString());
do {
  List<Either<SymbolInformation,DocumentSymbol>> result = languageServer.getTextDocumentService().documentSymbol(params).get());
} while ( /* when to exit ?? */ );

No, it does not work so.

First, work done progress is independent of partial result progress, so you do not actually need to set a work done token if all you need is partial results.

Second, you only need to send the request once; partial results, if the server supports them for this request, will be sent by the server via one or more $progress notifications (with the same partial result token) before the response is sent (in which case, a successful response will be effectively empty, because all results will be sent by the progress notifications).

So, the key here is that partial results need to be actually supported by the server for a given request.

Then, you need to ensure that $progress notifications with partial results are properly handled by the client, and the results are accepted/accumulated somehow. It can get a bit complicated, but basically, you need to implement LSP4J's LanguageClient.notifyProgress.

I'd suggest reading the following sections of the specification for more information:

(The latter describes client-initiated work done progress, but the same principles apply equally to partial result progress.)

@ghentschke
Copy link

@sebthom are you working currently on a PR for this issue here or in LSP4E?

@sebthom
Copy link
Contributor Author

sebthom commented Mar 11, 2024

@ghentschke no I am not. go for it :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants