Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graphql-java, Virtual Threads and a new engine mechanism #3331

Open
bbakerman opened this issue Sep 13, 2023 · 9 comments
Open

graphql-java, Virtual Threads and a new engine mechanism #3331

bbakerman opened this issue Sep 13, 2023 · 9 comments
Labels
keep-open Tells Stale Bot to keep PRs and issues open

Comments

@bbakerman
Copy link
Member

bbakerman commented Sep 13, 2023

A treatise on Virtual Threads and how the might apply to graphql-java.

graphql-java and Virtual Threads

JDK 21 is bringing in the long awaited virtual threads (hereafter VTs). They allow for a more natural way to return asynchronous values.

If a virtual thread performs a blocking IO operation, the VT is suspended and moved off the carrier thread and that carrier thread can be used for other purposes.

This of course has implications for graphql-java.

Todays graphql engine

Today the graphql-java engine is powered by CompleteableFutures (hereafter CFs) - every value returned for a graphql field is encapsulated into a CF and composed together to make the final Map<String,Object> that is placed into the ExecutionResult data element.

So given a query like this

query q {
    films {
        name
        releaseDate
        director {
            name
            born
        }
        cast {
            name
            relatedFilms {
                name
            }
        }

    }
}

the 10 fields (and associated lists) would be wrapped in CFs, and composed down to a final result.

The graphql engine itself today NEVER blocks because its uses CFs for everything. It also never starts a thread today, but rather relies on the outside calling framework to control threading.

If a field fetch computation, ie done by the DataFetcher, wants do perform a blocking IO operation it is responsible for setting up the thread code to do that. It can return a CF like this to ensure the engine running the query starts asynchronous.

DataFetcher dataFetcher = dfe -> {
    return CompleteableFuture.supplyAsync(() -> someCodeThatMayBlock(dfe), executor)
}

So the graphql engine never starts a thread but rather the callback code is responsible for that.

If a DataFecther returns a CF, it will be used. Any other value is wrapped in a CF and then passed back into the engine processing.

So far CFs have proven to be quite fast. While there is some memory pressure involved in the wrapping sync values with CFs, its not overwhelming and the https://en.wikipedia.org/wiki/Treiber_stack based mechanism
in CF has proven to be fast.

In fact some early testing from others like Quarkus have show that in a request / response situation, CFs are nominally marginally faster than the new VTs. Not by much but they certainly are just as fast if not faster.

All benchmarks are lies so don't read too much into other than to say VTs will NOT be significantly faster than CFs.

With virtual threads

The graphql engine could in theory be simplified by embracing virtual threads. Instead of a DataFetcher having to returns a CF to be async, it could itself start a VT and just return an object.

This means that the engine would not need to compose together CFS to get values. A list of N fields would just be a materialised list of N values and they could be returned internally as such.

In theory this could be virally done in all code right ot the stop where CompletableFuture<ExecutionResult> executeAsync(ExecutionInput executionInput) becomes ExecutionResult executeAsync(ExecutionInput executionInput)

A DataFetcher in a VT world could be the following


DataFetcher dataFetcher = dfe -> {
   try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        Supplier<Object>  value  = scope.fork(() -> someCodeThatMayBlock(dfe));
        scope.join()           
             .throwIfFailed();  // ... and propagate errors
        return value.get()
    }}

Note the above uses the proposed Structured Concurrency API, also in JDK 21 but in preview.

This is actual a return to the past. The original java 6 based engine worked in this manner where all returned values appear synchronous.

The every DataFetcher in a VT idea

In theory we could start a VT for every field fetch. There can be millions of VTs in theory. However VTs are cheap but not free so this does not strike me as a sensible idea.

The non trivial DataFetcher idea

One idea is a that the graphql engine could start a virtual thread per non trivial data fetcher.

Today graphql-java has TrivialDataFetcher which PropertyDataFetcher uses when it reads POJO values out of an in memory object. a VT per field is overkill and not needed.

But a VT per non TrivialDataFetcher might be used so that some one can writing blocking code and the engine running it would stay async and continue to process other fields as it operates.

There are two ways to do this.

The engine itself could start starting VTs itself before executing a non trivial data fetcher.

Or it could provide helper functions that make a DataFetcher start in VT and then it would be a choice of consumer to choose to use it.

  DataFetcher dataFetcher = VirtualThreadDataFetcher.create( dfe -> someCodeThatMayBlock(dfe))

At this stage I really like the fact that graphql-java never starts a thread, virtual or not and leaves it to consumers. That way the threading decisions are left to consumers and frameworks
such as Spring or DGS.

VTs are much much cheaper than old school threads but they are not free or without consequences.

The migration

graphql-java is based on Java 11, having only just updated. We are deliberately slow in base JDK upgrades because all sorts of people use graphql-java and we won't require them to use JDK 21 say for many years.

We want graphql-java to be widely useable on many versions of the JDK.

So we have two cohorts of users. Those who might be able to run JDK 21 and VTs and those who cannot

A CF based engine on JDK21

The current engine can certainly run async in a JDK 21 VT environment. If a DataFetcher did the following it would fit straight into the current engine

DataFetcher dataFetcher = dfe -> {
   try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
        Supplier<Object>  value  = scope.fork(() -> someCodeThatMayBlock(dfe));
        scope.join()           
             .throwIfFailed();  // ... and propagate errors
        return value.get()
    }}

The above would be wrapped in a CF (as it would be completed) by the engine and it would work as it does today.

A object based engine on JDK21

Another option is to rewrite a new engine for those who can exclusively run on JDK 21 and VTs. It would be materialised object based and not use CFs.

DataFetchers would not be affected because the already return Object and we smart wrap them in CFs today and would not need to in the this future engine.

Flow on effects to the engine

This has flow on effects however to other parts of the callback code, namely Instrumentation

Today it has methods like

default InstrumentationContext<Object> beginFieldFetch(InstrumentationFieldFetchParameters parameters, InstrumentationState state) {


public interface InstrumentationContext<T> {

    /**
     * This is invoked when the instrumentation step is initially dispatched
     *
     * @param result the result of the step as a completable future
     */
    void onDispatched(CompletableFuture<T> result);

    /**
     * This is invoked when the instrumentation step is fully completed
     *
     * @param result the result of the step (which may be null)
     * @param t      this exception will be non null if an exception was thrown during the step
     */
    void onCompleted(T result, Throwable t);

}

The CFs here are strongly present in the Instrumentation API. We could leave it as is allocated a dummy CF but thats pointless really if we are writing a new engine

What this means is that a new engine requires a new peer Instrumentation. We also know this to be true from other work we have done where we investigated a new algorithm of fetching and hence the Instrumentation callbacks become specific to the engine algorithm.

Also today the engine used is called an graphql.execution.ExecutionStrategy and its specified when you build the GraphQL instance. These have CFs in their signature like the following (and more)

public abstract CompletableFuture<ExecutionResult> execute(ExecutionContext executionContext, ExecutionStrategyParameters parameters) throws NonNullableFieldWasNullException;

protected CompletableFuture<ExecutionResult> completeValueForObject(ExecutionContext executionContext, ExecutionStrategyParameters parameters, GraphQLObjectType resolvedObjectType, Object result) {

The use of specific ExecutionStrategy and Instrumentation leaks up into high layers via this being used when you build a GraphQL

        GraphQL graphQL = GraphQL.newGraphQL(schema)
                .queryExecutionStrategy(new AsyncExecutionStrategy())
                .instrumentation(new TracingInstrumentation())
                .build();

We need a marker like interface that means that you can build a GraphQL instance but specify the key engine bits in a manner that is not so tied to its shape. eg rather this

public Builder queryExecutionStrategy(ExecutionStrategyMarkerInterface executionStrategy) {


public Builder instrumentation(InstrumentationMarkerInterface instrumentation) {

or better yet invent something like a GraphqlEngine that holds these runtime concerns in it and this is then set into the GraphQL instance. Something like

        GraphQLEngine engine = GraphQLEngine.newEngine()
                .queryStrategy(new AsyncExecutionStrategy())
                .instrumentation(new TracingInstrumentation())
                .build();

        GraphQL graphQL = GraphQL.newGraphQL(schema)
                .engine(engine)
                .build();

This would then allow say

        GraphQLEngine engine = GraphQLEngine.newEngine()
                .queryEngine(new Jdk21VirtualThreadEngine())
                .instrumentation(new Jdk21TracingInstrumentation())
                .build();

or more likely since the engine parts actually need to come together as one part they would be a specific class to construct them

        GraphQLEngine engine = Jdk21VirtualThreadEngine.newEngine().build();
        GraphQL graphQL = GraphQL.newGraphQL(schema)
                .engine(engine)
                .build();

Wrinkles in current code

graphql.execution.preparsed.PreparsedDocumentProvider uses CFs however I think in a future VT world its ok to
expect it to return a completed CF wrapper of the PreparsedDocumentEntry. It's not worth the effort to change this.

DataFetcherExceptionHandler has a CF in its signature via

public interface DataFetcherExceptionHandler {
    CompletableFuture<DataFetcherExceptionHandlerResult> handleException(DataFetcherExceptionHandlerParameters handlerParameters);
}

however its really tied to the engine instance and hence it could be engine specific. The defaulting that happens now via graphql.GraphQL.Builder#defaultDataFetcherExceptionHandler would need to be removed.

The defaulting of graphql.execution.instrumentation.dataloader.DataLoaderDispatcherInstrumentation needs to be taken out of the GraphQL class and moved into the engine factory say.

In Summary

  • graphql-java will happily work in a VT world as it is today
  • graphql-java has to be able to operate in a pre VT JDK world and a post JDK VT world
  • VT support would make for cleaner engine code in that it is values returned not CFs
  • There maybe some performance enhancements for simple object values
  • There are not likely to be performance enhancements for truly async object values
  • A new Engine + Instrumentation mechanism is needed

None of the above is a promise to do anything - but rather an enumeration of ideas.

@rstoyanchev
Copy link

The fact that DataFetchers can be asynchronous by returning CF means that GraphQL Java can start fetching multiple fields in parallel. If a DataFetcher however uses a VT and returns a plain value (i.e. no async wrapper), it would need to block for the value, and that in turn means GraphQL Java will be blocked on each individual field.

Today CFs enable GraphQL Java to act as orchestration layer for concurrent fetching. In order to do the same with virtual threads and structured concurrency, GraphQL Java would need to spawn those VTs, and that could be another ExecutionStrategy implementation.

Note that in Spring for GraphQL we enable controllers to return Callable, which we then put on an Executor thread and return a CF for that. So you can configure an Executor for VTs today and that would work without any changes. Controllers can also return Flux or Mono which we also adapt to CF. It would be good if all those options continue to be first class citizens as I expect many applications to continue to use that.

@paulbakker
Copy link

Thanks for the great summary @bbakerman. I haven't made up my mind yet what would be the best strategy, but I do have a few thoughts.

  1. There is a lot of existing user code that returns CF from datafetchers because that (until now) was the way to get parallel behavior. It would be very nice if that code wouldn't have to change. I don't think it's a blocker though - if it's clearly better to come up with a cleaner engine, it's an option.

  2. In a Servlet based setup it's common to store request related info (SSO etc) on ThreadLocal. This is something users currently have to deal with when using CF, by passing in the correct Executor as an argument when creating the CF. It's also a common source of bugs when users forget to do this.
    If we would keep CF as the main construct, could VT be leveraged by passing in a Executors.newVirtualThreadPerTask? That way graphql-java wouldn't need to change it's engine, but we would still get the benefits I think? The executor passed in could be one provided by DGS/Spring where context propagation is also taken care of.

  3. Manually having to wrap blocking work into a CF does feel like a boilerplate, especially once we also have structured concurrency. For this reason I do like your idea to have something like VirtualThreadDataFetcher.create, which we could very easily leverage in DGS/Spring. From a framework perspective it could be an extra annotation to mark a datafetcher as parallel or something like that. Since this would hide CF for users anyway, there isn't that much reason to rewrite the engine itself? For this to work we would need a mechanism to provide a default Executor in graphql-java, which is configured for context propagation. That executor would be based on Executors.newVirtualThreadPerTask, but do context propagation to deal with ThreadLocal, and could be provided by DGS/Spring.

@rrva
Copy link

rrva commented Sep 28, 2023

I am leaning towards having an execution strategy that supports pure objects and uses structured concurrency natively, without the need for wrapping everything in CompletableFutures. This is especially an overhead for implementations where graphql-java is mostly fetching data which already is in process memory, so no async fetch is needed. We have a great deal of fields which can be resolved this way. For large responses in high-traffic scenarios, this wrapping contributes to memory pressure and overall overhead. CFs is by far the most commonly created object. Also, the programming model in data fetchers is a bit easier with virtual threads and pure objects. We have have a large number of async fields as well but the majority is not.

I do however understand the need of a backwards-compatible execution strategy or at least a good migration path for CompletableFuture-based implementations.

@rrva
Copy link

rrva commented Sep 28, 2023

VTs ought to make performance analysis a bit easier, both for user code and especially when it comes to parts inside the graphql engine. Right now, a flamegraph can sometimes be quite hard to spot things in with all the CF resolving going on.

@paulbakker
Copy link

I've added (experimental) support to the DGS framework to automatically wrap non-CompletableFuture data fetchers in CompletableFuture, running on Virtual Threads. We'll run canaries etc. with this which should provide some good input for this discussion.
Netflix/dgs-framework#1662

@bbakerman
Copy link
Member Author

@paulbakker - this auto wrapping of "non CF" data fetchers - this is only for "business logic" @dgs methods right. eg some code that some one has written

It's not going to happen for simple POJO field fetchers that ProperyDataFetcher might use say?

@paulbakker
Copy link

@paulbakker - this auto wrapping of "non CF" data fetchers - this is only for "business logic" @dgs methods right. eg some code that some one has written

Correct, this is only for user defined data fetchers. In many cases there will be some blocking work there, unless it's only just calling a dataloader.

Dataloaders are another place I want to do something similar, but haven't quite figured out the best approach yet. CompletableFuture is part of the Dataloader API, so should that be abstracted from the user with a "nicer" API, or is that just going to be more confusing?
One very common user error is to not pass in an Executor as the second argument when creating a CF in a dataloader, which makes it run on the fork/join pool, and that's something that would be really good to avoid, but I'm not sure how to hook into that without providing some sort of API on top of dataloaders.

@bbakerman
Copy link
Member Author

On DataLoaders - the real work is done in the backing BatchLoader function

https://github.com/graphql-java/java-dataloader/blob/master/src/main/java/org/dataloader/BatchLoader.java

public interface BatchLoader<K, V> {
    CompletionStage<List<V>> load(List<K> keys);
}

So if you want them to be virtual thread based, you need to auto wrap them when the BatchLoader is created. With the current code design its too late after the DataLoader instance is created. But since DGS is responsible for wiring everything together - I suspect you will have an interception point.

Copy link

Hello, this issue has been inactive for 60 days, so we're marking it as stale. If you would like to continue this discussion, please comment within the next 30 days or we'll close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep-open Tells Stale Bot to keep PRs and issues open
Projects
None yet
Development

No branches or pull requests

6 participants
@paulbakker @rstoyanchev @rrva @bbakerman @dondonz and others