Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Opensearch embedding store implementation #213

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sebastienblanc
Copy link
Contributor

Maybe @sboeckelmann can also review this since you are the author of the opensearch extension (and also added support in langchain4j)

formatting and doc
@sebastienblanc sebastienblanc requested a review from a team as a code owner January 9, 2024 07:33
@geoand geoand changed the title adding opensearch as embedding store Add Opensearch embedding store implementation Jan 9, 2024
Comment on lines +52 to +53
import software.amazon.awssdk.http.SdkHttpClient;
import software.amazon.awssdk.http.apache.ApacheHttpClient;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

Copy link
Collaborator

@geoand geoand Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be the JDK's HttpClient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I need to dive a bit into this, tbh, this class is just a copy/paste from langchain4j, I just replaced the logger for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that the JDK one can leak connection - so I would rather use the quarkus one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is to connect to an AWS OpenSearch server, the opensearch extension also uses those https://github.com/quarkiverse/quarkus-opensearch/blob/main/opensearch-java-client/runtime/src/main/java/io/quarkiverse/opensearch/client/runtime/OpenSearchTransportHelper.java , this is where it would be nice to have @sboeckelmann feedback

Copy link

@yrodiere yrodiere Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code.

FWIW in Hibernate Search we just use the Elasticsearch low-level client (a thin wrapper on top of Apache HTTP client -- the async one) to connect to either Elasticsearch and OpenSearch, and have a custom Apache HTTP plugin based on the AWS SDK to support signing. That way we handle it all with a single impl: Elasticsearch or OpenSearch, AWS or non-AWS. Though I'll admit it feels a bit messy. Another advantage is that AWS request signing in that case is well-tested (it's harder than you'd think, especially with chunked requests).

I would recommend that you rely on the existing quarkus-elasticsearch-rest-client extension but... then you won't get AWS signing (since the only support we have right now is specific to Hibernate Search).

Eventually we'll need a more consolidated approach for the Elasticsearch low-level client. quarkusio/quarkus#26991 will get us one step closer, as it will force me to move AWS signing to something not specific to Hibernate Search.

We could also consider dropping the Elasticsearch low-level client and shipping a low-level client of our own based on the Quarkus HTTP stack, but I really would only do this if we can agree to use it on all extensions related to Elasticearch/OpenSearch, and if we can afford the API break.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code

I'm more thinking about the impact loading 100s of classes of yet another HTTP library, than the impact of another thread pool which is smaller (but of course ideally would not exist)

@geoand
Copy link
Collaborator

geoand commented Jan 9, 2024

@yrodiere might also be interested in this :)

docs/modules/ROOT/pages/opensearch-store.adoc Outdated Show resolved Hide resolved
docs/modules/ROOT/pages/opensearch-store.adoc Outdated Show resolved Hide resolved
docs/modules/ROOT/pages/opensearch-store.adoc Outdated Show resolved Hide resolved
opensearch/deployment/pom.xml Outdated Show resolved Hide resolved
opensearch/runtime/pom.xml Outdated Show resolved Hide resolved
Comment on lines +52 to +53
import software.amazon.awssdk.http.SdkHttpClient;
import software.amazon.awssdk.http.apache.ApacheHttpClient;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that the JDK one can leak connection - so I would rather use the quarkus one.

docs/modules/ROOT/pages/opensearch-store.adoc Outdated Show resolved Hide resolved
Co-authored-by: Clement Escoffier <clement.escoffier@gmail.com>
Copy link

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pinging me @geoand, here are my two cents.

As explained below, consolidating our approach to connecting to OpenSearch/Elasticsearch is certainly going to be an issue.

In the long run (not possible right now), I wonder if we wouldn't want to use Hibernate Search in standalone mode (not yet supported in Quarkus, see quarkusio/quarkus#26182 ) for this store and the Elasticsearch one?

Hibernate Search supports connecting to multiple versions of Elasticsearch/OpenSearch transparently, and it has built-in support for mass indexing, search (full-text, spatial, ... and vector in 7.1). Perhaps more interesting, it supports keeping the Elasticsearch/OpenSearch index in sync with a primary source of truth (relational database) without reindexing everything all the time. What it does not (and will not) support is transforming content into embeddings (vectors), so that part would certainly still be langchain4j's job.

Or perhaps we'd just want to keep this embedding store "pristine", and simply have the Hibernate Search extension provide an integration with Lang4j to transform. I.e. expect people to either use the langchain4j API directly without Hibernate Search, or the Hibernate Search API which would delegate some things to langchain4j? At this point I'm not sure what's best, we'd need to discuss it.

Comment on lines +52 to +53
import software.amazon.awssdk.http.SdkHttpClient;
import software.amazon.awssdk.http.apache.ApacheHttpClient;
Copy link

@yrodiere yrodiere Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code.

FWIW in Hibernate Search we just use the Elasticsearch low-level client (a thin wrapper on top of Apache HTTP client -- the async one) to connect to either Elasticsearch and OpenSearch, and have a custom Apache HTTP plugin based on the AWS SDK to support signing. That way we handle it all with a single impl: Elasticsearch or OpenSearch, AWS or non-AWS. Though I'll admit it feels a bit messy. Another advantage is that AWS request signing in that case is well-tested (it's harder than you'd think, especially with chunked requests).

I would recommend that you rely on the existing quarkus-elasticsearch-rest-client extension but... then you won't get AWS signing (since the only support we have right now is specific to Hibernate Search).

Eventually we'll need a more consolidated approach for the Elasticsearch low-level client. quarkusio/quarkus#26991 will get us one step closer, as it will force me to move AWS signing to something not specific to Hibernate Search.

We could also consider dropping the Elasticsearch low-level client and shipping a low-level client of our own based on the Quarkus HTTP stack, but I really would only do this if we can agree to use it on all extensions related to Elasticearch/OpenSearch, and if we can afford the API break.

@sboeckelmann
Copy link

don't use the old rest-client, those are deprecated.
The AWS2 SDK unfortunately needs to have the Apache HTTP Client stack being setup. Use the new Async Java Client

import dev.langchain4j.store.embedding.EmbeddingStore;
import io.quarkus.logging.Log;
import software.amazon.awssdk.http.SdkHttpClient;
import software.amazon.awssdk.http.apache.ApacheHttpClient;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, it just occurred to me that by bypassing the existing quarkus-elasticsearch-rest-client extension, you're also bypassing the dev service support for OpenSearch... Might be a bit annoying?

Copy link
Collaborator

@geoand geoand Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, very good point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah IMO OpenSearch dev service is a "must have" (and it's also used by the integration test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And @sboeckelmann mentioned that the rest client is deprecated.
So, what is the conclusion for this part ? Do we keep it as is for now ?

Copy link

@yrodiere yrodiere Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And @sboeckelmann mentioned that the rest client is deprecated.

Only the one being used in this PR. There's another one with async/reactive support in Apache HTTP Client, and I'd be surprised if the AWS SDK didn't provide a wrapper around that, too.

Anyway, here's my suggestion: I'd recommend relying on quarkus-elasticsearch-rest-client. It is quite low level (you'll have to write JSON), but it's consistent with what we expose in other Elasticsearch/OpenSearch extensions, and using it will get you dev services for (almost) free. Almost, because you'll probably still need to use a build item somewhere to let the Dev Services know you target OpenSearch, not Elasticsearch: see DevservicesElasticsearchBuildItem.

Users would have to add AWS request signing manually for now, though. We can add built-in support to that client, though ideally we'd put that in https://github.com/quarkiverse/quarkus-amazon-services somewhere.

FWIW, you can find a relatively simple implementation of that stuff here (the entry point is ElasticsearchAwsHttpClientConfigurer); it's LGPL, but I'm the author, and I'm hereby licensing this code to anyone interested as ASL2.

I'd personally suggest dropping AWS request signing for now and adding separately, but it's your call.

If nobody works on AWS request signing for quarkus-elasticsearch-rest-client, I'll eventually have to do it when dealing with quarkusio/quarkus#26991, but I can't guarantee this will happen soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yrodiere Thanks for all the pointers ! TBH, I have currently no free cycles to refactor this to be based on quarkus-elasticsearch-rest-client , if anyone else wants to pick this up. In the mean time, this current impl is working so maybe we can open a issue to refactor this later? wdyt @geoand ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that if we do leave it as is, we'll never get around to improving it :).

List<String> ids = embeddings.stream()
.map(ignored -> randomUUID())
.collect(toList());
addAllInternal(ids, embeddings, embedded);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not already handled by the caller, you might want to limit the number of documents you add to your bulk request.

I have no idea of the size and number of documents involved here, but I know that for more than a few hundred large documents on a low-RAM install of OpenSearch (e.g. what we provide with Dev Services, ~1GB heap), the request buffer of OpenSearch/Elasticsearch can easily get overloaded and then OpenSearch/Elasticsearch will start rejecting requests.

In Hibernate Search we have a configurable amount of parallel queues, each sending at most one bulk request at a time with a maximum number of documents in each bulk request (see https://github.com/quarkusio/search.quarkus.io/blob/8accfb39e06372805ce41d108a927e00828b1e25/src/main/resources/application.properties#L52C1-L56). This helps users tune their set up to avoid problems.

That being said, if you know your documents are small and the number of documents is reasonable (i.e. not millions of documents), you might be able to do away with such complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time it should not be that many documents and their size should be pretty small but we can add maybe a "nice to have" for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants