Add Opensearch embedding store implementation #213

sebastienblanc · 2024-01-09T07:33:12Z

Maybe @sboeckelmann can also review this since you are the author of the opensearch extension (and also added support in langchain4j)

formatting and doc

geoand · 2024-01-09T07:37:17Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+import software.amazon.awssdk.http.SdkHttpClient;
+import software.amazon.awssdk.http.apache.ApacheHttpClient;


Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

Another option would be the JDK's HttpClient

yeah I need to dive a bit into this, tbh, this class is just a copy/paste from langchain4j, I just replaced the logger for now.

Except that the JDK one can leak connection - so I would rather use the quarkus one.

So this is to connect to an AWS OpenSearch server, the opensearch extension also uses those https://github.com/quarkiverse/quarkus-opensearch/blob/main/opensearch-java-client/runtime/src/main/java/io/quarkiverse/opensearch/client/runtime/OpenSearchTransportHelper.java , this is where it would be nice to have @sboeckelmann feedback

Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code.

FWIW in Hibernate Search we just use the Elasticsearch low-level client (a thin wrapper on top of Apache HTTP client -- the async one) to connect to either Elasticsearch and OpenSearch, and have a custom Apache HTTP plugin based on the AWS SDK to support signing. That way we handle it all with a single impl: Elasticsearch or OpenSearch, AWS or non-AWS. Though I'll admit it feels a bit messy. Another advantage is that AWS request signing in that case is well-tested (it's harder than you'd think, especially with chunked requests).

I would recommend that you rely on the existing quarkus-elasticsearch-rest-client extension but... then you won't get AWS signing (since the only support we have right now is specific to Hibernate Search).

Eventually we'll need a more consolidated approach for the Elasticsearch low-level client. quarkusio/quarkus#26991 will get us one step closer, as it will force me to move AWS signing to something not specific to Hibernate Search.

We could also consider dropping the Elasticsearch low-level client and shipping a low-level client of our own based on the Quarkus HTTP stack, but I really would only do this if we can agree to use it on all extensions related to Elasticearch/OpenSearch, and if we can afford the API break.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code

I'm more thinking about the impact loading 100s of classes of yet another HTTP library, than the impact of another thread pool which is smaller (but of course ideally would not exist)

geoand · 2024-01-09T07:38:09Z

@yrodiere might also be interested in this :)

docs/modules/ROOT/pages/opensearch-store.adoc

opensearch/deployment/pom.xml

opensearch/runtime/pom.xml

cescoffier · 2024-01-09T07:39:34Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+import software.amazon.awssdk.http.SdkHttpClient;
+import software.amazon.awssdk.http.apache.ApacheHttpClient;


Except that the JDK one can leak connection - so I would rather use the quarkus one.

docs/modules/ROOT/pages/opensearch-store.adoc

Co-authored-by: Clement Escoffier <clement.escoffier@gmail.com>

yrodiere

Thanks for pinging me @geoand, here are my two cents.

As explained below, consolidating our approach to connecting to OpenSearch/Elasticsearch is certainly going to be an issue.

In the long run (not possible right now), I wonder if we wouldn't want to use Hibernate Search in standalone mode (not yet supported in Quarkus, see quarkusio/quarkus#26182 ) for this store and the Elasticsearch one?

Hibernate Search supports connecting to multiple versions of Elasticsearch/OpenSearch transparently, and it has built-in support for mass indexing, search (full-text, spatial, ... and vector in 7.1). Perhaps more interesting, it supports keeping the Elasticsearch/OpenSearch index in sync with a primary source of truth (relational database) without reindexing everything all the time. What it does not (and will not) support is transforming content into embeddings (vectors), so that part would certainly still be langchain4j's job.

Or perhaps we'd just want to keep this embedding store "pristine", and simply have the Hibernate Search extension provide an integration with Lang4j to transform. I.e. expect people to either use the langchain4j API directly without Hibernate Search, or the Hibernate Search API which would delegate some things to langchain4j? At this point I'm not sure what's best, we'd need to discuss it.

yrodiere · 2024-01-09T09:30:24Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+import software.amazon.awssdk.http.SdkHttpClient;
+import software.amazon.awssdk.http.apache.ApacheHttpClient;


Do we really have to use these clients?

I would much rather use the Quarkus HTTP stack since it leads to lower resource usage while also guaranteeing things will work in native mode.

If you mean it would lead to not duplicating thread pools, sure, but I'd like to mention Apache HTTP clients have an async/reactive implementation as well, so you don't necessarily need to rely on blocking code.

FWIW in Hibernate Search we just use the Elasticsearch low-level client (a thin wrapper on top of Apache HTTP client -- the async one) to connect to either Elasticsearch and OpenSearch, and have a custom Apache HTTP plugin based on the AWS SDK to support signing. That way we handle it all with a single impl: Elasticsearch or OpenSearch, AWS or non-AWS. Though I'll admit it feels a bit messy. Another advantage is that AWS request signing in that case is well-tested (it's harder than you'd think, especially with chunked requests).

I would recommend that you rely on the existing quarkus-elasticsearch-rest-client extension but... then you won't get AWS signing (since the only support we have right now is specific to Hibernate Search).

Eventually we'll need a more consolidated approach for the Elasticsearch low-level client. quarkusio/quarkus#26991 will get us one step closer, as it will force me to move AWS signing to something not specific to Hibernate Search.

We could also consider dropping the Elasticsearch low-level client and shipping a low-level client of our own based on the Quarkus HTTP stack, but I really would only do this if we can agree to use it on all extensions related to Elasticearch/OpenSearch, and if we can afford the API break.

sboeckelmann · 2024-01-09T09:47:52Z

don't use the old rest-client, those are deprecated.
The AWS2 SDK unfortunately needs to have the Apache HTTP Client stack being setup. Use the new Async Java Client

yrodiere · 2024-01-09T09:50:31Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+import dev.langchain4j.store.embedding.EmbeddingStore;
+import io.quarkus.logging.Log;
+import software.amazon.awssdk.http.SdkHttpClient;
+import software.amazon.awssdk.http.apache.ApacheHttpClient;


Ha, it just occurred to me that by bypassing the existing quarkus-elasticsearch-rest-client extension, you're also bypassing the dev service support for OpenSearch... Might be a bit annoying?

Indeed, very good point!

yeah IMO OpenSearch dev service is a "must have" (and it's also used by the integration test)

And @sboeckelmann mentioned that the rest client is deprecated.
So, what is the conclusion for this part ? Do we keep it as is for now ?

And @sboeckelmann mentioned that the rest client is deprecated.

Only the one being used in this PR. There's another one with async/reactive support in Apache HTTP Client, and I'd be surprised if the AWS SDK didn't provide a wrapper around that, too.

Anyway, here's my suggestion: I'd recommend relying on quarkus-elasticsearch-rest-client. It is quite low level (you'll have to write JSON), but it's consistent with what we expose in other Elasticsearch/OpenSearch extensions, and using it will get you dev services for (almost) free. Almost, because you'll probably still need to use a build item somewhere to let the Dev Services know you target OpenSearch, not Elasticsearch: see DevservicesElasticsearchBuildItem.

Users would have to add AWS request signing manually for now, though. We can add built-in support to that client, though ideally we'd put that in https://github.com/quarkiverse/quarkus-amazon-services somewhere.

FWIW, you can find a relatively simple implementation of that stuff here (the entry point is ElasticsearchAwsHttpClientConfigurer); it's LGPL, but I'm the author, and I'm hereby licensing this code to anyone interested as ASL2.

I'd personally suggest dropping AWS request signing for now and adding separately, but it's your call.

If nobody works on AWS request signing for quarkus-elasticsearch-rest-client, I'll eventually have to do it when dealing with quarkusio/quarkus#26991, but I can't guarantee this will happen soon.

@yrodiere Thanks for all the pointers ! TBH, I have currently no free cycles to refactor this to be based on quarkus-elasticsearch-rest-client , if anyone else wants to pick this up. In the mean time, this current impl is working so maybe we can open a issue to refactor this later? wdyt @geoand ?

I'm afraid that if we do leave it as is, we'll never get around to improving it :).

yrodiere · 2024-01-09T09:53:35Z

...ch/runtime/src/main/java/io/quarkiverse/langchain4j/opensearch/OpenSearchEmbeddingStore.java

+        List<String> ids = embeddings.stream()
+                .map(ignored -> randomUUID())
+                .collect(toList());
+        addAllInternal(ids, embeddings, embedded);


If it's not already handled by the caller, you might want to limit the number of documents you add to your bulk request.

I have no idea of the size and number of documents involved here, but I know that for more than a few hundred large documents on a low-RAM install of OpenSearch (e.g. what we provide with Dev Services, ~1GB heap), the request buffer of OpenSearch/Elasticsearch can easily get overloaded and then OpenSearch/Elasticsearch will start rejecting requests.

In Hibernate Search we have a configurable amount of parallel queues, each sending at most one bulk request at a time with a maximum number of documents in each bulk request (see https://github.com/quarkusio/search.quarkus.io/blob/8accfb39e06372805ce41d108a927e00828b1e25/src/main/resources/application.properties#L52C1-L56). This helps users tune their set up to avoid problems.

That being said, if you know your documents are small and the number of documents is reasonable (i.e. not millions of documents), you might be able to do away with such complexity.

Most of the time it should not be that many documents and their size should be pretty small but we can add maybe a "nice to have" for this.

adding opensearch

2e88bf5

formatting and doc

sebastienblanc requested a review from a team as a code owner January 9, 2024 07:33

geoand changed the title ~~adding opensearch as embedding store~~ Add Opensearch embedding store implementation Jan 9, 2024

geoand reviewed Jan 9, 2024

View reviewed changes

cescoffier requested changes Jan 9, 2024

View reviewed changes

Apply suggestions from code review

c4dd60e

Co-authored-by: Clement Escoffier <clement.escoffier@gmail.com>

yrodiere reviewed Jan 9, 2024

View reviewed changes

extracting to property, doc update

b5afbd2

sebastienblanc mentioned this pull request Mar 10, 2024

Elasticsearch integration #357

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Opensearch embedding store implementation #213

Add Opensearch embedding store implementation #213

sebastienblanc commented Jan 9, 2024

geoand Jan 9, 2024

geoand Jan 9, 2024 •

edited

sebastienblanc Jan 9, 2024

cescoffier Jan 9, 2024

sebastienblanc Jan 9, 2024

yrodiere Jan 9, 2024 •

edited

geoand Jan 9, 2024

geoand commented Jan 9, 2024

cescoffier Jan 9, 2024

yrodiere left a comment •

edited

yrodiere Jan 9, 2024 •

edited

sboeckelmann commented Jan 9, 2024

yrodiere Jan 9, 2024

geoand Jan 9, 2024 •

edited

sebastienblanc Jan 9, 2024

sebastienblanc Jan 9, 2024

yrodiere Jan 9, 2024 •

edited

sebastienblanc Jan 10, 2024

geoand Jan 10, 2024

yrodiere Jan 9, 2024

sebastienblanc Jan 9, 2024

		import software.amazon.awssdk.http.SdkHttpClient;
		import software.amazon.awssdk.http.apache.ApacheHttpClient;

Add Opensearch embedding store implementation #213

Are you sure you want to change the base?

Add Opensearch embedding store implementation #213

Conversation

sebastienblanc commented Jan 9, 2024

Choose a reason for hiding this comment

geoand Jan 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yrodiere Jan 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoand commented Jan 9, 2024

Choose a reason for hiding this comment

yrodiere left a comment • edited

Choose a reason for hiding this comment

yrodiere Jan 9, 2024 • edited

Choose a reason for hiding this comment

sboeckelmann commented Jan 9, 2024

Choose a reason for hiding this comment

geoand Jan 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yrodiere Jan 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoand Jan 9, 2024 •

edited

yrodiere Jan 9, 2024 •

edited

yrodiere left a comment •

edited

yrodiere Jan 9, 2024 •

edited

geoand Jan 9, 2024 •

edited

yrodiere Jan 9, 2024 •

edited