Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #3971: Check how to integrate vector databases via rest APIs #4059

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

vga91
Copy link
Collaborator

@vga91 vga91 commented May 2, 2024

Fixes #3971

Changes

Emulate the https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/ commands.

Neo4j Vector Index Vector database correspondent
CREATE VECTOR INDEX apoc.vectordb.qdrant.createCollection
DROP VECTOR INDEX apoc.vectordb.qdrant.deleteCollection
add vector node/rel apoc.vectordb.qdrant.upsert
CALL db.index.vector.queryNodes / CALL db.index.vector.queryRelationships apoc.vectordb.qdrant.get and apoc.vectordb.qdrant.query
Delete vector node/rel apoc.vectordb.qdrant.delete
  • the same for the ChromaDb procedures.
  • the same for the Weaviate procedures

NOTE: Like the apoc.ml ones, the chroma, qdrand and weaviate procedures are implemented in such a way that they have the same signature, even though under the hood they have different bodies/methods/etc.

  • Added 2 custom procedures apoc.vectordb.qdrant.get and apoc.vectordb.custom to handle other vector databases (like Pinecone tested in PineconeTest).

Using the apoc.vectordb.*.get and apoc.vectordb.*.query procedures, we can auto-create neo4j vector indexes and entities, using the mapping config.

NOTE: by default, with the apoc.vectordb.*get and apoc.vectordb.*query only score, metatada and entity are retrieved, to get also other results, we have to set the config allResults: true.

To evaluate

  • apoc.vectordb.custom could be changed to a more generic naming, e.g. apoc.restapi.custom(<conf>), since it could be used with other rest APIs
  • move RestAPIConfig to util package

Additional notes (after PR merge)

  • Open a follow-up issue:

  • Test / custom procedures with other databases (like Pinecone)

  • Added trello Core card: problem with Pinecone, create a PR after neo4j-contrib PR creation...
    We cannot execute Pinecone fetch API with method: "", due to these 2 pieces of apoc core codes:

    • setDoOutput(true)
    • http.setChunkedStreamingMode(1024 * 1024);
      In both cases, we receive a 200OK, but with no results.

@Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception {
var config = new HashMap<>(configuration);

String qdrantUrl = getChromaUrl(hostOrKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy & paste typo - not qdrant :)

@Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception {
var config = new HashMap<>(configuration);

String qdrantUrl = getChromaUrl(hostOrKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

public URLAccessChecker urlAccessChecker;

@Procedure("apoc.vectordb.chroma.createCollection")
@Description("apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a bit better descriptions (for all the procedures), not just the signature again? otherwise the apoc.help output is not really informative if it shows the same content twice without a human description

}

private static Entity handleMappingNode(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) {
String query = "CREATE CONSTRAINT IF NOT EXISTS FOR (n:%s) REQUIRE n.%s IS UNIQUE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test that you can run both the constraint as well as the data creation operation in the same tx?

shouldn't we leave that to the user to create the constraint, otherwise it would do it for every entity

transaction.commit();
}

String setVectorQuery = "CALL db.create.setNodeVectorProperty($entity, $key, $vector)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can set the property to a float array ourselves, no need to call cypher here.


private static Entity handleMappingRel(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) {
try {
String query = "CREATE CONSTRAINT IF NOT EXISTS FOR ()-[r:%s]-() REQUIRE (r.%s) IS UNIQUE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above I don't think we need to do that

// in this case we cannot auto-create the rel, since we should have to define start and end node as well
Relationship rel;
try (Transaction transaction = db.beginTx()) {
Object propValue = metaProps.remove(mapping.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we really remove the mapping-id ? if we later return the metadata that's missing?

try (Transaction transaction = db.beginTx()) {
Object propValue = metaProps.remove(mapping.getId());
rel = transaction.findRelationship(RelationshipType.withName(mapping.getType()), mapping.getProp(), propValue);
if (rel != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not only happen when "create: true" is set?

transaction.commit();
}

String setVectorQuery = "CALL db.create.setRelationshipVectorProperty($entity, $key, $vector)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can set the float array property in the same tx above

node = transaction.createNode(Label.label(mapping.getLabel()));
node.setProperty(mapping.getProp(), propValue);
}
if (node != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why to we write properties if create is not set to true? then we should just return the found node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only populate a node when create is true
alternatively we could have 3 modes (create / update / read) with read the default

try {
Node node;
try (Transaction transaction = db.beginTx()) {
Object propValue = metaProps.remove(mapping.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as below we should not remove the mapping id from the metadata

}

db.executeTransactionally(setVectorQuery,
Map.of("entity", Util.rebind(tx, entity), "key", mapping.getEmbeddingProp(), "vector", embedding));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to turn the double list into a float array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and just set the float array directly as property


APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases:

- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add pinecone to docs

* @param entity we cannot declare entity with class Entity,
* as an error `cannot be converted to a Neo4j type: Don't know how to map `org.neo4j.graphdb.Entity` to the Neo4j Type` would be thrown
*/
public record EmbeddingResult(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have two fields one for Node and one for Relationship
where one or the other is null?

otherwise Cypher cannot do anything with that Object result and you have to first call convert.toNode which would be really annoying.

enum Type {
CHROMA(new ChromaEmbeddingType()),
QDRANT(new QdrantEmbeddingType()),
WEAVIATE(new WeaviateEmbeddingType());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinecone?

import static org.junit.Assert.assertTrue;

/**
* It leverages `apoc.vectordb.custom*` procedures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we have a dedicated pinecone procedures set?

Copy link
Member

@jexp jexp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments

@vga91 vga91 marked this pull request as draft May 17, 2024 16:32
@vga91 vga91 force-pushed the issue-3971 branch 3 times, most recently from 9ff02b4 to 9a8f108 Compare May 20, 2024 07:13
- `apoc.vectordb.store` (to store host, credentials and mapping into the system database)

All the procedures, except the `apoc.vectordb.store` one, can have, as a final parameter,
a configuration map with these possible parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible -> optional

| headers | additional HTTP headers
| method | HTTP method
| endpoint | endpoint key,
can be used to override the default endpoint created via the 1st parameter of the `apoc.vectordb.qdrant.*` and `apoc.vectordb.qdrant.*`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

can be used to override the default endpoint created via the 1st parameter of the `apoc.vectordb.qdrant.*` and `apoc.vectordb.qdrant.*`,
to handle potential endpoint changes.
| body | body HTTP request
| jsonPath | To customize https://github.com/json-path/JsonPath[JSONPath] of the response. The default is `null`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSONPath parsing of the response

See examples below.
|===

include::./qdrand.adoc[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qdrant file name typo?


include::./custom.adoc[]

== Store Vector db info (i.e. `apoc.vectordb.store`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store is an implementation detail
What it's about for the user is configure?

We can save some info in the System Database to be reused later, that is the host, login credentials, and mapping,
to be used in `*.get` and `.*query` procedures, except for the `apoc.vectordb.custom.get` one.

Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.store(vectorName, host, credentialsValue, mapping)`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't I be able to use multiple vector names for the same provider?
like "qdrant" + "books" or "qdrant" + "papers" ? we should not limit it to one only.
so that instead of host+key I would use "books" or "papers"

Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.store(vectorName, host, credentialsValue, mapping)`,
where `vectorName` can be "QDRANT", "CHROMA" or "WEAVIATE",
that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`.
Then `host` is the host base name, `credentialsValue` is the API key and `mapping` is a map that can be used instead of the homonym `embeddingConfig` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

homonym
we should have mapping throughout when it's about the mapping
embeddingConfig is a different topic, no?

that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`.
Then `host` is the host base name, `credentialsValue` is the API key and `mapping` is a map that can be used instead of the homonym `embeddingConfig` parameter.

NOTE:: this procedure is only executable by a user with admin permissions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs to be routed to the systemdb leader? or does that happen now automatically???

[source,cypher]
----
CALL apoc.vectordb.store('QDRANT', 'custom-host-name', '<apiKey>',
{embeddingProp: "vect", label: "Test", prop: "myId", id: "foo"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure about these names, they are not that obvious:

what would be sensible and easily understandable? (throughout)

Perhaps:

  • embeddingKey
  • metadataKey
  • nodeLabel
  • nodeKey

.Create a collection (it leverages https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection[this API])
[source,cypher]
----
CALL apoc.vectordb.qdrant.createCollection($host, 'test_collection', 'Cosine', 4, {<optional config>})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use $hostOrKey here too


== Weaviate

Here is a list of all available Qdrant procedures:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo Qdrant -> Weaviate

.map(MapResult::new);
}

@Procedure(value = "apoc.vectordb.chroma.delete", mode = Mode.SCHEMA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these really mode=SCHEMA?

.map(ListResult::new);
}

@Procedure(value = "apoc.vectordb.chroma.get", mode = Mode.SCHEMA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be mode=WRITE if we keep the update behavior ?

I think if we should move the write behavior into a separate method, like queryAndUpdate or so? or updateGraphFromQuery ? and keep the query method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.


@Procedure(value = "apoc.vectordb.chroma.query", mode = Mode.SCHEMA)
@Description("apoc.vectordb.chroma.query(hostOrKey, collection, vector, filter, limit, $configuration) - Retrieve closest vectors the the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter")
public Stream<EmbeddingResult> query(@Name("hostOrKey") String hostOrKey,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that comment for the query procedure was meant for here:

I think if we should move the write behavior into a separate method, like queryAndUpdate or so? or updateGraphFromQuery ? and keep the query method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Mode.SCHEMA, I had accidentally left it in that initially the procedure also auto-created the vector indexes in neo4j, I removed it now.
And added procedures queryAndUpdate with WRITE mode

v -> listOfListsToMap((Map) v).stream());
}

private Map<String, Object> getVectorDbInfo(String hostOrKey, String collection, Map<String, Object> configuration, String templateUrl) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if we could expose this getVectorDbInfo in a procedure call for each of the databases to get an overview what's in there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added procedures with names apoc.vectordb.<type>.info

* and mapping data to auto-create neo4j vector indexes and properties
*/
@Procedure(value = "apoc.vectordb.custom.get", mode = Mode.SCHEMA)
@Description("apoc.vectordb.custom.get(host, $configuration) - Customizable get / query procedure")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little bit more detail in the description?

throw new RuntimeException(embeddingErrMsg);
}

entity.setProperty(mapping.getEmbeddingProp(), embedding.stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just do a utility method that creates a float array of the list size and uses a for loop over the list to set the values. then the JVM can also optimize that to SIMD. I don't think that the streams are efficient here.

// -- implementations
//

class QdrantEmbeddingHandler implements VectorEmbeddingHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move these implementations closer to where the vector databases are? either into the procedures file or an associated file? Otherwise we have to update this file whenever we add a new db?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

@vga91 vga91 force-pushed the issue-3971 branch 4 times, most recently from b4af8cb to ae0152a Compare May 24, 2024 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants