New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #3971: Check how to integrate vector databases via rest APIs #4059
base: dev
Are you sure you want to change the base?
Conversation
6d57a89
to
4291f7a
Compare
@Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception { | ||
var config = new HashMap<>(configuration); | ||
|
||
String qdrantUrl = getChromaUrl(hostOrKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy & paste typo - not qdrant :)
@Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception { | ||
var config = new HashMap<>(configuration); | ||
|
||
String qdrantUrl = getChromaUrl(hostOrKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
public URLAccessChecker urlAccessChecker; | ||
|
||
@Procedure("apoc.vectordb.chroma.createCollection") | ||
@Description("apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a bit better descriptions (for all the procedures), not just the signature again? otherwise the apoc.help output is not really informative if it shows the same content twice without a human description
} | ||
|
||
private static Entity handleMappingNode(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) { | ||
String query = "CREATE CONSTRAINT IF NOT EXISTS FOR (n:%s) REQUIRE n.%s IS UNIQUE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you test that you can run both the constraint as well as the data creation operation in the same tx?
shouldn't we leave that to the user to create the constraint, otherwise it would do it for every entity
transaction.commit(); | ||
} | ||
|
||
String setVectorQuery = "CALL db.create.setNodeVectorProperty($entity, $key, $vector)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can set the property to a float array ourselves, no need to call cypher here.
|
||
private static Entity handleMappingRel(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) { | ||
try { | ||
String query = "CREATE CONSTRAINT IF NOT EXISTS FOR ()-[r:%s]-() REQUIRE (r.%s) IS UNIQUE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above I don't think we need to do that
// in this case we cannot auto-create the rel, since we should have to define start and end node as well | ||
Relationship rel; | ||
try (Transaction transaction = db.beginTx()) { | ||
Object propValue = metaProps.remove(mapping.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we really remove the mapping-id ? if we later return the metadata that's missing?
try (Transaction transaction = db.beginTx()) { | ||
Object propValue = metaProps.remove(mapping.getId()); | ||
rel = transaction.findRelationship(RelationshipType.withName(mapping.getType()), mapping.getProp(), propValue); | ||
if (rel != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this not only happen when "create: true" is set?
transaction.commit(); | ||
} | ||
|
||
String setVectorQuery = "CALL db.create.setRelationshipVectorProperty($entity, $key, $vector)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can set the float array property in the same tx above
node = transaction.createNode(Label.label(mapping.getLabel())); | ||
node.setProperty(mapping.getProp(), propValue); | ||
} | ||
if (node != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why to we write properties if create is not set to true? then we should just return the found node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only populate a node when create is true
alternatively we could have 3 modes (create / update / read) with read the default
try { | ||
Node node; | ||
try (Transaction transaction = db.beginTx()) { | ||
Object propValue = metaProps.remove(mapping.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as below we should not remove the mapping id from the metadata
} | ||
|
||
db.executeTransactionally(setVectorQuery, | ||
Map.of("entity", Util.rebind(tx, entity), "key", mapping.getEmbeddingProp(), "vector", embedding)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to turn the double list into a float array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and just set the float array directly as property
|
||
APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases: | ||
|
||
- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add pinecone to docs
* @param entity we cannot declare entity with class Entity, | ||
* as an error `cannot be converted to a Neo4j type: Don't know how to map `org.neo4j.graphdb.Entity` to the Neo4j Type` would be thrown | ||
*/ | ||
public record EmbeddingResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have two fields one for Node and one for Relationship
where one or the other is null?
otherwise Cypher cannot do anything with that Object result and you have to first call convert.toNode which would be really annoying.
enum Type { | ||
CHROMA(new ChromaEmbeddingType()), | ||
QDRANT(new QdrantEmbeddingType()), | ||
WEAVIATE(new WeaviateEmbeddingType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinecone?
import static org.junit.Assert.assertTrue; | ||
|
||
/** | ||
* It leverages `apoc.vectordb.custom*` procedures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we have a dedicated pinecone procedures set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments
9ff02b4
to
9a8f108
Compare
- `apoc.vectordb.store` (to store host, credentials and mapping into the system database) | ||
|
||
All the procedures, except the `apoc.vectordb.store` one, can have, as a final parameter, | ||
a configuration map with these possible parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible -> optional
| headers | additional HTTP headers | ||
| method | HTTP method | ||
| endpoint | endpoint key, | ||
can be used to override the default endpoint created via the 1st parameter of the `apoc.vectordb.qdrant.*` and `apoc.vectordb.qdrant.*`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
can be used to override the default endpoint created via the 1st parameter of the `apoc.vectordb.qdrant.*` and `apoc.vectordb.qdrant.*`, | ||
to handle potential endpoint changes. | ||
| body | body HTTP request | ||
| jsonPath | To customize https://github.com/json-path/JsonPath[JSONPath] of the response. The default is `null`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSONPath parsing of the response
See examples below. | ||
|=== | ||
|
||
include::./qdrand.adoc[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qdrant file name typo?
|
||
include::./custom.adoc[] | ||
|
||
== Store Vector db info (i.e. `apoc.vectordb.store`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store is an implementation detail
What it's about for the user is configure?
We can save some info in the System Database to be reused later, that is the host, login credentials, and mapping, | ||
to be used in `*.get` and `.*query` procedures, except for the `apoc.vectordb.custom.get` one. | ||
|
||
Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.store(vectorName, host, credentialsValue, mapping)`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't I be able to use multiple vector names for the same provider?
like "qdrant" + "books" or "qdrant" + "papers" ? we should not limit it to one only.
so that instead of host+key I would use "books" or "papers"
Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.store(vectorName, host, credentialsValue, mapping)`, | ||
where `vectorName` can be "QDRANT", "CHROMA" or "WEAVIATE", | ||
that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`. | ||
Then `host` is the host base name, `credentialsValue` is the API key and `mapping` is a map that can be used instead of the homonym `embeddingConfig` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
homonym
we should have mapping throughout when it's about the mapping
embeddingConfig is a different topic, no?
that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`. | ||
Then `host` is the host base name, `credentialsValue` is the API key and `mapping` is a map that can be used instead of the homonym `embeddingConfig` parameter. | ||
|
||
NOTE:: this procedure is only executable by a user with admin permissions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also needs to be routed to the systemdb leader? or does that happen now automatically???
[source,cypher] | ||
---- | ||
CALL apoc.vectordb.store('QDRANT', 'custom-host-name', '<apiKey>', | ||
{embeddingProp: "vect", label: "Test", prop: "myId", id: "foo"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure about these names, they are not that obvious:
what would be sensible and easily understandable? (throughout)
Perhaps:
- embeddingKey
- metadataKey
- nodeLabel
- nodeKey
.Create a collection (it leverages https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection[this API]) | ||
[source,cypher] | ||
---- | ||
CALL apoc.vectordb.qdrant.createCollection($host, 'test_collection', 'Cosine', 4, {<optional config>}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use $hostOrKey here too
|
||
== Weaviate | ||
|
||
Here is a list of all available Qdrant procedures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo Qdrant -> Weaviate
.map(MapResult::new); | ||
} | ||
|
||
@Procedure(value = "apoc.vectordb.chroma.delete", mode = Mode.SCHEMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these really mode=SCHEMA?
.map(ListResult::new); | ||
} | ||
|
||
@Procedure(value = "apoc.vectordb.chroma.get", mode = Mode.SCHEMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be mode=WRITE if we keep the update behavior ?
I think if we should move the write behavior into a separate method, like queryAndUpdate
or so? or updateGraphFromQuery
? and keep the query
method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.
|
||
@Procedure(value = "apoc.vectordb.chroma.query", mode = Mode.SCHEMA) | ||
@Description("apoc.vectordb.chroma.query(hostOrKey, collection, vector, filter, limit, $configuration) - Retrieve closest vectors the the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter") | ||
public Stream<EmbeddingResult> query(@Name("hostOrKey") String hostOrKey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, that comment for the query procedure was meant for here:
I think if we should move the write behavior into a separate method, like queryAndUpdate
or so? or updateGraphFromQuery
? and keep the query method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Mode.SCHEMA, I had accidentally left it in that initially the procedure also auto-created the vector indexes in neo4j, I removed it now.
And added procedures queryAndUpdate
with WRITE mode
v -> listOfListsToMap((Map) v).stream()); | ||
} | ||
|
||
private Map<String, Object> getVectorDbInfo(String hostOrKey, String collection, Map<String, Object> configuration, String templateUrl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if we could expose this getVectorDbInfo
in a procedure call for each of the databases to get an overview what's in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added procedures with names apoc.vectordb.<type>.info
* and mapping data to auto-create neo4j vector indexes and properties | ||
*/ | ||
@Procedure(value = "apoc.vectordb.custom.get", mode = Mode.SCHEMA) | ||
@Description("apoc.vectordb.custom.get(host, $configuration) - Customizable get / query procedure") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
little bit more detail in the description?
throw new RuntimeException(embeddingErrMsg); | ||
} | ||
|
||
entity.setProperty(mapping.getEmbeddingProp(), embedding.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just do a utility method that creates a float array of the list size and uses a for loop over the list to set the values. then the JVM can also optimize that to SIMD. I don't think that the streams are efficient here.
// -- implementations | ||
// | ||
|
||
class QdrantEmbeddingHandler implements VectorEmbeddingHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should move these implementations closer to where the vector databases are? either into the procedures file or an associated file? Otherwise we have to update this file whenever we add a new db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
b4af8cb
to
ae0152a
Compare
…d vector as a default result
Fixes #3971
Changes
Emulate the https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/ commands.
CREATE VECTOR INDEX
apoc.vectordb.qdrant.createCollection
DROP VECTOR INDEX
apoc.vectordb.qdrant.deleteCollection
apoc.vectordb.qdrant.upsert
CALL db.index.vector.queryNodes
/CALL db.index.vector.queryRelationships
apoc.vectordb.qdrant.get
andapoc.vectordb.qdrant.query
apoc.vectordb.qdrant.delete
NOTE: Like the
apoc.ml
ones, the chroma, qdrand and weaviate procedures are implemented in such a way that they have the same signature, even though under the hood they have different bodies/methods/etc.apoc.vectordb.qdrant.get
andapoc.vectordb.custom
to handle other vector databases (like Pinecone tested in PineconeTest).Using the
apoc.vectordb.*.get
andapoc.vectordb.*.query
procedures, we can auto-create neo4j vector indexes and entities, using the mapping config.NOTE: by default, with the
apoc.vectordb.*get
andapoc.vectordb.*query
only score, metatada and entity are retrieved, to get also other results, we have to set the configallResults: true
.To evaluate
apoc.vectordb.custom
could be changed to a more generic naming, e.g.apoc.restapi.custom(<conf>)
, since it could be used with other rest APIsRestAPIConfig
to util packageAdditional notes (after PR merge)
Open a follow-up issue:
Test / custom procedures with other databases (like Pinecone)
Added trello Core card: problem with Pinecone, create a PR after neo4j-contrib PR creation...
We cannot execute Pinecone fetch API with method: "", due to these 2 pieces of apoc core codes:
In both cases, we receive a 200OK, but with no results.