Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a typesafe and fluent way to return only a subset of data and to use script_fields #8074

Open
erik-kallen opened this issue Apr 3, 2024 · 7 comments
Labels
8.x Relates to 8.x client version Area: Helpers Category: Feature

Comments

@erik-kallen
Copy link

erik-kallen commented Apr 3, 2024

Is your feature request related to a problem? Please describe.

I can't find a good way to return only a subset of the source in a typesafe way. I also can't find a (reasonably) typesafe way to use script_fields.

Describe the solution you'd like

I would like to be able to do this somehow. One option would be a new overload to SearchAsync:

Task<SearchResponse<TResult>> SearchAsync<TDocument, TResult>(
    Action<SearchRequestDescriptor<TDocument>> configureRequest,
    Expression<Func<TDocument, TResult>> resultFactory
);

The intent is that this could be invoked like this:

var result = await client.SearchAsync<MyEntity, MyResultEntity>(
    req => req.Index(...).Query(...),
    doc => new MyResultEntity
    {
        Prop1 = doc.Prop1,
        Prop2 = doc.Prop2ByLanguage[currentLanguage],
        Prop3 = Script<int>(@"emit(0)")
    }
);

And this should set the _source_includes and script_fields based on what the factory method wants.

Describe alternatives you've considered

One option is to remove all required from my indexed entity and then transform each element in result.Documents.

another would be to use SearchAsync<JsonObject>, but that doesn't work, either. Trying to use

var result = await client.SearchAsync<JsonObject>(
    req => req.Query(q => q.Term(o => o["prop1], theValue))

);

results in an exception with this stack trace:

# FailureReason: Unrecoverable/Unexpected BadRequest while attempting POST on http://localhost:9200/my-index/_search
 - [1] BadRequest: Node: http://localhost:9200/ Exception: ArgumentOutOfRangeException Took: 00:00:00.2872455
# Audit exception in step 1 BadRequest:
System.ArgumentOutOfRangeException: capacity ('-1') must be a non-negative value. (Parameter 'capacity')
Actual value was -1.
   at System.ArgumentOutOfRangeException.ThrowNegative[T](T value, String paramName)
   at System.Text.StringBuilder..ctor(Int32 capacity, Int32 maxCapacity)
   at Elastic.Clients.Elasticsearch.FieldExpressionVisitor.Resolve(Expression expression, Boolean toLastToken) in /_/src/Elastic.Clients.Elasticsearch.Shared/Core/Infer/Field/FieldExpressionVisitor.cs:line 33
   at Elastic.Clients.Elasticsearch.FieldResolver.Resolve(Expression expression, MemberInfo member, Boolean toLastToken) in /_/src/Elastic.Clients.Elasticsearch.Shared/Core/Infer/Field/FieldResolver.cs:line 75
   at Elastic.Clients.Elasticsearch.FieldResolver.ResolveFieldName(Field field) in /_/src/Elastic.Clients.Elasticsearch.Shared/Core/Infer/Field/FieldResolver.cs:line 49
   at Elastic.Clients.Elasticsearch.FieldResolver.Resolve(Field field) in /_/src/Elastic.Clients.Elasticsearch.Shared/Core/Infer/Field/FieldResolver.cs:line 31
   at Elastic.Clients.Elasticsearch.Inferrer.Field(Field field) in /_/src/Elastic.Clients.Elasticsearch.Shared/Core/Infer/Inferrer.cs:line 52
   at Elastic.Clients.Elasticsearch.QueryDsl.TermQueryDescriptor`1.Serialize(Utf8JsonWriter writer, JsonSerializerOptions options, IElasticsearchClientSettings settings) in /_/src/Elastic.Clients.Elasticsearch/_Generated/Types/QueryDsl/TermQuery.g.cs:line 217
<--snip

However, using await client.SearchAsync<Dictionary<string, JsonNode>>(...) seems to work but is rather unergonomic.

@erik-kallen
Copy link
Author

I can help building this feature, if you want.

@erik-kallen erik-kallen changed the title Provide a typesafe and fluent way to return only a subset of data and to use runtime_mappings Provide a typesafe and fluent way to return only a subset of data and to use script_fields Apr 4, 2024
@flobernd flobernd added 8.x Relates to 8.x client version Area: Helpers labels Apr 4, 2024
@flobernd
Copy link
Member

flobernd commented Apr 4, 2024

Hi @erik-kallen,

I have to think about this a little bit more. Currently I only see 1 main problem which would have to be solved in the code-generator:

  • The search APIs should have a generic argument TDocument for the query builder part, etc. and a different generic argument TPartialDocument for just the deserialization.

If that would be the case, you could easily deserialize into a partial type that only has the requested properties marked with required.

This does not solve the problem for meta-fields like the script_fields in your example, but these are only available after deserialization of the complete response (including all documents) which means we definitely have to materialize the request and perform a transformation afterwards.

For that transformation part, I'm not sure if we really need a helper, but I'm not completely against adding one.

Regarding type safety:

  • The SourceIncludes property is already of type Fields which allows to define fields using lambda expression trees
  • The ScriptFields are currently defined as IDictionary<string, Elastic.Clients.Elasticsearch.ScriptField> which does not allow usage of lambda expressions. A hand-crafted class ScriptFields similar to Fields could be created to improve usability and type-safety

What do you think?

@erik-kallen
Copy link
Author

The problem with the TPartialDocument approach is that I need to make sure that properties are named the same in both types, which is something I very much want to avoid. Another problem that would be good to have solved is that whenever a property would be added to this partial document, the SourceIncludes list needs to be manually maintained.

Given that we already have the typesafe ability to generate a Field from an Expression, I don't think it would be a super hard problem to:

  • Use the ExpressionVisitor to traverse an Expression<Func<TDocument, TTarget>> to find all access chains to properties of the source and collect them to a Fields object, and to collect all ScriptFields.
  • Use the ExpressionVisitor to transform the Expression<Func<TDocument, TTarget>> to an `Expression<Func<Hit, Hit>>
  • Compile the expression.
  • Deserialize the document into a JsonObject instead of TDocument (this might be hard, but it would be required also for the TPartialDocument solution).
  • Use the compiled expression to transform the documents.

The only thing I'm a little worried about is that this means that there is a real risk of the user defining the lambda inline, which would lead to a lot of compilations. To solve this, we could implement expression comparisons and reuse the compiled expressions if the expression trees are identical. Or we could have a limit of how many expressions are allowed to be compiled (which will alert the user to the problem). Or we could not compile the expressions but instead evaluate them dynamically.

Actually, I realize that my idea to have the script inline probably won't play well with this cache. Those values probably need to be referenced by some syntax like FieldValue<string>("theFieldName") instead.

@flobernd
Copy link
Member

flobernd commented Apr 4, 2024

The problem with the TPartialDocument approach is that I need to make sure that properties are named the same in both types

This sounds like a pretty unique problem to be honest. I fully understand the argument about the required properties, but renaming properties during retrieval is probably not a very common thing to do.

Another problem that would be good to have solved is that whenever a property would be added to this partial document, the SourceIncludes list needs to be manually maintained.

The Fields type already provides an implicit conversion operator from PropertyInfo[] which means you could initialize it with a list of properties for the TPartialDocument that is previously obtained by reflection.

Given that we already have the typesafe ability to generate a Field from an Expression, I don't think it would be a super hard problem to:

From the technical side, this is definitely doable and the approach you describe sounds reasonable.

However, in the context of this library I'm concerned about using compiled expressions. Besides the actual compilation overhead that you already mentioned, they are interpreted instead of compiled on several platforms like Mono, Xamarin, etc. Even when using AOT compilation in regular .NET framework, expression trees can no longer be compiled. This introduces a significant performance impact on these code pathes.

To solve this, we could implement expression comparisons and reuse the compiled expressions if the expression trees are identical

I've played with expression tree comparison in the past to be able to cache generic EF query results and from my experience, the comparison takes a significant amount of time itself (my code can probably be optimized on a few places) so that it was not worth using for fast queries at all.


I can definitely see the value in such helper function and don't want to demotivate you, but the problem you want to solve is really specific in my opinion. I would rather keep the library more generic and don't increase the scope too much. As I'm the only person working on this project, I always have to keep an eye on maintainability as well. This as well was one major reason for switching from hand-crafted code to the code generation approach. Besides that, there are the technical difficulties with the expression trees.

If you want to work on this anyways, I would love to see the results, but I don't expect it to be merged upstream at the moment.

What I'll definitely add to my list, is to come up with a way to deserialize search query results into a TPartialDocument. While this does not solve your specific issue, this still is something that I can see to have value for a lot of other users 🙂

@erik-kallen
Copy link
Author

The problem with the TPartialDocument approach is that I need to make sure that properties are named the same in both types

This sounds like a pretty unique problem to be honest. I fully understand the argument about the required properties, but renaming properties during retrieval is probably not a very common thing to do.

I don't understand what you mean. The problem I am referring to is that if I have a class IndexedEntity with all properties required, and another class PartialIndexedEntity with all properties optional, I need to make sure that the properties have the same name in both classes, or things will break silently.

The Fields type already provides an implicit conversion operator from PropertyInfo[] which means you could initialize it with a list of properties for the TPartialDocument that is previously obtained by reflection.

Yes, but as mentioned above, I don't think the TPartialData approach is better than what I currently have.

However, in the context of this library I'm concerned about using compiled expressions

This is a valid concern. The overload would have to be annotated with RequiresDynamicCode. Or perhaps do whatever EF does (which I don't know the details of, but the problem is very similar).

I can definitely see the value in such helper function and don't want to demotivate you, but the problem you want to solve is really specific in my opinion.

I don't think it's a very specific problem. Of course, I don't know how other people use the library, but I imagine it not being uncommon for people to do client.Search<TDocument>(...).Documents.Select(d => new SomeOtherEntity(...)). Anyone who does that could potentially benefit from this solution.

As I'm the only person working on this project, I always have to keep an eye on maintainability as well.

I understand that, thank you for your work!

If you want to work on this anyways, I would love to see the results, but I don't expect it to be merged upstream at the moment.

I wouldn't mind working on it, but I won't do it if it won't be merged. Or, if it becomes possible to deserialize to a different type than the query builder type, I might implement it as part of my own project (and I can share the code if I manage to get it working nicely).

What I'll definitely add to my list, is to come up with a way to deserialize search query results into a TPartialDocument. While this does not solve your specific issue, this still is something that I can see to have value for a lot of other users 🙂

While I don't see myself using the TPartialData approach, i think I can probably implement what I want externally if TPartialData is allowed to be JsonElement (or JsonNode).

@flobernd
Copy link
Member

flobernd commented Apr 5, 2024

The problem I am referring to is that if I have a class IndexedEntity with all properties required, and another class PartialIndexedEntity with all properties optional, I need to make sure that the properties have the same name in both classes, or things will break silently.

I understand that concern 🙂 Just the way you want to solve this, is very specific (in my opinon and without knowing all of your requirements).

The way I personally would solve this (again, given that we would have a version of the API that allows to specify TDocument and TPartialDocument), would just look like this:

public record Partial
{
    public required int A { get; init; }
    public required int B { get; init; }
}

public record Full : Partial
{
    public required int C { get; init; }
}

// In the next step, get `Property[]` of `Partial` by reflection and use it to initialize 
// the `Fields` instance of the `SourceInclude` property in the `SearchRequest`

If this is a suitable class design, of course depends on your needs.

And keep in mind, that even with your approach, you would still be forced to update the property names on your already indexed documents, if you rename something.

A solution to this would be to use explicit JsonName attributes on the properties.

I don't know how other people use the library, but I imagine it not being uncommon for people to do client.Search<TDocument>(...).Documents.Select(d => new SomeOtherEntity(...)).

That definitely could be the case. It's just that from my experience:

  • people either use the inheritance based approach
  • people just quickly map the properties by hand if the amount of properties is small
  • large projects are using defacto-standard libraries to do the mapping for them (e.g. AutoMapper)

While I don't see myself using the TPartialData approach, i think I can probably implement what I want externally if TPartialData is allowed to be JsonElement (or JsonNode).

Yes, this should be possible in this case. Deserialization to JsonObject works fine in theory. For you it was crashing because of accessing the indexer (o => o["prop1]) which is probably not supported at the moment by the field name inference expression visitor.

@erik-kallen
Copy link
Author

Inheritance won't work for me because different queries return different subsets.

I don't know how other people use the library, but I imagine it not being uncommon for people to do client.Search(...).Documents.Select(d => new SomeOtherEntity(...)).

That definitely could be the case. It's just that from my experience:

  • people either use the inheritance based approach
  • people just quickly map the properties by hand if the amount of properties is small
  • large projects are using defacto-standard libraries to do the mapping for them (e.g. AutoMapper)

Yes, but the main thing I want to do doesn't really have anything to do with the mapping part. It's about reducing the amount of data fetched from elastic, and AutoMapper won't help with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.x Relates to 8.x client version Area: Helpers Category: Feature
Projects
None yet
Development

No branches or pull requests

2 participants