Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to query fields indexed by nouveau #4997

Open
GMishx opened this issue Feb 29, 2024 · 26 comments · Fixed by #5018
Open

Unable to query fields indexed by nouveau #4997

GMishx opened this issue Feb 29, 2024 · 26 comments · Fixed by #5018

Comments

@GMishx
Copy link
Contributor

GMishx commented Feb 29, 2024

Description

I have compiled latest couchdb with ./configure --enable-nouveau and it is running fine. Even started the nouveau server with the created ./rel/couchdb/nouveau/bin/nouveau server.

Now, when I try to query the information from the indexes, it does not work for fields other than for email.

Steps to Reproduce

I have a sw360users database with following fields:

{
  "_id": "2b1086cef0a14b7eaeb6f0204b58b8cf",
  "_rev": "1-bb5f36a88a804eeba165b704090476b7",
  "type": "user",
  "email": "clearing_admin1@sw360.org",
  "userGroup": "CLEARING_ADMIN",
  "fullname": "Test Clearing1",
  "givenname": "Test",
  "lastname": "Clearing1"
}

Upon this DB, created a ddoc for nouveau with following document:

{
  "_id": "_design/nouveau_user",
  "nouveau": {
    "users": {
      "index": "function(doc) {\n  if (doc.type == 'user' ) {\n    if (typeof(doc.givenname) == 'string') {\n        index(\"string\", \"givenname\", doc.givenname, {\"store\": true});\n    }\n    if (typeof(doc.email) == 'string') {\n        index(\"string\", \"email\", doc.email, {\"store\": true});\n    }\n    if (typeof(doc.lastname) == 'string') {\n        index(\"string\", \"lastname\", doc.lastname, {\"store\": true});\n    }\n  }\n}",
      "default_analyzer": "english",
      "field_analyzers": {
        "email": "email"
      }
    }
  }
}

Here, I am indexing 3 fields, givenname, lastname and email. I tried various configurations by changing the positions of index() in the function, using different type of analyzers for creating the index.

I see no error in the nouveau logs or in the couchdb logs after the creation of ddoc. Thus, I relaxed :-)

Note: Responses are trimmed for brevity.

Now, when I queried all records with q=*:*, I get 10 fields since I have 10 users:
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "*:*"}'

{"total_hits_relation":"EQUAL_TO","total_hits":10,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"setup@sw360.org"}},{"order":[{"value":1.0,"@type":"float"},{"value":"2a7cedcf38e24ebbade7a23f3f07f793","@type":"string"}],"id":"2a7cedcf38e24ebbade7a23f3f07f793","fields":{"lastname":"Clearing2","givenname":"Test","email":"clearing_admin2@sw360.org"}}],"counts":null,"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiZWIzM2U2ZGI1YTE1NDAxNjgxMDg4OWQ4ZTU0NWZmODIiLCJAdHlwZSI6InN0cmluZyJ9XSxbeyJ2YWx1ZSI6MS4wLCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImVmMjMxYjQ5NTk3ZDRiZDViMmI4OThkNjcxODIwY2U3IiwiQHR5cGUiOiJzdHJpbmcifV1d"}

If I try to query with field email, I get expected response:
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "email:setup*"}'

{"total_hits_relation":"EQUAL_TO","total_hits":1,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"setup@sw360.org"}}],"counts":null,"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiMTFhN2MyOWRlZjJjNDMwNGE5N2RiODEyNTIxYmQ4MmMiLCJAdHlwZSI6InN0cmluZyJ9XV0="}

But with field lastname, I get nothing:
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "lastname:Administrator"}'

{"total_hits_relation":"EQUAL_TO","total_hits":0,"ranges":null,"hits":[],"counts":null,"bookmark":"W10="}

Tried multiple times with lastname:admin*, lastname:administrator, lastname:Administrator but failed to get any response even with different analyzers. The behavior is same for the other field givename. Querying only works for email with different lucene syntax.

Expected Behaviour

Expected to query the indexes on different fields as well.

Your Environment

$ curl --user "admin:admin" 'http://localhost:5984'

{"couchdb":"Welcome","version":"3.3.3-29db2df","git_sha":"29db2df","uuid":"8722f4f42d4f2d566be241e6035df095","features":["nouveau","access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}

Nouveau is also configured in default.ini using default ./rel/couchdb/etc/nouveau.yaml:

[nouveau]
enable = true
url = http://127.0.0.1:5987
  • CouchDB version used: 3.3.3-29db2df
  • Browser name and version: curl 8.2.1 (x86_64-conda-linux-gnu) libcurl/8.2.1 OpenSSL/3.0.10 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.52.0
  • Operating system and version: Ubuntu 22.04.4 LTS

Additional Context

Using counts to aggregate index values works just as expected.
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "*:*", "counts": ["lastname"]}'

{"total_hits_relation":"EQUAL_TO","total_hits":10,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"setup@sw360.org"}},{"order":[{"value":1.0,"@type":"float"},{"value":"2a7cedcf38e24ebbade7a23f3f07f793","@type":"string"}],"id":"2a7cedcf38e24ebbade7a23f3f07f793","fields":{"lastname":"Clearing2","givenname":"Test","email":"clearing_admin2@sw360.org"}}],"counts":{"lastname":{"User2":1,"User1":1,"User":1,"Clearing2":1,"Clearing1":1,"Clearing":1,"Administrator":1,"Admin2":1,"Admin1":1,"Admin":1}},"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiZWIzM2U2ZGI1YTE1NDAxNjgxMDg4OWQ4ZTU0NWZmODIiLCJAdHlwZSI6InN0cmluZyJ9XSxbeyJ2YWx1ZSI6MS4wLCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImVmMjMxYjQ5NTk3ZDRiZDViMmI4OThkNjcxODIwY2U3IiwiQHR5cGUiOiJzdHJpbmcifV1d"}

$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau_info/users'

{"name":"_design/nouveau_user/users","search_index":{"update_seq":628,"purge_seq":0,"num_docs":10,"disk_size":8521}}
@rnewson
Copy link
Member

rnewson commented Feb 29, 2024

thank you for the detailed report, I will look into it.

My first thought is that the query parser is transforming your "Administrator" to "administrator", but as it was indexed as "string" and not "text" it is held as "Administrator" in the index itself, and thus doesn't match.

assuming that's it then I agree that the query parser should not do this for string fields and I will make a fix.

@GMishx
Copy link
Contributor Author

GMishx commented Feb 29, 2024

But I tried to query with "admin", "administrator" and "Administrator" without any luck. Same holds for other values like "User". Thus I am suspecting something is wrong with the analyzed value from index or query not giving out the same result. Email being a special case where analyzer does not change the value, it matches.

@rnewson
Copy link
Member

rnewson commented Feb 29, 2024

Yes, I mean in the index it is "Administrator" but you are not able query with the "A" as the query parser converts with the standard analyzer. you might say 'q=foo:Administrator' but the query parser is making it a term query on "administrator".

@rnewson
Copy link
Member

rnewson commented Feb 29, 2024

e.g, if you specified the "keyword" analyzer for the lastName field, the query parser won't lowercase it for you and it should then match.

@GMishx
Copy link
Contributor Author

GMishx commented Mar 1, 2024

Yes, I mean in the index it is "Administrator" but you are not able query with the "A" as the query parser converts with the standard analyzer. you might say 'q=foo:Administrator' but the query parser is making it a term query on "administrator".

Can confirm that's the case. I modified the doc to give the "lastname": "administrator", then the query q=lastname:admin* gave back the expected result.

@GMishx
Copy link
Contributor Author

GMishx commented Mar 1, 2024

https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/StringField.html vs https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/TextField.html btw

I also modified the index to be of type text rather than string as: index("text", "lastname", doc.lastname, {"store": true}); and query started working even for uppercase "Admin1".

If that's expected behavior, I can update the index creation doc.

@rnewson
Copy link
Member

rnewson commented Mar 1, 2024

"text" type means the value is analyzed. the Lucene analyzers typically force to lower case among other effects, which explains the new success.

@GMishx
Copy link
Contributor Author

GMishx commented Mar 1, 2024

Yeah, I got to understand that from the (archived) couchdb-lucene project

@rnewson , you'd still like to work on this (I can also try to check the issue) with the lower-casing of string query.
Or should I close the issue as I was using wrong field type?

@rnewson
Copy link
Member

rnewson commented Mar 1, 2024

I still intend to make an enhancement. Assuming I'm right in my first comment you did nothing wrong, and I would like nouveau to do the right thing.

We know that "string" fields will not be analyzed, we need to tell the query parser to also not analyze the query string for "string" fields (and nouveau knows the index definition, so it does know which fields are "string" or "text", etc).

rnewson added a commit that referenced this issue Mar 25, 2024
rnewson added a commit that referenced this issue Mar 25, 2024
rnewson added a commit that referenced this issue Mar 25, 2024
rnewson added a commit that referenced this issue Mar 26, 2024
@rnewson
Copy link
Member

rnewson commented Mar 26, 2024

rehi (I've been out on vacation the last few weeks).

I've mocked up a few approaches to this locally and I don't like any of them, they all either have a non-trivial overhead or other odd side-effects more surprising than what you've encountered.

I think the right move is to clarify that if you index with type "string" and you intend to search on that field (as opposed to only sorting on it, for example), then you need to specify the "keyword" analyzer for that field in the index definition. If you do that, everything works out nicely.

In your case I think you actually do want the "text" type for "lastName", so that you can search case-insensitively, but only you know for sure.

@GMishx
Copy link
Contributor Author

GMishx commented Mar 26, 2024

Hey, thanks for the updates. The documentation makes the type field more clear in #5018.

@rnewson
Copy link
Member

rnewson commented Mar 26, 2024

no problem!

@GMishx
Copy link
Contributor Author

GMishx commented Mar 28, 2024

@rnewson Even using the provided suggestions from #5018 fail for different case.

I have a document with fields name and version. Version is stored as a string in couchdb and sample values are "1", "2", "4.2.0". With default analyzer (simple_asciifolding), these values were getting lost:

$ curl --user "admin:admin" 'http://localhost:5984/_nouveau_analyze' -X POST -H 'Content-Type: application/json' -d '{"analyzer": "simple_asciifolding", "text": "4.2.0"}' | jq
{
  "tokens": []
}

Thus I created the index as following:

{
  "_id": "_design/lucene",
  "_rev": "238-6e02d3801cc64311f5244cb242855e82",
  "nouveau": {
    "projects": {
      "default_analyzer": "keyword",
      "field_analyzers": {
        "version": "keyword"
      },
      "index": "function(doc) {\nif(doc.version !== undefined && doc.version != null && doc.version.length >0) {\n      index('text', 'version', doc.version, {'store': true});\n    }\n}"
    }
}

Notice I added field_analyzer on the version field to keyword as #5018 suggested. I also tried using field type as string as well as text. But in all cases, I cannot query the document containing "version:4.2.0" or "version:1".

@rnewson
Copy link
Member

rnewson commented Mar 28, 2024

You specify the keyword analyzer for all fields. for input "4.2.0" that tokenizes to "4.2.0".

Can you show the result of querying the view with ?q=version:4.2.0, ?q=version:"4.2.0" and finally a ?q=_id: with the doc id of the doc with a version of 4.2.0 (e.g, ?q=_id:doc1).

@rnewson rnewson reopened this Mar 28, 2024
@GMishx
Copy link
Contributor Author

GMishx commented Mar 28, 2024

Here are the outputs as requested. I am getting same results for GET and POST queries.

Output of analyze:

$ curl --user "admin:admin" --silent 'http://localhost:5984/_nouveau_analyze' -X POST -H 'Content-Type: application/json' -d '{"analyzer": "keyword", "text": "4.2.0"}' | jq
{
  "tokens": [
    "4.2.0"
  ]
}

Output of version:4.2.0:

$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "version:4.2.0"}' | jq
{
  "total_hits_relation": "EQUAL_TO",
  "total_hits": 0,
  "ranges": null,
  "hits": [],
  "counts": null,
  "bookmark": "W10="
}

Output of version:"4.2.0":

$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "version:\"4.2.0\""}' | jq
{
  "total_hits_relation": "EQUAL_TO",
  "total_hits": 0,
  "ranges": null,
  "hits": [],
  "counts": null,
  "bookmark": "W10="
}

Output with the doc match:

$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "_id:c26ccf0179c14f22bf6f9d0d55acd2dc"}' | jq
{
  "total_hits_relation": "EQUAL_TO",
  "total_hits": 1,
  "ranges": null,
  "hits": [
    {
      "order": [
        {
          "value": 0.44583148,
          "@type": "float"
        },
        {
          "value": "c26ccf0179c14f22bf6f9d0d55acd2dc",
          "@type": "string"
        }
      ],
      "id": "c26ccf0179c14f22bf6f9d0d55acd2dc",
      "fields": {
        "version": "4.2.0",
        "state": "ACTIVE",
        "projectType": "CUSTOMER",
        "name": "fossology",
        "clearingState": "IN_PROGRESS",
        "businessUnit": "DEPARTMENT"
      }
    }
  ],
  "counts": null,
  "bookmark": "W1t7InZhbHVlIjowLjQ0NTgzMTQ4LCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImMyNmNjZjAxNzljMTRmMjJiZjZmOWQwZDU1YWNkMmRjIiwiQHR5cGUiOiJzdHJpbmcifV1d"
}

@rnewson
Copy link
Member

rnewson commented Mar 28, 2024

thanks.

@rnewson
Copy link
Member

rnewson commented Mar 28, 2024

ok, the short answer is that the (nouveau-specific) query parser interprets "4.2.0" as a number and performs a numeric query, not a text/string query. I'm surprised by that, but obviously the same would be true for "4", etc.

This is a very helpful thread btw, these are exactly the issues I want to confront before removing the 'experimental' label from nouveau.

@rnewson
Copy link
Member

rnewson commented Mar 28, 2024

jshell> var nf = NumberFormat.getInstance(Locale.getDefault());
nf ==> java.text.DecimalFormat@674dc

jshell> nf.parse("4");
$5 ==> 4

jshell> nf.parse("4.2");
$6 ==> 4.2

jshell> nf.parse("4.2.0");
$7 ==> 4.2

That's core Java behaviour.

@GMishx
Copy link
Contributor Author

GMishx commented Mar 28, 2024

BTW for context, I am translating the project sw360 which currently uses couchdb-lucene to nouveau.

@rnewson
Copy link
Member

rnewson commented Mar 28, 2024

that's helpful to know, thanks. I'm looking at changing the "magical" nature of numeric queries. I extended/altered the basic lucene query syntax to auto-detect numbers but it has always been a bit awkward (as you've re-discovered).

so I'm looking at a syntax extension that lets you tell nouveau that you intend to look for "2" as a string or as a number, explicitly.

@rnewson
Copy link
Member

rnewson commented Mar 29, 2024

posted a draft PR that addresses this, with some extensive prose on whether it's a good idea or not.

@GMishx
Copy link
Contributor Author

GMishx commented Apr 1, 2024

Will it make sense to use the field type of the index? We already have types double, string and text to differentiate the types of values. (just guessing)

@rnewson
Copy link
Member

rnewson commented Apr 10, 2024

@GMishx I merged a fix for this, but note that I had to change how some things work (you can see the documentation diff in #5021). Essentially you don't need to put a type indicator at the end of the field name when sorting.

what should now happen is you can index a field as a number or a string and the right kind of query will be used. Please give it a try.

@GMishx
Copy link
Contributor Author

GMishx commented Apr 15, 2024

I can confirm the indexing is now working as expected for mentioned issue. Thanks for the quick fix @rnewson I can index and query values "4.2.0", "1" and "2".

I will test it further with other values as well and update here.

@rnewson
Copy link
Member

rnewson commented Apr 15, 2024

thanks for the confirmation, I like this change and your issue was the nudge I needed to make this improvement, so thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants