Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape is not working for HTTP and SQL #2131

Open
8 tasks
scruel opened this issue May 4, 2024 · 4 comments
Open
8 tasks

Escape is not working for HTTP and SQL #2131

scruel opened this issue May 4, 2024 · 4 comments
Labels
waiting Waiting for the original poster (in most cases) or something else

Comments

@scruel
Copy link

scruel commented May 4, 2024

Bug Description:

For index as following:

#DROP TABLE testx;
MySQL [(none)]> CREATE TABLE testx(content text) morphology='stem_en';
MySQL [(none)]> INSERT INTO testx(content) VALUES('the next test gen');
MySQL [(none)]> INSERT INTO testx(content) VALUES('the next-gen test gen');
MySQL [(none)]> INSERT INTO testx(content) VALUES('the next-generation test gen');
MySQL [(none)]> INSERT INTO testx(content) VALUES('the next');
MySQL [(none)]> INSERT INTO testx(content) VALUES('the -gen');
MySQL [(none)]> INSERT INTO testx(content) VALUES('the gen');

Unexpected behavior from SQL side:

MySQL [(none)]> SELECT * FROM testx WHERE MATCH('@content "next-gen"');
+---------------------+-----------------------+
| id                  | content               |
+---------------------+-----------------------+
| 7784835685113921561 | the next-gen test gen |
+---------------------+-----------------------+
MySQL [(none)]> SELECT * FROM testx WHERE MATCH('@content next-gen');
+---------------------+------------------------------+
| id                  | content                      |
+---------------------+------------------------------+
| 7784835685113921561 | the next-gen test gen        |
| 7784835685113921562 | the next-generation test gen |
| 7784835685113921560 | the next test gen            |
+---------------------+------------------------------+
MySQL [(none)]> SELECT * FROM testx WHERE MATCH('@content next\-gen');
+---------------------+------------------------------+
| id                  | content                      |
+---------------------+------------------------------+
| 7784835685113921561 | the next-gen test gen        |
| 7784835685113921562 | the next-generation test gen |
| 7784835685113921560 | the next test gen            |
+---------------------+------------------------------+
MySQL [(none)]> SELECT * FROM testx WHERE MATCH('@content next\\-gen');
+---------------------+------------------------------+
| id                  | content                      |
+---------------------+------------------------------+
| 7784835685113921561 | the next-gen test gen        |
| 7784835685113921562 | the next-generation test gen |
| 7784835685113921560 | the next test gen            |
+---------------------+------------------------------+

Why next\\-gen keyword gives same result as next-gen?

Unexpected behavior from HTTP side:

>>> search_req = SearchRequest(
>>>     index='testx',
>>>     query={
>>>     "match": {
>>>       "content": {"query":"next-gen"},
>>>     }}
>>> )
>>> print([x['_source']['content'] for x in searchApi.search(search_req).hits.hits])
['the next-gen test gen', 'the next', 'the gen', 'the -gen', 'the next-generation test gen', 'the next test gen']
>>> search_req = SearchRequest(
>>>     index='testx',
>>>     query={
>>>     "match": {
>>>       "content": {"query":"next\\-gen"},
>>>     }}
>>> )
>>> print([x['_source']['content'] for x in searchApi.search(search_req).hits.hits])
['the next-gen test gen', 'the next', 'the gen', 'the -gen', 'the next-generation test gen', 'the next test gen']
>>> search_req = SearchRequest(
>>>     index='testx',
>>>     query={
>>>     "match": {
>>>       "content": {"query":"next\\\\-gen"},
>>>     }}
>>> )
>>> print([x['_source']['content'] for x in searchApi.search(search_req).hits.hits])
['the next-gen test gen', 'the next', 'the gen', 'the -gen', 'the next-generation test gen', 'the next test gen']

Reproduce HTTP side returned result from SQL side:

MySQL [(none)]> SELECT * FROM testx WHERE MATCH('@content next|"-gen"');
+---------------------+------------------------------+
| id                  | content                      |
+---------------------+------------------------------+
| 7784835685113921561 | the next-gen test gen        |
| 7784835685113921563 | the next                     |
| 7784835685113921565 | the gen                      |
| 7784835685113921564 | the -gen                     |
| 7784835685113921562 | the next-generation test gen |
| 7784835685113921560 | the next test gen            |
+---------------------+------------------------------+

Why next-gen keyword gives same result as next|"-gen"?

Manticore Search Version:

Manticore 6.2.12 dc5144d@230822

Operating System Version:

Linux 6.5.0-14-generic 22.04.1-Ubuntu 2 x86_64 GNU/Linux

Have you tried the latest development version?

  • Yes

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

  • Task estimated
  • Specification created, reviewed, and approved
  • Implementation completed
  • Tests developed
  • Documentation updated
  • Documentation proofread
  • Changelog updated
@scruel scruel added the bug label May 4, 2024
@sanikolaev
Copy link
Collaborator

Why next\-gen keyword gives same result as next-gen?

Because - is not in charset_table.

Why next-gen keyword gives same result as next|"-gen"?

Because - is not in charset_table, therefore it's a separator, same as . or , etc. So next-gen is tokenized as next gen. next|"-gen" is tokenized the same way.

@sanikolaev sanikolaev added waiting Waiting for the original poster (in most cases) or something else and removed bug labels May 6, 2024
@scruel
Copy link
Author

scruel commented May 6, 2024

According to your document, we should be able to escape -:

The following characters should be escaped using a backslash ():
! " $ ' ( ) - / < @ \ ^ | ~

But is {"match": {"content": {"query":"next-gen"}}} matching next|"-gen" excepted?

@JubilantJerry
Copy link

JubilantJerry commented May 6, 2024

I'm not a dev so correct me if I'm wrong, but my understanding is that
SELECT * FROM testx WHERE MATCH('@content "next-gen"')
behaves identically to all of these
SELECT * FROM testx WHERE MATCH('@content "next gen"')
SELECT * FROM testx WHERE MATCH('@content "next -gen"')
SELECT * FROM testx WHERE MATCH('@content "next\-gen"')
SELECT * FROM testx WHERE MATCH('@content "next \-gen"')

But
SELECT * FROM testx WHERE MATCH('@content next-gen')
will be different from
SELECT * FROM testx WHERE MATCH('@content next -gen')
where the former is identical to
SELECT * FROM testx WHERE MATCH('@content next \-gen')

charset_table defaults to treating - (and in fact all the other characters that need escaping) as a separator between keywords, but charset_table can be modified to treat these characters as actual text.

Regardless, without escaping the characters, ! " $ ' ( ) - / < @ \ ^ | ~ will be interpreted as search operators which can cause unexpected changes to the search query. With the default charset_table, the desired behavior is for these characters to be treated the same as whitespace, not as search operators.

@sanikolaev
Copy link
Collaborator

According to your document, we should be able to escape -

You can escape it, but it has to be in the charset_table for the escaping to take effect, e.g.:

w/o - in charset_table:

mysql> drop table if exists t; create table t(f text); insert into t values(1, 'next-gen'),(2, 'next gen'),(3, 'next.gen'); select * from t where match('next-gen'); select * from t where match('next\\-gen');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t values(1, 'next-gen'),(2, 'next gen'),(3, 'next.gen')
--------------

Query OK, 3 rows affected (0.00 sec)

--------------
select * from t where match('next-gen')
--------------

+------+----------+
| id   | f        |
+------+----------+
|    1 | next-gen |
|    2 | next gen |
|    3 | next.gen |
+------+----------+
3 rows in set (0.00 sec)
--- 3 out of 3 results in 0ms ---

--------------
select * from t where match('next\\-gen')
--------------

+------+----------+
| id   | f        |
+------+----------+
|    1 | next-gen |
|    2 | next gen |
|    3 | next.gen |
+------+----------+
3 rows in set (0.00 sec)
--- 3 out of 3 results in 0ms ---

with - in charset_table

mysql> drop table if exists t; create table t(f text) charset_table='non_cjk, -'; insert into t values(1, 'next-gen'),(2, 'next gen'),(3, 'next.gen'); select * from t where match('next-gen'); select * from t where match('next\\-
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text) charset_table='non_cjk, -'
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t values(1, 'next-gen'),(2, 'next gen'),(3, 'next.gen')
--------------

Query OK, 3 rows affected (0.00 sec)

--------------
select * from t where match('next-gen')
--------------

+------+----------+
| id   | f        |
+------+----------+
|    1 | next-gen |
+------+----------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 1ms ---

--------------
select * from t where match('next\\-gen')
--------------

+------+----------+
| id   | f        |
+------+----------+
|    1 | next-gen |
+------+----------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---

Note, in the above the escaping didn't change anything, but it's important when - can be considered the negation operation:

mysql> insert into t values(4, '-gen');
--------------
insert into t values(4, '-gen')
--------------

Query OK, 1 row affected (0.01 sec)

mysql> select * from t where match('-gen');
--------------
select * from t where match('-gen')
--------------

ERROR 1064 (42000): table t: query error: query is non-computable (single NOT operator)

mysql> select * from t where match('\\-gen');
--------------
select * from t where match('\\-gen')
--------------

+------+------+
| id   | f    |
+------+------+
|    4 | -gen |
+------+------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting Waiting for the original poster (in most cases) or something else
Projects
None yet
Development

No branches or pull requests

3 participants