Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search is case sensitive for non-ASCII messages #5052

Open
link2xt opened this issue Nov 27, 2023 · 4 comments · May be fixed by #5321
Open

Search is case sensitive for non-ASCII messages #5052

link2xt opened this issue Nov 27, 2023 · 4 comments · May be fixed by #5321
Assignees
Labels
enhancement New feature or request

Comments

@link2xt
Copy link
Collaborator

link2xt commented Nov 27, 2023

In SQLite search with LIKE is case-insensitive:

AND txt LIKE ?

This however does not work for non-english letters. Using functions like UPPER also does not help as they only work for ASCII and using a function on txt column will prevent using indexes anyway.

It is a well-known problem:
https://shallowdepth.online/posts/2022/01/5-ways-to-implement-case-insensitive-search-in-sqlite-with-full-unicode-support/

The solution is to create a new column txt_normalized defaulting to NULL (so we do not bump up the database size in a migration) and storing lowercased/normalized text there when the row is created. When doing a search, search over IFNULL(txt_normalized, txt).

@link2xt link2xt added the enhancement New feature or request label Nov 27, 2023
@r10s
Copy link
Member

r10s commented Nov 27, 2023

that would double the space needed for stored text, maybe okay when compared to what blobs need.

what would be another consideration for having an extra field for searching is that we could also add other things there - as names of files or webxdc apps - currently, you cannot search for that. appending these information to txt_normalized can make that accessible without degrading performance

that said, there is also FULLTEXT and ICU extensions, that may be alternatives (ICU is mentioned in the linked post to be complicated, however, not sure if that also applied to rust). for FULLTEXT, it is still astonishing how far and fast we can go without

@link2xt
Copy link
Collaborator Author

link2xt commented Nov 27, 2023

ICU is mentioned in the linked post to be complicated, however, not sure if that also applied to rust

It is complicated, we currently just use vendored SQLCipher and managing our own version of SQLite with SQLCipher and additional extensions is going to make compiling the core more difficult for everyone. There is an issue in rusqlite repo: rusqlite/rusqlite#531

@link2xt
Copy link
Collaborator Author

link2xt commented Dec 17, 2023

This migration passes by the way:

+    if dbversion < 106 {
+        sql.execute_migration(
+            r"CREATE VIRTUAL TABLE search USING fts5(text)",
+            106
+        ).await?;
+    }

So we can use https://www.sqlite.org/fts5.html

@iequidoo iequidoo self-assigned this Mar 4, 2024
iequidoo added a commit that referenced this issue Mar 4, 2024
@iequidoo
Copy link
Collaborator

iequidoo commented Mar 4, 2024

Tried FTS5, but this approach has its own problems. We should use MATCH instead of LIKE to perform a case-insensitive search, but then we lose a capability of matching against parts of words (if we use the unicode61 tokeniser). We can use the trigram tokeniser then for substring matching, but anyway MATCH should be used for case-insensitive search (it's not stated explicitly in the docs, but i failed to get LIKE working). That means we should escape the search pattern correctly (when using LIKE we just pass %pattern% f.e.). And also queries of fewer than 3 chars wouldn't work:

Substrings consisting of fewer than 3 unicode characters do not match any rows when used with a full-text query.

I think the most simple way that doesn't break the current UX also is what @link2xt suggested initially:

The solution is to create a new column txt_normalized defaulting to NULL (so we do not bump up the database size in a migration) and storing lowercased/normalized text there when the row is created. When doing a search, search over IFNULL(txt_normalized, txt).

Moreover, if a normalised text doesn't differ from the source one or contains only ASCII, we can omit storing it thus reducing the db size.

EDIT: Note that matching against parts of words is critical for CJK langs, i don't know if there are any tokenisers for them.

iequidoo added a commit that referenced this issue Mar 5, 2024
SQLite search with `LIKE` is case-insensitive only for ASCII chars. To make it case-insensitive for
all messages, create a new column `msgs.txt_normalized` defaulting to `NULL` (so we do not bump up
the database size in a migration) and storing lowercased/normalized text there when the row is
created/updated. When doing a search, search over `IFNULL(txt_normalized, txt)`.
iequidoo added a commit that referenced this issue May 20, 2024
SQLite search with `LIKE` is case-insensitive only for ASCII chars. To make it case-insensitive for
all messages, create a new column `msgs.txt_normalized` defaulting to `NULL` (so we do not bump up
the database size in a migration) and storing lowercased/normalized text there when the row is
created/updated. When doing a search, search over `IFNULL(txt_normalized, txt)`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants