Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tokenize function #45145

Open
dujijun007 opened this issue May 6, 2024 · 0 comments · May be fixed by #45119
Open

Support tokenize function #45145

dujijun007 opened this issue May 6, 2024 · 0 comments · May be fixed by #45119
Labels
type/enhancement Make an enhancement to StarRocks

Comments

@dujijun007
Copy link
Contributor

dujijun007 commented May 6, 2024

Enhancement

Why I'm doing:

StarRocks has implemented GIN (Generalized Inverted Index), which works by tokenizing fields into individual tokens and building a dictionary out of them. This allows users to perform different semantic searches on this dictionary. However, due to the presence of various tokenizers, the results of tokenization can differ, making it not very intuitive for users to understand how the original field text is tokenized into specific tokens.

What I'm doing:

Support a tokenize function, to allow users to work with specific tokenizer and get results of tokenization easily.

Description

Function definition

function tokenize(tokenizer_name: string, content: string) -> list of strings

Input and output

// input 
tokenizer_name: needs to be limited to the existing tokenizers. For now, only support chinese, english, standard.
content: text, but notice that the language corresponding to the text content only achieves the expected effect when work with the specified tokenizer.

// output 
tokens: splited and analyzed by tokenizer

Example

// tokenize with english
mysql> SELECT tokenize('english', 'Today is saturday');
+------------------------------------------+
| tokenize('english', 'Today is saturday') |
+------------------------------------------+
| ["today","is","saturday"]                |
+------------------------------------------+
1 row in set (0.00 sec)

// count word frequency
mysql> select unnest, count(*) as count 
mysql> from t_tokenized_table, unnest(tokenize('english', english_text)) as unnest
mysql> group by unnest order by count;
+----------+-------+
| unnest   | count |
+----------+-------+
| world    |     1 |
| comes    |     1 |
| tap      |     1 |
| the      |     1 |
| from     |     1 |
| sea      |     1 |
| shanghai |     1 |
| water    |     1 |
| hello    |     2 |
+----------+-------+
9 rows in set (0.06 sec)

Notice

This function can work independently without building GIN for a column, however, it is not advisable to invoke this function to tokenize and construct a dictionary on the massive data during the query time due to the poor performance. In fact, there is no need for the user to explicitly call the tokenize function to build a dictionary at the time of writing. Since both actions have the same behavior, this function is more suitable to be used to troubleshoot some search results that are difficult to understand.

@dujijun007 dujijun007 added the type/enhancement Make an enhancement to StarRocks label May 6, 2024
@dujijun007 dujijun007 linked a pull request May 6, 2024 that will close this issue
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Make an enhancement to StarRocks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant