Support tokenize function #45145

dujijun007 · 2024-05-06T15:10:46Z

Enhancement

Why I'm doing:

StarRocks has implemented GIN (Generalized Inverted Index), which works by tokenizing fields into individual tokens and building a dictionary out of them. This allows users to perform different semantic searches on this dictionary. However, due to the presence of various tokenizers, the results of tokenization can differ, making it not very intuitive for users to understand how the original field text is tokenized into specific tokens.

What I'm doing:

Support a tokenize function, to allow users to work with specific tokenizer and get results of tokenization easily.

Description

Function definition

function tokenize(tokenizer_name: string, content: string) -> list of strings

Input and output

// input 
tokenizer_name: needs to be limited to the existing tokenizers. For now, only support chinese, english, standard.
content: text, but notice that the language corresponding to the text content only achieves the expected effect when work with the specified tokenizer.

// output 
tokens: splited and analyzed by tokenizer

Example

// tokenize with english
mysql> SELECT tokenize('english', 'Today is saturday');
+------------------------------------------+
| tokenize('english', 'Today is saturday') |
+------------------------------------------+
| ["today","is","saturday"]                |
+------------------------------------------+
1 row in set (0.00 sec)

// count word frequency
mysql> select unnest, count(*) as count 
mysql> from t_tokenized_table, unnest(tokenize('english', english_text)) as unnest
mysql> group by unnest order by count;
+----------+-------+
| unnest   | count |
+----------+-------+
| world    |     1 |
| comes    |     1 |
| tap      |     1 |
| the      |     1 |
| from     |     1 |
| sea      |     1 |
| shanghai |     1 |
| water    |     1 |
| hello    |     2 |
+----------+-------+
9 rows in set (0.06 sec)

Notice

This function can work independently without building GIN for a column, however, it is not advisable to invoke this function to tokenize and construct a dictionary on the massive data during the query time due to the poor performance. In fact, there is no need for the user to explicitly call the tokenize function to build a dictionary at the time of writing. Since both actions have the same behavior, this function is more suitable to be used to troubleshoot some search results that are difficult to understand.

The text was updated successfully, but these errors were encountered:

dujijun007 added the type/enhancement Make an enhancement to StarRocks label May 6, 2024

dujijun007 linked a pull request May 6, 2024 that will close this issue

[Enhancement] Support tokenize function #45119

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support tokenize function #45145

Support tokenize function #45145

dujijun007 commented May 6, 2024 •

edited

Support tokenize function #45145

Support tokenize function #45145

Comments

dujijun007 commented May 6, 2024 • edited

Enhancement

Why I'm doing:

What I'm doing:

Description

Function definition

Input and output

Example

Notice

dujijun007 commented May 6, 2024 •

edited