Added a module to split Japanese words #3158

miku0 · 2023-08-19T08:48:26Z

In these codes, Japanese addresses are divided into three categories based on administrative divisions: cities, municipalities, and below.
Nominatim uses ICU (International Components for Unicode) transliteration for user-entered addresses to split them into meaningful words. Here is an example of debugging. There are many candidates.

Fig. 1 The example of debugging.

To help make this division more accurate, when there are large administrative divisions (prefecture and city) in the string, we pre-separate them in the algorithm and put "," markers between the split words.
This "," is set to BreakType.SOFT_PHRASE in the program and words with this node are penalized with a lower search priority.
The node relationship is as follows
(1)--da->(2)--ban->(3)--shi->(4)--da->(5)--ban->(6)
|| 　　　　^^ ||
|+------大阪市--------------+ +-------大阪--------+|
+-------------------大阪市大阪---------------------+

As a result of this change, "大阪市大阪" with SOFT_PHRASE is penalized more and given lower search priority than "大阪市", the name of a city (the fifth value from the left is the penalty value).

Fig. 2 Before the change.

Fig. 3 After the change.

lonvia · 2023-08-20T08:39:21Z

The failing tests may not be related to you code. I have the same in an unrelated change. I'm investigating.

lonvia · 2023-08-20T18:06:35Z

Can you please rebase your code on master? This should make the CI errors go away.

lonvia

Looks mostly good. Just two minor comments from my side.

lonvia · 2023-08-20T18:08:18Z

nominatim/api/search/icu_tokenizer.py

+                                  for p in phrases)))
+        return normalized
+
+    def split_key_japanese_phrases(


You probably forgot to delete the old code here.

lonvia · 2023-08-20T18:09:41Z

nominatim/api/search/query.py

@@ -29,6 +29,7 @@ class BreakType(enum.Enum):
    """ Break created as a result of tokenization.
        This may happen in languages without spaces between words.
    """
+    SOFT_PHRASE = ':'


Can you please add documentation for this new type, just like it is done in the lines above.

Thank you so much for your help and comments.
I added the documentation.

lonvia reviewed Aug 20, 2023

View reviewed changes

made a module to split Japanese words

dfbacf4

miku0 force-pushed the soft_phrase-final branch from 8f43956 to dfbacf4 Compare August 21, 2023 01:06

Modification of comments

c283235

lonvia mentioned this pull request Sep 7, 2023

Configurable preprocessing for queries #3193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a module to split Japanese words #3158

Added a module to split Japanese words #3158

miku0 commented Aug 19, 2023

lonvia commented Aug 20, 2023

lonvia commented Aug 20, 2023

lonvia left a comment

lonvia Aug 20, 2023

lonvia Aug 20, 2023

miku0 Aug 21, 2023

Added a module to split Japanese words #3158

Are you sure you want to change the base?

Added a module to split Japanese words #3158

Conversation

miku0 commented Aug 19, 2023

lonvia commented Aug 20, 2023

lonvia commented Aug 20, 2023

lonvia left a comment

Choose a reason for hiding this comment

lonvia Aug 20, 2023

Choose a reason for hiding this comment

lonvia Aug 20, 2023

Choose a reason for hiding this comment

miku0 Aug 21, 2023

Choose a reason for hiding this comment