How do I use url_match with Japanese? #12864

ryanheise · 2023-07-27T04:33:43Z

ryanheise
Jul 27, 2023

It seems that url_match is not setup by default for Japanese while it is for English: spacy.load('en_core_web_lg').tokenizer.url_match is there but spacy.load('ja_core_news_lg').tokenizer.url_match is not there. I tried setting it with nlp = spacy.load('ja_core_news_lg'); nlp.tokenizer.url_match = spacy.lang.tokenizer_exceptions.URL_MATCH but it had no effect on the results.

Would anyone be able to advise on how to set this?

Also, is there any reason why it is not already set? And is there information on which models do or don't handle it by default?

adrianeboyd · 2023-07-27T08:07:21Z

adrianeboyd
Jul 27, 2023

Most languages in spacy (ones that use whitespace) use the rule-based tokenizer spacy.tokenizer.Tokenizer, but languages that need more powerful word segmentation tools use third-party libraries (in spacy currently: Chinese, Japanese, Korean, Thai, Vietnamese).

You can check with:

nlp = spacy.blank("en")
print(type(nlp.tokenizer))  # `spacy.tokenizer.Tokenizer`

nlp = spacy.blank("ja")
print(type(nlp.tokenizer))  # `spacy.lang.ja.JapaneseTokenizer`

There's more information in the "Language support" sections starting here: https://spacy.io/usage/models#chinese

For having URLs as single tokens in Japanese, my initial suggestion would be to postprocess the sudachipy tokenization. Match URLs in the text using the URL_PATTERN pattern (or whatever pattern you like) and retokenize the matched spans.

4 replies

ryanheise Jul 27, 2023
Author

Thanks, @adrianeboyd . It would be nice if url_match were able to work consistently for all tokenizers, but at least your suggestion helps in getting it to work now.

adrianeboyd Jul 28, 2023

The annotation depends on the model design and the available training datasets, so the annotation schemes aren't unified across all different languages for the provided trained pipelines. The Japanese pipelines are a bit unusual in that the POS tags come from the tokenizer rather than a separate tagger component. All languages do use UPOS for token.pos and everything except English and German uses UD for the dependency parses.

I don't think there are many URLs in the English training data, so I'm not surprised to see X. The training data mostly uses the tag ADD for URLs (which would also be mapped to UPOS X), but I think it's rare enough that it doesn't show up in the model predictions much.

ryanheise Jul 28, 2023
Author

Thanks again. I edited out that part of my reply because I realised I had made a mistake in disabling the url_match (setting it explicitly to None). After restoring it, it gave reasonable annotations. Although the lemmatised URLs are a bit interesting, sometimes converting to lowercase and sometimes not. I'm not sure what the logic here is, but I got the following lemmatisation results with url_match on English:

Original token               -> Lemma

https://www.foo.com/WHATEVER -> https://www.foo.com/WHATEVER
https://WWW.foo.com/WHATEVER -> https://www.foo.com/whatever
HTTPS://WWW.foo.com/WHATEVER -> HTTPS://WWW.foo.com/WHATEVER

adrianeboyd Jul 31, 2023

The URL POS tags aren't consistent, and there are lemmatizer rules that lowercase lemmas for certain tags (NOUN) but not other (PROPN).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I use url_match with Japanese? #12864

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How do I use url_match with Japanese? #12864

ryanheise Jul 27, 2023

Replies: 1 comment · 4 replies

adrianeboyd Jul 27, 2023

ryanheise Jul 27, 2023 Author

adrianeboyd Jul 28, 2023

ryanheise Jul 28, 2023 Author

adrianeboyd Jul 31, 2023

ryanheise
Jul 27, 2023

Replies: 1 comment 4 replies

adrianeboyd
Jul 27, 2023

ryanheise Jul 27, 2023
Author

ryanheise Jul 28, 2023
Author