How do I use url_match with Japanese? #12864
ryanheise
started this conversation in
Language Support
Replies: 1 comment 4 replies
-
Most languages in spacy (ones that use whitespace) use the rule-based tokenizer You can check with: nlp = spacy.blank("en")
print(type(nlp.tokenizer)) # `spacy.tokenizer.Tokenizer` nlp = spacy.blank("ja")
print(type(nlp.tokenizer)) # `spacy.lang.ja.JapaneseTokenizer` There's more information in the "Language support" sections starting here: https://spacy.io/usage/models#chinese For having URLs as single tokens in Japanese, my initial suggestion would be to postprocess the sudachipy tokenization. Match URLs in the text using the |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
It seems that
url_match
is not setup by default for Japanese while it is for English:spacy.load('en_core_web_lg').tokenizer.url_match
is there butspacy.load('ja_core_news_lg').tokenizer.url_match
is not there. I tried setting it withnlp = spacy.load('ja_core_news_lg'); nlp.tokenizer.url_match = spacy.lang.tokenizer_exceptions.URL_MATCH
but it had no effect on the results.Would anyone be able to advise on how to set this?
Also, is there any reason why it is not already set? And is there information on which models do or don't handle it by default?
Beta Was this translation helpful? Give feedback.
All reactions