UTF-8 Tokenization #6

sajov · 2018-04-03T22:19:48Z

Hi,

i have issues with German Umlaute so i guess it depends on
const { WordTokenizer } = require('natural/lib/natural/tokenizers/regexp_tokenizer');

see classifier response

{ "text": "what's the weather in münchen on the weekend", "tokens": [ "what", "s", "the", "weather", "in", "m", "nchen", "on", "the", "weekend" ], "stems": [ "what", "s", "the", "weather", "in", "m", "nchen", "on", "the", "weekend" ], "_id": "f382b318585747e8f72a6c11af37e88e", "action": { "type": "weather", "tags": { "location": [ 5, 6 ], "time": null }, "text": "what's the weather in münchen on the weekend", "_id": 2 }, "confidence": 0.8600388765335083, "extracted": { "location": "m nchen", "time": null }, "reference": { "type": "weather", "tags": { "location": [ 5, 5 ], "time": [ 6, 8 ] }, "text": "what's the weather in vancouver on the weekend", "_id": 2 } }

I have not compared other tokenizer
Any suggestions?
thanks

The text was updated successfully, but these errors were encountered:

daffl · 2018-04-03T22:27:10Z

Very odd, I thought I tested it in my mother tongue and it worked. I'll write a test and dig a little more into why that would be an issue.

sajov · 2018-04-03T23:37:10Z

great to hear :-)
thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 Tokenization #6

UTF-8 Tokenization #6

sajov commented Apr 3, 2018

daffl commented Apr 3, 2018

sajov commented Apr 3, 2018

UTF-8 Tokenization #6

UTF-8 Tokenization #6

Comments

sajov commented Apr 3, 2018

daffl commented Apr 3, 2018

sajov commented Apr 3, 2018