Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word extractions #428

Open
flazouh opened this issue Jul 8, 2023 · 0 comments
Open

Word extractions #428

flazouh opened this issue Jul 8, 2023 · 0 comments

Comments

@flazouh
Copy link

flazouh commented Jul 8, 2023

Hi, I came accross your great package and had a question,
It seems that when extracting words from "안녕하세요. 어떻게 지내세요?" I get

{
  "words": [
    [
      [
        "안녕",
        "NNG"
      ],
      [
        "",
        "XSV"
      ],
      [
        "세요",
        "EFN"
      ],
      [
        ".",
        "SF"
      ]
    ],
    [
      [
        "어떻",
        "VA"
      ],
      [
        "",
        "ECD"
      ]
    ],
    [
      [
        "지내",
        "VV"
      ],
      [
        "세요",
        "EFN"
      ],
      [
        "?",
        "SF"
      ]
    ]
  ]
}

With :

def get_tokens(text):
    tokens = kkma.pos(text)
    return {"words": tokens}

But how is that extracting words, I don't understand, isn't the Korean phrase "안녕하세요" typically split into two main tokens:

"안녕" (Annyeong) - This is a noun that means "peace" or "well-being".
"하세요" (Haseyo) - This is the honorific form of the verb "하다" (hada), which means "to do". When used after a noun in this way, it makes the noun into a polite, inquisitive phrase.
So, when combined, "안녕하세요" literally translates to something like "Are you peaceful?" or "Is peace being done?". However, it is commonly used to mean "Hello" in English ?

@flazouh flazouh changed the title Processing time Word extractions Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant