Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to tokenize sentences that start with at mark #60

Open
cromz22 opened this issue Oct 12, 2022 · 0 comments
Open

Unable to tokenize sentences that start with at mark #60

cromz22 opened this issue Oct 12, 2022 · 0 comments

Comments

@cromz22
Copy link

cromz22 commented Oct 12, 2022

Tokenization fails when a sentence starts with "@".
This doesn't seem to be an issue when jumanpp is called through subprocess or rhoknp.

To reproduce:

import subprocess
from pyknp import Juman
import rhoknp


def subprocess_tokenize(ja_sent):
    command = f"echo {ja_sent} | jumanpp --segment"
    return subprocess.run(command, shell=True, capture_output=True, text=True).stdout.strip()


def pyknp_tokenize(ja_sent):
    jumanpp = Juman()
    return " ".join([mrph.midasi for mrph in jumanpp.analysis(ja_sent).mrph_list()])


def rhoknp_tokenize(ja_sent):
    jumanpp = rhoknp.Jumanpp()
    return " ".join([str(morph) for morph in jumanpp.apply(ja_sent).morphemes])


def main():
    ja_sent = "@@抄録@@移動アドホックネットワークのための適応経路チューニングスキーム「OR2」を提案した。"
    ja_sent = "@ECOVISIONの導入効果は以下の通りである。"

    ja_sent = pyknp_tokenize(ja_sent)
    print(ja_sent)


if __name__ == "__main__":
    main()

Standard error:

Traceback (most recent call last):
  File "error.py", line 30, in <module>
    main()
  File "error.py", line 25, in main
    ja_sent = pyknp_tokenize(ja_sent)
  File "error.py", line 13, in pyknp_tokenize
    return " ".join([mrph.midasi for mrph in jumanpp.analysis(ja_sent).mrph_list()])
  File "/path_to_venv/lib/python3.8/site-packages/pyknp/juman/juman.py", line 98, in analysis
    return self.juman(input_str, juman_format)
  File "/path_to_venv/venv/lib/python3.8/site-packages/pyknp/juman/juman.py", line 85, in juman
    result = MList(self.juman_lines(input_str), juman_format)
  File "/path_to_venv/venv/lib/python3.8/site-packages/pyknp/juman/mlist.py", line 26, in __init__
    self._mrph[-1].push_doukei(Morpheme(line[2:], mid, juman_format))
IndexError: list index out of range
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant