Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbalanced/hanging double quotes allows weird midasi #18

Open
jamesohortle opened this issue Aug 29, 2019 · 1 comment
Open

Unbalanced/hanging double quotes allows weird midasi #18

jamesohortle opened this issue Aug 29, 2019 · 1 comment

Comments

@jamesohortle
Copy link

from pyknp import Juman
juman = Juman(jumanpp=False)

result = juman.analysis('"test').mrph_list()
print(" :: ".join(r.midasi for r in result)) # --> '"test "test'

result = juman.analysis('"test "').mrph_list()
print(" :: ".join(r.midasi for r in result)) # --> '"test "test' :: '\\ ' :: '"'

Something like

if sum(1 for c in input_str if c == '"') % 2 != 0:
    input_str = "".join(
        reversed("".join(reversed(input_str)).replace('"', "", 1))
    )

could be used to clean strings with unpaired quotes before calling Juman.analyze(), but I don't find it satisfactory. It may just be an example of "garbage in, garbage out".

The problem is in morpheme.py: Morpheme._parse_spec() where quotes are handled in these nested if-statements (starting line 125):

                # If "\"" proceeds " ", it would be not inside_quotes, but "\"".
                if inside_quotes and char == " " and part == '"':
                    inside_quotes = False

                if part != "" and char == " " and not inside_quotes:
                    if part.startswith('"') and part.endswith('"') and len(part) > 1:
                        print(f"APPENDING PART0 {part}")
                        parts.append(part[1:-1])
                    else:
                        print(f"APPENDING PART1 {part}")
                        parts.append(part)
                    part = ""
                else:
                    print(f"ADDING CHAR TO PART {part} + {char}")
                    part += char

The expected behaviour (according only to me), should be '"test' and '"test' :: '\\ ' :: '"'.

@jamesohortle
Copy link
Author

Had another look, if we use juman = Juman() then we get "test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant