Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Word Segmentation give position of the vocabularies? #42

Open
shivanraptor opened this issue Oct 3, 2023 · 1 comment
Open

Does Word Segmentation give position of the vocabularies? #42

shivanraptor opened this issue Oct 3, 2023 · 1 comment
Labels

Comments

@shivanraptor
Copy link

Feature you are interested in and your specific question(s):
I'm studying Word Segmentation of PyCantonese (https://pycantonese.org/word_segmentation.html), does the function return also the start & end position of the vocabulary?

What you are trying to accomplish with this feature or functionality:
I would like to achieve:

import pycantonese
from pycantonese.word_segmentation import Segmenter
segmenter = Segmenter()
result = pycantonese.segment("廣東話容唔容易學?", cls=segmenter)
print(result)

Current result:

['廣東話', '容', '唔', '容易', '學', '?']

Would like to have the following result (with the start & end position):

[('廣東話', 0, 3), ('容', 3, 4), ('唔', 4, 5), ('容易', 5, 7), ('學', 7, 8), ('?', 8, 9)]

Thanks.

@shivanraptor
Copy link
Author

Something like:

def custom_segment(input_str: str):
        segmenter = Segmenter() # possible attributes: disallow, allow, max_word_length
        pyseg = pycantonese.segment(input_str, cls=segmenter)
        tokenized = []
        for word in pyseg:
            start = input_str.find(word)
            end = start + len(word)
            tokenized.append((word, start, end))
        return tokenized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant