Does Word Segmentation give position of the vocabularies? #42

shivanraptor · 2023-10-03T07:58:09Z

Feature you are interested in and your specific question(s):
I'm studying Word Segmentation of PyCantonese (https://pycantonese.org/word_segmentation.html), does the function return also the start & end position of the vocabulary?

What you are trying to accomplish with this feature or functionality:
I would like to achieve:

import pycantonese
from pycantonese.word_segmentation import Segmenter
segmenter = Segmenter()
result = pycantonese.segment("廣東話容唔容易學？", cls=segmenter)
print(result)

Current result:

['廣東話', '容', '唔', '容易', '學', '？']

Would like to have the following result (with the start & end position):

[('廣東話', 0, 3), ('容', 3, 4), ('唔', 4, 5), ('容易', 5, 7), ('學', 7, 8), ('？', 8, 9)]

Thanks.

The text was updated successfully, but these errors were encountered:

shivanraptor · 2023-10-03T08:18:15Z

Something like:

def custom_segment(input_str: str):
        segmenter = Segmenter() # possible attributes: disallow, allow, max_word_length
        pyseg = pycantonese.segment(input_str, cls=segmenter)
        tokenized = []
        for word in pyseg:
            start = input_str.find(word)
            end = start + len(word)
            tokenized.append((word, start, end))
        return tokenized

shivanraptor added the question label Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Word Segmentation give position of the vocabularies? #42

Does Word Segmentation give position of the vocabularies? #42

shivanraptor commented Oct 3, 2023

shivanraptor commented Oct 3, 2023

Does Word Segmentation give position of the vocabularies? #42

Does Word Segmentation give position of the vocabularies? #42

Comments

shivanraptor commented Oct 3, 2023

shivanraptor commented Oct 3, 2023