Skip to content

Extract both sentences and words from publication content? #395

Answered by mickael-menu
domkm asked this question in Q&A
Discussion options

You must be logged in to vote

tokenizer($0)[0] is because ContentTokenizer always returns a [any ContentElement] of length 1.

I see, I forgot that the text tokenizers split the segments of a TextContentElement instead of returning more TextContentElement. But this is an implementation detail, you should assume that it might return more than one.

And that's why you didn't get the individual word locators, you need to check the segments. You can try this version:

func contentPairs(publication: Publication) throws -> [(sentence: Locator, words: [Locator])] {
    guard let content = publication.content() else {
        return []
    }

    let wordTokenizer = makeTextContentTokenizer(
        defaultLanguage: publication.

Replies: 1 comment 6 replies

Comment options

You must be logged in to vote
6 replies
@mickael-menu
Comment options

@domkm
Comment options

@domkm
Comment options

@mickael-menu
Comment options

Answer selected by domkm
@domkm
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants