Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLP edge case - hyphen separated words aren't highlighted as one word (fifty- two) #9

Open
jsms90 opened this issue May 11, 2018 · 1 comment

Comments

@jsms90
Copy link
Collaborator

jsms90 commented May 11, 2018

For some reason there is often a space between the hyphen and the second word. Sometimes a space on both sides (?)

Sometimes the first word is picked up by compromise (the NLP library), sometimes just the second. Sometimes both, but separately from each other.

hypenated-word

With a two digit word like fifty-two, users could click on both because clicking 50 & clicking will give a combined total of 52 👍 But this doesn't work well for 3 digit numbers.

@jsms90
Copy link
Collaborator Author

jsms90 commented May 11, 2018

First attempt:

if (this.abstract) {
        const abstractWithWhitespaces = this.abstract
          .replace(/([a-zA-Z]+[\,\.\)\}\]\!\?]+)(\d)/g, "$1 $2")

          .replace(/(\d+[\,\.\)\}\]\!\?]+)([a-zA-Z]+)/g, "$1 $2")
          .replace(/([a-zA-Z]+)\s?\=\s?(\d)/g, "$1 = $2");
        let nlpExtractedNumbers = nlp(abstractWithWhitespaces).match("#Value").out("text");
        let parsedAbstract = abstractWithWhitespaces
        while (nlpExtractedNumbers.match(/([a-zA-Z]+)\s?\-\s([a-zA-Z]+)/)) {
          const wordsWithDashes = /([a-zA-Z]+)\s?\-\s([a-zA-Z]+)/.exec(nlpExtractedNumbers);
          parsedAbstract = abstractWithWhitespaces.replace( wordsWithDashes[0], wordsWithDashes[1] + wordsWithDashes[2]);
          nlpExtractedNumbers = nlpExtractedNumbers.replace( wordsWithDashes[0], wordsWithDashes[1] + wordsWithDashes[2]);
        }
        return nlp(parsedAbstract).out("html");
      }

But we still have them showing up as separate words

ninety-five

Narrowing it down, parsedAbstract gives ninety-five as one word:

' Event- related  spectral  perturbations  (ERSPs;  event- related  mean  power  spectral
changes)  and  inter- trial  coherence  (ITCs;  event- related  consistency  of  spectral  phase)
reveal  a  more  comprehensive  overview  of  EEG  activity.  Ninety-five  subjects  (56  MS
patients,  39  controls)  completed  visual  and  auditory  two- stimulus  P3b  event- related
potential  tasks  and  the  PASAT. '

But nlp(parsedAbstract).out("html") treats ninety and five as 2 separate words and puts them in separate spans with different classes, hence the separate styling:

<span
    class="nl-TitleCase nl-Hyphenated nl-Cardinal nl-Value nl-TextValue"
>Ninety</span>-
<span
    class="nl-Hyphenated nl-Cardinal nl-Value nl-TextValue"
>five</span>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant