Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added control string for sentence breaks #676

Draft
wants to merge 23 commits into
base: master
Choose a base branch
from
Draft

Conversation

MihaiSurdeanu
Copy link
Contributor

What do you think @kwalcock, @myedibleenso ?
See the unit test for the expected behavior.

Copy link
Member

@kwalcock kwalcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is fine. I'm a little worried about ramifications, but don't let that hold it back.

WHITESPACE=22
SEQ_OF_UNICODES=23
ErrorCharacter=24
'[SB]'=8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd, but I assume, correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e., line 25.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was what Antlr generates, so I am assuming it is correct.

@@ -119,6 +119,16 @@ abstract class RuleBasedSentenceSplitter extends SentenceSplitter {
endPositions += lastPosition + raw.head.length
}

// found the control string that enforces sentence breaks
// note that this token is NOT added to the sentences produced
else if(crt.word == SENTENCE_BREAK_CONTROL_STRING) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be reassured if this was also dependent on this.useControlStrings or SentenceSplitter.useControlStrings which defaults to false. Those who want to use the feature can turn it on if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Can you please add it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be set without needing to create a custom Processor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No... Because we need to adjust the corresponding antlr grammar. Unless we come up with a generic format for the control string, e.g., anything between square brackets? Or, anything between double square brackets, e.g., [[SB]]? Then we can let people set the string to whatever values they want.

@@ -186,6 +196,10 @@ class SpanishSentenceSplitter extends RuleBasedSentenceSplitter {
object SentenceSplitter {
val EOS: Regex = """^[\.!\?\s]+$""".r

// Control string that enforces a sentence break
// If you change this value, change also the SENTENCEBREAK in OpenDomainLexer.g to the same value (and recompile the Antlr grammar)
val SENTENCE_BREAK_CONTROL_STRING = "[SB]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the procedure would be to figure out before tokenization, so before Processor.mkDocument, where the control strings should be, like where there's a <br>, and change them to [SB]. These two strings happen to be the same length and one could take the resulting Document and substitute the old text for the new text in order to preserve the original. If the strings are different lengths, all the offsets would be off and the substitution won't work. We would lose (easy) access to the original document text. Will that be a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is likely that this will change offsets (e.g., when replacing newlines with '[SB]'). Users need to be aware of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to worry about supporting reinsertion of the original original token in this case of sentence boundaries (at least not at this stage).

That said, I think that is something we think about supporting for cases where a user wants to preserve unrecognized tokens (ex. through re-insertion).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwalcock , would you feel more comfortable using a control string with higher entropy (ex. <[*^[SB]^*]>)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with it just being off by default. If someone turns it on, (a simple SentenceSplitter.useControlString = true) and it's important, they can be responsible for making sure that the control string is not already in their text and if necessary, escaping it before and unescaping it after, etc.

We do in general have cases in which provenance is important and the original text needs to be preserved. This new feature is still useful and can be used when the original is not so important, though.

@myedibleenso
Copy link
Member

@MihaiSurdeanu @kwalcock , I had need of this again today and it got me thinking: It would be helpful to have a test related to what we expect the value of a Document's text to be when the control string is used:

c/o B.A.Z. Bub[SB]
Morning Star Industries, Ltd.[SB]
666 Ring of Fire Circle[SB]
Lake of Fire, AZ 85666[SB]
signs you might be living in a simulation (recognize these warning signs)...[SB]
  - the earliest sound you remember from your childhood is the Windows startup theme[SB]
  - ....[SB]

@kwalcock
Copy link
Member

@myedibleenso, can you check TestMkCombinedDocuments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants