Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST: Doc.ents as SpanGroup #12380

Closed
wants to merge 3 commits into from

Conversation

adrianeboyd
Copy link
Contributor

Description

Overview of required changes to support SpanGroup rather than Tuple[Span]:

  • implement slice for SpanGroup
  • return SpanGroup for SpanGroup + x or x + SpanGroup rather than refusing to concatenate (currently without good error handling)

Side effects:

  • if appending to Doc.ents, only Iterable[Span] is supported rather than other formats like raw entity tuples from matcher results, but you can still assign mixed data in any of the currently supported formats to Doc.ents
    • for the Matcher case, as_spans provides a good alternative

Types of change

?

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Overview of required changes to support `SpanGroup` rather than
`Tuple[Span]`:

* implement slice for `SpanGroup`
* return `SpanGroup` for `SpanGroup + x` or `x + SpanGroup` rather than
refusing to concatenate (currently without good error handling)

Side effects:

* if appending to `Doc.ents`, only `Iterable[Span]` is supported rather
than other formats like raw entity tuples from matcher results, but you
can still assign mixed data in any of the currently supported formats to
`Doc.ents`
  * for the `Matcher` case, `as_spans` provides a good alternative
@adrianeboyd adrianeboyd added the 🔜 v4.0 Related to upcoming v4.0 label Mar 7, 2023
@adrianeboyd
Copy link
Contributor Author

I'm surprised that no one has complained about slicing span groups.

Comment on lines +109 to +112
if isinstance(i, slice):
start, stop = util.normalize_slice(len(self), i.start, i.stop, i.step)
spans = [self[i] for i in range(start, stop)]
return SpanGroup(self.doc, spans=spans)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not the fastest way to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I suppose we could support step here, but I'm not sure why you'd want to do this.

@adrianeboyd
Copy link
Contributor Author

And if we did this, we'd want to do it for all the Tuple[Span] cases from #12288.

@adrianeboyd adrianeboyd marked this pull request as draft March 8, 2023 07:36
@adrianeboyd
Copy link
Contributor Author

So the current SpanGroup design with a weakref for the doc is basically a deal breaker for this proposal:

import spacy
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns([{"label": "A", "pattern": "the"}])
print(nlp("the cat").ents)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/span_group.pyx", line 62, in spacy.tokens.span_group.SpanGroup.__repr__
    return str(list(self))
  File "spacy/tokens/span_group.pyx", line 179, in __iter__
    yield self[i]
  File "spacy/tokens/span_group.pyx", line 114, in spacy.tokens.span_group.SpanGroup.__getitem__
    return Span.cinit(self.doc, self.c[i])
  File "spacy/tokens/span_group.pyx", line 73, in spacy.tokens.span_group.SpanGroup.doc.__get__
    raise RuntimeError(Errors.E865)
RuntimeError: [E865] A SpanGroup is not functional after the corresponding Doc has been garbage collected. To keep using the spans, make sure that the corresponding Doc object is still available in the scope of your function.

I like it otherwise, but this (by itself) is definitely too breaking.

@svlandeg svlandeg added the feat / doc Feature: Doc, Span and Token objects label Jun 10, 2023
@honnibal
Copy link
Member

I see the thinking here.

Is it too awkward to make SpanGroup take an argument that tells us whether to hold a real ref or a weakref?

@svlandeg svlandeg deleted the branch explosion:v5 January 29, 2024 09:15
@svlandeg svlandeg closed this Jan 29, 2024
@svlandeg svlandeg reopened this Jan 29, 2024
@svlandeg svlandeg changed the base branch from v4 to v5 January 29, 2024 09:30
@svlandeg svlandeg added 🔜 v5.0 Related to upcoming v5.0 and removed 🔜 v4.0 Related to upcoming v4.0 labels Jan 29, 2024
@svlandeg
Copy link
Member

Temporarily closing this PR as we currently don't have the bandwidth to finish this.

@svlandeg svlandeg closed this Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / doc Feature: Doc, Span and Token objects 🔜 v5.0 Related to upcoming v5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants