Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Allow lex_attr_getters to take a vocab argument #12244

Closed

Conversation

adrianeboyd
Copy link
Contributor

Description

Allow methods in lex_attr_getters to optionally take a vocab argument in order to pass vocab.lookups and the new less-structure vocab.lex_attr_data in order to provide arbitrary data (like stop words) to the getters in a way that it can be serialized with the pipeline.

Types of change

?

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@adrianeboyd adrianeboyd added the 🔜 v4.0 Related to upcoming v4.0 label Feb 7, 2023
@adrianeboyd adrianeboyd marked this pull request as draft February 7, 2023 14:15

def _get_lex_attr_value(vocab, func, string):
try:
return func(vocab, string)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no obvious speed hit with try/except in the test suite (unlike for inspect) and it would be easy to add this with the order swapped in v3. (Some more performance checks would still be necessary.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a bit about collateral catches here though. It would be pretty unpleasant to have us catch a TypeError the user's callback legitimately raised, because we'll then lose the original traceback.

If we go this route, maybe we could mitigate the problem by doing something like this?

def _get_lex_attr_value(vocab, func, string):
    try:
        return func(vocab, string)
    except TypeError as first_error:
        # TODO: some kind of setting or deprecation warning
        try:
            return func(string)
        except TypeError:
            # If the second signature doesn't work either, raise the first
            # error, as that will be the one with the user's traceback
            # (and don't raise it from the second error; that's not helpful)
            raise first_error from None

I think doing the inspection when the callback is assigned would be more reliable though, and it'd remove the question of the runtime performance. It also lets us notify for deprecation when the callback is assigned, rather than when it is invoked, which also seems better to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean about the error traceback.

I agree in theory about tracking the signature of each getter, but modifying the vocab (in a way that doesn't break all existing code and examples) to have some notion of "assigned" is going to be non-trivial.

Copy link
Member

@honnibal honnibal Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a backport for v3, so it's definitely not a problem to just do it this way. For reference, here's the idea I referred to on Slack for the proxying dict. I could've sworn we did this someplace else but I can't find it.

import weakref


class SetterCallbackDict(dict):
    """A dict class that makes callbacks on __setitem__.
    The intended use-case is to replace a dictionary attribute
    of a class, if some logic should occur when items are
    assigned. The parent argument is also passed to the callback,
    and stored as a weakref to avoid cycles.
    """

    def __init__(self, parent, setter) -> None:
        self._parent = weakref.ref(parent)
        self._setter = setter

    def __setitem__(self, key, value):
        parent = self._parent()
        return self._setter(parent, key, value)


class Vocab:
    def __init__(self, lex_attrs):
        self._lex_attrs = SetterCallbackDict(self, set_lex_attr)
        for key, value in lex_attrs.items():
            self._lex_attrs[key] = value

    @property
    def lex_attrs(self):
        return self._lex_attrs

    @lex_attrs.setter
    def lex_attrs(self, new_dict):
        self._lex_attrs = SetterCallbackDict(self, set_lex_attr)
        for key, value in new_dict.items():
            self._lex_attrs[key] = value


def set_lex_attr(vocab, key, value):
    if key == "foo":
        print("key was foo")
    dict.__setitem__(vocab.lex_attrs, key, value)

@svlandeg svlandeg added the feat / doc Feature: Doc, Span and Token objects label Jun 10, 2023
@adrianeboyd adrianeboyd mentioned this pull request Aug 2, 2023
3 tasks
@svlandeg svlandeg deleted the branch explosion:v5 January 29, 2024 09:15
@svlandeg svlandeg closed this Jan 29, 2024
@svlandeg svlandeg reopened this Jan 29, 2024
@svlandeg svlandeg added 🔜 v5.0 Related to upcoming v5.0 and removed 🔜 v4.0 Related to upcoming v4.0 labels Jan 29, 2024
@svlandeg svlandeg changed the base branch from v4 to v5 January 29, 2024 09:43
@svlandeg
Copy link
Member

Temporarily closing this PR as we currently don't have the bandwidth to wrap this up properly.

@svlandeg svlandeg closed this Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / doc Feature: Doc, Span and Token objects 🔜 v5.0 Related to upcoming v5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants