Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider configuring "index_predicates" flag in variable definition #1052

Open
NickCrews opened this issue Jun 9, 2022 · 2 comments
Open

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Jun 9, 2022

Currently, you choose whether or not to use index predicates by passing the index_predicates flag in prepare_training().

This has some drawbacks

  • Indexing happens regardless, in a previous step. Slow. (This is my motivation)
  • You get all or nothing, perhaps you only want to index certain variables
  • The code is complex, because you have to pass this flag between function calls.
  • Things like Fingerprinter have to know the details of predicates. eg currently it does stuff like if hasattr(predicate, "index"). It would be better if the Fingerprinter never even received this predicate in the first place. Similar thing in DataModel. It also looks like the BlockLearners know about canopy vs non-canopy predicates, eg when it calls self.data_model.predicates(canopies=False), but this might be a separate topic. I don't understand what a canopy index is and why you sometimes would want to use or not use one.

What if instead we defined this flag as part of the variable definition? eg {"field": "name", "type": "ShortString", "index": False}. Then, the variable itself is responsible for adding or omitting the index predicates, and everything else just accepts this at face value.

@fgregg
Copy link
Contributor

fgregg commented Jun 11, 2022

i like this idea. some considerations:

  1. what's the default? i think it makes sense that you have to toggle on index predicates
  2. not all variable types have index predicates, we would need to provide some reasonable warning when that flag is a no-op
  3. what does the transition story look like. there are a few cases
    1. in train, index_predicates=False: disable all index predicates regardless of what the variable definition is
    2. in train, index_predicates=True do whatever the variable definitions say.
    3. in train, index_predicates is not explicitly set. If there are variable definitions that ask for index predicates, then i think we should do what the variable definitions ask
    4. in train, index_predicates is not explicitly set. if there are no variables that ask for index_predicates (current situation), then probably raise an warning?
  4. would be want to preserve a flag to disable all index_predicates in the train method... maybe.

@NickCrews
Copy link
Contributor Author

what's the default? i think it makes sense that you have to toggle on index predicates

I agree, by default don't use index predicates

not all variable types have index predicates, we would need to provide some reasonable warning when that flag is a no-op

I would throw an error. Plenty of other variables have params and settings that are specific to them, I view this change as lumping this option in with the other ones.

would be want to preserve a flag to disable all index_predicates in the train method... maybe.

I think the end goal should be to remove the index_predicates argument from train(). The issue as you say is the transition period. I would think that should keep past behavior:

    def train(
        self, recall: float = 1.00, **kwargs
    ) -> None:  # pragma: no cover
        index_predicates = kwargs.get("index_predicates", None)
        if index_predicates is not None:
            warnings.warn("the arg deprecated")
            if index_predicates:
                # override variable config, pretend "index":True was used in all definitions
            else:
                # override variable config, pretend "index":False was used in all definitions
        # continue onwards
  

After some time, we remove the **kwargs and this logic.

This agrees with your suggestions i and iii, but disagrees with ii. If we follow ii and have variable indexes off by default, then this is going to change behavior for people who don't update their code. I don't think I understand the situation you are talking about for iv?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants