Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-loaded feature dictionary in VW #4500

Open
olgavrou opened this issue Feb 15, 2023 · 1 comment
Open

pre-loaded feature dictionary in VW #4500

olgavrou opened this issue Feb 15, 2023 · 1 comment
Labels
Feature Request New feature requested in system

Comments

@olgavrou
Copy link
Collaborator

olgavrou commented Feb 15, 2023

Short description

The triggering idea here is an action dictionary/catalog for vowpal wabbit

Feature dictionaries can be loaded (id to parsed features, we only care about the features not the labels). Then incoming examples can reference the id instead of including the entire feature string and we can avoid parsing the same example strings many times.

Possible solution/implementation details

The idea is to

  • expose an API in VW that allows the loading of pre-parsed VW::example_features* that are referenced by a unique id (defined by the user)
  • examples can be extended to reference an id instead of holding the full feature string
  • a reduction will be added that:
    • holds a reference to this loaded dictionary (load/access to it should be thread safe in case an external thread (parser?) decides to reload the dictionary)
    • checks incoming examples for the existing id and if found swaps the incoming example's features with the features from the dictionary (un swapped on the way out of the reduction)
    • the loading of the dictionary from a library usage POV is up to the API caller since VW expects a map from id to VW::example_features*
    • TBD: from a CLI POV we need to decide on a format that can be parsed and set by VW during setup

*VW::example_feature is a new struct that holds the VW::v_array<namespace_index> indices and std::array<features, NUM_NAMESPACES> feature_space that is the full information of an example's features and potentially other information needed for feature counting

Other things to consider

  • All parsers that want to support this feature need to accept a reference id (json already supports this)
  • Cache: needs to be extended to hold the example id, and if used with cache (as with any other parser) the dictionary needs to be available
  • If someone sets an example id and also features in that example what do we do?
    • parser that is processing that example could reject it OR
    • ignore the extra features OR
    • add them to the dictionary example after the swap (and remove them from the dictionary example prior to exiting the reduction)
  • if the features loaded in the dictionary are populated with audit information then audit is complete otherwise it is just incomplete, this is up to the caller of the API
  • Failure mode:
    • if someone references a non existent id then we throw it is a non recoverable error
@olgavrou olgavrou added the Feature Request New feature requested in system label Feb 15, 2023
@olgavrou olgavrou changed the title pre-loaded example dictionary in VW pre-loaded feature dictionary in VW Feb 15, 2023
@RohitRathore1
Copy link

@olgavrou Is this issue open for anyone? Or your team is going to work on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request New feature requested in system
Projects
None yet
Development

No branches or pull requests

2 participants