Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introducing DBM for storage #1297

Closed
wants to merge 10 commits into from

Conversation

NikolayOG
Copy link
Contributor

@NikolayOG NikolayOG commented Mar 29, 2023

Summary

Draft commit, please focus on DMBFetchedStateStore class.

Test Plan

  1. Delete the pdq index
  2. Run
threatexchange match photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
No data to match against
  1. Run threatexchange fetch
  2. Run threatexchange match photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
    pdq 16 (Sample Signals) INVESTIGATION_SEED

Now we are finding matches

def get_for_api(self, api: t.Type[SignalExchangeAPI], storage=None) -> fetch_state.FetchedStateStoreBase:
print("PASSED STORAGE??", storage)
# if storage == "dbm":
return DMBFetchedStateStore(api)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this for now,

Next steps would be to propagate storage flag from the cli terminal and select the correct storage interface class

raise NotImplementedError


class DMBFetchedStateStore(fetch_state.FetchedStateStoreBase):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please focus on this class, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DMB or DBM?

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

Thanks for filling out a test plan.

Based on your work here, I think we can actually simplify the base interface to make it easier to implement as an extension. Catch me at the hackathon and we can talk through it!

raise NotImplementedError


class DMBFetchedStateStore(fetch_state.FetchedStateStoreBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DMB or DBM?

data = {}
while k:
print(k, db[k])
_k, _v = map(pickle.loads, [k, db[k]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: underscores usually denote that the variable is unused (not that it's private as in other language conventions), so k, v is probably preferred.

@github-actions github-actions bot added the python-threatexchange Items related to the threatexchange python tool / library label Mar 31, 2023
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome progress! I think this is starting to narrow in on what I think will solve the GIFCT's issues and give us some flexibility in future designs.

Toplevel thoughts:

  1. Let's continue iterating on the GenericStorage interface - this feel like it will play out in the long term more than specializing for the whole
  2. Can you add some docstrings to the GenericStorage index?
  3. (Doesn't yet need to be solved / future PR) Consider also how it might be possible to port the old pickle + dict storage forward. We make a compatibility guarantee for users of the CLI


T = t.TypeVar('T')

class GenericStorage(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Can you add more comments on the expectations for this class?

Comment on lines +58 to +62
k = pickle.dumps(key)
return pickle.loads(self.db[k]) if k in self.db else None

def set(self, key: T, val: T):
k, v = map(pickle.dumps, [key, val])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Why pickle the key?

Comment on lines +68 to +79
def __iter__(self):
self.next_key = self.db.firstkey()
return self

def __next__(self):
if (not self.next_key):
raise StopIteration
curr_key = self.next_key
self.next_key = self.db.nextkey(curr_key)
k, v = map(pickle.loads, [curr_key, self.db[curr_key]])
return k, v

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Why provide magic functions?

self.file = None
self.next_key = None

def connect(self, file: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why connect explicitly instead of as needed?

class DBMFetchedStateStore(fetch_state.FetchedStateStoreBase):
def __init__(
self,
api_cls: TSignalExchangeAPICls,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the final version of this will have a different constructor, but this is fine for a proof of concept.

raise NotImplementedError


class DBMFetchedStateStore(fetch_state.FetchedStateStoreBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Instead of making this DBMFetchStateStore, can you make this GenericStorageStore and pass in an instance of GenericState?

@@ -293,3 +293,10 @@ def get_for_signal_type(
sense.
"""
raise NotImplementedError

@abstractmethod
def exists(self, collab: CollaborationConfigBase) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like it gets called anywhere in the code - where do you plan on using it?

import typing as t
import pickle

T = t.TypeVar('T')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to constrain to the same keyspace as fetch store, you can copy the type signature there (union of str and int)

Comment on lines +13 to +17
def connect(self, *args, **kwargs):
raise NotImplementedError

@abstractmethod
def exists(self, *args, **kwargs):
Copy link
Contributor

@Dcallies Dcallies Apr 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Having *args and **kwargs makes these functions fairly unconstrained. This means that they'll be hard to try and hard to predict. Can you narrow this down further?

class GenericStorage(ABC):

@abstractmethod
def connect(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we require that a constructing an instance of this object make it functional, do we need this method?

@Dcallies
Copy link
Contributor

Dcallies commented Mar 1, 2024

Hey @NikolayOG, I'm going to close this PR for inactivity, feel free to re-open if you still want to try and get it merged!

@Dcallies Dcallies closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed python-threatexchange Items related to the threatexchange python tool / library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants