-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Introducing DBM for storage #1297
Conversation
def get_for_api(self, api: t.Type[SignalExchangeAPI], storage=None) -> fetch_state.FetchedStateStoreBase: | ||
print("PASSED STORAGE??", storage) | ||
# if storage == "dbm": | ||
return DMBFetchedStateStore(api) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this for now,
Next steps would be to propagate storage flag from the cli terminal and select the correct storage interface class
raise NotImplementedError | ||
|
||
|
||
class DMBFetchedStateStore(fetch_state.FetchedStateStoreBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please focus on this class, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DMB or DBM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
Thanks for filling out a test plan.
Based on your work here, I think we can actually simplify the base interface to make it easier to implement as an extension. Catch me at the hackathon and we can talk through it!
raise NotImplementedError | ||
|
||
|
||
class DMBFetchedStateStore(fetch_state.FetchedStateStoreBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DMB or DBM?
data = {} | ||
while k: | ||
print(k, db[k]) | ||
_k, _v = map(pickle.loads, [k, db[k]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: underscores usually denote that the variable is unused (not that it's private as in other language conventions), so k
, v
is probably preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome progress! I think this is starting to narrow in on what I think will solve the GIFCT's issues and give us some flexibility in future designs.
Toplevel thoughts:
- Let's continue iterating on the GenericStorage interface - this feel like it will play out in the long term more than specializing for the whole
- Can you add some docstrings to the GenericStorage index?
- (Doesn't yet need to be solved / future PR) Consider also how it might be possible to port the old pickle + dict storage forward. We make a compatibility guarantee for users of the CLI
|
||
T = t.TypeVar('T') | ||
|
||
class GenericStorage(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Can you add more comments on the expectations for this class?
k = pickle.dumps(key) | ||
return pickle.loads(self.db[k]) if k in self.db else None | ||
|
||
def set(self, key: T, val: T): | ||
k, v = map(pickle.dumps, [key, val]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Why pickle the key?
def __iter__(self): | ||
self.next_key = self.db.firstkey() | ||
return self | ||
|
||
def __next__(self): | ||
if (not self.next_key): | ||
raise StopIteration | ||
curr_key = self.next_key | ||
self.next_key = self.db.nextkey(curr_key) | ||
k, v = map(pickle.loads, [curr_key, self.db[curr_key]]) | ||
return k, v | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Why provide magic functions?
self.file = None | ||
self.next_key = None | ||
|
||
def connect(self, file: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why connect explicitly instead of as needed?
class DBMFetchedStateStore(fetch_state.FetchedStateStoreBase): | ||
def __init__( | ||
self, | ||
api_cls: TSignalExchangeAPICls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the final version of this will have a different constructor, but this is fine for a proof of concept.
raise NotImplementedError | ||
|
||
|
||
class DBMFetchedStateStore(fetch_state.FetchedStateStoreBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Instead of making this DBMFetchStateStore, can you make this GenericStorageStore
and pass in an instance of GenericState?
@@ -293,3 +293,10 @@ def get_for_signal_type( | |||
sense. | |||
""" | |||
raise NotImplementedError | |||
|
|||
@abstractmethod | |||
def exists(self, collab: CollaborationConfigBase) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like it gets called anywhere in the code - where do you plan on using it?
import typing as t | ||
import pickle | ||
|
||
T = t.TypeVar('T') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to constrain to the same keyspace as fetch store, you can copy the type signature there (union of str and int)
def connect(self, *args, **kwargs): | ||
raise NotImplementedError | ||
|
||
@abstractmethod | ||
def exists(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Having *args and **kwargs makes these functions fairly unconstrained. This means that they'll be hard to try and hard to predict. Can you narrow this down further?
class GenericStorage(ABC): | ||
|
||
@abstractmethod | ||
def connect(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we require that a constructing an instance of this object make it functional, do we need this method?
Hey @NikolayOG, I'm going to close this PR for inactivity, feel free to re-open if you still want to try and get it merged! |
Summary
Draft commit, please focus on
DMBFetchedStateStore
class.Test Plan
threatexchange fetch
threatexchange match photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
pdq 16 (Sample Signals) INVESTIGATION_SEED
Now we are finding matches