New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[py-tx] Align tx match --hash
format with the output of tx hash
#1289
Conversation
… produced by the hash command to be used in the match command (i.e. prefixed with 'pdq'). In addition, allows for multiple hashes in a single file of varying signal types. Will throw exceptions when you specify (-S) a signal type, and provide hashes of another type, as well as when there are multiple signal type prefixes with a mixture of no prefixes -- i.e. when we are unable to 'infer' the signal type from the hash
… produced by the hash command to be used in the match command (i.e. prefixed with 'pdq'). In addition, allows for multiple hashes in a single file of varying signal types. Will throw exceptions when you specify (-S) a signal type, and provide hashes of another type, as well as when there are multiple signal type prefixes with a mixture of no prefixes -- i.e. when we are unable to 'infer' the signal type from the hash
tx match --hash
format with the output of tx hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! I think end result of this change (allowing the commands to pipe into each other) is a huge improvement in the sanity of the UI.
blocking:
- I think you can simplify your group by and validation - put some comments inline, but can also catch you offline
- I think we can reduce this down to three error cases:
- You included a signal without a prefix and it's ambiguous (fix: -S)
- You included a signal not valid for your arguments (fix: remove or change args)
- You included a signal that didn't validate (fix: remove or update hash)
hash = hash.strip() | ||
if not hash: | ||
continue | ||
try: | ||
# Need to keep this final validation as we are yet to have validated the hashes without a prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, forgot to remove this additional validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! Couple of blocking comments for bits I'm not sure about
@@ -215,6 +221,90 @@ def execute(self, settings: CLISettings) -> None: | |||
fetched_data, | |||
) | |||
|
|||
def validate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is still only called with --hashes
, not in all calls will use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, will update -- should we also be validating the non --hashes
calls as well? I can update either in this pr or a follow up one.
Takes the hashes grouped by optional string prefix, performs all required validations. | ||
Command line arguments, SignalType, hashes, and ambiguous SignalTypes are validated. | ||
Returns a dict of validated SignalType, and validated hashes of each SignalType. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a docstring!
One detail (similar to above comment) is that this validates when the input is hashes
): | ||
raise CommandError( | ||
f"Error: SignalType '{self.only_signal} was specified, but attempting to query with additional/different SignalTypes" | ||
f"Please remove or change your arguments, or remove the additional SignalTypes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: f-string not needed here
|
||
# There is a signal without a prefix and it's ambiguous (fix: -S) | ||
if None in hashes_grouped_by_prefix: | ||
if len(hashes_grouped_by_prefix) > 2 or len(signal_types) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignorable: I think this is being overly specific about the input. This will throw if you do
cat 'pdq facefaceface...
facefaceface...' > file.txt
$ tx match -S pdq photo file.txt
but I think it's unlikely that you'll get a non-prefixed value here that will pass the signal type validation on L247
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh, this is a good catch. I see what you mean. Thanks!
if self.only_signal and ( | ||
self.only_signal not in hashes_grouped_by_signal | ||
or len(hashes_grouped_by_signal) > 1 | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this block if you pass in the expected signal types into 252 (rather than all settings)
# Validate the hashes | ||
for hash in hashes: | ||
try: | ||
hash = signal_type.validate_signal_str(hash.strip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may actually change the string for some types (i.e. for some binary types we accept octal, hex, etc). You probably want to save this back.
@@ -225,23 +315,48 @@ def _match_file( | |||
return index.query(s_type.hash_from_file(path)) | |||
|
|||
|
|||
def _group_hashes_by_prefix( | |||
path: pathlib.Path, | |||
settings: CLISettings, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unused - also, I prefer to avoid passing this around because it is somewhat of a god object
line = line.strip() | ||
if not line: | ||
continue | ||
components = line.split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn over what to do when len(component) > 2 - this must mean that there are spaces in the hash, which some future type could allow, but it means our naive parsing here will fail oddly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree, and had a similar thought -- on re-evaluation I'm thinking for when len(component) > 1 it's probably better to assume that [0] is the prefix and concat [1:] into the hash.
The only issue here is that if there is a future hash with a space, without a prefix the parsing will fail still. I'll have a think on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your intuition is right - if there are hashes with a space, providing the prefix allows for a non-ambiguous parse. Check out lpartition
or split(limit=2)
signal_type, hash = components | ||
else: | ||
# Assume it doesn't have a prefix and is a raw hash | ||
hash = components[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: What about len(component) > 2?
hashes = hashes_grouped_by_prefix.get(signal_type, set()) | ||
hashes.add(hash) | ||
hashes_grouped_by_prefix[signal_type] = hashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use setdefault again here to simplify (some folks also like defaultdict, but I have burned before!)
hashes_grouped_by_prefix.setdefault(signal_type, set()).add(hash) # you are done
Hi @Sam-Freeman! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Hey @Sam-Freeman, I'm going to close this PR for inactivity, feel free to re-open if you still want to try and get it merged! |
Summary
Fixes #1188
This pr allows for the usage of the hash format produced by the hash command to be used in the match command (i.e. prefixed with 'pdq'). In addition, allows for multiple hashes in a single file of varying signal types.
Will throw exceptions when:
Test Plan
Wrote unit tests and they all pass. Also ran a few manual tests, and piped the command from hash to match which works.
tx hash photo <photo location> | tx match --hashes -S pdq photo -