Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx] Align tx match --hash format with the output of tx hash #1289

Closed
wants to merge 5 commits into from

Conversation

Sam-Freeman
Copy link

@Sam-Freeman Sam-Freeman commented Mar 28, 2023

Summary

Fixes #1188
This pr allows for the usage of the hash format produced by the hash command to be used in the match command (i.e. prefixed with 'pdq'). In addition, allows for multiple hashes in a single file of varying signal types.

Will throw exceptions when:

  • You specify (-S) a signal type, and provide hashes of another type
  • When there are multiple signal type prefixes with a mixture of no prefixes -- i.e. we are unable to 'infer' the signal type from the hash
  • Provide an invalid signal type
  • Provide an invalid hash for the signal type

Test Plan

Wrote unit tests and they all pass. Also ran a few manual tests, and piped the command from hash to match which works.
tx hash photo <photo location> | tx match --hashes -S pdq photo -

… produced by the hash command to be used in the match command (i.e. prefixed with 'pdq'). In addition, allows for multiple hashes in a single file of varying signal types. Will throw exceptions when you specify (-S) a signal type, and provide hashes of another type, as well as when there are multiple signal type prefixes with a mixture of no prefixes -- i.e. when we are unable to 'infer' the signal type from the hash
@github-actions github-actions bot added the python-threatexchange Items related to the threatexchange python tool / library label Mar 28, 2023
Sam Freeman added 3 commits March 28, 2023 11:22
… produced by the hash command to be used in the match command (i.e. prefixed with 'pdq'). In addition, allows for multiple hashes in a single file of varying signal types. Will throw exceptions when you specify (-S) a signal type, and provide hashes of another type, as well as when there are multiple signal type prefixes with a mixture of no prefixes -- i.e. when we are unable to 'infer' the signal type from the hash
@Dcallies Dcallies changed the title Fix for Issue 1188. [py-tx] Align tx match --hash format with the output of tx hash Mar 28, 2023
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! I think end result of this change (allowing the commands to pipe into each other) is a huge improvement in the sanity of the UI.

blocking:

  1. I think you can simplify your group by and validation - put some comments inline, but can also catch you offline
  2. I think we can reduce this down to three error cases:
    1. You included a signal without a prefix and it's ambiguous (fix: -S)
    2. You included a signal not valid for your arguments (fix: remove or change args)
    3. You included a signal that didn't validate (fix: remove or update hash)

python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
python-threatexchange/threatexchange/cli/match_cmd.py Outdated Show resolved Hide resolved
hash = hash.strip()
if not hash:
continue
try:
# Need to keep this final validation as we are yet to have validated the hashes without a prefix
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, forgot to remove this additional validation

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Couple of blocking comments for bits I'm not sure about

@@ -215,6 +221,90 @@ def execute(self, settings: CLISettings) -> None:
fetched_data,
)

def validate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is still only called with --hashes, not in all calls will use it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, will update -- should we also be validating the non --hashes calls as well? I can update either in this pr or a follow up one.

Comment on lines +231 to +233
Takes the hashes grouped by optional string prefix, performs all required validations.
Command line arguments, SignalType, hashes, and ambiguous SignalTypes are validated.
Returns a dict of validated SignalType, and validated hashes of each SignalType.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a docstring!

One detail (similar to above comment) is that this validates when the input is hashes

):
raise CommandError(
f"Error: SignalType '{self.only_signal} was specified, but attempting to query with additional/different SignalTypes"
f"Please remove or change your arguments, or remove the additional SignalTypes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: f-string not needed here


# There is a signal without a prefix and it's ambiguous (fix: -S)
if None in hashes_grouped_by_prefix:
if len(hashes_grouped_by_prefix) > 2 or len(signal_types) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignorable: I think this is being overly specific about the input. This will throw if you do

cat 'pdq facefaceface...
facefaceface...' > file.txt
$ tx match -S pdq photo file.txt

but I think it's unlikely that you'll get a non-prefixed value here that will pass the signal type validation on L247

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh, this is a good catch. I see what you mean. Thanks!

Comment on lines +257 to +260
if self.only_signal and (
self.only_signal not in hashes_grouped_by_signal
or len(hashes_grouped_by_signal) > 1
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this block if you pass in the expected signal types into 252 (rather than all settings)

# Validate the hashes
for hash in hashes:
try:
hash = signal_type.validate_signal_str(hash.strip())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may actually change the string for some types (i.e. for some binary types we accept octal, hex, etc). You probably want to save this back.

@@ -225,23 +315,48 @@ def _match_file(
return index.query(s_type.hash_from_file(path))


def _group_hashes_by_prefix(
path: pathlib.Path,
settings: CLISettings,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused - also, I prefer to avoid passing this around because it is somewhat of a god object

line = line.strip()
if not line:
continue
components = line.split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn over what to do when len(component) > 2 - this must mean that there are spaces in the hash, which some future type could allow, but it means our naive parsing here will fail oddly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree, and had a similar thought -- on re-evaluation I'm thinking for when len(component) > 1 it's probably better to assume that [0] is the prefix and concat [1:] into the hash.

The only issue here is that if there is a future hash with a space, without a prefix the parsing will fail still. I'll have a think on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your intuition is right - if there are hashes with a space, providing the prefix allows for a non-ambiguous parse. Check out lpartition or split(limit=2)

signal_type, hash = components
else:
# Assume it doesn't have a prefix and is a raw hash
hash = components[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: What about len(component) > 2?

Comment on lines +335 to +337
hashes = hashes_grouped_by_prefix.get(signal_type, set())
hashes.add(hash)
hashes_grouped_by_prefix[signal_type] = hashes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use setdefault again here to simplify (some folks also like defaultdict, but I have burned before!)

hashes_grouped_by_prefix.setdefault(signal_type, set()).add(hash)  # you are done

@facebook-github-bot
Copy link
Contributor

Hi @Sam-Freeman!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@Dcallies
Copy link
Contributor

Dcallies commented Mar 1, 2024

Hey @Sam-Freeman, I'm going to close this PR for inactivity, feel free to re-open if you still want to try and get it merged!

@Dcallies Dcallies closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed python-threatexchange Items related to the threatexchange python tool / library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[py-tx] inconsistency in hash and match --hashes format
3 participants