Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx] inconsistency in hash and match --hashes format #1188

Open
master-hax opened this issue Sep 15, 2022 · 4 comments
Open

[py-tx] inconsistency in hash and match --hashes format #1188

master-hax opened this issue Sep 15, 2022 · 4 comments
Assignees
Labels
do-not-reap help wanted python-threatexchange Items related to the threatexchange python tool / library

Comments

@master-hax
Copy link

threatexchange hash photo <path-to-image-file> outputs hashes in the form

pdq f5b44e4f1673f4e44e57fd6d021201effd8b40d6b54906d3dc302df672160708

however

threatexchange match --hashes photo <path-to-hash-file> expects a single file without a "pdq " prefix

this causes an inconsistency where hashes generated by threatexchange hash can't be directly used in threatexchange match --hashes

@Dcallies
Copy link
Contributor

Dcallies commented Sep 15, 2022

Hey @master-hax, thanks for reaching out!

This is currently intended behavior, however, providing consistency between the file collab type and --hashes might be useful.

--hashes should be requiring you to also do -S to pick the type.

The way I've done it is using cut -d ' ' -f 2 to get only the hashes.

Another trick might be to do something like dataset -p does, which is not print columns that you are explicitly selecting (since otherwise | grep would have worked).

@master-hax some questions:

  1. Would your expectation be that hash and match be directly compatible?
  2. What would the behavior be if someone did threatexchange match --hashes -S pdq photo -- text is not pdq

@Dcallies Dcallies added do-not-reap python-threatexchange Items related to the threatexchange python tool / library labels Sep 15, 2022
@master-hax
Copy link
Author

master-hax commented Sep 15, 2022

Hey @Dcallies, thanks for the fast response

responses:

  1. I expect hash and match --hashes commands to be compatible even if they are not compatible by default. e.g. maybe we could add a flag to hash like hash --raw or hash --no-prefix that outputs the hashes in a match compatible format? (without the "pdq " prefix)
  2. Due to the -S pdq args, i think threatexchange should attempt to interpret the string "text is not pdq" as a pdq hash. and then possibly fail due to the input not being in hexadecimal or not being the correct length

@Dcallies
Copy link
Contributor

Dcallies commented Sep 16, 2022

After thinking about it for a bit, I think it's reasonable to expect that match --hashes and hash will be compatible.

However, definitely want to limit the number of arguments on these commands, especially if there are equivalents in existing bash tools.

--hashes requiring -S has bothered me for a while, I propose we do the following:

  1. Remove the requirement that match --hashes requires -S
  2. --hashes will default to looking for the signal type prefix and will interpret it in order to select the right index
  3. -S with only one option will strip the appropriate prefix (i.e. 'pdq ') if it finds it, and accept a naked hash

This solution provides:

  1. backwards compatibility
  2. requested functionality

But has the downside of potential ambiguity if there is a future signal type which the hash can contain the prefix of its signal type. This is possible with 'raw_text' which a string that started with 'raw_text' would be discovered if using -S with only one option. This seems okay since raw_text is an innocuous string, and has a simple workaround (no -S).

To handle this, we can update the document strings and the error message for missing prefix on hashes.

@Dcallies Dcallies changed the title inconsistency in hash format [py-tx] inconsistency in hash and match --hashes format Sep 16, 2022
@Dcallies
Copy link
Contributor

Working with Sam on this as a ramp-up task for the upcoming hackathon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-reap help wanted python-threatexchange Items related to the threatexchange python tool / library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants