Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDQ: Expected hashes of regression tests differ between reference implementations #929

Open
eklann opened this issue Feb 18, 2022 · 5 comments
Labels
do-not-reap pdq Items related to the pdq libraries or reference implementations

Comments

@eklann
Copy link

eklann commented Feb 18, 2022

In the Java reference implementation, there is a test that asserts that the hash of the sample image "bridge-1-original.jpg" is "f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22". See: https://github.com/facebook/ThreatExchange/blob/main/pdq/java/src/test/java/hasher/pdqhashing/PDQHasherTest.java

In the C++ readme, the hash of the same image is said to be "f8f8f0cce0f4e84d0e370a22028f67f0b36e2ed596623e1d33e6339c4e9c9b22". See: https://github.com/facebook/ThreatExchange/tree/main/pdq/cpp/bin

I can't find a mention in the PHP code that specifically tests this file, but the hash "f8f8f0cce0f4e84d0e370a22028f67f0b36e2ed596623e1d33e6339c4e9c9b22" (same as for C++) is used as an example here: https://github.com/facebook/ThreatExchange/blob/main/pdq/php/pdqhashtest.php

In the python code (is this an official reference implementation?) it seems that the expected hash is yet another hash: "d8f8f0cce0f4a84f0e370a22028f67f0b36e2ed596623e1d33e6b39c4e9c9b22". See: https://github.com/facebook/ThreatExchange/blob/main/pdq/python/pdqhashing/tests/pdq_test.py

In the SQL examples, the hash "f8f8f0cce0f4e84d0e370a22028f67f0b36e2ed596623e1d33e6339c4e9c9b22" (same as C++) is used as an example. See: https://github.com/facebook/ThreatExchange/tree/main/pdq/sql

In short: It seems that the Java and Python implementations differs in expectation about the hash value of that image compared to C++. This seems problematic. Why are they not the same? Are there bugs in the reference implementations? Am I missing something?

@BigPanda97
Copy link

Maybe reading the documentation would help you: https://raw.githubusercontent.com/facebook/ThreatExchange/main/hashing/hashing.pdf

For this reason we have chosen a tradeoff: (a) Use implementation-specific downsampler to get to 512x512; (b) use the two-pass tent filter to go from 512x512 to 64x64. This is a balance between efficient computation, and a few bits (0/2/4/6) of hamming distance between different implementations given the same image.

The downsizing of the image in different implementations leads to slightly different hashes...

@eklann
Copy link
Author

eklann commented Feb 20, 2022

Thanks for the reference!

After examining the C++/Java/Python implementation a bit more closely, it seems like:

  • C++ uses CImg, which in turn defaults to nearest neighbor.
  • The Java implementation does not have any logic for downsampling before the hash computation is done, so hashing is done on the full resolution image.
  • Python uses PIL, which in turn defaults to bicubic.

So if I want some sample image along with an expected hash as a regression test for a new implementation, it seems I should use the hash from the Java implementation to verify that the algorithm is implemented correctly.

I could not find any guideline in the documentation about what type of downsamling to use. I suspect that this might impact the precision. Any insight to share about this?

@Dcallies
Copy link
Contributor

@BarrettOlson was looking into the exact same problem for verifying cross-languages. The downsampling problem makes it a pain to verify correctness for sure. Barrett, did you have any luck with pre-downsampled inputs, or trying to intercept the bits post-resizing?

@Dcallies Dcallies added do-not-reap pdq Items related to the pdq libraries or reference implementations labels Feb 22, 2022
@BarrettOlson
Copy link
Contributor

BarrettOlson commented Feb 22, 2022

Yes, I believe downsampling is creating these differences. The original authors intent was to have users replace CImg.h with their own (potentially proprietary and more efficient) image processing library with the expectation that a few bits difference (0/2/4/6) was acceptable (given a common distance threshold being closer to 31).

I am looking into better (more exact) verification options such as, having an intermediate representation or images already sized 512x512 as input but don’t have much to add to this thread yet. Also open to suggestions! :)

@eklann
Copy link
Author

eklann commented Feb 23, 2022

Thanks for the info!

My personal suggestion would be that the library used to hash an image does not care about resolution and downsampling, and then the user of that lib takes care of any preprocessing on the image. This would make it much easier to test as a black-box, plus the user can actually choose to have the hash computed on the full resolution if that is important for the use case.

Having regression tests on images below 512x512px seems like it would solve the issue of verification cross-language for now. But if the downsampling logic is in the hashing code, the tests will be sensitive to future modifications to this 512px limit.

I'll use the hash from the Java tests, since it does not do any downsampling from what I can see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-reap pdq Items related to the pdq libraries or reference implementations
Projects
None yet
Development

No branches or pull requests

4 participants