Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--reuse-media creates an unacceptably high chance of a collision #1231

Open
4 of 6 tasks
Lymia opened this issue May 12, 2024 · 1 comment · May be fixed by #1232
Open
4 of 6 tasks

--reuse-media creates an unacceptably high chance of a collision #1231

Lymia opened this issue May 12, 2024 · 1 comment · May be fixed by #1232
Labels

Comments

@Lymia
Copy link

Lymia commented May 12, 2024

Version

v2.42.8

Flavor

GUI (Graphical User Interface), CLI (Command-Line Interface)

Platform

Linux

Export format

HTML, TXT, JSON, CSV

Steps to reproduce

Export servers where large numbers of videos, images and especially screenshots are often posted. Observe that files named image.png or unknown.png are extremely common due to copy-paste of images directly into Discord, often leaving no file name available.

Do math to confirm the birthday problem is at play and will be causing you problems.

Details

The algorithm for making file names for downloaded media only stores 20 bytes of the hash of the URL. This is insufficient.

Despite there being 1048576 possible hashes, it only takes 1200 images for there to be a 50% chance of a collision and 500 images before there is a 10% chance of a collision due to the birthday problem. The original reasoning in the issue where this code was added (#395) is incorrect due to the commonness of unknown.png, image.png, and other filenames.

A list of all filenames I've seen that could reach that 10% chance of a collision in just my own exports of servers I'm in:

    508 hqdefault.jpg
    541 image0.png
    681 sddefault.jpg
   1269 image0.jpg
   1696 tenor.png
   5798 maxresdefault.jpg
   6637 image.png
  11314 unknown.png

As can clearly be seen, file names are not evenly distributed. Notably, maxresdefault, sddefault and hqdefault are not being posted intentionally by users at all, and are instead the result of youtube thumbnails. Similarly, tenor happens because the tenor gif button is pressed in Discord's own UI.

There is a near 100% chance that there are collisions here between unknown.png, image.png, and maxresdefault.jpg. You can check with this calculator: https://www.bdayprob.com/. Solve P(D,N) with D = 1048576 and N = 6637.

10 characters provides a better margin of safety: 15000 images with the same name before there is even a 0.01% chance, which would have successfully handled my use case.

10 characters of base-32 rather than base-16 would provide 600000 images before there is the same chance of collision. 8 characters of base-32 would provide the same margin as the 10 characters of base-16.

Checklist

  • I have looked through existing issues to make sure that this bug has not been reported before
  • I have provided a descriptive title for this issue
  • I have made sure that this bug is reproducible on the latest version of the application
  • I have provided all the information needed to reproduce this bug as efficiently as possible
  • I have sponsored this project
  • I have not read any of the above and just checked all the boxes to submit the issue
@Lymia Lymia added the bug label May 12, 2024
@Lymia Lymia changed the title --reuse-media creates an unacceptably high chance of a --reuse-media creates an unacceptably high chance of a collision May 12, 2024
@96-LB
Copy link
Sponsor Contributor

96-LB commented May 12, 2024

An alternative suggestion to lengthening the hash would be to just put the attachment ID in the filename. I feel like at a certain point, a bunch of random base-32 characters doesn't really provide much utility over 19 digits (which would be guaranteed to never collide).

Lymia added a commit to Lymia/DiscordChatExporter that referenced this issue May 13, 2024
@Tyrrrz Tyrrrz linked a pull request May 13, 2024 that will close this issue
PrEDRiSing added a commit to PrEDRiSing/DiscordChatExporter that referenced this issue Jun 2, 2024
Improve filenames for downloaded assets. Fixes Tyrrrz#1231
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants