Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate imports #513

Open
2 tasks
EvanHahn opened this issue Mar 11, 2024 · 2 comments
Open
2 tasks

Avoid duplicate imports #513

EvanHahn opened this issue Mar 11, 2024 · 2 comments
Assignees

Comments

@EvanHahn
Copy link
Contributor

Description

Normally we use a random ID for new records. This can lead to duplicates when importing data: the same data can be imported twice, resulting in duplicate records with different IDs but the same information.

This could be done in a few ways, but we think the best would be to do something like this:

  1. Hash the object into some stable ID.
  2. Use that ID instead of the random one to avoid duplicates

Here's some sample object code:

import stableStringify from 'json-stable-stringify'
import { createHash } from 'crypto'

function hashObject(obj) {
  return createHash('sha256').update(stableStringify(obj)).digest()
}

We discussed doing this more generally in #507 but decided to do this only for config imports for now.

Tasks

  • Add tests for duplicate imports; ensure that only one record is created, no duplicates
  • Make those tests pass
@tomasciccola
Copy link
Contributor

I'm wondering if this should apply to icons. I don't know about performance, but we're embedding the icon blobs into the doc itself, so wouldn't hashing be slow for icons?

@EvanHahn
Copy link
Contributor Author

tl;dr: I think it will be fine from a performance perspective.


I wrote this unoptimized script that hashes a 1GiB file:

import { createHash } from 'node:crypto'
import { readFileSync } from 'node:fs'

// Generated with `head -c 1073741824 /dev/random > big_file`
const input = readFileSync('./big_file')

console.time('computing hash')
const hash = createHash('sha256')
hash.update(input)
console.log(hash.digest().toString('hex'))
console.timeEnd('computing hash')

On my slower laptop (with a ~2.3GHz 4-core Intel processor), this takes about 3 seconds. On my faster machine (a 3.5GHz 16-core AMD processor), this takes about 550 milliseconds.

This is a bit slow, but (1) this is a giant icon file (2) my code is unoptimized. A more realistic (but still large) 1MiB file finishes in just 10ms on my slower machine, and about 4ms on my faster one.

Personally, I think that's acceptable. Does that seem okay to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants