Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should TrailDB deduplicate events? #85

Open
ckuethe opened this issue Jul 8, 2016 · 4 comments
Open

Should TrailDB deduplicate events? #85

ckuethe opened this issue Jul 8, 2016 · 4 comments

Comments

@ckuethe
Copy link

ckuethe commented Jul 8, 2016

For discussion - it would be nice if TrailDB could deduplicate events. Below is a simple script that inserts some records twice. Clearly it's a little bit silly to append the exact same database twice, but it's possible that I might have some duplicate events when merging a bunch of different log types for a given time period.

from traildb import TrailDB, TrailDBConstructor
from uuid import uuid4

fields = ['text']
cons = TrailDBConstructor('/tmp/test1', fields)
for x in range(2):
    uid = uuid4().hex
    for ts in range(5):
        cons.add(uid, ts, ['trail {}, time {}'.format(uid, ts)])

tdb = cons.finalize()
print '{} fields, {} trails, {} events'.format(tdb.num_fields, tdb.num_trails, tdb.num_events)

cons = TrailDBConstructor('/tmp/test2', fields)
cons.append(tdb)
cons.append(tdb)

tdb = cons.finalize()
print '{} fields, {} trails, {} events'.format(tdb.num_fields, tdb.num_trails, tdb.num_events)

prints

2 fields, 2 trails, 10 events
2 fields, 2 trails, 20 events
@gregn-adroll
Copy link
Contributor

What did you have in mind for the semantics of deduplication? Are you picturing like a flag that you pass to the constructor that causes it to drop exact duplicates of previously handled events on the floor?

@ckuethe
Copy link
Author

ckuethe commented Jul 8, 2016

Yes, a flag to the constructor to silently drop dups would be great. That would allow me to backfill logs and still have unique events.

@tuulos
Copy link
Member

tuulos commented Jul 8, 2016

duplicates in this context means that all fields are equal, including the timestamp and the uuid? Implementing dedup logic like this should be quite doable.

@ckuethe
Copy link
Author

ckuethe commented Jul 8, 2016

Yes, all the fields including timestamp and uuid would be equal if the event was to be considered a duplicate.

  • Different UUID? Lightning struck Alice instead of Bob. Log it.
  • Different timestamp? Bob got hit by lightning again. Log it.
  • Alice and Carol both telling me that Bob got hit by lightning at noon? If deduplication is active, I don't care who told me, only that I have a record of the event. (The logged event may or may not have a source host field, as appropriate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants