Support point-in-time backups #52

akamensky · 2020-04-08T08:35:09Z

This is a one-off tool (means it does not need to run in background after backup is done), so the reliance on background daemon process is funny. There is no need to run kafka-connect as a daemon at all.

itadventurer · 2020-04-11T14:36:14Z

You are right regarding the restore procedure. Restoring is an one-off activity.
The backup is a continously running activity. There is no "I finished doing a backup" in Kafka as Kafka data is a stream and there is no end to it. Sure, you can assume that if you did not get any new data for x seconds you are "done" but you cannot generalize that.
Have a look on #46 and #54 for more

itadventurer · 2020-04-12T17:58:01Z

#56 🎉

akamensky · 2020-04-14T02:14:01Z

@itadventurer

The backup is a continously running activity.

This assumes continuous stream of data 24x7x365, that does not apply to all cases. In our case the stream runs for X hours per day only, the backup happens only after that and is actually intended as a daily backup/snapshot of data.

I think there should be a way to (internally) detect that there hasn't been any new messages for X amount of time (possibly configurable interval) after which the backup process would gracefully exit and thus terminating the process.

Another (possibly simpler) alternative to this would be to only backup messages up to the timestamp of when backup was started. Not sure how this would play together with backing up offsets. Maybe first backup offsets, then we know the timestamp at which we backed up offsets and we can backup messages up to that timestamp.

itadventurer · 2020-04-14T08:44:39Z

I see your point. Yeah, probably it would be nice to have a way to do point-in-time backups 🤔
Though, this is not trivial as there is no easy way to decide whether a stream "finished".

What you can do in your case:

Let Kafka Backup running in the background
Kafka Backup writes data continuously in the background to the file system
kill -9 Kafka Backup as soon as it is "finished", i.e. it finished writing your data. This should be promptly after you finished producing data
move the data of Kafka Backup to your new destination.

I understand that this is quite a common use case and I will provide more documentation for that with #2. For v0.1 Documentation is the last big issue so hopefully this should happen soonish ;)

I see following approach

Standalone kafka-backup tool #54 introduces a new standalone CLI tool. The CLI tool should support this.
We add a new flag --snapshot to the CLI tool (or add a new tool called backup-snapshot.sh)

How to detect when a backup is "finished" (only applicable if the --snapshot flag is set):

We remember the time when the backup is started. All records that have a newer timestamp are ignored during the backup
When a partition does not produce any new data for some time (e.g. 20s) we assume that there is no new data

What do you think?

akamensky · 2020-04-14T09:12:06Z

Let Kafka Backup running in the background

The issue is exactly with this step. We cannot keep it running in background. We only have a specific window when we can do the snapshot. It is not up to us to decide when we can do backup it is an external regulatory requirements.

We remember the time when the backup is started. All records that have a newer timestamp are ignored during the backup

Yes, that is exactly what I meant and I think this would remove requirement of having it running in background (and trying to catch the moment when all producers are done).

When a partition does not produce any new data for some time (e.g. 20s) we assume that there is no new data

I think this option is mutual exclusive with the other one. And I think first one is better as it gives a specific reference point and does not rely on finding a window when there are no messages.

itadventurer · 2020-04-15T16:37:37Z

Actually I wanted to write that this is nearly impossible with Kafka, but while writing I got an idea for a solution:

The kafka-consumer-groups returns the current position of the consumer in the partition, but more interestingly it returns the current end-offset of each particular partition. This means there is a way to get the latest offset for a partition at a certain point in time. I have currently no idea how this is achieved (need to check the code).

So now there is a clear path how to do a (more-or-less) point-in-time backup:

Get the end-of-partition offset for every partition to be backed up (Somewhere here: https://github.com/itadventurer/kafka-backup/blob/master/src/main/java/de/azapps/kafkabackup/sink/BackupSinkTask.java#L81 )
Consume every partition
As soon as a consumed record has a offset >= the saved one for that partition remember this. Ignore all records in the backup. (See https://github.com/itadventurer/kafka-backup/blob/master/src/main/java/de/azapps/kafkabackup/sink/BackupSinkTask.java#L63 )
As soon as all partitions are up to date print a message to STDOUT
Use the wrapper script to detect this message and kill kafka connect gracefully. Similar to how it is solved during restore: https://github.com/itadventurer/kafka-backup/blob/master/bin/restore-standalone.sh#L232-L252

You see that this is really not that trivial.

My current focus is to improve the test suite and stabilize Kafka Backup for a first release (See https://github.com/itadventurer/kafka-backup/milestone/1). I cannot give you an ETA for that feature. I would be more than happy to review a PR for that (and I am also searching for additional maintainers ;) ) I am happy to help if there are any questions

akamensky · 2020-04-16T08:38:52Z

You see that this is really not that trivial.

I am more on a operations side of things (like setting up, monitoring Kafka clusters etc). So I trust you on this part. My point being is that from my side of work this is something me (and pretty sure many others) do need.

I would be more than happy to review a PR for that (and I am also searching for additional maintainers ;) )

I am not that great with Java/Scala to be of much help here. If it were Python, C/C++ or at the very least Go I could help :P

FloMko · 2020-06-16T13:17:40Z

Hello!
at first - I'm happy to found your solution, because i have to backup kafka topic data
at second - unfortunately I couldn't write anything in Java/Scala, so I've prepared 'wrapper' for you 'backup-standalone.sh' with python for full backup solution
https://gist.github.com/FloMko/7adf2e00cd80fe7cc88bb587cde999ce
It'll be nice to see any updates about any build-in support for point-in-time backup

itadventurer · 2020-06-17T07:18:33Z

Hey,
Thank you for your work! As a temporary workaround I could imagine to add this as an additional script to this repo and replace it later by a built-in solution. Feel free to add this as a Pull request :) (to ./bin/kafka-backup-point-in-time.py or something else ;) )

akamensky · 2020-06-17T10:10:18Z

I am about to publish completely separate implementation written in Go that doesn’t rely on connect API. Just FYI. We are already using it in our production environment.

FloMko · 2020-06-18T14:11:21Z

@akamensky could you share your solution? as far as you have tested your solution it'll be fine

akamensky · 2020-06-23T10:44:24Z

@FloMko we just published it. You can find it (as well as a rebuilt binary) here

itadventurer · 2020-06-23T13:16:46Z

Thank you @WesselVS for your PR #99! I have just merged it to master. Will do a release with this enhancement and some other fixes soonish.

@akamensky Cool! Great to see some more work regarding Kafka Backups ;)

itadventurer closed this as completed Apr 11, 2020

itadventurer added the duplicate This issue or pull request already exists label Apr 11, 2020

akamensky mentioned this issue Apr 14, 2020

Can we shutdown Kafka Connect after Restore has finished? #46

Closed

itadventurer changed the title ~~Any way to do this without running daemon/docker in background?~~ Support point-in-time backups Apr 14, 2020

itadventurer reopened this Apr 14, 2020

itadventurer added question Further information is requested and removed duplicate This issue or pull request already exists labels Apr 14, 2020

itadventurer added enhancement New feature or request help wanted Extra attention is needed and removed question Further information is requested labels Apr 17, 2020

itadventurer mentioned this issue Apr 21, 2020

Document how to do daily Backups #77

Open

WesselVS mentioned this issue Jun 22, 2020

Implement point in time backups (snapshots) #99

Merged

itadventurer closed this as completed Jun 23, 2020

itadventurer added this to the V0.2: Documentation and Stabilizing Kafka Backup milestone Jun 23, 2020

rahulsingh-jnpr mentioned this issue Oct 5, 2021

Need few information about topics and duplicate data in kafka EclipseTrading/ksnap#20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support point-in-time backups #52

Support point-in-time backups #52

akamensky commented Apr 8, 2020

itadventurer commented Apr 11, 2020

itadventurer commented Apr 12, 2020

akamensky commented Apr 14, 2020 •

edited

itadventurer commented Apr 14, 2020

akamensky commented Apr 14, 2020 •

edited

itadventurer commented Apr 15, 2020

akamensky commented Apr 16, 2020

FloMko commented Jun 16, 2020

itadventurer commented Jun 17, 2020

akamensky commented Jun 17, 2020

FloMko commented Jun 18, 2020

akamensky commented Jun 23, 2020 •

edited

itadventurer commented Jun 23, 2020

Support point-in-time backups #52

Support point-in-time backups #52

Comments

akamensky commented Apr 8, 2020

itadventurer commented Apr 11, 2020

itadventurer commented Apr 12, 2020

akamensky commented Apr 14, 2020 • edited

itadventurer commented Apr 14, 2020

akamensky commented Apr 14, 2020 • edited

itadventurer commented Apr 15, 2020

akamensky commented Apr 16, 2020

FloMko commented Jun 16, 2020

itadventurer commented Jun 17, 2020

akamensky commented Jun 17, 2020

FloMko commented Jun 18, 2020

akamensky commented Jun 23, 2020 • edited

itadventurer commented Jun 23, 2020

akamensky commented Apr 14, 2020 •

edited

akamensky commented Apr 14, 2020 •

edited

akamensky commented Jun 23, 2020 •

edited