Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support point-in-time backups #52

Closed
akamensky opened this issue Apr 8, 2020 · 13 comments
Closed

Support point-in-time backups #52

akamensky opened this issue Apr 8, 2020 · 13 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@akamensky
Copy link

This is a one-off tool (means it does not need to run in background after backup is done), so the reliance on background daemon process is funny. There is no need to run kafka-connect as a daemon at all.

@itadventurer
Copy link
Owner

You are right regarding the restore procedure. Restoring is an one-off activity.
The backup is a continously running activity. There is no "I finished doing a backup" in Kafka as Kafka data is a stream and there is no end to it. Sure, you can assume that if you did not get any new data for x seconds you are "done" but you cannot generalize that.
Have a look on #46 and #54 for more

@itadventurer itadventurer added the duplicate This issue or pull request already exists label Apr 11, 2020
@itadventurer
Copy link
Owner

#56 🎉

@akamensky
Copy link
Author

akamensky commented Apr 14, 2020

@itadventurer

The backup is a continously running activity.

This assumes continuous stream of data 24x7x365, that does not apply to all cases. In our case the stream runs for X hours per day only, the backup happens only after that and is actually intended as a daily backup/snapshot of data.

I think there should be a way to (internally) detect that there hasn't been any new messages for X amount of time (possibly configurable interval) after which the backup process would gracefully exit and thus terminating the process.

Another (possibly simpler) alternative to this would be to only backup messages up to the timestamp of when backup was started. Not sure how this would play together with backing up offsets. Maybe first backup offsets, then we know the timestamp at which we backed up offsets and we can backup messages up to that timestamp.

@itadventurer itadventurer changed the title Any way to do this without running daemon/docker in background? Support point-in-time backups Apr 14, 2020
@itadventurer
Copy link
Owner

I see your point. Yeah, probably it would be nice to have a way to do point-in-time backups 🤔
Though, this is not trivial as there is no easy way to decide whether a stream "finished".

What you can do in your case:

  • Let Kafka Backup running in the background
  • Kafka Backup writes data continuously in the background to the file system
  • kill -9 Kafka Backup as soon as it is "finished", i.e. it finished writing your data. This should be promptly after you finished producing data
  • move the data of Kafka Backup to your new destination.

I understand that this is quite a common use case and I will provide more documentation for that with #2. For v0.1 Documentation is the last big issue so hopefully this should happen soonish ;)


I see following approach

  • Standalone kafka-backup tool #54 introduces a new standalone CLI tool. The CLI tool should support this.
  • We add a new flag --snapshot to the CLI tool (or add a new tool called backup-snapshot.sh)

How to detect when a backup is "finished" (only applicable if the --snapshot flag is set):

  • We remember the time when the backup is started. All records that have a newer timestamp are ignored during the backup
  • When a partition does not produce any new data for some time (e.g. 20s) we assume that there is no new data

What do you think?

@itadventurer itadventurer reopened this Apr 14, 2020
@itadventurer itadventurer added question Further information is requested and removed duplicate This issue or pull request already exists labels Apr 14, 2020
@akamensky
Copy link
Author

akamensky commented Apr 14, 2020

Let Kafka Backup running in the background

The issue is exactly with this step. We cannot keep it running in background. We only have a specific window when we can do the snapshot. It is not up to us to decide when we can do backup it is an external regulatory requirements.

We remember the time when the backup is started. All records that have a newer timestamp are ignored during the backup

Yes, that is exactly what I meant and I think this would remove requirement of having it running in background (and trying to catch the moment when all producers are done).

When a partition does not produce any new data for some time (e.g. 20s) we assume that there is no new data

I think this option is mutual exclusive with the other one. And I think first one is better as it gives a specific reference point and does not rely on finding a window when there are no messages.

@itadventurer
Copy link
Owner

Actually I wanted to write that this is nearly impossible with Kafka, but while writing I got an idea for a solution:

The kafka-consumer-groups returns the current position of the consumer in the partition, but more interestingly it returns the current end-offset of each particular partition. This means there is a way to get the latest offset for a partition at a certain point in time. I have currently no idea how this is achieved (need to check the code).

So now there is a clear path how to do a (more-or-less) point-in-time backup:

  1. Get the end-of-partition offset for every partition to be backed up (Somewhere here: https://github.com/itadventurer/kafka-backup/blob/master/src/main/java/de/azapps/kafkabackup/sink/BackupSinkTask.java#L81 )
  2. Consume every partition
  3. As soon as a consumed record has a offset >= the saved one for that partition remember this. Ignore all records in the backup. (See https://github.com/itadventurer/kafka-backup/blob/master/src/main/java/de/azapps/kafkabackup/sink/BackupSinkTask.java#L63 )
  4. As soon as all partitions are up to date print a message to STDOUT
  5. Use the wrapper script to detect this message and kill kafka connect gracefully. Similar to how it is solved during restore: https://github.com/itadventurer/kafka-backup/blob/master/bin/restore-standalone.sh#L232-L252

You see that this is really not that trivial.

My current focus is to improve the test suite and stabilize Kafka Backup for a first release (See https://github.com/itadventurer/kafka-backup/milestone/1). I cannot give you an ETA for that feature. I would be more than happy to review a PR for that (and I am also searching for additional maintainers ;) ) I am happy to help if there are any questions

@akamensky
Copy link
Author

You see that this is really not that trivial.

I am more on a operations side of things (like setting up, monitoring Kafka clusters etc). So I trust you on this part. My point being is that from my side of work this is something me (and pretty sure many others) do need.

I would be more than happy to review a PR for that (and I am also searching for additional maintainers ;) )

I am not that great with Java/Scala to be of much help here. If it were Python, C/C++ or at the very least Go I could help :P

@itadventurer itadventurer added enhancement New feature or request help wanted Extra attention is needed and removed question Further information is requested labels Apr 17, 2020
@FloMko
Copy link

FloMko commented Jun 16, 2020

Hello!
at first - I'm happy to found your solution, because i have to backup kafka topic data
at second - unfortunately I couldn't write anything in Java/Scala, so I've prepared 'wrapper' for you 'backup-standalone.sh' with python for full backup solution
https://gist.github.com/FloMko/7adf2e00cd80fe7cc88bb587cde999ce
It'll be nice to see any updates about any build-in support for point-in-time backup

@itadventurer
Copy link
Owner

Hey,
Thank you for your work! As a temporary workaround I could imagine to add this as an additional script to this repo and replace it later by a built-in solution. Feel free to add this as a Pull request :) (to ./bin/kafka-backup-point-in-time.py or something else ;) )

@akamensky
Copy link
Author

I am about to publish completely separate implementation written in Go that doesn’t rely on connect API. Just FYI. We are already using it in our production environment.

@FloMko
Copy link

FloMko commented Jun 18, 2020

@akamensky could you share your solution? as far as you have tested your solution it'll be fine

@akamensky
Copy link
Author

akamensky commented Jun 23, 2020

@FloMko we just published it. You can find it (as well as a rebuilt binary) here

@itadventurer
Copy link
Owner

Thank you @WesselVS for your PR #99! I have just merged it to master. Will do a release with this enhancement and some other fixes soonish.

@akamensky Cool! Great to see some more work regarding Kafka Backups ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants