GitHub - tokenrove/punchy-the-log: Simple demonstration of hole punching for logging

This is a demonstration of the fallocate(FALLOC_FL_PUNCH_HOLE, ...) technique for keeping an "infinite scroll" journal of manageable size, as well as a kind of persistent pipe. The code is intentionally simple and avoids many performance optimizations.

The log directory contains programs that work on a log where the offset of the next available message is written at the beginning of the file, while pipe contains programs that work on a sort of "persistent pipe" as described by Carlo Alberto Ferrari, where we use SEEK_DATA to find the next message, and trim the logical size with FALLOC_FL_COLLAPSE_RANGE.

In both cases, each invocation of producer writes a length-prefixed message into the log, and consumer reads one out, trimming the log as it goes. With -f, consumer will consume continuously. Exempli gratia: (requires pv)

$ (IFS= yes | while read x; do echo "$x" | ./producer ./loggy.log; done) &
$ ./consumer -f ./loggy.log | pv >/dev/null
$ kill $!

This is extremely Linux-specific. Portability patches would be interesting.

Sparse files are useful for all kinds of things (this LWN comment gives an example of using this for rewinding live TV), and maybe aren't as well known as they should be. Many modern filesystems support this kind of thing.

Depending on how often data is produced, at EOF you may want the consumer to spin (repeatedly read) or use inotify to get notified of a change. The former will tend to give lower latency, but burns a lot of CPU (maybe yield between reads?); the latter is friendly to other processes but introduces significant latency. In this implementation, we spin a few times and then block.

This consumer trims on every message, but it would be much faster to only trim every so often, if you have a way to deal with re-reading duplicates in the case of a crash.

In the multiple-consumer case, you probably want a separate log per consumer, although you could have some other synchronization mechanism. One I've used before is having a separate trim process run from cron, when the data had timestamps and there were known freshness constraints. That doesn't look as cool as this implementation, though.

In the multiple-producer case, you want to make sure you're writing your whole message in one write call, and if you're really paranoid, make sure you're writing less than PIPE_BUF bytes.

Particularly if you're okay with at-least-once consuming, you could avoid the offset at the beginning by using lseek(..., SEEK_DATA, ...) in the consumer, and starting the file with a hole. This is the approach the pipe consumer takes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
log		log
pipe		pipe
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log

log

pipe

pipe

.gitignore

.gitignore

README.md

README.md

Repository files navigation

About

Releases

Packages

Languages

tokenrove/punchy-the-log

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Languages