Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica gets flooded with WAL segments without writes #422

Open
pbedat opened this issue Sep 5, 2022 · 21 comments 路 May be fixed by #573
Open

Replica gets flooded with WAL segments without writes #422

pbedat opened this issue Sep 5, 2022 · 21 comments 路 May be fixed by #573

Comments

@pbedat
Copy link

pbedat commented Sep 5, 2022

first of all: Thank you for making Litestream. It really helped to get our product off the ground with a very simple setup 馃檪

Unfortunately I'm starting to see a super weird behavior in Litestream. Disclaimer: Our application connects to ~18 DBs all replicated.
It started when the db client reported "DB Timeout" errors, when users tried to login (or perform any writing op). In a panic, I restarted the app container. The new container tried to restore the databases, but got stuck.
Checking S3 showed me that some WAL directories had thousands of wal files with only 91B, which were created by Litestream every second. I stopped the container, cleaned all "empty" wal files, started the container again and everything went back to normal.

What I also saw, were DBs that did not receive writes while in this condition, did not had those empty WAL files. Only after I tried to write something to those dbs, litestream started to generate those empty WAL files. This also rooted out my speculation about some loop that would constantly write to the DB.

The container was running for 28 days, so I thought this might be one reason for this behavior, but now it happened again on a staging container, that was created 2 days ago. Right now I have no idea how to reproduce the issue.

My current litestream version is 0.3.8 and I will upgrade to 0.3.9 now.

Is this a known bug? Can I provide you with more information, that would help you investigating this?

Here is my litestream config:

access-key-id: XXXXXXXXXXXXXXX
secret-access-key: XXXXXXXXXXXXXXXX

addr: :9090
dbs:

- path: /XXXXXXX.db
  snapshot-interval: 2h
  rentention: 2d
  replicas:
    - url: s3://XXXXX

# there are 18 more dbs with the same snapshot options in this config
@benbjohnson
Copy link
Owner

Checking S3 showed me that some WAL directories had thousands of wal files with only 91B, which were created by Litestream every second.

Is the 91 bytes the compressed or uncompressed size of the WAL files? It can also be useful to run hexdump -C on the uncompressed version of those WAL segments to see what is being updated. Sometimes checking the difference between two contiguous segments can show what changed.

Are you able to replicate this with a single DB in a different environment or only in your application's environment? If you can replicate it on a smaller scale and can share that then it shouldn't be hard to find the issue.

@pbedat
Copy link
Author

pbedat commented Sep 7, 2022

Is the 91 bytes the compressed or uncompressed size of the WAL files?

The compressed size

It can also be useful to run hexdump -C on the uncompressed version of those WAL segments to see what is being updated. Sometimes checking the difference between two contiguous segments can show what changed.

Thanks for the advice, I will check the diffs. I should have some of those 91B files in my backups and I still hope the problem is on our side. On the other hand, I can't explain the flooding that started on untouched dev DBs, after the first write, while the condition occured.

Are you able to replicate this with a single DB in a different environment or only in your application's environment? If you can replicate it on a smaller scale and can share that then it shouldn't be hard to find the issue.

Unfortunately this problem "just occurs". I've been running the container, that went nuts for 28 days without a problem. The number of DBs and the retention options did not change significantly during the last three months.

@pbedat
Copy link
Author

pbedat commented Sep 7, 2022

Here is a diff of two subsequent wal frames:

diff <(hexdump -C 00000262_054d07d0.wal) <(hexdump -C 00000262_054d17e8.wal)
c2
< 00000010  db 11 6c 25 89 2f 28 3c  0d 00 00 00 01 0f f9 00  |..l%./(<........|
---
> 00000010  74 87 bc ab 48 7a 9b 3c  0d 00 00 00 01 0f f9 00  |t...Hz.<........|
6c6
< 00001010  00 05 01 03 00 02 6a cb                           |......j.|
---
> 00001010  00 05 01 03 00 02 6a cc                           |......j.|

I have not learned much about the inner structure of WAL frames (yet), but it looks pretty empty. Could it be that the first 32 bytes are just headers?

The frames have an unpacked size of 4096 B

@Bluebugs
Copy link

Bluebugs commented Oct 24, 2022

The same thing did happen to me today. Thousand of 91B wal.lz4 files in the S3 bucket leading to restore to be stuck forever. Let me know if any additional information would be useful, I have kept a few hundred of them.

@pbedat
Copy link
Author

pbedat commented Oct 27, 2022

@Bluebugs under what condition did it happen to you? After it happened again after a service restart, I believe there might be a correlation. I'm not 100% sure if docker swarm is ensuring a container is stopped before it provisions a new version of it. So maybe a unclean shutdown or overlapping litestream instances might cause the issue. I'm planning to investigate this tomorrow.

@Bluebugs
Copy link

@pbedat I am using kubernetes and definitively there is a problem with overlapping instance (that is why it is stuck as it can never catch up pretty much, I think), but the trigger for emitting all those 91B seems to come from before any upgrade and after startup. I am not able to replicate it on demand (it only did happen once in like 6months) :-(

@pbedat
Copy link
Author

pbedat commented Oct 27, 2022

@Bluebugs when litestream enters this faulty state (whatever the reason) the flooding seems to start only after the first write has been done to the db. So it might enter this state on restart, but the flooding might start later after the upgrade. What was the timeline of your incident? Was the service running a long time and then suddenly began to flood or was it related to a service upgrade/restart?

@Bluebugs
Copy link

It took us time to notice/understand what was happening. We had an upgrade in the last 24hours preceding us addressing the issue and having the db back to normal operation. The storm of small file might have happened (from the math of 1 file every 1s) maybe 3hours after the upgrade and the db is usually very low traffic. So it could really have been the first write triggering the issue.

@Bluebugs
Copy link

We are getting this problem more frequently on just one of our use of litestream. It is one with heavy read that could happen at the time of the service being restarted. Would that explain the problem? I was expecting read load to not have any effect on litestream, but it seems it was an incorrect assumption. If you think that could be the case, we will figure a solution that I think would be better anyway, but confirm this possibility would be great.

@hifi
Copy link
Collaborator

hifi commented Jan 8, 2023

@pbedat @Bluebugs are your applications Java and using xerial/sqlite-jdbc by any chance?

I think I accidentally replicated this when debugging Java related issues with Litestream, would be interesting to know if it's a specific SQLite driver/implementation that causes this.

@pbedat
Copy link
Author

pbedat commented Jan 8, 2023

@hifi I was able to track down the situation, that caused the flooding.

TL;DR two Litestream instances replicating the same DB file in parallel

First, some context:

We are using Docker Swam to host our application. With Docker Swarm you create services, that maintain a set of (optionally) replicated tasks (Docker containers). Our app service replication is set to 1, because of sqlite. With this setup (Docker Swarm), we can't host Litestream as a sidecar, because we have to make sure, that the application and litestream run on the same Docker Swarm node. Otherwise Litestream would run on server A and the app on server B and thus litestream won't be able to access the volume of the app container. (You can't configure docker swarm to make sure two services must always run on the same node).
Because of this, we run Litestream directly in the entrypoint of the app container.

The flooding started, when a dev decided to restart the app docker container manually (e.g. in the Portainer GUI). The problem is, that docker does not care, wether a container is a service task or not. When a container goes down (even through a restart), Docker Swarm tries to heal the service and launches a new task, to maintain the replication of "1". Now two Litestream instances may be started on the same node and both are trying to replicate a single database. This leads to the flooding.

I will try to prevent this situation by adding a lockfile next to the DB, while litestream is running. @benbjohnson I think this issue could be closed, but maybe another "Caveat" under https://litestream.io/tips/#multiple-applications-replicating-into-location-can-corrupt might be helpful.

@hifi
Copy link
Collaborator

hifi commented Jan 8, 2023

Ah, I see, I figured out an easy repro.

#!/bin/sh
echo "CREATE TABLE IF NOT EXISTS test (a INT);"
for i in `seq 1 1000`; do echo "INSERT INTO test VALUES($i);"; done
echo "BEGIN;"
echo "SELECT * FROM test LIMIT 1;"
./gentest.sh > test.sql
sqlite3 test.db < test.sql # prep db once

Start Litestream in another window (this is a downstream patched 0.3.9 so mind the difference in output):

$ ./litestream replicate -config litestream.yml
litestream (development build)
initialized db: /home/hifi/.../test.db
replicating to: name="file" type="file" path="/home/hifi/.../tmp/test.db"
test.db(file): write snapshot af16ec43fd684c26/00000001
test.db(file): snapshot written af16ec43fd684c26/00000001 elapsed=4.91183ms sz=9102
test.db(file): write wal segment af16ec43fd684c26/00000001:00000000
test.db(file): wal segment written af16ec43fd684c26/00000001:00000000 elapsed=1.289839ms sz=4152
... waits here for the insert dump ...

Then run sqlite3 cli without committing the deferred read transaction:

$ sqlite3 test.db
SQLite version 3.38.5 2022-05-06 15:25:27
Enter ".help" for usage hints.
sqlite> .read test.sql
1
sqlite>

Meanwhile in Litestream land:

test.db: checkpoint: mode=PASSIVE (0,120,120)
test.db(file): write wal segment af16ec43fd684c26/00000001:00001038
test.db(file): wal segment written af16ec43fd684c26/00000001:00001038 elapsed=2.450436ms sz=490280
test.db: checkpoint: mode=PASSIVE (0,1006,1005)
test.db(file): write wal segment af16ec43fd684c26/00000001:00078b60
test.db(file): wal segment written af16ec43fd684c26/00000001:00078b60 elapsed=9.160072ms sz=3650320
test.db: checkpoint: mode=PASSIVE (0,1007,1005)
test.db(file): write wal segment af16ec43fd684c26/00000001:003f3e70
test.db(file): wal segment written af16ec43fd684c26/00000001:003f3e70 elapsed=1.205721ms sz=4120
test.db: checkpoint: mode=PASSIVE (0,1008,1005)
test.db(file): write wal segment af16ec43fd684c26/00000001:003f4e88
test.db(file): wal segment written af16ec43fd684c26/00000001:003f4e88 elapsed=1.381763ms sz=4120
test.db: checkpoint: mode=PASSIVE (0,1009,1005)
test.db(file): write wal segment af16ec43fd684c26/00000001:003f5ea0
test.db(file): wal segment written af16ec43fd684c26/00000001:003f5ea0 elapsed=1.058914ms sz=4120
test.db: checkpoint: mode=PASSIVE (0,1010,1005)
test.db(file): write wal segment af16ec43fd684c26/00000001:003f6eb8
test.db(file): wal segment written af16ec43fd684c26/00000001:003f6eb8 elapsed=1.109299ms sz=4120
test.db: checkpoint: mode=PASSIVE (0,1011,1005)
test.db(file): write wal segment af16ec43fd684c26/00000001:003f7ed0
test.db(file): wal segment written af16ec43fd684c26/00000001:003f7ed0 elapsed=1.192757ms sz=4120
test.db: checkpoint: mode=PASSIVE (0,1012,1005)
test.db(file): write wal segment af16ec43fd684c26/00000001:003f8ee8
test.db(file): wal segment written af16ec43fd684c26/00000001:003f8ee8 elapsed=1.48662ms sz=4120
... continues this once per second ...

I've noted that it doesn't get stuck like this if the writes are within a single transaction - only a big WAL update will cause the loop, 100 line insert doesn't trigger this either, it needs to be big enough (something to do with pages?). Committing the read transaction within sqlite3 cli will make Litestream stop this.

@pbedat
Copy link
Author

pbedat commented Jan 8, 2023

@hifi the "long running read transaction" is actually the mechanism, that allows Litestream do its job :)

https://litestream.io/how-it-works/#the-shadow-wal

That might also explain, why to parallel litestream instances cause the flodding (see my prev comment)

@hifi
Copy link
Collaborator

hifi commented Jan 8, 2023

@pbedat Interesting, could be two different ways to trigger this behavior but nevertheless there's something funky going on with Litestream as well and there's an easy repro for that now.

@pbedat
Copy link
Author

pbedat commented Jan 8, 2023

@hifi Thanks for your repro. It's actually the same trigger: long running read transactions. Either caused by user code, or by another litestream process. This should definetly go into the "Caveats" section of the Docs.

@hifi
Copy link
Collaborator

hifi commented Jan 8, 2023

It doesn't happen if the inserts are within a single transaction so it's not just long running read transactions but many consecutive write (transactions) and then a single long read. That seems like a bug rather than a caveat.

@hifi
Copy link
Collaborator

hifi commented Jan 8, 2023

I'll take back the note about amount of writes, it's purely the size of the WAL.

In current main it would hit this, on 0.3.9 it's slightly different:
https://github.com/benbjohnson/litestream/blob/main/db.go#L941

This will force a passive checkpoint to happen immediately after the write which is a no-op because there's an open read transaction but the insert into _litestream_seq will succeed so no error will be thrown as writers don't block readers. This will in fact grow the WAL file as well. It will then happily continue to notify replicas about (shadow?) WAL changes that are still locked behind the read transaction.

I haven't yet dug deeper but this should be possible to prevent. Ideally Litestream will just be in a checkpoint loop until it succeeds but not notify replicas because there's nothing to replicate yet. That is the correct behavior during a long read transaction if a checkpoint is hit during that.

I also think it should at some point just give up with passive checkpoints if none succeed and try doing full checkpoints until one succeeds and then revert back to passive. This could be configurable to be true to the original goal of not interfering with the writer. Technically our downstream patch kind of does this when the WAL file grows too large so it would do a truncate checkpoint but there should be a time/retry based middle ground that will do a full first.

I'll probably write a patch for this if I can at least make the passive loop work correctly.

@hifi
Copy link
Collaborator

hifi commented Jan 8, 2023

The WAL frames aren't actually empty, they just contain the Litestream sequence updates so once it gets into that loop it will keep updating the sequence once per second and causes those "unnecessary" WALs to be written out and replicated.

I think what people are seeing is that Litestream checkpointing gets locked out by persistent read transactions and it never gets the chance to flush it properly as @pbedat pointed out. Forcing a full checkpoint eventually is still a possible solution as not replicating at all is worse than locking out the application.

Edit: There's actually a failsafe at 10k pages to force a restart checkpoint and it will try doing that once per second with one second busy wait so it will try its very best to lock out other users. It can't do anything if a read transaction is always open when it goes into busy wait and never finish during that. Option would be to make the busy wait longer if the issue persists so that a read that takes over a second can't block checkpointing forever as it would catch up eventually by waiting longer than a single read transaction would take. I suspect a badly indexed read load could block Litestream forever.

@simonw
Copy link

simonw commented Mar 1, 2024

We were seeing this bug... but it turns out to have been a problem with our configuration:

{
    "dbs": [
        {
            "path": "/mnt/internal.db",
            "replicas": [
                {
                    "url": "s3://datasette-cloud-backups/team-164/internal"
                }
            ]
        },
        {
            "path": "/mnt/internal.db",
            "replicas": [
                {
                    "url": "s3://datasette-cloud-backups/team-164/internal.db"
                }
            ]
        },
        {
            "path": "/mnt/data.db",
            "replicas": [
                {
                    "url": "s3://datasette-cloud-backups/team-164/data.db"
                }
            ]
        }
    ]
}

We had accidentally set the same internal.db database to be backed up to two locations - one with a .db extension and one without.

@benbjohnson
Copy link
Owner

I think I may have a fix for this (#573). The litestream_seq is only meant to force the existence of a WAL file but there was one instance of it inserting whether there was a WAL file or not. If anyone can give this a test to double check the fix, I would appreciate it.

@pbedat
Copy link
Author

pbedat commented Mar 6, 2024

@benbjohnson it fixes the situation, where two litestream instances, replicating the same db cause the flooding. Very nice!

Here is my very simple repro:

func TestFlood(t *testing.T) {
	defer os.RemoveAll("./tmp/")

	db, err := setupDB()
	if err != nil {
		t.Fatal(err)
	}

	litestreamPath, err := exec.LookPath("litestream")
	//litestreamPath := "/home/pbedat/projects/litestream/.bin/litestream"
	if err != nil {
		t.Fatal("litestream not in PATH: ", err)
	}

        // start two instances of litestream with the same config

	cmd1 := exec.Command(litestreamPath, "replicate", "-config", "litestream.config")
	cmd1.Stdout = os.Stdout
	cmd1.Stderr = os.Stderr

	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		log.Print(cmd1.Run())
		wg.Done()
	}()

	cmd2 := exec.Command(litestreamPath, "replicate", "-config", "litestream.config")
	cmd2.Stdout = os.Stdout
	cmd2.Stderr = os.Stderr

	wg.Add(1)
	go func() {
		time.Sleep(200 * time.Millisecond)
		log.Print(cmd2.Run())
		wg.Done()
	}()

        // wait for litestream to initialize
	time.Sleep(1 * time.Second)

        // simulate db write load
	for i := 0; i < 1000; i++ {
		_, err = db.Exec("INSERT INTO foo(id) VALUES(?);", i)
		if err != nil {
			t.Fatal(err)
		}
		time.Sleep(10 * time.Millisecond)
	}

        // wait for litestream to catch up
	time.Sleep(1 * time.Second)

        // count all files in the backup dir, wait some time and count again, to see if litestream is flooding the dir with empty wal files
	count1 := 0
	fs.WalkDir(os.DirFS("./tmp/backup"), ".", func(path string, d fs.DirEntry, err error) error {
		count1++
		return nil
	})

	time.Sleep(3 * time.Second)

	count2 := 0
	fs.WalkDir(os.DirFS("./tmp/backup"), ".", func(path string, d fs.DirEntry, err error) error {
		count2++
		return nil
	})

	cmd1.Process.Signal(syscall.SIGTERM)
	cmd2.Process.Signal(syscall.SIGTERM)
	wg.Wait()

	if count2 > count1 {
		t.Fatal("should not happen", count2, count1)
	}
}

Wouldn't it be safer to let litestream fail, when another instances is already replicating a DB? Or is it hard to detect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants