Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Unexplained restart in standalone mode #1017

Open
NetForces opened this issue Apr 7, 2024 · 18 comments
Open

[bug]: Unexplained restart in standalone mode #1017

NetForces opened this issue Apr 7, 2024 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@NetForces
Copy link

Describe the bug
I am running the latest LinDB on Ubuntu 20.10 as a service with the following unit file

# /etc/systemd/system/lindb.service
# This file is managed by ansible, don't make changes here, last modified on 2024-04-06 16:38:57 by courchea from courchea-x1-gen-10

[Unit]
Description=LinDB daemon
After=network.target

[Service]
ExecStart=/opt/lind standalone run
Restart=on-failure
User=root
Group=root

[Install]
WantedBy=multi-user.target

I see that it restarts multiple time per day:

# journalctl -u lindb.service --since today --no-page | grep restart
Apr 07 00:43:31 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 4.
Apr 07 01:16:02 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 5.
Apr 07 02:50:33 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 6.
Apr 07 03:29:03 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 7.
Apr 07 05:36:02 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 8.
Apr 07 06:45:07 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 9.
Apr 07 07:17:31 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 10.
Apr 07 08:23:07 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 11.
Apr 07 09:30:54 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 12.
Apr 07 10:33:31 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 13.

But the logs don't show much, except that it did restart:

2024-04-07 10:33:00.874	INFO	replica/wal.go:218	[    Replica] [WriteAheadLog]: check write ahead log if expire	{"path": "data/storage/wal/co2click/0/20240407060000/1", "expire": false}
2024-04-07 10:33:00.874	INFO	replica/wal.go:218	[    Replica] [WriteAheadLog]: check write ahead log if expire	{"path": "data/storage/wal/co2click/0/20240406230000/1", "expire": false}
2024-04-07 10:33:00.874	INFO	replica/wal.go:218	[    Replica] [WriteAheadLog]: check write ahead log if expire	{"path": "data/storage/wal/co2click/0/20240407050000/1", "expire": false}
2024-04-07 10:33:00.874	INFO	replica/wal.go:218	[    Replica] [WriteAheadLog]: check write ahead log if expire	{"path": "data/storage/wal/co2click/0/20240406200000/1", "expire": false}
2024-04-07 10:33:00.874	INFO	replica/wal.go:218	[    Replica] [WriteAheadLog]: check write ahead log if expire	{"path": "data/storage/wal/co2click/0/20240407100000/1", "expire": false}
2024-04-07 10:33:00.874	INFO	replica/wal.go:231	[    Replica] [WriteAheadLog]: write ahead log is expire, need destroy it	{"path": "data/storage/wal/co2click/0/20240407090000/1"}
2024-04-07 10:33:31.478	INFO	lind/runner.go:41	[        CMD] [Main]: Lind running as standalone with PID: 2219769 (pprof: false)
2024-04-07 10:33:31.482	INFO	lind/runner.go:49	[        CMD] [Main]: maxprocs: Leaving GOMAXPROCS=[1]: CPU quota undefined
2024-04-07 10:33:32.999	INFO	standalone/runtime.go:215	[ Standalone] [Runtime]: etcd server is ready
2024-04-07 10:33:33.013	INFO	state/etcd.go:73	[ standalone] [ETCD]: new etcd client successfully	{"endpoints": ["http://localhost:2379"]}
2024-04-07 10:33:33.019	INFO	state/etcd.go:73	[ standalone] [ETCD]: new etcd client successfully	{"endpoints": ["http://localhost:2379"]}

To Reproduce (Required)
Steps to reproduce the behavior:

  • Run LinDB as a service
  • Have a GO API client sending data entries

Expected behavior (Required)
The service should not restart

What's your LinDB version? (Required)

# /opt/lind version
LinDB: v0.3.0, BuildDate: 2023-08-29T10:41:33+0000
GOOS="linux"
GOARCH="amd64"
GOVERSION="go1.19.12"

Screenshots
None

Additional context
Pretty much a vanilla install. I simply untared the binaries in /opt installed the unit file, started the service and create a DB from the WebUI.
Then I have a Go client that listens for sensor data and commits it to the LinDB database.

@NetForces NetForces added the bug Something isn't working label Apr 7, 2024
@stone1100
Copy link
Member

  1. Are there any other logs when LinDB Service is restarted?(journalctl)
  2. What is the log before startup in standalone.log?

@NetForces
Copy link
Author

Ok so the last restart happened at 11h30 today:

# journalctl -u lindb.service --since today --no-page | grep restart
Apr 08 02:43:30 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 3.
Apr 08 03:17:59 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 4.
Apr 08 04:56:31 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 5.
Apr 08 05:29:33 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 6.
Apr 08 06:34:38 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 7.
Apr 08 07:39:33 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 8.
Apr 08 08:43:31 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 9.
Apr 08 09:47:28 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 10.
Apr 08 10:20:05 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 11.
Apr 08 11:30:01 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 12.

I don't have a standalone.log file as it's piped in journalctl so here are the logs from there at the time of the last crash:

log.txt

@stone1100
Copy link
Member

panic when writing wal. I'll try to reproduce it.

@NetForces
Copy link
Author

@stone1100 Awesome. Let me know if you need anything or need me to test something.

@stone1100
Copy link
Member

@NetForces If write some data before current time?

@NetForces
Copy link
Author

@stone1100 Yes this happens in my scenario as if a sensor was for some reason not able to send a data point (Wi-Fi not available or turned-off), it will save that data point so that it is transmitted later on with the timestamp in the past.

For this, I use the x method. For example:

		w.AddPoint(context.TODO(), api.NewPoint("co2").
			AddTag("sensorUUID", t.UUID).
			AddField(api.NewLast("co2", float64(t.CO2))).SetTimestamp(sensorDatapointTimestamp))

Where sensorDatapointTimestamp is the timestamp sent by the sensor for that specific data point.

@NetForces
Copy link
Author

NetForces commented Apr 12, 2024

However, I just looked at the logs and this condition happened multiple times so far today and does not match a time when the lindb restarted.

Forget that, looks like my node crashed completely... The multiple restart probably crashed the datastore. I'll start from scratch with a fresh datastore.

@NetForces
Copy link
Author

@stone1100 ok, the timeframe of the restart seems to match some of the points where I write data with a timestamp in the past. I need to add more debug log on the service that received data and writes it to LinDB to make sure and see if I can find a pattern for you.

Give me a couple of days and I should be able to provide that information.

@NetForces
Copy link
Author

@stone1100 Ok I was able to capture many restarts this afternoon and each have a corresponding to write where I use a time in the past. These are indicated by the Offline write to LinDB in the below logs.

The date/time in the first [] is the timestamp I am sending LindDB.

Apr 17 14:34:16 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 1.
====
Apr 17 14:34:12 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 14:34:12.642121687 -0400 EDT m=+2329.410152742][30:C6:F7:25:DD:1C][590][20.29999924][42][55][0.099999979][1.473333597][2.279999733][2.486666441][2.543334007][0][2.590000153][2.590000153][1022.011133][4.15500021][4.15500021][95.47000122]
Apr 17 14:34:13 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 14:34:13.63926905 -0400 EDT m=+2330.407300108][D4:D4:DA:85:16:C0][0][10.71000004][41.66113281][0][0][0][0][6.355772495][8.958575249][0][10.80299091][11.72089672][1015.835449][0][0][0]
Apr 17 14:34:14 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 13:34:12 -0400 EDT][E8:9F:6D:08:EB:60][741][21.5][32][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 14:34:15 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 14:34:15.113259358 -0400 EDT m=+2331.881290430][4C:75:25:97:92:98][420][19.10000038][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 14:34:15 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 14:34:15.19865659 -0400 EDT m=+2331.966687645][64:B7:08:B8:EA:D4][404][15.19999981][42][0][0][0][0][0.488716066][0.536054492][0][0.551926672][0.559826255][0][4.079999924][4.079999924][0]
Apr 17 14:34:16 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 14:34:16.873207314 -0400 EDT m=+2333.641238385][94:B9:7E:AC:EB:1C][876][23.70000076][36][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 15:44:29 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 2.
=====
Apr 17 15:44:24 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 15:44:24.560412339 -0400 EDT m=+6541.328443402][64:B7:08:B8:EB:50][915][26.5][38][0][0][0][0][0.63328445][0.669674575][0][0.669674754][0.669674575][0][4.230000019][4.230000019][0]
Apr 17 15:44:24 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 15:44:24.89369252 -0400 EDT m=+6541.661723592][E8:9F:6D:08:EB:4C][682][23.60000038][46][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 15:44:26 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 14:44:41 -0400 EDT][E8:9F:6D:08:E9:EC][624][19.7][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 15:44:28 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 15:44:28.735606916 -0400 EDT m=+6545.503637985][30:C6:F7:25:DA:28][839][13.60000038][48][77][0][0][0][0.61727643][0.652746916][0][0.652746916][0.652747393][1021.144358][4.143000126][4.143000126][94.26000214]
Apr 17 15:44:29 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 15:44:29.44057616 -0400 EDT m=+6546.208607227][D4:D4:DA:85:5A:70][758][21.79999924][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 17:36:02 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 3.
====
Apr 17 17:35:59 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 17:35:59.95144682 -0400 EDT m=+13236.719477893][94:B9:7E:93:0F:04][415][20.10000038][32][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 17:36:00 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 16:35:25 -0400 EDT][4C:75:25:97:9C:48][459][12.3][50][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 17:36:01 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 17:36:01.342312533 -0400 EDT m=+13238.110343601][4C:75:25:BF:A6:A8][815][19.79999924][35][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 17:36:01 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 17:36:01.83428109 -0400 EDT m=+13238.602312154][E8:9F:6D:08:EC:30][730][21][41][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 17:36:02 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 17:36:02.429532975 -0400 EDT m=+13239.197564048][30:C6:F7:25:CD:94][652][16.10000038][44][69][0][0][0][0.193659842][0.298699349][0][0.376466572][0.414518327][1007.513972][4.162000179][4.162000179][96.23999786]
Apr 17 17:36:02 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 17:36:02.635256986 -0400 EDT m=+13239.403288052][E8:9F:6D:08:ED:BC][540][18.10000038][43][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:16:48 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 4.
=====
Apr 17 18:16:43 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 18:16:43.541750399 -0400 EDT m=+15680.309781456][D4:D4:DA:85:5B:04][601][21.20000076][43][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:16:44 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 18:16:44.079132548 -0400 EDT m=+15680.847163599][D4:D4:DA:84:B8:80][0][11.79000092][57.54980469][0][0][0][0][0.556801617][0.588796854][0][0.588797033][0.588797033][977.7913818][0][0][0]
Apr 17 18:16:44 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 18:16:44.990709885 -0400 EDT m=+15681.758740952][E8:9F:6D:08:EC:A4][434][18.60000038][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:16:45 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 18:16:45.236114492 -0400 EDT m=+15682.004145547][64:B7:08:B8:EA:D4][408][15.80000019][44][0][0][0][0][0.862787604][0.912365556][0][0.912365556][0.912366033][0][4.079999924][4.079999924][0]
Apr 17 18:16:45 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:16:00 -0400 EDT][E8:9F:6D:08:F1:14][581][19.3][40][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:16:48 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 18:16:48.324920279 -0400 EDT m=+15685.092951351][30:C6:F7:25:DE:78][481][17.79999924][36][95][0][0][0][0.211804718][0.223975584][0][0.223975584][0.223975763][923.7051348][4.06400013][4.06400013][86.44999695]
Apr 17 19:19:58 environment-monitor-01 systemd[1]: lindb.service: Scheduled restart job, restart counter is at 5.
======
Apr 17 19:19:55 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 19:19:55.676444916 -0400 EDT m=+19472.444475988][E8:9F:6D:08:E9:EC][415][19.20000076][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 19:19:55 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 19:19:55.836940428 -0400 EDT m=+19472.604971488][E8:9F:6D:08:EA:C0][1043][20.29999924][35][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 19:19:56 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 18:00:44 -0400 EDT][4C:75:25:AD:A9:44][556][20.8][39][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 19:19:56 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 19:19:56.762437775 -0400 EDT m=+19473.530468829][08:B6:1F:88:64:30][814][20.79999924][44][165][0][0][0][2.357426167][2.492889881][0][2.492888212][2.492892027][1093.854956][4.133999825][4.133999825][93.37999725]
Apr 17 19:19:57 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 19:19:57.802805337 -0400 EDT m=+19474.570836404][D4:D4:DA:85:5D:E4][719][20.89999962][39][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 19:19:58 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 19:19:58.446656268 -0400 EDT m=+19475.214687324][E8:9F:6D:08:EB:88][1145][21.10000038][38][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 19:19:58 environment-monitor-01 api[2738367]: Live write to LinDB: [2024-04-17 19:19:58.492952602 -0400 EDT m=+19475.260983699][94:B9:7E:93:0B:30][542][19.70000076][31][0][0][0][0][0][0][0][0][0][0][0][0][0]

So for a test I will right now disable these Offline write to LinDB writes in my API code and see if the restart stops. If they do that we know they are the cause. But what is strange is that not all the Offline write to LinDB cause a LinDB restart. For example between the last 2 restart many of these were done:

...
...
Apr 17 18:19:02 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:17:06 -0400 EDT][4C:75:25:AD:A9:44][894][21.5][49][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:20:56 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 18:20:12 -0400 EDT][E8:9F:6D:08:EA:80][787][23.4][41][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:30:46 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 18:30:47 -0400 EDT][4C:75:25:97:96:CC][415][20.8][34][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:31:53 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:31:12 -0400 EDT][E8:9F:6D:08:F1:14][639][19.4][40][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:33:33 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:17:06 -0400 EDT][4C:75:25:AD:A9:44][894][21.5][49][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:34:01 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:30:55 -0400 EDT][4C:75:25:AD:A9:44][907][21.4][49][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:34:30 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:34:42 -0400 EDT][E8:9F:6D:08:E9:EC][458][19.2][34][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:35:21 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:17:06 -0400 EDT][4C:75:25:AD:A9:44][894][21.5][49][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:35:52 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:30:55 -0400 EDT][4C:75:25:AD:A9:44][907][21.4][49][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:38:58 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:37:42 -0400 EDT][E8:9F:6D:08:E9:EC][452][19.3][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:38:58 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:38:43 -0400 EDT][E8:9F:6D:08:E9:EC][451][19.2][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:41:59 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:41:42 -0400 EDT][E8:9F:6D:08:E9:EC][449][19.3][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:43:28 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:43:42 -0400 EDT][E8:9F:6D:08:E9:EC][447][19.4][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 18:48:28 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 17:48:42 -0400 EDT][E8:9F:6D:08:E9:EC][443][19.5][33][0][0][0][0][0][0][0][0][0][0][0][0][0]
Apr 17 19:05:21 environment-monitor-01 api[2738367]: Offline write to LinDB: [2024-04-17 18:00:44 -0400 EDT][4C:75:25:AD:A9:44][556][20.8][39][0][0][0][0][0][0][0][0][0][0][0][0][0]
...
...

Anyway I'll let you know if the LinDB process survives the night without these.

@stone1100
Copy link
Member

I know the reason.

  1. Broker does not filter data that out of acceptable time range(config when create database) when writing, and the connection(broker->storage) is active.
  2. Storage caches the wal log page in this connection, but it will do wal expiration operation based on acceptable time range. After wal expired, the related wal log pages will be closed, but the wal log page still exists in this connection.
  3. If broker writes data that out of acceptable time range, server will panic.

It will be fixed next version this week.

Thank you for your report and test.

@NetForces
Copy link
Author

Oh, so I could increase this when creating the database as a possible work-around ?

image

@stone1100
Copy link
Member

Oh, so I could increase this when creating the database as a possible work-around ?

image

Yes

@NetForces
Copy link
Author

Ok, I recreated the database for a 30d behead and 30d ahead. Will see how that holds.

@NetForces
Copy link
Author

So with the 30d behead and 30d ahead the service did not restart, but this morning my node was showing as dead:

image

Do you think it is related ? If not, I'll open a different ticket.

@NetForces
Copy link
Author

I created a separate issue: #1019

@stone1100
Copy link
Member

fixed(v0.3.1)

@NetForces
Copy link
Author

Awesome. I have just updated my system. Will let you know how it goes.

@NetForces
Copy link
Author

FYI service did not restart through the night which is a good thing. However #1019

@stone1100 stone1100 self-assigned this Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants