Poor performance after server running for a few days #1282

Terrance · 2023-02-26T13:02:36Z

I'm seeing certain requests taking durations into the minutes to complete. For example, loading my calendars in InfCloud:

Method	Path	Duration	Note
PROPFIND	/	16 ms
REPORT	/user/	11 ms
PROPFIND	/user/	13 ms	Reading server settings
PROPFIND	/user/	4.64 min	Reading calendar / address book list
REPORT	/user/1a518a80-23b6-abd1-cef8-f0e25e6063ef/	2.43 min	Reading events
PUT	/user/1a518a80-23b6-abd1-cef8-f0e25e6063ef/cca...892.ics	1.22 min	Creating an event

I've bumped my nginx proxy timeout to keep the requests working, though it makes things difficult to use with the long delays.

Performance seems to vary somewhat (e.g. give or take a few minutes for a scan of all collections) from day to day despite no obvious difference in server load. Restarting Radicale helps (e.g. reducing the scan time to about 30 seconds initially) for a day or two, but the slowness creeps back in after that.

I'm not sure if I'm at the point where my calendars are just too hefty to work with as files, or if there's another bottleneck here.

Debug log snippet: radicale-2023-02-26.log (NB. different day to the table)

Collection stats

For context, as I'm not sure what constitutes a "large" collection, here's some stats on the collection root (I'm the only user):

Collection	Type	# items	Total size
1a518a80-23b6-abd1-cef8-f0e25e6063ef	VCALENDAR	12836	651M
1fc7a94b-a1b1-bb0a-5e9e-c3756f597434	VCALENDAR	921	14M
260ceee4-135a-69a8-d36c-8bfe3937ed2a	VADDRESSBOOK	155	6.0M
29917d2c-60e7-0a32-178b-079c5bb33fe8	VCALENDAR	3060	46M
31f4fd59-7748-b1de-426c-ee1bee515ef4	VCALENDAR	3052	47M
4c899125-a332-8e3d-ce5b-6aabb3db886a	VCALENDAR	25	348K
513bbed6-56f0-cd92-4188-5c3c076745a5	VCALENDAR	167	2.3M
53f15099-79b5-d366-a91f-462bbc0435f5	VCALENDAR	8393	108M
68643525-214e-bc87-dd32-773f40887307	VCALENDAR	2848	70M
6c9d17e5-7012-f6c9-29c3-2d93898f0c0d	VCALENDAR	706	11M
883076b9-7570-78be-a719-bdd417592df5	VCALENDAR	6480	171M
9d9d7ce5-54f0-eae2-ef0f-704b6493b8f4	VCALENDAR	2	56K
a5797b88-7d73-9751-c7e7-2227c5a55f07	VADDRESSBOOK	75	2.3M
b31241e3-f303-9ffa-aca2-03734debda04	VADDRESSBOOK	49	2.6M
c730a9ad-243f-5bf3-1c14-580fc1f623ca	VCALENDAR	1194	20M
cb17d5b1-5693-4d59-4fdf-b8a62e705465	VCALENDAR	2445	57M
eb6cc48b-f577-609f-f929-56b6ddd37792	VCALENDAR	1287	28M

Totals: 43695 items, 1.2GB

Server: Radicale 3.1.8 on Arch Linux

pbiering · 2023-02-26T13:31:59Z

can you check your system regarding

consumed memory of the process, e.g.

ps -F -C radicale

system swap state, e.g.

free

current load on the system

cat /proc/loadavg

check system behavoir during long running calls

vmstat 1

Terrance · 2023-02-26T14:52:03Z

Initial stats (after a service restart and leaving it for a bit, just DAVx5 doing mobile syncing):

$ cat /proc/loadavg
0.04 0.11 0.96 1/484 4124315

$ free
               total        used        free      shared  buff/cache   available
Mem:         8038544     2372640     3151612      160288     2514292     5199280
Swap:              0           0           0

$ ps -F -C radicale
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
radicale 4123175 2725173  0 82592 28652   1 14:21 pts/16   00:00:03 /usr/bin/python /usr/bin/radicale --debug

After 5 minutes whilst InfCloud is waiting for a scanning PROPFIND request:

$ cat /proc/loadavg
3.07 1.71 1.37 2/490 4124649

$ free
               total        used        free      shared  buff/cache   available
Mem:         8038544     2400064     2933564      160288     2704916     5171860
Swap:              0           0           0

$ ps -F -C radicale
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
radicale 4123175 2725173  1 83505 31808   1 14:21 pts/16   00:00:20 /usr/bin/python /usr/bin/radicale --debug

After 10 minutes with calendars loaded in InfCloud:

$ cat /proc/loadavg
1.18 1.46 1.36 2/485 4124885

$ free
               total        used        free      shared  buff/cache   available
Mem:         8038544     2430644     2498560      160288     3109340     5141324
Swap:              0           0           0

$ ps -F -C radicale
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
radicale 4123175 2725173  3 83709 62012   1 14:21 pts/16   00:00:51 /usr/bin/python /usr/bin/radicale --debug

Stats during those requests: radicale-2023-02-06-vmstat.log

pbiering · 2023-02-26T16:22:38Z

Looks like your system has an IO and/or CPU issue, the "idle" column turns from usually 99 down to 0 while the "wait" column turns up from 0 to nearly 99 -> the process/OS is waiting for IO response.

Potential reasons

overloaded underlying virtual infrastructure (in case of a VM)
bottleneck in underlying/connected IO
too less CPU cores for parallel work

what is the output of

virt-what
cat /sys/block/*/device/{vendor,model}
cat /proc/cpuinfo | grep processor | wc -l

Also monitor using top while the long running request is running

Terrance · 2023-02-26T17:10:48Z

$ cat /sys/block/*/device/{vendor,model}
ATA
WDC WD20EFRX-68E

$ cat /proc/cpuinfo | grep processor | wc -l
2

This is a bare-metal home server, so no virtualisation here. CPU is a 2-core Intel(R) Pentium(R) CPU G4560 @ 3.50GHz, HDD is a 2TB WD Red circa 2012. I should say Radicale's performance has likely been in a bit of a decline on my machine for a while, though I'm not sure when I first noticed.

Since the run above, I've rebooted after some other upgrades, and we're back to ~10 seconds for that scanning PROPFIND. It's possible it's just cleared some in-memory cruft, though I hadn't noticed any other software performing poorly before the reboot. I'll keep tabs on it and see how the performance goes as time goes on, but I'll welcome any other places to check.

Terrance · 2023-04-02T19:12:36Z

So we're a couple of reboots later, and it seems that after about 7-10 days of uptime I start seeing noticeably slow or timing-out requests with Radicale.

I've took down a couple of "heavy" services to try and rule out system performance issues, though the server as a whole still seems fine, and besides some automated backups there isn't really anything else here with any significant disk usage.

From the potential reasons listed before:

too less CPU cores for parallel work

I noted it's a 2-core system before -- is there an expected / ballpark figure for number of cores required (be it fixed, linear in number of calendars etc.)? Would reducing max_connections in config help?

Terrance · 2023-11-09T14:52:53Z

Just wanted to check in again as the performance remains a problem, and is also now blocking my ability to sync on mobile: DAVx5 is consistently timing out during the sync (not sure if they've recently implemented a timeout, or the delays have crept up and passed its tolerance).

I'm at the point where I'm considering alternative servers that don't use plain .ics files on disk, as I assume that's ultimately the bottleneck here, though being able to interact with the storage is very useful and I suspect any database-backed servers will be harder to work with for ad-hoc tasks.

Any answers to the open questions above (re. CPU cores, max_connections, and whether this collection is just considered "too large") would still be appreciated, along with any suggestions for other data I can provide -- one other data point I have is from looking at system disk usage: Radicale achieved ~22.5GB of disk reads over the course of ~4 days, which seems excessive when the whole collection is a bit over 1GB (a fraction of which is within sync range of DAVx5).

pbiering · 2024-03-11T06:36:23Z

did this problem disappear since last report?

Terrance · 2024-03-11T23:20:31Z

No change, still seeing problematic disk usage. (I assume I'm not expecting to see any changes on Radicale's side? The latest release is still 3.1.8 and the Arch package hasn't changed since last year.)

I have switched from InfCloud to AgenDAV as my web client, which seems to handle the slowness a bit better in general, at the cost of some edits seemingly not going through if they time out.

pbiering · 2024-03-12T04:23:17Z

which seems excessive when the whole collection is a bit over 1GB

Your collection has a size of over 1 GByte on disk? That's huge and potentially an explanation for the high "wait" value caused by long running I/O.

For testing purposes, can you insert a SSD and move the collection?

Another way of speed check would be temporary copy the collection to /dev/shm/<subdirectory> and run speed test "memory-only".

Terrance · 2024-03-13T22:54:52Z

I suspected the collection was on the large size, which is why I put the collection stats in the original issue and queried it a few times. Unfortunately I don't have a spare SSD I can hook up, but I can try running it in-memory for a bit and see how that compares -- I just came across Profile-sync-daemon elsewhere which might work for this.

pbiering · 2024-03-15T06:56:56Z

What filesystem are you using? If "ext*" and there are a huge amount of files in one single directory this can also be a reason. In this case try to create an additional partition with "xfs" and copy over and check the behavior again.

Terrance · 2024-03-17T10:42:51Z

I've currently got the largest calendar symlinked into /dev/shm (with some manual rsyncs to keep it backed up) as an experiment, but so far it doesn't feel like much of a gain.

It is otherwise on ext4, so with the large calendar in the tens of thousands of items and others in the thousands, I'm not sure how much of a hit is taken listing the collection directories (naive ls -l doesn't seem to struggle though). I think repartitioning this server is a bit out of scope for now -- I'll leave it on the temporary partition for a bit and see how it goes.

Would tidying up the cache help? Or perhaps putting all the caches in /dev/shm? Stats for the large calendar:

Path	Size	Files
.Radicale.cache/history	74M	18325
.Radicale.cache/item	97M	18319
.Radicale.cache/sync-token	617M	294

pbiering · 2024-03-17T16:33:35Z

Hmm, the sync token directory has a large size. You can stop "radicale", remove directory ".Radicale.cache" complete, create it on /dev/shm/collection-root/$COLLECTION/.Radicale.cache and create a symlink towards in the original collection directory.

Beside that I would assume that without using Python's profiling capabilites it's not easy to detect what consumes the time.

But you can work with a copy of on a different system to dig further. I would start on this 180 seconds duration call from the log you've posted above - and potentially dig into radicale/app/propfind.py and add some debug log entries, especially before and after any for loop to narrow down where the time is consumed.

pbiering added the needs reporter feedback label Mar 11, 2024

pbiering mentioned this issue Apr 6, 2024

Radicale quite slow after fresh installation with just one item in the calendar #1466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance after server running for a few days #1282

Poor performance after server running for a few days #1282

Terrance commented Feb 26, 2023

pbiering commented Feb 26, 2023

Terrance commented Feb 26, 2023

pbiering commented Feb 26, 2023

Terrance commented Feb 26, 2023

Terrance commented Apr 2, 2023

Terrance commented Nov 9, 2023

pbiering commented Mar 11, 2024

Terrance commented Mar 11, 2024

pbiering commented Mar 12, 2024

Terrance commented Mar 13, 2024

pbiering commented Mar 15, 2024

Terrance commented Mar 17, 2024

pbiering commented Mar 17, 2024

Poor performance after server running for a few days #1282

Poor performance after server running for a few days #1282

Comments

Terrance commented Feb 26, 2023

pbiering commented Feb 26, 2023

Terrance commented Feb 26, 2023

pbiering commented Feb 26, 2023

Terrance commented Feb 26, 2023

Terrance commented Apr 2, 2023

Terrance commented Nov 9, 2023

pbiering commented Mar 11, 2024

Terrance commented Mar 11, 2024

pbiering commented Mar 12, 2024

Terrance commented Mar 13, 2024

pbiering commented Mar 15, 2024

Terrance commented Mar 17, 2024

pbiering commented Mar 17, 2024