[Tracking] Reduce default query database retention time to 3 months #1372

yubiuser · 2022-07-03T08:36:00Z

So far we save historical data by default for up to one year (https://docs.pi-hole.net/ftldns/configfile/#maxdbdays). This seems a bit over-the top.

jpgpi250 · 2022-07-14T15:37:23Z

This has been discussed several times on the forum. If I remember correct, DL6ER's argument to not implement this was the fact that reducing existing 365 day databases would take a long time (on less powerfull devices - read pi zero).
Personally, I've been using MAXDBDAYS=8 for quite some time now, and never had a problem tracking an issue with an unexpected blocked domain.
My argument for using MAXDBDAYS=8:
Gravity (default) updates every sunday (7 days), so the domains that are being blocked change all the time. Without some aditional data, it's really hard to find out what happened (domain suddenly being blocked). It's even harder to detect a domain that used to be blocked and is now allowed (causing tracking, ads, ... that used to be blocked)
Remember that a long time ago, I contributed code to gravity.sh, this to keep the old database available.
I've been using that old database to store the result of some sqlite3 queries, using grafana to display the results.

query to detect new gravity domains:

	exec 3>&1 1>"${importfile}"
	pihole-FTL sqlite3 "${gravitydb}" << EOSQL
	ATTACH '${olddb}' AS olddb;
	SELECT domain FROM gravity WHERE domain NOT IN (SELECT domain FROM olddb.gravity);
EOSQL
	exec 1>&3 3>&-
	sudo sqlite3 "${olddb}" ".mode csv" ".separator '|'" ".import '${importfile}' new"

query to detect removed gravity domains:

	exec 3>&1 1>"${importfile}"
	pihole-FTL sqlite3 "${gravitydb}" << EOSQL
	ATTACH '${olddb}' AS olddb;
	SELECT DISTINCT domain, adlist_id FROM olddb.gravity WHERE domain NOT IN (SELECT domain FROM gravity) ORDER BY domain ASC;
EOSQL
	exec 1>&3 3>&-
	sudo sqlite3 "${olddb}" ".mode csv" ".separator '|'" ".import '${importfile}' removed"

domains that used to be allowed, now blocked:

pihole-FTL sqlite3 "${FTLdb}" << EOSQL
	ATTACH '${olddb}' AS olddb;
	SELECT timestamp, domain, status from queries WHERE 
		timestamp < '${timestamp}'
		AND status IN (2, 3, 12, 13, 14)
		AND domain IN (SELECT domain FROM olddb.new) GROUP BY domain;
EOSQL
exec 1>&3 3>&-
sudo sqlite3 "${olddb}" "DELETE FROM blocked;"
sudo sqlite3 "${olddb}" ".mode csv" ".separator '|'" ".import '${importfile}' blocked"

domains that are used to be blocked, now allowed:

exec 3>&1 1>"${importfile}"
pihole-FTL sqlite3 "${FTLdb}" << EOSQL
	ATTACH '${olddb}' AS olddb;
	SELECT timestamp, domain, status from queries WHERE
		timestamp < '${timestamp}'
		AND status in ( 1, 4, 5, 6, 7, 8, 9, 10, 11 )
		AND domain IN (SELECT domain FROM olddb.removed) GROUP BY domain;
EOSQL
exec 1>&3 3>&-
sudo sqlite3 "${olddb}" "DELETE FROM allowed;"
sudo sqlite3 "${olddb}" ".mode csv" ".separator '|'" ".import '${importfile}' allowed"

It would be really great (usefull) if these results were available in the pihole menu, you'll be suprised of the results I see after the weekly gravity run (depends on the lists you are using)

I run the script by simply adding a few lines at the end of the gravity script, with my DB size (4576KB), hardly any additional delay (MAXDBDAYS=8), storing the results of the queries in the old gravity database (the results don't need changing until the next gravity run).

Conclusion: with a few additions (old, new, blocked, allowed) after gravity has run, it is perfectly possible to analyze unexpected blocks / adds being displayed, without a massive database (MAXDBDAYS=365).

I would suggest to add a whiptail to the installation process, giving the user a retention choice (365 / 90 / 31 / 15 / 8), if pihole-FTL doesn't already contain a MAXDBDAYS setting, warning the user this may take some time...

jfb-pihole · 2022-07-14T15:53:44Z

"Gravity (default) updates every sunday (7 days), so the domains that are being blocked change all the time"

Very few domains change on a weekly basis relative to the total number of blocked domains. If you look at all the domains your Pi-hole has blocked in it's recent history (which is dependent on not only your blocklists but primarily by your client requests), you will find the same domains appear consistently.

"I would suggest to add a whiptail to the installation process, giving the user a retention choice (365 / 90 / 31 / 15 / 8),"

I suspect that new user (who has never seen the query log or any other operation aspect of Pi-hole at work), will have no insight which will lead them to an informed decision. They will either randomly pick or pick the longest (more is better).

We should have a rational default (I would propose 90 days) and offer the user an option to make changes in the settings menu of the web admin GUI.

jpgpi250 · 2022-07-14T16:22:55Z

this week, after the install of v.5.11.4 (10 jul 2022)
new: 18426
removed: 10130.

"Very few domains change on a weekly basis relative to the total number of blocked domains."

true, not so much compared to the total number of gravity entries, but still significant.
I do use a large number of lists (firebog), this to ensure pihole anomalities are noticed, where they are not always noticed by users, using only the default list (or a small number of lists).

"you will find the same domains appear consistently."

true, but the importance of the domains that popup in the allowed and blocked tables I created, are usually very supprising, as in WTF, "why is this blocked" OR "why isn't this blocked anymore"
why is this blocked? -> requires investigation, to see if pages still load normally...
why isn't this blocked anymore? -> since pages did load before, it might unlock some feature OR blocking is still useful, thus add to RPZ file and / or ad to locally maintained addlist (both automated on my system).

"I suspect that new user (who has never seen the query log or any other operation aspect of Pi-hole at work), will have no insight which will lead them to an informed decision. They will either randomly pick or pick the longest (more is better).

We should have a rational default (I would propose 90 days) and offer the user an option to make changes in the settings menu of the web admin GUI."

whiptail can suggest a recommended value, selected by default
settings menu would cause the delay (example pi-zero), which might not be expected / desired by the user at that time. The whiptail choice ensures this only happens during the installation / update process.

jfb-pihole · 2022-07-14T16:52:43Z

"not so much compared to the total number of gravity entries, but still significant."

These are only significant if you actually query any of the domains that are added.

As a single example from a Pi-hole that serves my desktop Mac only (MAXDBDAYS=90):

Domains in gravity:

sudo sqlite3 /etc/pihole/gravity.db "SELECT COUNT(*) FROM gravity"
530766

Total domains blocked by gravity (including CNAMES) for the duration of the long term database is trivially small compared to the number of domains in gravity, or compared to the number that may have changed last Sunday morning.

sqlite3 /etc/pihole/pihole-FTL.db "select domain from queries where status in (1,9) group by domain order by domain" | wc -l
650

jpgpi250 · 2022-07-14T17:08:27Z

You're missing the point.
I don't care about the number of gravity entries that actually cause a block.
I do care about the new entries that will cause a block, given my browsing habits, and removed entries that will no longer cause a block.

I've already have the info that meets my needs, just wanted to share the method I use to provide additional useful information with a limited DB size (MAXDBDAYS=8), when using pihole, so ...

DL6ER · 2022-07-14T19:24:00Z

Simply reducing MAXDBDAYS will cause FTL to delete several thousands - possibly millions - of rows from the database. Even when this is a whole lot, it should be doable even on low-end devices - they key is to not run VACUUM. The most important issue I'm seeing here is that no queries will be written to disk while the deletion is ongoing. All queries are retained in-memory until this task is finished. This is safe and nothing will happen unless a user decides to restart FTL within this massive deletion process. Even when an interruption shouldn't cause any harm to the database itself (no corruption), the queries that could not have been saved are lost.

To avoid this, we could use a gradual process by never deleting more than, say, 1% of the database at once¹, like:

DELETE FROM query_storage WHERE id IN (SELECT id FROM query_storage WHERE timestamp <= %i LIMIT (SELECT COUNT(*)/100 FROM query_storage))

instead of the current used:

DELETE FROM query_storage WHERE timestamp <= %i

FTL/src/database/query-table.c

Line 539 in bc5077d

    
           if(dbquery(db, "DELETE FROM query_storage WHERE timestamp <= %i", timestamp) != SQLITE_OK)

The database deletion happens at the same interval as garbage collection (every 10 minutes). That means that deleting three quarters of the database will take little more than 12 hours. The proposed algorithm above is both fast and also adaptive making it suitable both for large-scale Pi-hole deployments as also for the Pi Zero at home with only one client.

We cannot use SQLite3's DELETE ... LIMIT ... here as this is a special extension not included in the regular SQLite3 source code. ↩

yubiuser · 2022-07-14T19:31:58Z

Or we don't touch existing installations but only new ones..

rpe1313 · 2022-08-24T07:08:29Z

Just going to throw this in the mix - why base the retention of the data on time? Would it not be better to base the retention on size of the schema table?

Get table size (assuming 48 bytes per row):

 SELECT COUNT(*) * (48) / (1024) / (1024) from query_storage

Read from config the size you want it to be kept under, say no bigger than 80MB (small for the example), and use a recursive function to delete small chunks at a time, sleep, and continue until the table is that size. Not only are you not taxing say a Pi Zero, but you are also allowing the updates to continue during the function (fork it).

I assumed 48 bits total in each row of the table, just for the example. To disable the function, in the config file the user would just use -1.

int x = dbquery(db, "SELECT COUNT(*) * (48) / (1024) / (1024) FROM query_storage")
if(x > config_size && !config_size < 0){
  while(x > config_size){
    dbquery(db,"DELETE FROM query_storage WHERE id IN (SELECT id FROM query_storage ORDER BY id ASC LIMIT (SELECT  COUNT(*)/100 FROM query_storage")
    x = dbquery(db, "SELECT COUNT(*) * (48) / (1024) / (1024) FROM query_storage")
    sleep(120)
  }
}

The query will delete from the top of the table and ignore the remainder since the index is an auto increment. You can also remove the limit and just hard code how many rows to delete at each recursion:

int x = dbquery(db, "SELECT COUNT(*) * (48) / (1024) / (1024) FROM query_storage")
if(x > config_size && !config_size < 0){
  while(x > config_size ){
    dbquery(db,"DELETE FROM query_storage WHERE id IN (SELECT id FROM query_storage ORDER BY id ASC LIMIT 100")
    x = dbquery(db, "SELECT COUNT(*) * (48) / (1024) / (1024) FROM query_storage")
    sleep(120)
  }
}

github-actions · 2022-09-23T08:05:50Z

This issue is stale because it has been open 30 days with no activity. Please comment or update this issue or it will be closed in 5 days.

DL6ER · 2022-11-06T13:42:18Z

Just going to throw this in the mix - why base the retention of the data on time? Would it not be better to base the retention on size of the schema table?

No, because size is not a valid measure with the database - it keeps memory once allocated to it. Say, the database grew to 90 MB and you delete all rows - the database will still be 90 MB in file size as deleted entries merely make room for new records. The is very meaningful as the deleted records will typically not be at the end of your file but rather poke hole in the middle. For freeing up memory behind you, you have to rewrite the entire database. This does not only involve moving all the rows into the created holes, but also rewriting all the indices (what is stored where) which takes a really long time for typical Pi-hole databases with query counts in the order of millions of entries. Reorganizing such a table takes hours to days (!) on lowest-end hardware such as Raspberry Pi Zeros still used for Pi-holes these days. Not only that this causes additional wear on the SD cards, any power loss during this time poses a risk for your entire database. Furthermore, as the database will stay locked throughout the entire reorganization, you would definitely loose all the history in between the start of the operation any our power loss as FTL will be unable to write any more queries into the database.

Hence, I think it's much more useful and doable to delete based on a counting quantity, being either the number of queries in the database or a certain time until we start deleting. While the former may provide a simpler measure on how large the file will get, I do still believe that defining a time (like "I can compare today's activity with the one one month ago") is more widely applicable and easier to explain to the average user not caring about the database details. In today's world where buying an SD card with less than 16 GB becomes complicated, I don't think we need to worry about databases growing to like 500 MB (for a full year, maybe 125 MB for a quarter).

DL6ER · 2022-11-06T13:58:07Z

Some experiments on a Raspberry Pi 4 (the only one I still have myself):

Make a copy of the database

$ cp /etc/pihole/pihole-FTL.db ~/pihole-FTL_test.db

$ ls -lh ~/pihole-FTL_test.db
-rw-rw-r-- 1 pi pi 690M Nov  6 14:45 /home/pi/pihole-FTL_test.db

What is the oldest timestamp? (1 year ago)

$ date -d @$(pihole-FTL sqlite3 ~/pihole-FTL_test.db "SELECT MIN(timestamp) FROM queries")
Sa 6. Nov 14:40:01 CET 2021

Get current number of queries in the database

$ printf "%'d\n" $(pihole-FTL sqlite3 ~/pihole-FTL_test.db "SELECT COUNT(*) FROM queries")
9.538.985

Delete all queries older than 3 months

$ time pihole-FTL sqlite3 ~/pihole-FTL_test.db "DELETE FROM query_storage WHERE timestamp < $(date --date='-3 month' +%s)"
real	0m14,713s
user	0m9,039s
sys	0m2,594s

What is the oldest timestamp? (now three months ago)

$ date -d @$(pihole-FTL sqlite3 ~/pihole-FTL_test.db "SELECT MIN(timestamp) FROM queries")
Sa 6. Aug 15:48:01 CEST 2022

Get current number of queries in the database

$ printf "%'d\n" $(pihole-FTL sqlite3 ~/pihole-FTL_test.db "SELECT COUNT(*) FROM queries")
2.101.652

$ ls -lh ~/pihole-FTL_test.db
-rw-rw-r-- 1 pi pi 690M Nov  6 14:48 /home/pi/pihole-FTL_test.db

15 seconds was a lot less than I'd have assumed. It may relate to something like a minute on lower-end hardware for a similar database (about 25,000 queries per day).

DL6ER · 2022-11-06T14:03:09Z

In comparison (sorry for the many individual posts, but this helps structuring the output better):

$ time pihole-FTL sqlite3 ~/pihole-FTL_test.db "DELETE FROM query_storage WHERE id IN (SELECT id FROM query_storage WHERE timestamp <= $(date --date='-3 month' +%s) LIMIT (SELECT COUNT(*)/100 FROM query_storage))"
real	0m1,844s
user	0m0,711s
sys	0m0,307s

takes around 0.8 - 2.5 seconds on every iteration on a freshly copied database. There is clearly some overhead but it seems acceptable. The number of queries goes down as expected (9443158, 9348727, 9255240, ...).

DL6ER · 2023-03-13T10:44:26Z

yubiuser added Bug Discussion labels Jul 3, 2022

DL6ER removed the Bug label Aug 21, 2022

github-actions bot added the stale label Sep 23, 2022

yubiuser added internal and removed stale labels Sep 23, 2022

DL6ER added Breaking change Pi-hole v6.0 No bug and removed Discussion labels Mar 18, 2023

DL6ER added the Fixed in v6.0 label Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] Reduce default query database retention time to 3 months #1372

[Tracking] Reduce default query database retention time to 3 months #1372

yubiuser commented Jul 3, 2022

jpgpi250 commented Jul 14, 2022 •

edited by dschaper

jfb-pihole commented Jul 14, 2022 •

edited

jpgpi250 commented Jul 14, 2022 •

edited

jfb-pihole commented Jul 14, 2022

jpgpi250 commented Jul 14, 2022

DL6ER commented Jul 14, 2022

yubiuser commented Jul 14, 2022 •

edited

rpe1313 commented Aug 24, 2022

github-actions bot commented Sep 23, 2022

DL6ER commented Nov 6, 2022

DL6ER commented Nov 6, 2022

DL6ER commented Nov 6, 2022 •

edited

DL6ER commented Mar 13, 2023

[Tracking] Reduce default query database retention time to 3 months #1372

[Tracking] Reduce default query database retention time to 3 months #1372

Comments

yubiuser commented Jul 3, 2022

jpgpi250 commented Jul 14, 2022 • edited by dschaper

jfb-pihole commented Jul 14, 2022 • edited

jpgpi250 commented Jul 14, 2022 • edited

jfb-pihole commented Jul 14, 2022

jpgpi250 commented Jul 14, 2022

DL6ER commented Jul 14, 2022

Footnotes

yubiuser commented Jul 14, 2022 • edited

rpe1313 commented Aug 24, 2022

github-actions bot commented Sep 23, 2022

DL6ER commented Nov 6, 2022

DL6ER commented Nov 6, 2022

DL6ER commented Nov 6, 2022 • edited

DL6ER commented Mar 13, 2023

jpgpi250 commented Jul 14, 2022 •

edited by dschaper

jfb-pihole commented Jul 14, 2022 •

edited

jpgpi250 commented Jul 14, 2022 •

edited

yubiuser commented Jul 14, 2022 •

edited

DL6ER commented Nov 6, 2022 •

edited