Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] Reduce default query database retention time to 3 months #1372

Open
yubiuser opened this issue Jul 3, 2022 · 13 comments
Open

[Tracking] Reduce default query database retention time to 3 months #1372

yubiuser opened this issue Jul 3, 2022 · 13 comments

Comments

@yubiuser
Copy link
Member

yubiuser commented Jul 3, 2022

So far we save historical data by default for up to one year (https://docs.pi-hole.net/ftldns/configfile/#maxdbdays). This seems a bit over-the top.

@jpgpi250
Copy link

jpgpi250 commented Jul 14, 2022

This has been discussed several times on the forum. If I remember correct, DL6ER's argument to not implement this was the fact that reducing existing 365 day databases would take a long time (on less powerfull devices - read pi zero).
Personally, I've been using MAXDBDAYS=8 for quite some time now, and never had a problem tracking an issue with an unexpected blocked domain.
My argument for using MAXDBDAYS=8:
Gravity (default) updates every sunday (7 days), so the domains that are being blocked change all the time. Without some aditional data, it's really hard to find out what happened (domain suddenly being blocked). It's even harder to detect a domain that used to be blocked and is now allowed (causing tracking, ads, ... that used to be blocked)
Remember that a long time ago, I contributed code to gravity.sh, this to keep the old database available.
I've been using that old database to store the result of some sqlite3 queries, using grafana to display the results.

query to detect new gravity domains:

	exec 3>&1 1>"${importfile}"
	pihole-FTL sqlite3 "${gravitydb}" << EOSQL
	ATTACH '${olddb}' AS olddb;
	SELECT domain FROM gravity WHERE domain NOT IN (SELECT domain FROM olddb.gravity);
EOSQL
	exec 1>&3 3>&-
	sudo sqlite3 "${olddb}" ".mode csv" ".separator '|'" ".import '${importfile}' new"

query to detect removed gravity domains:

	exec 3>&1 1>"${importfile}"
	pihole-FTL sqlite3 "${gravitydb}" << EOSQL
	ATTACH '${olddb}' AS olddb;
	SELECT DISTINCT domain, adlist_id FROM olddb.gravity WHERE domain NOT IN (SELECT domain FROM gravity) ORDER BY domain ASC;
EOSQL
	exec 1>&3 3>&-
	sudo sqlite3 "${olddb}" ".mode csv" ".separator '|'" ".import '${importfile}' removed"

domains that used to be allowed, now blocked:

pihole-FTL sqlite3 "${FTLdb}" << EOSQL
	ATTACH '${olddb}' AS olddb;
	SELECT timestamp, domain, status from queries WHERE 
		timestamp < '${timestamp}'
		AND status IN (2, 3, 12, 13, 14)
		AND domain IN (SELECT domain FROM olddb.new) GROUP BY domain;
EOSQL
exec 1>&3 3>&-
sudo sqlite3 "${olddb}" "DELETE FROM blocked;"
sudo sqlite3 "${olddb}" ".mode csv" ".separator '|'" ".import '${importfile}' blocked"

domains that are used to be blocked, now allowed:

exec 3>&1 1>"${importfile}"
pihole-FTL sqlite3 "${FTLdb}" << EOSQL
	ATTACH '${olddb}' AS olddb;
	SELECT timestamp, domain, status from queries WHERE
		timestamp < '${timestamp}'
		AND status in ( 1, 4, 5, 6, 7, 8, 9, 10, 11 )
		AND domain IN (SELECT domain FROM olddb.removed) GROUP BY domain;
EOSQL
exec 1>&3 3>&-
sudo sqlite3 "${olddb}" "DELETE FROM allowed;"
sudo sqlite3 "${olddb}" ".mode csv" ".separator '|'" ".import '${importfile}' allowed"

It would be really great (usefull) if these results were available in the pihole menu, you'll be suprised of the results I see after the weekly gravity run (depends on the lists you are using)

I run the script by simply adding a few lines at the end of the gravity script, with my DB size (4576KB), hardly any additional delay (MAXDBDAYS=8), storing the results of the queries in the old gravity database (the results don't need changing until the next gravity run).

Conclusion: with a few additions (old, new, blocked, allowed) after gravity has run, it is perfectly possible to analyze unexpected blocks / adds being displayed, without a massive database (MAXDBDAYS=365).

I would suggest to add a whiptail to the installation process, giving the user a retention choice (365 / 90 / 31 / 15 / 8), if pihole-FTL doesn't already contain a MAXDBDAYS setting, warning the user this may take some time...

@jfb-pihole
Copy link
Sponsor Member

jfb-pihole commented Jul 14, 2022

"Gravity (default) updates every sunday (7 days), so the domains that are being blocked change all the time"

Very few domains change on a weekly basis relative to the total number of blocked domains. If you look at all the domains your Pi-hole has blocked in it's recent history (which is dependent on not only your blocklists but primarily by your client requests), you will find the same domains appear consistently.

"I would suggest to add a whiptail to the installation process, giving the user a retention choice (365 / 90 / 31 / 15 / 8),"

I suspect that new user (who has never seen the query log or any other operation aspect of Pi-hole at work), will have no insight which will lead them to an informed decision. They will either randomly pick or pick the longest (more is better).

We should have a rational default (I would propose 90 days) and offer the user an option to make changes in the settings menu of the web admin GUI.

@jpgpi250
Copy link

jpgpi250 commented Jul 14, 2022

this week, after the install of v.5.11.4 (10 jul 2022)
new: 18426
removed: 10130.

"Very few domains change on a weekly basis relative to the total number of blocked domains."

  • true, not so much compared to the total number of gravity entries, but still significant.
  • I do use a large number of lists (firebog), this to ensure pihole anomalities are noticed, where they are not always noticed by users, using only the default list (or a small number of lists).

"you will find the same domains appear consistently."

  • true, but the importance of the domains that popup in the allowed and blocked tables I created, are usually very supprising, as in WTF, "why is this blocked" OR "why isn't this blocked anymore"
  • why is this blocked? -> requires investigation, to see if pages still load normally...
  • why isn't this blocked anymore? -> since pages did load before, it might unlock some feature OR blocking is still useful, thus add to RPZ file and / or ad to locally maintained addlist (both automated on my system).

"I suspect that new user (who has never seen the query log or any other operation aspect of Pi-hole at work), will have no insight which will lead them to an informed decision. They will either randomly pick or pick the longest (more is better).

We should have a rational default (I would propose 90 days) and offer the user an option to make changes in the settings menu of the web admin GUI."

  • whiptail can suggest a recommended value, selected by default
  • settings menu would cause the delay (example pi-zero), which might not be expected / desired by the user at that time. The whiptail choice ensures this only happens during the installation / update process.

@jfb-pihole
Copy link
Sponsor Member

"not so much compared to the total number of gravity entries, but still significant."

These are only significant if you actually query any of the domains that are added.

As a single example from a Pi-hole that serves my desktop Mac only (MAXDBDAYS=90):

Domains in gravity:

sudo sqlite3 /etc/pihole/gravity.db "SELECT COUNT(*) FROM gravity"
530766

Total domains blocked by gravity (including CNAMES) for the duration of the long term database is trivially small compared to the number of domains in gravity, or compared to the number that may have changed last Sunday morning.

sqlite3 /etc/pihole/pihole-FTL.db "select domain from queries where status in (1,9) group by domain order by domain" | wc -l
650

@jpgpi250
Copy link

You're missing the point.
I don't care about the number of gravity entries that actually cause a block.
I do care about the new entries that will cause a block, given my browsing habits, and removed entries that will no longer cause a block.

I've already have the info that meets my needs, just wanted to share the method I use to provide additional useful information with a limited DB size (MAXDBDAYS=8), when using pihole, so ...

@DL6ER
Copy link
Member

DL6ER commented Jul 14, 2022

Simply reducing MAXDBDAYS will cause FTL to delete several thousands - possibly millions - of rows from the database. Even when this is a whole lot, it should be doable even on low-end devices - they key is to not run VACUUM. The most important issue I'm seeing here is that no queries will be written to disk while the deletion is ongoing. All queries are retained in-memory until this task is finished. This is safe and nothing will happen unless a user decides to restart FTL within this massive deletion process. Even when an interruption shouldn't cause any harm to the database itself (no corruption), the queries that could not have been saved are lost.

To avoid this, we could use a gradual process by never deleting more than, say, 1% of the database at once1, like:

DELETE FROM query_storage WHERE id IN (SELECT id FROM query_storage WHERE timestamp <= %i LIMIT (SELECT COUNT(*)/100 FROM query_storage))

instead of the current used:

DELETE FROM query_storage WHERE timestamp <= %i

if(dbquery(db, "DELETE FROM query_storage WHERE timestamp <= %i", timestamp) != SQLITE_OK)

The database deletion happens at the same interval as garbage collection (every 10 minutes). That means that deleting three quarters of the database will take little more than 12 hours. The proposed algorithm above is both fast and also adaptive making it suitable both for large-scale Pi-hole deployments as also for the Pi Zero at home with only one client.

Footnotes

  1. We cannot use SQLite3's DELETE ... LIMIT ... here as this is a special extension not included in the regular SQLite3 source code.

@yubiuser
Copy link
Member Author

yubiuser commented Jul 14, 2022

Or we don't touch existing installations but only new ones..

@DL6ER DL6ER removed the Bug label Aug 21, 2022
@rpe1313
Copy link

rpe1313 commented Aug 24, 2022

Just going to throw this in the mix - why base the retention of the data on time? Would it not be better to base the retention on size of the schema table?

Get table size (assuming 48 bytes per row):

 SELECT COUNT(*) * (48) / (1024) / (1024) from query_storage

Read from config the size you want it to be kept under, say no bigger than 80MB (small for the example), and use a recursive function to delete small chunks at a time, sleep, and continue until the table is that size. Not only are you not taxing say a Pi Zero, but you are also allowing the updates to continue during the function (fork it).

I assumed 48 bits total in each row of the table, just for the example. To disable the function, in the config file the user would just use -1.

int x = dbquery(db, "SELECT COUNT(*) * (48) / (1024) / (1024) FROM query_storage")
if(x > config_size && !config_size < 0){
  while(x > config_size){
    dbquery(db,"DELETE FROM query_storage WHERE id IN (SELECT id FROM query_storage ORDER BY id ASC LIMIT (SELECT  COUNT(*)/100 FROM query_storage")
    x = dbquery(db, "SELECT COUNT(*) * (48) / (1024) / (1024) FROM query_storage")
    sleep(120)
  }
}

The query will delete from the top of the table and ignore the remainder since the index is an auto increment. You can also remove the limit and just hard code how many rows to delete at each recursion:

int x = dbquery(db, "SELECT COUNT(*) * (48) / (1024) / (1024) FROM query_storage")
if(x > config_size && !config_size < 0){
  while(x > config_size ){
    dbquery(db,"DELETE FROM query_storage WHERE id IN (SELECT id FROM query_storage ORDER BY id ASC LIMIT 100")
    x = dbquery(db, "SELECT COUNT(*) * (48) / (1024) / (1024) FROM query_storage")
    sleep(120)
  }
}

@github-actions
Copy link

This issue is stale because it has been open 30 days with no activity. Please comment or update this issue or it will be closed in 5 days.

@github-actions github-actions bot added the stale label Sep 23, 2022
@yubiuser yubiuser added internal and removed stale labels Sep 23, 2022
@DL6ER
Copy link
Member

DL6ER commented Nov 6, 2022

Just going to throw this in the mix - why base the retention of the data on time? Would it not be better to base the retention on size of the schema table?

No, because size is not a valid measure with the database - it keeps memory once allocated to it. Say, the database grew to 90 MB and you delete all rows - the database will still be 90 MB in file size as deleted entries merely make room for new records. The is very meaningful as the deleted records will typically not be at the end of your file but rather poke hole in the middle. For freeing up memory behind you, you have to rewrite the entire database. This does not only involve moving all the rows into the created holes, but also rewriting all the indices (what is stored where) which takes a really long time for typical Pi-hole databases with query counts in the order of millions of entries. Reorganizing such a table takes hours to days (!) on lowest-end hardware such as Raspberry Pi Zeros still used for Pi-holes these days. Not only that this causes additional wear on the SD cards, any power loss during this time poses a risk for your entire database. Furthermore, as the database will stay locked throughout the entire reorganization, you would definitely loose all the history in between the start of the operation any our power loss as FTL will be unable to write any more queries into the database.

Hence, I think it's much more useful and doable to delete based on a counting quantity, being either the number of queries in the database or a certain time until we start deleting. While the former may provide a simpler measure on how large the file will get, I do still believe that defining a time (like "I can compare today's activity with the one one month ago") is more widely applicable and easier to explain to the average user not caring about the database details. In today's world where buying an SD card with less than 16 GB becomes complicated, I don't think we need to worry about databases growing to like 500 MB (for a full year, maybe 125 MB for a quarter).

@DL6ER
Copy link
Member

DL6ER commented Nov 6, 2022

Some experiments on a Raspberry Pi 4 (the only one I still have myself):

Make a copy of the database

$ cp /etc/pihole/pihole-FTL.db ~/pihole-FTL_test.db

$ ls -lh ~/pihole-FTL_test.db
-rw-rw-r-- 1 pi pi 690M Nov  6 14:45 /home/pi/pihole-FTL_test.db

What is the oldest timestamp? (1 year ago)

$ date -d @$(pihole-FTL sqlite3 ~/pihole-FTL_test.db "SELECT MIN(timestamp) FROM queries")
Sa 6. Nov 14:40:01 CET 2021

Get current number of queries in the database

$ printf "%'d\n" $(pihole-FTL sqlite3 ~/pihole-FTL_test.db "SELECT COUNT(*) FROM queries")
9.538.985

Delete all queries older than 3 months

$ time pihole-FTL sqlite3 ~/pihole-FTL_test.db "DELETE FROM query_storage WHERE timestamp < $(date --date='-3 month' +%s)"
real	0m14,713s
user	0m9,039s
sys	0m2,594s

What is the oldest timestamp? (now three months ago)

$ date -d @$(pihole-FTL sqlite3 ~/pihole-FTL_test.db "SELECT MIN(timestamp) FROM queries")
Sa 6. Aug 15:48:01 CEST 2022

Get current number of queries in the database

$ printf "%'d\n" $(pihole-FTL sqlite3 ~/pihole-FTL_test.db "SELECT COUNT(*) FROM queries")
2.101.652

$ ls -lh ~/pihole-FTL_test.db
-rw-rw-r-- 1 pi pi 690M Nov  6 14:48 /home/pi/pihole-FTL_test.db

15 seconds was a lot less than I'd have assumed. It may relate to something like a minute on lower-end hardware for a similar database (about 25,000 queries per day).

@DL6ER
Copy link
Member

DL6ER commented Nov 6, 2022

In comparison (sorry for the many individual posts, but this helps structuring the output better):

$ time pihole-FTL sqlite3 ~/pihole-FTL_test.db "DELETE FROM query_storage WHERE id IN (SELECT id FROM query_storage WHERE timestamp <= $(date --date='-3 month' +%s) LIMIT (SELECT COUNT(*)/100 FROM query_storage))"
real	0m1,844s
user	0m0,711s
sys	0m0,307s

takes around 0.8 - 2.5 seconds on every iteration on a freshly copied database. There is clearly some overhead but it seems acceptable. The number of queries goes down as expected (9443158, 9348727, 9255240, ...).

@DL6ER
Copy link
Member

DL6ER commented Mar 13, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants