Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After using ansible redis roles, pmproxy.log filled our log volume #1536

Open
portante opened this issue Jan 31, 2022 · 12 comments
Open

After using ansible redis roles, pmproxy.log filled our log volume #1536

portante opened this issue Jan 31, 2022 · 12 comments

Comments

@portante
Copy link
Contributor

26GB pmproxy.log file on disk.

Filled with lines like:

[Sun Jan 30 16:35:33] pmproxy(2388513) Error: SADD pcp:series:label.X.value:Y
Redis reply error: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.
@portante
Copy link
Contributor Author

Why are PCP processes logging to disk in separate files? Why isn't it using syslog logging to land in journald?

@andreasgerstmayr
Copy link
Member

Why are PCP processes logging to disk in separate files? Why isn't it using syslog logging to land in journald?

We've an open ticket for that: #1040 but it didn't get much traction so far.

Redis reply error: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

Did Redis run out of disk space while snapshotting, could you check the Redis logs?

@portante
Copy link
Contributor Author

We used the redis role to configure it. I does not appear to have any files on disk open, but has created a /var/lib/redis/dump.rdb file.

The errors in the redis logs say:

4216:M 30 Jan 2022 13:00:36.034 * 10000 changes in 60 seconds. Saving...
4216:M 30 Jan 2022 13:00:36.633 * Background saving started by pid 1839362
1839362:C 30 Jan 2022 13:26:34.699 * DB saved on disk
1839362:C 30 Jan 2022 13:26:35.440 * RDB: 528 MB of memory used by copy-on-write
4216:M 30 Jan 2022 13:26:36.430 * Background saving terminated with success
4216:M 30 Jan 2022 13:27:37.050 * 10000 changes in 60 seconds. Saving...
4216:M 30 Jan 2022 13:27:37.681 * Background saving started by pid 2271458
2271458:C 30 Jan 2022 13:53:44.483 # Write error saving DB on disk: No space left on device
4216:M 30 Jan 2022 13:54:07.907 # Background saving error

But the only thing using substantial disk space is the pmproxy.log file.

After expanding the volume, the redis service is back to normal.

@andreasgerstmayr
Copy link
Member

Redis by default is snapshotting its entire database periodically on disk. When you restart Redis, it will load the contents of dump.rdb into memory (see https://redis.io/topics/persistence for more info). There's another related issue open regarding Redis persistence: performancecopilot/ansible-pcp#22 - we'll need to look into tuning the Redis persistence settings.

But the only thing using substantial disk space is the pmproxy.log file.

Looks like Redis stops all write requests if it can't snapshot the database to the disk successfully. Makes sense from a durability point of view, otherwise you'll loose data when Redis gets restarted.

In PCP we log all errors coming from Redis. Not sure what's the best solution in this case - keep track of logged errors and only log each Redis error type once? We have this implemented for Redis reconnect (to not flood the pmproxy log with "trying to connect to redis..." messages), but I'm not 100% convinced that it makes sense to apply this for every single error type. What do you think @natoscott?

@portante
Copy link
Contributor Author

This is why I asked about PCP logging to files. If it logged to journald either vis syslog or stdout/stderr from systemd process creation, then the disk would not have filled up.

@andreasgerstmayr
Copy link
Member

This is why I asked about PCP logging to files. If it logged to journald either vis syslog or stdout/stderr from systemd process creation, then the disk would not have filled up.

Ah, because journald automatically manages log rotation. Yep, I'm also very much in favor of supporting journald.

afaics we have the following options
a) drop the /usr/libexec/pcp/lib/pmproxy shell script, move this logic into the pmproxy binary and start the pmproxy binary directly, logging to stdout
b) call /usr/libexec/pcp/lib/pmproxy in ExecStartPre= and start the pmproxy binary in ExecStart=
c) or change the internal log routines to use (optionally) journald

I think option a) is the cleanest but the most effort, alternatively option b) is also fine imho.
What we need to keep in mind is that PCP is also supported on non-systemd platforms.

@natoscott
Copy link
Member

@andreasgerstmayr I believe this is one of the areas @goodwinos is planning on tackling for PCP v6.

Another option (e.g. non-systemd and/or pre v6) would be to install logrotate(8) files. I don't think we can reliably attempt to throttle logging of all possible Redis issues in pmproxy, and logrotate is a generic solution we could apply to other PCP daemon logs as well.

@portante
Copy link
Contributor Author

portante commented Feb 1, 2022

See this P99 talk on the problem with logging. Until we properly solved that, for system administrators, journald helps the most, as logrotate based solutions can't react fast enough in all cases.

@portante
Copy link
Contributor Author

portante commented Feb 1, 2022

Just to be sure, here is a series of grep commands I used to filter the entire contents of the 26 GB pmproxy.log file before removing it due to size issues:

[root@intlabproxy-002 pmproxy]# grep -vF "not able to persist on disk" pmproxy.log | grep -vF "status mismatch at time" | grep -vF "HMSET:" | grep -vF "SADD pcp:" | grep -vF "HSET: string mapping script" | grep -vF "SADD: mapping" | grep -vF "SADD: map series" | grep -vF "SADD map metric" | grep -vF "GEOADD: mapping"
Log for pmproxy on intlabproxy-002 started Sun Jan 30 00:26:01 2022

pmproxy: PID = 2388513, PDU version = 2, user = pcp (991)
pmproxy request port(s):
  sts fd   port  family address
  === ==== ===== ====== =======
  ok    10       unix   /run/pcp/pmproxy.socket
  ok    11 44322 inet   INADDR_ANY
  ok    12 44322 ipv6   INADDR_ANY
  ok    13 44323 inet   INADDR_ANY
  ok    14 44323 ipv6   INADDR_ANY
[Sun Jan 30 00:26:01] pmproxy(2388513) Info: OpenSSL 1.1.1g FIPS  21 Apr 2020 - no certificates found
[Sun Jan 30 00:26:01] pmproxy(2388513) Info: Redis slots, command keys, schema version setup
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: MISCONF

@natoscott
Copy link
Member

We have this implemented for Redis reconnect (to not flood the pmproxy log with "trying to connect to redis..." messages), but I'm not 100% convinced that it makes sense to apply this for every single error type. What do you think @natoscott?

It's not a bad idea IMO. I can see no gain from continually re-logging the same error repeatedly for continually failing operations - culling or rate limiting to perhaps once per day makes more sense to me.

@natoscott
Copy link
Member

To summarize - it looks like all actions to take here are in PCP itself. Is there any default redis.conf change we should make in the role, or shall I close this one out and we can continue discussions in the PCP issue trackers? Thanks.

@andreasgerstmayr
Copy link
Member

Afaics there is no change to redis.conf required here, the log (spam) is produced by PCP.

shall I close this one out and we can continue discussions in the PCP issue trackers

+1, I'll move this ticket over to the pcp repo.

@andreasgerstmayr andreasgerstmayr transferred this issue from performancecopilot/ansible-pcp Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants