Skip to content
This repository has been archived by the owner on Apr 5, 2024. It is now read-only.

Users reporting 4GB Pi 4 still reboots #47

Open
chrisys opened this issue May 6, 2020 · 10 comments · Fixed by #48
Open

Users reporting 4GB Pi 4 still reboots #47

chrisys opened this issue May 6, 2020 · 10 comments · Fixed by #48
Assignees

Comments

@chrisys
Copy link
Member

chrisys commented May 6, 2020

Users are still reporting that 4GB Pi 4s reset when working on 4 tasks. Reports say 3 tasks are OK. 2GB devices are now running stably with 1 task.

Reduce the number of Pi 4 allocated tasks to 3 by setting CPU usage percent to 75.

@chrisys chrisys self-assigned this May 6, 2020
@chrisys chrisys linked a pull request May 6, 2020 that will close this issue
@ptrm
Copy link
Contributor

ptrm commented May 6, 2020

Just an observation connected with simult. task count reliability: for several days my 4GB raspberry had been running stable with 4 cores occupied, but when I started fiddling with overclocking yesterday, it started rebooting by itself. That might just mean for the previous days I got specific tasks less demanding of the CPU, one device is surely too little to judge, but that made me think if fiddling with underclocking, instead of limiting core count, would give any benefit for stability that would make overall task count higher than the current solution. Or, setting a kernel cgroup limit to use certain percentage of the cpu (available as compose config field)

@chrisys
Copy link
Member Author

chrisys commented May 8, 2020

@ptrm that's definitely something we should investigate. If you think about it from a fleet level we have the ability to deploy to thousands of devices and gather metrics on how they perform in order to figure out what the settings are that result in most work units being completed.

@chrisys chrisys reopened this May 8, 2020
@chrisys
Copy link
Member Author

chrisys commented May 8, 2020

@ptrm on a fleet level we're still seeing a lot of Pi 4 reboots which does seem to be 4GB boards, even when limited to 3 tasks.

@ptrm
Copy link
Contributor

ptrm commented May 8, 2020

There are certailny many more indicators than I wrote about below, but here is what I managed to do for my two rpi4s to get stable under current load settings (1 core for 2GB rpi, 3 cores for the 4GB one).

One thing that turned out to be reboting my devices was undervoltage and underpowering. It's a common problem, especially for rpi4. Basically raspberrys, and the rpi4 the most, require the voltage to be stable and possibly closest to 5V, and most general use and even high-current chargers provide ~4.9 or less voltage under zero load, and then even less as the current rises (which is ok for charging 3.7V li-ion/poli batteries).

I came up with this snippet as a helpful tool to paste into balena os shell (rpi3 balenaos seems to not have vc tools installed):

while true; do \
  sleep 1; \
  clear; \
  date --iso-8601=s; \
  echo -ne 'vcgencmd get_throttled:\t\t'; \
  echo "ibase=16;obase=2;$(vcgencmd get_throttled| sed -E 's/^[^=]+=0x//')"|bc ; \
  echo -ne 'vcgencmd measure_clock arm:\t'; vcgencmd measure_clock arm; \
done

If something more than zero is output in get_throttled, it means some undervoltage occured, and it was usually corellating with reboots of my device. See the docs under get_throttled. There are separate flags for freq capping, undervoltage, and temperature excess for the past and current moment.

image

Here is my properly powered rpi4 for example (overclocked to 1.7Ghz), and if it would ever have been underpowered since last reboot, the get_throttled value would look something like 01010000000000000000 and the clock value might indicate around 600MHz. In the edge cases, my overclocked rpi4 with 4gb rebooted without visible changes in the the get_throttled output. So at 1800MHz for example, everything looked good but it would reboot every ~30min. So that might mean other things related with overclocking caused reboot, or the above ones are very sudden.

@ptrm
Copy link
Contributor

ptrm commented May 8, 2020

And fleetwise, it might be good to write something on the project's webpage about good (or official) power supply.

Plus, now I remembered that after first deployents to balena I got the device-level variable RESIN_HOST_CONFIG_avoid_warnings set to 1 by default, which hides the warning icons overlaid on top of the screen contents. This might be a helpful indicator, but then guess little users use displays for their pis in such use case.

@ptrm
Copy link
Contributor

ptrm commented May 9, 2020

image

The fun fact is, I can get my 2gb rpi4 to run at 2,1GHz with one task, but it failed to run on standard clock settings with 2 tasks with the same decent power source :/

@ptrm
Copy link
Contributor

ptrm commented May 9, 2020

on a fleet level we're still seeing a lot of Pi 4 reboots which does seem to be 4GB boards, even when limited to 3 tasks.

How to distinguish reboots from "last online" status btw? Does the http API provide more options? I have a machine that's said to be online for 2 hours, but it's uptime is in balena OS is 23:19, so indicates no reboots at all :o

@chrisys
Copy link
Member Author

chrisys commented May 12, 2020

@ptrm that's a good point you make and something I hadn't considered. Initially when we were looking at this issue, reboots were definitely occurring and resetting the device uptime as expected. However now I'm looking at a sample of devices from the fleet that have been online for a few minutes, and their uptimes are all measured in days. Perhaps the limitation to 3 tasks had a more substantial effect than I first thought.

We did see a marked jump in output after the fleet was updated on Friday morning: https://www.boincstats.com/stats/14/team/detail/18832/charts

The balenaCloud dashboard does have a per-device diagnostics facility which checks for undercurrent/underpower events (see here), but there's no way to run this on an entire fleet and correlate results at the moment.

@chrisys
Copy link
Member Author

chrisys commented May 12, 2020

Added issue regarding missing vcgencmd here: balena-os/balena-raspberrypi#485

@ptrm
Copy link
Contributor

ptrm commented May 12, 2020

Yeah, I noticed it can be checked here as well: https://dashboard.balena-cloud.com/devices/<device id>/diagnostics – it's marked as experimental, and indeed running the whole diagnostics even on idle rpi4 is lengthy.

Glad it's opensourced, though, the scripts look very useful.

EDIT: would be good to have them run separately, and also, maybe there's a way to tag a machine from the supervisor level to see in the fleet a flat regarding having ever been underpowered? (Seeing tags can have values, I assume even underpower counts might get into play)

And yeah, the chart looks impressive

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants