Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5ExtError in daily update #273

Open
tomkooij opened this issue May 14, 2019 · 6 comments
Open

HDF5ExtError in daily update #273

tomkooij opened this issue May 14, 2019 · 6 comments
Assignees

Comments

@tomkooij
Copy link
Member

We have been experience frequent (>1/week) HDF5Errors which break the daily update:

HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 509, in H5Fopen
    unable to open file
  File "H5Fint.c", line 1400, in H5F__open
    unable to open file
  File "H5Fint.c", line 1615, in H5F_open
    unable to lock the file
  File "H5FD.c", line 1640, in H5FD_lock
    driver lock request failed
  File "H5FDsec2.c", line 941, in H5FD_sec2_lock
    unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'

End of HDF5 error back trace

Unable to open/create file '/databases/frome/2019/5/2019_5_13.h5'

This is while OPENing an ESD file: https://github.com/HiSPARC/publicdb/blob/master/publicdb/histograms/esd.py#L121
(happens at singles/weather/events)

My analysis is that this is an HDF5 1.10 issue (we upgraded some time ago, now that HDF5 1.10 is default in anaconda). There is an issue with file locking (HDF5 1.10 supports SWMR: single writer multiple readers with a special file locking mechanism).

Fix: export HDF5_USE_FILE_LOCKING="FALSE"
I have to figure out how to do that in django jobs.

https://stackoverflow.com/a/51735764/4965175

This is not #150 (update fails when a station writes to a raw datastore file while we are reading it)

@tomkooij
Copy link
Member Author

This turns out to be #150 (reading raw data while the raw datastore file is being simultaneously written into by the writer on frome) after all.

It turned off HDF5 1.10 file locking and the update crashed with:

HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 509, in H5Fopen
    unable to open file
  File "H5Fint.c", line 1400, in H5F__open
    unable to open file
  File "H5Fint.c", line 1700, in H5F_open
    unable to read superblock
  File "H5Fsuper.c", line 623, in H5F__super_read
    truncated file: eof = 3725070164, sblock->base_addr = 0, stored_eof = 3725086548

End of HDF5 error back trace

Unable to open/create file '/databases/frome/2019/5/2019_5_14.h5'

This is most certainly caused by the writer on frome writing while we are trying to open the file for reading.

The daily update starts at 4am local time (servers use local time) which is 2am UTC during summer/DST. I have changed this to 5am local time (3am UTC) to reduce the chance of frome writing to the datastore while the update runs.

I will investigate if supervisord can start/stop the writer on frome during the update.

@davidfokkema
Copy link
Member

I will investigate if supervisord can start/stop the writer on frome during the update.

Clever idea!

@tomkooij
Copy link
Member Author

It might be as easy as two cronjobs supervisoctl stop datastore-writer at 3:30am and supervisorctl start datastore-writer at 6:00am.

But I'd like at add some kind of mail/alert when supervisord fails to (re)start a service: Perhaps superlance: https://serverfault.com/a/244733

@davidfokkema
Copy link
Member

Or, the update process consists of stopping the writer, running the update process, and finally starting the writer. But then we definitely need alerts.

@tomkooij
Copy link
Member Author

Or, the update process consists of stopping the writer, running the update process, and finally starting the writer. But then we definitely need alerts.

Yes, I thought about that... The communication between pique and frome (start/stop writer) puts me off. I don't fancy doing that with xml-rpc (but we can). We do have alerts (sentry.io) on pique/daily update, so we do get an email if the daily update fails.

@tomkooij
Copy link
Member Author

This has been fixed (temporarily) for a couple of months by turning the writer on frome on/off during the daily update using a cronjob.

TODO: Add to ansible provisioning.

@tomkooij tomkooij self-assigned this Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants