Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 Files are missing xml info #141

Open
1 task done
lsawade opened this issue Jul 6, 2023 · 8 comments
Open
1 task done

HDF5 Files are missing xml info #141

lsawade opened this issue Jul 6, 2023 · 8 comments
Labels
bug-unconfirmed Something isn't working - not yet confirmed

Comments

@lsawade
Copy link
Collaborator

lsawade commented Jul 6, 2023

Avoid duplicates

  • I searched existing issues

Bug Summary

Frustratingly enough, there seems to have been an issue when downloading all the data that the response files weren't stored in the HDF5 files. Approximately a 1/4 of files have responses making the error absurd. I'm also thinking about a quick fix, because xml's have been downloaded to the station directory just not been added to the the HDF5 file grrr.

Code to Reproduce

r = Request(... format='hdf5')
r.download()
r.preprocess()

Error Traceback

ERROR:pyglimer.request.preprocess:Could not find station inventory for Station XO.KSN4
Traceback (most recent call last):
  File "/scratch/gpfs/lsawade/PyGLImER/src/pyglimer/waveform/preprocessh5.py", line 197, in _preprocessh5_single
    inv = rdb.get_response(net, stat)
  File "/scratch/gpfs/lsawade/PyGLImER/src/pyglimer/database/raw.py", line 234, in get_response
    with BytesIO(np.array(self[path], dtype=np.dtype('byte'))) as b:
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/lsawade/.conda/envs/PyGLImER/lib/python3.10/site-packages/h5py/_hl/group.py", line 328, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: 'Unable to open object (component not found)'

PyGLImER Version?


Operating System?

linux

Python Version?

3.10.6

Installation Method?

developer installation / from source / git checkout

@lsawade lsawade added the bug-unconfirmed Something isn't working - not yet confirmed label Jul 6, 2023
@lsawade
Copy link
Collaborator Author

lsawade commented Jul 6, 2023

Alright, after inspection, I see that this is explicitly set to False in the download_waveforms() function, but in the preprocess function there is no catch to load the response from the stations directory. Is this a feature or an oversight?

mseed_to_hdf5(rawloc, False)

@lsawade
Copy link
Collaborator Author

lsawade commented Jul 6, 2023

As a quick fix, I made a script:

#/bin/env python3

import sys
from pyglimer.database.raw import statxml_to_hdf5

statxml_to_hdf5(sys.argv[1], sys.argv[2])

I have a feeling that the inventory however is supposed to be written earlier, will continue to investigate.

@lsawade
Copy link
Collaborator Author

lsawade commented Jul 6, 2023

I believe I have found the issue. And it's simple not taking into account all stations at the end of the event loop.

When mseed_to_hdf5() is called here

mseed_to_hdf5(rawloc, False)

It does removes a large chunk of mseeds without ever adding the stationxml. At the end of the event loop when

mseed_to_hdf5(rawloc, save_statxml=True, statloc=statloc)

is called only a fraction of the available stations have actually mseeds left. Meaning, the recursive function mseed_to_hdf5() which depends on the availability of mseeds will only add stationxml to the station files that actually have mseeds remaining.

I think a fix would be to simple run

statxml_to_hdf5(rawloc, statloc)

after mseed_to_hdf5(), and it should fix the problem. Compared to the time needed for downloading, the time it takes to add the stationxml to the HDF5 files is peanuts.

Please check this @PeterMakus, if you agree I'll push a fix, then merge, then I should be CCP stacking.

@PeterMakus
Copy link
Collaborator

Hi @lsawade ,

I guess we could do that. I am wondering though whether it will fix the problem. As I see it, all response information is downloaded and added to the hdf5 files here:

mseed_to_hdf5(rawloc, save_statxml=True, statloc=statloc)

after executing

download_full_inventory(statloc, clients)

The reason for the latter line was to make sure that really all response information is added and not just a subset. For example, if station XY was active from 1970-2010, but the sensor was changed in 1990, it might download information for only one sensor if we only requested data from e.g. 1995 onwards.

Have you tried changing this line and checked whether it actually fixes the problem?

@lsawade
Copy link
Collaborator Author

lsawade commented Jul 7, 2023

Hi,

The full inventory is definitely downloaded and all the stationxml's are definitely there, but not added to the HDF5 files because this

only a fraction of the available stations have actually mseeds left. Meaning, the recursive function mseed_to_hdf5() which depends on the availability of mseeds will only add stationxml to the station files that actually have mseeds remaining.

problem remains. Because mseed_to_hdf5() which depends on the availability of mseeds and not on the availability of stationxml. If the last request does not have mseeds for certain stations. Those stations won't get a station xml at the end.


Question. Do you always download the entire station.xml, no matter what?

@PeterMakus
Copy link
Collaborator

Yes, the entire station.xml is always downloaded.

I guess the savest bet would be to alter mseed_to_hdf5so that if statloc=True it will also check the availability of stationxmls and not only mseeds, what do you think?

@PeterMakus
Copy link
Collaborator

Ok,
turns out the only thing we have to do is add an overwrite flag to RawDatabase.add_response. Because otherwise, they might contain outdated response files.

@lsawade
Copy link
Collaborator Author

lsawade commented Jul 11, 2023

Testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed Something isn't working - not yet confirmed
Projects
None yet
Development

No branches or pull requests

2 participants