-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segy: add the option to skip corrupt blocks during reading #3259
base: master
Are you sure you want to change the base?
Conversation
…corrupt segy blocks rather than having the segy reader simply raise the SegyFileError exception
except ValueError: | ||
#skip the remaining corrupt traces | ||
if(not skip_corrupt_traces):raise ValueError | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be
except ValueError as e:
if not skip_corrupt_traces:
raise e
I don't understand the break
, isn't the intent of the PR that reading continues after a corrupt trace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that is a mistake left over from earlier testing (that break causes it to stop so you get a stream object with all of the good traces up until a corrupt trace)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up leaving this as is because a lot of the corruption I see in segy files is as a result of an entirely new byte being added to the bad header. This reliably breaks the offset used to decode the next segy header, guaranteeing that subsequent blocks will be decoded incorrectly.
Since this is going into the next feature release, I do not see a problem with having the default
I don't understand that part, is it not possible to skip until the next header start kind of? Do later headers rely on information from former blocks?
Yes please, make a new test file with only a handful blocks to keep it small.
This file here and the new test file should go into the
There are two places were this can/should be mentioned:
|
@ThomasLecocq you're working with segy to some extent right? Maybe you can watch this one too? |
yes, indeed! I'll work on it this week, cheers |
Thank you both for the feedback. Ill go ahead and make the sample bad segy file and add it to the tests. Ill also remove that break and change it as megies suggested and add the test and documentation. Ill try and get this in early next week. The only thing that I disagree with is that I think that it should remain defaulted to False, because the new behavior can result in a stream object being returned with bad traces (for example if the time stamp happened to be the thing which was corrupt, there will be a trace with a bad start or end time). This is fine if the user invoked the option and expects that some of the data might be bad, if its defaulted to True and this happens I imagine you will get lots of issues about "the segy reader not working" as users probably wont realize that the issue is in the file, not the reader. ThomasLecocq If you want to take a look at it I am attaching a copy of the sample bad segy file which I plan to add to the tests. Its the same as the good segy except that I added two bytes on line 137536 using a hex editor to simulate the kind of header corruption I am complaining about. The code I have pasted below plots this (using the change in this pr).
|
Fine with me if you think it should be like that. I still don't fully understand how these blocks are corrupted, though. If just a bad starttime were to be in there would our reader not just read it as a valid block anyway?
Would be interesting how that actually happens. If the reader complains about some corruption of some sort and you just throw away the affected part and look for a good next header (I'm assuming that these segy "traces" each have their own header) and continue from there, how could this approach result in bad data being returned? |
If you go to the "files changed" tab, you can see what the linter is complaining about. |
…unit tests for the option to skip_corrupt_blocks. Also added a description of this option to the documentation
…stop at the corrupt trace and only return the traces which have been sucesfully decoded so far
Ok so there are a few types of common segy file corruption that seem to occur which have different effects on the reader:
I thought that the PR I originally submitted allowed you to simply skip all corrupt blocks for each of these cases. This seems to be the case for the first case of corruption, as long as the bad byte doesn't damage the information about the number of traces in this sample. More testing showed me that this isnt quite what happens with the second kind of file corruption. Its only possible to get the data up to the corrupt block because all subsequent data is incorrectly offset. A more robust fix will require modifying SEGYTrace._read_trace() to scan until the next trace is reached instead of offsetting based on the npts = self.header.number_of_samples_in_this_trace calculated from the trace header. This is probably a lot more work than what I have done and I am not sure if my bosses will care about recovering this remaining data enough to have me implement this in the near term. In summary I still think this PR is an big improvement because you get much more data out of a corrupt segy file without having to carefully specify time ranges, but instead of giving you all non corrupt blocks it simply returns data up until the first corrupt block occurs. |
What does this PR do?
This makes a small change to the obspy segy reader to add an option to skip corrupt segy blocks rather than stopping and raising the SEGYTraceReadingError exception.
Segy files segment the data, such that there is a segy header followed by data associated with that header, followed by another segy header until the end of the file is reached.
Currently if a corrupt header is encountered the obspy reader simply raises an SEGYTraceReadingError exception. This means that if a large segy file has any corrupt data in it the user must either specify a time range that avoids this corrupt data (which is laborious to calculate) or is unable to use any of the good data in the segy file.
This change adds an option “skip_corrupt_traces” (defaulted to False) which causes the segy reader to simply skip any corrupt blocks. Using the option is little risky since it can cause the reader to return incorrect traces if multiple sequential blocks are corrupt, however I think it is a good user option since large segy files will often have a few corrupt blocks and this is an easier way to deal with them rather than having to carefully specify start and end times which avoid them.
Why was it initiated? Any relevant Issues?
I made these changes to facilitate my own workflows with large segy files
PR Checklist
I tried to follow the pull request instructions, although I got stuck in a few places:
I ran pytest and all of the tests passed. However I need help adding a specific case for the issue of a corrupt segy file (my segy files are very large but I could very easily create a small one which demonstrates the problem that this address). If someone could point me to the part of the test module which currently tests segy files that would be very helpful.
I don’t understand how to add the new option to the documentation. I would just need to add :
“
skip_corrupt_traces: Large segy files often have corrup traces, setting this parameter to True
Causes them to be skipped instead of raising a SEGYTraceReadingError.
Defaulted to False so as not to upset any existing workflows which use this.
”
to
https://docs.obspy.org/master/packages/autogen/obspy.io.segy.core._read_segy.html#obspy.io.segy.core._read_segy
Thanks!