segy: add the option to skip corrupt blocks during reading #3259

firstkingofrome · 2023-01-23T21:52:05Z

What does this PR do?

This makes a small change to the obspy segy reader to add an option to skip corrupt segy blocks rather than stopping and raising the SEGYTraceReadingError exception.

Segy files segment the data, such that there is a segy header followed by data associated with that header, followed by another segy header until the end of the file is reached.

Currently if a corrupt header is encountered the obspy reader simply raises an SEGYTraceReadingError exception. This means that if a large segy file has any corrupt data in it the user must either specify a time range that avoids this corrupt data (which is laborious to calculate) or is unable to use any of the good data in the segy file.

This change adds an option “skip_corrupt_traces” (defaulted to False) which causes the segy reader to simply skip any corrupt blocks. Using the option is little risky since it can cause the reader to return incorrect traces if multiple sequential blocks are corrupt, however I think it is a good user option since large segy files will often have a few corrupt blocks and this is an easier way to deal with them rather than having to carefully specify start and end times which avoid them.

Why was it initiated? Any relevant Issues?

I made these changes to facilitate my own workflows with large segy files

PR Checklist

I tried to follow the pull request instructions, although I got stuck in a few places:

I ran pytest and all of the tests passed. However I need help adding a specific case for the issue of a corrupt segy file (my segy files are very large but I could very easily create a small one which demonstrates the problem that this address). If someone could point me to the part of the test module which currently tests segy files that would be very helpful.
I don’t understand how to add the new option to the documentation. I would just need to add :
“
skip_corrupt_traces: Large segy files often have corrup traces, setting this parameter to True
Causes them to be skipped instead of raising a SEGYTraceReadingError.
Defaulted to False so as not to upset any existing workflows which use this.

”

to
https://docs.obspy.org/master/packages/autogen/obspy.io.segy.core._read_segy.html#obspy.io.segy.core._read_segy

Thanks!

…corrupt segy blocks rather than having the segy reader simply raise the SegyFileError exception

megies · 2023-01-24T08:48:10Z

obspy/io/segy/core.py

+        except ValueError:
+            #skip the remaining corrupt traces
+            if(not skip_corrupt_traces):raise ValueError
+            break


this should probably be

except ValueError as e: if not skip_corrupt_traces: raise e

I don't understand the break, isn't the intent of the PR that reading continues after a corrupt trace?

Good catch, that is a mistake left over from earlier testing (that break causes it to stop so you get a stream object with all of the good traces up until a corrupt trace)

I ended up leaving this as is because a lot of the corruption I see in segy files is as a result of an entirely new byte being added to the bad header. This reliably breaks the offset used to decode the next segy header, guaranteeing that subsequent blocks will be decoded incorrectly.

megies · 2023-01-24T08:48:39Z

option “skip_corrupt_traces” (defaulted to False)

Since this is going into the next feature release, I do not see a problem with having the default True or even not make it a switch but just always try to skip corrupt blocks. I would just make it a warning that gets shown if blocks were skipped.

Using the option is little risky since it can cause the reader to return incorrect traces if multiple sequential blocks are corrupt

I don't understand that part, is it not possible to skip until the next header start kind of? Do later headers rely on information from former blocks?

files are very large but I could very easily create a small one which demonstrates the problem that this address

Yes please, make a new test file with only a handful blocks to keep it small.

point me to the part of the test module which currently tests segy files

This file here and the new test file should go into the data folder in the same directory, then you can access your data as testdata['filename.segy'] in the test code, like in some other test routines in that file.

2. I don’t understand how to add the new option to the documentation.

There are two places were this can/should be mentioned:

the segy API front page (here in code which shows up in the docs here)
the API docs of the _read_segy() function, where you already added it appropriately

megies · 2023-01-24T08:49:24Z

@ThomasLecocq you're working with segy to some extent right? Maybe you can watch this one too?

ThomasLecocq · 2023-01-24T08:51:18Z

@ThomasLecocq you're working with segy to some extent right? Maybe you can watch this one too?

yes, indeed! I'll work on it this week, cheers

firstkingofrome · 2023-01-24T18:03:44Z

Thank you both for the feedback. Ill go ahead and make the sample bad segy file and add it to the tests. Ill also remove that break and change it as megies suggested and add the test and documentation. Ill try and get this in early next week.

The only thing that I disagree with is that I think that it should remain defaulted to False, because the new behavior can result in a stream object being returned with bad traces (for example if the time stamp happened to be the thing which was corrupt, there will be a trace with a bad start or end time). This is fine if the user invoked the option and expects that some of the data might be bad, if its defaulted to True and this happens I imagine you will get lots of issues about "the segy reader not working" as users probably wont realize that the issue is in the file, not the reader.

ThomasLecocq If you want to take a look at it I am attaching a copy of the sample bad segy file which I plan to add to the tests. Its the same as the good segy except that I added two bytes on line 137536 using a hex editor to simulate the kind of header corruption I am complaining about. The code I have pasted below plots this (using the change in this pr).

import obspy as op
segy=op.core.stream.read("good.segy")
segy.plot()
#fails
badSegy=op.core.stream.read("corrupt.segy",unpack_trace_headers=True)
#suceeds
badSegy=op.core.stream.read("corrupt.segy",unpack_trace_headers=True,skip_corrupt_traces=True)
badSegy.plot()

segyFiles.zip

megies · 2023-01-25T09:09:10Z

The only thing that I disagree with is that I think that it should remain defaulted to False, because the new behavior can result in a stream object being returned with bad traces (for example if the time stamp happened to be the thing which was corrupt, there will be a trace with a bad start or end time).

Fine with me if you think it should be like that. I still don't fully understand how these blocks are corrupted, though. If just a bad starttime were to be in there would our reader not just read it as a valid block anyway?

new behavior can result in a stream object being returned with bad traces

Would be interesting how that actually happens. If the reader complains about some corruption of some sort and you just throw away the affected part and look for a good next header (I'm assuming that these segy "traces" each have their own header) and continue from there, how could this approach result in bad data being returned?

megies · 2023-01-25T09:11:17Z

If you go to the "files changed" tab, you can see what the linter is complaining about.

…unit tests for the option to skip_corrupt_blocks. Also added a description of this option to the documentation

…stop at the corrupt trace and only return the traces which have been sucesfully decoded so far

firstkingofrome · 2023-01-30T20:47:37Z

Ok so there are a few types of common segy file corruption that seem to occur which have different effects on the reader:

A changed byte value in a segy block header typically doesn't break the current segy reader unless it is in the npts field or results in a non physical time or sample rate, in which case a valueError or segyFileError typically results and no data is returned
Any kind of corruption which results in an entirely new byte being added to the segy header (I am not sure how common this is, I have some data from battery powered instruments and it seems to be fairly common when the instrument is subjected to cold temperatures which cause it to temporarily loose power). This breaks the current segy reader because it causes the remainder of the file to be offset by that byte and it appears that the segy reader uses the npts value from each preceding segy header as an offset.

I thought that the PR I originally submitted allowed you to simply skip all corrupt blocks for each of these cases. This seems to be the case for the first case of corruption, as long as the bad byte doesn't damage the information about the number of traces in this sample.

More testing showed me that this isnt quite what happens with the second kind of file corruption. Its only possible to get the data up to the corrupt block because all subsequent data is incorrectly offset. A more robust fix will require modifying SEGYTrace._read_trace() to scan until the next trace is reached instead of offsetting based on the npts = self.header.number_of_samples_in_this_trace calculated from the trace header. This is probably a lot more work than what I have done and I am not sure if my bosses will care about recovering this remaining data enough to have me implement this in the near term.

In summary I still think this PR is an big improvement because you get much more data out of a corrupt segy file without having to carefully specify time ranges, but instead of giving you all non corrupt blocks it simply returns data up until the first corrupt block occurs.

made some small changes to the segy reader to add the option to skip …

e5ff328

…corrupt segy blocks rather than having the segy reader simply raise the SegyFileError exception

megies reviewed Jan 24, 2023

View reviewed changes

megies added the .io.segy label Jan 24, 2023

megies added this to the 1.5.0 milestone Jan 24, 2023

megies requested a review from ThomasLecocq January 24, 2023 08:53

megies changed the title ~~made some small changes to the segy reader to add the option to skip corrupt segy blocks~~ segy: add the option to skip corrupt blocks during reading Jan 24, 2023

Eric Eckert added 3 commits January 30, 2023 12:01

added an sample corrupt segy file and good segy file with associated …

efa4485

…unit tests for the option to skip_corrupt_blocks. Also added a description of this option to the documentation

added the original behaviour back in which causes the segy reader to …

f752e5e

…stop at the corrupt trace and only return the traces which have been sucesfully decoded so far

added myself to contributors.

11a4d70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segy: add the option to skip corrupt blocks during reading #3259

segy: add the option to skip corrupt blocks during reading #3259

firstkingofrome commented Jan 23, 2023

megies Jan 24, 2023

firstkingofrome Jan 24, 2023

firstkingofrome Jan 30, 2023 •

edited

megies commented Jan 24, 2023

megies commented Jan 24, 2023

ThomasLecocq commented Jan 24, 2023

firstkingofrome commented Jan 24, 2023 •

edited

megies commented Jan 25, 2023

megies commented Jan 25, 2023

firstkingofrome commented Jan 30, 2023

segy: add the option to skip corrupt blocks during reading #3259

Are you sure you want to change the base?

segy: add the option to skip corrupt blocks during reading #3259

Conversation

firstkingofrome commented Jan 23, 2023

What does this PR do?

Why was it initiated? Any relevant Issues?

PR Checklist

megies Jan 24, 2023

Choose a reason for hiding this comment

firstkingofrome Jan 24, 2023

Choose a reason for hiding this comment

firstkingofrome Jan 30, 2023 • edited

Choose a reason for hiding this comment

megies commented Jan 24, 2023

megies commented Jan 24, 2023

ThomasLecocq commented Jan 24, 2023

firstkingofrome commented Jan 24, 2023 • edited

megies commented Jan 25, 2023

megies commented Jan 25, 2023

firstkingofrome commented Jan 30, 2023

firstkingofrome Jan 30, 2023 •

edited

firstkingofrome commented Jan 24, 2023 •

edited