Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segy: add the option to skip corrupt blocks during reading #3259

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

firstkingofrome
Copy link

What does this PR do?

This makes a small change to the obspy segy reader to add an option to skip corrupt segy blocks rather than stopping and raising the SEGYTraceReadingError exception.

Segy files segment the data, such that there is a segy header followed by data associated with that header, followed by another segy header until the end of the file is reached.

Currently if a corrupt header is encountered the obspy reader simply raises an SEGYTraceReadingError exception. This means that if a large segy file has any corrupt data in it the user must either specify a time range that avoids this corrupt data (which is laborious to calculate) or is unable to use any of the good data in the segy file.

This change adds an option “skip_corrupt_traces” (defaulted to False) which causes the segy reader to simply skip any corrupt blocks. Using the option is little risky since it can cause the reader to return incorrect traces if multiple sequential blocks are corrupt, however I think it is a good user option since large segy files will often have a few corrupt blocks and this is an easier way to deal with them rather than having to carefully specify start and end times which avoid them.

Why was it initiated? Any relevant Issues?

I made these changes to facilitate my own workflows with large segy files

PR Checklist

I tried to follow the pull request instructions, although I got stuck in a few places:

  1. I ran pytest and all of the tests passed. However I need help adding a specific case for the issue of a corrupt segy file (my segy files are very large but I could very easily create a small one which demonstrates the problem that this address). If someone could point me to the part of the test module which currently tests segy files that would be very helpful.

  2. I don’t understand how to add the new option to the documentation. I would just need to add :

    skip_corrupt_traces: Large segy files often have corrup traces, setting this parameter to True
    Causes them to be skipped instead of raising a SEGYTraceReadingError.
    Defaulted to False so as not to upset any existing workflows which use this.

to
https://docs.obspy.org/master/packages/autogen/obspy.io.segy.core._read_segy.html#obspy.io.segy.core._read_segy

Thanks!

…corrupt segy blocks rather than having the segy reader simply raise the SegyFileError exception
except ValueError:
#skip the remaining corrupt traces
if(not skip_corrupt_traces):raise ValueError
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be

except ValueError as e:
    if not skip_corrupt_traces:
        raise e

I don't understand the break, isn't the intent of the PR that reading continues after a corrupt trace?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that is a mistake left over from earlier testing (that break causes it to stop so you get a stream object with all of the good traces up until a corrupt trace)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up leaving this as is because a lot of the corruption I see in segy files is as a result of an entirely new byte being added to the bad header. This reliably breaks the offset used to decode the next segy header, guaranteeing that subsequent blocks will be decoded incorrectly.

@megies
Copy link
Member

megies commented Jan 24, 2023

option “skip_corrupt_traces” (defaulted to False)

Since this is going into the next feature release, I do not see a problem with having the default True or even not make it a switch but just always try to skip corrupt blocks. I would just make it a warning that gets shown if blocks were skipped.

Using the option is little risky since it can cause the reader to return incorrect traces if multiple sequential blocks are corrupt

I don't understand that part, is it not possible to skip until the next header start kind of? Do later headers rely on information from former blocks?

  1. files are very large but I could very easily create a small one which demonstrates the problem that this address

Yes please, make a new test file with only a handful blocks to keep it small.

  1. point me to the part of the test module which currently tests segy files

This file here and the new test file should go into the data folder in the same directory, then you can access your data as testdata['filename.segy'] in the test code, like in some other test routines in that file.

2. I don’t understand how to add the new option to the documentation.

There are two places were this can/should be mentioned:

@megies
Copy link
Member

megies commented Jan 24, 2023

@ThomasLecocq you're working with segy to some extent right? Maybe you can watch this one too?

@megies megies added this to the 1.5.0 milestone Jan 24, 2023
@ThomasLecocq
Copy link
Contributor

@ThomasLecocq you're working with segy to some extent right? Maybe you can watch this one too?

yes, indeed! I'll work on it this week, cheers

@megies megies changed the title made some small changes to the segy reader to add the option to skip corrupt segy blocks segy: add the option to skip corrupt blocks during reading Jan 24, 2023
@firstkingofrome
Copy link
Author

firstkingofrome commented Jan 24, 2023

Thank you both for the feedback. Ill go ahead and make the sample bad segy file and add it to the tests. Ill also remove that break and change it as megies suggested and add the test and documentation. Ill try and get this in early next week.

The only thing that I disagree with is that I think that it should remain defaulted to False, because the new behavior can result in a stream object being returned with bad traces (for example if the time stamp happened to be the thing which was corrupt, there will be a trace with a bad start or end time). This is fine if the user invoked the option and expects that some of the data might be bad, if its defaulted to True and this happens I imagine you will get lots of issues about "the segy reader not working" as users probably wont realize that the issue is in the file, not the reader.

ThomasLecocq If you want to take a look at it I am attaching a copy of the sample bad segy file which I plan to add to the tests. Its the same as the good segy except that I added two bytes on line 137536 using a hex editor to simulate the kind of header corruption I am complaining about. The code I have pasted below plots this (using the change in this pr).

import obspy as op
segy=op.core.stream.read("good.segy")
segy.plot()
#fails
badSegy=op.core.stream.read("corrupt.segy",unpack_trace_headers=True)
#suceeds
badSegy=op.core.stream.read("corrupt.segy",unpack_trace_headers=True,skip_corrupt_traces=True)
badSegy.plot()

segyFiles.zip

@megies
Copy link
Member

megies commented Jan 25, 2023

The only thing that I disagree with is that I think that it should remain defaulted to False, because the new behavior can result in a stream object being returned with bad traces (for example if the time stamp happened to be the thing which was corrupt, there will be a trace with a bad start or end time).

Fine with me if you think it should be like that. I still don't fully understand how these blocks are corrupted, though. If just a bad starttime were to be in there would our reader not just read it as a valid block anyway?

new behavior can result in a stream object being returned with bad traces

Would be interesting how that actually happens. If the reader complains about some corruption of some sort and you just throw away the affected part and look for a good next header (I'm assuming that these segy "traces" each have their own header) and continue from there, how could this approach result in bad data being returned?

@megies
Copy link
Member

megies commented Jan 25, 2023

If you go to the "files changed" tab, you can see what the linter is complaining about.

Eric Eckert added 3 commits January 30, 2023 12:01
…unit tests for the option to skip_corrupt_blocks. Also added a description of this option to the documentation
…stop at the corrupt trace and only return the traces which have been sucesfully decoded so far
@firstkingofrome
Copy link
Author

Ok so there are a few types of common segy file corruption that seem to occur which have different effects on the reader:

  1. A changed byte value in a segy block header typically doesn't break the current segy reader unless it is in the npts field or results in a non physical time or sample rate, in which case a valueError or segyFileError typically results and no data is returned
  2. Any kind of corruption which results in an entirely new byte being added to the segy header (I am not sure how common this is, I have some data from battery powered instruments and it seems to be fairly common when the instrument is subjected to cold temperatures which cause it to temporarily loose power). This breaks the current segy reader because it causes the remainder of the file to be offset by that byte and it appears that the segy reader uses the npts value from each preceding segy header as an offset.

I thought that the PR I originally submitted allowed you to simply skip all corrupt blocks for each of these cases. This seems to be the case for the first case of corruption, as long as the bad byte doesn't damage the information about the number of traces in this sample.

More testing showed me that this isnt quite what happens with the second kind of file corruption. Its only possible to get the data up to the corrupt block because all subsequent data is incorrectly offset. A more robust fix will require modifying SEGYTrace._read_trace() to scan until the next trace is reached instead of offsetting based on the npts = self.header.number_of_samples_in_this_trace calculated from the trace header. This is probably a lot more work than what I have done and I am not sure if my bosses will care about recovering this remaining data enough to have me implement this in the near term.

In summary I still think this PR is an big improvement because you get much more data out of a corrupt segy file without having to carefully specify time ranges, but instead of giving you all non corrupt blocks it simply returns data up until the first corrupt block occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants