Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when testing a bad WARC ending in gzip header (10 bytes) and no data on Mac #99

Open
ikreymer opened this issue Mar 20, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@ikreymer
Copy link

Tried out the latest release on Mac, and am getting this segfault with Browsertrix Crawler WARCs:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x68 pc=0x16932fa]

goroutine 16 [running]:
github.com/nlnwa/warchaeology/cmd/validate.validateFile({0x1912600, 0x1e30c20}, {0xc001b2c780, 0x5f})
	/home/runner/work/warchaeology/warchaeology/cmd/validate/validate.go:149 +0x37a
github.com/nlnwa/warchaeology/internal/filewalker.(*filewalker).Walk.func2.1()
	/home/runner/work/warchaeology/warchaeology/internal/filewalker/filewalker.go:224 +0x183
github.com/nlnwa/warchaeology/internal/workerpool.New.func1(0x0?)
	/home/runner/work/warchaeology/warchaeology/internal/workerpool/workerpool.go:42 +0xa5
created by github.com/nlnwa/warchaeology/internal/workerpool.New in goroutine 1
	/home/runner/work/warchaeology/warchaeology/internal/workerpool/workerpool.go:37 +0x89

Have not had time to isolate which WARC is causing it exactly, but could later, if that would be helpful.
Tried both with -r and with a --source-file-list with default concurrency.

@trym-b
Copy link
Contributor

trym-b commented Mar 21, 2024

Hi Ilya

Thank you for reporting this issue. Unfortunately, it seems like we have some cross-platform support problems that we would like to fix, which your issue highlights.

One thing that would help us greatly would be to find the .warc file that triggers this error, so that we could create a reproducer of this issue and also create a regression test. Without the offending .warc file it is a bit more difficult to figure out the underlying error.

@ikreymer
Copy link
Author

@trym-b yes, I've managed to isolate the specific WARC, it seems to be an invalid WARC with last record truncated (which of course should be an error), and not anything to do with recursive/directory checking.
The WARC is attached:
rec-33318048d933-20240317162652059-0.warc.gz
Was testing the latest Darwin release.

@ikreymer
Copy link
Author

ikreymer commented Mar 21, 2024

I believe the issue with that file is that at the end, it has a complete record and the 10 bytes for the next gzip header for next record, and no data after that.

@ikreymer ikreymer changed the title segfault when running recursively on a directory of WARCs segfault when testing a bad WARC ending in gzip header (10 bytes) and no data on Mac Mar 21, 2024
@trym-b
Copy link
Contributor

trym-b commented Mar 22, 2024

I am happy that you were able to root cause it so quickly. After downloading and opening the file with the following command, I also get a segfault.

go run main.go validate ~/Downloads/rec-33318048d933-20240317162652059-0.warc.gz 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x68 pc=0xa9d65a]

goroutine 21 [running]:
github.com/nlnwa/warchaeology/cmd/validate.validateFile({0xd20a60, 0x1243000}, {0x7ffdfa1f89bb, 0x41})
	/home/trym/repos/work/warchaeology/cmd/validate/validate.go:137 +0x37a
github.com/nlnwa/warchaeology/internal/filewalker.(*filewalker).Walk.func2.1()
	/home/trym/repos/work/warchaeology/internal/filewalker/filewalker.go:225 +0x183
github.com/nlnwa/warchaeology/internal/workerpool.New.func1(0x0?)
	/home/trym/repos/work/warchaeology/internal/workerpool/workerpool.go:26 +0xa5
created by github.com/nlnwa/warchaeology/internal/workerpool.New in goroutine 1
	/home/trym/repos/work/warchaeology/internal/workerpool/workerpool.go:21 +0x89
exit status 2

We have done some code changes since the release, so it is likely that the version I tested on (latest main) returned the same stack-trace but with slightly different line numbers.

I agree that a malformed warc file should not in general result in a panic/segfault. I will look into fixing this error.

I hope I can use the warc you provided as test data, if you don't mind?

@trym-b trym-b added the bug Something isn't working label Mar 22, 2024
@ikreymer
Copy link
Author

Of course, feel free to use this as test data

trym-b added a commit that referenced this issue Mar 25, 2024
# Motivation

Many top level commands are missing tests. This is
a first step to add a test for `ls` with proper
test data.

# Changes

This commit adds a very simple test for `ls` that
checks that it does not crash when reading a file.

# Future work

Add more checks to the test so that it is even
better at avoiding regressions.

# Acknowledgements

Thanks to Ilya Kreymer for providing the test data
through issue
#99
trym-b added a commit that referenced this issue Mar 26, 2024
# Motivation

Many top level commands are missing tests. This is
a first step to add a test for `ls` with proper
test data.

# Changes

This commit adds a very simple test for `ls` that
checks that it does not crash when reading a file.

Also added `lfs` checkout to every workflow so
that any lfs file is fetched wrongly.

# Future work

Add more checks to the test so that it is even
better at avoiding regressions.

# Acknowledgements

Thanks to Ilya Kreymer for providing the test data
through issue
#99
trym-b added a commit that referenced this issue Apr 2, 2024
# Motivation

Many top level commands are missing tests. This is
a first step to add a test for `ls` with proper
test data.

# Changes

This commit adds a very simple test for `ls` that
checks that it does not crash when reading a file.

# Future work

Add more checks to the test so that it is even
better at avoiding regressions.

# Acknowledgements

Thanks to Ilya Kreymer for providing the test data
through issue
#99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants