Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DetectReader might OOM #479

Open
ivanjaros opened this issue Feb 7, 2024 · 4 comments
Open

DetectReader might OOM #479

ivanjaros opened this issue Feb 7, 2024 · 4 comments

Comments

@ivanjaros
Copy link

ivanjaros commented Feb 7, 2024

If for some reason the readLimit gets set to 0, the DetectReader will read the entire file into memory. This can be many GBs in size that would be read into memory for no valid reason.

There should be a condition to never allow reading more than a MB, for example, for whatever conditions there might be regarding the readLimit. I would personally also read in chunks up to said limit and try to detect mime for each run for the so far read data instead of ingesting the entire source []byte because you might need only 100 bytes while reading 3072.

func DetectReader(r io.Reader) (*MIME, error) {
	var in []byte
	var err error

	// Using atomic because readLimit can be written at the same time in other goroutine.
	l := atomic.LoadUint32(&readLimit)
	if l == 0 {
		in, err = ioutil.ReadAll(r) <-------------------------- BAD !!
		if err != nil {
			return errMIME, err
		}
	} else {
		var n int
		in = make([]byte, l)
		// io.UnexpectedEOF means len(r) < len(in). It is not an error in this case,
		// it just means the input file is smaller than the allocated bytes slice.
		n, err = io.ReadFull(r, in)
		if err != nil && err != io.EOF && err != io.ErrUnexpectedEOF {
			return errMIME, err
		}
		in = in[:n]
	}

	mu.RLock()
	defer mu.RUnlock()
	return root.match(in, l), nil
}

Thanks for this great library though! Something like this should be built-in. Good job.

@gabriel-vasile
Copy link
Owner

If for some reason the readLimit gets set to 0

readLimit does not get to 0 for some reason, it must be intentionally set by the programmer. The docs for SetLimit mention what happens when 0 is passed.

I would personally also read in chunks up to said limit and try to detect mime for each run for the so far read data instead of ingesting the entire source []byte because you might need only 100 bytes while reading 3072.

This argument can go both ways: if detection on first chunk is bad and detection for first+second chunks is ok then we performed 2 detections when we could have done just one. So what operation brings more overhead? Reading one big chunk or performing multiple detections? I don't know, I never benchmarked this or even thought about it, but I am curious what's more efficient. I guess this depends a lot on reader, a net.Conn might be slower than a bytes.Buffer.

There is also the problem of figuring out when to stop reading chunks. Take for example .docx format which is a .zip archive containing some "special" files. These "special" files can appear anywhere inside the zip. To tell that an input is zip takes only first 4 bytes. To find the "special" files inside zip... we might search until the end of the file.

I agree that reading a readLimit chunk is not ideal especially when the input is a very big or very small file but I think it's necessary compromise to keep the code simple and fast.

@ivanjaros
Copy link
Author

There is also the problem of figuring out when to stop reading chunks. Take for example .docx format which is a .zip archive containing some "special" files. These "special" files can appear anywhere inside the zip. To tell that an input is zip takes only first 4 bytes. To find the "special" files inside zip... we might search until the end of the file.

Exactly why support for streams(#480) is needed.

readLimit does not get to 0 for some reason, it must be intentionally set by the programmer

anything can happen and i do not see how this is not bad code to have even the option to bring the entire system down

@gabriel-vasile
Copy link
Owner

I understand the point of streaming/buffering input and I will look how I can make it work with all existing code and if it's worth bothering. This library initially started as an alternative to http.DetectContentType but with support for more formats.
Doing it as suggested in #480 will not happen because performing repeated detections on increasingly bigger byte slices will not help with OOM.

anything can happen and i do not see how this is not bad code to have even the option to bring the entire system down

There are many places where you can bring your system down. You can access out of bounds array indexes, you can call os.ReadFile on a file that does not fit in memory. For now, I will make some changes to docs to make it clear that the input is never buffered and only read in one chunk.

@ivanjaros
Copy link
Author

ivanjaros commented Feb 13, 2024

Doing it as suggested in #480 will not happen because performing repeated detections on increasingly bigger byte slices will not help with OOM.

I did not say you should do that. The point was that as you have said yourself - you might not know where the magic bytes are located so you keep reading into the buffer until you detect some and then proceed from there. You might not need the buffer at all, that is beside the point. But it is more about cases where magic bytes are not at the beginning of the file or possibly they are not even at the end but spread out(you might detect a mime right from the get go but you might need some more information that is at the end of the file to get the full picture. It does not mean you have to scan entire file into the buffer). Essentially this library should behave just like a hashing function which does exactly that - has a fixed buffer and at the end produces hash of all the data - without loading the entire data stream into the buffer.

So the internal implementation is irrelevant, you know what you need to do, but it is about adding io.Reader and io.Writer mime detectors to support streams and also getting rid of that fixed max-read limit.

Essentially io.R/io.W should be the primary methods and Detect([]byte) should only wrap them - the same as encoders, like json, do internally. So the exactly opposite of the current design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants