New Reader classes, v3 #250

kwhopper · 2020-05-03T18:07:11Z

RandomAccessStream + ReaderInfo

This PR is the third attempt at simplifying and enhancing the reading capabilities of MetadataExtractor. Very few code changes were made to the initial classes. The two main goals are to show how few changes are needed to bring the core classes into the library, and provide some Test cases.

The actual operation of these classes was implemented in #130 but has been omitted here. I want it to be easier for you to review.

In this PR:

Add RandomAccessStream and ReaderInfo classes to project
Minor input change to BufferBoundsException
Some initial test cases for Ras and ReaderInfo

The main advantages:

All sequential, byte, and indexed readers have been replaced by a single RandomAccessStream that handles all core reading operations. It uses a capturing-like structure to buffer data as requested.
The ReaderInfo class is the reading layer on top of an underlying RandomAccessStream. It helps simplify calls into RAS and can support Sequential and Indexed operations. Objects know their own global position and describe their own length, byte order, and local position purely by index; few to no bytes are buffered out of the underlying RAS.
Instances of ReaderInfo can be cloned at any time against the same RandomAccessStream. This is crucial to maintaining a flow down through the various reading classes without creating multiple copies of byte arrays. Any point in the current ReaderInfo can become the starting point of a new clone.
Virtually all "Get" operations were copied from existing methods in the other reading classes, so they have largely been tested in the current live code. Some modifications were needed to support Indexed and Sequential in the same class.
Seekable (e.g. FileStream) and Non-seekable (e.g. HttpClient) streams are supported.

closes #130

@drewnoakes The area in this PR can be used for further discussion - including whether I should continue to pursue this. I still believe in the general idea and this is a stable first step. Many other things can be added using nothing but the two main classes in this PR, including things like Span under the hood, easily writing out thumbnail byte arrays to disk, a basic catalog of the Exif data byte locations (needed for writing Exif), and selectively buffering large byte arrays to disk (like ICC data) through wrapping with another RandomAccessStream.

To reiterate: most of the code here was taken from existing reader code -- it has simply been consolidated into a couple of classes to support a new way of maintaining position.

drewnoakes

Thanks for this PR. It's much easier to see how this might work.

Have given it a quick look through and added my first wave of questions.

MetadataExtractor/IO/RandomAccessStream.cs

drewnoakes · 2020-05-04T05:18:27Z

MetadataExtractor/IO/RandomAccessStream.cs

+        /// <param name="isSequential">flag indicating if caller is using sequential access</param>
+        /// <exception cref="BufferBoundsException"/>
+        /// <exception cref="IOException"/>
+        public ushort GetUInt16(long index, bool IsMotorolaByteOrder, bool isSequential)


Would be good to move most (if not all) of these Get* methods to extensions.

Casing of IsMotorolaByteOrder needs tweaking (ditto elsewhere).

MetadataExtractor/IO/RandomAccessStream.cs

drewnoakes · 2020-05-04T05:25:30Z

MetadataExtractor/IO/ReaderInfo.cs

+{
+    /// <author>Kevin Mott https://github.com/kwhopper</author>
+    /// <author>Drew Noakes https://drewnoakes.com</author>
+    public class ReaderInfo


Can we give this a better name than ReaderInfo. To me, the current name implies it's some immutable bit of data rather than a stateful actionable object.

Should this type be internal?

It could potentially be internal if the bits we need from it (like global position) are recorded with metadata items. Then the instances can disappear in GC.

This was one of the early class names and I never changed it. Maybe something like "RegionReader" or "DataRegion", etc. I'm open to whatever.

I'm also coming up short for name suggestions. Will have a think. If it's internal then it's easier to change later as no one will have taken a dependency on the name.

drewnoakes · 2020-05-04T05:27:03Z

MetadataExtractor/IO/ReaderInfo.cs

+        /// Skips forward or backward in the sequence. If the sequence ends, an <see cref="IOException"/> is thrown.
+        /// </remarks>
+        /// <param name="offset">the number of bytes to seek, in either direction.</param>
+        /// <param name="isSequential">optional for testing to indicate whether sequential access was used.</param>


I'm unclear on why isSequential is needed for these Skip methods.

SequentialReader classes throw an IOException when too many bytes are requested. IndexedReader classes throw a BufferBoundsException. It was added very recently as a flag to help throw different exceptions to match your original readers, and make Test methods match better.

It isn't necessary at all if you want it to throw a common exception.

Thanks. I think throwing BufferBoundsException everywhere would be fine. It derives from IOException anyway, and includes more information. This parameter seems out of place somehow, and it gets passed around a lot which will have some perf impact.

It looks like it's controlling some logic in ReadAtGlobal though.

I'm a bit unclear on how ReaderInfo functions in both sequential and indexed access. Can you switch between them for the same instance at leisure? What is the rationale for having a single type versus two different types?

A ReaderInfo class can be read indexed or sequentially at the same time. You are free to make indexed calls at any time and only sequential calls will advance LocalPosition.

The reasons for this design are: a) both types access data from a pre-captured byte array source so in essence they are the 'same' anyway; b) it seemed unnecessary to re-implement all the Get* methods simply to separate the method of reading; and c) it removes ambiguity down the parsing chain so you don't have to inspect the parent ReaderInfo of a Clone and compare its original parsing method.

In the last PR where these classes were actually implemented and working throughout the library, ReaderInfo's are cloned constantly (to achieve the benefit of fewer byte array copies) and passed into other reading/parsing classes outside the parent's control. It's not possible to know down the line which access method a parser will want, and the method can actually change many times depending on the metadata segments. So, it's a better design to have only one type of reader and avoid an extra inspection or clone type change. It also allows for the rare case when indexed and sequential are desirable in a parse.

MetadataExtractor/IO/RandomAccessStream.cs

kwhopper · 2020-05-06T15:07:36Z

Attempted to address some of your comments in the last commit. Let me know what you think.

drewnoakes · 2020-05-07T10:09:42Z

MetadataExtractor/IO/RandomAccessStream.cs

+        /// <param name="validateIndex">allows for skipping validation if already done by the caller</param>
+        /// <exception cref="BufferBoundsException"/>
+        /// <exception cref="IOException"/>
+        private byte GetByte(long index, bool isSequential, bool validateIndex)


Given how often we will call this method I think you could make this method GetByteNoValidation and move the validation back to GetByte, then remove the validateIndex parameter.

Implemented, with some other changes

drewnoakes · 2020-05-07T10:12:09Z

MetadataExtractor/IO/RandomAccessStream.cs

-                // Motorola - MSB first (big endian)
+                // Motorola - MSB first


Why remove "big endian" here (and "little endian" below)?

Oh I see we don't have it elsewhere.

MetadataExtractor/IO/RandomAccessStream.cs

drewnoakes · 2020-05-07T10:20:49Z

MetadataExtractor/IO/RandomAccessStream.cs

+                var fromChunkIndex = fromOffset / p_chunkLength;     // chunk integer key
+                var fromInnerIndex = fromOffset % p_chunkLength;     // index inside the chunk to start reading


In the case where we read from multiple chunks, can we avoid the div/mod for all but the first iteration? Would it work if these were initialised before the loop, then at the end of the loop incremented and set to zero respectively?

I think fromOffset would then not need to be updated either.

Also could remaining be removed and the loop condition be toIndex < count?

drewnoakes · 2020-05-07T11:37:17Z

MetadataExtractor/IO/RandomAccessStream.cs

+            var chunkIndex = index / p_chunkLength;
+            var innerIndex = index % p_chunkLength;
+
+            if (p_chunks.ContainsKey(chunkIndex))
+                return p_chunks[chunkIndex][innerIndex];


I am a little concerned that these operations (div, mod, dictionary lookup, array lookup), when repeated for every byte consumed, will degrade performance. Rather than speculate, and while we have both the new/old implementations on the same branch, can we add some benchmarks that compare before/after?

I would recommend using BenchmarkDotNet for this. It's a really amazing library. I can help set that up if you'd like.

For scenarios, I'd look at sequential non-indexed access, sequential indexed access, random access.

This loop is an almost verbatim copy of similar methods in IndexedCapturingReader. The only difference really is they get called in all cases instead of being implicitly requested by the caller (through using that reader class).

I can try to setup some benchmarks if you want. I'll probably need some guidance on BenchmarkDotNet if that's recommended.

drewnoakes · 2020-05-07T11:37:54Z

MetadataExtractor/IO/RandomAccessStream.cs

+            if (p_chunks.ContainsKey(chunkIndex))
+                return p_chunks[chunkIndex][innerIndex];
+            else
+                return unchecked((byte)-1);


This will be 255 which is indistinguishable by the caller from a real value. Should this just throw?

There were some specific metadata cases where GetByte was expected to return this value... but now I can't remember how that went. This one might have to be tested against the metadata repo again to figure out -- but that's down the line.

It might have had something to do with replicating the Stream.ReadByte() return value and behavior. When it can't read any more data, it returns -1. That isn't an error condition to the Stream, but SequentialStreamReader throws an IOException on that value. I could maybe try to throw the same exception and see if it works.

I went ahead and changed it to throw for now. Will need to see how it works in more extensive testing.

I think GetByte used to return int, which covers the domain of byte (which is unsigned) while allowing the special -1 value to indicate something else. Casting -1 to byte here pushed the value into the range of byte which means the caller can no longer tell that it was meant to be special. If this is going to return byte, then the exception is that special value.

drewnoakes · 2020-05-07T11:40:27Z

MetadataExtractor/IO/RandomAccessStream.cs

+
+        private const int DefaultChunkLength = 4 * 1024;
+        private readonly int p_chunkLength;
+        public Dictionary<long, byte[]> p_chunks = new Dictionary<long, byte[]>();


Why Dictionary<,> and not List<>?

Because chunks aren't necessarily read and stored in a linear sequence. There could be gaps so a Dictionary keyed on the chunk starting position is ideal. If the stream is seekable, note that BytesAvailable will first seek the stream to the desired 'chunkable' position and only read that chunk into the Dictionary; the chunk index becomes the key. This optimization ignores caching all the bytes in between, and is the main reason this implementation is as good or better than the current one on a seekable stream.

For example, suppose you have a FileStream (which is seekable) and the first chunk of the file has everything you need at the moment to start parsing. The Dictionary will have a key of "0" with a byte array the size of the 0-th chunk of the file. Maybe it tells you that the next IFD is 300,000 bytes further into the file. BytesAvailable will Seek to the closest chunk that contains this new position and read that one. Now there will be only two entries in the Dictionary - "0" and "73" if the chunk length is 4096 (assuming my math is right).

So, this avoids caching ~300,000 bytes of unneeded data in the Dictionary. The only way to know whether this chunk is already there is to key it by chunk index relative to the entire stream -- and we're then back to why a keyed structure is better. We might eventually read some or all of those skipped bytes, but at this particular point they aren't needed.

This structure also works just fine if you're reading the whole thing linearly, such as with a NetworkStream. Non-seekable streams will simply read all intervening chunks into the Dictionary instead.

Thanks. Using your example, indexes 1 through 72 in a List<byte[]> could be null which would take up a trivial amount of space. This would then mean that lookups in the list would be faster (simple pointer arithmetic) than a dictionary lookup.

This kind of thing is why I think it's time to get some benchmarks set up. Performance is a very important feature of this library, particularly in the IO area.

BenchmarkDotNet is an amazing tool and very easy to use. If you haven't tried it before, I think you'll really enjoy it. This page has some copy/paste-able code to start from:

https://benchmarkdotnet.org/articles/guides/getting-started.html

I'd also add [MemoryDiagnoser] to the benchmark class, which will cause memory usage/GC stats to be included in the output.

It's probably best to create a new project in the solution for these benchmarks. I am happy to help with this.

kwhopper · 2020-05-09T13:22:38Z

More review changes pushed. I'll comment on the others, and we'll see if they also need to be addressed

kwhopper · 2020-05-23T21:43:55Z

@drewnoakes - I added a List<byte[]> implementation of RAS. Please let me know what you think. To keep some code around, I also made an interface and a copy of the Dictionary version.

You're probably correct that a List will perform better overall and is a little easier to tune and control. But constructing tests for Dictionary vs. List by themselves isn't straightforward or very valuable outside of the (various) ways this library reads data. If this PR makes it into the library, I'm happy to do tests later using the full MetadataExtractor library code which will put the usage in more proper context.

drewnoakes · 2020-05-24T09:19:43Z

Thanks @kwhopper.

In order to merge this I would want to see those benchmarks. They could be removed prior to merging (though kept in the history for posterity).

If you don't have the time to add such benchmarks then I will endeavour to do so but am not sure when that will be. I don't think they'll take long to add. Exploring ideas generated by their results is where the time would likely go.

kwhopper · 2020-05-24T14:56:34Z

I got hung up on the best way to approach it without using the library itself. What's a good way to construct the benchmarks? Maybe random reads against a large file - a few megabytes in size?

It could be deterministic reads also, like defined sections of the file in both scenarios.

kwhopper · 2020-05-24T15:59:58Z

Here's a very basic run with deterministic offsets:

RASBenchmark.zip

Results indicate the List version is ~35-40% "faster":

BenchmarkDotNet=v0.10.12, OS=Windows 10.0.18363
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical cores and 6 physical cores
  [Host]     : .NET Framework 4.6.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4180.0
  DefaultJob : .NET Framework 4.6.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.4180.0


                            Method |     Mean |     Error |    StdDev | Scaled |
---------------------------------- |---------:|----------:|----------:|-------:|
            RASDictionaryBenchmark | 26.80 us | 0.1253 us | 0.1172 us |   1.00 |
                  RASListBenchmark | 16.41 us | 0.1079 us | 0.1010 us |   0.61 |
 RASDictionaryNonseekableBenchmark | 27.95 us | 0.1190 us | 0.1113 us |   1.04 |
       RASListNonseekableBenchmark | 17.26 us | 0.0910 us | 0.0851 us |   0.64 |

Are there settings to get other data, like memory usage, GC, etc? Thanks

kwhopper · 2020-05-25T02:52:56Z

Pushed a few code changes to address some benchmark differences. A few may be reverted but they are at least documented in a commit.

kwhopper · 2020-05-25T03:05:21Z

I updated BenchmarkDotNet to 0.12.1 in the next tests. Here's a new benchmark file that includes the List version of RAS, and similar calls for IndexedCapturingReader and IndexedSeekingReader:

RASBenchmark.zip

With those files and the last commit, here are the benchmarks:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT


|                          Method |     Mean |   Error |  StdDev | Ratio |
|-------------------------------- |---------:|--------:|--------:|------:|
|                RASListBenchmark | 376.3 us | 1.56 us | 1.38 us |  1.00 |
| IndexedCapturingReaderBenchmark | 202.1 us | 0.88 us | 0.78 us |  0.54 |
|   IndexedSeekingReaderBenchmark | 240.0 us | 1.33 us | 1.18 us |  0.64 |

The time differences are negligible at this scale and due almost entirely to the use of "long" for index params to some RandomAccessStream methods. For some reason, the 64-bit calculations can be relatively slow. You can view the 'hack' parts in the commit where it's listed as a "micro-optimization" for benchmarks. If I remove those, here are the numbers:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT


|                          Method |     Mean |   Error |  StdDev | Ratio |
|-------------------------------- |---------:|--------:|--------:|------:|
|                RASListBenchmark | 689.4 us | 3.75 us | 3.32 us |  1.00 |
| IndexedCapturingReaderBenchmark | 202.4 us | 1.03 us | 0.91 us |  0.29 |
|   IndexedSeekingReaderBenchmark | 251.2 us | 0.62 us | 0.48 us |  0.36 |

We're still talking micro seconds, so even this result isn't terrible.

Using "int" index params instead of "long" explains virtually all of the difference. I did some earlier partial tests where those are all bypassed and RAS performed even better on time against IndexedCapturingReader. So, we can either support files larger than an int32 in size and take a small hit (on some architectures), or change it back to int's in RAS classes and probably close the time gap.

I think some users have wanted long support in files so I put it in there from the start.

drewnoakes · 2020-05-25T03:21:48Z

This looks great.

I'll have a play around with the benchmarks later today hopefully. One thing I'll try (in case you want to try it first) is to move the construction of the RAS and ReaderInfo out of the [Benchmark] methods (they can be fields which are reused across runs) and add [MemoryDiagnoser] to the benchmark.

I'll probably also generate random bytes and try reading them all sequentially as various data types, comparing both RAS and the existing sequential readers, to add more sequential coverage.

I don't know what's best around int/long indexes. We should probably support long.

Feel free to push the benchmark code to this branch.

kwhopper · 2020-05-25T04:13:47Z

Pushed and added some of what you mentioned. The memory allocations are intriguing but not sure what to make of them without more testing:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT


|                          Method |     Mean |   Error |  StdDev | Ratio |   Gen 0 |  Gen 1 | Gen 2 | Allocated |
|-------------------------------- |---------:|--------:|--------:|------:|--------:|-------:|------:|----------:|
|                RASListBenchmark | 685.5 us | 2.06 us | 1.92 us |  1.00 |       - |      - |     - |   2.98 KB |
| IndexedCapturingReaderBenchmark | 216.0 us | 0.73 us | 0.68 us |  0.32 | 10.2539 | 1.9531 |     - |  63.91 KB |
|   IndexedSeekingReaderBenchmark | 238.2 us | 0.71 us | 0.63 us |  0.35 |  0.4883 |      - |     - |   3.03 KB |

kwhopper · 2020-05-25T04:22:33Z

Here's the memory when RAS is always instantiated during a benchmark run:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  DefaultJob : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT


|                          Method |     Mean |   Error |  StdDev | Ratio |   Gen 0 |  Gen 1 | Gen 2 | Allocated |
|-------------------------------- |---------:|--------:|--------:|------:|--------:|-------:|------:|----------:|
|                RASListBenchmark | 689.2 us | 3.54 us | 3.31 us |  1.00 |  7.8125 | 0.9766 |     - |  51.85 KB |
| IndexedCapturingReaderBenchmark | 201.4 us | 1.10 us | 1.03 us |  0.29 | 10.2539 | 1.9531 |     - |  63.91 KB |
|   IndexedSeekingReaderBenchmark | 260.7 us | 0.74 us | 0.62 us |  0.38 |  0.4883 |      - |     - |   3.03 KB |

kwhopper · 2020-05-25T04:37:10Z

To sum up a little as you look at this...

We know that byte arrays will always be read and then sliced and diced as metadata is parsed. The goal here is to read these arrays in reasonable chunks from the source but assign a pointer (a ReaderInfo instance) that specifies location and size and use that until actual bytes are really needed. This instance can be used as-is to read bytes, or cloned into another pointer instance when a subsection is desired in another parsing class.

The behavior and output you see so far, then, is generally expected and part of the design to reuse existing byte arrays as much as possible. It's a single reader class as well that avoids almost all the byte copying done in the current implementation down the chain. IndexedSeekingReader doesn't allocate a lot of memory in this test but I don't think that's very representative of reality; most of the time many chunks of a file will be read during parsing and RAS expects that from the start.

drewnoakes · 2020-05-25T23:23:54Z

Thanks @kwhopper. I agree with your points and can mostly see the bigger picture when it comes to memory copying. Without implementing parallel versions of the library in terms of old/new it won't be easy to see a more realistic set of values. Still it's useful to ensure there are no major issues, and track changes over time.

I will pull this branch down and play with it to gain a better understanding.

kwhopper · 2020-05-26T00:51:58Z

Thanks @drewnoakes. I've hounded you plenty the last few years so it's enough that you're still open to the idea. Parallel libraries isn't out of the question entirely to benchmark. It's certainly not an overnight proposition, but I already did it once to demo the first attempt.

There are many potential advantages of a separate pointer object for reading that always knows its own local and global positions. Here are a few that drove this idea in the first place... and my persistence in lobbying for it:

ReaderInfo's made from an RAS "CreateReader" call could easily be tracked in that RAS instance. This gives you the structure of what was parsed and provides the most basic thing the library always needed to have any hope of writing metadata someday -- some map of the metadata bytes and a way to easily tell what was read and what was skipped.
The ReaderInfo class could be given "type" properties for special cases. For example, maybe it's a configuration setting in the future to not read (and buffer) embedded thumbnail bytes directly but only return the ReaderInfo pointer. The user (or even a special method in the code) could use that pointer to go back later and save the thumbnail bytes to disk.
NonSeekable streams work fine, although for now it will buffer the entire thing in memory. In the future, the bytes could be buffered to disk in some RAS implementation without changing ReaderInfo, and then the other things become possible and more efficient (saving a thumbnail from above, etc)
Large metadata objects like XML and compressed ICC are buffered into byte arrays twice in many cases. Similar to the second item above, small changes to RAS/ReaderInfo could make handling them configurable. Maybe it can say don't keep the parsed bytes in the buffer permanently and write directly to disk or just give me back the ReaderInfo pointer of these bytes and my custom code will read the bytes later.
A new RAS instance can be created at anytime, even during a parse. This could help with things like side-car XML files, or buffer larger parses into a local FileStream at some maximum List<byte[]> size and keep only the new RAS object (all handled internally).
If we get to a context object, "DefaultChunkLength" is easily override-able by the user
"WithShiftedBaseOffset" becomes obsolete as ReaderInfo always knows its global position. It felt like a moderate hack anyway to fix the loss of a global position due to secondary byte array usage.

I'm sure there are other things, but that's enough for now. Thanks

kwhopper · 2020-08-20T16:58:24Z

Without implementing parallel versions of the library in terms of old/new it won't be easy to see a more realistic set of values.

I'm happy to do this if it will help -- implement a test of the entire library using these classes. Most of the work was already done in the first version. I'll need to do fixes for any new additions or changes to the core library.

drewnoakes · 2020-08-21T01:25:11Z

I'm happy to do this if it will help

The issue from my side is free time. Where I live we have a total lock down which means I have to spend a large chunk of each day looking after my child while also trying to perform my job, and I am feeling burnt out. I have only occasional bandwidth for small simple tasks like small PR reviews and fixing simple bugs. This PR requires several hours of deep thought. Integrating it across the whole library, while a generous offer, will increase the time requirement.

I haven't forgotten about this, and I am hopeful we can get this in without many changes. I just need to take the time to think through the implications, and that requires finding the time. Once done we can plan out integrating the changes across the library.

drewnoakes · 2020-11-08T22:18:32Z

Thanks for fixing the merge conflicts @kwhopper.

- Minor input change to BufferBoundsException - Some initial test cases for Ras and ReaderInfo

…y done by the caller - use GetByte instead of Read for Get* methods (similar to old readers) - add the constant RandomAccessStream.UnknownLengthValue - mark NonSeekableStream class as internal (for testing only) - document the RandomAccessStream.Length property - mark RandomAccessStream.ValidateIndex method as internal - document the return value of the RandomAccessStream.ValidateIndex method

…y report a Length - Throw BufferBoundsException instead of IOException in ValidateIndex - Remove "IsSequential" inputs to most RandomAccessStream classes - fix isMotorolaByteOrder casing in RandomAccessStream

- rename overload of GetByte to GetByteNoValidation and remove validateIndex param

…rtial is true - remove allowPartial param from ValidateRange

…t in the Dictionary instead of returning -1

…k buffer - add IRandomAccessStream interface - keep dictionary version of RAS class

- remove nullable designator from RandomAccessStream.p_inputStream (optimization) - convert long index values to int for benchmarks

- leave support for long Stream indexes in RandomAccessStream (and take the mild read hit) - add RASBenchmark class

drewnoakes reviewed May 4, 2020

View reviewed changes

drewnoakes reviewed May 7, 2020

View reviewed changes

MetadataExtractor/IO/RandomAccessStream.cs Outdated Show resolved Hide resolved

drewnoakes reviewed May 7, 2020

View reviewed changes

drewnoakes mentioned this pull request May 14, 2020

Add async support #256

Open

kwhopper mentioned this pull request Apr 2, 2021

Huge allocations #283

Open

kwhopper mentioned this pull request May 5, 2021

Avoid reading unused bytes from PNG streams drewnoakes/metadata-extractor#535

Merged

kwhopper added 2 commits June 13, 2021 15:18

- Add RandomAccessStream and ReaderInfo

a3c2e86

- Minor input change to BufferBoundsException - Some initial test cases for Ras and ReaderInfo

kwhopper added 8 commits June 13, 2021 15:18

- If Stream.CanSeek is true, make sure it is also able to successfull…

a5281d5

…y report a Length - Throw BufferBoundsException instead of IOException in ValidateIndex - Remove "IsSequential" inputs to most RandomAccessStream classes - fix isMotorolaByteOrder casing in RandomAccessStream

- perform ValidateIndex call directly in GetByte

b547216

- rename overload of GetByte to GetByteNoValidation and remove validateIndex param

Rename ValidateIndex to ValidateRange

28daa1a

- skip validation and directly call BytesAvailable in Read if allowPa…

c16d02e

…rtial is true - remove allowPartial param from ValidateRange

NEEDS TESTING - throw an IOException when the chunkIndex doesn't exis…

5a16c6a

…t in the Dictionary instead of returning -1

- use a List<byte[]> instead of Dictionary<long, byte[]> for RAS chun…

3a00e07

…k buffer - add IRandomAccessStream interface - keep dictionary version of RAS class

- change ValidateRange to return int instead of long

f95f556

- remove nullable designator from RandomAccessStream.p_inputStream (optimization) - convert long index values to int for benchmarks

- BenchmarkDotNet 0.10.12 -> 0.12.1

03bcbc7

- leave support for long Stream indexes in RandomAccessStream (and take the mild read hit) - add RASBenchmark class

kwhopper force-pushed the feature/new-reader-v3 branch from f757004 to 03bcbc7 Compare June 13, 2021 20:36

drewnoakes mentioned this pull request May 8, 2023

Fix Exif thumbnail offset #333

Merged

		var fromChunkIndex = fromOffset / p_chunkLength; // chunk integer key
		var fromInnerIndex = fromOffset % p_chunkLength; // index inside the chunk to start reading

New Reader classes, v3 #250

Are you sure you want to change the base?

New Reader classes, v3 #250

Conversation

kwhopper commented May 3, 2020 • edited

drewnoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewnoakes May 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwhopper commented May 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewnoakes May 7, 2020 • edited

Choose a reason for hiding this comment

drewnoakes May 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwhopper May 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwhopper commented May 9, 2020

kwhopper commented May 23, 2020

drewnoakes commented May 24, 2020

kwhopper commented May 24, 2020

kwhopper commented May 24, 2020

kwhopper commented May 25, 2020

kwhopper commented May 25, 2020

drewnoakes commented May 25, 2020

kwhopper commented May 25, 2020

kwhopper commented May 25, 2020

kwhopper commented May 25, 2020 • edited

drewnoakes commented May 25, 2020

kwhopper commented May 26, 2020

kwhopper commented Aug 20, 2020

drewnoakes commented Aug 21, 2020

drewnoakes commented Nov 8, 2020

kwhopper commented May 3, 2020 •

edited

drewnoakes May 7, 2020 •

edited

drewnoakes May 7, 2020 •

edited

drewnoakes May 7, 2020 •

edited

kwhopper May 9, 2020 •

edited

kwhopper commented May 25, 2020 •

edited