Add some format documentation.

unknownbrackets · Nov 1, 2014 · e6631c0 · e6631c0
1 parent cc55a61
commit e6631c0
Show file tree

Hide file tree

Showing 4 changed files with 143 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -22,7 +22,7 @@ Features
   * Processes multiple files in one command.
   * Can take a CSO or DAX file as a source.
   * Able to output at larger block sizes.
-  * Support for experimental cso formats using [lz4][] (faster decompression)
+  * Support for experimental [CSO v2][] and [ZSO][] formats using [lz4][] (faster decompression.)
   * Tuning of deflate or lz4 compression threshold.
 
 
@@ -97,7 +97,7 @@ Platforms
 
 maxcso has only been tested on Windows so far.  The code was written to be portable, however.
 If you'd like to port it to another platform, pull requests are accepted.  It may just compile
-out of the box with a Makefile or similar.
+out of the box with a Makefile or similar, but 7-zip is probably the biggest problem.
 
 ### Windows
 
@@ -116,6 +116,7 @@ libraries.  Licensing is as follows:
  * [Zopfli][] is licensed under Apache 2.0.
  * [libuv][] is licensed under MIT.
  * [zlib][] is licensed under zlib.
+ * [lz4][] is licensed under BSD.
 
 
 Other tools
@@ -136,4 +137,6 @@ Other tools
 [CisoMC]: http://wololo.net/talk/viewtopic.php?f=20&t=32659
 [ciso]: http://sourceforge.net/projects/ciso/
 [ciso-python]: http://virtuousflame.blog.163.com/blog/static/177177172201111833413485/
-[lz4]: https://code.google.com/p/lz4/
+[lz4]: https://code.google.com/p/lz4/
+[CSO v2]: README_CSO.md
+[ZSO]: README_ZSO.md
diff --git a/README_CSO.md b/README_CSO.md
@@ -0,0 +1,83 @@
+CSO format
+===========
+
+The original CSO format was created by BOOSTER.
+
+This document includes an experimental v2 format of CSO, proposed by Unknown W. Brackets.
+
+
+Overview
+===========
+
+A CSO file consists of a file header, index section, and data section.
+
+Typically, the file extension .cso is used.
+
+
+Format (version 1)
+===========
+
+The header is as follows (little endian):
+
+    char[4]  magic;             // Always "CISO".
+	uint32_t header_size;       // Does not always contain a reliable value.
+	uint64_t uncompressed_size; // Total size of original ISO.
+	uint32_t block_size;        // Size of each block, usually 2048.
+	uint8_t  version;           // May be 0 or 1.
+	uint8_t  index_shift;       // Indicates left shift of index values.
+	uint8_t  unused[2];         // May contain any values.
+
+Following that are index entries, which are each a uint32_t (little endian).  The number of
+index entries can be found by taking `ceil(uncompressed_size / block_size) + 1`.
+
+The lower 31 bits of each index entry, when shifted left by `index_shift`, indicate the
+position within the file of the block's compressed data.  The length of the block is the
+difference between this entry's offset and the following index entry's offset value.
+
+Note that this size may be larger than the compressed or uncompressed data, if `index_shift` is
+greater than 0.  The space between blocks may be padded with any byte, but NUL is recommended.
+
+Note also that this means index entries must be incrementing.  Reordering or deduplication of
+blocks is not supported.
+
+The high bit of the index entry indicates whether the block is uncompressed.
+
+When compressed, blocks are compressed using the raw [deflate][] algorithm, with window size
+being 15 (when using zlib, specify -15 for no zlib header.)
+
+The final index entry indicates the end of the data segment and normally EOF.
+
+
+Format (version 2)
+===========
+
+The header is more strictly defined:
+
+    char[4]  magic;             // Always "CISO".
+	uint32_t header_size;       // Must always be 0x18.
+	uint64_t uncompressed_size; // Total size of original ISO.
+	uint32_t block_size;        // Size of each block.
+	uint8_t  version;           // Must be 2.
+	uint8_t  index_shift;       // Indicates left shift of index values.
+	uint8_t  unused[2];         // Must be 0.
+
+The index data follows the same format as version 1, but the interpretation of the size and high
+bit is handled differently.
+
+In version 2, when the length of a compressed block (that is, the difference between two index
+entry offset values) is >= `block_size`, the block must not be compressed.
+
+Note again that when `index_shift` is greater than 0, the size may include additional padding.
+If the compressed size plus this padding would result in `block_size` or more bytes, the data
+must not be compressed (or decompressed.)  This won't result in any observed file size
+difference, because the padding would have been wasted bytes anyway.
+
+When the size of the compressed block is less than `block_size`, the data is always compressed.
+The high bit of the index entry indicates which compression method has been used.  When it is
+set, the data is compressed with [lz4][], otherwise it is compressed with [deflate][].
+
+The final index entry must not have the high bit set.
+
+
+[lz4]: https://code.google.com/p/lz4/
+[deflate]: https://www.ietf.org/rfc/rfc1951.txt
diff --git a/README_ZSO.md b/README_ZSO.md
@@ -0,0 +1,53 @@
+ZSO format
+===========
+
+Please note that this format is not final, and is experimental.
+
+This format has been proposed by [codestation][] in a patch to [procfw][].
+
+
+Overview
+===========
+
+The general format is the same as the CSO v1 format.  It consists of a file header, index
+section, and data section.
+
+Unlike the original CSO format, blocks are compressed using [lz4][] rather than [deflate][].
+Additionally, the magic bytes differ (ZISO), and the preferred extension is "zso".
+
+
+Format
+===========
+
+The header is as follows (little endian):
+
+    char[4]  magic;             // Always "ZISO".
+	uint32_t header_size;       // Always 0x18.
+	uint64_t uncompressed_size; // Total size of original ISO.
+	uint32_t block_size;        // Size of each block, usually 2048.
+	uint8_t  version;           // Always 1.
+	uint8_t  index_shift;       // Indicates left shift of index values.
+	uint8_t  unused[2];         // Always 0.
+
+Following that are index entries, which are each a uint32_t (little endian).  The number of
+index entries can be found by taking `ceil(uncompressed_size / block_size) + 1`.
+
+The lower 31 bits of each index entry, when shifted left by `index_shift`, indicate the
+position within the file of the block's compressed data.  The length of the block is the
+difference between this entry's offset and the following index entry's offset value.
+
+Note that this size may be larger than the compressed or uncompressed data, if `index_shift` is
+greater than 0.  The space between blocks may be padded with any byte, but NUL is recommended.
+
+Note also that this means index entries must be incrementing.  Reordering or deduplication of
+blocks is not supported.
+
+The high bit of the index entry indicates whether the block is uncompressed.
+
+The final index entry indicates the end of the data segment and normally EOF.
+
+
+[codestation]: https://github.com/codestation
+[procfw]: https://code.google.com/p/procfw/
+[lz4]: https://code.google.com/p/lz4/
+[deflate]: https://www.ietf.org/rfc/rfc1951.txt
diff --git a/src/input.cpp b/src/input.cpp
@@ -375,6 +375,7 @@ bool Input::DecompressSectorDeflate(uint8_t *dst, const uint8_t *src, unsigned i
 }
 
 bool Input::DecompressSectorLZ4(uint8_t *dst, const uint8_t *src, int dstSize, std::string &err) {
+	// Must use fast, because we don't know the size of the input data.  It could include padding.
 	if (LZ4_decompress_fast(reinterpret_cast<const char *>(src), reinterpret_cast<char *>(dst), dstSize) < 0) {
 		err = "LZ4 decompression failed.";
 		return false;