Symbolize literal integer constants #1

jreiser · 2022-06-07T20:01:52Z

While the small size is remarkable, there is much work to be done before some other project might be willing to use or adapt this code.

Looking at the current HEAD 9fd3a26 at Tue Jun 7 10:14:56 2022 +0700, in file lzmadec.x86_64.asm there are many many bare literal integer constants that do not have symbolic names. In contrast, the reference implementation by Igor Pavlov in LZMA SDK 4.40 and successors uses dozens of symbolic names with designated inter-relationships. This makes it hard to compare the two versions. Also, the micro-lzmadec code lacks documentation of strategy, explanation of coding tricks, and comments in general. Which specializations of parameter values has micro-lzmadec assumed? What relationships do the numeric constants in micro-lzmadec have to each other? If it becomes necessary or desirable to change one value, then how are the others affected?

The text was updated successfully, but these errors were encountered:

ilyakurdyukov · 2022-06-08T03:44:40Z

Oh no, I didn't plan to write a book explaining all the tricks. Because for me it would take more time than writing this code. You can ask me to explain particular places.

The most common trick is to use cdq to clear edx when I know that eax is non-negative (after calling rc_bit() for example). The "state" is stored on top of the stack to pop/push it using one byte instruction.

For x86_64, I avoid frequent use of the r8..r15 registers, because using them wastes an extra byte on the REX prefix. The same with using 64-bit registers.

Which specializations of parameter values has micro-lzmadec assumed?

I didn't understand the question, what are the "parameter values"?

uses dozens of symbolic names with designated inter-relationships

The offsets in the probability model tables have been rearranged (to save some bytes in the code) as follows:

Name: offset, size

original:

IsMatch: 0, 192
IsRep: 192, 12
IsRepG0: 204, 12
IsRepG1: 216, 12
IsRepG2: 228, 12
IsRep0Long: 240, 192
PosSlot: 432, 256
SpecPos: 688, 114
Align: 802, 16
LenCoder: 818, 514
RepLenCoder: 1332, 514
Literal: 1846, 0

micro-lzmadec:

IsMatch: 0, 192
IsRep0Long: 192, 192
IsRep: 384, 12
IsRepG0: 396, 12
IsRepG1: 408, 12
IsRepG2: 420, 12
PosSlot: 432, 256 + 16 (padding)
SpecPos: 704, 114
Align: 818, 16 + 30 (padding)
LenCoder: 864, 514 + 8 (padding)
RepLenCoder: 1386, 514 + 148 (padding)
Literal: 2048, 0

If it becomes necessary or desirable to change one value, then how are the others affected?

Just follow the code, it's not that bad if you already know LZMA decompression algorithm. And there are many hints around.

Many of these offsets in the table are bound to each other, but I don't see why anyone would change them unless that person understands the code and has found a way to make it even smaller.

So, here 192/2 is used to make three constants:
IsRep0Long = 192/2 * 2, IsRep = 192/2 * 4, LenCoder = 192/2 * 9
Also the RepLenCoder offset is made using a one byte constant multiplied by 9 (154 * 9 = 1386), same multiplier as LenCoder.

	cdq
...
	mov	dl, 192/2
	lea	ebx, [rsi+rdx*2]	; IsRep0Long, 192
	lea	esi, [rax+rdx*4]	; IsRep, 384 + state
...
	jmp	_case_len
...
	mov	dl, 154		; 154*9 = 1332+54
...
_case_len:
	lea	esi, [rdx*8+rdx]

I think that if necessary, this code can be made a hundred bytes more, but much faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbolize literal integer constants #1

Symbolize literal integer constants #1

jreiser commented Jun 7, 2022

ilyakurdyukov commented Jun 8, 2022

Symbolize literal integer constants #1

Symbolize literal integer constants #1

Comments

jreiser commented Jun 7, 2022

ilyakurdyukov commented Jun 8, 2022

original:

micro-lzmadec: