Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbolize literal integer constants #1

Open
jreiser opened this issue Jun 7, 2022 · 1 comment
Open

Symbolize literal integer constants #1

jreiser opened this issue Jun 7, 2022 · 1 comment

Comments

@jreiser
Copy link

jreiser commented Jun 7, 2022

While the small size is remarkable, there is much work to be done before some other project might be willing to use or adapt this code.

Looking at the current HEAD 9fd3a26 at Tue Jun 7 10:14:56 2022 +0700, in file lzmadec.x86_64.asm there are many many bare literal integer constants that do not have symbolic names. In contrast, the reference implementation by Igor Pavlov in LZMA SDK 4.40 and successors uses dozens of symbolic names with designated inter-relationships. This makes it hard to compare the two versions. Also, the micro-lzmadec code lacks documentation of strategy, explanation of coding tricks, and comments in general. Which specializations of parameter values has micro-lzmadec assumed? What relationships do the numeric constants in micro-lzmadec have to each other? If it becomes necessary or desirable to change one value, then how are the others affected?

@ilyakurdyukov
Copy link
Owner

Oh no, I didn't plan to write a book explaining all the tricks. Because for me it would take more time than writing this code. You can ask me to explain particular places.

The most common trick is to use cdq to clear edx when I know that eax is non-negative (after calling rc_bit() for example). The "state" is stored on top of the stack to pop/push it using one byte instruction.

For x86_64, I avoid frequent use of the r8..r15 registers, because using them wastes an extra byte on the REX prefix. The same with using 64-bit registers.

Which specializations of parameter values has micro-lzmadec assumed?

I didn't understand the question, what are the "parameter values"?

uses dozens of symbolic names with designated inter-relationships

The offsets in the probability model tables have been rearranged (to save some bytes in the code) as follows:

Name: offset, size

original:

IsMatch: 0, 192
IsRep: 192, 12
IsRepG0: 204, 12
IsRepG1: 216, 12
IsRepG2: 228, 12
IsRep0Long: 240, 192
PosSlot: 432, 256
SpecPos: 688, 114
Align: 802, 16
LenCoder: 818, 514
RepLenCoder: 1332, 514
Literal: 1846, 0

micro-lzmadec:

IsMatch: 0, 192
IsRep0Long: 192, 192
IsRep: 384, 12
IsRepG0: 396, 12
IsRepG1: 408, 12
IsRepG2: 420, 12
PosSlot: 432, 256 + 16 (padding)
SpecPos: 704, 114
Align: 818, 16 + 30 (padding)
LenCoder: 864, 514 + 8 (padding)
RepLenCoder: 1386, 514 + 148 (padding)
Literal: 2048, 0

If it becomes necessary or desirable to change one value, then how are the others affected?

Just follow the code, it's not that bad if you already know LZMA decompression algorithm. And there are many hints around.

Many of these offsets in the table are bound to each other, but I don't see why anyone would change them unless that person understands the code and has found a way to make it even smaller.

So, here 192/2 is used to make three constants:
IsRep0Long = 192/2 * 2, IsRep = 192/2 * 4, LenCoder = 192/2 * 9
Also the RepLenCoder offset is made using a one byte constant multiplied by 9 (154 * 9 = 1386), same multiplier as LenCoder.

	cdq
...
	mov	dl, 192/2
	lea	ebx, [rsi+rdx*2]	; IsRep0Long, 192
	lea	esi, [rax+rdx*4]	; IsRep, 384 + state
...
	jmp	_case_len
...
	mov	dl, 154		; 154*9 = 1332+54
...
_case_len:
	lea	esi, [rdx*8+rdx]

I think that if necessary, this code can be made a hundred bytes more, but much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants