Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modules/dbe - lz4 encoder and decoder [DSLX code only] #1092

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rdob-ant
Copy link
Contributor

@rdob-ant rdob-ant commented Aug 3, 2023

This is an implementation of Lz4 encoder and decoder in DSLX without reference Python & cocotb test code that was present in #1079.

The source branch additionally includes changes from #1078 that are needed to run DSLX tests of procs using 1R1W single-partition RAM. These will be dropped by rebasing when we get that PR merged.

@tmichalak
Copy link

CC @proppy @hongted

std::convert_to_bools_lsb0(read_req.mask))
let init_parts = mem_initialized[read_req.addr];
let requested_is_init = for (idx, requested_is_init): (u32, bool) in
range(u32:0, NUM_PARTITIONS) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add 1 level of indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hongted . Sure, will do that. To keep things clean, I propose to discuss ram.x changes in a PR dedicated to that bugfix: #1078 . I will address your comments directly there.

This change is present in the current PR only because Lz4 DSLX tests depends on the RAM bugfix, and I'm not aware of a way of submitting Lz4 changes for review without making commits from a dependency PR visible here.

Comment on lines 253 to 262
let requested_is_init = for (idx, requested_is_init): (u32, bool) in
range(u32:0, NUM_PARTITIONS) {
if read_req.mask[idx+:bool] {
if !init_parts[idx] {
trace_fmt!("reading partitions {} but part={} is not init: {}",
std::convert_to_bools_lsb0(read_req.mask), idx, init_parts);
} else { () };
requested_is_init && init_parts[idx]
} else { requested_is_init }
} (true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems general enough to suggest a separate function and unit test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems that this can be simplified to just
let mem_initialized_as_bits = std::convert_to_bits_lsb0(mem_initialized[req_req.addr]);

assert_eq(mem_initialized_as_bits, mem_initialized_as_bits | read_req.mask)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #1078

Copy link
Collaborator

@hongted hongted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a round-trip test be added?

i.e.

Create a round-trip test proc that spawns an encoder and decoder back to back.
Give it some arbitrary plain text and assert that what goes in comes out.

name = "dbe_common_dslx",
srcs = [
"common.x",
"common_test.x"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move common_test into a xls_dslx_test rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. The constructs from common_test.x are imported in both lz4_decoder.x and lz4_encoder.x (since the test procs are defined in the same files where the real procs are implemented, which looks like a common pattern in DSLX). All imports must be resolvable at all times, not only when running DSLX tests. Without moving those tests to separate files, I don't think it's possible to factor out common_test.x and import it only when running the tests 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd factor out the tests into their own dslx file -- that keeps the interface cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in the upcoming update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've factored out the tests & added a round-trip test.

Copy link
Collaborator

@hongted hongted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name = "dbe_lz4_decoder_opt_ir",
src = "dbe_lz4_decoder_ir.ir",
top = "__lz4_decoder__decoder__decoder_base_0__16_16_8_next",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the dslx is using an abstract ram model, suggest adding the transforms to a real ran model in both the opt_ir and then in codegen -- following https://github.com/google/xls/blob/main/xls/contrib/xlscc/examples/BUILD#L152

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will look into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongted Hm, I'm a bit confused here. I see that code that targets XLScc (which I haven't looked into at all till today) indeed represents RAM as an abstract object with array-like semantics, which has to be then rewritten to map onto a real-world 1RW or 1R1W RAM core.

Contrary to that, my DSLX code, as well as existing DSLX example code in e.g. xls/examples/delay.x is written with a specific RAM core in mind from the beginning - there is a protocol for 1R1W core (4 channels: 2x output for rd & wr requests, 2x input for rd & wr responses) and a protocol for 1RW core (2 channels: 1x output for combined rd or wr request, 1x input for combined rd or wr response).

If there a way to write DSLX code to target 'abstract' RAM and then rewrite it at opt_ir stage like in case of XLScc, do you know if there's example code that shows this, or at least some file with DSLX struct definitions for the abstract protocol?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the example is XLSCC, both XLSCC and DSLX are translated to a common IR.

The idea here is to use to target an an abstract ram model so that we can easily retarget the implementation to whatever ram is available. Whether it's 1RW, 1R1W, etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to reply: this was addressed, now there's a configuration for IR optimizer to rewrite the RAM interface.

"reset": "rst",
"use_system_verilog": "false",
"streaming_channel_data_suffix": "_data",
"io_constraints": ",".join([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following above comment, add an additional ram_config -- per https://github.com/google/xls/blob/main/xls/contrib/xlscc/examples/BUILD#L273

This enables switchting between different RAM specifications more easily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood, I'll first have to modify DSLX proc to use the abstract RAM protocol (see my reply above). Please let me know if got it wrong.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct -- it should be a minimal change.

MK = 2, // Control marker
}

pub struct Token<SYM_WIDTH: u32, PTR_WIDTH: u32, CNT_WIDTH: u32> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to follow the LZ4 spec -- https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an "internal" token object used by encoder & decoder algorithms. There are other procs (not included in this PR) needed to convert between a stream of these _Token_s and the actual sequence/block format used by Lz4 files.

I chose such design because Lz4 block is not just a sequence of flattened & serialized tokens - to optimize file footprint they perform grouping of LITERAL tokens and have a special way of encoding COPY_POINTERs (e.g. part of CP_CNT is stored in the first byte of the sequence, and the rest gets appended at the very end of the sequence, for example). While encoder & decoder algorithms shouldn't really deal with such specifics of file encoding, they only care about raw LITERAL and COPY_POINTER tokens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Add a TODO if this the plan is to address the comment in a follow up change.

However, I think the current design should be such that it's possible to implement it in the future with minimal changes. One thing I don't see here is how to handle the variable literal lengths available in LZ4.

What is the plan to address that?

Copy link
Contributor Author

@rdob-ant rdob-ant Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for variable literal lengths in LZ4 block format.

A single Token represents either a single literal (symbol), or a single match (copy) operation. If I understand it right, the "variable literal length" that you're referring to is a way to specify the number of literals present in LZ4 sequence, where a single LZ4 sequence is defined as N literals followed by a single match operation (except for the last sequence in the block, which has only literals and no match operation). Important detail of the format is that you have to first specify the number of literals that the sequence contains (using a variable number of bytes that depends on how many literals do you have), and only then specify the literals themselves. The extreme case is of an LZ4 block that contains only literals, without any match operation - so the number of literals in the sequence approx. corresponds to the block size, which iirc can be as large as 4MB.

W.r.t. hardware implementation, LZ4 algorithm itself can be laid out rather nicely as it's stream-oriented and doesn't need any extra buffering besides 64KB history buffer and 8~32KB hash table.

However, if one is to implement an IP core that emits data in a standard LZ4 block format, due to the fact that LZ4 compressor emits tokens sequentially and their type and count is not known beforehand, and the fact that the LZ4 block specifies number of literals before the literals themselves, they'll first need to be buffered (with buffer size limited by the max block size), and only then one will be able to emit an LZ4 sequence header and then emit literals themselves. It all can be done within a single buffer as well, but you'll still need a block buffer :)

There are alternative encoding formats that try to mitigate this issue e.g. MLZ4, but they are not compatible with existing LZ4 software.

In the scope of this PR, the idea was to implement an LZ4 encoder core that will accept a stream of raw bytes on the input, and output a stream of tokens: literals or matches, without targeting a specific block encoding format. Packing of tokens into blocks (using LZ4 or MLZ4 or any other format) is thus offloaded to an external proc, which is out of scope of this PR. However, the encoder core should of course emit enough data to allow such packing - and that's why I've recently added e.g. a "MATCHED_SYMBOL" token to allow easily rewriting matches (needed to address some "special" LZ4 requirements w.r.t. minimal match length and minimal literal sequence length at the end of the block).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my concern here is that it is not possible to output the Lz4 block encoding from this token stream. As you mention, an implementation would need store the entire token stream, and then process that.

Sounds like what you are saying is that all implementations have that limit? If so it sounds like LZ4 isn't a suitable compression algorithm for streaming hardware acceleration.

@proppy -- thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm itself is stream-oriented and plays rather well with the HW, but the encoding of tokens into byte sequences does not. You can check e.g. the aforementioned MLZ4 in part of encoding - they've modified it a bit so that it's more hardware-friendly and does not require excessive buffering. They've also improved the algorithm itself a bit, but that is not significant in this context.

}

pub enum TokenKind : u2 {
LT = 0, // Literal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a comment, just have TokenKind be the unexpanded name ie LITERAL instead of LT.

This is a common comment. While there is no formal DSLX standard yet, we prefer to have longer more meaningful names. Refer to https://google.github.io/styleguide/cppguide.html#General_Naming_Rules

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

mark: Mark
}

pub fn zero_token<SYM_WIDTH: u32, PTR_WIDTH: u32, CNT_WIDTH: u32>()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using the builtin zero!<>. This is a common comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// Check whether rd_ptr points to an unwritten HB cell
let rd_ptr_valid = (rd_ptr < state.wr_ptr) || state.hb_all_valid;

// HB READ - read the symbol from history buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle the "overlap match" scenario referenced in https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md when the symbol from the history buffer depends on the match copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave this to @proppy to decide whether or not the DSLX tests should be more comprehensive.

// HB WRITE - write emitted symbol to the history buffer
let do_hb_write = data_valid && !data.is_mark;
let tok = send_if(tok, o_ram_hb_wr_req, do_hb_write,
ram::WriteWordReq<u32:1>(state.wr_ptr, data.data));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LZ4 is a byte-oriented compression right? Should there be some use of masks to handle a WordReq'ed sized write?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. This RAM's data bus has width of a single symbol, so the "word size" is 8 bits in case of a standard LZ4, masks are not needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a TODO is warranted here then. I think most RAMs we'll have available for implementation would have a data bus of greater than 8 bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a TODO. But do we want to make DSLX code aware of the port size of a specific RAM IP used in the hardware? E.g. in FPGAs you'd frequently rely on RAM block being auto-inferred from the Verilog code, thus the synthesis tool takes care of a mismatch between the data/address bus widths of available hard-IP core and the expectations set by the RTL design.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be an input parameter. SRAMS take a non-trivial amount of time to generate collateral and incorporate during an ASIC desgin flow.

hb_all_valid: bool,
}

pub proc decoder_base<SYM_WIDTH: u32, PTR_WIDTH: u32, CNT_WIDTH: u32> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of general architecture, consider how to support the following

  1. Using single port RAM
  2. Whether multiple procs accessing a single RAM could make the behavior clearer/cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's the policy w.r.t single/dual-port RAMs.

  • In case we'd always prefer to use a single-port RAM, then this block can be rewritten to do that.
  • In case we want to have both options available, then a proc that uses 1R1W RAM is more versatile, as it can either use 1R1W directly, or it can be bridged to a 1RW RAM via some kind of 1R1W-1RW adapter proc (this last claim needs some thinking though.. I reckon there may be some corner cases where this won't work).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above -- if abstract channels are used then RAM architecture comes free.

However, #2 remains. Would it make sense to break apart the proc into separate communicating procs using the same SRAM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, even if we're just considering splitting the current monolithic LZ4 compressor into multiple procs, having a few procs communicating with the same RAM may make things easier. But is this already supported in DSLX? Can you perhaps point me to such an example? I thought that a single RAM can be used at most by two procs (one reading, one writing) 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- I don't think we've contemplated more than two procs for an SRAM. But it's possible to write a SRAM-interface proc that will handle the arbitration.


// Send decoded data symbol or Mark
let zero_data = dbe::zero_plain_data<SYM_WIDTH>();
let (data_valid, data) = if rx_reset {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rx_rest is more about clearing the state to handle a new block, right? If so, consider changing the name to reflect that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's an equivalent to a complete state reset: it both prepares the proc to handle a new block of data, as well as it clears a sticky error state that the proc (and other procs in the chain) may be in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the common case is to handle the new block of data, no? I'd suggest naming it that rather than the less common case of resetting the error state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've kept the "reset" name, as the idea of "RESET" token was to reset a full network of processing blocks (e.g. LZ4 block encoder or anything else that can be there besides this single core), and it just happened that in the encoder this function coincides with the operation that has to be done when you want to start processing of a new independent block. I can rename it to "restart" or something like that, or even add a new marker code, but that would be redundant (in a sense that if you're starting to encode an independent block of data, you don't want to preserve any previous state, so why not just completely reset the circuit?).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the issue here is that the reset signal is usually special -- and may be considered a multi-cycle signal.

It sounds like this signal here is expected to be a mission-mode signal that should be timed as such -- so it's more a personal preference?

];

#[test_proc]
proc test_simple {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a couple of more tests to target edge conditions

  1. "overlap match"
  2. History buffer wraparouond
  3. offset zero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests, except for "offset zero" (as that belongs to the lz4_block_reader proc that parses Lz4 block format and is not part of the current PR) are implemented using Python Cocotb framework, they were part of the original large PR #1079:

https://github.com/antmicro/xls/blob/6b324c35883898f77933c7355eb1d7f5daee6339/xls/modules/dbe/lz4_decoder_test.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

}
}

pub proc encoder_base<
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment on how matches are found using the hash table (i.e. algorithm).
Comment on how aliases in hashing are found and prevented from causing a bad encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to prepare a document that explains how the algorithm works - this will be explained there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdob-ant curious which format do you plan to use? (separate markdown file, comments in the header of the impl, or google docs?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@proppy I wrote a plain markdown file.

FsmSt::ERROR
} else if rx_reset {
FsmSt::RESET
} else if fsm == FsmSt::RESET {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: usually we prefer a match statement for FSMs in the DSL (so you can demonstrate that you exhaustively handled all cases in a declarative fashion).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@proppy
Copy link
Member

proppy commented Aug 24, 2023

xls/examples/ram.x

Looks like there is a conflict, can we rebase?

@rdob-ant
Copy link
Contributor Author

@proppy will rebase when pushing updated code

@rdob-ant rdob-ant force-pushed the 47181-lz4-dslx branch 2 times, most recently from c20d324 to c7992e1 Compare August 29, 2023 18:35
@rdob-ant rdob-ant force-pushed the 47181-lz4-dslx branch 2 times, most recently from 6d42979 to 2f30860 Compare September 1, 2023 19:00
@@ -0,0 +1,289 @@
# Dictionary coder implementation in DSLX
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of generic comments

  1. Given the long FSM backedge -- what's the expected throughput of this implementation?
  2. Is the throughput/area tunable?
  3. Perhaps a multi-proc implementation would be preferred to improve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As this is an example code, we haven't really considered the throughput while writing this code. However, as far as I remember, when I was simulating the generated Verilog code, I got something in the order of 10~12 clock cycles for a single byte (with 1-cycle single-port RAMs). Even a dual-port RAM will make it somewhat better, the number is still not good enough for a practical implementation that should be targeting something like 1 byte per cycle on average (at most).
  2. Not directly. We can tune the area/compression ratio by changing the hash table size, and also using single-port -vs- double-port RAM or RAMs with different latency will result in different throughput/area ratios, but there is no parameter to control this directly in a smooth way (and tbh I can't think of any way this can be done in a scalar LZ4).
  3. Probably yes. I don't see how it can be improved without splitting the design.

* *data* communicates a symbol whenever *is_marker* is not set.
* *mark* communicates a control mark whenever *is_marker* is set.

#### Token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest writing another section on how these tokens can be converted into a stream compatible with the LZ4 block format (https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md)

Notably the following constraints

  1. That there is a minimum match length of 4 bytes
  2. That after each literal section there MUST be a minimum match
  3. The end of the sequence must have 5 literal bytes
  4. The last match must have an offset >= 12

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this. In short, the idea is that you can use data conveyed within "MATCHED_SYMBOLs" to rewrite "MATCHes", e.g. if your encoding format requires a match of at least 4 symbols, and the encoder emits a match of 2 symbols, there will be two MATCHED_SYMBOL tokens accompanying a MATCH, so the postprocessor can remove the match and convert MATCHED_SYMBOLs into UNMATCHES_SYMBOLs (i.e. literals).

This way it is possible to remove or shorten matches, and I think it should be enough to cover all the requirements imposed by LZ4 standard.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I think the discussion on token format is relevant here as well. Is it even possible to do such a thing without needing to buffer the entire block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a description to the doc that shows how matches can be rewritten to fulfill these requirements.

One note, though, about requirement 2: That after each literal section there MUST be a minimum match - a match must be there, except if it's the end of the block. The end of the block must contain only literals and no match. Thus, a typical LZ4 block looks like this:

literals - match - literals - match - literals - END

One other requirement is about the minimal match size, and that's easy to implement with only 4 (or even 3) bytes of FIFO memory. The rest of requirements discuss how exactly that last sequence of "literals" has to look like. Those requirements can be fulfilled by making sure that there is no match within the last 12 symbols in the block - which is as simple as "buffering up to 12 symbols and if there are any matches in them, replacing those matches with literals". The tricky part is if the block ends in a long match that produces more than those 12 symbols. In this case, the match has to be split - and it's possible to do this without needing more than a 12-symbol buffer (or, how the doc says, 15-token buffer, since in reality we also want to buffer match tokens, there can be up to 3 of those per 12 symbols).

grow, try to start a new matching string:
* __(7)__ Compute hash of the data contained in the FIFO.
* __(8)__ Load *candidate_ptr* pointer from the HT.
* __(9)__ Store *current_ptr* to the HT.
Copy link
Collaborator

@hongted hongted Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the hash table is only updated when starting a new match, so that even assuming no collisions -- not all sequences will be located in the hash table?

That is during a match, the intermediate sequences will not be updated with the most recent offset.

Is this intended and matching the reference implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check this, but IIRC this behavior matches the reference code 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongted See this line in Rust Lz4 implementation: https://github.com/PSeitz/lz4_flex/blob/main/src/block/compress.rs#L488

The code above this line looks for the beginning of the match and uses hash table to do that (lines 423~425). When the initial matching sequence of 4 bytes has been found, however, it tries to grow it (line 488) but this time without updating the hash table.

Example implementation of Lz4 - a dictionary-based data
compression algorithm.

Signed-off-by: Roman Dobrodii <rdobrodii@antmicro.com>
Switches Lz4 encoder proc to use abstract (1R1W) RamModel
and adds RAM interface rewrites at xls_ir_opt_ir and
xls_ir_verilog stages.

Signed-off-by: Roman Dobrodii <rdobrodii@antmicro.com>
Make SimultaneousReadWriteBehavior enum public as it can be
used from other modules.

Signed-off-by: Roman Dobrodii <rdobrodii@antmicro.com>
- Simplify Lz4 encoder by removing corner case handling required by Lz4
  block format. This allows to remove 1 FIFO, 2 FSM states, and a bunch
  of convoluted logic that can be easily moved to an external module.
- Hash function is factored out into a separate function
- Variable naming is improved

Signed-off-by: Roman Dobrodii <rdobrodii@antmicro.com>
Add a Markdown readme file and a couple of diagrams that
explain how the encoder works.

Signed-off-by: Roman Dobrodii <rdobrodii@antmicro.com>
Factored out the tests into separate DSLX files. Added a round-trip
test for encoder & decoder.

Signed-off-by: Roman Dobrodii <rdobrodii@antmicro.com>
Add information on match token rewriting.

Signed-off-by: Roman Dobrodii <rdobrodii@antmicro.com>
@cdleary cdleary added the app Application level functionality (examples, uses of XLS stack) label Mar 27, 2024
@proppy
Copy link
Member

proppy commented Apr 3, 2024

Should we close this in favor of the work on #1211 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app Application level functionality (examples, uses of XLS stack)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants