Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multisymbol RLE implementation #1052

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mtdudek
Copy link
Contributor

@mtdudek mtdudek commented Jul 3, 2023

This PR adds more advanced implementation of RunLengthEncoder that is capable of ingesting
multiple symbols at once and produce multiple pairs in output.

This PR depends on #1006
It updates implementations of simple encoder and simple decoder to work with new
channel description.

Main differentiating factor is that original RLE allows for symbol width to be adjusted.
This implementation adds ability to process multiple symbols at once.
Let's take a base RLE and multisybol RLE with input that takes up to 4 symbols and outputs up to 2 pairs
and following sequence:AABCCCCB

  • base RLE has to ingest symbols one by one. It takes 8 channel transactions A, A, B, C, C, C, C, B.
    Generating output takes require 5 transactions: (2,A), (1,B), (3,C), (1,C), (1,B)
  • multisymbol RLE will ingest symbols in 2 transactions as AABC, CCCB.
    Generating output takes only 3 transactions: [(2,A), (1,B)], [(3,C), (1,C)], [(1,B)]

Multisybol RL encoder should have higher throughput but wt will use more logic to implement.

@proppy proppy changed the title Add mutlisymbol RLE implementation Add multisymbol RLE implementation Jul 19, 2023
@mtdudek mtdudek force-pushed the MultiSymbolRLE branch 2 times, most recently from 708cfaa to 7d9f08b Compare July 20, 2023 08:55
@mtdudek
Copy link
Contributor Author

mtdudek commented Jul 20, 2023

@proppy I've updated this PR to depend on #1073.

@proppy
Copy link
Member

proppy commented Jul 24, 2023

can you rebase?

@mtdudek
Copy link
Contributor Author

mtdudek commented Jul 25, 2023

@proppy I've rebased it on to the newest main

@proppy
Copy link
Member

proppy commented Jul 25, 2023

implementation of RunLengthEncoder that is capable of ingesting multiple symbols at once and produce multiple pairs in output

Can you detail a little a bit more in the PR description (or the .x header) on how this differ from the original implementation, maybe with examples of input / output; the original implementation did support variable symbol width: so it'd be good to highligh how "multisymbol" differs from "larger symbol".

@mtdudek
Copy link
Contributor Author

mtdudek commented Jul 31, 2023

I'll update .x file with similar explanation and waveform.
I've update PR description.

@proppy
Copy link
Member

proppy commented Aug 24, 2023

[(2,A), (1,B)], [(3,C), (1,C)], [(1,B)]

Curious why there is three transaction here, and two separate packets for C?


// This encoder is implemented as a net of 4 processes.
// 1. Reduce stage - this process takes incoming symbols and symbol_valid
// and reduces them into symbol count pairs. This stage is stateless.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the stage is stateless, does it need to be a proc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand DSLX if I want to perform channel operations I have to use proc's.

// This encoder is implemented as a net of 4 processes.
// 1. Reduce stage - this process takes incoming symbols and symbol_valid
// and reduces them into symbol count pairs. This stage is stateless.
// 2. Realign stage - this process moves pairs emitted from previous stage
Copy link
Member

@proppy proppy Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note describing the state for each "stage" when they have one?


// This encoder is implemented as a net of 4 processes.
// 1. Reduce stage - this process takes incoming symbols and symbol_valid
// and reduces them into symbol count pairs. This stage is stateless.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this stage equivalent to the current rle_enc can the implementation be shared?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any obvious code that can be shared between reduce step and rle_enc.
Main difference is that rle_enc code has to take into account state and update it f necessary,
while reduce step doesn't have one.

// 1. Reduce stage - this process takes incoming symbols and symbol_valid
// and reduces them into symbol count pairs. This stage is stateless.
// 2. Realign stage - this process moves pairs emitted from previous stage
// so that they are align to the left, it also calculates propagation distance
Copy link
Member

@proppy proppy Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example to show 'align to the left` refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let the input to 1 step be [A, A, B, B].
Output from the first step will be [..., (A,2), ..., (B,2)].
2 step will transform it to [(A,2), (B,2), ..., ...].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the comment below can you also explain what ... refers to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here ... was used for an invalid data symbol, counter pair.

// 1. Reduce stage - this process takes incoming symbols and symbol_valid
// and reduces them into symbol count pairs. This stage is stateless.
// 2. Realign stage - this process moves pairs emitted from previous stage
// so that they are align to the left, it also calculates propagation distance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example of a propagation distance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume that count is 2 bit wide and input can ingest 4 symbols at once.
Let the input be [A,A,A,A], so after the second stage it will look like:
[(A,3), (A,1), ..., ...] and propagation distance will be 1.
For input [A,B,B,B] second step will emit [(A,1), (B,3), ..., ...] with propagation distance 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the comment below, can you update the comment with a textual explanation of what the propagation distance refer to? (here is seems that it mainly indicate with a packet has been broken or not)

// 2. Realign stage - this process moves pairs emitted from previous stage
// so that they are align to the left, it also calculates propagation distance
// for the first pair.
// 3. Core stage - this stage is stateful. It takes align pairs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document what the state is?

// so that they are align to the left, it also calculates propagation distance
// for the first pair.
// 3. Core stage - this stage is stateful. It takes align pairs,
// and combines them with its state.It outputs multiple symbol/count pairs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example of the output to show how it different from the received input?

// for the first pair.
// 3. Core stage - this stage is stateful. It takes align pairs,
// and combines them with its state.It outputs multiple symbol/count pairs.
// 4 - Adjust Width stage - this stage takes output from the core stage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/4 - /4./

// This encoder is implemented as a net of 4 processes.
// 1. Reduce stage - this process takes incoming symbols and symbol_valid
// and reduces them into symbol count pairs. This stage is stateless.
// 2. Realign stage - this process moves pairs emitted from previous stage
Copy link
Member

@proppy proppy Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it takes input from the preceding "stage"? if yes, maybe we should state it explicitly.

// 2. Realign stage - this process moves pairs emitted from previous stage
// so that they are align to the left, it also calculates propagation distance
// for the first pair.
// 3. Core stage - this stage is stateful. It takes align pairs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it takes input from the preceding "stage"? if yes, maybe we should state it explicitly.

// 3. Core stage - this stage is stateful. It takes align pairs,
// and combines them with its state.It outputs multiple symbol/count pairs.
// 4 - Adjust Width stage - this stage takes output from the core stage.
// If output can handle more or equal number of pairs as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the "stage" know that the "output can handle more or equal the number of pairs", can you give an example

Copy link
Contributor Author

@mtdudek mtdudek Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output and input widths are parametrized INPUT_WIDTH and OUTPUT_WIDTH.
If OUTPUT_WIDTH is greater or equal to INPUT_WIDTH, adjust width step has nothing to adjust.

// If output can handle more or equal number of pairs as
// input number of symbols. This stage does nothing.
// If the output is narrower than the input,
// this stage will serialize symbol counter pairs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this stage will serialize symbol counter pairs.

as it will split them to match the size of the desired output? can you give an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let input width be 4 and output with is 2. Core step can emit the following : [(A,1), (B,1) , (A,1), (B,1)].
Adjust Width step will serialize it to [(A,1(, (B,1)], [(A,1), (B,1)]

@mtdudek
Copy link
Contributor Author

mtdudek commented Aug 25, 2023

[(2,A), (1,B)], [(3,C), (1,C)], [(1,B)]

Curious why there is three transaction here, and two separate packets for C?

This was with an assumption that RLE counter is 2 bit wide, so it can only hold at most 3 symbols, while C was repeated 4 times.

@@ -63,8 +63,8 @@ struct RunLengthDecoderState<SYMBOL_WIDTH: u32, COUNT_WIDTH: u32> {
}
// RLE decoder implementation
pub proc RunLengthDecoder<SYMBOL_WIDTH: u32, COUNT_WIDTH: u32> {
input_r: chan<DecInData<SYMBOL_WIDTH, COUNT_WIDTH>> in;
output_s: chan<DecOutData<SYMBOL_WIDTH>> out;
input_r: chan<DecInData<SYMBOL_WIDTH, COUNT_WIDTH, 1>> in;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more friendly to parameterize with max_count and use https://google.github.io/xls/dslx_std/#stdclog2 + 1 to represent the width?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember there are parts of code that assume max_count to be 2^k-1.
I'm not sure when user would like to set max_count to be something other than power of 2 -1.

// The behavior of the encoder is presented on the waveform below:

// This encoder is implemented as a net of 4 proc.
// 1. Reduce proc - this process takes incoming symbols and symbol_valid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 1. effectively equivalent to the existing RLE enc implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exactly the same thing as existing RLE enc. Existing RLE has state that is combined with incoming data,
this stage doesn't have state, it only works on data visible in a single channel payload and doesn't include stream history.

// distance for the first pair.
// Example behaviours:
// 1)
// input: [.., (A, 2), .., (B, 2)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does .. means here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to mark invalid data and data, count pairs.

// and reduces them into symbol count pairs. This step is stateless.
// 2. Realign proc - this process moves pairs emitted from the reduce step
// so that they are aligned to the left, it also calculates propagation
// distance for the first pair.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also explain what "propagation distance" mean here? (it's not clear, at least to me, from just reading the example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here propagation distance is the number of consecutive pairs, starting form first pair, that have the same symbol value as the first one and have their counter maxed out. This way 'Core proc' can only check "propagation distance"th pair value and counter.

  • If value differs from the internal state, then all pairs from [0:"propagation distance") will differ and core should combine payload with its state.
  • If value is equal internal state, then "Core proc" only checks if internal state counter + "propagation distance"th pair counter overflow.

// output: [(A, 2), (B, 2), .., ..]
// propagation distance: 0
// 2)
// input: [.., .., (A, 3), (A, 1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A, 3), (A, 1) can this actually happen? wouldn't the previous step simply emit (A, 4)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the examples I was assuming counter 2 bit wide, so that's why there is (A, 3), (A, 1) and not (A, 4).

// 3. Core proc - this step is stateful. It takes align pairs from
// the realign step, and combines them with its state to create multiple
// symbol/count pairs output. State is represented by following tuple
// `<symbol, count, last>`. It contains symbol and count from last pair
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it effectively means that we're counting again? is that right to assume that if each transaction only include one symbol this does nothing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first proc with a state that holds stream history. If a stream would only have one data value in all symbols.
First processing state would reduce incoming data stream into (data_value, input_count), this stage would combine this pair with one that preceded it, and only emit output pair when either internal counter overflows or stream ends.

|||
|-----|-------|
|input|output |
|[(A, True), (A, True), (A, True), (A, True)]|[.., .., (A, 3), (A, 1)]|
Copy link
Member

@proppy proppy Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the boolean after the symbol in the example corresponding to !last?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catch this. I've forgotten to put last in this example.
This boolean value next to the symbol is data valid value.
I'll update this example to include last value

repeating symbols to the output stream that contains the symbols and
the number of its consecutive occurrences in the input stream.

Overall, we can break down the data processing into four stages: reduction, alignment, compression, and output generation. The division of responsibility allowed the specialized blocks to efficiently process data and made it possible test each functionality separately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we measure the overhead of splitting the process into different proc in term of additional registers and wires, compared to a single proc implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked how much more resource does multi process implementation uses over single process implementation, and how max frequency is affected by this.


### Process
1. Reduce step - this process takes incoming symbols and symbol_valid
and reduces them into symbol count pairs. This step is stateless.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't stateless proc be in theory implemented as fn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All state less processes are effectively a fns, but as far as I know you can't put recv/send operations into fns.
Using fn would potentially reduce fmax, as first proc would have to do all the processing before it would be able to update it's state, or we would reduce pipeline throughput when using II>1 to improve fmax.

## Encoder processing pipeline detailed breakdown.

### Initial conditions
- input width is 4 symbols wide,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this effectively assume that the input bit-width is always a multiple of the symbols size? i.e: you couldn't have block with a u9 input bus that process repeating u2 symbol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, implementation assumes that the input is always x symbols wide.
Handling u9 to u2 conversion should be done by additional proc.
As it will know how to handle that extra one bit,
if that bit is a part of symbol that was split on u9 boundary, or it has some other meaning.

This commit adds valid signals to PlainData interface.
CompressedData interface uses `count > 0` to define
valid symbol count pair. This is in preparation for a
multisymbol RLE encoder implementation.

Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
This encoder is capable of ingesting multiple symbols
and produces multiple compressed pairs.
It should offer faster compression in exchange for area
used.

Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
Signed-off-by: Maciej Dudek <mdudek@antmicro.com>
@cdleary cdleary added the app Application level functionality (examples, uses of XLS stack) label Mar 27, 2024
@proppy
Copy link
Member

proppy commented Apr 3, 2024

Should we close this in favor of the work on #1211 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app Application level functionality (examples, uses of XLS stack)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants