Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: compressed layout of enum #3166

Open
frank-king opened this issue Aug 23, 2021 · 7 comments
Open

Proposal: compressed layout of enum #3166

frank-king opened this issue Aug 23, 2021 · 7 comments

Comments

@frank-king
Copy link

frank-king commented Aug 23, 2021

I have an enum like this:

enum Format {
    UnsignedInt8 = 0,
    SignedInt8 = 1,
    UnsignedInt16Le = 2,
    UnsignedInt16Be = 3,
    SignedInt16Le = 4,
    SignedInt16Be = 5,
    UnsignedInt32Le = 6,
    UnsignedInt32Be = 7,
    SignedInt32Le = 8,
    SignedInt32Be = 9,
    UnsignedInt64Le = 10,
    UnsignedInt64Be = 11,
    SignedInt64Le = 12,
    SignedInt64Be = 13,
    Ieee754FloatLe = 14,
    Ieee754FloatBe = 15,
    Ieee754DoubleLe = 16,
    Ieee754DoubleBe = 17,
    Utf16Le = 18,
    Utf16Be = 19,
    Utf32Le = 20,
    Utf32Be = 21
}

It's kind of verbose and needs more codes in pattern matching, so I rewrite it like this:

enum Format2 {
    Int8 { signed: bool },                    // 0, 1
    Int16 { signed: bool, big_endian: bool }, // 2, 3, 4, 5
    Int32 { signed: bool, big_endian: bool }, // 6, 7, 8, 9
    Int64 { signed: bool, big_endian: bool }, // 10, 11, 12, 13
    Ieee754Float { big_endian: bool },        // 14, 15
    Ieee754Double { big_endian: bool },       // 16, 17
    Utf16 { big_endian: bool },               // 18, 19
    Utf32 { big_endian: bool },               // 20, 21
}

I implemented TryFrom<u8> and Into<u8> for it, to make sure Format2 has the same mapping to u8 as Format.

Everything works fine, except that the rewritten Format2 occupies 3 bytes, instead of 1.

Is there a possibility to add something like #[repr(compressed)] to make Format2's memory representation like Format?

Currently, rustc can optimize the enum to occupy only 1 byte if I comment out all the fields in the discriminants except the first one:

enum Format3 {
    Int8  { signed: bool },                    // 0, 1
    Int16 , // { signed: bool, big_endian: bool }, // 2, 3, 4, 5
    Int32 , // { signed: bool, big_endian: bool }, // 6, 7, 8, 9
    Int64 , // { signed: bool, big_endian: bool }, // 10, 11, 12, 13
    Ieee754Float  , // { big_endian: bool },        // 14, 15
    Ieee754Double , // { big_endian: bool },       // 16, 17
    Utf16 , // { big_endian: bool },               // 18, 19
    Utf32 , // { big_endian: bool },               // 20, 21
}

I'm not very familiar with rustc, but I can describe a possible algorithm to achieve this:

  1. Scan each discriminants of the enum, and calculate the number of all possible values it could have. Also records the offset start of each discriminant based on a prefix sum. If there are discriminants that:
    1. have multiple fields, the number of all possible values is the product of that of all fields;
    2. have enum, tuple or struct fields, do this recursively;
      3.have non-Copy fields, stop this algorithm.
  2. Check the sum of all possible values for the whole enum, if it is:
    1. in range 0..256, use 1 byte to represent it;
    2. in range 256..65536, use 2 bytes to represent it;
    3. ...
  3. Now each field of each discriminants has its value range (offset..offset + num_of_values). During pattern match, the discriminants are matched by its value range, and the fields are matched by its subrange. It might not be contiguous, then it can be matched by modulo or bit operations. (The value range can be extended to fit exponent of 2 for faster matching by bit operations, but the user should be able to control it.)

For example, for Int8 { signed: bool }, there are 2 possible values, and it has offset 0, so its range is 0..2; for Int16 { signed: bool, big_endian: bool }, there are 2 × 2 = 4 possible values, and it has offset 2, so its range is 2..6. If there is a match

match format {
    Format::Int16 { signed: true, big_endian } => {/* ... */}
    Format::Int32 { signed, big_endian: false } => {/* ... */}
    // ...
}

It can be translated to

let format_u8 = format as u8; // maybe by transmute
match format_u8 {
    4..6 => {
        let big_endian = format_u8 % 2 != 0;
    }
    6..10 if format_u8 % 2 == 0 {
        let signed = format_u8 - 6 >= 2;
    }
    // ...
}
@nagisa
Copy link
Member

nagisa commented Aug 23, 2021

Currently Rust allows code to take references to inner enum fields, so bool must be represented as 0 or 1 in its own byte no matter what.

If there's only one bool, it is able to allocate 0 and 1 for the bool of the first variant (thus making references to the field still valid) and then the other byte values correspond to other variants.

@frank-king
Copy link
Author

frank-king commented Aug 23, 2021

What if using anonymous enum instead of bool? Since the discriminants of the anonymous enum is only used in the outer enum, there is no problem to take it by (immutable) reference and do sub-matching.

enum Format3 {
    Int8( enum { Unsigned, Signed } ),                    // 0, 1
    Int16( enum { Unsigned, Signed }, enum { LittleEndian, BigEndian } ), // 2, 3, 4, 5
    // ...
}

// match by reference (however, matching by mutable reference is ill-formed)
match &format {
    Format3::Int16 (ref signed, ref big_endian) => {
        if mathces!(signed, Format3::Int16::Signed) {
            // ...
        }
    }
    // ...
}

In alternative, if #[repr(compressed)] is enabled, how about make it taken-by-value only? And automatically impl Copy for it.

@Kimundi
Copy link
Member

Kimundi commented Aug 26, 2021

Similar things have been discussed in the past. It basically boils down to that annotation needing to prevent taking references to data in the enum. Which would be fine! We already do something similar with #[repr(packed)], where taking references is marked as unsafe.

@kennytm
Copy link
Member

kennytm commented Aug 30, 2021

this would be one step away from native bitfields (not necessarily compatible with C) in rust 🙂

#[repr(bit_packed)]  // = #[repr(compressed)]
struct A {
   a: bool,
   b: bool,
}

assert_eq!(std::mem::size_of::<A>(), 1);

#[repr(bit_packed)] 
enum Format2 {
    // 5 bits to represent the whole type:
    Int8 { signed: bool },                    // 000|b_ (0, 2)
    Int16 { signed: bool, big_endian: bool }, // 001|bb (4, 5, 6, 7)
    Int32 { signed: bool, big_endian: bool }, // 010|bb (8, 9, 10, 11)
    Int64 { signed: bool, big_endian: bool }, // 011|bb (12, 13, 14, 15)
    Ieee754Float { big_endian: bool },        // 100|b_ (16, 18)
    Ieee754Double { big_endian: bool },       // 101|b_ (20, 22)
    Utf16 { big_endian: bool },               // 110|b_ (24, 26)
    Utf32 { big_endian: bool },               // 111|b_ (28, 30)
}

assert_eq!(std::mem::size_of::<Format2>(), 1);

@frank-king
Copy link
Author

@kennytm Cool! But it seems not working in nightly rustc yet?
Have there been any related RFCs?

@kennytm
Copy link
Member

kennytm commented Aug 30, 2021

@whjpji 😅 i mean using what Kimundi suggested (prohibiting taking references to a field), someone can file an RFC for the ultra-packed representation, then if it is accepted (which is hard as #205, #314, #1449, #3064 and #3113 will definitely get pulled into the debate), someone else can implement it.

@kpreid
Copy link

kpreid commented Oct 17, 2022

Adding cross-reference: #311 is an earlier discussion of the same idea as here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants