Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposals for improvements #1055

Open
denballakh opened this issue Dec 10, 2023 · 27 comments
Open

Proposals for improvements #1055

denballakh opened this issue Dec 10, 2023 · 27 comments

Comments

@denballakh
Copy link

denballakh commented Dec 10, 2023

Hi! Your project is awesome! I made very similar thing myself for my project, and then found construct.
I have several proposals and questions (they are at the end).
Here they are:

New features:

  • (solved) add support for recursive constructs. Right now I am doing this:
Var = c.Struct(
    'name' / WStr,
    'type' / c.Enum(
        u8,
        unknown=0,
        int=1,
        dword=2,
        float=3,
        str=4,
        array=9,
    ),
    'value' / c.Switch(
        c.this.type,
        {
            'unknown': c.Computed(None),
            'int': i32,
            'dword': u32,
            'float': f64,
            'str': WStr,
            'array': List(u32),  # will be replaced later
        },
    ),
)
Var.value.cases['array'] = List(Var)

But this doesn't feel like a good way. Also, such constructs are not compile'able

  • import'able compiled parsers that don't require construct at runtime (or do, but doesn't require recompiling every time)
  • current integer naming is reasonable, but it creates too much noise when there is a lot of different integer types being used. I propose shortening their names to something like u8/i32/f64 for uint8/int32/double. But there is a problem: now there is no way to specify endiannes, so making new namespace for them might be good: construct.le.u32, construct.be.f64, from construct.le import *
  • add a helper function to convert .parse result into ordinary python dict/list's. Right now they are not json-serializable because of ._io field

Micro-optimizations:

  • use f-strings more
  • add __slots__ to reduce memory usage and (possibly) speed up some hot attribute lookups
  • make most function arguments positional-only, this speeds up function calls a little
  • passing arguments is not free. (stream, context, path) are passed everywhere, but are rarely accessed. Maybe we can store some of them in some kind of global storage (probably thread-local)?

Python versions:

  • drop 3.6 and 3.7 support, they reached EOL. 3.8 will reach EOL soon
  • add official support for 3.12
  • remove py2-related hacks

Tooling:

  • black'ify code
  • incorporate type annotations into code, add some typechecker in CI

Documentation:

  • documentation is a bit out of sync with existing implementation
  • improve examples in docs (s = X or Y is confusing)
  • add proper API reference that lists all available methods on all objects in the library
  • improve docstrings (fix punctuation, improve readability, ...)

Other:

  • there are several names that don't follow pep8, maybe we can pep8-ify them

Questions:

  • what is the point of .flagbuildnone? I can't figure out from source code what it does.
  • (solved) what is the common way to make construct that does nothing to stream and returns None? I found out that Computed(None) does exactly that, but something like construct.Nothing would be nice to have.
@denballakh
Copy link
Author

By reading documentation one more time I found solutions for some of my problems:

  • Pass does nothing, this is exactly what I wanted
  • LazyBound allows us to make recursive constructs, it is also compile'able

@denballakh
Copy link
Author

I have a problem.
I have a file that contains something like this:

  • some data...
  • offset where X starts as uint32
  • some other data...
  • X

How do I parse/build that structure?
It should work like this:
On parse:

  • parse offset of X and save it
  • when you are about to parse X, seek to saved offset and then continue parsing

On build:

  • save current offset, skip place where offset of X will be written later (seek 4 bytes forward)
  • when you are about to build X, save current offset, seek to first saved offset, build there second saved offset, seek back to first saved offset

Pseudocode:

def parse():
    ...
    offset1 = u32.parse()
    ...
    seek(offset1)
    x = X.parse()

def build():
    ...
    offset1 = tell()
    seek(offset1 + u32.sizeof())
    ...
    offset2 = tell()
    seek(offset1)
    u32.build(offset2)
    seek(offset2)
    X.build(x)

I tried to do it like this, it parses as intended, but builds incorrectly:

c.Struct(
    '__X_offset_offset' / c.Tell,
    '__X_offset' / c.Default(u32, 0),
    'somedata' / somedata, # variable length
    c.Seek(c.this.__X_offset),
    c.Pointer(c.this.__X_offset_offset, c.Default(u32, c.this.__X_offset)),
    'x' / X,
    ...
)

Is it possible to implement that behaviour without making custom class?
If I have to implement custom class myself, I am thinking of something like this:

c.Struct(
    'X_offset' / Offset(u32),
    'somedata' / somedata,
    OffsetApply('X_offset', c.Tell),
    'x' / X,
)

@arekbulski
Copy link
Member

@denballakh
Copy link
Author

I know about Pointer, but I don't think it solves my problem.
It will work during parsing, but during building how would it know how and where to store offset of X?

@arekbulski
Copy link
Member

add a helper function to convert .parse result into ordinary python dict/list's. Right now they are not json-serializable because of ._io field

That is an interesting point. I will give it some thought.

@denballakh
Copy link
Author

What can you say about things I proposed? Which ones do you like and which ones don't?
How can I help you? I think I can do some of these things myself, but I need your approval for some decisions.

@denballakh
Copy link
Author

denballakh commented Dec 11, 2023

Here is a library I was working on:
https://github.com/denballakh/ranger-tools/blob/master/rangers/std/dataclass.py
It does very similar thing to construct, but is a lot simpler and lacks a lot of functionality.
Naming is very different, but you can find a lot of similar constructs such as Struct, Sequence, Computed, Restreamed, Greedy*, Switch, Terminated, Array, Const, ...

It can be used like this:
https://github.com/denballakh/ranger-tools/blob/master/rangers/sav.py

I decided to switch to construct, and now I am rewriting some structs to construct (it is trivial in 90% of cases).
The problem I mentioned several messages before arises from there:
https://github.com/denballakh/ranger-tools/blob/master/rangers/scr.py#L261
https://github.com/denballakh/ranger-tools/blob/master/rangers/scr.py#L264-L272
I am not able to do it correctly with construct.

@arekbulski
Copy link
Member

I am busy atm. I will look at the links and tell you how you can contribute later today. 👍

@arekbulski
Copy link
Member

Here is what I would like to work on myself:

  • add a helper function to convert .parse result into ordinary python dict/list's. Right now they are not json-serializable because of ._io field

Here is what I would like you to work on:

  • documentation is a bit out of sync with existing implementation

The .flagbuildnone is a way for a subcon parsing class to tell its wrapper parsing class that it does not require a build value (the thing you pass into .build method). If you do .build(None) it promises to build just fine. Look at this... if the subcon has a buildfromnone set, it is given either a dict entry or a None, but otherwise its a straight error.
https://github.com/construct/construct/blob/master/construct/core.py#L2252-L2255

Line 261, Converted looks a lot like Const parsing class.
Line 264, CustomCallable looks a lot like a custom parsing class (not an adapter).

@denballakh
Copy link
Author

There is a couple of places where example show that list is returned, but actually ListContainer is returned.
Would you like to update examples in the documentation like this?:

 >>> st.parse(b"\x03\x01\x02\x03")
-Container(count=3, items=[1, 2, 3])
+Container(count=3, items=ListContainer([1, 2, 3]))

 >>> seq.parse(b"\x00\x80lalalaland\x00\x00\x00\x00\x00")
-[128, 'lalalaland', b'\x00\x00\x00\x00']
+ListContainer([128, 'lalalaland', b'\x00\x00\x00\x00'])

 >>> d.parse(b"dsadhsaui")
-[100, 115, 97, 100, 104, 115, 97, 117, 105]
+ListContainer([100, 115, 97, 100, 104, 115, 97, 117, 105])

Also, in the example about .sizeof() showed error message looks like this: construct.core.SizeofError: cannot calculate size.
But actually it looks like this: construct.core.SizeofError: Error in path (sizeof).
I guess it is a regression from old behaviour, I will keep example as is.

Btw, can I reach you on some other platform? Maybe discord? That would be nice.

@arekbulski
Copy link
Member

Btw, can I reach you on some other platform? Maybe discord? That would be nice.

No, I am using Issues only, for reasons I dont want to discuss in public. If you really have a need then my email is publicly available. But please dont use it unless its really private matters.

Thank you for bringing those 2 issues (ListContainer and sizeof) to my attention. If you would submit a PR with corrections I would be more than happy to merge it.

@denballakh
Copy link
Author

denballakh commented Dec 12, 2023

  • Update docs #1056 - typos + a lot of backticks + a bit of bugs + some rewordings
  • [docs bug or possible regression] in "adapters" page VersionNumber.build(88) actually raises this:
>>> VersionNumber.build(88)
Traceback (most recent call last):
    ...
    raise ValidationError("object failed validation: %s" % (obj,), path=path)
construct.core.ValidationError: Error in path (building)
object failed validation: 88

and not just ValidationError: object failed validation: 88, as indicated in the documentation

applies also to d.build(88) example a bit later

  • [bug?] there is an inconsistency with result returned from regular construct and compiled construct:
>>> s = Struct('a' / Flag, 'b' / Flag)
>>> dict(s.parse(b'\1\1'))
{'_io': <_io.BytesIO object at 0x000001BC0EFB1670>, 'a': True, 'b': True}
>>> dict(s.compile().parse(b'\1\1'))
{'a': True, 'b': True}

_io is included only in result from normal construct, because compiled ones are storing result in different container (i guess)

  • example code in compilation page is outdated, but i'm not sure if we should change it. It is there only for demonstration purposes to explain why and what is happening. It should not be used in any other way

  • while reading compilation page, I noticed that you mentioned that container.x is slower than container['x']. I dont think this is true (at least in cpython, i dont know any internals or speed characteristics of pypy). You can do something like this:

class Container(dict):
    def __new__(cls, *args, **kwargs):
        self = super().__new__(cls, *args, **kwargs)
        self.__dict__ = self # <- this is the most important part
        return self

x = Container()
x.a = 42
assert x['a'] == 42

x['b'] = 24
assert x.b == 24

try:
    x.t
except AttributeError:
    pass # this should happen

try:
    x['t']
except KeyError:
    pass # this should happen

# you can do this:
x[5] = 6
# but this will not be accessible through attribute, obviously

This reduces the overhead of attribute access virtually to zero, now they work just like with any other object attribute access. There is no need in running __getattr__ and re-handling exceptions from it to convert them from KeyError to AttributeError.
Also this probably can be heavily optimized by specializing adaptive interpreter, which reduces overhead even more.

Also, it looks like Container.update method does nothing new that dict.update cannot do. It just adds overhead and removes ability to pass keyword arguments

@denballakh
Copy link
Author

  • i don't think Debugger works. At least on windows. Exception is not being propagated, and i dont think assigning to self.retval has any effect

  • there is a bug in formatting of EnumIntegerString inside of Container's:

>>> Container(x=EnumIntegerString.new(5, 'a'))
Container(x=uEnumIntegerString.new(5, 'a'))
#           ^ notice 'u' prefix here

I suggest removing u prefixes from everywhere. They are no longer useful, but confusing.
Also, it might be a good idea to remove EnumIntegerString.__repr__ method to make reprs a bit nicer. I think this is a backwards-compatible change.

  • one of the examples is broken (on cpython3.12):
>>> RepeatUntil(list_[-1] == 0, Byte).parse(b"aioweqnjkscs\x00")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\github\construct\construct\core.py", line 404, in parse
    return self.parse_stream(io.BytesIO(data), **contextkw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\core.py", line 416, in parse_stream
    return self._parsereport(stream, context, "(parsing)")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\core.py", line 428, in _parsereport
    obj = self._parse(stream, context, path)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\core.py", line 2678, in _parse
    if predicate(e, obj, context):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\expr.py", line 160, in __call__
    lhs = self.lhs(obj) if callable(self.lhs) else self.lhs
          ^^^^^^^^^^^^^
  File "d:\github\construct\construct\expr.py", line 217, in __call__
    return self.__parent(*args)[self.__index]
           ^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\expr.py", line 215, in __call__
    return args[1]
           ~~~~^^^
IndexError: tuple index out of range

not sure what happened, i guess list_ indeed isn't working

  • Check also raises different exception:
# at runtime:
construct.core.CheckError: Error in path (parsing) -> integrity1
check failed during parsing
# in docs:
ValidationError: check failed during parsing
# i fixed it to this fow now:
CheckError: check failed during parsing

i guess everything that was raising ExcName: pretty message now raises ExcName: Error in path ...\npretty message.

you might find exception notes interesting in that regard: https://docs.python.org/3.12/library/exceptions.html?highlight=__notes__#BaseException.add_note

  • bug
>>> d = FocusedSeq(1 or "num",
...     Const(b"MZ"),
...     "num" / Byte,
...     Terminated,
... )
>>> d.parse(b"MZ\xff")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\github\construct\construct\core.py", line 404, in parse
    return self.parse_stream(io.BytesIO(data), **contextkw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\core.py", line 416, in parse_stream
    return self._parsereport(stream, context, "(parsing)")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\core.py", line 428, in _parsereport
    obj = self._parse(stream, context, path)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\core.py", line 3238, in _parse
    return finalret
           ^^^^^^^^
UnboundLocalError: cannot access local variable 'finalret' where it is not associated with a value
>>> d.build(255)
Traceback (most recent call last):
  File "d:\github\construct\construct\core.py", line 1165, in _build
    data = struct.pack(self.fmtstr, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: required argument is not an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "d:\github\construct\construct\core.py", line 452, in build
    self.build_stream(obj, stream, **contextkw)
  File "d:\github\construct\construct\core.py", line 464, in build_stream
    self._build(obj, stream, context, "(building)")
  File "d:\github\construct\construct\core.py", line 3246, in _build
    buildret = sc._build(obj if sc.name == parsebuildfrom else None, stream, context, path)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\core.py", line 2774, in _build
    return self.subcon._build(obj, stream, context, path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\github\construct\construct\core.py", line 1167, in _build
    raise FormatFieldError("struct %r error during building, given value %r" % (self.fmtstr, obj), path=path)
construct.core.FormatFieldError: Error in path (building) -> num
struct '>B' error during building, given value None
  • something weird happened to BitsInteger example:
# example:
>>> d = BitsInteger(16, swapped=this.swapped)
>>> d.build(1, swapped=True)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
#   ^^

# my result:
>>> d = BitsInteger(16, swapped=this.swapped)
>>> d.build(1, swapped=True)
b'\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00'
#                               ^^
>>> d.build(1, swapped=False)
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
#                                                               ^^

@arekbulski
Copy link
Member

Holy cow, you certainly dug up some dirt there. It will take me some time to get back to you on this, sorry. 😞 Feel free to post even more if you want. Just... (whispers) dont overdo it...

@arekbulski
Copy link
Member

I am going through your changes and adding lots of mine, so please do NOT add more commits, not unless you want to merge them yourself later. It is going to take me 1-2 more days to sift through everything you committed.

@denballakh
Copy link
Author

denballakh commented Dec 14, 2023

Do you like the idea of making construct.be/le namespaces that contain aliases for numbers like u32/f64 ?
This is pretty easy to implement.

P.S: if you merge two open PRs, i can continue working without worrying about possible merge conflicts.

@arekbulski
Copy link
Member

No, namespaces are a no go.

I will get on top of your PRs later today.

@arekbulski
Copy link
Member

something weird happened to BitsInteger example

I found this in 2.10 changelog:

BitsInteger swapped semantic was fixed

@denballakh
Copy link
Author

something weird happened to BitsInteger example

I found this in 2.10 changelog:

BitsInteger swapped semantic was fixed

so example is just wrong, ok, it should be updated

@denballakh
Copy link
Author

I was reading your changes in the documentation and i have several little notes:

>>> from construct import *
>>> c = Computed(42)
>>> s = Struct('a' / c, 'b' / c) + (Struct('c' / c, 'd' / c) + Struct('e' / c, 'f' / c))
>>> s.parse(b'')
Container(a=42, b=42, c=42, d=42, e=42, f=42) # <- no nesting

Everything else looks fine.

@arekbulski
Copy link
Member

3rd point, YES you right of course, its not a valid syntax. I will correct it.

@franzhaas
Copy link

franzhaas commented Jan 17, 2024

@denballakh the __slots__ idea is currently worked on here.:
#1063

I would love to know if it improves your use case or breaks it...

@denballakh
Copy link
Author

slots in new versions are no longer faster than regular attrs, so there would be no perfomance increase
and memory usage is not a big concern, because i doubt you have millions of constructs

@franzhaas
Copy link

I dont have a break down on the speed effect of each and every optimisation on this pull request. But on my workload it improved about 3x (running on pypy), in real world application...

There are user visible changes though....

@denballakh
Copy link
Author

Oh right. I didn't consider pypy at all, i was talking about cpython3.12.
Awesome!

@kohlrabi
Copy link

kohlrabi commented Feb 6, 2024

slots in new versions are no longer faster than regular attrs, so there would be no perfomance increase

@denballakh Do you have a source for that statement. That's news to me.

@franzhaas
Copy link

franzhaas commented Feb 25, 2024

After thinking about it for a while, i came up with this optimisation.:

05fd679

Instead of going __ slots __ (which has problems with names like 1, a-b etc...) it goes dict, and only converts to Container when necessary...

This achieves a significant speed gain but is much less intrusive than using __ slots __.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants