Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Filtering through Python wrapper #222

Open
JiriBakker opened this issue Jan 15, 2024 · 4 comments
Open

[Feature Request] Filtering through Python wrapper #222

JiriBakker opened this issue Jan 15, 2024 · 4 comments

Comments

@JiriBakker
Copy link

Hi,

First of all, thanks so much for making this tool available publicly. At Royal Dutch National Meteorological Institute (KNMI) we are making good use of it, so we're very happy that we're able to do so.

Currently we're trying to improve the performance of our overall pipeline that is using the Asterix decoding through the Python wrapper. One of the options we would like to explore is to see if we can reduce the processing time of the Asterix file by filtering out data items of categories that we are not interested in.

It seems like the CLI already has this option (-LF) but since we're using the Python wrapper this functionality doesn't seem available to use yet. Are we overlooking a certain feature that might help us filter the categories, or is this feature yet to be added? Even though we're not that well versed in C++, we might attempt adding the functionality ourselves (and submitting a pull request), but we would appreciate some advice on whether you think it's feasible and the amount of work it would likely require.

Thanks in advance!

@ifsnop
Copy link
Contributor

ifsnop commented Jan 15, 2024 via email

@JiriBakker
Copy link
Author

Hi Jiri, It's a joy seeing KNMI here :) Just for info, I'm using the CLI with -LF option and forwarding the filtered output to a message queue (mqtt, rabbitmq) so it can be offloaded and distributed to several hosts that can process in realtime. It scales really well. I haven't explored the python wrapper. Kind regards,

Thanks for sharing diego! I've been in contact with @dsalantic as well, and he suggested doing the file reading/seeking in Python and then passing relevant bytes to the Asterix parser. I've been working on implementing this, but preliminary results seem pretty good so far.

I will update this thread when we've finalized our solution.

@zoranbosnjak
Copy link
Contributor

Hi @JiriBakker,
there are several methods you can apply to speed up the processing, for example:

  • check (profile) where the processing time is spent most, optimize that
  • avoid copying big chunks of data if not necessary
  • pre-compute as much as possible at program initialization stage, with the purpose to reduce processing time spent on each datagram
  • skip non-required categories (this is easy with asterix, since the category number and the size are at the head of every datablock)
  • use low level language for critical parts, like C/C++ instead of high level, like python
  • use multiple cores
  • ...

However, good asterix processing performance could be achived in python too. Could you please share a short sample file (input), together with the expected result (output)? I would be interested in comparing the pure python implementation with your existing processing pipeline or with the optimized implementation that your are up to.

Zoran

@JiriBakker
Copy link
Author

JiriBakker commented Feb 1, 2024

@zoranbosnjak Thanks for the input!

For now we've implemented the optimization of filtering per category. Below a sample of how we implemented this:

from pathlib import Path
from typing import Any, Generator,
import asterix 

def generate_data_item_stream(path: Path, allowed_categories: list[int]) -> Generator[dict[str, Any], None, None]:
    with open(path, "rb") as file:
        while (first_three_bytes := file.read(3)):
            if first_three_bytes == "":
                break

            category: int = first_three_bytes[0]
            length: int = first_three_bytes[1] * 256 + first_three_bytes[2]

            if category not in allowed_categories:
                file.seek(length - 3, 1)
                continue

            data_block: bytes = first_three_bytes + file.read(length - 3)

            # Because of backwards compatibility with older Asterix formats (v2.1 and earlier) a single data block
            # can contain one or multiple data items. This is the reason why `asterix.parse()` will return a
            # list instead of a single item. Note that within a data block all data items are guaranteed to have
            # the same category, so we can do the category filtering on data block level.
            data_items: list[dict[str, Any]] = asterix.parse(data = data_block, verbose = False)
                
            for data_item in data_items:
                yield data_item

So far this works fairly well for us. Files that contain multiple categories, some of which we are not interested in, are processed faster than previously. We'll monitor the performance once we start processing larger amounts of data. If we have any additional findings we'll be sure to share them here.

@dsalantic I'll leave it up to you whether or not you want to close this issue. For now, the above solution is sufficient for us. Thanks again for the assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants