[Feature Request] Filtering through Python wrapper #222

JiriBakker · 2024-01-15T08:13:23Z

Hi,

First of all, thanks so much for making this tool available publicly. At Royal Dutch National Meteorological Institute (KNMI) we are making good use of it, so we're very happy that we're able to do so.

Currently we're trying to improve the performance of our overall pipeline that is using the Asterix decoding through the Python wrapper. One of the options we would like to explore is to see if we can reduce the processing time of the Asterix file by filtering out data items of categories that we are not interested in.

It seems like the CLI already has this option (-LF) but since we're using the Python wrapper this functionality doesn't seem available to use yet. Are we overlooking a certain feature that might help us filter the categories, or is this feature yet to be added? Even though we're not that well versed in C++, we might attempt adding the functionality ourselves (and submitting a pull request), but we would appreciate some advice on whether you think it's feasible and the amount of work it would likely require.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

ifsnop · 2024-01-15T12:56:45Z

It seems like the CLI already has this option (-LF) but since we're using the Python wrapper this functionality doesn't seem available to use yet. Are we overlooking a certain feature that might help us filter the categories, or is this feature yet to be added? Even though we're not that well versed in C++, we might attempt adding the functionality ourselves (and submitting a pull request), but we would appreciate some advice on whether you think it's feasible and the amount of work it would likely require.

Hi Jiri, It's a joy seeing KNMI here :) Just for info, I'm using the CLI with -LF option and forwarding the filtered output to a message queue (mqtt, rabbitmq) so it can be offloaded and distributed to several hosts that can process in realtime. It scales really well. I haven't explored the python wrapper. Kind regards,

…

-- diego dot torres at gmail dot com - Madrid / Spain

JiriBakker · 2024-01-15T17:47:55Z

Hi Jiri, It's a joy seeing KNMI here :) Just for info, I'm using the CLI with -LF option and forwarding the filtered output to a message queue (mqtt, rabbitmq) so it can be offloaded and distributed to several hosts that can process in realtime. It scales really well. I haven't explored the python wrapper. Kind regards,

Thanks for sharing diego! I've been in contact with @dsalantic as well, and he suggested doing the file reading/seeking in Python and then passing relevant bytes to the Asterix parser. I've been working on implementing this, but preliminary results seem pretty good so far.

I will update this thread when we've finalized our solution.

zoranbosnjak · 2024-01-24T19:15:11Z

Hi @JiriBakker,
there are several methods you can apply to speed up the processing, for example:

check (profile) where the processing time is spent most, optimize that
avoid copying big chunks of data if not necessary
pre-compute as much as possible at program initialization stage, with the purpose to reduce processing time spent on each datagram
skip non-required categories (this is easy with asterix, since the category number and the size are at the head of every datablock)
use low level language for critical parts, like C/C++ instead of high level, like python
use multiple cores
...

However, good asterix processing performance could be achived in python too. Could you please share a short sample file (input), together with the expected result (output)? I would be interested in comparing the pure python implementation with your existing processing pipeline or with the optimized implementation that your are up to.

Zoran

JiriBakker · 2024-02-01T09:13:30Z

@zoranbosnjak Thanks for the input!

For now we've implemented the optimization of filtering per category. Below a sample of how we implemented this:

from pathlib import Path
from typing import Any, Generator,
import asterix 

def generate_data_item_stream(path: Path, allowed_categories: list[int]) -> Generator[dict[str, Any], None, None]:
    with open(path, "rb") as file:
        while (first_three_bytes := file.read(3)):
            if first_three_bytes == "":
                break

            category: int = first_three_bytes[0]
            length: int = first_three_bytes[1] * 256 + first_three_bytes[2]

            if category not in allowed_categories:
                file.seek(length - 3, 1)
                continue

            data_block: bytes = first_three_bytes + file.read(length - 3)

            # Because of backwards compatibility with older Asterix formats (v2.1 and earlier) a single data block
            # can contain one or multiple data items. This is the reason why `asterix.parse()` will return a
            # list instead of a single item. Note that within a data block all data items are guaranteed to have
            # the same category, so we can do the category filtering on data block level.
            data_items: list[dict[str, Any]] = asterix.parse(data = data_block, verbose = False)
                
            for data_item in data_items:
                yield data_item

So far this works fairly well for us. Files that contain multiple categories, some of which we are not interested in, are processed faster than previously. We'll monitor the performance once we start processing larger amounts of data. If we have any additional findings we'll be sure to share them here.

@dsalantic I'll leave it up to you whether or not you want to close this issue. For now, the above solution is sufficient for us. Thanks again for the assistance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Filtering through Python wrapper #222

[Feature Request] Filtering through Python wrapper #222

JiriBakker commented Jan 15, 2024

ifsnop commented Jan 15, 2024 via email

JiriBakker commented Jan 15, 2024

zoranbosnjak commented Jan 24, 2024

JiriBakker commented Feb 1, 2024 •

edited

[Feature Request] Filtering through Python wrapper #222

[Feature Request] Filtering through Python wrapper #222

Comments

JiriBakker commented Jan 15, 2024

ifsnop commented Jan 15, 2024 via email

JiriBakker commented Jan 15, 2024

zoranbosnjak commented Jan 24, 2024

JiriBakker commented Feb 1, 2024 • edited

JiriBakker commented Feb 1, 2024 •

edited