Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document performance tuning #225

Open
sjperkins opened this issue Jul 4, 2022 · 0 comments
Open

Document performance tuning #225

sjperkins opened this issue Jul 4, 2022 · 0 comments

Comments

@sjperkins
Copy link
Member

sjperkins commented Jul 4, 2022

Description

Currently, there are a number of parameters that can affect performance in multiple formats. It would be useful to document them.

CASA

  1. Choice of storage manager. Tiled Storage Managers have better performance than Standard Storage Managers. Data can be tiled (chunked) on multiple dimensions.

  2. Grouping, indexing columns and TAQL where queries can produce non-contiguous row ordering patterns.

  3. Chunk Sizes specified in xds_from_*(chunks={"row": 10000}) affect performance:

    • Larger chunk sizes generally perform better as requests for larger quantities of data can be issued in one call (although this is affected by (2)).
    • Functions that process larger chunk sizes can drop the GIL (especially if written in numba or C++). This is useful in a dask environment.
    • Matching some multiple of TiledStorageManager chunks would also probably improve performance.

Zarr

  1. Can specify on disk chunk sizes, similar to TiledStorageManager

Arrow/Parquet

  1. Arrow datasets are composed of multiple Arrow tables, partitioned by row. Thus, can only chunk on row.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant