Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce eager loading functions #147

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

lukapeschke
Copy link
Collaborator

@lukapeschke lukapeschke commented Dec 22, 2023

What

This introduces eager loading functions that make use of the calamine's new DataTypeRef.

This prevents some allocations, resulting in a lower memory footprint.

Caveats

Gains

While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise 馃コ (they still beat us by a few MBs), while being about 10 times faster

Before

before

After

after

Pandas

pandas

Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
@lukapeschke lukapeschke added enhancement New feature or request 馃敤 WIP 馃敡 馃 rust 馃 Pull requests that edit Rust code labels Dec 22, 2023
@lukapeschke lukapeschke self-assigned this Dec 22, 2023
@lukapeschke lukapeschke added the 馃悕 python 馃悕 Pull requests that edit Python code label Dec 22, 2023
@lukapeschke lukapeschke added this to the v1.0.0 milestone Feb 9, 2024
@PrettyWood PrettyWood modified the milestones: v1.0.0, v0.10.0 Feb 14, 2024
@lukapeschke
Copy link
Collaborator Author

Some work is still required in calamine: tafia/calamine#409

Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
@lukapeschke
Copy link
Collaborator Author

Okay well just noticed that the API changed so we actually need to use workshet_range_ref in case Sheets are the Xlsx variant

@lukapeschke lukapeschke modified the milestones: v0.10.0, v1.0.0 Feb 27, 2024
@PrettyWood
Copy link
Member

Glad to see tafia/calamine#409 has been merged. Hopefully we get a new release soon 馃憤

Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
@lukapeschke
Copy link
Collaborator Author

new data

main

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

main

this branch

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_eager(sheet_name)
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

branch

Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
@PrettyWood
Copy link
Member

New benchmark looks great 馃槂

Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
@lukapeschke
Copy link
Collaborator Author

Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out 馃コ

Benchmarks with the latest version:

iterations owned by ref
1 lazy eager
20 lazy_20 eager_20

Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
@lukapeschke
Copy link
Collaborator Author

calamine 0.25.0 should be released soon, meaning I should finally be able to finish this 馃檪 tafia/calamine#435

Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Signed-off-by: Luka Peschke <luka.peschke@toucantoco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
馃 rust 馃 Pull requests that edit Rust code enhancement New feature or request 馃敤 WIP 馃敡 馃悕 python 馃悕 Pull requests that edit Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants