Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement #20

Open
vtoupet opened this issue Jul 22, 2021 · 4 comments
Open

Performance improvement #20

vtoupet opened this issue Jul 22, 2021 · 4 comments

Comments

@vtoupet
Copy link
Contributor

vtoupet commented Jul 22, 2021

I am using your library with Pandas. Performance is not that good (it takes 1-2 seconds to process a full year).
The reasons for this are:

  • operations are performed sequentially while it could be partially vectorised.
  • everyhting is decoded even though you don't need everything

The way I see things:

  • use pandas.read_fwf for the mandatory sections
  • use apply method for the remaining part of the string (additional fields + remarks).

Usually, you know what information you are trying to get (and probably not every field that is present).
The idea would be to provide a list of desired fields. Based on that list, we could perform only the necessary decoding and return a Pandas Dataframe (or a list of records)

That would increase speed a lot.

Are you interested in such evolution for your library ?

Thanks,
Vincent

@haydenth
Copy link
Owner

i am 100% interested in this :)

@haydenth
Copy link
Owner

OK Sat and thought about this for a few mins.

  • vectorized == parallized or threaded?
  • love the idea of requesting only specific fields; that would speed it up dramatically

@vtoupet
Copy link
Contributor Author

vtoupet commented Jul 23, 2021

vectorized means not scalar. Instead of applying a function to a scalar and iterate over a list of scalar, we apply the same function to a vector (of dimension 1 x n). This is the main principle of Numpy and Pandas. This is much quicker.

I'll try to initiate something by september.

@haydenth
Copy link
Owner

Oh I see what you are saying.. I don't think it would be crazy hard to make a layer above ish_report that vectorizes the individual ish_report objects so they can be used in a library like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants