Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very Slow to Load/Import #37

Open
whatamithinking opened this issue Jan 11, 2019 · 4 comments
Open

Very Slow to Load/Import #37

whatamithinking opened this issue Jan 11, 2019 · 4 comments

Comments

@whatamithinking
Copy link

Python 3.6

Every time I import the library, it is extremely slow (~50 seconds).
I don't have time to review the codebase right now for a solution, so I am going to look for other libraries, but this really is a huge problem when you just want to rattle off some code quickly in the shell.

I tried with Python 3.7 from Anaconda distribution and normal 3.6, but both were exceedingly slow. I also tried calling cpi.update() and then closing and re-opening the shell and importing again, but it was still extremely slow.

@palewire
Copy link
Owner

palewire commented Jan 11, 2019

I agree this is the major weakness of the library. The issue is that the CPI database from the BLS is quite large. It is packaged here as a SQLite file that ships with the code.

One alternative design approach would be a lazy-loading on-demand system, but due to the structure of the source files from the government that's not an easy option. Some kind of API would likely need to be launched and supported, or a drastically restructured version of the data would need to be shipped, perhaps one small CSV file per series.

Another strategy would be to target users who are looking to make only the most rudimentary CPI calculations using the CPI-U series. I have no evidence to back this up, but I suspect this would represent the lion's share of use cases. Were that to be a driving design principle, a carved out, slimmed down version containing only that data series could be initially imported, with access to the larger, more complete database only being made when requested.

In the meantime, if you are working on a data science style research project, I would recommend using our library in a computational notebook, like Project Jupyter, where the import can be done a single time and saved into the environment.

@whatamithinking
Copy link
Author

I am just playing around with my retirement funds, trying to model different indices to figure out which one to pick and I want to adjust the returns for inflation before I dump them into my model.

I am not quite sure I am following you though. Is this mod downloading the data every time it is imported? From a very brief look at the source code it looked like it would only do that when it was extremely out of date. If it is downloading every time, I feel like that is unnecessary.

I think you could get away with lazy loading as you suggest, where you download maybe just the last few years and then as the function is called for different dates, deal with it then. I am guessing there isn't a sweet little gov. api that could be queried on the fly to just download the values needed?

@wrwetzel
Copy link

I'm having the same problem. I'm working only with "Housing" and just in New Jersey. Would it be possible to partition the data by "items" and/or "area" so as to reduce the load time?

@gschivley
Copy link

Another strategy would be to target users who are looking to make only the most rudimentary CPI calculations using the CPI-U series. I have no evidence to back this up, but I suspect this would represent the lion's share of use cases. Were that to be a driving design principle, a carved out, slimmed down version containing only that data series could be initially imported, with access to the larger, more complete database only being made when requested.

How can I help implement this? I use cpi in a project that needs to put data from several sources on a consistent $-year. Not having to write and maintain code for this is worth the ~40-50 seconds needed to load cpi, but it would be great to cut down the import time.

Here is the lprun output for parsing all 21 files in parsers.FILE_LIST that happens on import (which is nearly all of the import time).

Total time: 46.1686 s
File: /Users/greg/opt/miniconda3/envs/powergenome/lib/python3.7/site-packages/cpi/parsers.py
Function: parse_indexes at line 180

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   180                                               def parse_indexes(self):
   181         1          7.0      7.0      0.0          logger.debug("Parsing index files")
   182                                                   # Loop through all the files ...
   183        22         72.0      3.3      0.0          for file in self.FILE_LIST:
   184                                                       # ... and for each file ...
   185   2607983    8193080.0      3.1     17.7              for row in self.get_file(file):
   186                                                           # Get the series
   187   2607962    8400675.0      3.2     18.2                  series = self.series_list.get_by_id(row['series_id'])
   188                                           
   189                                                           # Create an object
   190   2607962    1090450.0      0.4      2.4                  index = Index(
   191   2607962     885840.0      0.3      1.9                      series,
   192   2607962    1695671.0      0.7      3.7                      int(row['year']),
   193   2607962    2739748.0      1.1      5.9                      self.periods.get_by_id(row['period']),
   194   2607962    5248789.0      2.0     11.4                      float(row['value'])
   195                                                           )
   196                                           
   197                                                           # If the value has already been loaded ...
   198   2607962    8241058.0      3.2     17.8                  if index.date in series._indexes[index.period.type]:
   199                                                               # ... verify this value matches what we have ...
   200   1151078    5225930.0      4.5     11.3                      assert index == series._indexes[index.period.type][index.date]
   201                                                           else:
   202                                                               # ... and if the series doesn't have the index yet, add it.
   203   1456884    4447276.0      3.1      9.6                      series._indexes[index.period.type][index.date] = index

If only the cu.data.1.AllItems table is parsed, import time might drop to < 2 seconds

Total time: 1.46556 s
File: /Users/greg/opt/miniconda3/envs/powergenome/lib/python3.7/site-packages/cpi/parsers.py
Function: parse at line 151

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   151                                               def parse(self):
   152         1     208850.0 208850.0     14.3          self.series_list = self.parse_series()
   153         1    1256705.0 1256705.0     85.7          self.parse_indexes()
   154         1          1.0      1.0      0.0          return self.series_list

Total time: 1.08048 s
File: /Users/greg/opt/miniconda3/envs/powergenome/lib/python3.7/site-packages/cpi/parsers.py
Function: parse_indexes at line 180

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   180                                               def parse_indexes(self):
   181         1          6.0      6.0      0.0          logger.debug("Parsing index files")
   182                                                   # Loop through all the files ...
   183         2          3.0      1.5      0.0          for file in self.FILE_LIST:
   184                                                       # ... and for each file ...
   185     57618     241452.0      4.2     22.3              for row in self.get_file(file):
   186                                                           # Get the series
   187     57617     252653.0      4.4     23.4                  series = self.series_list.get_by_id(row['series_id'])
   188                                           
   189                                                           # Create an object
   190     57617      21515.0      0.4      2.0                  index = Index(
   191     57617      18628.0      0.3      1.7                      series,
   192     57617      34797.0      0.6      3.2                      int(row['year']),
   193     57617      55336.0      1.0      5.1                      self.periods.get_by_id(row['period']),
   194     57617      99604.0      1.7      9.2                      float(row['value'])
   195                                                           )
   196                                           
   197                                                           # If the value has already been loaded ...
   198     57617     177766.0      3.1     16.5                  if index.date in series._indexes[index.period.type]:
   199                                                               # ... verify this value matches what we have ...
   200                                                               assert index == series._indexes[index.period.type][index.date]
   201                                                           else:
   202                                                               # ... and if the series doesn't have the index yet, add it.
   203     57617     178725.0      3.1     16.5                      series._indexes[index.period.type][index.date] = index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants