Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converter for route_id #62

Open
TheFrok opened this issue Feb 13, 2020 · 10 comments
Open

Converter for route_id #62

TheFrok opened this issue Feb 13, 2020 · 10 comments

Comments

@TheFrok
Copy link

TheFrok commented Feb 13, 2020

  • partridge version: 0.11.0 (but also happens on 1.1.1)
  • Python version: 3.8
  • Operating System: Win 10

Description

I tried to change the types of the _id columns (i.e. route_id) in some table from dtype object to numeric, to lower the memory usage. I did that by adding a converter to the default config.
It went fine at, but the DataFrames came back empty. I looked into that a little bit and I think it is because the read_file method does the prune part before the type conversion, causing the comparison of object column (the column in the current table) with numeric column (from the dependency table that is type converted).
I'm not sure what would be the right solution for that, maybe changing both columns to object before comparison.

What I Did

In[1]: import partridge as ptg
In[2]: conf = ptg.config.default_config()
In[3]: ptg.load_feed(path, config=conf)
Out[3]:
  route_id agency_id route_short_name  ... route_desc route_type  route_color
0        1        25                1  ...  67001-1-#          3          NaN
1        2        25                1  ...  67001-2-#          3          NaN
2        3        25                2  ...  56002-1-#          3          NaN
[3 rows x 7 columns]

In[4]: import pandas as pd
In[5]: conf.nodes['trips.txt']['converters']['route_id'] = pd.to_numeric
In[6]: conf.nodes['routes.txt']['converters']['route_id'] = pd.to_numeric
In[7]: ptg.load_feed(path, config=conf).routes
Out[7]:
Empty DataFrame
Columns: [route_id, agency_id, route_short_name, route_long_name, route_desc, route_type, route_color]
Index: []
@invisiblefunnel
Copy link
Contributor

Thanks for reporting the issue. This does appear to be a bug. I will take a closer look in the next week or so.

I'm not sure what would be the right solution for that, maybe changing both columns to object before comparison.

You seem to have a good understanding of the internals of partridge. If you would like to implement your suggestion I'd gladly review it and work with you to get it merged.

@invisiblefunnel
Copy link
Contributor

An intermediate step could be to detect this case and raise a descriptive error. @TheFrok what do you think about that as a next step?

@TheFrok
Copy link
Author

TheFrok commented Feb 14, 2020

If you would like to implement your suggestion I'd gladly review it and work with you to get it merged.

I started working on it. After I gave it some thought I decided that since conversion function can't always be reversed it might not be a option to "reconvert" the column back to object so I want to convert both column and check the dependency after conversion.

I tried to push my changes but I don't have the right permission, so if you could please help me with that :)

@invisiblefunnel
Copy link
Contributor

invisiblefunnel commented Feb 16, 2020

@TheFrok I recommend using the workflow described here: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request-from-a-fork. You can fork this repo into your account, push a branch to your repo, and create a pull request into this repo.

@TheFrok
Copy link
Author

TheFrok commented Feb 16, 2020

Cool, I'll do that.
@invisiblefunnel I noticed that in _load_feed you create some "dummy" feed as a source for the real one, but delete the converters configuration from that feed. Can you explain why is it necessary, it is causing me a bit of trouble in the change I'm trying to make.

@invisiblefunnel
Copy link
Contributor

I need to refresh my memory on some of the codebase. Let me take a closer look this week and make some recommendations. Thanks for your patience.

@invisiblefunnel
Copy link
Contributor

invisiblefunnel commented Feb 16, 2020

Testing some ideas on a new branch. It's a work in progress.

https://github.com/remix/partridge/compare/dw/fix-pruning-dtype-mismatch

@TheFrok
Copy link
Author

TheFrok commented Feb 17, 2020

You can take a look at this - master...TheFrok:id-column-converter .
I thought you would like to avoid converting types before pruning,
and also because of the "dummy" feed the pruning is run twice, which caused me a bit of trouble.

@TheFrok
Copy link
Author

TheFrok commented Feb 17, 2020

https://github.com/remix/partridge/compare/dw/fix-pruning-dtype-mismatch

@invisiblefunnel I tried to use that, and it worked for me except for the fact that I got an error when tried to prune int8 with int16. But that suppose simpler to fix :)

@invisiblefunnel
Copy link
Contributor

invisiblefunnel commented Feb 17, 2020

@TheFrok I prefer your approach because it doesn't interfere with filtering/pruning! Can you make a pull request? I will review and leave comments later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants