Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate alternatives to xarray to handle ProcessedVariable computations #3913

Open
agriyakhetarpal opened this issue Mar 20, 2024 · 4 comments
Labels
difficulty: medium Will take a few days priority: medium To be resolved if time allows

Comments

@agriyakhetarpal
Copy link
Member

agriyakhetarpal commented Mar 20, 2024

Recently, #3892 highlighted that pandas was being installed as an implicit required dependency for PyBaMM, because it was a required dependency for one of our required dependencies (xarray). pandas was otherwise listed as an optional dependency with the [pandas] extra and is currently used only for handling CSV files.

This dependence on xarray is particularly concerning, because:

  1. If pandas decides to act upon PDEP-10 with v3, it would drastically increase the download size for PyBaMM (pyarrow wheels across platforms are 120+ megabytes in size at a minimum).
  2. This would have complications on if things like Pyodide support are considered – where running PyBaMM on the browser would require excess bandwidth utilisation and slow down usage workflows. It would also affect regular users by a bit in Google Colab, where Python virtual environments and dependencies are not saved or cached.

Prior to the use of xarray (see #2366) as a backend for the ProcessedVariable and the ProcessedVariableComputed classes, the scipy.interpolate module was being used – which could be an option to return to.

There is time until pandas decides on this and also until we release v24.5, so we can take into account some of the developments around this area as they arise (as discussed in the technical roadmap meeting on 18/03/2024).

@agriyakhetarpal agriyakhetarpal added difficulty: medium Will take a few days priority: medium To be resolved if time allows labels Mar 20, 2024
@kratman
Copy link
Contributor

kratman commented Mar 20, 2024

What is pyodide being used for if it is an issue?

I have used pyarrow and pandas in a lot of web based apps without issue. Both pandas and pyarrow are pretty common in data science, so I know these get used in web/notebook applications on a regular basis

@agriyakhetarpal
Copy link
Member Author

What is pyodide being used for if it is an issue?

It's not being used by us currently, but as a part of my work assignment I am extending support for it across a lot of PyData projects and across the Scientific Python ecosystem (please see Quansight-Labs/czi-scientific-python-mgmt#18 and Quansight-Labs/czi-scientific-python-mgmt#19). PyBaMM isn't quite there yet, because we have CasADi as a dependency—it is tricky to compile it to WASM—if it becomes optional, we could move things forward on that (see #3826). The best and most stable example of where you can see Pyodide currently is on any of the usage examples in the scikit-learn documentation, where you can bring interactive docs via client-side JupyterLite notebooks.

I have used pyarrow and pandas in a lot of web based apps without issue. Both pandas and pyarrow are pretty common in data science, so I know these get used in web/notebook applications on a regular basis

There's no issue as such if you do so locally for any data science workflows because the pyarrow backend is extremely fast, but 1. those with unstable connections can have issues running such notebooks online, and 2. having a heavy (required) dependency graph in general isn't good for any library (packaging/distribution, for example, is one of the areas). But this is a smaller part of the picture; some of the responses on pandas-dev/pandas#54466 are quite insightful in this regard.

@kratman
Copy link
Contributor

kratman commented Mar 20, 2024

Yeah if we are going to drop xarray then using scipy or numpy native features would be good. However, it looks like we use pandas directly in a bunch of files, so it is not just due to xarray. I think if you want to make pandas optional, then you would need to pandas from a bunch of places (notebooks, tests, etc) and not just remove xarray.

Pandas can be useful for analysis and plotting, so we should probably think about if it is useful on the whole to include it and make sure it is a concern for our users. Realistically optional dependencies just make things more complicated. Unless we have fully optional modules then we should try to just remove problematic libraries all together.

@agriyakhetarpal
Copy link
Member Author

We did have pandas as an optional dependency before #3892, didn't we? I imagine it should not be a lot of work to make it fully optional back again with the import_optional_dependency wrapper. Or are we using it in a notebook where we haven't installed it in the introductory code cell?

A lot of the plotting features (for example matplotlib) were set as optional so that you were not forced to use it, and therefore you could use libraries like holoviz, pyvista, altair, seaborn, or any others of your choice offering a plotting backend and a graphics module. It is still optional at this time but in PyBaMM's history before v23.5 it was one of the "truly" optional dependencies (but we didn't have a list of optional dependencies back then).

@agriyakhetarpal agriyakhetarpal changed the title Investigate alternatives for xarray to handle ProcessedVariable computations Investigate alternatives to xarray to handle ProcessedVariable computations Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty: medium Will take a few days priority: medium To be resolved if time allows
Projects
None yet
Development

No branches or pull requests

2 participants