Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy computation for integration with xarray / dask #56

Open
gmacgilchrist opened this issue Aug 19, 2021 · 8 comments
Open

Lazy computation for integration with xarray / dask #56

gmacgilchrist opened this issue Aug 19, 2021 · 8 comments
Labels
enhancement New feature or request

Comments

@gmacgilchrist
Copy link

I am attempting to do computations of the carbonate system in model output. I use xarray to load and analyze these data with lazy computations, meaning that data are only loaded into memory as a final step, and then only the data required for the computation. This makes parallelization of the computation easy when working with large datasets.

This functionality doesn't seem to play nicely with PyCO2SYS (or vice versa!), which instead loads all the data into memory during the application of the sys command. I'm not sure how to put together a good minimum working example to show the issue here, because it relies on loading a dataset. I can try to put something together if the issue is not clear.

I'm wondering if there is a plan or interest to integrate the functionality of PyCO2SYS with that of xarray? That is, to allow the calculations inherent in PyCO2SYS to be performed lazily. I would be happy to help implement this, although I am no expert. On the other hand, I would be happy to hear if there is a workaround that I am missing.

Tagging @ognancy4life and @jbusecke as folk who might be interested in this problem.

P.S. Thanks so much for implementing this package - it's a great resource!

@jbusecke
Copy link

Is the calculation applied elementwise (e.g. could I loop through all positions of an array and that would result in the same output as if I feed the full array)?
In that case I would recommend xarrays map_blocks. This would apply PyCO2SYS on each chunk seperately (and keep it lazy until actually computed).

@gmacgilchrist
Copy link
Author

Thanks @jbusecke. My guess is that yes, it is applied element-wise, but will need @mvdh7 to confirm. I will try to implement map_blocks and see if it brings any improvement.

@gmacgilchrist
Copy link
Author

This is a lovely interim solution, @jbusecke, thanks!

Here's a short example with a simple wrapper function for PyCO2SYS, where ds is my xarray dataset with all of the required arguments and I am interested in getting only the pCO2_out variable from the results:

def pyco2sys_wrapper(ds):
    args = {'par1':ds['talk']*conversion,
            'par2':ds['dissic']*conversion,
            'par1_type':1,
            'par2_type':2,
            'pressure_out':0,
            'temperature_out':ds['thetao'], 
            'pressure':ds['p'],
            'temperature':ds['insitutemp']}
    results = pyco2.sys(**args)
    return ds['talk'].copy(data=results['pCO2_out'])

ds['PpCO2'] = xr.map_blocks(pyco2sys_wrapper,ds,template=ds['talk'])

This could be generalized to output the full "results" dictionary as an xarray Dataset, but just need a bit of work to carry over names and dimensions etc.

@mvdh7
Copy link
Owner

mvdh7 commented Sep 7, 2021

Hey @gmacgilchrist, sorry for the delay. I'm a fairly novice xarray user and don't have a better solution than what you and @jbusecke have already come up with. I think it would be a great thing to add to the package. I understand your example code above well enough to generalise it to the full results dict but am lacking in time to implement this at the moment, so if you do have time to get it set up or started that would be very welcome. Otherwise, I'll get round to it... eventually!

@mvdh7 mvdh7 added the enhancement New feature or request label Sep 7, 2021
@lukegre
Copy link
Contributor

lukegre commented Sep 16, 2021

I'm happy to help out here! @gmacgilchrist, have you made any headway on this?

@gmacgilchrist
Copy link
Author

I haven't revisited this yet, and don't expect that I will get to it in the next couple of weeks. But I would definitely value your help @lukegre when I get back to it.

Do you have any sense for how many people are using this package to work with model data? I'm interested to know in general if other folk have come up against memory/computation issues. I know that @ognancy4life came up against some similar issues, but I'd like to get a sense for how broadly useful it would be to make the code workable with big data, or if ~99% of users are working with observations and would likely never come up against such problems.

In the end, even though I got the computation to run lazily (based on @jbusecke's solution) so that it played nicely with xarray, I still came up against terminal memory issues when finally performing the calculation. I think these were related to the fact that the package attempts to calculate all of the variables at once, leading to the creating of at least 100 numpy arrays at each computation. In the end, we did some temporal averaging of the data prior to doing the CO2SYS calculations, just to be able to move forward.

@ognancy4life
Copy link

Hi All, Thanks for following up on this! My understanding is that most CO2 system calculations done in earth system models are done using an incomplete or approximated CO2 system for this very reason. While I hope we can arrive at a solution for running PyCO2SYS using xarray data arrays because I do like to have the full system and all the bells and whistles like uncertainty estimates, doing a full CO2 system calculation may not be necessary for most cases. Have you all checked if any of the other "lighter" CO2 system calculators like mocsy might be available in python and/or more easily adapted to play nice with xarray? I know there are others I've run across when reading ocean BGC model documentation papers but their names are escaping me.

@mvdh7
Copy link
Owner

mvdh7 commented Dec 12, 2021

Hi @gmacgilchrist and others,

Do you have any sense for how many people are using this package to work with model data? I'm interested to know in general if other folk have come up against memory/computation issues. I know that @ognancy4life came up against some similar issues, but I'd like to get a sense for how broadly useful it would be to make the code workable with big data, or if ~99% of users are working with observations and would likely never come up against such problems.

I still don't know the answer to this.

In the end, even though I got the computation to run lazily (based on @jbusecke's solution) so that it played nicely with xarray, I still came up against terminal memory issues when finally performing the calculation. I think these were related to the fact that the package attempts to calculate all of the variables at once, leading to the creating of at least 100 numpy arrays at each computation. In the end, we did some temporal averaging of the data prior to doing the CO2SYS calculations, just to be able to move forward.

Although you can cut the number of variables calculated by only returning the outputs you're interested in, probably at least half of the calculated variables are essential for most calculations (e.g. total salt contents and equilibrium constants) and so unavoidable. Could the fact that these intermediate variables are stored within dicts be contributing to the issue?

Anyway, if you are in fact only interested in one or two output variables, it will probably be most efficient to write new functions that only do the bare minimum calculation and can be used with xarray's apply_ufunc or map_blocks. For the former approach I have just created a first example. If you can try it out on your dataset and let me know if it does what you need then I will look into generalising it for more outputs. For now, you'd need to clone the develop branch of this repo and then use that to run something like:

pH = xr.apply_ufunc(
    pyco2.minimal.pH_from_alkalinity_dic,
    ds.alkalinity,
    ds.dic,
    kwargs={"temperature": ds.temperature, "salinity": ds.salinity},
    dask="parallelized",
)

The function pyco2.minimal.pH_from_alkalinity_dic accepts only a subset of the pyco2.sys kwargs; see its signature to see what can be used. Input/output conditions aren't possible (nor are they necessary) in this example.

PS: you can roughly halve the computation time for your example script (from 23 Aug) by using temperature=ds['thetao'] and taking only the pCO2 result instead of pCO2_out - you do not need to provide both temperature and temperature_out in this example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants