Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize make_geocube() when converting a GeoDataFrame with multiple columns #56

Open
openSourcerer9000 opened this issue Feb 10, 2021 · 2 comments
Labels
proposal Idea for a new feature.

Comments

@openSourcerer9000
Copy link

Hello there, I stumbled on this library from a stackoverflow answer you posted and I have to say I really dig it.

I'm working on a project where I have a GDF that needs to get rasterized, and this turned out to be the perfect solution. However, the GDF's that will be coming through the pipeline will be kind of big (150K rows x 700 columns) . Right now the rasterization part is becoming a bottleneck, it takes a little over an hour while the other operations happen in minutes. We can cut down the resolution of some of this data on our end, but it seems like there could be some room to optimize the function.

For example, one column with the 150K shapely Point features rasterizes in about 5 seconds using the 'nearest' interpolation method. I believe it should be possible to run the 5 second algorithm that aligns the outgoing grid with the 'nearest' vector features just once, and then simply apply the same pattern across the other n columns of data, so as n scales up, the time to execute the function doesn't scale up with it.

For the 'nearest' option, imagine rasterizing a numbered index associated with the geometry features, and then simply mapping the remaining columns to the new pattern (something akin to pandas take(). I'm not sure what's going on exactly under the hood, but I imagine something analogous could be done for the other interpolation options.

@openSourcerer9000 openSourcerer9000 added the proposal Idea for a new feature. label Feb 10, 2021
@snowman2
Copy link
Member

700 columns

That is a lot of columns.

For the 'nearest' option, imagine rasterizing a numbered index associated with the geometry features, and then simply mapping the remaining columns to the new pattern (something akin to pandas take().

Sounds like a definite possibility. One issue you will run into is NULL values. You would likely have to add a dummy row at the end of your dataframe and fill it with NaN and then fill the NULL values in the raster with -1 or something.

I'm not sure what's going on exactly under the hood, but I imagine something analogous could be done for the other interpolation options.

This would get very complex very quickly. You would have to keep track of the weight of each row in the dataframe for each cell in the raster.

@openSourcerer9000
Copy link
Author

openSourcerer9000 commented Mar 13, 2021

So this worked for my purposes. The rasterization went from ~75min to under 2min.


from geocube.api.core import make_geocube
import xarray as xr
from datetime import datetime

def rastarize(gdf,gridsize=0.0003):
    '''
    vector -> raster\n
    gdfs: single gpd geodataframe or list of GDFs with identical 'geometry'\n
    gridsize: resolution in degrees of output grid\n
    '''
    nowe = datetime.now()
    
    justdex = gdf[[g]].reset_index()
    
    res=(-gridsize,gridsize)
    coorstr = str(justdex.crs)
    dexray = make_geocube(justdex,output_crs=coorstr,interpolate_na_method='nearest',resolution=res)

    #xarray of fields related to index
    colLookup = gdf.drop('geometry',1).to_xarray()

    #merge on the index to populate corresponding columns to geocube, without having to interpolate all of them
    mergd = colLookup.sel({'index':dexray.to_array()})
    #drop extra coords/dims:
    mergd = mergd.drop(labels=['index','variable'])
    assert len(mergd['variable'].values)==1
    ds = mergd.sel(variable=mergd['variable'].values[0]) #drop variable dim


    timd = datetime.now()-nowe
    print(f'Finished rasterizing in {timd}')
    return ds

And then a test that the output would is the same as the unoptimized make_geocube. This could be beefed up to account for NULL values.


def test_rastarize():
    gridsize=0.0003
    n = 100
    np.random.seed(7)
    xrand = np.random.rand(n)*17
    np.random.seed(330)
    yrand = np.random.rand(n)*10
    gdf = [Point(-95-x*0.5*gridsize,30+y*0.5*gridsize) for x,y in zip(xrand,yrand) ]
    gdf = gpd.GeoDataFrame(geometry = gdf, crs = 'EPSG:4326')
    gdf['a'] = np.random.uniform(range(len(gdf))).round(2)
    gdf['b'] = gdf['a']*2
    gdf.index = np.round(gdf.index*.7,2)

    optimizedGeocube = rastarize(gdf,gridsize)

    res=(-gridsize,gridsize)
    coorstr = str(gdf.crs)
    testfull = make_geocube(gdf,output_crs=coorstr,interpolate_na_method='nearest',resolution=res)

    xr.testing.assert_equal(testfull, optimizedGeocube)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Idea for a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants