Optimize make_geocube() when converting a GeoDataFrame with multiple columns #56

openSourcerer9000 · 2021-02-10T22:54:23Z

Hello there, I stumbled on this library from a stackoverflow answer you posted and I have to say I really dig it.

I'm working on a project where I have a GDF that needs to get rasterized, and this turned out to be the perfect solution. However, the GDF's that will be coming through the pipeline will be kind of big (150K rows x 700 columns) . Right now the rasterization part is becoming a bottleneck, it takes a little over an hour while the other operations happen in minutes. We can cut down the resolution of some of this data on our end, but it seems like there could be some room to optimize the function.

For example, one column with the 150K shapely Point features rasterizes in about 5 seconds using the 'nearest' interpolation method. I believe it should be possible to run the 5 second algorithm that aligns the outgoing grid with the 'nearest' vector features just once, and then simply apply the same pattern across the other n columns of data, so as n scales up, the time to execute the function doesn't scale up with it.

For the 'nearest' option, imagine rasterizing a numbered index associated with the geometry features, and then simply mapping the remaining columns to the new pattern (something akin to pandas take(). I'm not sure what's going on exactly under the hood, but I imagine something analogous could be done for the other interpolation options.

snowman2 · 2021-02-11T01:04:36Z

700 columns

That is a lot of columns.

For the 'nearest' option, imagine rasterizing a numbered index associated with the geometry features, and then simply mapping the remaining columns to the new pattern (something akin to pandas take().

Sounds like a definite possibility. One issue you will run into is NULL values. You would likely have to add a dummy row at the end of your dataframe and fill it with NaN and then fill the NULL values in the raster with -1 or something.

I'm not sure what's going on exactly under the hood, but I imagine something analogous could be done for the other interpolation options.

This would get very complex very quickly. You would have to keep track of the weight of each row in the dataframe for each cell in the raster.

openSourcerer9000 · 2021-03-13T00:11:46Z

So this worked for my purposes. The rasterization went from ~75min to under 2min.


from geocube.api.core import make_geocube
import xarray as xr
from datetime import datetime

def rastarize(gdf,gridsize=0.0003):
    '''
    vector -> raster\n
    gdfs: single gpd geodataframe or list of GDFs with identical 'geometry'\n
    gridsize: resolution in degrees of output grid\n
    '''
    nowe = datetime.now()
    
    justdex = gdf[[g]].reset_index()
    
    res=(-gridsize,gridsize)
    coorstr = str(justdex.crs)
    dexray = make_geocube(justdex,output_crs=coorstr,interpolate_na_method='nearest',resolution=res)

    #xarray of fields related to index
    colLookup = gdf.drop('geometry',1).to_xarray()

    #merge on the index to populate corresponding columns to geocube, without having to interpolate all of them
    mergd = colLookup.sel({'index':dexray.to_array()})
    #drop extra coords/dims:
    mergd = mergd.drop(labels=['index','variable'])
    assert len(mergd['variable'].values)==1
    ds = mergd.sel(variable=mergd['variable'].values[0]) #drop variable dim


    timd = datetime.now()-nowe
    print(f'Finished rasterizing in {timd}')
    return ds

And then a test that the output would is the same as the unoptimized make_geocube. This could be beefed up to account for NULL values.


def test_rastarize():
    gridsize=0.0003
    n = 100
    np.random.seed(7)
    xrand = np.random.rand(n)*17
    np.random.seed(330)
    yrand = np.random.rand(n)*10
    gdf = [Point(-95-x*0.5*gridsize,30+y*0.5*gridsize) for x,y in zip(xrand,yrand) ]
    gdf = gpd.GeoDataFrame(geometry = gdf, crs = 'EPSG:4326')
    gdf['a'] = np.random.uniform(range(len(gdf))).round(2)
    gdf['b'] = gdf['a']*2
    gdf.index = np.round(gdf.index*.7,2)

    optimizedGeocube = rastarize(gdf,gridsize)

    res=(-gridsize,gridsize)
    coorstr = str(gdf.crs)
    testfull = make_geocube(gdf,output_crs=coorstr,interpolate_na_method='nearest',resolution=res)

    xr.testing.assert_equal(testfull, optimizedGeocube)

openSourcerer9000 added the proposal Idea for a new feature. label Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize make_geocube() when converting a GeoDataFrame with multiple columns #56

Optimize make_geocube() when converting a GeoDataFrame with multiple columns #56

openSourcerer9000 commented Feb 10, 2021

snowman2 commented Feb 11, 2021

openSourcerer9000 commented Mar 13, 2021 •

edited

Optimize make_geocube() when converting a GeoDataFrame with multiple columns #56

Optimize make_geocube() when converting a GeoDataFrame with multiple columns #56

Comments

openSourcerer9000 commented Feb 10, 2021

snowman2 commented Feb 11, 2021

openSourcerer9000 commented Mar 13, 2021 • edited

openSourcerer9000 commented Mar 13, 2021 •

edited