Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polygon_to_array returns shorter y dimension than source rasters #311

Open
WY-CGhilardi opened this issue Apr 26, 2024 · 26 comments · Fixed by #310
Open

polygon_to_array returns shorter y dimension than source rasters #311

WY-CGhilardi opened this issue Apr 26, 2024 · 26 comments · Fixed by #310

Comments

@WY-CGhilardi
Copy link

Version info:

$ pip list | grep geowombat
geowombat                    2.1.19

Originally ran into this issue when following this example in the docs

I have a time series of 5 banded imagery

with gw.open(
    filenames,
    band_names=['band1','band2','band3','band4','band5']
) as src:
	print(src)

This returns the following array with dimensions 20596 by 16400

<xarray.DataArray (time: 12, band: 5, y: 16400, x: 20596)>
dask.array<concatenate, shape=(12, 5, 16400, 20596), dtype=uint16, chunksize=(1, 5, 1, 20596), chunktype=numpy.ndarray>
Coordinates:
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * time     (time) int64 1 2 3 4 5 6 7 8 9 10 11 12
  * band     (band) <U5 'band1' 'band2' 'band3' 'band4' 'band5'
Attributes: (12/14)
    transform:           (3.3217508595230054, 0.0, 455302.1731317809, 0.0, -3...
    crs:                 6339
    res:                 (3.3217508595230054, 3.321750859522998)
    is_tiled:            1
    nodatavals:          (0.0, 0.0, 0.0, 0.0, 0.0)
    _FillValue:          0.0
    ...                  ...
    filename:            ['1_2023_6339.tif', '2_2023_6339.tif', '3_2023_6339....
    resampling:          nearest
    AREA_OR_POINT:       Area
    COLORINTERP:         Blue
    _data_are_separate:  1
    _data_are_stacked:   1

I have a geodataframe I am loading from a geopackage

training_polys = gpd.read_file('training.gpkg', layer='polys')

training_polys.head()

   OBJECTID   class      geometry
0         1      2   MULTIPOLYGON (((473555.422 5079549.614, 473559...  
1         2      2   MULTIPOLYGON (((494340.857 5099531.053, 494388...  
2         3      2   MULTIPOLYGON (((478086.790 5063840.854, 478218...  
3         4      2   MULTIPOLYGON (((474459.725 5093186.530, 474563...
4         5      2   MULTIPOLYGON (((498999.456 5085437.285, 499009... 

When I try to fit a model, the resulting rasterized array is short on the y axis by one row

X, Xy, clf = fit(data = src, clf = pl, labels = training_polys, col='class')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 X, Xy, clf = fit(data = src, clf = pl, labels = training_polys, col='class')

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/ml/classifiers.py:269, in Classifiers.fit(self, data, clf, labels, col, targ_name, targ_dim_name)
    266 else:
    267     data = self._add_time_dim(data)
--> 269     data, labels = self._prepare_labels(data, labels, col, targ_name)
    270     X, Xna = self._prepare_predictors(data, targ_name)
    271     clf = self._prepare_classifiers(clf)

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/ml/classifiers.py:77, in ClassifiersMixin._prepare_labels(self, data, labels, col, targ_name)
     71     labels = polygon_to_array(labels, col=col, data=data)
     73 labels = xr.concat([labels] * data.gw.ntime, dim="band").assign_coords(
     74     coords={"band": data.time.values.tolist()}
     75 )
---> 77 data.coords[targ_name] = (["time", "y", "x"], labels.values)
     79 return data, labels

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:528, in Coordinates.__setitem__(self, key, value)
    527 def __setitem__(self, key: Hashable, value: Any) -> None:
--> 528     self.update({key: value})

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:554, in Coordinates.update(self, other)
    546 # Discard original indexed coordinates prior to merge allows to:
    547 # - fail early if the new coordinates don't preserve the integrity of existing
    548 #   multi-coordinate indexes
    549 # - drop & replace coordinates without alignment (note: we must keep indexed
    550 #   coordinates extracted from the DataArray objects passed as values to
    551 #   `other` - if any - as those are still used for aligning the old/new coordinates)
    552 coords_to_align = drop_indexed_coords(set(other_coords) & set(other), self)
--> 554 coords, indexes = merge_coords(
    555     [coords_to_align, other_coords],
    556     priority_arg=1,
    557     indexes=coords_to_align.xindexes,
    558 )
    560 # special case for PandasMultiIndex: updating only its dimension coordinate
    561 # is still allowed but depreciated.
    562 # It is the only case where we need to actually drop coordinates here (multi-index levels)
    563 # TODO: remove when removing PandasMultiIndex's dimension coordinate.
    564 self._drop_coords(self._names - coords_to_align._names)

File /env/lib/python3.10/site-packages/xarray/core/merge.py:556, in merge_coords(objects, compat, join, priority_arg, indexes, fill_value)
    554 _assert_compat_valid(compat)
    555 coerced = coerce_pandas_values(objects)
--> 556 aligned = deep_align(
    557     coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
    558 )
    559 collected = collect_variables_and_indexes(aligned, indexes=indexes)
    560 prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:952, in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
    949     else:
    950         out.append(variables)
--> 952 aligned = align(
    953     *targets,
    954     join=join,
    955     copy=copy,
    956     indexes=indexes,
    957     exclude=exclude,
    958     fill_value=fill_value,
    959 )
    961 for position, key, aligned_obj in zip(positions, keys, aligned):
    962     if key is no_key:

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:888, in align(join, copy, indexes, exclude, fill_value, *objects)
    692 """
    693 Given any number of Dataset and/or DataArray objects, returns new
    694 objects with aligned indexes and dimension sizes.
   (...)
    878 
    879 """
    880 aligner = Aligner(
    881     objects,
    882     join=join,
   (...)
    886     fill_value=fill_value,
    887 )
--> 888 aligner.align()
    889 return aligner.results

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:575, in Aligner.align(self)
    573 self.assert_no_index_conflict()
    574 self.align_indexes()
--> 575 self.assert_unindexed_dim_sizes_equal()
    577 if self.join == "override":
    578     self.override_indexes()

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:476, in Aligner.assert_unindexed_dim_sizes_equal(self)
    474     add_err_msg = ""
    475 if len(sizes) > 1:
--> 476     raise ValueError(
    477         f"cannot reindex or align along dimension {dim!r} "
    478         f"because of conflicting dimension sizes: {sizes!r}" + add_err_msg
    479     )

ValueError: cannot reindex or align along dimension 'y' because of conflicting dimension sizes: {16400, 16399} (note: an index is found along that dimension with size=16400)

Looking through function definitions I was able to reproduce the mismatch with the polygon_to_array function. This returns a 20596 by 16399 dimension array

labelss = polygon_to_array( training_polys, 'class', src)

<xarray.DataArray 'array-7aee3d45b28f735a4a0493d1fc553903' (band: 1, y: 16399,
                                                            x: 20596)>
dask.array<array, shape=(1, 16399, 20596), dtype=uint8, chunksize=(1, 1, 20596), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 1
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
Attributes:
    transform:  (3.3217508595230054, 0.0, 455302.1731317809, 0.0, -3.32175085...
    crs:        EPSG:6339
    res:        (3.3217508595230054, 3.321750859522998)
    is_tiled:   1
	

I realize I am not explicitly setting the ref_res config parameter like in the example, but it seems to be trying to just match src parameters, which seems like perfectly reasonable behavior. I am not sure if this just fails because of the non-square resolution.

@jgrss
Copy link
Owner

jgrss commented Apr 26, 2024

Hi @WY-CGhilardi thanks for posting.

The default behavior of polygon_to_array is to return the bounding box of the intersection between the array and polygon. But, we might not be passing all the needed variables through fit (Cc @mmann1123).

To start, could you try the following with your data?

import geowombat as gw

with gw.open(
    filenames,
    band_names=['band1','band2','band3','band4','band5']
) as src:
    poly_array = gw.polygon_to_array(training_polys, col="class", data=src, bounds_by="union")
    print(poly_array)

Are you able to share your data?

@jgrss
Copy link
Owner

jgrss commented Apr 26, 2024

Here's another test to see if the polygons are properly converted using a config context.

import geowombat as gw
from geowombat.backends.rasterio_ import get_file_bounds

aoi_bounds = get_file_bounds(filenames, return_bounds=True, bounds_by='union')
with gw.config.update(ref_bounds=aoi_bounds):
    with gw.open(
        filenames,
        band_names=['band1','band2','band3','band4','band5']
    ) as src:
        X, Xy, clf = fit(data=src, clf=pl, labels=training_polys, col='class')

@mmann1123
Copy link
Collaborator

mmann1123 commented Apr 27, 2024

Currently ml functions don't work on time series stacks, this would also require time series training data. Instead you need to set stack_dim = 'band'. This will treat all bands over all time periods a columns of X.

with gw.open(
    filenames,
    band_names=['band1','band2','band3','band4','band5'], 
    stack_dim='band'
) as src:

X, Xy, clf = fit(data = src, clf = pl, labels = training_polys, col='class')

mmann1123 added a commit that referenced this issue Apr 27, 2024
@WY-CGhilardi
Copy link
Author

WY-CGhilardi commented Apr 27, 2024

I cannot share the data but it sounds like this is a issue with how the series is stacked.

For completeness here are the two configs requested above.

The first returns the same results as my previous call to polygon_to_array

with gw.open(
    filenames,
    band_names=['band1','band2','band3','band4','band5']
) as src:
poly_array = gw.polygon_to_array(training_polys, col="class", data=src, bounds_by="union")
print(poly_array)

<xarray.DataArray 'array-74bf0873a2d08172017317f9392aa551' (band: 1, y: 16399,
                                                            x: 20596)>
dask.array<array, shape=(1, 16399, 20596), dtype=uint8, chunksize=(1, 1, 20596), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 1
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
Attributes:
    transform:  (3.3217508595230054, 0.0, 455302.1731317809, 0.0, -3.32175085...
    crs:        EPSG:6339
    res:        (3.3217508595230054, 3.321750859522998)
    is_tiled:   1
	

The second returns the same error message as before but now with one extra row on the y axis (16401) rather than one less

from geowombat.backends.rasterio_ import get_file_bounds

aoi_bounds = get_file_bounds(filenames, return_bounds=True, bounds_by='union')
with gw.config.update(ref_bounds=aoi_bounds):
    with gw.open(
        filenames,
        band_names=['band1','band2','band3','band4','band5']
    ) as src:
        X, Xy, clf = fit(data=src, clf=pl, labels=training_polys, col='class')	
		

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 9
      4 with gw.config.update(ref_bounds=aoi_bounds):
      5     with gw.open(
      6         filenames,
      7         band_names=['band1','band2','band3','band4','band5']
      8     ) as src:
----> 9         X, Xy, clf = fit(data=src, clf=pl, labels=training_polys, col='class')

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/ml/classifiers.py:269, in Classifiers.fit(self, data, clf, labels, col, targ_name, targ_dim_name)
    266 else:
    267     data = self._add_time_dim(data)
--> 269     data, labels = self._prepare_labels(data, labels, col, targ_name)
    270     X, Xna = self._prepare_predictors(data, targ_name)
    271     clf = self._prepare_classifiers(clf)

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/ml/classifiers.py:77, in ClassifiersMixin._prepare_labels(self, data, labels, col, targ_name)
     71     labels = polygon_to_array(labels, col=col, data=data)
     73 labels = xr.concat([labels] * data.gw.ntime, dim="band").assign_coords(
     74     coords={"band": data.time.values.tolist()}
     75 )
---> 77 data.coords[targ_name] = (["time", "y", "x"], labels.values)
     79 return data, labels

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:528, in Coordinates.__setitem__(self, key, value)
    527 def __setitem__(self, key: Hashable, value: Any) -> None:
--> 528     self.update({key: value})

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:554, in Coordinates.update(self, other)
    546 # Discard original indexed coordinates prior to merge allows to:
    547 # - fail early if the new coordinates don't preserve the integrity of existing
    548 #   multi-coordinate indexes
    549 # - drop & replace coordinates without alignment (note: we must keep indexed
    550 #   coordinates extracted from the DataArray objects passed as values to
    551 #   `other` - if any - as those are still used for aligning the old/new coordinates)
    552 coords_to_align = drop_indexed_coords(set(other_coords) & set(other), self)
--> 554 coords, indexes = merge_coords(
    555     [coords_to_align, other_coords],
    556     priority_arg=1,
    557     indexes=coords_to_align.xindexes,
    558 )
    560 # special case for PandasMultiIndex: updating only its dimension coordinate
    561 # is still allowed but depreciated.
    562 # It is the only case where we need to actually drop coordinates here (multi-index levels)
    563 # TODO: remove when removing PandasMultiIndex's dimension coordinate.
    564 self._drop_coords(self._names - coords_to_align._names)

File /env/lib/python3.10/site-packages/xarray/core/merge.py:556, in merge_coords(objects, compat, join, priority_arg, indexes, fill_value)
    554 _assert_compat_valid(compat)
    555 coerced = coerce_pandas_values(objects)
--> 556 aligned = deep_align(
    557     coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
    558 )
    559 collected = collect_variables_and_indexes(aligned, indexes=indexes)
    560 prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:952, in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
    949     else:
    950         out.append(variables)
--> 952 aligned = align(
    953     *targets,
    954     join=join,
    955     copy=copy,
    956     indexes=indexes,
    957     exclude=exclude,
    958     fill_value=fill_value,
    959 )
    961 for position, key, aligned_obj in zip(positions, keys, aligned):
    962     if key is no_key:

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:888, in align(join, copy, indexes, exclude, fill_value, *objects)
    692 """
    693 Given any number of Dataset and/or DataArray objects, returns new
    694 objects with aligned indexes and dimension sizes.
   (...)
    878 
    879 """
    880 aligner = Aligner(
    881     objects,
    882     join=join,
   (...)
    886     fill_value=fill_value,
    887 )
--> 888 aligner.align()
    889 return aligner.results

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:575, in Aligner.align(self)
    573 self.assert_no_index_conflict()
    574 self.align_indexes()
--> 575 self.assert_unindexed_dim_sizes_equal()
    577 if self.join == "override":
    578     self.override_indexes()

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:476, in Aligner.assert_unindexed_dim_sizes_equal(self)
    474     add_err_msg = ""
    475 if len(sizes) > 1:
--> 476     raise ValueError(
    477         f"cannot reindex or align along dimension {dim!r} "
    478         f"because of conflicting dimension sizes: {sizes!r}" + add_err_msg
    479     )

ValueError: cannot reindex or align along dimension 'y' because of conflicting dimension sizes: {16400, 16401} (note: an index is found along that dimension with size=16401)

@WY-CGhilardi
Copy link
Author

I did also try adding the stack_dim parameter to the open call but that threw a different error before it made it to the fit

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 with gw.open(
      2     filenames,
      3     band_names=['band1','band2','band3','band4','band5'],
      4     stack_dim='band'
      5 ) as src:
      6     src

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/core/api.py:531, in open.__init__(self, filename, band_names, time_names, stack_dim, bounds, bounds_by, resampling, persist_filenames, netcdf_vars, mosaic, overlap, nodata, scale_factor, offset, dtype, scale_data, num_workers, **kwargs)
    518     self.data = gw_mosaic(
    519         filename,
    520         overlap=overlap,
   (...)
    526         **kwargs,
    527     )
    529 else:
    530     # Stack images along the 'time' axis
--> 531     self.data = gw_concat(
    532         filename,
    533         stack_dim=stack_dim,
    534         bounds_by=bounds_by,
    535         resampling=resampling,
    536         time_names=time_names,
    537         band_names=band_names,
    538         nodata=nodata,
    539         overlap=overlap,
    540         dtype=dtype,
    541         netcdf_vars=netcdf_vars,
    542         **kwargs,
    543     )
    544     self.__data_are_stacked = True
    546 self.__data_are_separate = True

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/backends/xarray_.py:738, in concat(filenames, stack_dim, bounds_by, resampling, time_names, band_names, nodata, dtype, netcdf_vars, overlap, warp_mem_limit, num_threads, tap, **kwargs)
    735             src.coords['time'] = parse_filename_dates(filenames)
    737 if band_names:
--> 738     src.coords['band'] = band_names
    739 else:
    740     if src.gw.sensor:

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:528, in Coordinates.__setitem__(self, key, value)
    527 def __setitem__(self, key: Hashable, value: Any) -> None:
--> 528     self.update({key: value})

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:566, in Coordinates.update(self, other)
    560 # special case for PandasMultiIndex: updating only its dimension coordinate
    561 # is still allowed but depreciated.
    562 # It is the only case where we need to actually drop coordinates here (multi-index levels)
    563 # TODO: remove when removing PandasMultiIndex's dimension coordinate.
    564 self._drop_coords(self._names - coords_to_align._names)
--> 566 self._update_coords(coords, indexes)

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:834, in DataArrayCoordinates._update_coords(self, coords, indexes)
    832 coords_plus_data = coords.copy()
    833 coords_plus_data[_THIS_ARRAY] = self._data.variable
--> 834 dims = calculate_dimensions(coords_plus_data)
    835 if not set(dims) <= set(self.dims):
    836     raise ValueError(
    837         "cannot add coordinates with new dimensions to a DataArray"
    838     )

File /env/lib/python3.10/site-packages/xarray/core/variable.py:2997, in calculate_dimensions(variables)
   2995             last_used[dim] = k
   2996         elif dims[dim] != size:
-> 2997             raise ValueError(
   2998                 f"conflicting sizes for dimension {dim!r}: "
   2999                 f"length {size} on {k!r} and length {dims[dim]} on {last_used!r}"
   3000             )
   3001 return dims

ValueError: conflicting sizes for dimension 'band': length 60 on <this-array> and length 5 on {'x': 'x', 'y': 'y', 'band': 'band'}

@mmann1123
Copy link
Collaborator

Comment out band_names, it should be the same length as the total number of bands in all images

@WY-CGhilardi
Copy link
Author

removing band_names with stack_dim set to bands returns the original off by one I was running into above

with gw.open(
    filenames,
    stack_dim='band'
) as src:

    X, Xy, clf = fit(data = src, clf = pl, labels = training_polys, col='class')
ValueError                                Traceback (most recent call last)
Cell In[7], line 6
      1 with gw.open(
      2     filenames,
      3     stack_dim='band'
      4 ) as src:
----> 6     X, Xy, clf = fit(data = src, clf = pl, labels = training_polys, col='class')

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/ml/classifiers.py:269, in Classifiers.fit(self, data, clf, labels, col, targ_name, targ_dim_name)
    266 else:
    267     data = self._add_time_dim(data)
--> 269     data, labels = self._prepare_labels(data, labels, col, targ_name)
    270     X, Xna = self._prepare_predictors(data, targ_name)
    271     clf = self._prepare_classifiers(clf)

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/ml/classifiers.py:77, in ClassifiersMixin._prepare_labels(self, data, labels, col, targ_name)
     71     labels = polygon_to_array(labels, col=col, data=data)
     73 labels = xr.concat([labels] * data.gw.ntime, dim="band").assign_coords(
     74     coords={"band": data.time.values.tolist()}
     75 )
---> 77 data.coords[targ_name] = (["time", "y", "x"], labels.values)
     79 return data, labels

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:528, in Coordinates.__setitem__(self, key, value)
    527 def __setitem__(self, key: Hashable, value: Any) -> None:
--> 528     self.update({key: value})

File /env/lib/python3.10/site-packages/xarray/core/coordinates.py:554, in Coordinates.update(self, other)
    546 # Discard original indexed coordinates prior to merge allows to:
    547 # - fail early if the new coordinates don't preserve the integrity of existing
    548 #   multi-coordinate indexes
    549 # - drop & replace coordinates without alignment (note: we must keep indexed
    550 #   coordinates extracted from the DataArray objects passed as values to
    551 #   `other` - if any - as those are still used for aligning the old/new coordinates)
    552 coords_to_align = drop_indexed_coords(set(other_coords) & set(other), self)
--> 554 coords, indexes = merge_coords(
    555     [coords_to_align, other_coords],
    556     priority_arg=1,
    557     indexes=coords_to_align.xindexes,
    558 )
    560 # special case for PandasMultiIndex: updating only its dimension coordinate
    561 # is still allowed but depreciated.
    562 # It is the only case where we need to actually drop coordinates here (multi-index levels)
    563 # TODO: remove when removing PandasMultiIndex's dimension coordinate.
    564 self._drop_coords(self._names - coords_to_align._names)

File /env/lib/python3.10/site-packages/xarray/core/merge.py:556, in merge_coords(objects, compat, join, priority_arg, indexes, fill_value)
    554 _assert_compat_valid(compat)
    555 coerced = coerce_pandas_values(objects)
--> 556 aligned = deep_align(
    557     coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
    558 )
    559 collected = collect_variables_and_indexes(aligned, indexes=indexes)
    560 prioritized = _get_priority_vars_and_indexes(aligned, priority_arg, compat=compat)

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:952, in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
    949     else:
    950         out.append(variables)
--> 952 aligned = align(
    953     *targets,
    954     join=join,
    955     copy=copy,
    956     indexes=indexes,
    957     exclude=exclude,
    958     fill_value=fill_value,
    959 )
    961 for position, key, aligned_obj in zip(positions, keys, aligned):
    962     if key is no_key:

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:888, in align(join, copy, indexes, exclude, fill_value, *objects)
    692 """
    693 Given any number of Dataset and/or DataArray objects, returns new
    694 objects with aligned indexes and dimension sizes.
   (...)
    878 
    879 """
    880 aligner = Aligner(
    881     objects,
    882     join=join,
   (...)
    886     fill_value=fill_value,
    887 )
--> 888 aligner.align()
    889 return aligner.results

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:575, in Aligner.align(self)
    573 self.assert_no_index_conflict()
    574 self.align_indexes()
--> 575 self.assert_unindexed_dim_sizes_equal()
    577 if self.join == "override":
    578     self.override_indexes()

File /env/lib/python3.10/site-packages/xarray/core/alignment.py:476, in Aligner.assert_unindexed_dim_sizes_equal(self)
    474     add_err_msg = ""
    475 if len(sizes) > 1:
--> 476     raise ValueError(
    477         f"cannot reindex or align along dimension {dim!r} "
    478         f"because of conflicting dimension sizes: {sizes!r}" + add_err_msg
    479     )

ValueError: cannot reindex or align along dimension 'y' because of conflicting dimension sizes: {16400, 16399} (note: an index is found along that dimension with size=16400)

@mmann1123
Copy link
Collaborator

Do you have polygons that exceed the bounds of the images?

@mmann1123
Copy link
Collaborator

Do you have polygons that exceed the bounds of the images? Or that contain missing data?

@WY-CGhilardi
Copy link
Author

Do you have polygons that exceed the bounds of the images?

No, the polygons were explicitly drawn within this AOI. Doubling checking the bounding boxes for both the geodataframe and xarray stack check out as well

# minx, miny, maxx, maxy
training_polys.total_bounds
array([ 472449.7271, 5063777.276 ,  510789.9898, 5106120.4848])


## hacky but does the job
# minx, miny, maxx, maxy
(float(src.x.min().values), float(src.y.min().values), float(src.x.max().values), float(src.y.max().values))
(455303.8340072106, 5061403.560293189, 523715.29295908695, 5115876.952638507)

or contain missing data

No, the dataframe is fully populated and all have geometry

training_polys.isnull().values.any()

False

@mmann1123
Copy link
Collaborator

This is a weird one. I've never run into it and can't reproduce without the data.

Can you try seeing 'ref_image' to the first image in your list?

https://geowombat.readthedocs.io/en/latest/tutorial-config.html#reference-settings-image

@jgrss
Copy link
Owner

jgrss commented Apr 27, 2024

@WY-CGhilardi could you post the full transform (print(src.gw.transform)) of your image stack? I can see the left bound from your printout but not the top. And if you can also post the total bounds (print(df.total_bounds))) of your polygon geodataframe, we might be able to reproduce this issue with fake data.

@jgrss
Copy link
Owner

jgrss commented Apr 27, 2024

Apologies, I see that you already did what I requested. I'll look into this and see if I can reproduce.

@jgrss
Copy link
Owner

jgrss commented Apr 27, 2024

I was able to reproduce with fake data, so I should have what I need to diagnose.

import geowombat as gw
import dask.array as da
import numpy as np
import geopandas as gpd
import xarray as xr
from affine import Affine
from shapely.geometry import box

num_time = 12
num_bands = 5
num_rows = 16400
num_cols = 20596
res = (3.3217508595230054, 3.321750859522998)

left, bottom, right, top = (455303.8340072106, 5061403.560293189, 523715.29295908695, 5115876.952638507)

x = np.linspace(left, right, num_cols)
y = np.linspace(top, bottom, num_rows)

data = xr.DataArray(
    da.random.random(
        (num_time, num_bands, num_rows, num_cols), chunks=(-1, -1, 128, 128)
    ),
    dims=('time', 'band', 'y', 'x'),
    coords={
        'time': range(1, num_time + 1), 
        'band': [f"band{b}" for b in range(1, num_bands + 1)], 
        'y': y, 
        'x': x,
    },
    attrs={
        'transform': list(Affine(
            res[0], 0.0, left, 0.0, -res[1], top
        )),
        'crs': 'EPSG:6339',
        'res': res,
        'nodatavals': (0.0, 0.0, 0.0, 0.0, 0.0),
        '_FillValue': 0.0,
    },
)

poly_left, poly_bottom, poly_right, poly_top = 472449.7271, 5063777.276, 510789.9898, 5106120.4848
poly1 = box(poly_left, poly_top - 1000, poly_left + 1000, poly_top)
poly2 = box(poly_right - 1000, poly_bottom, poly_right, poly_bottom + 1000)

training_polys  = gpd.GeoDataFrame(
    data=[0, 1],
    columns=['class'],
    geometry=[poly1, poly2],
    crs='EPSG:6339',
)
poly_array = gw.polygon_to_array(
    training_polys, 
    col="class", 
    data=data, 
    bounds_by="union",
)
print(data)
xarray.DataArray 'random_sample-b383b64724d8ff4e461e9131e8739640'
time: 12 band: 5 y: 16400 x: 20596

print(poly_array)
xarray.DataArray 'array-c8fefabbbbd90f2375e1a36a7114dc8c'
band: 1 y: 16399 x: 20596

@jgrss jgrss mentioned this issue Apr 27, 2024
@jgrss
Copy link
Owner

jgrss commented Apr 27, 2024

I think this might just be a precision issue. The adjusted height was 16399.9999987, which was converted to 16399 instead of 16400 with int(). See if #312 works for you. I still need to add additional tests.

@WY-CGhilardi
Copy link
Author

I didn't see that geowombat exposed a bounds function! I ran that and I got slightly different values from my "manual parsing" version. It sounds like that isn't maybe the full issue but I thought worth calling out.

src.gw.bounds
(455302.1731317809, 5061401.89941776, 523716.9538345167, 5115878.613513936)

##as compared with (455303.8340072106, 5061403.560293189, 523715.29295908695, 5115876.952638507) from above

I was not able to successfully get the reference image config working correctly. The files I am accessing are remote on a cloud provider so not sure if that is the issue or something else.

with gw.config.update(ref_image=filenames[0]):
    with gw.open(filenames, ) as src:
        print(src.gw.bounds)

23:21:49:WARNING:94:xarray_._check_config_globals:  The reference image does not exist
23:21:49:WARNING:94:xarray_._check_config_globals:  The reference image does not exist
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 2
      1 with gw.config.update(ref_image=filenames[0]):
----> 2     with gw.open(filenames, ) as src:
      3         print(src.gw.bounds)

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/core/api.py:531, in open.__init__(self, filename, band_names, time_names, stack_dim, bounds, bounds_by, resampling, persist_filenames, netcdf_vars, mosaic, overlap, nodata, scale_factor, offset, dtype, scale_data, num_workers, **kwargs)
    518     self.data = gw_mosaic(
    519         filename,
    520         overlap=overlap,
   (...)
    526         **kwargs,
    527     )
    529 else:
    530     # Stack images along the 'time' axis
--> 531     self.data = gw_concat(
    532         filename,
    533         stack_dim=stack_dim,
    534         bounds_by=bounds_by,
    535         resampling=resampling,
    536         time_names=time_names,
    537         band_names=band_names,
    538         nodata=nodata,
    539         overlap=overlap,
    540         dtype=dtype,
    541         netcdf_vars=netcdf_vars,
    542         **kwargs,
    543     )
    544     self.__data_are_stacked = True
    546 self.__data_are_separate = True

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/backends/xarray_.py:619, in concat(filenames, stack_dim, bounds_by, resampling, time_names, band_names, nodata, dtype, netcdf_vars, overlap, warp_mem_limit, num_threads, tap, **kwargs)
    616 with rio_open(filenames[0]) as src_:
    617     tags = src_.tags()
--> 619 src_ = warp_open(
    620     f'{filenames[0]}:{netcdf_vars[0]}' if netcdf_vars else filenames[0],
    621     resampling=resampling,
    622     band_names=[netcdf_vars[0]] if netcdf_vars else band_names,
    623     nodata=nodata,
    624     warp_mem_limit=warp_mem_limit,
    625     num_threads=num_threads,
    626     **kwargs,
    627 )
    629 attrs = src_.attrs.copy()
    630 src_.close()

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/backends/xarray_.py:314, in warp_open(filename, band_names, resampling, dtype, netcdf_vars, nodata, return_windows, warp_mem_limit, num_threads, tap, **kwargs)
    297     warped_objects = warp_images(
    298         filenames, resampling=resampling, **ref_kwargs_netcdf_stack
    299     )
    301     yield xr.concat(
    302         (
    303             open_rasterio(
   (...)
    310         dim='band',
    311     )
    313 with open_rasterio(
--> 314     warp(filename, resampling=resampling, **ref_kwargs),
    315     nodata=ref_kwargs['nodata'],
    316     **kwargs,
    317 ) if not filenames else warp_netcdf_vars() as src:
    318     if band_names:
    319         if len(band_names) > src.gw.nbands:

File ~/venvs/geowombat-env/lib/python3.10/site-packages/geowombat/backends/rasterio_.py:783, in warp(filename, resampling, bounds, crs, res, nodata, warp_mem_limit, num_threads, tap, tac)
    779 dst_height, dst_width = get_dims_from_bounds(dst_bounds, dst_res)
    781 # Do all the key metadata match the reference information?
    782 if (
--> 783     (tuple(src_info.src_bounds) == tuple(bounds))
    784     and (src_info.src_res == dst_res)
    785     and (src_crs == dst_crs)
    786     and (src_info.src_width == dst_width)
    787     and (src_info.src_height == dst_height)
    788     and ('.nc' not in filename.lower())
    789 ):
    790     vrt_options = {
    791         'resampling': getattr(Resampling, resampling),
    792         'src_crs': src_crs,
   (...)
    803         },
    804     }
    806 else:

TypeError: 'NoneType' object is not iterable

@WY-CGhilardi
Copy link
Author

I am still seeing 16399 after installing issue_310 branch

#pip install git+https://github.com/jgrss/geowombat@jgrss/issue_310

<xarray.DataArray 'array-74bf0873a2d08172017317f9392aa551' (band: 1, y: 16399,
                                                            x: 20596)>
dask.array<array, shape=(1, 16399, 20596), dtype=uint8, chunksize=(1, 1, 20596), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 1
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
Attributes:
    transform:  (3.3217508595230054, 0.0, 455302.1731317809, 0.0, -3.32175085...
    crs:        EPSG:6339
    res:        (3.3217508595230054, 3.321750859522998)
    is_tiled:   1

@jgrss
Copy link
Owner

jgrss commented Apr 27, 2024

I was not able to successfully get the reference image config working correctly. The files I am accessing are remote on a cloud provider so not sure if that is the issue or something else.

Hmm, geowombat open() is basically just wrapped rasterio. Are you opening a virtual file following their protocol?

@jgrss
Copy link
Owner

jgrss commented Apr 27, 2024

Could you run your polygon conversion with the following and let me know what you get?

with gw.config.update(ref_res=3.2175):

and

with gw.config.update(ref_res=3.22):

@WY-CGhilardi
Copy link
Author

Let me start off with a big thank you to both of you for working through this with me, I really appreciate it

with gw.config.update(ref_res=3.2175):
    with gw.open(
        filenames,
        band_names=['band1','band2','band3','band4','band5']
    ) as src:
        src

labelss = polygon_to_array( training_polys, 'class', src)

labelss

<xarray.DataArray 'array-eb576767189c8f0f6be73fcd77ffb4ee' (band: 1, y: 16930,
                                                            x: 21263)>
dask.array<array, shape=(1, 16930, 21263), dtype=uint8, chunksize=(1, 1, 20597), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 1
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
Attributes:
    transform:  (3.2175, 0.0, 455302.1731317809, 0.0, -3.2175, 5115878.613513...
    crs:        EPSG:6339
    res:        (3.2175, 3.2175)
    is_tiled:   1
with gw.config.update(ref_res=3.22):
    with gw.open(
        filenames,
        band_names=['band1','band2','band3','band4','band5']
    ) as src:
        src
		
labelss = polygon_to_array( training_polys, 'class', src)

labelss
<xarray.DataArray 'array-e7a6b86caea789cacaa30e07821d1756' (band: 1, y: 16917,
                                                            x: 21247)>
dask.array<array, shape=(1, 16917, 21247), dtype=uint8, chunksize=(1, 1, 20597), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 1
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
Attributes:
    transform:  (3.22, 0.0, 455302.1731317809, 0.0, -3.22, 5115878.613513936)
    crs:        EPSG:6339
    res:        (3.22, 3.22)
    is_tiled:   1

Hmm, geowombat open() is basically just wrapped rasterio. Are you opening a virtual file following their protocol?

Yes, the filenames is functionally just ['s3://somebucket/somprefix/somefile1.tif', 's3://somebucket/somprefix/somefile2.tif'] Indexing seems to work fine as part of the normal open call so I don't see why that wouldn't work as part of the ref_img. I am not too concerned with this behavior now since we have narrowed it down to precision issue. 🙃

with gw.open(filenames[0]) as src:
    print(src.gw.bounds)

(455302.1731317809, 5061401.89941776, 523716.9538345167, 5115878.613513936)

@jgrss
Copy link
Owner

jgrss commented Apr 29, 2024

Using the fake data generated above, the array size is

print(data.shape)
print(data.gw.bounds)

(12, 5, 16400, 20596)
(455300.5122563511, 5061400.23854233, 523718.61470994644, 5115880.274389366)

and the training polygon bounding box is

print(training_polys.total_bounds.tolist())

[472449.7271, 5063777.276, 510789.9898, 5106120.4848]

Those numbers should match what you have, as I simply copied what you posted.

Using the latest (geowombat=2.1.19 from main) on Python 3.9.16, I get:

poly_array = gw.polygon_to_array(
    training_polys, 
    col="class", 
    data=data, 
)
print(poly_array.shape)

(1, 16400, 20596)

@jgrss
Copy link
Owner

jgrss commented Apr 29, 2024

I'm not sure if it's a formatting/syntax error, but in your last post it didn't look like you wrapped the polygon conversion within the context. Are you running the following test?

with gw.config.update(ref_res=3.2175):
    with gw.open(
        filenames,
        band_names=['band1','band2','band3','band4','band5']
    ) as src:
        # This needs to be indented here in order to use the config resolution
        labels = polygon_to_array(training_polys, col='class', data=src)
        print(src)
        print(labels)

If that prints aligned shapes, can you try

with gw.config.update(ref_res=3.2175):
    with gw.open(
        filenames,
        band_names=['band1','band2','band3','band4','band5']
    ) as src:
        X, Xy, clf = fit(data=src, clf=pl, labels=training_polys, col='class')

I don't have the same raster file you're using, but if I resample the array then the polygon array shape aligns.

data_resamp = data.gw.transform_crs(dst_res=(3.2175, 3.2175))

poly_array = gw.polygon_to_array(
    training_polys, 
    col="class", 
    data=data_resamp , 
)

print(data_resamp.shape)
(5, 16933, 21265)

print(poly_array.shape)
(1, 16933, 21265)

@WY-CGhilardi
Copy link
Author

That is just a formatting issue, I can confirm I still return mis-aligned arrays with the configs

Using 3.2175

with gw.config.update(ref_res=3.2175):
    with gw.open(
        filenames,
        band_names=['band1','band2','band3','band4','band5']
    ) as src:
        # This needs to be indented here in order to use the config resolution
        labels = polygon_to_array(training_polys, col='class', data=src)
        print(src) #16931 x 21264
        print(labels) #16930 x 21263
<xarray.DataArray (time: 12, band: 5, y: 16931, x: 21264)>
dask.array<concatenate, shape=(12, 5, 16931, 21264), dtype=uint16, chunksize=(1, 5, 1, 20597), chunktype=numpy.ndarray>
Coordinates:
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * time     (time) int64 1 2 3 4 5 6 7 8 9 10 11 12
  * band     (band) <U5 'band1' 'band2' 'band3' 'band4' 'band5'
Attributes: (12/14)
    transform:           (3.2175, 0.0, 455302.1731317809, 0.0, -3.2175, 51158...
    crs:                 6339
    res:                 (3.2175, 3.2175)
    is_tiled:            1
    nodatavals:          (0.0, 0.0, 0.0, 0.0, 0.0)
    _FillValue:          0.0
    ...                  ...
    filename:            ['1_2023_6339.tif', '2_2023_6339.tif', '3_2023_6339....
    resampling:          nearest
    AREA_OR_POINT:       Area
    COLORINTERP:         Blue
    _data_are_separate:  1
    _data_are_stacked:   1
<xarray.DataArray 'array-eb576767189c8f0f6be73fcd77ffb4ee' (band: 1, y: 16930,
                                                            x: 21263)>
dask.array<array, shape=(1, 16930, 21263), dtype=uint8, chunksize=(1, 1, 20597), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 1
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
Attributes:
    transform:  (3.2175, 0.0, 455302.1731317809, 0.0, -3.2175, 5115878.613513...
    crs:        EPSG:6339
    res:        (3.2175, 3.2175)
    is_tiled:   1

Using 3.22

with gw.config.update(ref_res=3.22):
    with gw.open(
        filenames,
        band_names=['band1','band2','band3','band4','band5']
    ) as src:
        # This needs to be indented here in order to use the config resolution
        labels = polygon_to_array(training_polys, col='class', data=src)
        print(src) #16918 x 21247
        print(labels) #16917 x 21247
<xarray.DataArray (time: 12, band: 5, y: 16918, x: 21247)>
dask.array<concatenate, shape=(12, 5, 16918, 21247), dtype=uint16, chunksize=(1, 5, 1, 20597), chunktype=numpy.ndarray>
Coordinates:
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * time     (time) int64 1 2 3 4 5 6 7 8 9 10 11 12
  * band     (band) <U5 'band1' 'band2' 'band3' 'band4' 'band5'
Attributes: (12/14)
    transform:           (3.22, 0.0, 455302.1731317809, 0.0, -3.22, 5115878.6...
    crs:                 6339
    res:                 (3.22, 3.22)
    is_tiled:            1
    nodatavals:          (0.0, 0.0, 0.0, 0.0, 0.0)
    _FillValue:          0.0
    ...                  ...
    filename:            ['1_2023_6339.tif', '2_2023_6339.tif', '3_2023_6339....
    resampling:          nearest
    AREA_OR_POINT:       Area
    COLORINTERP:         Blue
    _data_are_separate:  1
    _data_are_stacked:   1
<xarray.DataArray 'array-e7a6b86caea789cacaa30e07821d1756' (band: 1, y: 16917,
                                                            x: 21247)>
dask.array<array, shape=(1, 16917, 21247), dtype=uint8, chunksize=(1, 1, 20597), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 1
  * y        (y) float64 5.116e+06 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * x        (x) float64 4.553e+05 4.553e+05 4.553e+05 ... 5.237e+05 5.237e+05
Attributes:
    transform:  (3.22, 0.0, 455302.1731317809, 0.0, -3.22, 5115878.613513936)
    crs:        EPSG:6339
    res:        (3.22, 3.22)
    is_tiled:   1

It seems that resampling does work to align the array. Note I just did this on a single image here, I was running into OOM errors with the full stack.

with gw.open(filenames[0]) as src:
    data_resamp = src.gw.transform_crs(dst_res=(3.2175, 3.2175))

    poly_array = gw.polygon_to_array(
        training_polys, 
        col="class", 
        data=data_resamp , 
    )

    print(data_resamp.shape)
    print(poly_array.shape)
(5, 16932, 21264)
(1, 16932, 21264)

@jgrss jgrss closed this as completed in #310 May 1, 2024
jgrss added a commit that referenced this issue May 1, 2024
* simplify mosaic procedure

* adding mosiac and save unit tests

* fix path tifs

* resolve ml install issues

* handle 0s in X[targ_name]

* add todo

* reverting to main

* pin scikit fix

* add tests

* throw error if stacked by time
resolves #311

* droping test - meaningless output

---------

Co-authored-by: jgrss <jbgraesser@gmail.com>
@jgrss jgrss reopened this May 1, 2024
@WY-CGhilardi
Copy link
Author

WY-CGhilardi commented May 1, 2024

I have been poking around this some more locally and noticed something this morning. For a given image, the returned xarray is a slightly different shape on read than what is reported from QGIS/GDAL even before anything involving the polygons.

NOTE this is a slightly clipped down version of the files discussed above for ease of working on laptop

gdalinfo myfile1_clipped.tif
Driver: GTiff/GeoTIFF
Size is 16462, 16376
..... truncated
Full gdalinfo output if helpful ```bash gdalinfo myfile1_clipped.tif Driver: GTiff/GeoTIFF Size is 16462, 16376 Coordinate System is: PROJCRS["NAD83(2011) / UTM zone 10N", BASEGEOGCRS["NAD83(2011)", DATUM["NAD83 (National Spatial Reference System 2011)", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",6318]], CONVERSION["UTM zone 10N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-123, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1]], USAGE[ SCOPE["Engineering survey, topographic mapping."], AREA["United States (USA) - between 126┬░W and 120┬░W onshore and offshore - California; Oregon; Washington."], BBOX[30.54,-126,49.09,-119.99]], ID["EPSG",6339]] Data axis to CRS axis mapping: 1,2 Origin = (468984.716852614830714,5115798.602993652224541) Pixel Size = (3.321750859523004,-3.321750859523010) Metadata: AREA_OR_POINT=Area COLORINTERP=Blue Image Structure Metadata: INTERLEAVE=PIXEL Corner Coordinates: Upper Left ( 468984.717, 5115798.603) (123d24' 7.06"W, 46d11'42.21"N) Lower Left ( 468984.717, 5061401.611) (123d23'54.38"W, 45d42'19.70"N) Upper Right ( 523667.380, 5115798.603) (122d41'35.76"W, 46d11'43.27"N) Lower Right ( 523667.380, 5061401.611) (122d41'45.44"W, 45d42'20.75"N) Center ( 496326.048, 5088600.107) (123d 2'50.66"W, 45d57' 3.46"N) Band 1 Block=16462x1 Type=UInt16, ColorInterp=Gray Min=1.000 Max=10000.000 Minimum=1.000, Maximum=10000.000, Mean=311.924, StdDev=500.680 NoData Value=0 Metadata: STATISTICS_APPROXIMATE=YES STATISTICS_MAXIMUM=10000 STATISTICS_MEAN=311.92363571779 STATISTICS_MINIMUM=1 STATISTICS_STDDEV=500.68043708223 STATISTICS_VALID_PERCENT=80.34 Band 2 Block=16462x1 Type=UInt16, ColorInterp=Undefined Min=1.000 Max=10000.000 Minimum=1.000, Maximum=10000.000, Mean=373.399, StdDev=495.399 NoData Value=0 Metadata: STATISTICS_APPROXIMATE=YES STATISTICS_MAXIMUM=10000 STATISTICS_MEAN=373.39868300658 STATISTICS_MINIMUM=1 STATISTICS_STDDEV=495.39942964762 STATISTICS_VALID_PERCENT=80.34 Band 3 Block=16462x1 Type=UInt16, ColorInterp=Undefined Min=1.000 Max=9961.000 Minimum=1.000, Maximum=9961.000, Mean=366.853, StdDev=518.729 NoData Value=0 Metadata: STATISTICS_APPROXIMATE=YES STATISTICS_MAXIMUM=9961 STATISTICS_MEAN=366.8531565228 STATISTICS_MINIMUM=1 STATISTICS_STDDEV=518.72915463185 STATISTICS_VALID_PERCENT=80.34 Band 4 Block=16462x1 Type=UInt16, ColorInterp=Undefined NoData Value=0 Band 5 Block=16462x1 Type=UInt16, ColorInterp=Undefined NoData Value=0 ```

QGIS
image

with gw.open(
    myfile1_clipped.tif
) as src:
    print(src.shape)
​
(5, 16375, 16462)

EDIT: I tried a vanilla rasterio read and that agrees with the GDAL outputs

#check metadata

import rasterio
with rasterio.open(filenames[0]) as rast_src:
    print(rast_src.meta)
{'driver': 'GTiff', 'dtype': 'uint16', 'nodata': 0.0, 'width': 16462, 'height': 16376, 'count': 5, 'crs': CRS.from_epsg(6339), 'transform': Affine(3.321750859523004, 0.0, 468984.71685261483,
       0.0, -3.3217508595230103, 5115798.602993652)}

#actually read a band and report array shape

with rasterio.open(filenames[0]) as rast_src:
    arr = rast_src.read(1)
    print(arr.shape)
	
(16376, 16462)

EDIT 2: rioxarray matches with raw rasterio

da = rioxarray.open_rasterio(filenames[0] )
da.shape
(5, 16376, 16462)

@jgrss jgrss mentioned this issue May 18, 2024
@jgrss
Copy link
Owner

jgrss commented May 18, 2024

Sorry for the delay @WY-CGhilardi.

I have a random data clone of your image, with gdalinfo test.tif resulting in:

Driver: GTiff/GeoTIFF
Files: test.tif
Size is 16462, 16376
Coordinate System is:
PROJCRS["WGS 84 / UTM zone 10N",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["UTM zone 10N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-123,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Between 126°W and 120°W, northern hemisphere between equator and 84°N, onshore and offshore. Canada - British Columbia (BC); Northwest Territories (NWT); Nunavut; Yukon. United States (USA) - Alaska (AK)."],
        BBOX[0,-126,84,-120]],
    ID["EPSG",32610]]
Data axis to CRS axis mapping: 1,2
Origin = (468984.717000000004191,5115798.603000000119209)
Pixel Size = (3.321750859523004,-3.321750859523010)
Metadata:
  AREA_OR_POINT=Area
Image Structure Metadata:
  COMPRESSION=LZW
  INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left  (  468984.717, 5115798.603) (123d24' 7.06"W, 46d11'42.21"N)
Lower Left  (  468984.717, 5061401.611) (123d23'54.38"W, 45d42'19.70"N)
Upper Right (  523667.380, 5115798.603) (122d41'35.76"W, 46d11'43.27"N)
Lower Right (  523667.380, 5061401.611) (122d41'45.44"W, 45d42'20.75"N)
Center      (  496326.048, 5088600.107) (123d 2'50.66"W, 45d57' 3.46"N)
Band 1 Block=256x256 Type=Byte, ColorInterp=Gray
  NoData Value=0
Band 2 Block=256x256 Type=Byte, ColorInterp=Undefined
  NoData Value=0
Band 3 Block=256x256 Type=Byte, ColorInterp=Undefined
  NoData Value=0
Band 4 Block=256x256 Type=Byte, ColorInterp=Undefined
  NoData Value=0
Band 5 Block=256x256 Type=Byte, ColorInterp=Undefined
  NoData Value=0

With changes in #319, the opened array results in:

with gw.open("test.tif") as src:
    print(src)

<xarray.DataArray (band: 5, y: 16376, x: 16462)> Size: 1GB
dask.array<open_rasterio-fa4e67f0ae8d81c28193cae47ef688b7<this-array>, shape=(5, 16376, 16462), dtype=uint8, chunksize=(5, 256, 256), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 40B 1 2 3 4 5
  * x        (x) float64 132kB 4.69e+05 4.69e+05 ... 5.237e+05 5.237e+05
  * y        (y) float64 131kB 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
Attributes: (12/13)
    transform:           (3.321750859523004, 0.0, 468984.717, 0.0, -3.3217508...
    crs:                 32610
    res:                 (3.321750859523004, 3.32175085952301)
    is_tiled:            1
    nodatavals:          (0.0, 0.0, 0.0, 0.0, 0.0)
    _FillValue:          0.0
    ...                  ...
    offsets:             (0.0, 0.0, 0.0, 0.0, 0.0)
    filename:            test.tif
    resampling:          nearest
    AREA_OR_POINT:       Area
    _data_are_separate:  0
    _data_are_stacked:   0

@WY-CGhilardi
Copy link
Author

I pulled down the /reorg branch and it looks like that does correct the initial image read dimensions, now matching rasterio/GDAL/...

with gw.open(
    filenames,
) as src:
    print(src)

<xarray.DataArray (time: 12, band: 5, y: 16376, x: 16462)> Size: 32GB
dask.array<concatenate, shape=(12, 5, 16376, 16462), dtype=uint16, chunksize=(1, 5, 1, 16462), chunktype=numpy.ndarray>
Coordinates:
  * x        (x) float64 132kB 4.69e+05 4.69e+05 ... 5.237e+05 5.237e+05
  * y        (y) float64 131kB 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * time     (time) int64 96B 1 2 3 4 5 6 7 8 9 10 11 12
  * band     (band) int64 40B 1 2 3 4 5
Attributes: (12/14)
    transform:           (3.321750859523004, 0.0, 468984.71685261483, 0.0, -3...
    crs:                 6339
    res:                 (3.321750859523004, 3.3217508595230103)
    is_tiled:            1
    nodatavals:          (0.0, 0.0, 0.0, 0.0, 0.0)
    _FillValue:          0.0
    ...                  ...
    filename:            ['myfile1.tif', 'myfile2.tif...
    resampling:          nearest
    AREA_OR_POINT:       Area
    COLORINTERP:         Blue
    _data_are_separate:  1
    _data_are_stacked:   1

strangely, the resulting polygon_to_array is still unaligned? I think that branch is probably still worthwhile merging at some point

labels = polygon_to_array( training_polys,'class', src)
print(labels)


<xarray.DataArray 'array-010aa44b3977375e822586d17ce8e9a8' (band: 1, y: 16375,
                                                            x: 16462)> Size: 270MB
dask.array<array, shape=(1, 16375, 16462), dtype=uint8, chunksize=(1, 1, 16462), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) int64 8B 1
  * y        (y) float64 131kB 5.116e+06 5.116e+06 ... 5.061e+06 5.061e+06
  * x        (x) float64 132kB 4.69e+05 4.69e+05 ... 5.237e+05 5.237e+05
Attributes:
    transform:  (3.321750859523004, 0.0, 468984.71685261483, 0.0, -3.32175085...
    crs:        EPSG:6339
    res:        (3.321750859523004, 3.3217508595230103)
    is_tiled:   1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants