Skip to content

Commit

Permalink
[Justice Counts] Convert all column names to lowercase during spreads…
Browse files Browse the repository at this point in the history
…heet upload. (Recidiviz/recidiviz-data#29389)

## Description of the change

Convert all column names to lowercase during spreadsheet upload.

It turns out that the reason we were running into this
mysterious/misleading unexpected error for Carrol County (see Recidiviz/recidiviz-data#29303) is
because we are converting columns to lowercase when looking for
unexpected column names, but then NOT using case-insensitive logic when
actually parsing the columns later on. This puts us in a weird state
where we aren't catching the unexpected "Year" column during sheet
validation, but the later parsing steps don't recognize the "Year"
column and throw an unexpected error.

One solution here would be to make the parsing logic case-insensitive,
however this will lead us into issues later if we miss spots or forget
to make case-insensitive parsing later down the road.

Instead, let's convert all column names to lowercase as an initial step
during workbook upload.

We have a place in the workbook uploader code where we drop all rows
that contain Nans. Let's do another spreadsheet cleaning step here and
convert all the column names to lowercase.

## Testing

I tested this change locally by submitting a bulk upload sheet with a
"Year" column instead of the lowercase "year" column.

Here is the errors that we are getting _without_ any changes.
<img width="816" alt="Screenshot 2024-04-26 at 10 49 36 AM"
src="https://github.com/Recidiviz/recidiviz-data/assets/130382407/997b5040-c81d-43bb-b4fd-ffcfeddc0c8b">

Here is the improved error once we fix the expected-columns check (by
not making that process case insensitive).
<img width="773" alt="Screenshot 2024-04-26 at 10 49 58 AM"
src="https://github.com/Recidiviz/recidiviz-data/assets/130382407/94de377b-5271-4736-86e0-b133031eb6b9">

And here is the final fix - when we convert all column names to
lowercase, making the uploaded spreadsheet a valid one.

<img width="1401" alt="Screenshot 2024-04-26 at 10 50 39 AM"
src="https://github.com/Recidiviz/recidiviz-data/assets/130382407/1835ff84-751e-4fe5-ac85-d9adc2c745d6">
GitOrigin-RevId: c5401416da9877603636ff45e451129b603fc5aa
  • Loading branch information
brandon-hills authored and Helper Bot committed May 11, 2024
1 parent 2f86156 commit d8d0d74
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 6 deletions.
6 changes: 2 additions & 4 deletions recidiviz/justice_counts/bulk_upload/spreadsheet_uploader.py
Expand Up @@ -427,7 +427,7 @@ def _upload_rows_for_metricfile(
# we are filtering out 'Unnamed: 0' because this is the column name of the index column
# the index column is produced when the excel file is converted to a pandas df
column_names = pd.DataFrame(rows).dropna(axis=1).columns
actual_columns = {col.lower() for col in column_names if col != "Unnamed: 0"}
actual_columns = {col for col in column_names if col != "unnamed: 0"}
metric_key_to_errors = self._check_expected_columns(
metricfile=metricfile,
actual_columns=actual_columns,
Expand Down Expand Up @@ -735,9 +735,7 @@ def get_agency_name(row: Dict[str, Any]) -> str:
system=system if system is not None else self.system,
)
if agency_name is None:
actual_columns = {
col.lower() for col in row.keys() if col != "Unnamed: 0"
}
actual_columns = {col for col in row.keys() if col != "unnamed: 0"}
description = (
f'We expected to see a column named "agency". '
f"Only the following columns were found in the sheet: "
Expand Down
4 changes: 3 additions & 1 deletion recidiviz/justice_counts/bulk_upload/workbook_uploader.py
Expand Up @@ -201,9 +201,11 @@ def upload_workbook(
for sheet_name in actual_sheet_names:
logging.info("Uploading %s", sheet_name)
df = sheet_name_to_df[sheet_name]
# Drop any rows that contain any NaN values
# Drop any rows that contain any NaN values and make all column names lowercase.
try:
df = df.dropna(axis=0, how="any", subset=["value"])
a = df.columns
df.columns = [col.lower() for col in df.columns]
except (KeyError, TypeError):
# We will be in this case if the value column is missing,
# and it's safe to ignore the error because we'll raise
Expand Down
Expand Up @@ -1054,7 +1054,7 @@ def test_unexpected_column_name(
def test_breakdown_sum_warning(
self,
) -> None:
"""Checks that we warn the user when a the sum of values in a breakdown sheet is uploaded
"""Checks that we warn the user when the sum of values in a breakdown sheet is uploaded
and does not equal the sum of values in the aggregate sheet.
"""
with SessionFactory.using_database(self.database_key) as session:
Expand Down

0 comments on commit d8d0d74

Please sign in to comment.