[Justice Counts] Convert all column names to lowercase during spreads…

…heet upload. (Recidiviz/recidiviz-data#29389) ## Description of the change Convert all column names to lowercase during spreadsheet upload. It turns out that the reason we were running into this mysterious/misleading unexpected error for Carrol County (see Recidiviz/recidiviz-data#29303) is because we are converting columns to lowercase when looking for unexpected column names, but then NOT using case-insensitive logic when actually parsing the columns later on. This puts us in a weird state where we aren't catching the unexpected "Year" column during sheet validation, but the later parsing steps don't recognize the "Year" column and throw an unexpected error. One solution here would be to make the parsing logic case-insensitive, however this will lead us into issues later if we miss spots or forget to make case-insensitive parsing later down the road. Instead, let's convert all column names to lowercase as an initial step during workbook upload. We have a place in the workbook uploader code where we drop all rows that contain Nans. Let's do another spreadsheet cleaning step here and convert all the column names to lowercase. ## Testing I tested this change locally by submitting a bulk upload sheet with a "Year" column instead of the lowercase "year" column. Here is the errors that we are getting _without_ any changes. <img width="816" alt="Screenshot 2024-04-26 at 10 49 36 AM" src="https://github.com/Recidiviz/recidiviz-data/assets/130382407/997b5040-c81d-43bb-b4fd-ffcfeddc0c8b"> Here is the improved error once we fix the expected-columns check (by not making that process case insensitive). <img width="773" alt="Screenshot 2024-04-26 at 10 49 58 AM" src="https://github.com/Recidiviz/recidiviz-data/assets/130382407/94de377b-5271-4736-86e0-b133031eb6b9"> And here is the final fix - when we convert all column names to lowercase, making the uploaded spreadsheet a valid one. <img width="1401" alt="Screenshot 2024-04-26 at 10 50 39 AM" src="https://github.com/Recidiviz/recidiviz-data/assets/130382407/1835ff84-751e-4fe5-ac85-d9adc2c745d6"> GitOrigin-RevId: c5401416da9877603636ff45e451129b603fc5aa
Recidiviz · May 11, 2024 · d8d0d74 · d8d0d74
1 parent 2f86156
commit d8d0d74
Show file tree

Hide file tree

Showing 3 changed files with 6 additions and 6 deletions.
diff --git a/recidiviz/justice_counts/bulk_upload/spreadsheet_uploader.py b/recidiviz/justice_counts/bulk_upload/spreadsheet_uploader.py
@@ -427,7 +427,7 @@ def _upload_rows_for_metricfile(
         # we are filtering out 'Unnamed: 0' because this is the column name of the index column
         # the index column is produced when the excel file is converted to a pandas df
         column_names = pd.DataFrame(rows).dropna(axis=1).columns
-        actual_columns = {col.lower() for col in column_names if col != "Unnamed: 0"}
+        actual_columns = {col for col in column_names if col != "unnamed: 0"}
         metric_key_to_errors = self._check_expected_columns(
             metricfile=metricfile,
             actual_columns=actual_columns,
@@ -735,9 +735,7 @@ def get_agency_name(row: Dict[str, Any]) -> str:
                 system=system if system is not None else self.system,
             )
             if agency_name is None:
-                actual_columns = {
-                    col.lower() for col in row.keys() if col != "Unnamed: 0"
-                }
+                actual_columns = {col for col in row.keys() if col != "unnamed: 0"}
                 description = (
                     f'We expected to see a column named "agency". '
                     f"Only the following columns were found in the sheet: "

diff --git a/recidiviz/justice_counts/bulk_upload/workbook_uploader.py b/recidiviz/justice_counts/bulk_upload/workbook_uploader.py
@@ -201,9 +201,11 @@ def upload_workbook(
         for sheet_name in actual_sheet_names:
             logging.info("Uploading %s", sheet_name)
             df = sheet_name_to_df[sheet_name]
-            # Drop any rows that contain any NaN values
+            # Drop any rows that contain any NaN values and make all column names lowercase.
             try:
                 df = df.dropna(axis=0, how="any", subset=["value"])
+                a = df.columns
+                df.columns = [col.lower() for col in df.columns]
             except (KeyError, TypeError):
                 # We will be in this case if the value column is missing,
                 # and it's safe to ignore the error because we'll raise

diff --git a/recidiviz/tests/justice_counts/bulk_upload/bulk_upload_test.py b/recidiviz/tests/justice_counts/bulk_upload/bulk_upload_test.py
@@ -1054,7 +1054,7 @@ def test_unexpected_column_name(
     def test_breakdown_sum_warning(
         self,
     ) -> None:
-        """Checks that we warn the user when a the sum of values in a breakdown sheet is uploaded
+        """Checks that we warn the user when the sum of values in a breakdown sheet is uploaded
         and does not equal the sum of values in the aggregate sheet.
         """
         with SessionFactory.using_database(self.database_key) as session: