Support per-folder, file-based field mapping and config #4

cawarren · 2020-05-08T05:54:46Z

Currently all field mappings are handled centrally, in constants.py. Additionally, the current implementation concatenates multiple files directly below one another, which requires the order of headings to be consistent from file to file.

However, we have several data sources, and many of them changed both headers titles and order of columns over the duration of the past month or so (some multiple times).

Based on a convo with Jon earlier, I'd suggest the script:

Take in all subfolders in a particular directory (/data is fine),
Parse a mapping file in that directory that provides the data source name to use in the compilations column, and the mappings from source file headers in that directory to our output headers

This would make it easy for non-developers to respond each time a partner file format changes on us, headers move around or are renamed, etc. without editing the code directly. Note that in this approach, the script must support multiple subdirectories having the same compilation name.

The text was updated successfully, but these errors were encountered:

cawarren · 2020-05-13T20:10:05Z

I've updated my local copy to merge based on headers rather than just concatenate whole files, so while this bug is still the preferred approach its priority is lessened now.

Here's the new, kludgy implementation of data.combine_all_csvs in case it's helpful later:

def combine_all_csvs(dir_path):
  field_names = []
  files = []
  combined_rows = []

  # Get filenames for CSV files in the provided dir
  for filename in os.listdir(dir_path):
    if filename.lower().endswith('.csv'):
      files.append(os.path.join(dir_path, filename))

  # Read in headers for each CSV file in the directory
  for f in files:
    with open(f, "r", newline="") as f_in:
      reader = csv.reader(f_in)
      headers = next(reader)
      for h in headers:
        if h not in field_names:
          field_names.append(h)
          
  # Kludgy I know, but we write a temp merged CSV file to use the headers management of the csv lib
  with open(os.path.join(dir_path, "temp_merge.csv"), "w", newline="") as f_out:
    writer = csv.DictWriter(f_out, fieldnames=field_names)
    writer.writeheader()
    for f in files:
      with open(f, "r", newline="") as f_in:
        reader = csv.DictReader(f_in)  # Uses the field names in this file
        for line in reader:
          writer.writerow(line)
          
  # Read the merged file back in as our dataset
  rows = load_csv(os.path.join(dir_path, "temp_merge.csv"), as_dicts=True)
  combined_rows += rows[1:]
    
  # Delete the temp merged file and return the values from it
  os.remove(os.path.join(dir_path, "temp_merge.csv")) # Note: If removed for testing, will cause race condition on repeat-runs.
  return combined_rows

With a little elbow grease the temp CSV files wouldn't be necessary and this could all be done in memory, but I haven't had cycles to think through that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support per-folder, file-based field mapping and config #4

Support per-folder, file-based field mapping and config #4

cawarren commented May 8, 2020

cawarren commented May 13, 2020 •

edited

Support per-folder, file-based field mapping and config #4

Support per-folder, file-based field mapping and config #4

Comments

cawarren commented May 8, 2020

cawarren commented May 13, 2020 • edited

cawarren commented May 13, 2020 •

edited