Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support per-folder, file-based field mapping and config #4

Open
cawarren opened this issue May 8, 2020 · 1 comment
Open

Support per-folder, file-based field mapping and config #4

cawarren opened this issue May 8, 2020 · 1 comment

Comments

@cawarren
Copy link
Member

cawarren commented May 8, 2020

Currently all field mappings are handled centrally, in constants.py. Additionally, the current implementation concatenates multiple files directly below one another, which requires the order of headings to be consistent from file to file.

However, we have several data sources, and many of them changed both headers titles and order of columns over the duration of the past month or so (some multiple times).

Based on a convo with Jon earlier, I'd suggest the script:

  • Take in all subfolders in a particular directory (/data is fine),
  • Parse a mapping file in that directory that provides the data source name to use in the compilations column, and the mappings from source file headers in that directory to our output headers

This would make it easy for non-developers to respond each time a partner file format changes on us, headers move around or are renamed, etc. without editing the code directly. Note that in this approach, the script must support multiple subdirectories having the same compilation name.

@cawarren
Copy link
Member Author

cawarren commented May 13, 2020

I've updated my local copy to merge based on headers rather than just concatenate whole files, so while this bug is still the preferred approach its priority is lessened now.

Here's the new, kludgy implementation of data.combine_all_csvs in case it's helpful later:

def combine_all_csvs(dir_path):
  field_names = []
  files = []
  combined_rows = []

  # Get filenames for CSV files in the provided dir
  for filename in os.listdir(dir_path):
    if filename.lower().endswith('.csv'):
      files.append(os.path.join(dir_path, filename))

  # Read in headers for each CSV file in the directory
  for f in files:
    with open(f, "r", newline="") as f_in:
      reader = csv.reader(f_in)
      headers = next(reader)
      for h in headers:
        if h not in field_names:
          field_names.append(h)
          
  # Kludgy I know, but we write a temp merged CSV file to use the headers management of the csv lib
  with open(os.path.join(dir_path, "temp_merge.csv"), "w", newline="") as f_out:
    writer = csv.DictWriter(f_out, fieldnames=field_names)
    writer.writeheader()
    for f in files:
      with open(f, "r", newline="") as f_in:
        reader = csv.DictReader(f_in)  # Uses the field names in this file
        for line in reader:
          writer.writerow(line)
          
  # Read the merged file back in as our dataset
  rows = load_csv(os.path.join(dir_path, "temp_merge.csv"), as_dicts=True)
  combined_rows += rows[1:]
    
  # Delete the temp merged file and return the values from it
  os.remove(os.path.join(dir_path, "temp_merge.csv")) # Note: If removed for testing, will cause race condition on repeat-runs.
  return combined_rows

With a little elbow grease the temp CSV files wouldn't be necessary and this could all be done in memory, but I haven't had cycles to think through that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant