You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently all field mappings are handled centrally, in constants.py. Additionally, the current implementation concatenates multiple files directly below one another, which requires the order of headings to be consistent from file to file.
However, we have several data sources, and many of them changed both headers titles and order of columns over the duration of the past month or so (some multiple times).
Based on a convo with Jon earlier, I'd suggest the script:
Take in all subfolders in a particular directory (/data is fine),
Parse a mapping file in that directory that provides the data source name to use in the compilations column, and the mappings from source file headers in that directory to our output headers
This would make it easy for non-developers to respond each time a partner file format changes on us, headers move around or are renamed, etc. without editing the code directly. Note that in this approach, the script must support multiple subdirectories having the same compilation name.
The text was updated successfully, but these errors were encountered:
I've updated my local copy to merge based on headers rather than just concatenate whole files, so while this bug is still the preferred approach its priority is lessened now.
Here's the new, kludgy implementation of data.combine_all_csvs in case it's helpful later:
def combine_all_csvs(dir_path):
field_names = []
files = []
combined_rows = []
# Get filenames for CSV files in the provided dir
for filename in os.listdir(dir_path):
if filename.lower().endswith('.csv'):
files.append(os.path.join(dir_path, filename))
# Read in headers for each CSV file in the directory
for f in files:
with open(f, "r", newline="") as f_in:
reader = csv.reader(f_in)
headers = next(reader)
for h in headers:
if h not in field_names:
field_names.append(h)
# Kludgy I know, but we write a temp merged CSV file to use the headers management of the csv lib
with open(os.path.join(dir_path, "temp_merge.csv"), "w", newline="") as f_out:
writer = csv.DictWriter(f_out, fieldnames=field_names)
writer.writeheader()
for f in files:
with open(f, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
for line in reader:
writer.writerow(line)
# Read the merged file back in as our dataset
rows = load_csv(os.path.join(dir_path, "temp_merge.csv"), as_dicts=True)
combined_rows += rows[1:]
# Delete the temp merged file and return the values from it
os.remove(os.path.join(dir_path, "temp_merge.csv")) # Note: If removed for testing, will cause race condition on repeat-runs.
return combined_rows
With a little elbow grease the temp CSV files wouldn't be necessary and this could all be done in memory, but I haven't had cycles to think through that.
Currently all field mappings are handled centrally, in constants.py. Additionally, the current implementation concatenates multiple files directly below one another, which requires the order of headings to be consistent from file to file.
However, we have several data sources, and many of them changed both headers titles and order of columns over the duration of the past month or so (some multiple times).
Based on a convo with Jon earlier, I'd suggest the script:
compilations
column, and the mappings from source file headers in that directory to our output headersThis would make it easy for non-developers to respond each time a partner file format changes on us, headers move around or are renamed, etc. without editing the code directly. Note that in this approach, the script must support multiple subdirectories having the same compilation name.
The text was updated successfully, but these errors were encountered: