How to specify source files and validate data when using process_results and json2csv #1516

connornelle · 2024-05-06T17:36:18Z

Hello,

I have been running a PCV workflow on a HPC cluster, and currently it outputs results files the same way that local parallelization would work (however the for the Slurm array I am using a single image workflow over the whole array of images). The results files are as such: "IMG_2_IGP0001_results.txt" and "IMG_2_IGP0002_results.txt" for example. There are 212 images. I run the following code over the results directory.

`import os
import sys, traceback
from plantcv import plantcv as pcv
from plantcv.parallel import WorkflowInputs
from plantcv import parallel
import matplotlib
from matplotlib import pyplot as plt

parallel.process_results(job_dir="/home/m18c364/ondemand/data/sys/myjobs/projects/default/19/results", json_file="combined_output.txt")`

Which does output a very long file, I think this is where I am losing individual image markers, becuase after I run the json2csv code locally in the terminal the wide format csv file only has one row called "default_1". I think this is probably just a simple thing I have to add to the function, but I haven't found it in the documentation. I have attached the files too. Thank you!
combined_output.txt
output.csv-multi-value-traits.csv
output.csv-single-value-traits.csv

HaleySchuhl · 2024-05-08T15:03:34Z

Hi @connornelle , thanks for opening this issue. Since you are parallelizing without using plantcv-run-workflow it looks like your outputs have no metadata. When json2csv creates the CSV file there is no metadata (such as the image file name) to use as a unique data frame key. To resolve this I believe you can use the pcv.outputs.add_metadata method that we recently added to store the image filename under the term filepath.

connornelle · 2024-05-08T19:19:33Z

Thank you! This appears to be working well. I found that implementing the single img workflow in parallel on our cluster was much easier than trying to use the built in. Is there a resource on using the built in version? I didn't know where to start getting that running.

HaleySchuhl · 2024-05-09T14:45:35Z

Awesome!

I believe the best documentation page for our parallelization where it details how to setup a configuration file is here. We'll likely be adding a Scribe doc page additionally since we are finding their formatting to be really useful for processes that involve switching between multiple applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to specify source files and validate data when using process_results and json2csv #1516

How to specify source files and validate data when using process_results and json2csv #1516

connornelle commented May 6, 2024

HaleySchuhl commented May 8, 2024

connornelle commented May 8, 2024

HaleySchuhl commented May 9, 2024

How to specify source files and validate data when using process_results and json2csv #1516

How to specify source files and validate data when using process_results and json2csv #1516

Comments

connornelle commented May 6, 2024

HaleySchuhl commented May 8, 2024

connornelle commented May 8, 2024

HaleySchuhl commented May 9, 2024