Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to specify source files and validate data when using process_results and json2csv #1516

Open
connornelle opened this issue May 6, 2024 · 3 comments

Comments

@connornelle
Copy link

Hello,

I have been running a PCV workflow on a HPC cluster, and currently it outputs results files the same way that local parallelization would work (however the for the Slurm array I am using a single image workflow over the whole array of images). The results files are as such: "IMG_2_IGP0001_results.txt" and "IMG_2_IGP0002_results.txt" for example. There are 212 images. I run the following code over the results directory.

`import os
import sys, traceback
from plantcv import plantcv as pcv
from plantcv.parallel import WorkflowInputs
from plantcv import parallel
import matplotlib
from matplotlib import pyplot as plt

parallel.process_results(job_dir="/home/m18c364/ondemand/data/sys/myjobs/projects/default/19/results", json_file="combined_output.txt")`

Which does output a very long file, I think this is where I am losing individual image markers, becuase after I run the json2csv code locally in the terminal the wide format csv file only has one row called "default_1". I think this is probably just a simple thing I have to add to the function, but I haven't found it in the documentation. I have attached the files too. Thank you!
combined_output.txt
output.csv-multi-value-traits.csv
output.csv-single-value-traits.csv

@HaleySchuhl
Copy link
Contributor

Hi @connornelle , thanks for opening this issue. Since you are parallelizing without using plantcv-run-workflow it looks like your outputs have no metadata. When json2csv creates the CSV file there is no metadata (such as the image file name) to use as a unique data frame key. To resolve this I believe you can use the pcv.outputs.add_metadata method that we recently added to store the image filename under the term filepath.

@connornelle
Copy link
Author

Thank you! This appears to be working well. I found that implementing the single img workflow in parallel on our cluster was much easier than trying to use the built in. Is there a resource on using the built in version? I didn't know where to start getting that running.

@HaleySchuhl
Copy link
Contributor

Awesome!

I believe the best documentation page for our parallelization where it details how to setup a configuration file is here. We'll likely be adding a Scribe doc page additionally since we are finding their formatting to be really useful for processes that involve switching between multiple applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants