Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in image driver handling of parameter file, and/or tonic #787

Open
tbohn opened this issue Mar 7, 2018 · 12 comments · May be fixed by #943
Open

Potential bug in image driver handling of parameter file, and/or tonic #787

tbohn opened this issue Mar 7, 2018 · 12 comments · May be fixed by #943

Comments

@tbohn
Copy link
Contributor

tbohn commented Mar 7, 2018

  • Version: VIC.5.0.1 develop branch
  • Compiler: gcc 5.4.0
  • Operating system: ubuntu 16.04.9
  • Model settings: image driver
  • Problem summary:
    When using parameter files that I have generated from scratch, various states and fluxes get weird values that fairly quickly lead to nans. This doesn't happen with parameter files that I've generated with tonic from the Livneh et al (2015) ascii soil/veg/snowband parameter files.

I believe that the problem is caused by a combination of how I created my parameter files and some potentially missing/bad logic in VIC (and tonic). But when I try to fix my parameter files to make them more amenable to VIC, VIC still has trouble. I am not sure whether it's because I haven't succeeding in fixing my parameter files, or whether there's some other problem occurring. I could use some advice and/or help brainstorming for solutions here...

More detail:
My new parameter files generated from scratch differ from the tonic-generated files in a few ways:

  • for "vegetation library" parameters and root zone parameters, tonic sets them to "valid" values (the same constant value) in all classes in all grid cells, regardless of whether those classes are actually present in the grid cell (i,e, Cv[v] >= 0). In my parameters, I only defined them where Cv[v] > 0, and set them to missing where the vegetation class is not present (Cv[v] == 0).
  • tonic creates an extra bare soil tile in cells where the vegetated tile areas sum up to less than 1 (Cv_sum < 1); but does NOT increment Nveg to reflect the extra tile (i.e., Nveg < Nnonzero, where Nnonzero is the number of tiles for which Cv[v] > 0). In my parameters, the tile areas always reflect exactly what is present in the land cover map; Cv_sum always == 1 (except for numerical error); and Nveg == Nnonzero.
  • tonic parameter files define all floating point variables as type double. I defined mine as float to save space, since I didn't need the precision.

In vic_run(), I saw that bad values (they looked a lot like the missing value I'd given them in my parameter file) of veg parameters such as trunk_ratio were being passed to CalcAerodynamic(). This could certainly explain why the tonic parameter files didn't cause problems (since even invalid tiles have valid parameter values in tonic-generated parameter files). To verify that missing values from invalid tiles were getting passed, I replaced those values in the parameter file with an arbitrary value (again, only where Cv[v] == 0), and these values showed up being passed to CalcAerodynamic().

As for why these garbage values were being used, I hypothesized that VIC was somehow mistaking zero-area tiles for valid tiles. Looking at the code, in get_global_domain(), VIC assigns nv_active to Nveg+1, where Nveg is the value from the parameter file; then in vic_alloc() it allocates nv_active tiles to veg_con and calls make_all_vars() with nv_active-1, which allocates nv_active-1+1 tiles to veg_var, cell, etc. But then, vic_init() does its mapping between the all-classes-stored structure veg_con_map and the only-nonzero-classes structure veg_con by checking whether Cv > 0. Nveg is NOT involved in this mapping. My theory at the moment is that my single-precision Cv values might not look like 0 to VIC when VIC is interpreting them as doubles, thus causing the mapping to map invalid tiles to the veg_con structure.

However, I've tried changing Cv in my parameter file to double, and tried rounding all values to 6 digits of precision to ensure that 0s look like 0s. But it doesn't fix the problem. So, if I've not properly zeroed the 0s this way, I could definitely use help getting that to work correctly (my python skills are still not at the level of the UW lab).

But I don't know for sure if that's what happened - maybe I did succeed in zeroing out the 0s (it looked that way in a python session), and maybe something else is causing the problem in VIC.

Regardless, I can also see that VIC and tonic might benefit from improved code to counter these problems. Tonic's inconsistent incrementing of Nveg seems dangerous. In VIC, we could easily implement a check on whether the total number of tiles with Cv[v] > 0 actually equals Nveg - if not, it could give the user a helpful error message so that the user knows exactly what the problem is.

Also, I think tonic's assignment of valid parameter values to cell/class combinations where the classes are not present is a bad policy, since if VIC for some other reason were to accidentally assign invalid tiles, the error would not be caught (the parameters would be physically reasonable, just wrong).

Any thoughts/advice on this would be greatly appreciated. I need to get my simulations going ASAP!

@tbohn
Copy link
Contributor Author

tbohn commented Mar 8, 2018

Another potential culprit is appearing: vic_run_veg_lib seems to be wrong.

@tbohn
Copy link
Contributor Author

tbohn commented Mar 8, 2018

Another potential culprit: parallelization. I've noticed that, if I insert print statements into anything within the loop over cells in vic_image_run() or any function below that, like vic_run(), the order of grid cells gets all jumbled up. I've attempted to run with a single processor via mpirun -n 1 but maybe I'm doing it wrong? Regardless, running with mpirun -n 1 or omitting it and just running vic_image.exe -g .... by itself both result in jumbled time order of the execution of various print statements, to the point that it becomes very difficult to interpret.

I'm not saying that it's wrong to have parallelization per se; I'm saying that if parallelization is occurring even when I tell it to use a single processor, perhaps that could be causing a mis-match between the i of veg_lib[i] (which is carried indirectly through vic_run_veg_lib) and the i of veg_con[i], such that the veg_class called for by veg_con has veg_lib properties of invalid tiles from other cells. I.e., perhaps we're asking for properties of a veg_class in the current cell, but we're presented with the veg_lib of a different cell for which that class is not present; thus we get the garbage values. This would be consistent with things working OK for the tonic-generated param files, since again, they define the complete veg_lib for all classes in all cells. But it would be getting the right answer for the wrong reason... Perhaps there's an issue with my mpich library on my machine...

@tbohn
Copy link
Contributor Author

tbohn commented Mar 8, 2018

Now, the primary suspect is OpenMPI (at least the version on my machine, which might not be the right version). I deleted the line in vic_image_run() immediately before the loop over grid cells, which contains a pragma instructing OpenMPI to run the for loop using multiple threads. Suddenly my simulations ran fine.

The questions now are:

  1. Is there something wrong with my OpenMPI? (and if so, how to fix to either handle the pragma correctly or to tell it to ignore the pragma)
  2. How easily could this happen to other users? Is it happening at UW but not being detected due to all grid cells containing redundant copies of the veg_lib?
  3. Regardless, is it a good idea for vic param files to define veg_lib parameters for zero-area veg tiles? (in terms of space usage, possibility for corrupt values where validation doesn't take place, etc.)

@jhamman
Copy link
Member

jhamman commented Mar 10, 2018

@tbohn -thanks for the report.

  • For clarity, the pragma statement is for OpenMP (not OpenMPI) .
  • If you put the pragma statement back and set the OMP_NUM_THREADS environment variable to 1, do you still see the buggy behavior?
    export OMP_NUM_THREADS=1
    
  • If you compile without OpenMP but with the pragma statement in place, do you see the buggy behavior? For that, you'd need to remove the following line from the makefile.
    -fopenmp \
  • Finally, are you still building against anaconda libraries? I'm suspect of that configuration so it would be good to know if this is repeatable on a different set of libraries.

@tbohn tbohn mentioned this issue Mar 10, 2018
6 tasks
@tbohn
Copy link
Contributor Author

tbohn commented Mar 11, 2018

To answer your questions:

  • yes, export OMP_NUM_THREADS=1 fixes the problem.
  • yes, commenting out the -fopenmp line fixes the problem.
  • yes, I'm still building against the anaconda libraries.

@jhamman
Copy link
Member

jhamman commented Mar 11, 2018

@tbohn - how are you executing VIC? Do you get the odd behavior if you don't use MPI?

Can you run a test with:

export OMP_NUM_THREADS=8  # or some number of threads > 1
./vic_image -g your_param.txt  # omit mpirun 

@tbohn
Copy link
Contributor Author

tbohn commented Mar 11, 2018

The above test reproduces the bug.

That test is identical to one of the several ways that I ran vic and encountered the bug (i.e., I ran vic both with and without mpirun), except I wasn't explicitly setting OMP_NUM_THREADS to anything. So this implies that my environment somehow uses multiple threads by default.

@jhamman
Copy link
Member

jhamman commented May 24, 2018

I think in order to close this issue we should add a test that checks for bit for bit equivalence between the single core image driver and the openmp enabled image driver. That should probably fall on me but if someone else is interested and gets to it first, go for it.

@dwoodson-usbr
Copy link

@jhamman or @tbohn is there any update on this issue? I am running into a similar issue with NA values in my VIC output when running with OpenMPI, VIC5.1.0, and the calibrated VIC parameter file and domain from Currier et al. 2023. I get many NAs throughout the CRB domain.

I have tried a variety of the suggestions in this and other threads and in the VIC image driver documentation page (MX_RCACHE = 2, removing the -fopenmp flag, etc.), as well as trying different forcings, etc. We did not build against any conda libraries.

I am able to run all of the VIC sample datasets without this issue.

The issue presents as either large chunks of the domain as NA immediately:
image

Or NAs becoming more prevalent over time:
image

@tbohn
Copy link
Contributor Author

tbohn commented Dec 26, 2023 via email

@dwoodson-usbr
Copy link

@tbohn thank you for your response!

-David

@dwoodson-usbr
Copy link

The NA values in my VIC output were caused by two issues:

  1. Not setting any values for OMP_NUM_THREADS before running the image driver with OpenMPI, this caused a phenomenon were only a few cells were NA at the simulation start but the number of NA cells grew over time and seems to be a conflict between OpenMP and OpenMPI. export OMP_NUM_THREADS=1 solved this issue.
  2. When generating my VIC forcings, I used MetSim. However, I made my MetSim domain file using the R terra package and doing so imparted something into the VIC forcings which caused large zones of NA cells in the VIC output, even though the VIC forcings looked fine upon inspection and analysis. Creating the MetSim domain with Python code did not cause this issue, even though the MetSim domain from R and Python were essentially identical in terms of mask and elevation. So perhaps something with the drivers used in the R terra package.

dwoodson-usbr added a commit to dwoodson-usbr/VIC that referenced this issue Jan 11, 2024
Added description of work around for conflicts between OpenMP and OpenMPI based on UW-Hydro#787
@dwoodson-usbr dwoodson-usbr linked a pull request Jan 11, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants