Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/ISSUE] Significant performance degradation on multiple platforms affecting GEOS-Chem Classic and GCHP #57

Open
LiamBindle opened this issue Oct 16, 2020 · 14 comments
Labels
category: Bug Something isn't working never stale Never label this issue as stale topic: Performance Related to HEMCO performance, parallelization, or memory issues

Comments

@LiamBindle
Copy link
Contributor

LiamBindle commented Oct 16, 2020

There appears to be a significant hit in GEOS-Chem Classic and GCHP performance on some platforms, particularly those using gfortran, stemming from somewhere in HEMCO. This issue has been observed by @lizziel, @jimmielin, @WilliamDowns, and myself.

The four of us just finished a call discussing the issue and how to proceed. Some notes from that meeting are in the collapsed block below

Zoom call notes
  • HEMCO slow down call [2020-10-16 Fri 10:00]
    • Wil has seen in calc emis
    • Get current emissions
    • Haipeng had similar with get emis
      • Open MP loop becomes serial loop
      • Removing subroutine call (copy-pasting code directly into get emis loop fixed issue)
      • His slow down was in boundary layer mixing (~100x)
      • Potentially Open MP issue
    • GCHP still uses OpenMP, only problematic >1000 cores
    • Troubleshoot idea: check if there's a difference in performance when GCHP::HEMCO is built without OpenMP
    • Will saw slow down in Get Current emissions code
      • Some (but not all) emissions containers hang for a while
      • Haipeng's seen some containers would hang for a while too
    • Look at gfortran flags that could give warnings about performance things
    • Can it be reproduced with HEMCO standalone
      • Lizzie is going to look into
    • Liam will open issue on HEMCO
    • Liam will check if building HEMCO without OpenMP speeds up
    • Will will continue running timing tests and narrowing in on which emissions collections are problematic
    • Haipeng will try GEOS-Chem classic with gfortran and some different compiler setting

Recordings of the issue

Lizzie's internal benchmarks show that HEMCO's performance in GEOS-Chem Classic with gfortran has been deteriorating. Wall times for HEMCO in GEOS-Chem Classic :

Version gfortran ifort
alpha.4 396 232
alpha.5 976 216
alpha.9 1732 376

In some GCHP timing tests that Lizzie and I ran a few weeks ago, I observed a very significant drop off in GCHP's scaling (see figure below). Note that the solid line are Lizzie's tests with ifort and the dashed lines are my tests with gfortran. The drop off in performance was dominated by a big slow down in HOCI_GC_RUN(...).

throughput

Further information

  • In GCHP timing tests by @WilliamDowns, he has seen a significant slow down in the Get Current Emissions code. He's found that certain (but not all) emissions containers exhibit significant slow downs.
  • @jimmielin saw a performance hit in GC-Classic due to a bad loop in "Get Emissions". He managed to resolve the slow down by replacing a subroutine call in bad-loop with the subroutines code (i.e. inline copy-paste). He suspects his issue was OpenMP related.

Action items

  • @WilliamDowns will continue to run timing tests, to try to narrow in on the specific calls and containers that are slow
  • @lizziel will see if the issue can be reproduced with HEMCO standalone
  • @LiamBindle will see if building HEMCO in GCHPctm without OpenMP, improves anything
  • @jimmielin is going to try some different gfortran compiler options in GC-Classic

We will continue discussing in this thread.

@LiamBindle LiamBindle added the category: Bug Something isn't working label Oct 16, 2020
@WilliamDowns
Copy link

WilliamDowns commented Oct 16, 2020

Moving the entire code of get_current_emissions into the body of hco_calcemis has not resulted in a speed-up (my slowdown continues to occur somewhere in that function's work). Will continue timing tests / pinning down where it's occurring in there.

@LiamBindle
Copy link
Contributor Author

LiamBindle commented Oct 16, 2020

As @jimmielin suggested, it definitely appears to be related to OpenMP.

Here is my timing test result for GCHP alpha.10 with OpenMP enabled and disabled. These were 2-week timing tests at C48 with 192 cores and gfortran 8.3.

OpenMP Model throughput
Enabled 113 days/day
Disabled 163 days/day

Here is the same figure I showed above, but including this new timing test with OpenMP disabled. Note that the blue X corresponds to this new timing test with OpenMP disabled. Also note that the blue X is the same timing test as the solid red point below it.

image

There is a 40-90% increase in speed if OpenMP is disabled. I think this confirms it's related to OpenMP!

@lizziel
Copy link
Contributor

lizziel commented Oct 16, 2020

Interesting. I'll do my HEMCO standalone tests for both OpenMP enabled and disabled.

@lizziel
Copy link
Contributor

lizziel commented Oct 16, 2020

I am seeing significant differences in run-time in HEMCO standalone due to OpenMP. The differences across compilers are negligible in comparison. For all compilers disabling OpenMP results in reduction of run-time.

Wall-times for 1-month run

Compiler OMP=y (24 threads) OMP=n
gfortran 8.3 00:19:37 00:15:04
gfortran 9.3 00:21:32 00:14:59
ifort18 00:21:15 00:17:10
ifort19 00:21:41 00:16:10

All tests used HEMCO branch dev/GC_13.0.0. For the run directory I used GC-Classic fullchem HEMCO_Config.rc and HEMCO_Diagn.rc from GEOS-Chem branch dev/13.0.0.

@WilliamDowns
Copy link

For a 6-hour simulation in GCHPctm with 30 cores on 1 node using gfortran 9.3 and Spack-built OpenMPI 4.0.5, disabling OpenMP yields a total run time of 14m14s. With OpenMP enabled, the same run has not finished after 1h40m (it's 3h40m into the run). Mine may not be an issue unique to HEMCO based on cursory glances at the pace of the run outside of the very slow emissions timesteps. I'll have to rerun for more details as Cannon is about to shutdown.

@yantosca
Copy link
Contributor

yantosca commented Mar 2, 2021

Has there been any more work into this issue? I am wondering if maybe using a combination of !$OMP COLLAPSE() and !OMP SCHEDULE( DYNAMIC, n ) might help here. I can try to look into this as I have been doing other profiling tests.

@yantosca
Copy link
Contributor

yantosca commented Mar 3, 2021

I have been looking into this. It seems that testing if pointers LevDct1, LevDct2 are associated within routine GetVertIndx can have a big performance hit. The pointers are associated in GET_CURRENT_EMISSIONS and so can be tested there, and logical arguments can be passed to GetVertIndx indicating if the pointers are null or not.

Also a similar check on whether the PBLHEIGHT and BXHEIGHT fields are in the HcoState can be moved out of routine GetIdx, which is called on each (I,J). I am not sure if this breaks anything though.

I will do more testing/profiling. At first glance the COLLAPSE does not bring as big of a benefit as I thought, but cleaning up these tests if pointers are associated seem to be the way to get more performance here.

@stale
Copy link

stale bot commented Apr 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

@stale stale bot added the stale No recent activity on this issue label Apr 2, 2021
@jimmielin jimmielin added never stale Never label this issue as stale and removed stale No recent activity on this issue labels Apr 3, 2021
@lizziel
Copy link
Contributor

lizziel commented Apr 9, 2021

From comment above: "Also a similar check on whether the PBLHEIGHT and BXHEIGHT fields are in the HcoState can be moved out of routine GetIdx, which is called on each (I,J). I am not sure if this breaks anything though."

It turns out this does break things in the new HEMCO intermediate grid update I am bringing in (GEOS-Chem PR 681). Initialization of State_Met%LandTypeFrac previously used HCO_GetPtr. With the update it instead uses HCO_GC_EvalFld, a new wrapper around HCO_EvalFld. With this method the data is retrieved via Get_Current_Emissions where the new checks are outside of the IJ loop.

The problem with this is that during initialization Emissions_Run has only been called with Phase=0. Phase 0 skips the call to GridEdge_Set, and by extension HCO_SetPBLm, resulting in HcoState%Grid%PBLHEIGHT%Val not yet being associated. I don't think the code that uses PBLHEIGHT would ever be reached for the 2D Olson land map, but the new code still checks that it is associated.

I'm considering two different possible fixes:
(1) Wrap the BXHEIGHT%Val and PBLHEIGHT%Val associated checks in Get_Current_Emissions so they are only done if isLevDct1 or isLevDct2. For this solution I'm not sure if the error handling would still cover all cases.
(2) Change the Olson data retrieval back to getting a pointer (@jimmielin, could you comment on why you made this change?)

@jimmielin
Copy link
Collaborator

Hi Lizzie - I think we can just use HCO_GC_GetPtr instead. I changed it to EvalFld as we were planning to changing the GetPtrs to EvalFlds, but it would be unphysical to scale land type fractions anyway. One thing to note with the new intermediate grid updates is that the data has to be copied out of the pointer.

I think simply replacing
CALL HCO_GC_EvalFld( Input_Opt, State_Grid, Name, State_Met%LandTypeFrac(:,:,T), RC, FOUND=FND )

with a call to CALL HCO_GC_GetPtr ( ... Ptr2D ...) then copying the data like State_Met%LandTypeFrac(:,:,T) = Ptr2D(:,:) should resolve this issue.

@lizziel
Copy link
Contributor

lizziel commented Apr 10, 2021

Okay, I will put it back to GetPtr then. I previously kept it as GetPtr rather than change it to EvalFld because, as you say, it doesn't make sense to scale/mask the land mask.

I'm still wondering if we could move those PBLHEIGHT checks elsewhere. It should only need to be checked once during a run.

@lizziel
Copy link
Contributor

lizziel commented Aug 3, 2021

Does anyone know if this OpenMP problem in HEMCO is still an issue in GEOS-Chem 13? @jimmielin, @WilliamDowns, @LiamBindle

@LiamBindle
Copy link
Contributor Author

Sorry, I don't. I'm not aware of any updates on it though so I suspect it still exists.

@yantosca yantosca added the topic: Performance Related to HEMCO performance, parallelization, or memory issues label Mar 23, 2022
@bbakernoaa
Copy link

Has there been any update on this? I'm noticing when running MEGAN offline that it doesn't scale at all with CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Bug Something isn't working never stale Never label this issue as stale topic: Performance Related to HEMCO performance, parallelization, or memory issues
Projects
None yet
Development

No branches or pull requests

6 participants