Added GPU-CA On time tracking & idle time reaction #2841

lhlawson · 2023-02-08T16:56:13Z

Signed-off-by: lhlawson lowren.h.lawson@intel.com

Relates to As a user of GPU hardware I would like an agent to target energy savings during periods of low utilization #2839 story from github issues
Fixes Update GPU-CA to provide long idle handling #2840 change request from github issues.

Provides a report entry for tracking the GPU energy usage when the GPU is above a utilization threshold.
Provides an algorithm update to target long idle periods.

src/GPUActivityAgent.cpp

dannosliwcd · 2023-03-22T21:43:25Z

src/GPUActivityAgent.cpp

+                    }
+                }
+                else {
+                    m_gpu_idle_timer.at(domain_idx) = 10;


The width of the idle filter should probably be calculated or at least #defined close to the agent's wait time in case wait time becomes variable in the future

That comment assumes this is supposed to approximate a time-based filter. If it is actually meant to be 10 samples regardless of time, then also add a comment explaining why.

Updated to a define of 10 samples regardless of time requested, with a comment explaining why.

dannosliwcd · 2023-09-22T18:29:11Z

src/GPUActivityAgent.cpp

@@ -138,6 +150,8 @@ namespace geopm
            m_gpu_freq_max_control.push_back(m_control{m_platform_io.push_control("GPU_CORE_FREQUENCY_MAX_CONTROL",
                                                       m_agent_domain,
                                                       domain_idx), NAN});
+            m_gpu_idle_timer.push_back(10);


Suggested change

m_gpu_idle_timer.push_back(10);

m_gpu_idle_timer.push_back(IDLE_SAMPLE_COUNT);

dannosliwcd · 2023-09-22T18:35:19Z

src/GPUActivityAgent.cpp

+// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2
+// workloads of interest assuming the default 20ms sample rate (200ms idle).
+// We could use 200ms as the default for the agent, but this does not provide a
+// mechanism for user control of the idle period.  Using a count provides partial
+// user control in that the idleness period is defined by the requested agent
+// control loop time.


This makes it sound like the goal is actually time-based rather than sample-based. Should IDLE_SAMPLE_COUNT be computed from environment().period(M_WAIT_SEC) and a 200ms target instead?

The ideal solution would be to provide a control for both the control loop time (via waiter) and the idle time until the agent takes action. We don't currently have the agent built with this in mind, so having the idle time be a function of the user input we do have seemed preferable as there could be a workload that has the behavior
... N milliseconds active, 200 ms idle, N milliseconds active ...
We can either:

Make it sample based, giving the user some control

Make it time based, and risk a pathological case described above

Add the idle time/sample count to the policy

Of these I like either 1 or 2

Why a #define instead of a const like M_CPU_ACTIVITY_CUTOFF and set in initializer list?

If we expect this is something the user would modify, it should be a policy parameter.

Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Co-authored-by: Daniel Wilson <daniel.wilsonboy@gmail.com>

…ve time Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

… requirements Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

lhlawson · 2023-09-28T18:33:36Z

Prior to merge the GPU-CA should be re-run with all review fixes on the target hardware with one workload of interest.

bgeltz

Integration test(s)?

bgeltz · 2023-09-28T20:38:51Z

service/docs/source/geopm_agent_gpu_activity.7.rst

+If a ``phi`` value of 0.5 or greater is used and a long idle period, defined as
+10 samples with a ``GPU_UTILIZATION`` of 0, occurs the agent will request the
+minimum frequency for GPU as specified by the ``GPU_CORE_FREQUENCY_MIN_AVAIL``
+signal.
+


I feel like this paragraph belongs up with the other discussion of the phi policy parameter (one paragraph up).

bgeltz · 2023-09-28T20:39:35Z

src/GPUActivityAgent.cpp

@@ -23,6 +23,14 @@
 #include "Waiter.hpp"
 #include "Environment.hpp"

+// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2


Suggested change

// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2

// IDLE_SAMPLE_COUNT of 10 is based upon a study of the idle behavior of CORAL-2

bgeltz · 2023-09-28T20:56:58Z

src/GPUActivityAgent.cpp

+                if (!std::isnan(gpu_utilization) &&
+                    gpu_utilization == 0) {
+                    if (m_gpu_idle_timer.at(domain_idx) > 0) {
+                        m_gpu_idle_timer.at(domain_idx) = m_gpu_idle_timer.at(domain_idx) - 1;


Suggested change

m_gpu_idle_timer.at(domain_idx) = m_gpu_idle_timer.at(domain_idx) - 1;

m_gpu_idle_timer.at(domain_idx) -= 1;

bgeltz · 2023-09-28T21:00:24Z

src/GPUActivityAgent.cpp

+                    m_gpu_idle_timer.at(domain_idx) = IDLE_SAMPLE_COUNT;
+                }
+
+                if (m_gpu_idle_timer.at(domain_idx) <= 0) {


Suggested change

if (m_gpu_idle_timer.at(domain_idx) <= 0) {

// If no activity has been observed for IDLE_SAMPLE_COUNT samples,

// we assume it is safe to reduce the frequency to a minimum value.

if (m_gpu_idle_timer.at(domain_idx) <= 0) {

bgeltz · 2023-09-28T21:01:04Z

src/GPUActivityAgent.cpp

+                    // If no activity has been observed for a number of samples
+                    // IDLE_SAMPLE_COUNT we assume it is safe to reduce the frequency
+                    // to a minimum value.


Suggested change

// If no activity has been observed for a number of samples

// IDLE_SAMPLE_COUNT we assume it is safe to reduce the frequency

// to a minimum value.

// If activity is observed or is NAN, reset the IDLE_SAMPLE_COUNT tracking.

bgeltz · 2023-09-28T21:06:10Z

src/GPUActivityAgent.cpp

+                }
+
+                if (m_gpu_idle_timer.at(domain_idx) <= 0) {
+                    f_request = m_freq_gpu_min;


I think all of the code in this method that has anything to do with calculating f_request could be refactored into it's own helper method. I started thinking about this because I'm seeing repeated code already:

if (!std::isnan(gpu_utilization) && gpu_utilization == 0) {

I think this new code can be combined into the logic that starts on line 318 (new code), but further I think all of that can have an extract method refactor done on it to clean up adjust_platform() considerably, i.e. if the idle tracking time has expired and we're about to set the freq to min, we don't need to do any of the other frequency calculation above.

bgeltz · 2023-09-28T21:08:54Z

src/GPUActivityAgent.cpp

+// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2
+// workloads of interest assuming the default 20ms sample rate (200ms idle).
+// We could use 200ms as the default for the agent, but this does not provide a
+// mechanism for user control of the idle period.  Using a count provides partial
+// user control in that the idleness period is defined by the requested agent
+// control loop time.


Why a #define instead of a const like M_CPU_ACTIVITY_CUTOFF and set in initializer list?

If we expect this is something the user would modify, it should be a policy parameter.

bgeltz · 2023-09-28T21:29:10Z

src/GPUActivityAgent.cpp

+                    // TODO: handle roll-over more gracefully than dropping a sample
+                    if (m_gpu_energy.at(domain_idx).value > m_prev_gpu_energy.at(domain_idx)) {
+                        m_gpu_on_energy.at(domain_idx) += m_gpu_energy.at(domain_idx).value - m_prev_gpu_energy.at(domain_idx);
+                    }


I'm pretty surprised that this is necessary here and not handled by the IOGroup.
AFAICT, the GPU_ENERGY signal and the CPU_ENERGY signal both have this same problem. The agents should not have to track this, and the IOGroups should be updated to deal with this.

The code in this agent could then be simplified, i.e. you don't need m_prev_gpu_energy at all if this was handled by the IOGroup.

I agree this likely exists for CPU_ENERGY as well, but I have confirmed that the NVML and L0 IOGroups report the monotonic energy provided by the API (translated to SI units).
We could add a "GPU_ENERGY_DELTA" signal (final naming TBD) using the geopm DifferenceSignal class that provides the difference in the last sample and current sample and handles the rollover.

bgeltz · 2023-09-28T21:53:22Z

test/GPUActivityAgentTest.cpp

+        EXPECT_EQ(expected_header.at(i).first, report_header.at(i).first);
+        if (expected_header.at(i).first != "Agent Domain") {
+            EXPECT_EQ(std::stod(expected_header.at(i).second), std::stod(report_header.at(i).second));
+        };


Suggested change

};

}

bgeltz · 2023-09-28T21:55:04Z

test/GPUActivityAgentTest.cpp

+        EXPECT_EQ(expected_header.at(i).first, report_header.at(i).first);
+        if (expected_header.at(i).first != "Agent Domain") {
+            EXPECT_EQ(std::stod(expected_header.at(i).second), std::stod(report_header.at(i).second));
+        };


Suggested change

};

}

Also have a look at this, it may simplify this check:
https://stackoverflow.com/a/12340578

bgeltz · 2023-09-28T22:01:30Z

test/GPUActivityAgentTest.cpp

+        EXPECT_EQ(expected_header.at(i).first, report_header.at(i).first);
+        if (expected_header.at(i).first != "Agent Domain") {
+            EXPECT_EQ(std::stod(expected_header.at(i).second), std::stod(report_header.at(i).second));
+        };


Suggested change

};

}

…tainerEq Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

lhlawson requested a review from avilcheslopez February 8, 2023 16:56

lhlawson force-pushed the public-lhlawson-gpu-ca-idle branch from 80f6f4e to 387fc38 Compare February 10, 2023 02:47

lhlawson requested review from bgeltz and dannosliwcd March 20, 2023 20:29

dannosliwcd reviewed Mar 22, 2023

View reviewed changes

lhlawson added the 3.0 label Jul 28, 2023

lhlawson force-pushed the public-lhlawson-gpu-ca-idle branch from a551257 to bae474a Compare September 11, 2023 22:15

lhlawson force-pushed the public-lhlawson-gpu-ca-idle branch from e0f8255 to 621fc24 Compare September 21, 2023 20:59

lhlawson marked this pull request as ready for review September 21, 2023 23:31

dannosliwcd reviewed Sep 22, 2023

View reviewed changes

lhlawson force-pushed the public-lhlawson-gpu-ca-idle branch from 156bdf8 to b821e4d Compare September 27, 2023 15:14

lhlawson and others added 10 commits September 28, 2023 10:14

Added GPU-CA On time tracking & idle time reaction

0ddc775

Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Removed comment

7b65d7a

Co-authored-by: Daniel Wilson <daniel.wilsonboy@gmail.com>

Updated GPU CA tracking logic to handle startup time & prevent negati…

98535b3

…ve time Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Handle case where GPU-CA tracking doesn't see end of region during run

7194401

Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Fixed incorrect roll-over logic check in GPU-CA tracking

b623e6e

Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Added define for IDLE_SAMPLE_COUNT

19aa5f5

Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Added documentation of GPU-CA idle feature

c97443f

Signed-off-by: lhlawson <lowren.h.lawson@intel.com>

Added test for GPU-CA long idle time

b6d94ca

Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

Added tests for GPU-CA headers and additional corner case handling

297120e

Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

Addressed review feedback: swap hardcoded value with define

13be118

Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

lhlawson force-pushed the public-lhlawson-gpu-ca-idle branch from b821e4d to 70c219e Compare September 28, 2023 17:27

Address Review Feedback: Added GPU-CA Header entries around idle time…

6509325

… requirements Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

lhlawson force-pushed the public-lhlawson-gpu-ca-idle branch from 70c219e to 6509325 Compare September 28, 2023 18:14

bgeltz requested changes Sep 28, 2023

View reviewed changes

This was referenced Oct 3, 2023

Invalid 'on time' and 'active time' values for the GPU-CA report header #3172

Closed

Removed initial region tracking logic for GPU-CA #3173

Merged

Addressed review feedback: misc cleanup and unit test updates for Con…

0ddbf83

…tainerEq Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

lhlawson added 2 commits October 4, 2023 13:28

Updated GPU-CA doc ordering for discussion of phi

b4f2b84

Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

Refactor of adjust platform, update to tests

7e160c5

Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>

cmcantalupo force-pushed the dev branch 2 times, most recently from 2cd441d to 0859ce3 Compare October 19, 2023 00:41

cmcantalupo force-pushed the dev branch from 54d8906 to 3a36fcd Compare October 26, 2023 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added GPU-CA On time tracking & idle time reaction #2841

Added GPU-CA On time tracking & idle time reaction #2841

lhlawson commented Feb 8, 2023

dannosliwcd Mar 22, 2023

dannosliwcd Mar 22, 2023

lhlawson Sep 21, 2023

dannosliwcd Sep 22, 2023

dannosliwcd Sep 22, 2023

lhlawson Sep 22, 2023

bgeltz Sep 28, 2023

lhlawson commented Sep 28, 2023

bgeltz left a comment

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023 •

edited

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023 •

edited

lhlawson Oct 4, 2023

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023

bgeltz Sep 28, 2023

	m_gpu_idle_timer.push_back(10);
	m_gpu_idle_timer.push_back(IDLE_SAMPLE_COUNT);

	// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2
	// IDLE_SAMPLE_COUNT of 10 is based upon a study of the idle behavior of CORAL-2

	m_gpu_idle_timer.at(domain_idx) = m_gpu_idle_timer.at(domain_idx) - 1;
	m_gpu_idle_timer.at(domain_idx) -= 1;

Added GPU-CA On time tracking & idle time reaction #2841

Are you sure you want to change the base?

Added GPU-CA On time tracking & idle time reaction #2841

Conversation

lhlawson commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhlawson commented Sep 28, 2023

bgeltz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgeltz Sep 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgeltz Sep 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgeltz Sep 28, 2023 •

edited

bgeltz Sep 28, 2023 •

edited