-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added GPU-CA On time tracking & idle time reaction #2841
base: dev
Are you sure you want to change the base?
Conversation
80f6f4e
to
387fc38
Compare
src/GPUActivityAgent.cpp
Outdated
} | ||
} | ||
else { | ||
m_gpu_idle_timer.at(domain_idx) = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The width of the idle filter should probably be calculated or at least #defined close to the agent's wait time in case wait time becomes variable in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment assumes this is supposed to approximate a time-based filter. If it is actually meant to be 10 samples regardless of time, then also add a comment explaining why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to a define of 10 samples regardless of time requested, with a comment explaining why.
a551257
to
bae474a
Compare
e0f8255
to
621fc24
Compare
src/GPUActivityAgent.cpp
Outdated
@@ -138,6 +150,8 @@ namespace geopm | |||
m_gpu_freq_max_control.push_back(m_control{m_platform_io.push_control("GPU_CORE_FREQUENCY_MAX_CONTROL", | |||
m_agent_domain, | |||
domain_idx), NAN}); | |||
m_gpu_idle_timer.push_back(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_gpu_idle_timer.push_back(10); | |
m_gpu_idle_timer.push_back(IDLE_SAMPLE_COUNT); |
src/GPUActivityAgent.cpp
Outdated
// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2 | ||
// workloads of interest assuming the default 20ms sample rate (200ms idle). | ||
// We could use 200ms as the default for the agent, but this does not provide a | ||
// mechanism for user control of the idle period. Using a count provides partial | ||
// user control in that the idleness period is defined by the requested agent | ||
// control loop time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it sound like the goal is actually time-based rather than sample-based. Should IDLE_SAMPLE_COUNT
be computed from environment().period(M_WAIT_SEC)
and a 200ms
target instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ideal solution would be to provide a control for both the control loop time (via waiter) and the idle time until the agent takes action. We don't currently have the agent built with this in mind, so having the idle time be a function of the user input we do have seemed preferable as there could be a workload that has the behavior
... N milliseconds active, 200 ms idle, N milliseconds active ...
We can either:
- Make it sample based, giving the user some control
- Make it time based, and risk a pathological case described above
- Add the idle time/sample count to the policy
Of these I like either 1 or 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a #define
instead of a const like M_CPU_ACTIVITY_CUTOFF and set in initializer list?
If we expect this is something the user would modify, it should be a policy parameter.
156bdf8
to
b821e4d
Compare
Signed-off-by: lhlawson <lowren.h.lawson@intel.com>
Co-authored-by: Daniel Wilson <daniel.wilsonboy@gmail.com>
…ve time Signed-off-by: lhlawson <lowren.h.lawson@intel.com>
Signed-off-by: lhlawson <lowren.h.lawson@intel.com>
Signed-off-by: lhlawson <lowren.h.lawson@intel.com>
Signed-off-by: lhlawson <lowren.h.lawson@intel.com>
Signed-off-by: lhlawson <lowren.h.lawson@intel.com>
Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>
Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>
Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>
b821e4d
to
70c219e
Compare
… requirements Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>
70c219e
to
6509325
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration test(s)?
If a ``phi`` value of 0.5 or greater is used and a long idle period, defined as | ||
10 samples with a ``GPU_UTILIZATION`` of 0, occurs the agent will request the | ||
minimum frequency for GPU as specified by the ``GPU_CORE_FREQUENCY_MIN_AVAIL`` | ||
signal. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this paragraph belongs up with the other discussion of the phi policy parameter (one paragraph up).
src/GPUActivityAgent.cpp
Outdated
@@ -23,6 +23,14 @@ | |||
#include "Waiter.hpp" | |||
#include "Environment.hpp" | |||
|
|||
// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2 | |
// IDLE_SAMPLE_COUNT of 10 is based upon a study of the idle behavior of CORAL-2 |
src/GPUActivityAgent.cpp
Outdated
if (!std::isnan(gpu_utilization) && | ||
gpu_utilization == 0) { | ||
if (m_gpu_idle_timer.at(domain_idx) > 0) { | ||
m_gpu_idle_timer.at(domain_idx) = m_gpu_idle_timer.at(domain_idx) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_gpu_idle_timer.at(domain_idx) = m_gpu_idle_timer.at(domain_idx) - 1; | |
m_gpu_idle_timer.at(domain_idx) -= 1; |
src/GPUActivityAgent.cpp
Outdated
m_gpu_idle_timer.at(domain_idx) = IDLE_SAMPLE_COUNT; | ||
} | ||
|
||
if (m_gpu_idle_timer.at(domain_idx) <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (m_gpu_idle_timer.at(domain_idx) <= 0) { | |
// If no activity has been observed for IDLE_SAMPLE_COUNT samples, | |
// we assume it is safe to reduce the frequency to a minimum value. | |
if (m_gpu_idle_timer.at(domain_idx) <= 0) { |
src/GPUActivityAgent.cpp
Outdated
// If no activity has been observed for a number of samples | ||
// IDLE_SAMPLE_COUNT we assume it is safe to reduce the frequency | ||
// to a minimum value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If no activity has been observed for a number of samples | |
// IDLE_SAMPLE_COUNT we assume it is safe to reduce the frequency | |
// to a minimum value. | |
// If activity is observed or is NAN, reset the IDLE_SAMPLE_COUNT tracking. |
src/GPUActivityAgent.cpp
Outdated
} | ||
|
||
if (m_gpu_idle_timer.at(domain_idx) <= 0) { | ||
f_request = m_freq_gpu_min; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of the code in this method that has anything to do with calculating f_request could be refactored into it's own helper method. I started thinking about this because I'm seeing repeated code already:
if (!std::isnan(gpu_utilization) &&
gpu_utilization == 0) {
I think this new code can be combined into the logic that starts on line 318 (new code), but further I think all of that can have an extract method refactor done on it to clean up adjust_platform()
considerably, i.e. if the idle tracking time has expired and we're about to set the freq to min, we don't need to do any of the other frequency calculation above.
src/GPUActivityAgent.cpp
Outdated
// IDLE SAMPLE COUNT of 10 is based upon a study of the idle behavior of CORAL-2 | ||
// workloads of interest assuming the default 20ms sample rate (200ms idle). | ||
// We could use 200ms as the default for the agent, but this does not provide a | ||
// mechanism for user control of the idle period. Using a count provides partial | ||
// user control in that the idleness period is defined by the requested agent | ||
// control loop time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a #define
instead of a const like M_CPU_ACTIVITY_CUTOFF and set in initializer list?
If we expect this is something the user would modify, it should be a policy parameter.
// TODO: handle roll-over more gracefully than dropping a sample | ||
if (m_gpu_energy.at(domain_idx).value > m_prev_gpu_energy.at(domain_idx)) { | ||
m_gpu_on_energy.at(domain_idx) += m_gpu_energy.at(domain_idx).value - m_prev_gpu_energy.at(domain_idx); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty surprised that this is necessary here and not handled by the IOGroup.
AFAICT, the GPU_ENERGY signal and the CPU_ENERGY signal both have this same problem. The agents should not have to track this, and the IOGroups should be updated to deal with this.
The code in this agent could then be simplified, i.e. you don't need m_prev_gpu_energy at all if this was handled by the IOGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this likely exists for CPU_ENERGY as well, but I have confirmed that the NVML and L0 IOGroups report the monotonic energy provided by the API (translated to SI units).
We could add a "GPU_ENERGY_DELTA" signal (final naming TBD) using the geopm DifferenceSignal class that provides the difference in the last sample and current sample and handles the rollover.
test/GPUActivityAgentTest.cpp
Outdated
EXPECT_EQ(expected_header.at(i).first, report_header.at(i).first); | ||
if (expected_header.at(i).first != "Agent Domain") { | ||
EXPECT_EQ(std::stod(expected_header.at(i).second), std::stod(report_header.at(i).second)); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}; | |
} |
test/GPUActivityAgentTest.cpp
Outdated
EXPECT_EQ(expected_header.at(i).first, report_header.at(i).first); | ||
if (expected_header.at(i).first != "Agent Domain") { | ||
EXPECT_EQ(std::stod(expected_header.at(i).second), std::stod(report_header.at(i).second)); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also have a look at this, it may simplify this check:
https://stackoverflow.com/a/12340578
test/GPUActivityAgentTest.cpp
Outdated
EXPECT_EQ(expected_header.at(i).first, report_header.at(i).first); | ||
if (expected_header.at(i).first != "Agent Domain") { | ||
EXPECT_EQ(std::stod(expected_header.at(i).second), std::stod(report_header.at(i).second)); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}; | |
} |
…tainerEq Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>
Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>
Signed-off-by: Lowren Lawson <lowren.h.lawson@intel.com>
2cd441d
to
0859ce3
Compare
Signed-off-by: lhlawson lowren.h.lawson@intel.com
Provides a report entry for tracking the GPU energy usage when the GPU is above a utilization threshold.
Provides an algorithm update to target long idle periods.