Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indirect memory fix #1098

Open
wants to merge 2 commits into
base: stable
Choose a base branch
from
Open

Conversation

anooprac
Copy link

@anooprac anooprac commented May 3, 2024

I re-read the IMP paper and saw in 3.2.3 Indirect Prefetching, that along with prefetching B[i], B[i + delta] should also be prefetched. Because of this, I removed the for loop that causes repetitive additions of addresses that are the same. Instead, I only add (base address + (index << shift)) and (base address + ((index + distance) << shift), which the paper states are the only two prefetches that should be done in both the intro and in 3.2.3.

fix #1050

(pt_entry->index << pt_entry->shift);
addresses.push_back(AddrPriority(pf_addr, 0));
}
Addr pf_addr = pt_entry->baseAddr +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the code is wrong (here we were sending the same address to be prefetched multiple times), but I don´t think the fix is correct.

From the text:
-> "A[B[i]] ) = (B[i] << shift) + BaseAddr", where "BaseAddr is the address of A[0]", and shift is the number of bits of each element in A
-> "When software accesses B[i], IMP will prefetch and read the value of B[i + ∆]. It then uses predicted values of Coeff and BaseAddr to calculate the memory address of A[B[i + ∆]] ... and prefetch that line".

What I don´t understand from the text:
-> "IMP issues one or more prefetches for that pattern" - How can one issue one prefetch if the premise is that we always prefetch at least 2 (B[i + ∆] and A[B[i + ∆]])?
-> "The prefetch distance, the distance in the access stream between the current access and the one we prefetch, is initially small and linearly increased as more hits happen to the same pattern." - I could not find any sentence explaining how the distance is calculated. "initially small" is not a value. "linearly increased" is not a delta value. I will assume that the way we currently do it suffices.

Therefore, my understanding is (using variables from the code):

  • addr is i
  • pt_entry->index is the prediction of the value of B[i+∆]
  • Address used to prefetch B[i+∆]: addr + distance
  • Address used to prefetch A[B[i+∆]]: pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)

BTW, there are likely many more bugs in this code, as I don´t think it was tested. For example, checkAddressMatch seems to be the place that checks MissAddr1, but it does not check MissAddr2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification!

I still am also confused by the first part that you don't understand from the text. From the introduction of the paper. I think the distance might start at 0 based off the calcSaturation() function, but I also haven't looked into the implementation of it.

I see that there are other bugs in the code, but for this specific issue on elminating adding repetitive addresses to the queue, would it be fine to remove the for loop and add pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)? I still am a little confused by that equation though because it never takes into account distance, but the paper does mention that as used for indexing into the B array. Is that just an unnecessary calcualtion then, and we should be fine just using pt_entry->index?

Also, just so my understadning is correct, the distance is used to find the correct entry in the indirect table, but from there, we prefetch only pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)? The paper does state multiple prefetches can be made, but it doesn't explain how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my understanding from skimming through the paper:

  • There are two prefetchers in the paper: 1 for the index-array (B) addresses, and one for the main-array (A) addresses.
  • The index-array prefetcher uses a simple algorithm (possibly Stride): it uses a prefetching distance to determine the next index-array address to be fetched. I don´t really understand why the distance is increased linearly, but this is likely irrelevant for the paper as a whole, and this algorithm can probably be replaced by another one. This prefetcher generates the first prefetch address. However, since the algorithm can probably be replaced, and they seem to assume that there is a degree (check gem5´s implementation of the stride prefetcher) that is probably why they mention possibly having multiple prefetch addresses.
  • The main contribution of the paper is the main prefetcher: it uses those tables to predict that an access´ address is within an index array, and therefore we should try to predict the address of the main array by using those pt_entry members. The main prefetcher does not use distance, as it is not part of its algorithm. This prefetcher generates the second prefetch address.

This is how I would fix this block given the conclusions in my previous comment:

Replace

for (int delta = 1; delta < distance; delta += 1) {
    Addr pf_addr = pt_entry->baseAddr + (pt_entry->index << pt_entry->shift);
    addresses.push_back(AddrPriority(pf_addr, 0));
}

By

addresses.push_back(AddrPriority(addr + distance, 0));
addresses.push_back(AddrPriority(pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)), 0));

Alternatively, if we want to take the "increase linearly" part into account, this is how I think it could play out. However, I don´t think this is correct, because there is no point in prefetching multiple Bs if they don´t have a corresponding A prediction. i.e., we would need to use multiple pt_entry here.

// Predict Bs
for (unsigned degree = 1; degree < distance; ++degree) {
    addresses.push_back(AddrPriority(addr + degree, 0));
}
// Predict corresponding A
addresses.push_back(AddrPriority(pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)), 0));

The ideal scenario would be to contact the original authors and have them help review and fix this code, because I am sure many changes would be needed, and we don´t fully understand the paper.

(Make sure to not exceed the line char limit: 79 characters - if you haven´t already, make sure to read CONTRIBUTING.md)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also re-reading the paper and am trying to understand what different addresses are added to the queue.

A part of 3.1 Key Idea says "That is, for some contiguous values of i, they access B[i] then A[B[i]]." I think this can mean that what are enqueued is addresses A[B[i]] to A[B[i + distance]] linearly. So it seems like we would increment index by one until distance.

For a specific index i, the address to be enqueued can be calcualted by ADDR( A[B[i]] ) = Coeff × B[i] + BaseAddr. Coeff is determined by shift in the gem5 code.

In terms of there being two different types of prefetching, one that is stream and one that is indirect, the code already accounts for this starting at line 92:
if (pt_entry->streamCounter >= streamCounterThreshold) { int64_t delta = addr - pt_entry->address; for (unsigned int i = 1; i <= streamingDistance; i += 1) { addresses.push_back(AddrPriority(addr + delta * i, 0)); } } pt_entry->ad

I don't think the fix you provided would be correct, but also I'm not sure what the correct solution is. PC is used to get the index, which is correct according to the paper because it finds the correct entry in the prefetch table acording to the diagrams. But that only gets you the index, which you use to index into the B array going from that index to (index + distance). Distance is also calculated in this section of code, but the only part that confuses me is how to get access to the B array. Once we have that, we use index to index into B and shift it by shift and add it to baseAddr. Then increment index up until it is equal to index + distance.

Does this make sense in context of the paper? From there, how do we get B?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also re-reading the paper and am trying to understand what different addresses are added to the queue.

A part of 3.1 Key Idea says "That is, for some contiguous values of i, they access B[i] then A[B[i]]." I think this can mean that what are enqueued is addresses A[B[i]] to A[B[i + distance]] linearly. So it seems like we would increment index by one until distance.

For a specific index i, the address to be enqueued can be calcualted by ADDR( A[B[i]] ) = Coeff × B[i] + BaseAddr. Coeff is determined by shift in the gem5 code.

In terms of there being two different types of prefetching, one that is stream and one that is indirect, the code already accounts for this starting at line 92: if (pt_entry->streamCounter >= streamCounterThreshold) { int64_t delta = addr - pt_entry->address; for (unsigned int i = 1; i <= streamingDistance; i += 1) { addresses.push_back(AddrPriority(addr + delta * i, 0)); } } pt_entry->ad

Indeed, the stream prefetcher is already there, so there is no need to add it again.

I don't think the fix you provided would be correct, but also I'm not sure what the correct solution is. PC is used to get the index, which is correct according to the paper because it finds the correct entry in the prefetch table acording to the diagrams. But that only gets you the index, which you use to index into the B array going from that index to (index + distance). Distance is also calculated in this section of code, but the only part that confuses me is how to get access to the B array. Once we have that, we use index to index into B and shift it by shift and add it to baseAddr. Then increment index up until it is equal to index + distance.

Does this make sense in context of the paper? From there, how do we get B?

"which you use to index into the B array". The "index" is not and index into B. "index" is the a prediction of what Bs value is. i.e., Bs are predicted using pt_entry->index. Therefore, the getting a B value is trivial: just use the current PC to get the corresponding pt_entry: B==pt_entry->index.

However, I couldn´t understand how to generate multiple B values, since each PT entry contains a single index. My best guess is that instead of indexing the table using only the PC, we also need to use both the access address and generated stream addresses:

  • Use the current PC-addr pair to access stream table (prefetchTable.findEntry(hash(pc))) and generate multiple stream prefetch addresses (SPA) using a distance to generate more than one (as we currently do)
  • Use both the current PC-addr pair and each PC-SPA pair to access the prefetch table to get multiple pt_entries and generate multiple indirect prefetch addresses using the other formula (pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)))

@ivanaamit
Copy link
Contributor

Hi @anooprac,

There are a few things that we require all contributions to gem5 to have:

Fix change-id failed test:

To automatically insert a change-id, run the following:

f=.git/hooks/commit-msg
mkdir -p $f
curl -Lo $f https://gerrit-review.googlesource.com/tools/hooks/commit-msg
chmod +x $f

Then, amend the commit with git commit --amend --no-edit and update your pull request.

Fix commit messages:

Visit this webpage https://www.gem5.org/contributing and search for "Committing". There you have guidelines for creating commit messages.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants