Indirect memory fix #1098

anooprac · 2024-05-03T01:44:45Z

I re-read the IMP paper and saw in 3.2.3 Indirect Prefetching, that along with prefetching B[i], B[i + delta] should also be prefetched. Because of this, I removed the for loop that causes repetitive additions of addresses that are the same. Instead, I only add (base address + (index << shift)) and (base address + ((index + distance) << shift), which the paper states are the only two prefetches that should be done in both the intro and in 3.2.3.

fix #1050

odanrc · 2024-05-04T14:19:18Z

src/mem/cache/prefetch/indirect_memory.cc

-                                    (pt_entry->index << pt_entry->shift);
-                                addresses.push_back(AddrPriority(pf_addr, 0));
-                            }
+                            Addr pf_addr = pt_entry->baseAddr +


I agree that the code is wrong (here we were sending the same address to be prefetched multiple times), but I don´t think the fix is correct.

From the text:
-> "A[B[i]] ) = (B[i] << shift) + BaseAddr", where "BaseAddr is the address of A[0]", and shift is the number of bits of each element in A
-> "When software accesses B[i], IMP will prefetch and read the value of B[i + ∆]. It then uses predicted values of Coeff and BaseAddr to calculate the memory address of A[B[i + ∆]] ... and prefetch that line".

What I don´t understand from the text:
-> "IMP issues one or more prefetches for that pattern" - How can one issue one prefetch if the premise is that we always prefetch at least 2 (B[i + ∆] and A[B[i + ∆]])?
-> "The prefetch distance, the distance in the access stream between the current access and the one we prefetch, is initially small and linearly increased as more hits happen to the same pattern." - I could not find any sentence explaining how the distance is calculated. "initially small" is not a value. "linearly increased" is not a delta value. I will assume that the way we currently do it suffices.

Therefore, my understanding is (using variables from the code):

addr is i

pt_entry->index is the prediction of the value of B[i+∆]

Address used to prefetch B[i+∆]: addr + distance

Address used to prefetch A[B[i+∆]]: pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)

BTW, there are likely many more bugs in this code, as I don´t think it was tested. For example, checkAddressMatch seems to be the place that checks MissAddr1, but it does not check MissAddr2.

Thank you for the clarification!

I still am also confused by the first part that you don't understand from the text. From the introduction of the paper. I think the distance might start at 0 based off the calcSaturation() function, but I also haven't looked into the implementation of it.

I see that there are other bugs in the code, but for this specific issue on elminating adding repetitive addresses to the queue, would it be fine to remove the for loop and add pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)? I still am a little confused by that equation though because it never takes into account distance, but the paper does mention that as used for indexing into the B array. Is that just an unnecessary calcualtion then, and we should be fine just using pt_entry->index?

Also, just so my understadning is correct, the distance is used to find the correct entry in the indirect table, but from there, we prefetch only pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)? The paper does state multiple prefetches can be made, but it doesn't explain how.

Here is my understanding from skimming through the paper:

There are two prefetchers in the paper: 1 for the index-array (B) addresses, and one for the main-array (A) addresses.

The index-array prefetcher uses a simple algorithm (possibly Stride): it uses a prefetching distance to determine the next index-array address to be fetched. I don´t really understand why the distance is increased linearly, but this is likely irrelevant for the paper as a whole, and this algorithm can probably be replaced by another one. This prefetcher generates the first prefetch address. However, since the algorithm can probably be replaced, and they seem to assume that there is a degree (check gem5´s implementation of the stride prefetcher) that is probably why they mention possibly having multiple prefetch addresses.

The main contribution of the paper is the main prefetcher: it uses those tables to predict that an access´ address is within an index array, and therefore we should try to predict the address of the main array by using those pt_entry members. The main prefetcher does not use distance, as it is not part of its algorithm. This prefetcher generates the second prefetch address.

This is how I would fix this block given the conclusions in my previous comment:

Replace

for (int delta = 1; delta < distance; delta += 1) { Addr pf_addr = pt_entry->baseAddr + (pt_entry->index << pt_entry->shift); addresses.push_back(AddrPriority(pf_addr, 0)); }

By

addresses.push_back(AddrPriority(addr + distance, 0)); addresses.push_back(AddrPriority(pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)), 0));

Alternatively, if we want to take the "increase linearly" part into account, this is how I think it could play out. However, I don´t think this is correct, because there is no point in prefetching multiple Bs if they don´t have a corresponding A prediction. i.e., we would need to use multiple pt_entry here.

// Predict Bs for (unsigned degree = 1; degree < distance; ++degree) { addresses.push_back(AddrPriority(addr + degree, 0)); } // Predict corresponding A addresses.push_back(AddrPriority(pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)), 0));

The ideal scenario would be to contact the original authors and have them help review and fix this code, because I am sure many changes would be needed, and we don´t fully understand the paper.

(Make sure to not exceed the line char limit: 79 characters - if you haven´t already, make sure to read CONTRIBUTING.md)

I am also re-reading the paper and am trying to understand what different addresses are added to the queue.

A part of 3.1 Key Idea says "That is, for some contiguous values of i, they access B[i] then A[B[i]]." I think this can mean that what are enqueued is addresses A[B[i]] to A[B[i + distance]] linearly. So it seems like we would increment index by one until distance.

For a specific index i, the address to be enqueued can be calcualted by ADDR( A[B[i]] ) = Coeff × B[i] + BaseAddr. Coeff is determined by shift in the gem5 code.

In terms of there being two different types of prefetching, one that is stream and one that is indirect, the code already accounts for this starting at line 92:
if (pt_entry->streamCounter >= streamCounterThreshold) { int64_t delta = addr - pt_entry->address; for (unsigned int i = 1; i <= streamingDistance; i += 1) { addresses.push_back(AddrPriority(addr + delta * i, 0)); } } pt_entry->ad

I don't think the fix you provided would be correct, but also I'm not sure what the correct solution is. PC is used to get the index, which is correct according to the paper because it finds the correct entry in the prefetch table acording to the diagrams. But that only gets you the index, which you use to index into the B array going from that index to (index + distance). Distance is also calculated in this section of code, but the only part that confuses me is how to get access to the B array. Once we have that, we use index to index into B and shift it by shift and add it to baseAddr. Then increment index up until it is equal to index + distance.

Does this make sense in context of the paper? From there, how do we get B?

I am also re-reading the paper and am trying to understand what different addresses are added to the queue.

A part of 3.1 Key Idea says "That is, for some contiguous values of i, they access B[i] then A[B[i]]." I think this can mean that what are enqueued is addresses A[B[i]] to A[B[i + distance]] linearly. So it seems like we would increment index by one until distance.

For a specific index i, the address to be enqueued can be calcualted by ADDR( A[B[i]] ) = Coeff × B[i] + BaseAddr. Coeff is determined by shift in the gem5 code.

In terms of there being two different types of prefetching, one that is stream and one that is indirect, the code already accounts for this starting at line 92: if (pt_entry->streamCounter >= streamCounterThreshold) { int64_t delta = addr - pt_entry->address; for (unsigned int i = 1; i <= streamingDistance; i += 1) { addresses.push_back(AddrPriority(addr + delta * i, 0)); } } pt_entry->ad

Indeed, the stream prefetcher is already there, so there is no need to add it again.

I don't think the fix you provided would be correct, but also I'm not sure what the correct solution is. PC is used to get the index, which is correct according to the paper because it finds the correct entry in the prefetch table acording to the diagrams. But that only gets you the index, which you use to index into the B array going from that index to (index + distance). Distance is also calculated in this section of code, but the only part that confuses me is how to get access to the B array. Once we have that, we use index to index into B and shift it by shift and add it to baseAddr. Then increment index up until it is equal to index + distance.

Does this make sense in context of the paper? From there, how do we get B?

"which you use to index into the B array". The "index" is not and index into B. "index" is the a prediction of what Bs value is. i.e., Bs are predicted using pt_entry->index. Therefore, the getting a B value is trivial: just use the current PC to get the corresponding pt_entry: B==pt_entry->index.

However, I couldn´t understand how to generate multiple B values, since each PT entry contains a single index. My best guess is that instead of indexing the table using only the PC, we also need to use both the access address and generated stream addresses:

Use the current PC-addr pair to access stream table (prefetchTable.findEntry(hash(pc))) and generate multiple stream prefetch addresses (SPA) using a distance to generate more than one (as we currently do)

Use both the current PC-addr pair and each PC-SPA pair to access the prefetch table to get multiple pt_entries and generate multiple indirect prefetch addresses using the other formula (pt_entry->baseAddr + (pt_entry->index << pt_entry->shift)))

ivanaamit · 2024-05-06T15:49:47Z

Hi @anooprac,

There are a few things that we require all contributions to gem5 to have:

Fix change-id failed test:

To automatically insert a change-id, run the following:

f=.git/hooks/commit-msg
mkdir -p $f
curl -Lo $f https://gerrit-review.googlesource.com/tools/hooks/commit-msg
chmod +x $f

Then, amend the commit with git commit --amend --no-edit and update your pull request.

Fix commit messages:

Visit this webpage https://www.gem5.org/contributing and search for "Committing". There you have guidelines for creating commit messages.

Thank you!

anooprac added 2 commits May 2, 2024 20:40

add prefetch index + distance also

1dbb130

removed for loop that causes repetitive prefetches

d4572d5

odanrc suggested changes May 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indirect memory fix #1098

Indirect memory fix #1098

anooprac commented May 3, 2024 •

edited

odanrc May 4, 2024

anooprac May 4, 2024

odanrc May 4, 2024

anooprac May 4, 2024

odanrc May 5, 2024

ivanaamit commented May 6, 2024

Indirect memory fix #1098

Are you sure you want to change the base?

Indirect memory fix #1098

Conversation

anooprac commented May 3, 2024 • edited

odanrc May 4, 2024

Choose a reason for hiding this comment

anooprac May 4, 2024

Choose a reason for hiding this comment

odanrc May 4, 2024

Choose a reason for hiding this comment

anooprac May 4, 2024

Choose a reason for hiding this comment

odanrc May 5, 2024

Choose a reason for hiding this comment

ivanaamit commented May 6, 2024

Fix change-id failed test:

Fix commit messages:

anooprac commented May 3, 2024 •

edited