Delay slot writeback happens to early #55

nekronos · 2016-06-03T11:30:08Z

Due to the instruction pipeline in the R4300 the delay slot writeback will happen after the next instruction has done a register fetch, and before the next instruction`s writeback.

(from 4.1 in the NEC VR4300 datasheet)

This is also the case on MIPS R3000, which is the cpu that I am familiar with. However, reading the datasheet for R4300 it can actually insert delay cycles if an instruction depends on the writeback, please have a look at 4.3 Load Delay in the NEC VR4300 datasheet

Gravewish · 2017-01-14T02:32:03Z

Yes, as someone somewhat familiar with the MIPS architectural details, the current implementation of delay slots in the emulator is wrong, although it will only really be noticeable once exceptions come into play.

In any processor, branches do not have an immediate effect, they have an effect a couple cycles later. Due to pipe-lining, many instructions could be issued during this time, before we actually know where the next instruction should issue from. These cycles where the executing address is unknown are known as "Control Hazards". To solve these, simple processors will just waste a few cycles "waiting", until the target address of the branch is known.

More complex architectures will implement branch prediction algorithms to attempt and solve these hazards, by speculatively executing instructions where they think we will jump to (i.e. "assume we will not jump", "assume we will jump", and various flavors of "choose the most likely situation and assume it is true"). If the assumption turns out incorrect, all speculative instructions are invalidated, and execution resets to the state immediately after the branch, at the now known target address. These algorithms vary from simple to very complex, but even the simplest of them requires quite a bit of extra complexity. Sadly, this extra complexity is typically necessary, as you might have dozens of wasted cycles if you wait until the target address is know.

However, due the simplicity of the MIPS architecture, the delay between issuing a branch instruction and the target address being known is exactly 1 cycle. This is because the branch calculation finishes during EX, and there is an optimized path for branches, such that the branch target instruction can be issued immediately on that same cycle (see 4.2). As such, complex branch prediction algorithms are unnecessary, and delay slots are used instead.

Delay slots are effectively completely normal instructions which will be issued no matter if the branch condition is true or not, before the branch reaches the execution stage.
It is important to note that they still see/have the old PC reg, because they are issued before the jump takes effect. As such, changing the value of pc_reg immediately after "execute_instruction" is called on a branch is incorrect. Any instruction in the branch delay slot is effectively still executing with pc_reg = delay_slot_pc. Only after the delay slot is issued will the branch actually touch the pc_reg.

Functionally, "delay slots" are not themselves delayed. Instead, the effect of branches is what is delayed (until after the delay slot). This is a very important distinction!

This distinction is especially important when dealing with exceptions/interrupts (when these are implemented in the future). Particularly, if a delay slot causes an exception to be raised (or an interrupt fires on a cycle with a delay slot), the branch itself is also invalidated. Additionally, the address reported to the exception handler is not the branch delay slot instruction, but the branch itself. This makes sense, since if execution is resumed, the branch needs to be re-evaluated (as it was invalidated previously).

PS: Just for reference, "Likely" branches are branches where the delay slot only executes if the branch condition is true. This is effectively implemented in the hardware by issuing the delay slot instruction like normal, but invalidating it in the pipeline before it takes effect, as soon as the condition fails. Typically, this "invalidation" is implemented in hardware by replacing the instruction traveling the pipeline with a NOP.

"Likely" branches exist in order to help the compiler (or assembly programmer), specifically for situations where finding an instruction that could fit in the delay slot for both branch condition "true" and "false" cases is impossible (or very difficult). This way, the delay slot can have an instruction for the condition "true" case only, typically one that increments a loop counter.

As a result these branches will incur a "1 wasted cycle" any time that the condition is false, and the programmer/compiler should make all efforts to find an instruction that can be used as a "normal" delay slot in cases where it knows that the branch condition will likely fail.

Thus, this instruction ends up indicating a branch that is more likely to branch than not, which is where the name comes from. (Note that some non-MIPS architectures use similar naming "(un)likely branches", not for branches with delay slots, but in order to assist the branch prediction algorithm).

PS2: It is illegal for a branch instruction (including "eret") to be in a delay slot of another branch, and this results in undefined behavior (see 16.3), so it would make sense to panic in this situation. However some other instructions like "break" and "syscall" are legal because they do not branch, instead they raise exceptions.

yupferris · 2017-01-14T12:16:12Z

Wow, thanks for the very detailed writeup! Awesome pieces of info here.

I know @nekronos has attempted (in his R3000 emulator awhile back) to implement delayed writebacks to simulate this behavior, but (afair, and could be missing details here) that may not have covered the case where an interrupt fires or an exception is raised in the delay slot and the original branch should've been invalidated.

It sounds like explicitly implementing the pipelining would definitely simplify covering these cases, as what you've described here is actually quite simple in that context (which is definitely a good thing). The trade-off, ofc, is that implementing pipelined instructions might mean we have to split up instruction code not only by opcode, but by stage as well. Perhaps this isn't as daunting as it sounds, though. One would hope we'd be able to find an approximation of this behavior that would allow us to properly delay/invalidate instructions or parts of instructions (reg updates, other io) so that we wouldn't have to split up the instruction code itself too much, but if that ends up being sufficiently more complex/confusing than actual pipelining that's definitely a problem (and getting this wrong is likely a symptom of that being the case).

I must admit the code as it is now is a bit distant to me at this point (and will likely be until I pick it up again), but at that time I'll have to internalize it again and decide to make such a tradeoff. Any additional suggestions/insight on the matter is ofc very welcome :)

Gravewish · 2017-01-14T16:48:23Z

Yes, there is also the trade-off between simplicity+speed and accuracy. Obviously, to have 100% timing-accurate emulation, the pipeline stages will need to be modeled in the emulator.

However, in the context of a functional emulator (i.e. an emulator that tries to emulate the functionality only, not necessarily the micro-architectural details like pipelining), I don't think it will be difficult to approximate this behavior.

This is a special case even within the pipeline. Theoretically, in MIPS instructions will only take effect at the beginning of the Write-back stage (which "writes" the actual results "back" to registers), 5 cycles after an instruction is issued. An instruction accesses the register in the "Register Fetch" stage (the stage responsible for "fetching registers"), meaning that you have to wait 3 cycles before an instruction can use a result of any previous instruction (nr. of cycles between RF and WB)! Obviously, this is a huge waste of time.

However, most architectures have extra optimized paths for the most common situations that "skip the pipeline" in order to resolve dependencies ("hazards" in computer architecture lingo) between instructions a few cycles earlier. Of special importance in MIPS are the branch instructions and memory accesses.

I've spoken about the branch instructions before (and delay slots). Memory accesses also have load delay slots. For MIPS, the best case situation is that the memory address is in the cache, which results in a 1-cycle hazard due to an optimized path between DC ("Data-cache" stage) and RF. As such, if the instructions after a memory access depend on that memory access, the processor will stall (wait) until the memory access finishes, before continuing (see 4.3). This means that the programmer should always make sure there is at least 1 independent instruction after each load, otherwise the processor will insert a NOP automatically.

There are a few situations where MIPS does not automatically resolve hazards. This includes writes/reads to CPn registers, which need to be manually cleared by using the "ehb" instruction (this instruction stalls until the processor knows that all "execution hazards" have been cleared), and some other cases (there's a whole family of instructions that deal with cache invalidation, as well as another family for TLB invalidation). These situations might also be important for a fully accurate emulator when dealing with buggy applications. A MFCn that comes just a few instructions after a MTCn instruction for the same register has an undefined result, unless a EHB instruction is executed in-between. In many cases, it might still read the old value of the CPn register.

It is important to note that the processor guarantees that (other than branch/memory delay slots), user mode does not see any hazards, by introducing stalls if necessary. As such, only privileged instructions are allowed to cause hazards that might need to be cleared by using "ehb", "sync", etc.

While all of this is important for full accuracy, I don't think a lot of this needs to be modelled in a functional emulator like rustendo64 (at least so far). Due to the processor automatically handling most common hazards (by introducing stalls if necessary), most of these special cases won't affect the actual instruction flow. It could be important in order to model very particular bugs in applications, however I'm not sure how relevant that is here.

In particular, I think it might be enough to model branches as a special case situations, by implementing a mechanism that allows you to delay the effect of branch instructions by one instruction, instead of running the delay slot instruction as a special case itself.

Personally, I would save the "target" address of the branch instruction in a variable (self.jump_target?), as well as the address of the branch (self.branch_pc?). If self.jump_target is set after executing any instruction (and the current PC is not self.branch_pc), you jump to that address and clear both self.jump_target and self.branch_pc. Branches whose condition fail would simply place the instruction after the delay slot (self.pc_reg + 8) in the jump_target variable.

Likely branches could be implemented in two ways:

Less accurate, if the branch condition fails you jump immediately to self.pc_reg + 8 and don't activate any of the delay slot mechanisms.
More accurate, have a third variable which indicates if a likely branch condition failed (self.likely_branch_condition_failed?). If when an instruction is issued, this variable is set, you execute nothing (NOP) instead. This could be important to model the situation where an interrupt fires during the cycle after the branch, when the delay slot would be executing (even if it would later be invalidated). However I'm not sure if this will ever really be a relevant situation.

Exceptions/interrupts, when fired, could check if self.branch_pc is set and is not the current PC, in which case they know for a fact we are currently executing a delay slot, and also know which is the address of the branch.

Of course, there might be other situations where the pipeline is important to have an accurate functional emulator, but other than the CPn situation I don't remember any situation in particular.

Also, note that "eret" and "deret" are effectivelly branches as well (with some special behaviour) which means it is illegal to use them in delay slots themselves, but they do NOT have a delay slot.

PS: If you (or anyone else) want a book that goes through a simplified architecture inspired by MIPS and explains a huge portion of the thought process behind designing such an architecture, the best one I know (other than "See MIPS Run", targeted at people programming MIPS) is "Computer Architecture: A Quantitative Approach" by David Patterson and John L. Hennessy. This is one of my favorite computer architecture books, usually called the "Bible of Computer Architecture" and taught in many computer architecture courses. In fact, one of the authors of this book was actually one of the founders of MIPS Computer Systems.

yupferris · 2017-02-20T22:09:27Z

@Gravewish just wanted to say I ended up ordering "Computer Architecture: A Quantitative Approach" (fifth edition), and it arrived today! Just skimming through it now; looks like this is a fantastic source of info indeed! I think I'll need to study/read up a bit to wrap my head around this before I'm able to decide how it would be best to proceed for this emu specifically, but your comments have been very insightful and informative. Thank you very much for the recommendation and helpful info; I look forward to being able to parse it more thoroughly after filling my head with some more context :)

Gravewish · 2017-02-20T22:36:21Z

No problem! Enjoy the book, it's certainly the best "Computer Architecture" book I know, and super interesting if you like the topic. The three appendices are especially good, summarizing everything you need to know regarding Instruction Sets, Pipelining, and Memory, and I see myself referencing them all the time (I actually used them to refresh my mind on a couple of details while writing my previous answers)

It might be a bit heavy at times (it is focused at hardware engineers who will be actually designing the thing) so don't feel obligated to understand everything. Hell, I do this for a living, and still have a hard time sometimes with how complex it can get!

I look forward to seeing your progress on rustendo64! Sadly my schedule makes it quite difficult to catch your streams live :( , but I'll keep trying

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay slot writeback happens to early #55

Delay slot writeback happens to early #55

nekronos commented Jun 3, 2016

Gravewish commented Jan 14, 2017 •

edited

yupferris commented Jan 14, 2017 •

edited

Gravewish commented Jan 14, 2017 •

edited

yupferris commented Feb 20, 2017

Gravewish commented Feb 20, 2017

Delay slot writeback happens to early #55

Delay slot writeback happens to early #55

Comments

nekronos commented Jun 3, 2016

Gravewish commented Jan 14, 2017 • edited

yupferris commented Jan 14, 2017 • edited

Gravewish commented Jan 14, 2017 • edited

yupferris commented Feb 20, 2017

Gravewish commented Feb 20, 2017

Gravewish commented Jan 14, 2017 •

edited

yupferris commented Jan 14, 2017 •

edited

Gravewish commented Jan 14, 2017 •

edited