Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dv] Integrity error generation in memory busses #1811

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ctopal
Copy link
Contributor

@ctopal ctopal commented Sep 20, 2022

1st commit changes the old memory_error_seq so that it uses the new sequence library.

2nd commit is a bug-fix guaranteeing not injecting an error while handshaking.

3rd commit allows us to stop/start InfiniteRuns type of sequences. Necessary for stopping injecting errors while we are in an IRQ handler.

4th commit improves memory interface to have an integrity error response on demand, also changes the memory_error_seq to inject integrity errors as well as bus errors. Also introduces a new seperate test named riscv_mem_intg_err_test that randomly picks between having an integrity error or a bus error on a memory response to a request.

5th commit integrates internal NMI handling to our cosim environment by driving nmi_int in Spike and set internal NMI traps as synchronous.

6th commit updates CI to have latest version of ibex_cosim branch of Spike.

7th commit is a simple fix for tying up pc_wdata of RVFI interface

8th commit is an RTL change about allowing the register write in the case of an integrity error. Reason being, if we don't let this instruction execute, this means the interrupt caused by this instruction does need to happen pretty much instantly. And letting Spike know about that behaviour (we retired a faulty instruction that caused an interrupt) was challenging. Instead, after the load/store we now flag the next RVFI package (with the jump instruction) with int_nmi causing Spike to match with Ibex.

For more details please check commit messages.

Resolves #1731
Resolves #1732

@ctopal ctopal added Type:Enhancement Feature requests, enhancements Component:DV Design verification (DV) or testing issue labels Sep 20, 2022
@ctopal
Copy link
Contributor Author

ctopal commented Sep 22, 2022

This does not take into account the type of memory request. That means we won't differentiate between seeing an integrity error on the memory interface while doing a store or a load (either way if rdata integrity bits are faulty, we're seeing an internal NMI -is this okay? WDYT @GregAC ). I believe this might not be ideal and we should only send an integrity error while we are doing a read. Should I change it?

@GregAC
Copy link
Collaborator

GregAC commented Sep 23, 2022

That means we won't differentiate between seeing an integrity error on the memory interface while doing a store or a load (either way if rdata integrity bits are faulty, we're seeing an internal NMI -is this okay? WDYT @GregAC ). I believe this might not be ideal and we should only send an integrity error while we are doing a read. Should I change it?

With the current Ibex RTL integrity errors on rdata for writes are ignored. However we're planning on going back to checking it (see lowRISC/opentitan#14611) so the current behaviour is fine.

Copy link
Collaborator

@GregAC GregAC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good @ctopal just a few tweaks til it's ready to go.

dv/uvm/core_ibex/tests/core_ibex_test_lib.sv Show resolved Hide resolved
dv/uvm/core_ibex/tests/core_ibex_new_seq_lib.sv Outdated Show resolved Hide resolved
@@ -1266,6 +1266,7 @@ class core_ibex_mem_error_test extends core_ibex_directed_test;
memory_error_seq_h = memory_error_seq::type_id::create("memory_error_seq_h", this);
`uvm_info(`gfn, "Running core_ibex_mem_error_test", UVM_LOW)
memory_error_seq_h.vseq = vseq;
`DV_ASSERT_CTRL_REQ("NoAlertsTriggered", 1'b0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little too blunt. We wouldn't want alerts to occur on bus errors for example.

Perhaps randomly choose whether or not a test will generate integrity errors at all and then only disable this assertion if we'll be generating integrity errors?

@GregAC
Copy link
Collaborator

GregAC commented Sep 23, 2022

Oh and we do need a check that an integrity error triggers an alert. This can be done as an assert for the data side. Instruction side requires a little more thought (a fetch may not actually get executed so the integrity error doesn't get consumed and doesn't trigger an alert). Though strictly that's a V2S task. So for this PR at least we don't need to worry about it, but we must not forget about it (best to create an issue to track).

@ctopal ctopal force-pushed the intg_err branch 3 times, most recently from 1a49c7c to 1e3b1fb Compare October 14, 2022 01:50
@ctopal
Copy link
Contributor Author

ctopal commented Oct 14, 2022

While simulating with this version I found a minor mismatch regarding two back to back integrity errors (with load instructions). RTL side does not take the faulty data from memory into the register, Spike does.

I might need help about how to tell Spike to not go into a fault but also not process the retired instruction.

@ctopal
Copy link
Contributor Author

ctopal commented Oct 14, 2022

Test pass rate for this memory test is now around 60%. We hit internal NMI coverpoints though. I suspect almost all failures are related with the issue that I mentioned in the previous message.

@GregAC
Copy link
Collaborator

GregAC commented Oct 14, 2022

As discussed in the Ibex DV Sync I think we should avoid generating integrity errors in the riscv_mem_error_test and introduce a new riscv_mem_error_intg_test that does have integrity error. Just add a config option for core_ibex_mem_error_test that enables generating integrity errors and a new entry to the testlist that turns that on.

With that done this can pass CI and we can get it merged. If the new test pass rate is 60% that should do for our V2 targets and we can fix up remaining issues in V3.

@ctopal ctopal force-pushed the intg_err branch 4 times, most recently from 36a9048 to 5435edf Compare October 17, 2022 08:40
@hcallahan-lowrisc
Copy link
Contributor

I think it might be cleaner to split out the cosim/NMI changes into a separate PR? Those changes are quite significant on their own. And could you add some more detail to the PR comment, the messages for 7222d4a and 66f3e2d are a good start but I think some more context would be really useful in the future.

@ctopal ctopal force-pushed the intg_err branch 7 times, most recently from 55ab9bc to 2bd646b Compare October 17, 2022 13:30
@ctopal ctopal force-pushed the intg_err branch 3 times, most recently from e75daed to bc16e84 Compare October 17, 2022 15:04
@ctopal
Copy link
Contributor Author

ctopal commented Oct 17, 2022

Thanks @hcallahan-lowrisc for having a look, I updated the commit messages for almost all of them and dropped that one unnecessary commit about catching external RVFI signals -that one should belong to another PR I agree. Latest changes should make the mem_test pass in CI (as can be seen in Ibex CI + Private CI).

Might worth just another quick look @GregAC since now it includes cosim integration too.

PS Failures on the CI says cancelled,I'm not sure why

dv/cosim/cosim.h Outdated
// Set the state of the internal NMI (non-maskable interrupt) line.
// Behaviour wise this is almost as same as external NMI case explained at
// set_nmi method. Main difference to consider is that this interrupt is
// synchronous because it comes from the integrity checking logic inside the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal interrupt is effectively synchronous as currently implemented but it should be consider asynchronous (in particular we don't make any guarantees about when it'll occur relative to a memory access). So I think just drop the final statement here or tweak the language so it doesn't call it 'synchronous'

@@ -200,7 +200,10 @@ bool SpikeCosim::step(uint32_t write_reg, uint32_t write_reg_data,

if (processor->get_state()->last_inst_pc == PC_INVALID) {
if (!(processor->get_state()->mcause->read() & 0x80000000) ||
processor->get_state()->debug_mode) { // (Async-Traps are disabled in debug mode)
(processor->get_state()->mcause->read() ==
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal NMIs should be handled as an async trap here, had you encountered difficulties using internal NMI with the sync trap here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested removing it now, it didn't changed the pass rate. Possibly it was some other problem that lead me to believe this was the correct behaviour. I'll change it to behave just as another NMI now.

@ctopal ctopal force-pushed the intg_err branch 2 times, most recently from 14633ae to 8fc2e59 Compare October 18, 2022 09:04
@ctopal
Copy link
Contributor Author

ctopal commented Oct 18, 2022

I think this PR increases our test time for mem_error_test (after changing it with the new sequence we probably are injecting way more memory errors) and tips it over 30 mins. What should we do about it?

@GregAC
Copy link
Collaborator

GregAC commented Oct 19, 2022

I think this PR increases our test time for mem_error_test (after changing it with the new sequence we probably are injecting way more memory errors) and tips it over 30 mins. What should we do about it?

Let's try dropping the frequency of memory errors and see what that does to the run time.

Signed-off-by: Canberk Topal <ctopal@lowrisc.org>
This will prevent seeing mismatches right at the end of our test.
Before this change, mem_error_test could inject error at the store
instruction in which we finish up the test, resulting with mismatches
with Spike and Ibex on the instructions after finishing.

Also do the same prevention for signature_addr as well, since we also
don't want to corrupt that memory transaction too.

Signed-off-by: Canberk Topal <ctopal@lowrisc.org>
Signed-off-by: Canberk Topal <ctopal@lowrisc.org>
This test picks between inserting an integrity error or a bus error
to the response in the case of a memory request from Ibex. Introduces
a control knob `enable_mem_intg_err` which can control the rate of
having integrity errors per request.

This commit also disables checking for double fault alerts in the
scoreboard because they're expected to be seen while simulating and
they don't cause infinite loop problems because every time a memory
response is requested the error causing part is just randomized.
That means Ibex trying to execute same instruction again would have
a chance to succeed this time.

Signed-off-by: Canberk Topal <ctopal@lowrisc.org>
This commit is mainly an extension to cosim environment to drive the newly
introduced state variable `nmi_int` in Spike.

This commit
 - Extends RVFI interface by a single bit (ext_nmi_int)
 - Configures cosim to set nmi_int inside Spike

Signed-off-by: Canberk Topal <ctopal@lowrisc.org>
Signed-off-by: Canberk Topal <ctopal@lowrisc.org>
Signed-off-by: Canberk Topal <ctopal@lowrisc.org>
Signed-off-by: Canberk Topal <ctopal@lowrisc.org>
@ctopal
Copy link
Contributor Author

ctopal commented Oct 20, 2022

@GregAC can you have a look at 66d8d0f to see if this makes sense?

@GregAC
Copy link
Collaborator

GregAC commented Oct 21, 2022

@GregAC can you have a look at 66d8d0f to see if this makes sense?

It does make sense, though having reviewed our discussion I think we want to leave the RTL behaviour as is. This means we need to instead alter cosim to not do the register writes on ECC failures. I will take a look at this.

@GregAC
Copy link
Collaborator

GregAC commented Oct 24, 2022

@GregAC can you have a look at 66d8d0f to see if this makes sense?

It does make sense, though having reviewed our discussion I think we want to leave the RTL behaviour as is. This means we need to instead alter cosim to not do the register writes on ECC failures. I will take a look at this.

I've worked out how I'm going to do this, just need to implement it. I'm going to use a method that doesn't require spike changes (as otherwise these would be quite invasive to suppress a register write on a load).

@GregAC
Copy link
Collaborator

GregAC commented Nov 4, 2022

I've added some stuff and reworked commits here to give #1811 which supersedes this PR and should get the memory integrity test running and passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component:DV Design verification (DV) or testing issue Type:Enhancement Feature requests, enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dv,cosim] Implement internal interrupts in co-sim [dv] Add ecc error generation to memory agent
3 participants