Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read/write test error #2

Open
Tolar626 opened this issue Nov 30, 2022 · 11 comments
Open

Read/write test error #2

Tolar626 opened this issue Nov 30, 2022 · 11 comments

Comments

@Tolar626
Copy link

Hello, I am very interested in this project, and I have met some problems in my study.

I have instantiated the ddr3_x16_phy_cust and ddr3_rdcal modules in your Arty S7-50 project, and programmed the app module to generate the rdcal_start signal and control the data and address input of the ddr3_rdcal module.

The parameters as a whole follow your configuration.

The DDR interface frequency is 300M, ISERDES_16B, 32B, and 48B are both FALSE. The IDELAYCTRL frequency is 200M. The RD_DELAY is set to 6/10(Same phenomenon), and is deployed on zynq7030.

The board level tests are as follows:

After the w_rdcal_done signal is high, the single read/write test passes. Data written and read at address 0x10 are both 0xaaaa_aaaa_aaaa_aaaa. For details, see the following figure :

single_rw_pass

However, the problem occurs when the data is read after continuous writing. At address 0x0, full A is written to address 0x8, and full B is written to address 0x10. The data read at address 0x0 is full C, and the full A is different. For details, see the following figure :

mulit_rw_fail

According to the waveform, when the write operation is effective, the write data will be updated to the read data repository. In this way, the data read for the first time is the data written for the last time, which has nothing to do with the address.

I haven't studied your code in depth, so I want to study it further after running it through first. What is the problem according to your experience?

Looking forward to your reply and guidance, thank you!

@someone755
Copy link
Owner

someone755 commented Dec 1, 2022

Hi, thanks for your interest in my work :)

The read data vector does change during write operations. This is due to the way OSERDES and ISERDES are wired to the IOBUF. Here is a sketch I used in my thesis:
image
(Note the data output pin -- controlled by OSERDES -- is wired directly into IDELAY and then ISERDES.)

This is why you should only really consider the read data output as useful when the rd_data_valid flag is high. It is only high for one clock period per read operation.

You do not mention toying around with p_RD_DELAY and p_ISERDES_32B_SHIFT, so my first guess would be that this is where the issue lies. To read data successfully, you are required to find proper values for these two parameters. This will depend on the board and the FPGA and memory chips in use. Using 16 different bytes of data will help you determine the proper values for these two parameters, but sadly it can only be done via trial and error.*

This project's sister repository houses a working example for the Arty S7 board, but the top Verilog module should still be useful as an example of a working instantiation of this interface. Note that to use the read calibration module, IDELAY_TYPE should be set to `"VAR_LOAD"˙. (The values of p_IDELAY_INIT_DQS and p_IDELAY_INIT_DQ are ignored and assumed to be 0 in "VAR_LOAD" mode. See UG471 Table 2-5: IDELAY Attribute Summary.)

The sister repository also includes a Python script that makes it very easy to test whether you've calibrated the two read parameters correctly. The written and read data is saved to disk as two binary files, and it is possible to read them with a hex viewer/editor. It is then easy to see by how much the read data must be delayed. (The parameters allow for multiples of 64b increments, and a single switch for a 32b shift.)

*If you do use the Python script, you will see why I recommend unique strings instead of repeating strings of 4 bits.

Let me know how it goes, or if you need any help.

@Tolar626
Copy link
Author

Tolar626 commented Dec 1, 2022

Wow, thanks for your patient reply!

Based on the connection topology of OSERDES, ISERDES, and IOBUF that you described(the drawings are fantastic), combined with previous tests, it is guessed that the DRAM like does not work, resulting in read operations when the output of OSERDES is stored and then input to ISERDES.

For p_RD_DELAY and p_ISERDES_32B_SHIFT,I set p_ISERDES_32B_SHIFT to FALSE all the time and p_RD_DELAY to 6 and 10, There is a question here, p_ISERDES_32B_SHIFT can be tried either way (FALSE or TRUE), while p_RD_DELAY is read to enable the number of beats played, is there a range?

I will try to determine the values of the above parameters according to the top-level module of your demo combined with the Python script. I will communicate with you if there is any progress. Thanks again!

By the way, in the demo top layer of your Arty S7 board, the input DDR interface clock lp_DDR_FREQ is 325, while the output w_clk_ddr of MMCM is 300M. Is this a mistake? Here I first changed the lp_DDR_FREQ in the ddr3_x16_cust_top module to 300, and the ddrFreq in the Python script to 300000000, please point out any errors.

@someone755
Copy link
Owner

I'm afraid I don't understand your first sentence. OSERDES is only active during writes. At all other times, the IOBUF output is disabled (Z). Even if I implemented back-to-back reads and writes (in the current implementation, the bank is precharged when switching from writes to reads), OSERDES does not interfere with read data.

The read mechanism is really very simple. The ISERDES data output width is doubled (from 64 to 128 bits), and the read data valid flag is just a shift register. Take a look at the 12 lines and you'll see what I mean. p_RD_DELAY is what defines the width of the shift register:

reg	[p_RD_DELAY:0]	rn_rd_op_delayed = 'b0;	// r_rd_op pipe (delay)
...
assign o_mem_rddata_valid = rn_rd_op_delayed[p_RD_DELAY];

This is how I decided to control for CL and PCB delays etc. I guess the next step would be automating it, but it works well enough once you figure out the combination you need.

The arty_s7_playground repository is very very messy. Locally, my MMCM is correct, but I've decided not to bother uploading every variation of the IP files. I used to do that in the beginning, where the commits are absolutely huge, but nowadays I just hand select the files I edit. Vivado is anything but version control friendly. I assume people will mostly be interested in the verilog files, as the MMCM and FIFO configuration is well documented in the readme file in this repo.

The ddrFreq variable in Python is really only there to calculate memory throughput. What's important is that the UART baud rate is correct. The last two lines are really just for show. Here's what a successful run looks like:
image

Keep us updated!

@Tolar626
Copy link
Author

Tolar626 commented Dec 2, 2022

Thank you for your quick reply, OSERDES does not interfere with reading data, that's true.

Anyway, I've already started the verification process by changing the lp_ISERDES_32B_SHIFT and lp_RD_DELAY parameters and using your python script to expand the test.

The UART baud rate setting of 3_000_000 is wrong. I replaced yours with my own UART baud rate setting of 9600. Please ignore my low speed UART.

The first test
condition: ddrFreq = 300MHz,lp_ISERDES_32B_SHIFT = FALSE, lp_RD_DELAY = 10
result:Fail

Compare TX data with RX data. RX data is the result of TX data shifted 12 bytes to the left, as shown in the figure below.

TXDATA
dalay10_txdata
RXDATA
delay10_rxdata

The second test
condition: lp_ISERDES_32B_SHIFT = FALSE, lp_RD_DELAY = 9
result:Fail

RX data is the result of TX data shifted 8 bytes to the left, as shown in the figure below.

TXDATA
delay9_txdata
RXDATA
delay9_rxdata

The most recent test
condition: lp_ISERDES_32B_SHIFT = FALSE, lp_RD_DELAY = 8
result:Fail

For a data size of 1024, sometimes it succeeds or fails. For larger data sizes, it almost always fails.

A mistake that is prone to recurrence, The faulty RX data is the result of a previous 12-byte shift to the left of the RX data, as shown in the figure below.

TXDATA
err_rd_delay8_txdata
RXDATA
err_rd_delay8_rxdata
WAVEFORM
err_rd_delay8_wave

Another type of error occurs when the data size is larger, and no pattern can be found for RX data errors, as shown in the figure below.

TXDATA
err2_delay8_txdata
RXDATA
err2_delay8_rxdata
WAVEFORM
err2_delay8_wave

Next, I will try to reduce the lp_RD_DELAY again. Can you locate the cause of the above phenomenon? If so, please point out, thank you!

@someone755
Copy link
Owner

This is really great error reporting, thank you for the effort.

Hopefully after testing you understand how the lp_RD_DELAY parameter influences the rddata_valid flag. Effectively what you see is the read data getting shifted by 64 bits for each +/-1 change in lp_RD_DELAY. If you wish to also observe the ISERDES parallel output (64 bits), that is available via on_iserdes_par. If lp_ISERDES_32B_SHIFT is "TRUE" then the 64-bit on_oserdes_shifted signal may also be of interest. In your case, it seems that a value of 8 and "FALSE" is enough.

I'll be honest and admit I've never seen any such errors appear in my testing. The recurring first error makes me believe that the read command for address 'hC8 (at txdata blob location 'h190) is simply ignored. The data present at 'h190 is simply whatever is left on the ISERDES parallel output from the read that was issued to address 'hC0. One guess would be that the timing constraints of your memory module differ from the one I've used -- Have you checked that the timing parameters I've left in my code are fine for your memory chip? Also, though standardized, perhaps the mode register options could be different for your chip, as well?

As for the second error, which seems sporadic, I genuinely have no insight to share. Either the write or the read cycle fails for the data at blob location 'h0F0, and then at 'h100 again the recurring first error repeats.

If it turns out that it is the writes that are failing then something is very wrong. The write part of the PHY is made as well as I could manage (I'd argue even: as well as is possible with SERDES in memory mode), and without high speed probes placed onto the DQ lines there really is no way to even diagnose the error.

The reads might fail due to bad read calibration, but then my instincts would tell me you would see more than just sporadic 128-bit bursts corrupted. Still, you could observe the IDELAY tap values via on_dq_idelay_cnt[9:5] & on_dq_idelay_cnt[4:0] and on_dqs_idelay_cnt[9:5] & on_dqs_idelay_cnt[4:0]. In my testing it usually ends up at something like 17 and 10 taps.

As an aside, I was told by somebody that the Zynq chips have memory routed to the PS side of the chip, where a static DDR3 controller resides. I was led to believe UG933 applies to Zynq (where only routing to the PS is discussed). It's interesting that your board has memory routed to PL despite being a Zynq chip. Though the information forwarded to me might have been incorrect, or I might have misunderstood it.

Another side note, the Arty S7 board uses a FT 2232H, which can work at up to 12 Mbaud (or 6M or 3M etc). If your chip doesn't support it, that should be fine for the purposes of this test. (I used the high baud rate to confirm that the entirety of my memory was accessible. Transferring 2 Gbit of data, both ways, over a 3 Mbaud connection is slow!)

@Tolar626
Copy link
Author

Tolar626 commented Dec 6, 2022

Thank you very much for your help. I have benefited a lot.

The model of my TTL-RS232 module is SP3232, which supports the highest baud rate of 235Kbps, so I can only carry out low-speed test work, and the working pressure of DRAM is small.

Yes, the PS side of ZYNQ does have the DDR MC of arm, but my board PS and PL have their own external DRAM, I did not enable the PS side, in fact, I use ZYNQ as a K7 FPGA, so there is no DRAM routing to the PS side.

Your guess is right, the core timing parameter is really not suitable, in your design, tRCD and tRP are 13.5ns, while the minimum value of DDR-800(5-5-5), tRCD and tRP is 16.5ns, I think this is the problem, after the change, 300M and 400M have passed, you are so great!

However, there was a problem when I tried 466M. I used DDR-1066(7-7-7) Core timing parameter, lp_ISERDES_32B_SHIFT was set to TRUE, lp_RD_DELAY was set to 8. Occasionally, a bit error occurs in RX data, which is a 1 bit flip. See the following figure for details.

TXDATA
466M_txdata
RXDATA
466M_rxdata

I wonder whether the odt is always low, and the unterminated connection in high-speed working condition leads to remote reflection and write error. Therefore, the odt is turned on in WR state, and the odt is turned off when the PRE state is popped out. However, the test results are not significantly improved, and the results of adjusting lp_ISERDES_32B_SHIFT and lp_RD_DELAY will only be worse. I am considering whether to relax CL to 8. Is it caused by the strict timing of 1066?

Looking forward to your reply and best regards!

@someone755
Copy link
Owner

Your explanation of Zynq's memory interface configuration is much appreciated. Ignoring the PS is one way to get a good FPGA haha

At the risk of stating the obvious, you can just default to always using the fastest speed bin timings for your memory chip. I.e. if your chip is rated for operation at 1600, the timing values for that speed bin should be valid even when running at lower frequencies (for most timing parameters that should hold true even in DLL off mode at <125 MHz). That is to say, there is no need to look at different speed bin tables when running the memory at 300-400 MHz or 466 MHz, you can default to the fastest timings supported by the memory chip.

@robinsonb5 reported similar issues with sporadic bit shifts at higher frequencies as you do. If he has anything to add, his opinion is more than welcome in public as well (if not, I apologize for the tag).

It is true that the errors might be a consequence of my improper understanding of ODT. This is partly revealed in Issue #1, which @TheAnimatrix and I discussed further in private. My understanding is explained in that issue, but to summarize, I interpreted the following table (sourced from the Micron MT41K128M16 2Gbit DDR3L datasheet) to mean "If R_(TT,nom) is enabled in MR1, then the R_(TT,nom) value is in effect regardless of the ODT pin."

image

The ODT chapter in that datasheet further claims that "[write] accesses use RTT,nom if dynamic ODT (RTT(WR)) is disabled," which is done via MR2 by default in this interface.

That is to say, I interpreted the ODT pin to control when the impedance switches from R_(TT,nom) (set in MR1) to R_(TT,WR) (set in MR2). According to this interpretation, since R_(TT,WR) is disabled in MR2 and the ODT ball is kept low, the termination impedance should always be R_(TT,nom).

I'm still not sure which interpretation is correct. To further the confusion, Micron's datasheet lists an additional mode where the ODT pin may be wired high permanently (supposedly via a current limiting resistor) that JESD79-3 doesn't include at all! I also think (but am not 100% sure) that Xilinx's MIG keeps the ODT ball low.

Micron's TN-41-04 states: "When the module is being accessed during a WRITE operation, greater termination impedance is desired, for example, 60Ω or 120Ω." Perhaps raising R_(TT,WR) to 60 Ohm would be beneficial at these high frequencies. Sadly I cannot test this hypothesis because my Spartan FPGA can be clocked at max 464 MHz (plus my testing top module fails timing at ~330 MHz; I never anticipated I would be able to test at such high frequencies, so none of it is really optimized). You could set M[9,6,2] in MR1 to {0,0,1} (sets R_(TT,nom) to 60 Ohm) and see if it helps (assuming the data bus is terminated to R_(TT,nom) when the ODT ball is tied to 1'b0). Perhaps full ODT functionality would be needed -- If somebody reading this wants to try and implement that, you are more than welcome to contribute to this project.

Another problem could be due to the lack of control between the clock, DQ, and DQS signals. ODELAY elements are not available in HR banks, and the PHASER primitives are conveniently left undocumented by Xilinx. I've thought of changing the clock signal phase using the MMCM, but with a resolution of 45° that would be futile.

One solution presented to me could be adding error correction outside of the memory controller. With my knowledge of DDR3 SDRAM and insight into this interface in its current form (including testing on an Arty S7-50), I can only conclude that such sporadic errors are unavoidable at high frequencies.

The CL value has no bearing on the quality of the read data. Only the delay with which the memory delivers the read data after a read command is issued is changed. For the memory interface, the CL value changes nothing (no logic controls for or counts the CL clock cycles, but it could mean that lp_RD_DELAY and lp_ISERDES_32B_SHIFT would need to be adjusted when changing CL).

@Tolar626
Copy link
Author

Wow, thank you so much for sharing a lot of really helpful information, Sorry for the late reply.

I will try to configure R_(TT,nom) to 60 ohm, but before I do, I have an interesting phenomenon to share with you.

As you know, existing Python scripts, regardless of the UART baud rate of 12M/6M/3M(I only used 9600), from an engineering point of view, the control pressure was too small for DRAM, so I added read/write control logic.

The control process is simple. After the full array traversal write is complete, the full array traversal read is performed and the read data is checked. The data written is changed by bytes, for example, the first burst write 128'h1f_1e_1d_1c_1b_1a_19_18_17_16_15_14_13_12_11_10. The second sudden write 128 'h2f_2e_2d_2c_2b_2a_29_28_27_26_25_24_23_22_21_20...(please ignore my generated data, less pressure than pseudo-random data testing). During this process, I insert write or read commands whenever the command fifo has redundant space.

As a result, there was an error when reading data. At first, I thought it was the problem of read and write control logic. I thought about the difference between read and write control logic and Python script (the difference lies in the spacing between insert commands). In the Python script at 9600 baud rate, the interval between two write commands (write interval) of UART is about 480_0000 cycles, and the interval between two read commands (read interval) is about 250_0000 cycles. However, in my read and write control logic, the interval between two write commands is basically 0. The interval between two read commands is periodic (less than 50), that is, the read command is valid until the read data is valid. Based on this guess, I added configurable interval logic after the read and write commands. Finally, it was confirmed that there were still errors, indicating that the read and write control logic was correct. The above is the background.

I guess the previous test coverage with Python script was not enough. The DRAM address was 27 bits (3 bits for bank, 14 bits for row, 10 bits for col). When I verified the maximum 18 bits, it was summarized as Pass. Take your Python script and increase the traversal address coverage. Will errors occur as the coverage increases?

condition: ddrFreq = 300MHz, start_addr = 0
result:

End_addr Result
(1<<12) - 8 Pass
(1<<16) - 8 Pass
(1<<20) - 8 Pass
(1<<21) - 8 Pass
(1<<22) - 8 1 Fail 、2 Pass
(1<<23) - 8 Fail
(1<<24) - 8 Fail

The above experiment found that with the improvement of coverage rate, errors began to occur. For example, when the end address was (1<<22) -8, the test was conducted for 3 rounds, with 1 fail and 2 pass times, and the probability of errors would be greater if the coverage rate was increased again.

The following figure shows the error message with the end address (1<<24) -8. There is only 128bit error and no obvious error pattern can be seen.
end_addr_24_err

According to your experience, what causes this situation? Have you tried the test of full array traversal? Of course, it will take a long time. Looking forward to your reply, thank you again!

@someone755
Copy link
Owner

Hey, sorry for the late response. I could've sworn I replied already. Happy New Year, I guess!

I've never had anything like this happen, not with my Python script and not in fast sequential access. I wrote a module that does sequential writes across the entire memory, then sequentially reads from the entire memory, and compares the write and read data patterns. In this scenario, I have never gotten a read fail.

I have no clue why some data would be corrupted in your case. First guess would be that maybe some cells leak charge too quickly (Are you in a warm environment, or does the FPGA or memory chip warm up significantly?), so by increasing the address range, you prolong the time the offending cells have to corrupt data. (For this to be true, I think the corrupted address should always be the same one, but I'm not sure. This is all just speculation.) You could test if the refresh period is too short, somehow, though in all the simulations I've done it's always been around 7.8 us. Maybe write the entire chip, wait a while, then read back from it? Or just decrease the p_REFI parameter and see if refreshing more often helps (but I don't recommend going below 3.9 us).

Other than this, I really have no clue what could be going on. In the meantime I've also lost access to the development board I was using so I can't contribute with any testing of my own.

@Tolar626
Copy link
Author

Finally waiting for your reply, thank you very much for your analysis, happy New Year!

In fact, I haven't really figured out why this is happening. My verification is that FPGA and DRAM are not heated at room temperature, and the temperature of DRAM should be within the normal operating temperature (0~85℃). Refer to JESD79-3E(DDR3) Table21-Temperature Range. If the temperature exceeds 85℃, the frequency of refresh command should be increased. tREFI reduced from 7.8us to 3.9us.

In order to remove the interference of hardware environment, I will use MIG to test under the same conditions to see whether the same situation will occur. I guess it should have nothing to do with hardware environment. At the same time, I will also try to reduce tREFI to see if the improvement is achieved by increasing the refresh frequency, and then I will synchronize the test situation to you. I am very worried that this problem is stuck and no progress is made.

During this period of time, I tried other implementation methods, using I/ODDR primitive to realize the read and write access of X64 UDIMM (4 PCS X16 component), 100/200MHz is Pass, and at 300MHz, burst write 16'ha0de, 16'ha0df.... 16'ha0e5, error reading data from IDDR to Chip1, see Figure below:

1674007897701

When writing, I enabled ODT to reduce the reflection between DRAM components, but why did such error occur? If it is a writing error, why is it OK for other chips, while Chip1's error can be stably repeated. 300MHz should not have high speed pressure on I/ODDR. Do you have any good ideas about this problem?

Looking forward to your reply and best regards!

@someone755
Copy link
Owner

When implementing support for multiple DDR3 chips on the same ck/cmd/addr bus (like on a DIMM), you need to account for the fly-by topology effects. The physical datapath traces on the PCB might be equal between the DRAM chips, but because of the fly-by topology, the clock, address, and command bus signals arrive to the chips with different delays. If you do not account for this, then data will obviously get corrupted. I'm just hypothesizing here that you didn't use DDR3's write leveling function, because these lower end chips don't provide ODELAY functionality. Corrupted write data is an inevitability in this scenario.

This illustration of the effect of fly-by topology on datapath timing is taken from Micron TN-41-13:

image

That's just a guess though since you don't mention any delay primitives being used.

As for my own interface, I'm afraid I don't have any more input to give with regards to your issue. Maybe a complete set of IO constraints might expose some as-yet unexposed design flaw. Maybe it's a corner case I didn't account for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants