After feedback on our design review presentation on Monday, it was decided that I should look into an upper-bound access time on SDRAM from the FPGA. For context:
- The CPU uses an SDRAM controller to schedule and arrange simultaneous SDRAM requests. Since there are multiple input read/write command ports (from FPGA and CPU) and only a single output read/write command port (TO SDRAM), the SDRAM is a contested resource.
- Since the SDRAM is a contested resource and the order of requests is essentially non-deterministic, we must assume the worst-case access time for our FPGA so we can design our hardware to meet HDMI timing constraints.
- Unfortunately, few details on the SDRAM controller IP are provided by Intel. This means some assumptions have to be made regarding the SDRAM controller’s internal delays.
- We can, however, read the datasheet on the actual SDRAM chip – which gives us ideal CAS timings. The CAS latency is the time between a read command being sent by the SDRAM controller and the data being received by the SDRAM controller from the SDRAM. The CAS latency provided by the datasheet is only accurate for accesses in the same row. Actual latency increases if memory accesses are far enough from each other. This makes it important to utilize burst reads to achieve the nominal CAS latency.
In my notes, I make some assumptions about the timings introduced by Qsys interconnects and the SDRAM controller. See my notes below:
http://course.ece.cmu.edu/~ece500/projects/s21-teamc1/wp-content/uploads/sites/133/2021/03/Upper-Bound-on-SDRAM-Read.pdf
To summarize the findings:
- The CAS latency is 7 cycles on a 400 MHz clock. This is less than a clock cycle on our 50 MHz clock.
- The RAS-to-CAS latency is about a clock cycle on our 50MHz clock.
- 10 commands can exist in the command FIFO in the SDRAM controller. Assuming ours is picked last (in the worst case), we have to wait the equivalent of 10 RAS-to-CAS latencies + 10 CAS latencies.
- I’ve assumed interconnect latencies adding up to 3 clock cycles.
- A single read with a row-miss (accessing a different row), along with 9 other row-miss reads is our worst case latency. This adds up to a latency of 23 50MHz clock cycles.
- The actual timings can be made better by doing burst reads or pipelining reads.
- We will need to be careful about how much data we transfer. Transferring all of the PPU data over DRAM is infeasible. Transferring only the data needed by a scan line may be more feasible, but still difficult.
On Saturday, I brought this information along with a few PPU design ideas to an internal “Design Review” with Andrew. We came up with an alternative design using M10K memory – the main advantages over the original idea include less overall data transfer, and a safe timing failure condition: If the CPU somehow cannot finish its writes to the PPU’s VRAM in time before the frame must be rendered, then the frame is dropped and the previous frame is displayed (essentially dropping to 30FPS if this pattern continues).
My original goal for this week was to implement a tile engine prototype which accesses SDRAM for some tile data and displays it to the screen. Unfortunately, while I have made progress closer to a full PPU design, I have not implemented this yet. This means I will have to complete the simple Tile-Engine altogether next week. I am behind this week, but now that we’ve decided to move the PPU’s VRAM from SDRAM, the actual PPU design should be a little bit easier. I should be able to catch up (written design report time permitting) with the Tile Engine implementation by the end of next week.