Team Status Report for 04/10

This week’s tasks included APU implementation, PPU bug-fixing, and redesign of the PPU-CPU communication.

For context on the PPU-CPU communication redesign, a team design-review meeting was held last Saturday, which resulted in a redesign to the PPU-CPU communication. During the design phase, Joseph and Andrew had differing ideas on how the PPU-CPU communication and eventual PPU driver should work. Joseph wrote his design in the design review report: The CPU sends video data to the PPU over the AXI BUS accessible via MMIO, following strict timings. However, last weekend the team decided to switch to Andrew’s design: The CPU sends an address to the PPU, which the PPU then uses to DMA-copy video data stored in SDRAM. The main benefits of this include less coupled timings between the CPU and PPU, as well as more intuitive PPU software (with a lot of reusable code from the APU kernel module). Joseph will be implementing this next week, so some scheduling changes needed to be made.

Here are the scheduling changes and brief reasons for their occurrence:

  • Moved task the “Row Buffers, VRAMs, CPU->VRAM Interface”. Joseph needed to modify the designs of the CPU-Facing VRAM Interface to be more compatible with the SDRAM. This task is mostly done, but needs to be verified along with the SDRAM DMA next week.
  • Added a new task for implementing PPU-SDRAM DMA. The goal is to accomplish this next week.
  • Joseph used 1 Slack Time for the Tile-Engine Implementation. Tile-Engine was difficult, and required setup and debug of a few other interdependent systems.
  • Created a new APU bug task. Andrew will be working on this next week.

Andy’s Status Report for 04/10

The plan for this week had been for me to finish up the APU and then work on the sprite engine. Unfortunately, I hit a hard wall with communication between the APU and the CPU. Some progress has been made on that front, and I’m now able to send a 1KHz sine wave from a C program through the APU driver to the APU. Unfortunately, some corruption issues are preventing non-static data from being sent through.

Still, this does mean that the APU is fully written, just not fully debugged. The user space library for the APU is complete and tested, sending signals to a user process as a kind of user-mode interrupt works great. The kernel module is mostly written with the aforementioned corruption issues. The hardware, after some intense debugging and scrutiny from both Joseph and myself, seems to be fully operational.

I suspect that the issue is with the kernel module, and not the user space library or test program. It seems likely that I haven’t set up the APU kernel buffer for DMA correctly, and so that will be where I investigate next. The running theory is that our level 1 cache isn’t being flushed, and so we’re only getting some of the samples out of the kernel buffer into the APU, and are thus seeing corruption.

Thanks to the slack time we allocated for the end of this semester, there’s still a good chance that we’ll be able to get everything done on time. The plan for this coming week is for me to fix the APU and then deal with the sprite engine in any remaining time I have. After that, I’ll have a week to finish the sprite engine and two to work on the test game.

Joseph’s Status Report 4/3/21

Early this week I integrated Andrew’s I/O Subsystem into the main project. This was incorporated into a work-in-progress scrolling demo on Saturday.

The majority of my week was spent on implementing the Tile-Engine and Pixel-Mixer.

The Tile-Engine is almost finished, there are just a few bugs remaining with the mirror and scroll features. Andrew helped me find the source of some of these, and I will be implementing the fixes tomorrow (Sunday 4/4/21).

The Pixel-Mixer module currently only has a background Tile-Engine attached to it, with a dummy Sprite-Engine and foreground Tile-Engine – which simply output transparent tiles.

On Saturday, I began to put together a hardware-only demo involving controller-input and scrolling of the background layer as a Minimum-Viable Interim Demo. The background tile-data, pattern-data, and color-palette-data are planned out using Tiled (https://www.mapeditor.org/), drawn using Piskel (https://www.piskelapp.com/), and then converted into .mif files via a few python scripts I wrote.

Next week I will primarily work on the video demo, fixing the remaining bugs with the Tile-Engine, and implement one of either the DMA or Sprite-Engine. Unfortunately, some of the time spent on the CPU-VRAM interface last week was misguided (my fault). The CPU interface I designed uses MMIO to write data to the PPU, which after a short discussion with Andrew, I learned was terribly inefficient. I will be doing some research to see if I can reuse one of Intel/Altera’s DMA IP blocks to copy VRAM data from DRAM.

Team Status Report for 04/03/2021

We’re in a slightly better spot than we were last week, as we made a lot of progress this week. We’re both optimistic about the future. As we’d currently consider ourselves roughly 2 days behind the schedule from last week, there are no official scheduling changes.

By the end of next week, we should have the controller and audio modules fully finished and most of a working hardware implementation for the PPU. A video and audio demo will also be prepared for the interim demo.

Andy’s Status Report for 04/03/2021

This week, I finished up the controller module (which now fully works) and made significant progress toward a finished audio implementation. Work on the sprite engine hasn’t begun yet, so we are still behind (arguably more so), but we’ve both increased the amount of time we’re spending on this project. All things considered, I’d still say this week went well and I’m optimistic that we’ll be able to catch up.

As I projected in my previous report, I was able to finish the controller kernel module over that weekend. The changes necessary to get the controller module were largely uninteresting, I simply had to build a new linux kernel and change the one the provided linux image used. Then, there were some bug fixes for the kernel module and the user space library functions. All of this was completed over the weekend, and the controller library works as intended.

For audio, so far the actual device is fully described in verilog and has undergone a reasonable amount of testing. I was successfully able to play a 1KHz sine wave using the APU. With the hardware itself working, my main focus now is on bringing up the software side of the apu. The communication between the CPU and the APU is setup, but untested. Around half the kernel module is done, and the user space portion of the APU library is specified but unwritten. All told, I’ve got ~400 lines of C code to look forward to, but I’m no stranger to that :). Hopefully, I can get most of it done over the weekend and dedicate some real time to bringing up the sprite engine next week. I’ve got some grading work to do for OS over the weekend, though, so that will probably eat up quite a bit of my time, unfortunately.

Don’t have anything super interesting to share this week. Next week, I should have a video of a working audio demo (that isn’t just a test sine wave). For now, here are the verilog files for the APU. Note that I wound up not using their premade I2S module because it was trash. I threw together something with a much nicer interface, instead.

APU + I2S files: https://drive.google.com/file/d/14EN3CveBKY3m0okfWgywr5-sRoCuts16/view?usp=sharing

 

Joseph’s Status Report for 3/27/21

Since the 13th, I have made some major changes to the PPU task schedule. Due to some design decisions made in the design review report, and most notably our decision to replace our DRAM interface with an M10K VRAM, I have decided to combine the Tile-Engine implementation tasks into a single task, and add several new tasks relating to the PPU FSM.

Firstly, due to the new PPU FSM design, I was required to re-implement the HDMI Video Timing. This effectively combines the HDMI Video Output and Video Timing Generator blocks into one custom block. The reason was to gain better access to important signal timings such as DISPLAY, or VBLANK, as well as to read and swap the new row buffer.

The row buffer currently initializes with a test pattern (shown in the image above). The new HDMI Video Output block reads this buffer for every row. The goal (due next week by Wednesday) is for two row buffers to be swapped in and out, and the PPU-Logic (Tile Engine, Sprite Engine, and Pixel-Mixer) will fill the buffer not in use. The important progress made here is that the HDMI Video Output automatically fetches pixel data in the row, extracts an address from that pixel data to a color in the Palette RAM, and then reads from Palette RAM to obtain the final color. This sequence occurs 320 times in a row, with each result held for an additional 25MHz clock cycle to upscale to 640px resolution.

The next task I completed was the Double-Buffered VRAM implementation. Our VRAM consists of two identical VRAMs, one which is used by the PPU during the DISPLAY period, and one which the CPU can write to during the DISPLAY period. The VRAMs must be synchronized at the start of the BLANK period so that the CPU can write to a VRAM which accurately reflects the changes it has made since the previous frame. The Double-VRAM was implemented using a SystemVerilog Interface to manage all 36-per-RAM signals. The reason there is such a large number of signals is because we use the native M10K dual-port RAM configuration, which allows for an extra port for every signal, and our VRAM is split into 4 segments (each with their own controls). The VRAM Sync is implemented in a new block called VRAM Sync Writer. The VRAM Sync Writer controls all ports of each dual-port VRAM in order to speed up the copying.

Test-benches and simulations with RAM models provided by Altera were used to verify that the synchronization works. I instantiated RAM modules with pre-initialized data, sent the start signal to VRAM Sync Writer, and compared the resulting RAMs using ModelSim’s interface.

Lastly, I’ve implemented the PPU FSM and CPU write. No physical hardware tests have been done with respect to CPU write, but a dummy CPU module was used to write values to each VRAM location, and the results were confirmed via ModelSim. I’ll hopefully finish this after this report is due tonight, and get started working on the Tile-Engine tomorrow.

I am almost on track with the progress expected on the schedule, after communicating with Andrew, I’ve decided to push back the final PPU Kernel Module until after the interim demo and focus on a user-space (non-kernel-module) software driver instead.

Team Status Report for 03/27/2021

We are behind, lord help us, we are behind.

 

Since our last status report on the 13th, we have completed a significant portion of the video implementation, with only the Pixel-Engines (Tile and Sprite Engine) and Software PPU-Driver remaining. More details are included in Joseph’s status report for video. We are running into difficulties building Kernel Drivers needed for Input (and later PPU and APU). Andrew discusses these details in his status report.

 

We are planning on meeting Wednesday 31st to decide on the allocation of critical tasks (Sprite Engine and Remaining Audio Implementation). In the time between this status report and Wednesday, Joseph will attempt to finish the Tile-Engine, and Andrew will attempt to finish the Kernel Module for Input and the Audio Implementation.

 

Our new schedule can be found in the following link:

http://course.ece.cmu.edu/~ece500/projects/s21-teamc1/wp-content/uploads/sites/133/2021/03/Project-Schedule-Gantt-Chart-3-27-2021.pdf

 

The scheduling changes are summarized below:

  • Pixel-Engine tasks were combined and pushed back a week. In their place is a PPU FSM implementation task. These changes were necessary given the additions to the design made in the Design Review Report.
  • Pushed back PPU Driver to after the Interim Demo. Joseph is going to be focused on implementing a user-space video demo instead.
  • A new Video Demo task has been added to clarify what Joseph will be doing during the week of the Interim Demo.
  • Pushed Audio Implementation to week of 29th. Andrew will be attempting to complete this before the Wednesday Lab.

Andy’s Status Report for 03/27/2021

Over the past few weeks, I’ve focused on understanding and implementing a kernel driver for our system. In our original schedule, this task was supposed to take around a week. That has turned out not to be the case due to me running a bit behind and complications with the implementation of the kernel driver.

 

What I do have, is a user space front end to the controller kernel driver and a kernel space driver file that has not been compiled yet. Theoretically, this should work fine, but I have been unable to compile the kernel module due to complications with building against the pre-built kernel provided by Terasic. As far as I can tell, they do not provide the tools necessary to build against the provided kernel, and so a new one must be built from scratch and supplied to our board. I’ve begun this process, and hope to have the kernel built and booting by the end of today. If all goes well, I’ll be able to jump straight into testing the controller module tonight/tomorrow.

 

Due to the excessive and frustrating amount of time it has taken to write my first kernel module, audio has been pushed back to the body of this coming week. I don’t anticipate audio taking much time, as Joseph has a firm understanding of what would be the hardest part (communication with DDR3 and the cpu). Aside from this, it will be a relatively simple FSM that reads from memory and sends data to I2C. I believe the driver will be simple as well, considering I’ve learned some useful tools while reading up for the controller driver (ex. I can create a device file and arrange it so that the write system call sends samples to the kernel driver).

 

Once audio has been finished, I’ll be working on the sprite engine.

 

Drafts of the user space and kernel space implementation of the controller driver are available here:

https://drive.google.com/file/d/1LZ3EGkWE5TbSmbO2-qg8U7c6oalinqTu/view?usp=sharing

Joseph’s Status Report for 3/13/21

After feedback on our design review presentation on Monday, it was decided that I should look into an upper-bound access time on SDRAM from the FPGA. For context:

  • The CPU uses an SDRAM controller to schedule and arrange simultaneous SDRAM requests. Since there are multiple input read/write command ports (from FPGA and CPU) and only a single output read/write command port (TO SDRAM), the SDRAM is a contested resource.
  • Since the SDRAM is a contested resource and the order of requests is essentially non-deterministic, we must assume the worst-case access time for our FPGA so we can design our hardware to meet HDMI timing constraints.
  • Unfortunately, few details on the SDRAM controller IP are provided by Intel. This means some assumptions have to be made regarding the SDRAM controller’s internal delays.
  • We can, however, read the datasheet on the actual SDRAM chip – which gives us ideal CAS timings. The CAS latency is the time between a read command being sent by the SDRAM controller and the data being received by the SDRAM controller from the SDRAM. The CAS latency provided by the datasheet is only accurate for accesses in the same row. Actual latency increases if memory accesses are far enough from each other. This makes it important to utilize burst reads to achieve the nominal CAS latency.

In my notes, I make some assumptions about the timings introduced by Qsys interconnects and the SDRAM controller. See my notes below:
http://course.ece.cmu.edu/~ece500/projects/s21-teamc1/wp-content/uploads/sites/133/2021/03/Upper-Bound-on-SDRAM-Read.pdf

To summarize the findings:

  • The CAS latency is 7 cycles on a 400 MHz clock. This is less than a clock cycle on our 50 MHz clock.
  • The RAS-to-CAS latency is about a clock cycle on our 50MHz clock.
  • 10 commands can exist in the command FIFO in the SDRAM controller. Assuming ours is picked last (in the worst case), we have to wait the equivalent of 10 RAS-to-CAS latencies + 10 CAS latencies.
  • I’ve assumed interconnect latencies adding up to 3 clock cycles.
  • A single read with a row-miss (accessing a different row), along with 9 other row-miss reads is our worst case latency. This adds up to a latency of 23 50MHz clock cycles.
  • The actual timings can be made better by doing burst reads or pipelining reads.
  • We will need to be careful about how much data we transfer. Transferring all of the PPU data over DRAM is infeasible. Transferring only the data needed by a scan line may be more feasible, but still difficult.

On Saturday, I brought this information along with a few PPU design ideas to an internal “Design Review” with Andrew. We came up with an alternative design using M10K memory – the main advantages over the original idea include less overall data transfer, and a safe timing failure condition: If the CPU somehow cannot finish its writes to the PPU’s VRAM in time before the frame must be rendered, then the frame is dropped and the previous frame is displayed (essentially dropping to 30FPS if this pattern continues).

My original goal for this week was to implement a tile engine prototype which accesses SDRAM for some tile data and displays it to the screen. Unfortunately, while I have made progress closer to a full PPU design, I have not implemented this yet. This means I will have to complete the simple Tile-Engine altogether next week. I am behind this week, but now that we’ve decided to move the PPU’s VRAM from SDRAM, the actual PPU design should be a little bit easier. I should be able to catch up (written design report time permitting) with the Tile Engine implementation by the end of next week.

Team Status Report for 3/13/2021

This week was system bring-up week. Our goal was to get Controller Input, HDMI Video, and Linux running and communicating with each other. Additionally, this would include setup of our development environment, including Quartus, Platform Designer (Qsys), and arm-none-linux-gnueabihf.

 

Out of these goals, only Controller-Input and Linux can talk to each other. HDMI Video has been tested via a demo 2 weeks ago, but the PPU and System-Interconnect design itself isn’t finalized yet, so Linux cannot control the PPU and HDMI output yet. Specifically, this requires DRAM-fetch (Joseph’s task for this week) and Tile-Engine (Joseph’s task for next week) to be completed first.

 

The results of Joseph’s DRAM latency research forced us to come to the conclusion that our original idea of using DRAM as our VRAM was infeasible. The risk of the PPU repeatedly missing pixels turned out to be much larger than we had anticipated. As such, we have slightly reorganized our internal design for the PPU in a way that won’t require us to change the MMIO interface but will still allow us to use the vast majority of our design as we originally specified it. Under the new design, the memory locations specified by MMIO will instead be transferred to an internal VRAM buffer at the beginning of each frame. The VRAM will be implemented in M10K, which we can access once per cycle. At the start of each VBLANK, the VRAM buffer will be committed to actual VRAM, which the PPU will then render from.

 

There are no schedule changes this week.