Status Report (3/24 – 3/30)

Team

Changes to schedule:

Ilan is extending his memory interface work by at least 1 week.

Brandon is extending his sending video stream over sockets work by at least 1 week.

Edric is pushing back the edge detection implementation by at least 1 week.

Major project changes:

Our Wi-Fi quality issues have posed a problem that we intend to temporarily circumvent by lowering the video stream bitrate. Once we have more of the project’s functionality working, we’ll try to look back at the Wi-Fi quality issues so we can increase the bitrate.

On the compute side, we are basically decided on moving forward with Vivado’s HLS tool.

Brandon

For the sixth week of work on the project, I was able to successfully boot up and configure the Pis! It took me a decent amount of extra work outside of lab, but after receiving the correct USB cables, I was able to boot into the Pis and connect to the CMU DEVICE network. To simplify usage, we’re just using mini HDMI cables to navigate through the Pis with a monitor rather than SSHing in. After I finished initial setup, I moved on to camera functionality and networking over UDP. I was able to display a video feed from the camera, and convert a video frame to an RGB array, and then to a grayscale array, but when I began working on the networking portion of the project, I ran into some issues. The biggest issue is that we are achieving a significantly lower bandwidth than expected for the Pis (~5 Mbps) and for the ARM core (~20 Mbps). Thus, we made the decision to revert back to my original plan, which was utilizing H264 compression to match the appropriate bandwidth for the Pis. Unfortunately, we haven’t yet been able to send the video over the network using UDP, but we plan on working throughout the weekend to hopefully be ready for our interim demo by Monday.

The Pi setup completion was a big step in the right direction in terms of our scheduling, but this new bandwidth issue that’s preventing us from sending the video is worrying. However, if we’re able to successfully send the video stream over the network by the demo, we will definitely be right on schedule if not ahead of schedule.

Ilan

Personal accomplishments this week:

Did some testing of Wi-Fi on ARM core
- Had to configure Wi-Fi to re-enable on boot, since it kept turning off. Also saw some slowness and freezing over SSH, which is a concern once we start using the ARM core for more intense processing.
- Found that current bandwidth is ~20 Mbps, which is too low for what we need. Initially we’re going to try to lower the bitrate as a temporary way to keep moving forward, and later we’ll try either changing the driver or looking into other tweaks or possibly ordering an antenna to get better performance.
Continued to work on memory interface, but wasn’t able to get full setup finalized. Going to work on this more tomorrow (3/31) to have something for the demo, but I focused on helping Brandon and Edric starting Wednesday so we have more tangible and visual results for our demo. Brandon and I worked on getting the Pis up and running, and I helped him out with some of the initial camera setup. I also looked into and set him up with a start on how to get a lower bitrate out of the camera so we can still send video over Wi-Fi and how to pipe it into UDP connections in Python. I helped Edric set up HLS and get started on the actual implementation of the Gaussian filter. We were able to get an implementation working and Edric is going to do more tweaking to improve performance. Tomorrow (3/31), he and I are going to work to try to connect the Gaussian and the Intensity gradient blocks (we’re going to try to implement this tomorrow beforehand) and then I’ll continue working on the memory interface. The memory interface’s PL input is defined by Edric’s final Gaussian filter input needs, so my work will change a bit and so I’ve reprioritized to help him finalize first.

Progress on schedule:

I’m a little behind where I would like to be, and the Wi-Fi issues we’ve experienced on both the ARM core have been a bit of a setback. My goal toward the second half of the week was to help Brandon and Edric so we can have more of the functional part of the system ready for the demo. I’ll likely be extending my schedule at least 1 week to finalize the memory interface between PS and PL.

Deliverables next week:

Memory interface prototype using unit test to verify functionality.

Edric

After researching different possibilities for implementing the Canny algorithm, I’ve decided on going through with Vivado’s High Level Synthesis (HLS) tools. The motivation for this decision is while the initial stages (simple 2D convolution for Gaussian filters) isn’t particularly intense in hard Verilog, the later steps involving trigonometry will prove to be more complex. HLS will allow us to keep the actual algorithm simple, yet customizable enough via HLS’s pragmas.

So far I have an implementation for the Gaussian blur which both simulates and synthesizes to a Zynq IP block. Preliminary analysis shows that the latency is quite high, but the DSP slices used is quite minimal. More tweaking will have to be done to lower the latency, however since current testing is done on 1080p images, lowering this down to the target 720p will definitely make up for the majority of the speedup.

For the demo, I aim to implement the next two stages of Canny (applying the Sobel filter for both the X and Y domain, then combining the two). Along with this I’d like to see if I can get a software benchmark to compare the HLS output with (ideally something that is done using a library like OpenCV). Thankfully using HLS gives us access to a simulator which we can use to compare images.

I’m a little behind with regards to the actual implementation of Canny, but now that HLS is (kind of) working the implementation in terms of code will be quite easy. The difficult part will be configuring the pragmas to get the compute blocks to meet our requirements.

March 3, 2019March 3, 2019

Status Report (2/24-3/2)

Team Report

Changes to schedule:

We’re catching up and for the most part maintaining the pace that we set over the past few weeks. We accounted for a reasonable amount of time spent this week towards the design reviews, so we don’t have any current changes to our schedule.

Major project changes:

At this point we don’t have any major project changes since we’ve just finalized our project design for the most part. We still have some concerns about the DSP pipeline mapping correctly onto the DSP slices, and that’s something we’ll keep in mind and re-evaluate after implementing the first stage of the pipeline.

Brandon:

2/24-3/2

For the third week of work on the project, we mainly focused on the design aspect of the project, as we had to present a design presentation as well as write a design document. Since I was presenting, I mainly had to focus on this process rather than spend a lot of time working on the actual project. Thus, I didn’t make as much progress as I was hoping to make this week on video streaming functionality. However, I was able to get OpenCV to work so now I’m at about 50% completion on the video streaming tests before we get the actual hardware. Speaking of the hardware, I also submitted the order form for three Raspberry Pi W with Camera Packs (see below), which we will be able to start working with once we receive them. Some technical challenges I was able to overcome included some weird UDP behavior over multiple machines, and simply installing and working with OpenCV. The steps I took to accomplish this was again, a lot of online research and various forms of testing.

I’m still behind schedule, since I devoted most of my time this week to the design aspect of the class, but I should be okay, because I’m planning on staying in Pittsburgh over spring break, so I’ll be able to catch up on anything I don’t finish then (currently, I don’t have anything scheduled on the Gantt chart, so it’ll be an opportunity to catch up). The deliverable I hope to achieve in this next week is still getting video streaming/sending functionality working completely.

Ilan:

Personal accomplishments this week:

Started working on bringing up FPGA and ARM core. Still working on finalizing infrastructure and toolchain so it works for all 3 of us.
- Had to work through temporary obstacle of powering board since we didn’t have a power supply, so we ordered one for ourselves as well as one for another team that wanted one.
- Future work involves finishing bring-up, pushing infrastructure up to GitHub, and documenting toolchain for Brandon and Edric.
Continued researching steps of Canny edge detection in more depth with Edric to prepare for design review, but we weren’t able to finalize DSP slice allocation for each stage. This was brought up as a flaw in our design review documentation, so we put some time toward this during the second half of the week and will be finalizing that as well as hopefully a clock frequency target for the design that we can include in our design report. We’re still trying to work through the algorithm and better our understanding which has been a bit of a challenge.
- Future work will be finalizing the DSP slice allocation and determining target clock frequency.
No progress yet on implementing interface functionality, but that’s scheduled for the upcoming 2 or so weeks, so that’s fine.

Progress on schedule:

Edric and I continued to make progress on understanding the algorithm and designing the pipeline. We’ll be finalizing this over the rest of the weekend and the implementation will start over the next week or so.

Deliverables next week:

Finish enabling ARM core and FPGA functionality and pushing toolchain up to GitHub and documenting setup.
Finalized DSP slice allocation and target clock frequency.

February 23, 2019February 24, 2019

Status Report 2/17-2/23

Team report:

Changes to schedule:

We’re slightly behind on the hardware side of things because we only acquired our board this late this week, but our design has been simplified due to the use of an ARM core over a Microblaze and the Wi-Fi module being supported by the board rather than by us. This should cut some time out of interface bring-up and allow Ilan to help Edric a bit more with the pipeline design and implementation details.

As for the software side, Brandon’s slowly catching up after being very behind last week. He’s added the benchmarking to the Gantt chart for this week along with starting video frame/feed functionality.

Major project changes:

Possibility of full edge detection not being implemented in hardware due to limited number of DSP slices on the board. This would most likely mean we’d implement the first few stages in hardware, which are also the more computationally intense ones, and then move the data back to software. Once we implement the first stage and actually see how many DSP slices are actually used (as opposed to our theoretical calculations), we’ll know whether this change will happen or not.

Brandon:

For the second week of work on the project, we were able to clarify our project significantly from the prior week. We’ve settled on a security system implementation with canny edge detection, and this means that most of our communication protocol design will stay the same. Thus, I was able to spend my time this week actually working on relevant code for the UDP protocol. I drafted up a test server/client to ensure that I could get basic UDP functionality working, which I was able to as shown in the pictures below. Some technical challenges I was able to overcome include some bugs with the server code that involved some MSG flags that I wasn’t setting properly, along with some sizing issues with the char array that I was trying to send. The steps I took to accomplish this was a lot of online research and talking with some peers. Once I got this working, I was supposed to work on reconfiguring my code to accommodate video streams, but instead, since we have our design presentation this next week, I’m currently trying to benchmark some latency numbers for sending a 1280×720 video frame, so I’m designing a chunking/packing algo and trying to time it.

With this new task, I am now slightly behind schedule, but not as much as I was last week. I’ve caught up on the UDP functionality, but haven’t started video streaming functionality yet, which I was supposed to do this week. In order to catch up, I’m going to try to finish both benchmarking and start video streaming functionality this week. These are the deliverables I hope to complete in the next week.

Ilan

Personal accomplishments this week:

Decided on and acquired FPGA hardware. We’ll be using the Ultra96 with an ARM core and FPGA.
- No significant challenges here
- Steps involved mainly narrowing down between Zynq and Ultra96 due to ARM core, but Zynq was unavailable
- Future work involves board bring-up and interface development
Researched steps of Canny edge detection in-depth with Edric, determined that it may not be computationally feasible to fully compute 2 simultaneous 720p streams on the FPGA due to limited number of DSP slices (360 slices total, so if 2 separate compute blocks would be used that would mean 180 slices/stream), or it may be quite tight to the 100 ms latency. Back of the envelope math shows it would take ~27 ms to do 3 convolutions (just in the first 2 out of 5 stages of the edge detection algorithm alone) in a pixel-by-pixel fashion, with each convolution using 9 DSP slices. If we want a pipelined design, each DSP slice will be dedicated to a stage, and so that alone allocates 27/180 slices for a single stream. This is something we’ll nail down once we implement the Gaussian filter, since that will require a convolution and will heavily inform how we implement the intensity gradient calculation (another 2 convolutions). At that point, we’ll have a definite answer as to under what timing condition we can fit the whole algorithm in the FPGA fabric.
- Technical challenges met were lack of familiarity with the algorithm, and some gaps in understanding specific stages
- Steps involved starting mainly by focusing on first 2 stages since these seem to be a significant portion of the algorithm in terms of computational time. Broke down computation performed by convolution based on DSP slices on FPGA and determined conservatively how we would use the DSP slice to determine frequency, computation time, etc.
- Future work will be when Edric implements first stage and sees how convolution ultimately consumes DSP slices
Finalized interfacing design between ARM core and FPGA with Edric. We discussed all of the interfaces we’ll need and how those will work to allow the computation to be offloaded from the ARM core to the FPGA and read back once the computation has finished. We’ll section off a portion of DRAM for each stream, and use GPIO pins between the ARM core and PL to communicate status and control signals for the edge detection start/end. Since computing a matrix convolution efficiently means not overwriting the current data, we came up with 2 strategies for moving data between stages of the edge detection pipeline, 1 of which is our main strategy. Our main strategy is to allocate separate chunks in DRAM for each stage, so we can pipeline the design. This incurs more memory overhead, but based on our calculations it is feasible.
- No significant technical challenges here
- Steps were determine what interfaces we could use and what suited the application the best
- Future work will be myself implementing these interfaces

Progress on schedule:

Edric and I made good progress on the edge detection pipeline design and interfacing, which is approximately on schedule.
We didn’t have the power cable for the Ultra96 and couldn’t find a matching one in Hamerschlag, so I couldn’t do any quick testing of the board, which is slightly behind where I wanted to be. However, our previous schedule and architecture was based on using a Microblaze core, and after adjusting our schedule based on our finalized board decision the schedule hasn’t been affected.

Deliverables next week:

Enabling basic ARM core and FPGA functionality
- Unblocks Brandon’s development and testing on the server-side
- Unblocks Edric to start flashing bitstreams onto the FPGA (not necessary for a while though, most designs will be simulated and only synthesized for timing and resource usage.
- No expected risk/challenge here, these tasks are mainly focusing on getting very basic functionality working and making sure everything is usable and set up for the future when things become more fast-paced

Edric

This week, because we managed to decide on and get ahold of our FPGA we could begin some estimates. On the hardware end, no code has been written yet, but we’ve managed to flesh out a few aspects of our design:

Data coming from video streams will be placed in DRAM by the ARM core at a specified address
- DRAM address space is split into segments, where each stream is allocated a chunk of the space
Once in memory, the ARM core will communicate to the fabric (compute blocks) that there is a frame ready for processing
- Simple consumer-producer protocol: core will ping fabric that a frame is ready, as well as what address the start of the frame is located
When the frame is processed (and put back into DRAM), then the fabric will ping the ARM core that it is ready

Regarding the implementation of the Canny algorithm, there we’ve come across a few issues with respect to the actual implementation. It seems like we’ll need to do more work in trying to understand what operations are necessary, although focusing on the Gaussian filter for the time being seems reasonable.

We have, however designed the basic flow of a frame being processed (and how each step translates to its Canny algorithm step). This can be illustrated with the following diagram:

Each block represents a chunk of memory a copy of the frame (at each step) is located in DRAM. Unfortunately we can’t really edit the frame in-place, so this is how we’ll do it (for now).

Some foreseen challenges:

Still need to figure out how the Canny algorithm works
When there is both a frame done and a frame pending for processing, we’ll need to figure out a way to prevent deadlock between the producer (FPGA fabric) and consumer (ARM core)
- Perhaps a FIFO is enough. Will need to give it more thought.

For the most part, we’re a bit behind schedule, but definitely in a better place than last week. The next steps are to get flashing lights on the Ultra96, look into the Canny algorithm more, and perhaps solidify more our testing suite for looking at the output from the FPGA.