fpgaConvNet: CNN Model to Vivado IP Example¶

This tutorial demonstrates the end-to-end flow from a ONNX model to generated hardware. The LeNet network is used as an example. Let's start by looking at the base-line architecture through the lens of the fpgaconvnet-model modelling part of the project.

In [ ]:
from IPython.display import Image
from fpgaconvnet.models.network import Network

# load network
net = Network("single_layer", "models/single_layer.onnx")
net.visualise("baseline.png", mode="png")

# display to jupyter notebook
im = Image('baseline.png')
display(im)

The baseline network is predicted to have poor performance, as there is no parallelism amongst the modules. We can see what the performance is expected to be for a Zedboard.

In [ ]:
# load the zedboard platform details
net.update_platform("platforms/zedboard.json")

# show latency, throughput and resource predictions
print(f"predicted latency (us): {net.get_latency()*1000000}")
print(f"predicted throughput (img/s): {net.get_throughput()} (batch size={net.batch_size})")
print(f"predicted resource usage: {net.partitions[0].get_resource_usage()}")

To increase performance, the samo design-space exploration tool can be used. This tool generalises the CNN to FPGA mapping problem for Streaming Architectures. A command-line interface can be used to optimise a design. We will use the rule-based optimiser, and use a latency performance objective.

In [ ]:
!mkdir -p outputs
import samo.cli

# invoking the CLI from python
samo.cli.main([
    "--model", "models/single_layer.onnx",
    "--platform", "platforms/zedboard.json",
    "--output-path", "outputs/single_layer_opt.json",
    "--backend", "fpgaconvnet",
    "--optimiser", "rule",
    "--objective", "latency"
])

We can see the performance and resource predictions for the optimised configuration.

In [ ]:
# load the optimised network
net.load_network("outputs/single_layer_opt.json") # TODO: change name
net.update_partitions()

# print the performance and resource predictions
print(f"predicted latency (us): {net.get_latency()*1000000}")
print(f"predicted throughput (img/s): {net.get_throughput()} (batch size={net.batch_size})")
print(f"predicted resource usage: {net.partitions[0].get_resource_usage()}")

We can also see a visualisation of the model. It's clear how SAMO is able to exploit parallelism to generate higher performance designs.

In [ ]:
# create network visualisation
net.visualise("optimised.png", mode="png")

# display to jupyter notebook
im = Image('optimised.png')
display(im)

Once we are happy with the design, we can now generate the hardware using the fpgaconvnet-hls project. This part of the tool acts as a backend for modelling, implementing the hardware building blocks. To start with, we can instantiate the HLS code generator for this hardware configuration, and generate all the HLS hardware.

In [ ]:
from fpgaconvnet.hls.generate.network import GenerateNetwork

# create instance of the network
gen_net = GenerateNetwork("single_layer", "outputs/single_layer_opt.json", "models/single_layer.onnx")

# generate hardware and create HLS project
gen_net.create_partition_project(0)

For the hardware, we will need to generate a test input for the hardware, as well as a golden output.

In [ ]:
import numpy as np
import PIL
import requests

# download test mnist data
img_data = requests.get("https://storage.googleapis.com/kagglesdsdata/datasets/1272/2280/testSample/testSample/img_1.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20221109%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20221109T152127Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=b24923ea6357bd171c890de37ce2f542e285977c505b1be8b550f64e5e69d4828d821880879cd61f5053e07f3784f5a317e900c11bbf458da2be2b16aec78f455046184b0f757fdcd63133ae67adbaf0746005c8fd24d5968519e9a39e0b2b4b59ced37fc6338076a5f7dc48a94e425d9cda00b00be1b2585d1fc37119ff0d2510143fc1b37e51e05a8876bb3a350157cdb9c775daa6cc7e78eab455b4521afe96f1b374a43288385426b4eacdd0e135fe279156ee23c444e6adfdd35666dbb443f4182953d5fd1d632251a2e6779093660eaed9f8cf64c812c5d92e7368dbb1fd7608a27037d737b761baea34a350c90a24f2f4349ab1c68044098708786103").content
with open('mnist_example.jpg', 'wb') as handler:
    handler.write(img_data)

# show test image
im = Image('mnist_example.jpg')
display(im)

# load test data
input_image = PIL.Image.open("mnist_example.jpg") # load file
input_image = np.array(input_image, dtype=np.float32) # convert to numpy
input_image = np.expand_dims(input_image, axis=0) # add channel dimension
input_image = input_image/np.linalg.norm(input_image) # normalise
input_image = np.stack([input_image for _ in range(1)], axis=0 ) # duplicate across batch size

Next we can test the hardware. First let's run C-Simulation, which is just a functional check of the generated hardware. To do so, we need to create a test image.

In [ ]:
# run the partition's testbench
gen_net.run_testbench(0, input_image)

After the C-based simulation passes, the hardware can be generated. C-SYNTHESIS is the process of converting the high-level C description to RTL. We will generate the hardware for the single partition based on the outputs/single_layer_opt.json configuration.

It's encouraged to look through the HLS log to make sure there are no issues with pipelining or predicted timing.

In [ ]:
# generate hardware
gen_net.generate_partition_hardware(0)

Finally, we can run co-simulation. This runs RTL-level simulation of the generated circuit, and checks it against the C-simulation result.

In [ ]:
# run co-simulation
gen_net.run_cosimulation(0, input_image)