This tutorial demonstrates the end-to-end flow from a ONNX model to generated hardware. The LeNet network is used as an example. Let's start by looking at the base-line architecture through the lens of the fpgaconvnet-model modelling part of the project.
from IPython.display import Image
from fpgaconvnet.models.network import Network
# load network
net = Network("single_layer", "models/single_layer.onnx")
net.visualise("baseline.png", mode="png")
# display to jupyter notebook
im = Image('baseline.png')
display(im)
The baseline network is predicted to have poor performance, as there is no parallelism amongst the modules. We can see what the performance is expected to be for a Zedboard.
# load the zedboard platform details
net.update_platform("platforms/zedboard.json")
# show latency, throughput and resource predictions
print(f"predicted latency (us): {net.get_latency()*1000000}")
print(f"predicted throughput (img/s): {net.get_throughput()} (batch size={net.batch_size})")
print(f"predicted resource usage: {net.partitions[0].get_resource_usage()}")
To increase performance, the samo design-space exploration tool can be used. This tool generalises the CNN to FPGA mapping problem for Streaming Architectures. A command-line interface can be used to optimise a design. We will use the rule-based optimiser, and use a latency performance objective.
!mkdir -p outputs
import samo.cli
# invoking the CLI from python
samo.cli.main([
"--model", "models/single_layer.onnx",
"--platform", "platforms/zedboard.json",
"--output-path", "outputs/single_layer_opt.json",
"--backend", "fpgaconvnet",
"--optimiser", "rule",
"--objective", "latency"
])
We can see the performance and resource predictions for the optimised configuration.
# load the optimised network
net.load_network("outputs/single_layer_opt.json") # TODO: change name
net.update_partitions()
# print the performance and resource predictions
print(f"predicted latency (us): {net.get_latency()*1000000}")
print(f"predicted throughput (img/s): {net.get_throughput()} (batch size={net.batch_size})")
print(f"predicted resource usage: {net.partitions[0].get_resource_usage()}")
We can also see a visualisation of the model. It's clear how SAMO is able to exploit parallelism to generate higher performance designs.
# create network visualisation
net.visualise("optimised.png", mode="png")
# display to jupyter notebook
im = Image('optimised.png')
display(im)
Once we are happy with the design, we can now generate the hardware using the fpgaconvnet-hls project. This part of the tool acts as a backend for modelling, implementing the hardware building blocks. To start with, we can instantiate the HLS code generator for this hardware configuration, and generate all the HLS hardware.
from fpgaconvnet.hls.generate.network import GenerateNetwork
# create instance of the network
gen_net = GenerateNetwork("single_layer", "outputs/single_layer_opt.json", "models/single_layer.onnx")
# generate hardware and create HLS project
gen_net.create_partition_project(0)
For the hardware, we will need to generate a test input for the hardware, as well as a golden output.
import numpy as np
import PIL
import requests
# download test mnist data
img_data = requests.get("https://storage.googleapis.com/kagglesdsdata/datasets/1272/2280/testSample/testSample/img_1.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20221109%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20221109T152127Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=b24923ea6357bd171c890de37ce2f542e285977c505b1be8b550f64e5e69d4828d821880879cd61f5053e07f3784f5a317e900c11bbf458da2be2b16aec78f455046184b0f757fdcd63133ae67adbaf0746005c8fd24d5968519e9a39e0b2b4b59ced37fc6338076a5f7dc48a94e425d9cda00b00be1b2585d1fc37119ff0d2510143fc1b37e51e05a8876bb3a350157cdb9c775daa6cc7e78eab455b4521afe96f1b374a43288385426b4eacdd0e135fe279156ee23c444e6adfdd35666dbb443f4182953d5fd1d632251a2e6779093660eaed9f8cf64c812c5d92e7368dbb1fd7608a27037d737b761baea34a350c90a24f2f4349ab1c68044098708786103").content
with open('mnist_example.jpg', 'wb') as handler:
handler.write(img_data)
# show test image
im = Image('mnist_example.jpg')
display(im)
# load test data
input_image = PIL.Image.open("mnist_example.jpg") # load file
input_image = np.array(input_image, dtype=np.float32) # convert to numpy
input_image = np.expand_dims(input_image, axis=0) # add channel dimension
input_image = input_image/np.linalg.norm(input_image) # normalise
input_image = np.stack([input_image for _ in range(1)], axis=0 ) # duplicate across batch size
Next we can test the hardware. First let's run C-Simulation, which is just a functional check of the generated hardware. To do so, we need to create a test image.
# run the partition's testbench
gen_net.run_testbench(0, input_image)
After the C-based simulation passes, the hardware can be generated. C-SYNTHESIS is the process of converting the high-level C description to RTL. We will generate the hardware for the single partition based on the outputs/single_layer_opt.json
configuration.
It's encouraged to look through the HLS log to make sure there are no issues with pipelining or predicted timing.
# generate hardware
gen_net.generate_partition_hardware(0)
Finally, we can run co-simulation. This runs RTL-level simulation of the generated circuit, and checks it against the C-simulation result.
# run co-simulation
gen_net.run_cosimulation(0, input_image)