Micropipeline demo

Micropipeline

Ivan Sutherland introduced the term "micropipeline" in his 1988 Turing Award Lecture. Sutherland is widely regarded as the father of computer graphics. During his prolific career, he has received a plethora of honors and prizes. For the past few years, he has devoted himself to developing VLSI processing architectures, with a special emphasis on asynchronous logic.

To visualize the behavior of a micropipeline versus a standard clocked pipeline, we will use the metaphor of the bucket brigade -- a method for transporting items by passing them from one stationary person to the next. The bucket brigade resembles the structure of a static data pipeline, in which data is passed from one register to the next.

A conventional synchronous pipeline is like a bucket brigade in which each member follows the beat of a clock. When the clock ticks, each person pushes a bucket forward. When the clock tocks, each person catches the bucket pushed by the preceding person.
An asynchronous micropipeline behaves just as a real-world bucket brigade does. Each person who holds a bucket can pass it down the line as soon as the next person's hands are free. If there is no bucket available, the person waits for one.

Demo Description

The demo consist in a variable length FIFO that passes data through 32 bits registers. The first register implements a binary counter, so that a 32 bits count is generated and propagated to the last register in the FIFO.

In a synchronous pipeline, we would include all of the registers in a single clock domain. By using an asynchronous micropipeline approach, each of the 32 bits register will have its own local clock domain, acting as a trivial GALS design.

Associated to each synchronous 32 bits register & its associated clock domain in every stage of the asynchronous micropipeline, we will have a simple asynchronous rendezvous cell module that communicates with those associated to the neighboring stages to generate an ordered and interlocked sequence of clock pulses in each of the independent clock domains.

This are the three basic types of rendezvous modules used in the micropipeline demo:

AsyncArt Source Rendezvous Cell: This is the rendezvous module associated to the first data register, that acts as the origin of the data. Without a previous stage, as soon as the next stage is ready it will fire its local clock and pass a new data to the micropipeline.
AsyncArt Register Rendezvous Cell: This is the rendezvous module associated to the intermediate data registers. This will fire its local clock and load a new data when the previous stage has a new data available the next stage is empty.
AsyncArt Sink Rendezvous Cell: This is the rendezvous module associated to the last data register, the one that exposes the data to the outside after crossing the micropipeline. Without a following stage, this will fire its local clock and load a new data in the output as soon as a it arrives from the previous stage.

In order to allow for an easier integration and testing in several devices, we provide the micropipeline demo in both VHDL and Verilog versions:

Both HDL versions share the same input/output mapping:

RST: this is an asynchronous reset signal that initializes the asynchronous rendezvous modules, be sure you generate a reset pulse before activating it.
ACT: this is an asynchronous activation signal that starts the clock firing sequence when it is driven and maintained to its true level.
DATA: this will output the received data as it comes out from the micropipeline. By measuring the equivalent period for the different bits in the implemented counter, we can estimate the throughput of the micropipeline
- e.g. (2^32) / (MSV Bit Period) = (Peak Micropipeline Throughput)

Run on the iCEstick

The demo is targeted to the Lattice iCEstick Evaluation Kit and it requires Project IceStorm, Yosys and Nextpnr to work.

In order to build and run the demo, we need to clone the AsyncArt source code:

git clone https://ohwr.org/project/asyncart.git

Basically, it includes two different blocks:

An UART core connected to the second channel in iCEstick FTDI chip.
An instance of the Verilog version for the micropipeline demo.

In this demo, the 8 bits "characters" received by the FPGA from the UART are driven to an internal register. The two lower bits of this register are internally assigned to the micropipeline control signals, so that we can control the demo from the host development PC:

Bit 0 from UART is assigned to RST
Bit 1 from UART is assigned to ACT

Finally, we drive the following signals from the micropipeline output to the iCEstick LEDs so that we can measure the throughput without using expensive scopes, but by just using visual inspection:

Bit 31 from the Data to the LED 4 (Counter MSV bit)
Bit 30 from the Data to the LED 3
Bit 29 from the Data to the LED 2
Bit 28 from the Data to the LED 1

Hardware

Get into the hardware demo folder and run make to synthesize the design binary for the iCEstick.

cd demo/hardware
make

Then, you can flash the design in a connected iCEstick board by just issuing the following command:

make flash

Software

Once the iCEstick is programmed, we need to init the design (we have checked this is already done after reset) and activate the micropipeline.

In order to do this, we need to connect to the secondary UART in the FTDI chip (usually /dev/ttyUSB1) by configuring a 115000 Baudrate and 8 bits serial link and use it to assign the micropipeline control signals.

A simple option is to just open a serial console terminal (e.g. minicom, microcom...) associated to the UART and use compatible ASCII codes to provide the initialization sequence, e.g.:

ASCII character for 0 (0x30): RST and ACT are driven to False, so the micropipeline remains inactive.
ASCII character for 1 (0x31): RST is True while ACT is False, so asynchronous logic is initialized.
ASCII character for 2 (0x32): RST is False while ACT is True, so the micropipeline starts to work.

If you are not comfortable with serial consoles, you can just build the simple program that we provides to write in the 8 bits control register by using single UART accesses.

cd demo/software
make

Then, we can just issue the following commands to replicate the start-up sequence

sudo ./micropipeline-control /dev/ttyUSB1 0
sudo ./micropipeline-control /dev/ttyUSB1 1
sudo ./micropipeline-control /dev/ttyUSB1 2

Data Throughput

Depending on the depth of the micropipeline, the throughput performance will vary. This is because, when working at full speed, the complete micropipeline will be as fast as the slowest asynchronous handshaking stage. The following table includes the throughput for several micropipeline depth values.

In order to change the micropipeline depth, we need to modify the value of the micropipeline_depth parameter in the asyncart_demo instance included in the top HDL file:

https://www.ohwr.org/project/asyncart/blob/master/demo/hardware/micropipeline.v

The throughput is measured in or Mega-Data Items per Second (MDI/S), being an equivalent figure of merit as the frequency in MHz in synchronous FIFOs. In addition, we include the FPGA resource consumption for several pipeline depths:

Depth	Throughput [MDI/s]	LC [1280]	IO [112]	GB [8]
2	192	222	8	3
3	187	225	8	4
4	146	232	8	5
5	163	240	8	6
6	154	244	8	7
7	150	247	8	8
8	158	257	8	8
16	136	305	8	8
32	138	401	8	8
48	148	494	8	8
57	136	549	8	8

NOTES:

These are the toolchain versions used for the documented tests:
- nextpnr-ice40 -- Next Generation Place and Route (git sha1 12aca15)
- Yosys 0.8+55 (git sha1 47c89d6, gcc 5.4.0-6ubuntu1~16.04.10 -fPIC -Os)
Values for micropipeline_depth lower than 2 make no sense, you need at least a sender and a receiver.
Values for micropipeline_depth higher than 57 are not properly synthesized.
The logic consumption shows that the registers have been optimized (not all of the register bits drive a used signal).
For each clock domain in the micropipeline, the toolchain instantiate a global buffer. When there are no more GB available, it starts to use conventional routing resources.

Physical derives

By studying the micropipeline, you can check how fast your async design will be able to run and how physical environmental parameters have an impact in the speed of the CMOS circuitry, e.g.:

The more the supply voltage is, the faster the circuit runs: you can hack the power supplies to rise the VCC and automatically overclock your micropipeline.
The lower the die temperature is, the faster the circuit runs: you can use a hair-drier to heat the iCE40 chip and measure how the micropipeline automatically slows down.