Designing a RISC-V CPU, Part 1: Learning hardware design as a software engineer

Tue 16 February 2021

I have no experience in digital logic design. That is, I didn't until I recently decided that I would like to try designing my own CPU and running it on an FPGA! If you too are a software engineer with a vague interest in hardware design, I hope this series of posts about what I've learnt will be helpful and interesting. In this first installment, I hope to answer these questions:

What is digital logic design?
How do I get started, and what tools might I use?

In future installments, I will go into more detail about my CPU design and the RISC-V architecture, as well as hopefully answering these questions:

What about digital logic design is fundamentally different from software design?
What about digital logic design is similar to software design?

You can see the code for my CPU at the time of writing here or an up to date version here.

What is digital logic design?

Digital logic design is designing logic circuits that operate on binary values. The elementary components are logic gates: an AND gate, for example, has two inputs and one output. The output is 1 iff¹ both inputs are 1.

Typically, we design synchronous circuits which use flip-flops to store state, and thereby synchronise the operation of the circuit to a common clock. Flip-flops are composed of logic gates.

Analogue circuit design is concerned with the electronic components that make up logic gates, like transistors and diodes. This level of abstraction is often needed for applications dealing directly with signals derived from analogue sensors, like radio receivers. When designing a CPU, this level of abstraction would not be feasible: modern CPUs can have billions of transistors!

Instead, we use tools that can translate our digital logic design into different useful formats: the configuration of an FPGA (see below); a simulation; silicon layout.

What is an FPGA and why are they used?

We noted above that the same digital logic design tools can be used whether we are creating a custom ASIC to be made into silicon, or configuring an FPGA. A Field-Programmable Gate Array is an integrated circuit containing an array of programmable logic blocks. You could imagine it is as a big array of logic gates that can be connected together in various ways.

Making a custom chip generally costs millions, and of course once your chip is manufactured it cannot be changed. Thus, generally FPGAs are used when:

You cannot afford to create a custom ASIC due to lack of capital (e.g. if you're just some hacker like me and not ARM or Intel)
You cannot afford to create a custom ASIC because your volume is too low to make it worth the high one-off costs (e.g. if you are making a small quantity of MRI machines with custom data acquisition hardware)
You need the flexibility

The downsides? FPGAs have a much higher per-chip cost, and they are generally much slower as a consequence of being able to connect logic blocks together in very flexible ways. In contrast, a custom design can be reduced to the minimum number of transistors, with no concern for flexibility.

I think it's helpful context to compare the custom ASIC design process against that of an FPGA design:

Logic design: just like we'd do for an FPGA, the logic design of an ASIC is done in a hardware description language.
Verification: FPGA designs may well be verified, but you might expect the process for an ASIC design to be more rigorous – after all, the design can't be changed once manufactured! Often verification will involve formally verifying² parts of the design.
Synthesis: This creates a netlist: a list of logic blocks and their connections. The connections are called nets, and the blocks are called cells. For both FPGAs and ASICs, the cells are vendor-specific.
Placement and routing (P&R): for an FPGA, this involves mapping the logic blocks described in the netlist to actual blocks in the FPGA. The resulting binary is often called a bitstream. For an ASIC, this involves deciding where to place the cells on the silicon, and how to connect them up. Both applications generally use automated optimisation tools for this.

What tools do I need?

A hardware description language: I am using nMigen³

You may have heard of Verilog or VHDL: both popular hardware description languages (HDLs). I use "popular" here to mean widely used, not widely loved.

I won't pretend to know much about these tools: I only know that smarter people than me with vast logic design experience have a lot of hate for them. Due to the problems with Verilog and other similar tools, there have been various attempts at making more useful and friendlier alternatives. nMigen is one such project, which creates a domain-specific language in Python. In their own words:

Despite being faster than schematics entry, hardware design with Verilog and VHDL remains tedious and inefficient for several reasons. The event-driven model introduces issues and manual coding that are unnecessary for synchronous circuits, which represent the lion's share of today's logic designs. Counterintuitive arithmetic rules result in steeper learning curves and provide a fertile ground for subtle bugs in designs. Finally, support for procedural generation of logic (metaprogramming) through "generate" statements is very limited and restricts the ways code can be made generic, reused and organized.

To address those issues, we have developed the nMigen FHDL, a library that replaces the event-driven paradigm with the notions of combinatorial and synchronous statements, has arithmetic rules that make integers always behave like mathematical integers, and most importantly allows the design's logic to be constructed by a Python program. This last point enables hardware designers to take advantage of the richness of the Python language—object oriented programming, function parameters, generators, operator overloading, libraries, etc.—to build well organized, reusable and elegant designs.

If, like me, you've never used Verilog, then not all of this will have more than abstract meaning to you. But it certainly sounds promising, and I can attest that it has been very straightforward to get started with logic design without the reportedly large barrier of grappling with Verilog. I would recommend it, particularly if you are already familiar with Python!

The only downside I can think of is that nMigen is still in development, and in particular the documentation is not complete. There is a helpful community at #nmigen on chat.freenode.net.

A wave viewer for inspecting simulations: I am using GTKWave

nMigen provides simulation tooling: I use it in my tests, written using pytest. I record the signals during these tests and view them in a wave viewer to help debug.

gtkwave

Optional: An FPGA dev board. I am using a myStorm BlackIce II

You don't need an FPGA dev board to create your own CPU. You could do everything in simulation! The fun of having a board to work with, for me, is being able to flash LEDs and see my design in action.

Of course, if you were creating something more useful than my very basic CPU, then you would probably want some hardware to run it on, and this would be less "optional"!

Getting started with nMigen

Rather than immediately trying to design a CPU, I started by making an Arithmetic Logic Unit (ALU) in nMigen. The ALU is a key piece of any CPU design that I have seen: it performs arithmetic operations.

Why start with this? I knew I would need an ALU for my CPU; I knew I could make a simple one; I knew that the feeling of making something is an important motivator when starting a new project!

My design looked something like this:

 1 """Arithmetic Logic Unit"""
 2 import enum
 3 
 4 import nmigen as nm
 5 
 6 
 7 class ALUOp(enum.IntEnum):
 8     """Operations for the ALU"""
 9     ADD = 0
10     SUB = 1
11 
12 
13 class ALU(nm.Elaboratable):
14     """
15     Arithmetic Logic Unit
16 
17     * op (in): the opcode
18     * a (in): the first operand
19     * b (in): the second operand
20 
21     * o (out): the output
22     """
23 
24     def __init__(self, width):
25         """
26         Initialiser
27 
28         Args:
29             width (int): data width
30         """
31         self.op = nm.Signal()
32         self.a = nm.Signal(width)
33         self.b = nm.Signal(width)
34         self.o = nm.Signal(width)
35 
36     def elaborate(self, _):
37         m = nm.Module()
38 
39         with m.Switch(self.op):
40             with m.Case(ALUOp.ADD):
41                 m.d.comb += self.o.eq(self.a + self.b)
42             with m.Case(ALUOp.SUB):
43                 m.d.comb += self.o.eq(self.a - self.b)
44         return m

As you can see, we've created a lot of nMigen Signal instances to represent well...the signals that define the interface to our ALU! But what is this elaborate method? My understanding is that "elaboration" is the name for the first step in synthesising the netlist (see above). The idea in the nMigen code above is that we've created some elaboratable structure (by inheriting from nm.Elaboratable), i.e. something that describes digital logic we want to synthesise. The elaborate method describes that digital logic. It has to return an nMigen Module.

Let's have a closer look at the contents of the elaborate method. The Switch will create some kind of decision logic in the synthesised design. But what is m.d.comb? nMigen has the concept of synchronous (m.d.sync) and combinatorial⁴ (m.d.comb) control domains. From the nMigen docs:

A control domain is a named group of signals that change their value in identical conditions.

All designs have a single predefined combinatorial domain, containing all signals that change immediately when any value used to compute them changes. The name comb is reserved for the combinatorial domain.

A design can also have any amount of user-defined synchronous domains, also called clock domains, containing signals that change when a specific edge occurs on the domain’s clock signal or, for domains with asynchronous reset, on the domain’s reset signal. Most modules only use a single synchronous domain[...]

The behavior of assignments differs for signals in combinatorial and synchronous domains. Collectively, signals in synchronous domains contain the state of a design, whereas signals in the combinatorial domain cannot form feedback loops or hold state.

Let's think about a shift register as an example piece of logic that we wish to design. Let's say our shift register has 8 bits, and every clock cycle the bit values are shifted one bit along (with the left-most value coming from an input signal). This is necessarily synchronous: you couldn't create this functionality by simply wiring the bits together, which in nMigen is what assigning them in the combinatorial domain would represent.

In the next installment of this blog series, I'll have more detail on my CPU design. As it stands at the moment, I'm trying to retire one instruction per cycle with no pipelining – this is unusual, but my hope was that it would make various aspects of the CPU simpler. A consequence of this is that much of my logic is combinatorial, rather than synchronous, as I have very little state to maintain between clock cycles. At the moment, something is wrong with my register file design, and there's a chance I'll have to reassess my "no pipelining" idea in order to fix it.

Writing tests

I like using pytest for Python tests, but you can of course use whatever framework appeals to you. Here are my tests for the ALU code above:

 1 """ALU tests"""
 2 import nmigen.sim
 3 import pytest
 4 
 5 from riscy_boi import alu
 6 
 7 
 8 @pytest.mark.parametrize(
 9         "op, a, b, o", [
10             (alu.ALUOp.ADD, 1, 1, 2),
11             (alu.ALUOp.ADD, 1, 2, 3),
12             (alu.ALUOp.ADD, 2, 1, 3),
13             (alu.ALUOp.ADD, 258, 203, 461),
14             (alu.ALUOp.ADD, 5, 0, 5),
15             (alu.ALUOp.ADD, 0, 5, 5),
16             (alu.ALUOp.ADD, 2**32 - 1, 1, 0),
17             (alu.ALUOp.SUB, 1, 1, 0),
18             (alu.ALUOp.SUB, 4942, 0, 4942),
19             (alu.ALUOp.SUB, 1, 2, 2**32 - 1)])
20 def test_alu(comb_sim, op, a, b, o):
21     alu_inst = alu.ALU(32)
22 
23     def testbench():
24         yield alu_inst.op.eq(op)
25         yield alu_inst.a.eq(a)
26         yield alu_inst.b.eq(b)
27         yield nmigen.sim.Settle()
28         assert (yield alu_inst.o) == o
29 
30     comb_sim(alu_inst, testbench)

and my conftest.py:

 1 """Test configuration"""
 2 import os
 3 import shutil
 4 
 5 import nmigen.sim
 6 import pytest
 7 
 8 
 9 VCD_TOP_DIR = os.path.join(
10         os.path.dirname(os.path.realpath(__file__)),
11         "tests",
12         "vcd")
13 
14 
15 def vcd_path(node):
16     directory = os.path.join(VCD_TOP_DIR, node.fspath.basename.split(".")[0])
17     os.makedirs(directory, exist_ok=True)
18     return os.path.join(directory, node.name + ".vcd")
19 
20 
21 @pytest.fixture(scope="session", autouse=True)
22 def clear_vcd_directory():
23     shutil.rmtree(VCD_TOP_DIR, ignore_errors=True)
24 
25 
26 @pytest.fixture
27 def comb_sim(request):
28 
29     def run(fragment, process):
30         sim = nmigen.sim.Simulator(fragment)
31         sim.add_process(process)
32         with sim.write_vcd(vcd_path(request.node)):
33             sim.run_until(100e-6)
34 
35     return run
36 
37 
38 @pytest.fixture
39 def sync_sim(request):
40 
41     def run(fragment, process):
42         sim = nmigen.sim.Simulator(fragment)
43         sim.add_sync_process(process)
44         sim.add_clock(1 / 10e6)
45         with sim.write_vcd(vcd_path(request.node)):
46             sim.run()
47 
48     return run

Every test generates a vcd file for me to view in a wave viewer, like GTKWave, for debugging purposes. You'll notice that the combinatorial simulation fixture runs for an arbitrary small time period, whereas the synchronous simulation feature runs for a defined number of clock cycles.

Yielding a signal in the test function requests its current value from the simulator. For combinatorial logic, we yield nmigen.sim.Settle() to ask the simulation to complete.

For synchronous logic, you can also yield without an argument to start a new clock cycle.

Designing a CPU

Once I'd gotten familiar with nMigen, I started trying to draw a block diagram for my CPU. I will go into much more detail on this on the next installment of this blog series, but I will briefly say that I started by drawing out the logic required for one instruction, then drew out the logic for another instruction, then figured out how to combine them. Here's the first messy sketch:

sketch

This block diagram step was extremely valuable in figuring out what the interfaces of different components needed to be, but I wouldn't have wanted to do it before playing around in nMigen first and learning a bit about digital logic design in the process. The jazzed up block diagram currently looks like this:

block

Stay tuned for the next installment where I actually delve into RISC-V and CPU design. I expect there to be a third installment of me reworking my design and getting it working on the entire instruction set (RV32I) that I am implementing :)

"if and only if" ↩
If you're a software engineer who has worked on high assurance software, such as that for a high risk medical device, you might think "formal verification" refers to any formalised verification process. Here I use formal verification to refer to mathematically proving correctness, see https://en.wikipedia.org/wiki/Formal_verification ↩
Note that you want the nmigen project under the nmigen organisation on GitHub. You do not want the one under the m-labs organisation. The situation is a bit unfortunate and complicated, but just know that the former is only one under active development. It's possible that the active project might change name soon – I'll do my best to update this blog post should that happen. ↩
also known as combinational ↩

lochsh