Chapter 1
How does a CPU go faster?
- The browser
- The network
- Software and languages
- The CPU
- Memory
- Arithmetic
- Logic gates
- The switch
The processor of Chapter ?? runs one instruction all the way through, fetch then decode then execute, before it starts the next. But each of those stages uses different hardware: fetch touches memory, decode the instruction logic, execute the ALU. While one instruction is in the ALU, the next could already be decoding, and the one after that fetching. Overlapping the stages this way is a pipeline, and it is how every fast processor works. This chapter schedules instructions through a pipeline, then meets the two things that break the flow: a value not computed yet, and a branch not yet decided.
1.1 Overlapping the stages
If a new instruction starts each cycle and every instruction takes three stages, then once the pipeline fills, three instructions are always in flight and one finishes every cycle. Four instructions that would take twelve cycles run one at a time finish in six. Figure 1.2 lays the stages out on a timeline.
Scheduling the pipeline is bookkeeping over cycle numbers: place each instruction one cycle after the last, unless it must wait.
Example 1-1. Scheduling instructions through the pipeline
def schedule(program: list) -> tuple: """Place each instruction's three stages on a timeline, in order, stalling an instruction until every register it reads has been written back. Returns the per-instruction stage cycles and the total number of cycles.""" execute = [] # the cycle each instruction reaches its execute stage written = {} # register -> index of the last instruction that writes it for index, ins in enumerate(program): earliest = execute[index - 1] + 1 if index else _STAGES - 11 for src in ins.srcs: producer = written.get(src) if producer is not None: earliest = max(earliest, execute[producer] + 2)2 execute.append(earliest) written[ins.dst] = index stages = [{"fetch": e - 2, "decode": e - 1, "execute": e} for e in execute] total = execute[-1] + 1 if execute else 03 return stages, total
- in order, one instruction reaches the execute stage each cycle, so normally it follows a single cycle behind the one before it
- unless it reads a register an earlier instruction is still computing: then it waits two cycles past that instruction, a stall
- the run takes as many cycles as the last instruction’s execute, plus one to write its result back
1.2 When an instruction must wait
The stall in callout 2 is the price of a data hazard. If an instruction needs a value the instruction just ahead of it has not finished computing, it cannot proceed; it has to wait until that value exists. A bubble opens in the pipeline, and everything behind it slips a cycle. Figure 1.3 shows one.
A pipeline is only allowed to overlap instructions because the result is the same as running them one at a time. Here the schedule models only the timing; the computed result is simply the sequential one, so the program does not prove the equivalence. What it does check is the hazard rule, which is exactly what lets a real overlapped pipeline keep that same result: no instruction reads an operand before the instruction that produces it has computed it.
Example 1-2. Checking the schedule is safe
def hazard_safe(program: list, stages: list) -> bool: """True if every instruction reads its registers only after the instruction that writes each one has reached its execute stage, so no operand is read before it exists.""" written = {} for index, ins in enumerate(program): for src in ins.srcs: producer = written.get(src) if producer is not None and \ stages[index]["decode"] <= stages[producer]["execute"]:1 return False written[ins.dst] = index return True
- a schedule is unsafe if any instruction reads a register at or before the cycle the instruction writing it reaches execute; a safe schedule keeps every read strictly later
A real pipeline softens this with forwarding: the instant an instruction computes a value in its execute stage, that value is routed straight to the instruction behind it, rather than making it wait for the write-back. Many stalls disappear, but the shape of the hazard, a value needed before it exists, stays the same.
1.3 Guessing the next instruction
A data hazard makes the pipeline wait; a branch makes it guess. A jump does not know where it goes until it executes, but by then the pipeline has already fetched the instructions sitting behind it. So the processor bets on the outcome and fetches that way. A branch predictor makes the bet from past behaviour: a two-bit counter that leans towards taken or not-taken and only changes its mind after two wrong guesses in a row. On a loop that runs many times and leaves once, it is right almost always, missing only on the way out.
Example 1-3. Predicting a branch
def predict(outcomes: list) -> int: """Predict each branch with a two-bit saturating counter and count the misses. The counter leans towards taken or not-taken and needs two wrong guesses in a row to flip, so a loop taken many times then falling through is missed just once, on the way out.""" state = 0 # 0,1 predict not-taken; 2,3 taken misses = 0 for taken in outcomes: if (state >= 2) != taken:1 misses += 1 state = min(state + 1, 3) if taken else max(state - 1, 0)2 return misses
- the guess is taken once the counter has leaned that way, states two or three; a miss is a guess that did not match what happened
- each outcome nudges the counter one step towards it, so a steady branch saturates fast and then stays right
When the guess is wrong, the instructions fetched behind the branch were the wrong ones and are thrown away. Figure 1.4 shows the cost.
1.4 The pipeline, run
Running ftb.pipeline schedules two programs, one with no hazards and one a dependent chain, then predicts the branches of a loop.
independent: 4 instructions, 6 cycles (one at a time: 12)
r1 = f('base',) F D E
r2 = f('base',) F D E
r3 = f('base',) F D E
r4 = f('base',) F D E
hazard-safe: True; result {'base': 5, 'r1': 6, 'r2': 6, 'r3': 6, 'r4': 6}
dependent chain: 4 instructions, 9 cycles (one at a time: 12)
r1 = f('r0',) F D E
r2 = f('r1',) F D E
r3 = f('r2',) F D E
r4 = f('r3',) F D E
hazard-safe: True; result {'r0': 0, 'r1': 1, 'r2': 2, 'r3': 3, 'r4': 4}
The accumulator CPU of Chapter 5 has a single register, so every
instruction reads what the last one wrote: a chain, the worst case for
a pipeline. Many registers are what let independent work overlap.
Predicting a loop's 50 branches:
no predictor: 45 misses (10% right), 90 cycles flushed
2-bit counter: 7 misses (86% right), 14 cycles flushed
The independent program staircases cleanly and finishes in six cycles. The chain stalls at every step, because each instruction needs the last one’s result, and finishes in nine. Both still compute exactly what running them one at a time would. The chain is also the shape of arithmetic on the accumulator machine of Chapter ??: with a single register, most instructions read what the last one wrote, which is why real processors have many registers, so that independent work can be found and overlapped.
The last lines predict a loop’s fifty branches. Guessing not-taken misses every jump back to the top; the two-bit counter misses only the warm-up and each exit, turning ninety flushed cycles into fourteen. Overlap, forwarding, and a good guess are how one core does more each cycle.
The core is faster now, but it still waits on every trip to memory. The next chapter asks why those trips are so slow, and how a cache hides them.
Exercises
- Add a fourth independent instruction. How many cycles does the schedule take, and why only one more?
- Write a program where each instruction needs the result of the one before it. How many stalls appear?
- Predict a branch that flips taken, not-taken, taken, not-taken every time. How often is the two-bit counter right, and why is this its worst case?