How does a CPU go faster? · From Transistor to Browser

Chapter 1
How does a CPU go faster?

The browser
The network
Software and languages
The CPU
Memory
Arithmetic
Logic gates
The switch

Figure 1.1: Where we are in the climb from transistor to browser: still on the processor. This chapter asks how the same machine is made faster, by overlapping the steps of consecutive instructions.

The processor of Chapter ?? runs one instruction all the way through, fetch then decode then execute, before it starts the next. But each of those stages uses different hardware: fetch touches memory, decode the instruction logic, execute the ALU. While one instruction is in the ALU, the next could already be decoding, and the one after that fetching. Overlapping the stages this way is a pipeline, and it is how every fast processor works. This chapter schedules instructions through a pipeline, then meets the two things that break the flow: a value not computed yet, and a branch not yet decided.

1.1 Overlapping the stages

If a new instruction starts each cycle and every instruction takes three stages, then once the pipeline fills, three instructions are always in flight and one finishes every cycle. Four instructions that would take twelve cycles run one at a time finish in six. Figure 1.2 lays the stages out on a timeline.

pipeline-gantt — Figure 1.2: Four independent instructions in the pipeline. Each one’s fetch, decode, and execute run a cycle behind the instruction above it, so the stages overlap. At cycle three, three instructions are in flight at once.

Scheduling the pipeline is bookkeeping over cycle numbers: place each instruction one cycle after the last, unless it must wait.

Example 1-1. Scheduling instructions through the pipeline

def schedule(program: list) -> tuple:
    """Place each instruction's three stages on a timeline, in order, stalling an
    instruction until every register it reads has been written back. Returns the
    per-instruction stage cycles and the total number of cycles."""
    execute = []          # the cycle each instruction reaches its execute stage
    written = {}          # register -> index of the last instruction that writes it
    for index, ins in enumerate(program):
        earliest = execute[index - 1] + 1 if index else _STAGES - 11
        for src in ins.srcs:
            producer = written.get(src)
            if producer is not None:
                earliest = max(earliest, execute[producer] + 2)2
        execute.append(earliest)
        written[ins.dst] = index
    stages = [{"fetch": e - 2, "decode": e - 1, "execute": e} for e in execute]
    total = execute[-1] + 1 if execute else 03
    return stages, total

in order, one instruction reaches the execute stage each cycle, so normally it follows a single cycle behind the one before it
unless it reads a register an earlier instruction is still computing: then it waits two cycles past that instruction, a stall
the run takes as many cycles as the last instruction’s execute, plus one to write its result back

1.2 When an instruction must wait

The stall in callout 2 is the price of a data hazard. If an instruction needs a value the instruction just ahead of it has not finished computing, it cannot proceed; it has to wait until that value exists. A bubble opens in the pipeline, and everything behind it slips a cycle. Figure 1.3 shows one.

A pipeline is only allowed to overlap instructions because the result is the same as running them one at a time. Here the schedule models only the timing; the computed result is simply the sequential one, so the program does not prove the equivalence. What it does check is the hazard rule, which is exactly what lets a real overlapped pipeline keep that same result: no instruction reads an operand before the instruction that produces it has computed it.

Example 1-2. Checking the schedule is safe

def hazard_safe(program: list, stages: list) -> bool:
    """True if every instruction reads its registers only after the instruction
    that writes each one has reached its execute stage, so no operand is read
    before it exists."""
    written = {}
    for index, ins in enumerate(program):
        for src in ins.srcs:
            producer = written.get(src)
            if producer is not None and \
                    stages[index]["decode"] <= stages[producer]["execute"]:1
                return False
        written[ins.dst] = index
    return True

a schedule is unsafe if any instruction reads a register at or before the cycle the instruction writing it reaches execute; a safe schedule keeps every read strictly later

A real pipeline softens this with forwarding: the instant an instruction computes a value in its execute stage, that value is routed straight to the instruction behind it, rather than making it wait for the write-back. Many stalls disappear, but the shape of the hazard, a value needed before it exists, stays the same.

1.3 Guessing the next instruction

A data hazard makes the pipeline wait; a branch makes it guess. A jump does not know where it goes until it executes, but by then the pipeline has already fetched the instructions sitting behind it. So the processor bets on the outcome and fetches that way. A branch predictor makes the bet from past behaviour: a two-bit counter that leans towards taken or not-taken and only changes its mind after two wrong guesses in a row. On a loop that runs many times and leaves once, it is right almost always, missing only on the way out.

Example 1-3. Predicting a branch

def predict(outcomes: list) -> int:
    """Predict each branch with a two-bit saturating counter and count the misses.
    The counter leans towards taken or not-taken and needs two wrong guesses in a
    row to flip, so a loop taken many times then falling through is missed just
    once, on the way out."""
    state = 0                                # 0,1 predict not-taken; 2,3 taken
    misses = 0
    for taken in outcomes:
        if (state >= 2) != taken:1
            misses += 1
        state = min(state + 1, 3) if taken else max(state - 1, 0)2
    return misses

the guess is taken once the counter has leaned that way, states two or three; a miss is a guess that did not match what happened
each outcome nudges the counter one step towards it, so a steady branch saturates fast and then stays right

When the guess is wrong, the instructions fetched behind the branch were the wrong ones and are thrown away. Figure 1.4 shows the cost.

misprediction — Figure 1.4: A mispredicted branch. The guess sent the pipeline down the wrong path, so the two instructions fetched behind the branch are **flushed**, cycles of work thrown away. Deeper pipelines fetch further ahead, so a wrong guess costs even more.

1.4 The pipeline, run

Running ftb.pipeline schedules two programs, one with no hazards and one a dependent chain, then predicts the branches of a loop.

independent: 4 instructions, 6 cycles (one at a time: 12)
  r1 = f('base',)  F D E
  r2 = f('base',)    F D E
  r3 = f('base',)      F D E
  r4 = f('base',)        F D E
  hazard-safe: True; result {'base': 5, 'r1': 6, 'r2': 6, 'r3': 6, 'r4': 6}

dependent chain: 4 instructions, 9 cycles (one at a time: 12)
  r1 = f('r0',)    F D E
  r2 = f('r1',)        F D E
  r3 = f('r2',)            F D E
  r4 = f('r3',)                F D E
  hazard-safe: True; result {'r0': 0, 'r1': 1, 'r2': 2, 'r3': 3, 'r4': 4}

The accumulator CPU of Chapter 5 has a single register, so every
instruction reads what the last one wrote: a chain, the worst case for
a pipeline. Many registers are what let independent work overlap.

Predicting a loop's 50 branches:
  no predictor:  45 misses (10% right), 90 cycles flushed
  2-bit counter:  7 misses (86% right), 14 cycles flushed

The independent program staircases cleanly and finishes in six cycles. The chain stalls at every step, because each instruction needs the last one’s result, and finishes in nine. Both still compute exactly what running them one at a time would. The chain is also the shape of arithmetic on the accumulator machine of Chapter ??: with a single register, most instructions read what the last one wrote, which is why real processors have many registers, so that independent work can be found and overlapped.

The last lines predict a loop’s fifty branches. Guessing not-taken misses every jump back to the top; the two-bit counter misses only the warm-up and each exit, turning ninety flushed cycles into fourteen. Overlap, forwarding, and a good guess are how one core does more each cycle.

The core is faster now, but it still waits on every trip to memory. The next chapter asks why those trips are so slow, and how a cache hides them.

Exercises

Add a fourth independent instruction. How many cycles does the schedule take, and why only one more?
Write a program where each instruction needs the result of the one before it. How many stalls appear?
Predict a branch that flips taken, not-taken, taken, not-taken every time. How often is the two-bit counter right, and why is this its worst case?