Architecture by Alvin Lebeck 1 .Reference: "Readings in Computer Architecture" by Hill, Jouppi, Sohi .X-scale are Transmeta are representative embedded architectures //Concorde (airplane corporation) //Montreal (essentials) .Occam's Toothbrush: optimize simple things 2 //fabs = fabrication .ISA (Instruction Set Architecture) .Want CPI as small as possible .SPEC (Standard Performance Evaluation Corporation) *http://www.spec.org: many benchmarks #CPU2000 = 12 int + 14 FP //gap = "group theoretic enumerations //pitfall: trap .Little's Law: average # of jobs in system = arrival rate * mean holding time //wine cellar (Little's Law) //bribe ."Bandwidth can be solved with money. Latency ... you can't bribe God." - David Clark, MIT //yield .SPARC: Sun's instruction set .Java bytecodes //tenet //gadget 3 .SIMD: a kind of vector architecture .stack machine: popular in 1970; now: JVM .accumulator: CRAY in pre-1960 .memory-memory: VAX .memory-register: like extended accumulators .register-register: RISC like Alpha, MIPS, Power PC, CRAY (later) //replicate //primitives .exception handling is important (Wulf) 4 .An instruction w/o pipeline: I Fetch | I Decode | Exe | Mem | Write Back .N-stage pipeline: throughput = 1/max(t_stage) <= N/T (T=\sum t_stage) .Pipeline w/ overhead V: throughput = 1/(max(t_s)+V) <= min(N/T,1/V) .Hazard types *structural: use same H/W *data: access same data #RAW, (WAR, WAW) *control: sequence matters .Hazards cause stalls of S cycle and degrade throughput to N^2/T(N+S) .Deepen pipeline downgrades IPC .s* denotes structural stalling. It is usually avoided by more investment on hardware. .d* (data hazard stalling) stalls younger process, which is always safe since modifying value from old instruction is permitted. .MIPS implementation *MIPS1 uses no hazard detection - no-op need in compiler stage *MIPS2 uses hazard detection to speed up execution .delayed branches: instruction before branch is placed into the delayed slot .speculative 5 .speculation .branch history table //hysteresis .branch history shift register .branch target buffer (BTB) contains info about branching PC (address) .branch predictor is NOT an architectural buffer, nor visible state by program //tenet .a solution to WAW hazard: discard previous value *problem: if latter instruction encounters exception, the former value is used *Just stalling is better .coerced interrupt 6 .Flynn bottleneck and superscalar .Instruction level parallelism (ILP) is property of the SOFTware .statically scheduled (in-order) superscalar: SUN UltraSPARC, Alpha 21164 .wide decode-> RAW hazard -> stall the latter instruction .bypass is extended: rigid pipe vs fluid pipe (permit following instruction to be scheduled even during stall) *N^2 bypasses (Complexity paper) -> clustering: steer instruction .WB problem: wire from WB to register may be too long for a cycle .dynamic scheduling: execution not in-order (not von-Neumann structure) .As to scheduling, hardware does better than software .insturction buffer/queue/windows/scheduling buffer mean the same thing .detect dependency in dispatch stage, go ahead to issue if dependency resolved *Scoreboard: don't solve WAR/WAW; no bypassing; centralized *WAR/WAR is solved by the renaming technique *Need to trace some examples! 7 .register renaming *can have more locations than names *dynamically map names to locations .Tomasulo's algorithm: rows of reservation stations (RS) *cannot catch memory RAW hazard in simple Tomasulo .dynamic data flow graph: Link all RAW. The height of graph is the (optimal) execution order in unlimited resources //redux //bite the bullet .Instruction buffer -> Re-ordered buffer (ROB) .architectual status (out-of-order problem) *split WB to CM(complete) and CT(commit)/RT(retire)/graduate *solve the "precise state" problem in exceptions by cleaning all tables that are not retired and starting over *+ denotes the ready-in-ROB bit .every stage can produce a new value -> ROB size >= N*(#hit-L2-cache) 8 .another implementation: MIPS R10K *no copies -> free a register only if another write occurs or RAW diminish *knowing old mappings of registers (T_old) *precise state problem -> re-read the old mapping -> undo all following info buffers (ex. read at tail of ROB, free RS,...) .load scheduling problem *memory not rename *old memory address not recorded .memory disambiguation: speculation like loop //pathetic //culprit 9 .Static Scheduling (Ch.4) .Very Long Instruction Word (VLIW): A group of independent instructions *can identify resources early *dynamic translation (ex: Transmeta) .Explicitly Parallel Instruction Computing (EPIC): a compromise of VLIW *Profiling: by running the program once *Unrolling loops *Software pipelining: Translate unrolling sequence to pipelines *Trace scheduling deals with non-loop. -Blocks=single entry, single exit *to avoid memory conflict, use advanced load instruction *store is hard to undo //guise, messal *speculative instruction: put a flag to exception register -propagate flag to following instructions until store... *set-predicate instruction:sltip(?) 10 .REVIEW of MIDTERM: H&P Ch.1-4, App. A .Amdahl's law (memorize) .CPU time = I#/program * CPI * Clock_latency .CPI_overall = \sum_i CPI_i * F%_i (sum over instruction types) .Different mean values .Little's law: total # = arrival rate * average holding time .Transmeta approach for instruction set (code morphing) .5 stages of a Load .3 types of hazards .branch hazard approaches *stall *predict taken/not-taken (may w/ speculation) *delayed branch (w/o speculation) .branch fold issue .Dynamic branch prediction: 1 bit, 2 bits, correlating predictors *corr: use k-bit global branch history, 2^k total buffer space .Branch Target Buffer (BTB): store taken ins PC and prediction PC (and ins) .hybrid/competitive/selective/tournament branch predictor: associate confidence to each predictor on the fly. If current predictor is wrong and others right -> change to others *resource intensive .context switch/interrupt will add penalty, but 4ms/1interrupt is small comparing to ~G instructions/sec .rearrange instructions .unroll loop -> need increase # of registers .memory disambiguation *conservative: wait for every stores *opportunistic *perfect: no memory ambiguity happens .software pipelining: an artificial loop consisting of reversed order in combining different iterations .scoreboarding *no issue window *everything is stalled if structural or WAW hazard is detected in Dispatch .Tomasulo: resolve race condition by writing in first half of CDB broadcasting cycle and reading in second half *register renaming *#reservation station >= #physical ALU *dispatch: put instruction into FP operation queue *issue: put instruction into RS .speculation: getting more ILP *H/W undo squash *with commit to maintain precise interrupts *separate bypassing between real instructions and speculative instructions .H/W support for more ILP: re-ordered buffer *load-store queue use memory order buffer *separated from memory order buffer because not every instruction use memory and because memory order buffer is put closer to real memory units .Register Update Unit: combine ROB with RS .P6-style separates ROB from RS .R10K-style separates physical register file from ROB *must maintain a logical-to-physical map .superscalar: multiple PCs .VLIW: multiple operations for each PC *trace scheduling (for non-loop): fix-up instructions for every possible wrong branch prediction *trace cache: store traces with the highest probable branch series. Enable fetch past next branch (no PC increase needed) and branch folding .predicator/conditional instructions *still take a clock even if "annulled" *stall if condition evaluated late .Power as design target .reliability (Diva) 11 .Storage hierarchy: higher bandwidth, lower capacity/latency //BGvN'46 = Bulk, Goldstein, von Neumann .local miss ratio = miss/(total access on hierarchy x) .global miss ratio = miss/(total access on and above hierarchy x) .average (effective) access time = t_{hit} +(miss%*t_{miss}) *not perfect formula because misses may superpose .locality: one of the most important ideas in architecture .SRAM: high speed; DRAM: high density *SRAM: 6 transistors per bit, store data in flip-flop .balance: Amdahl's rule 1 MIPS<=>1 MB memory<=>1Mbit I/O .block frames grouped into sets *associativity = number of frames (ways) in each set *direct-mapped = one frame per set .physically separate tag comparator from data comparator *enable data speculation .set-associative (SA) is slower because tag affects data .static scheduling processor tends to use direct-mapped cache .dynamic scheduling processor tends to use set-associative *the additional access time is compensated by better overlapping/pipelining .The 4 Questions *placement *identification *replacement (LRU) (random) (NMRU) *write policy 12 .Cache: read NUCA .direct-mapping (DM): replacement issue depend on philosophy .write policies *write-allocate usually combines with write-back (today) *no-write-allocate (write-around) usually combines with write-through (earlier) .ABC's for caches: associativity, block size, capacity .write buffer: pipelined write->buffer can be bigger->good for write-through .evaluation methods *trace-driven simulation *(popular) full processor simulation #simple-scalar .miss classification *compulsory: cold start (first access) *capacity: too many different R/W; program solution: do similar R/W on same subroutine *conflict: too few associativity; solution: change alignment (or padding) *coherence: multi-thread I/O .skewed associativity (no test on this topic) .START eliminating miss rate .victim buffer (Jouppi): add a small fully-associative buffer for kicked-out data .software method *nested loop interchange (i,j->j,i) *loop blocking (i=1 to N -> j=1 to N/B, i=B*j to B*(j+1)) *loop fusion 13 .prefetching *stream buffer: pre-fetch into a buffer, not into cache -very effective in I$, less so for D$ (since access time increases) -for D$, fixed with multiple-way stream buffer .locality is very powerful (90/10 rule) .address prediction for linked lists *layout list elements serially -> malloc beforehand *correlated predictor: (like speculations) *dependence based pre-fetching (Intel future) *jump pointers (Intel future) .START eliminating miss penalty .critical word first (need memory return the specific word first, then the block) .when seen the word, immediately return and fetch others in parallel .sub-block: manage sub-blocks (tag, valid, ...) in a cache block .lock-up free caches (aka "non-blocking cache") *doing something in processor when miss *MSHR = miss status holding register *handle hit, dynamic scheduling .L2 cache (don't use unfiltered formula) *different concerns on L2 against L1: "avoid going to memory"->need low miss rate .multi-level inclusion? (L2 \superset L1) .START hit latency reduction (enlarge bandwidth) .true multi-porting (wire length N^2) .pipeline a single port (not scalable beyond 2 ports) .multibanking (interleaving): also used in main memory 14 .Main memory (rest of H&R Ch. 5) .DRAM *1 transistor + 1 capacitor *destructive read *refresh every 2ms *sophisticated (complicated) access; bank interleaving -assume A=2 cycles, T=1 cycle, B=2 cycles, interleaving looks like bank 0 1 2 3 cycle0 A A A A cycle1 A A A A cycle2 T/B B B B cycle3 B T/B B B cycle4 T s* cycle5 T *problem: power of 2 strides (very common in programming) //r.o.t. = rule of thumb *nibble mode - 4 bytes .SRAM *6 transistors *4X dense *1/4-1/8 access time *synchronous: send processor clock to DRAM *get rid of row/column //RAMBUS (RDRAM) = virtual channel DRAM *multiple data on a single wire -delay -> clock skew problem .virtual memory (Java, etc. get rid of addressing) *business reason: common software on wide product line *protection reason *memory-mapped files, networks,... .block(cache) -> pages(VM) .physical address(PA) -> frames(VM) .same 4 Q's *fully (or very highly) set-associative: low miss rate (no going to disk) *identification: mapping *replacement: sophisticated, can be done in software *write: always write-back, write-allocate (no going to disk again) 15 .VM review .pages size usually large (4-16KB) to amortize reads .one virtual page table per process .64-bit virtual address (VA) = 4 petabytes pages (assume 4K page size) .multi-level page table (Alpha) .inverted page table (not used widely) .MMU (memory management unit) .TB: physical index/tag .TLB: virtual index/physical tag //synonym 16 .I/O: often boring, but important - key of latency and throughput .H&P Ch. 7, RAID paper .IOPS = I/O per second .disk structure head------ ---+--- platter; O = track, (,) = sector | spindle .t_{disk} = t_{seek} + t_{rotation} + t_{transfer} + t_{controller} + t_{queuing} * t_{seek} nonlinear (because acceleration/slow down) //SSD = DRAM + battery .queuing theory (H&P) .micro electro-mechanics system (MEMS) .disk array: store in fine/coarse-grain style, striping .programmed I/O, memory mapping, DMA .BUS *synchronized: fast *asynchronized: hand-shaking, slow *switching *atomic *split-transaction (pipelined) *arbitration .rDMA: skip OS 17 .Parallel processing //CMP = chip multi-processor .cost and effectiveness: enough space to make multi-thread processors/MP *smooth upgrade *fault tolerance *power-effective .multithreading *coarse-grained: switch on long stalls (e.g. cache miss) *fine-grained: multi-issue (wide) command from different threads #P4: 2 threads .from Amdahl's law, speedup highly depends on portion of parallelism *find suitable domain, or we won't get much speedup .SISD(single instruction single data), SIMD(vector, MMX), MISD(not used), MIMD(now) .uniform memory access (UMA): small # processors vs. NUMA (large-scale) .routing technique: store-and-forward, worm-hole 18 .multithread .NUMA cluster (by interconnection like Ethernet) .memory coherence (single address) *cache coherence protocol -invalidate vs. update (indeed write-through) vs. bus-based (snooping) *upgrade/downgrade (hit level) *(centered) directories - home memory *coherence miss .consistency (access order) *sequential consistency: every load/store in order *weaker: only locks in order .synchronization (lock) *lock scheme: test-and-set instruction 19 .Multiscalar paper *sequencer: need multiple program counters, register files, function units, branch predictors, and processing elements *create mask *forward value for data dependency (by compiler) *ring structure *to deal with incorrect prediction (speculation), maintain a shared load/store queue (address resolution buffer) *problem: bandwidth *release value: squash based on control flow .simultaneous multiple thread (also Ch. 6): fill out issue slots by threads *enlarge register space (n*(#logical registers) + (ROB size) -1) *selective squashing (undoing) *implementation -round-robin (simple) -better: count the rate of commits in ROB, pick the most efficient (not-blocking) to execute first