Computer Architecture Lecture 0:...

35
Computer Architecture Lecture 0: Introduction ChihWei Liu 劉志尉 National Chiao Tung University [email protected]

Transcript of Computer Architecture Lecture 0:...

  • Computer ArchitectureLecture 0: Introduction

    Chih‐Wei Liu 劉志尉National Chiao Tung [email protected]

  • Course Information• Lecture:

    – Chih‐Wei Liu 劉志尉 [email protected]– 5731685, ED618

    • Teaching Assistant:– 楊承彥 張鈞豪– 54225, ED412

    • Course website: http://twins.ee.nctu.edu.tw• Textbook: J. L. Hennessy and D.A. Patterson, Computer Architecture: A 

    Quantitative Approach, 5th Edition, Morgan Kaufmann Publishers, 2012• Prerequisites:

    – Computer organization– Computer programming 

    CA-Lec0 [email protected] 2

  • This Course Has 5 Modules• Module 1

    – Instruction set architecture (ISA)– Pipelining and Hazards

    • Module 2– Memory hierarchy – Caches and virtual memory

    • Module 3– Instruction‐level Parallelism – Branch prediction– Speculation and Reorder buffer

    • Module 4– Thread‐level Parallelism– Cache coherent protocols

    • Module 5– Data‐level Parallelism– Vector machines

    CA-Lec0 [email protected] 3

    Midterm exam.

    Final exam.

  • Course Grade (tentative)

    • Lectures, Homework, and Quizzes: 25%

    – Adapted from Prof. David Patterson’s class notes– Please avoid arriving late or leaving early

    – One problem sets with respect to each chapter

    – One‐page reading report with respect to each lecture 

    – Homework should be handed in on time

    • LAB & Project: 25%

    • Midterm and Final Exams.: 50%

    • (Extra points 5%)

    CA-Lec0 [email protected] 4

  • Computer ? 

    CA-Lec0 [email protected] 5

    Building Hardwarethat Computes

  • Computing Devices Then…

    EDSAC, University of Cambridge, UK, 1949CA-Lec0 [email protected] 6

  • Computing Systems Today …

    • Computers & Microprocessors in everything– Vast infrastructure behind them

    CA-Lec0 [email protected] 7

    Robots

  • Overview of Computer Systems

    CA-Lec0 [email protected] 8

    Clusters

    Massive Cluster

    Gigabit Ethernet

    Cloud

  • “Mea

    ley

    Mac

    hine

    ”“M

    oore

    Mac

    hine

    Computer is a State‐Controlled Machine:Implementation as Comb logic + Latch

    Alpha/00

    Delta/10

    Beta/01

    0/0

    1/0

    1/1

    0/1

    0/0

    1/1La

    tch

    Com

    bina

    tion

    alLo

    gic

    Input Stateold Statenew Div

    0 0 0

    00 01 10

    00 10 01

    0 0 1

    1 1 1

    00 01 10

    01 00 10

    0 1 1

    CA-Lec0 [email protected] 9

    Finite state machine

  • Computer is a Microprogrammed Controller

    • State machine in which part of state is a “micro‐pc”.– Explicit circuitry for incrementing or changing PC 

    • Includes a ROM with “microinstructions”.– Controlled logic implements at least branches and jumps

    ROM

    (Instructions)

    Addr

    BranchPC

    + 1

    MUX

    Control

    0: forw 35 xxx1: b_no_obstacles 0002: back 10 xxx3: rotate 90 xxx4: goto 001

    Instruction Branch

    Com

    bina

    tion

    al L

    ogic

    /Co

    ntro

    lled

    Mac

    hine

    State w/ Address

    CA-Lec0 [email protected] 10

  • Instruction Execution Cycle

    CA-Lec0 [email protected] 11

    InstructionFetch

    InstructionDecode

    OperandFetch

    Execute

    ResultStore

    NextInstruction

    Obtain instruction from program storage

    Determine required actions and instruction size

    Locate and obtain operand data

    Compute result value or status

    Deposit results in storage for later use

    Determine successor instruction

    Processor

    regs

    F.U.s

    Memory

    program

    Data

    von Neumanbottleneck

  • Computer Architecture?

    CA-Lec0 [email protected] 12

    Application

    Physics

    Gap too large to bridge in one step

    In its general definition, computer architecture is the design of theabstraction layers that allow us to implement information processing applications efficiently using available manufacturing technologies.

  • Abstraction Layers in Modern Systems

    CA-Lec0 [email protected] 13

    Algorithm

    Gates/Register-Transfer Level (RTL)

    Application

    Instruction Set Architecture (ISA)

    Operating System/Virtual Machine

    Microarchitecture

    Devices

    Programming Language

    Circuits

    Physics

    Software

    Hardware

  • Example: PC

    CA-Lec0 [email protected] 14

    Algorithm: Linked list

    Gates/Register-Transfer Level (RTL)

    Application: Microsoft Powerpoint

    Instruction Set Architecture (ISA): x86

    Operating System/Virtual Machine: Windows

    Microarchitecture: 6-core, L3 Cache

    Devices: 22nm Tri-gate Transistors

    Programming Language: Microsoft SDK

    Circuits

    Physics

  • Example: Smart Phone

    CA-Lec0 [email protected] 15

    Algorithm: Trajectory

    Gates/Register-Transfer Level (RTL)

    Application: Angry Birds

    Instruction Set Architecture (ISA): ARM

    Operating System/Virtual Machine: Android/Dalvik

    Microarchitecture: Quad-core, L2 Cache

    Devices: 32nm CMOS

    Programming Language: Android SDK

    Circuits

    Physics

  • Computer Architecture Course

    CA-Lec0 [email protected] 16

    Algorithm

    Gates/Register-Transfer Level (RTL)

    Application

    Instruction Set Architecture (ISA)

    Operating System/Virtual Machine

    Microarchitecture

    Devices

    Programming Language

    Circuits

    Physics

    Original domain of

    the computer architect

    (‘50s-’80s)

    Domain of recent

    computer architecture

    (‘90s)Reliability, power, …

    Parallel computing, security, …

    Reinvigoration of computer architecture,

    mid-2000s onward.

  • Concluding Remarks• The abstraction layers provide

    – Environmental stability within generation– Environmental stability across generation– Consistency across a large number of units

    • Modern computer architecture cannot avoid paying attention to software and compilation issues.

    • Computer architecture is engineering design under constraints– Average case & worst case, peak power & energy per operation, soft errors & hard errors, hardware cost & software cost, ….

    CA-Lec0 [email protected] 17

  • “Bell’s Law” – new class per decade

    year

    log

    (peo

    ple

    per c

    ompu

    ter)

    streaming informationto/from physical world

    Number CrunchingData Storage

    productivityinteractive

    • Enabled by technological opportunities• Smaller, more numerous and more intimately connected• Brings in a new kind of application• Used in many ways not previously imagined

    CA-Lec0 [email protected] 18

  • Uniprocessor Performance

    CA-Lec0 [email protected] 19

    1

    10

    100

    1000

    10000

    1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

    Perfo

    rman

    ce (v

    s. V

    AX-1

    1/78

    0)

    25%/year

    52%/year

    ??%/year

    From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006

    • VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

    Pipelining,Data locality,Parallelism processing

  • Driven Technology: Moore’s Law

    • “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965

    • # of transistors on cost‐effective integrated circuit double every 18 months

    CA-Lec0 [email protected] 20

  • CISC (Complex Instruction Set Computer)

    CA-Lec0 [email protected] 21

  • RISC (Reduced Instruction Set Computer)

    CA-Lec0 [email protected] 22

  • UniProcessor Performance• Early 1970s, mainframes and minicomputers

    – 25%~30% growth per year in performance• Late 1970, microprocessor

    – 35% growth per year in performance• Early 1980s, Reduced Instruction Set Computer (RISC) architectures

    – 2 critical performance techniques• ILP (initially through pipelining and later through multiple instruction issue) • Cache

    – 50% growth per year in performance• 1998~2000, relative performance

    – By technology: 1.35per year– By technology + architecture, overall: 1.58 per yearNote: 1.581.35(1+15%), the architecture improvement factor is 15%

    CA-Lec0 [email protected] 1-23

  • Pipelined Instruction Execution

    Instr.

    Order

    Time (clock cycles)

    Reg ALU DMemIfetch Reg

    Reg ALU DMemIfetch Reg

    Reg ALU DMemIfetch Reg

    Reg ALU DMemIfetch Reg

    Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

    CA-Lec0 [email protected] 24

  • Limiting Force: Power Density

    CA-Lec0 [email protected] 25

  • µProc60%/yr.(2X/1.5yr)

    DRAM9%/yr.(2X/10 yrs)1

    10

    100

    10001980

    1981

    1983

    1984

    1985

    1986

    1987

    1988

    1989

    1990

    1991

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    DRAM

    CPU

    1982

    Processor-MemoryPerformance Gap:(grows 50% / year)

    Perf

    orm

    ance

    Time

    Limiting Force:Processor‐DRAM Memory Gap (latency)

    CA-Lec0 [email protected] 26

  • Limiting Force: Clock Speed and ILP

    • Chip density is continuing increase ~2x every 2 years– Clock speed is not– # processors/chip (cores) 

    may double instead• There is little or no more 

    Instruction Level Parallelism (ILP) to be found– Can no longer allow 

    programmer to think in terms of a serial programming model

    • Conclusion:Parallelism must be exposed to software!

    Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

    CA-Lec0 [email protected] 27

  • • Old Conventional Wisdom: Power is free, Transistors expensive• New Conventional Wisdom: “Power wall” Power expensive, Xtors free 

    (Can put more on chip than can afford to turn on)• Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, 

    innovation (Out‐of‐order, speculation, VLIW, …)• New CW: “ILP wall” law of diminishing returns on more HW for ILP • Old CW: Multiplies are slow, Memory access is fast• New CW: “Memory wall” Memory slow, multiplies fast 

    (200 clock cycles to DRAM memory, 4 clocks for multiply)• Old CW: Uniprocessor performance 2X / 1.5 yrs• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall

    – Uniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple “cores” 

    (2X processors per chip / ~ 2 years)• More power efficient to use a large number of simpler processors rather than a small number of complex processors

    Crossroads in Computer Architectue

    CA-Lec0 [email protected] 28

  • Sea Change in Chip Design• Intel 4004 (1971): 

    – 4‐bit processor,– 2312 transistors, 0.4 MHz, – 10 m PMOS, 11 mm2 chip

    • RISC II (1983): – 32‐bit, 5 stage – pipeline, 40,760 transistors, 3 MHz, – 3 m NMOS, 60 mm2 chip

    • 125 mm2 chip, 65 nm CMOS = 2312 RISC II+FPU+Icache+Dcache– RISC II shrinks to ~ 0.02 mm2 at 65 nm– Caches via DRAM or 1 transistor SRAM (www.t‐ram.com) ?– Proximity Communication via capacitive coupling at > 1 TB/s ?

    (Ivan Sutherland @ Sun / Berkeley)

    • Processor is the new transistor?

    CA-Lec0 [email protected] 29

  • ManyCore Chips: The trend is here

    • “ManyCore” refers to many processors/chip– 64?  128?  Hard to say exact boundary

    • How to program these?– Use 2 CPUs for video/audio– Use 1 for word processor, 1 for browser– 76 for virus checking???

    • Something new is clearly needed here…

    • Intel 80-core multicore chip (Feb 2007)– 80 simple cores– Two FP-engines / core– Mesh-like network– 100 million transistors– 65nm feature size

    • Intel Single-Chip Cloud Computer (August 2010)– 24 “tiles” with two IA

    cores per tile – 24-router mesh network

    with 256 GB/s bisection– 4 integrated DDR3 memory controllers– Hardware support for message-passing

    CA-Lec0 [email protected] 30

  • Problems with Sea Change

    • Algorithms, programming languages, compilers, operating systems, architectures, libraries, … not ready for 1000 CPUs / chip– Thread‐level parallelism (TLP)– Data‐level parallelism (DLP)

    • Four architectures exploit DLP and TLP– Instruction‐level parallelism – chapter 3– Vector architectures and GPUs – chapter 4– Thread‐level parallelism – chapter 5– Request‐level parallelism – chapter 6

    CA-Lec0 [email protected] 31

  • Course Information

    • Lecture:– Chih‐Wei Liu 劉志尉 [email protected]– 5731685, ED618

    • Teaching Assistants:– 楊承彥, 張鈞豪– 54225, ED412

    • Course website: http://twins.ee.nctu.edu.tw• Prerequisites:

    – Computer organization– Computer programming 

    CA-Lec0 [email protected] 32

  • Computer Architecture Course

    CA-Lec0 [email protected] 33

    Algorithm

    Gates/Register-Transfer Level (RTL)

    Application

    Instruction Set Architecture (ISA)

    Operating System/Virtual Machine

    Microarchitecture

    Devices

    Programming Language

    Circuits

    Physics

    Original domain of

    the computer architect

    (‘50s-’80s)

    Domain of recent

    computer architecture

    (‘90s)Reliability, power, …

    Parallel computing, security, …

    Reinvigoration of computer architecture,

    mid-2000s onward.

  • This Course Has 5 Modules• Module 1

    – Instruction set architecture (ISA)– Pipelining and Hazards

    • Module 2– Memory hierarchy – Caches and virtual memory

    • Module 3– Instruction‐level Parallelism – Branch prediction– Speculation and Reorder buffer

    • Module 4– Thread‐level Parallelism– Cache coherent protocols

    • Module 5– Data‐level Parallelism– Vector machines

    CA-Lec0 [email protected] 34

    Midterm exam.

    Final exam.

  • Course Grade (tentative)

    • Lectures, Homework, and Quizzes: 25%

    – Adapted from Prof. David Patterson’s class notes– Please avoid arriving late or leaving early

    – One problem sets with respect to each chapter

    – One‐page reading report with respect to each lecture 

    – Homework should be handed in on time

    • LAB & Project: 25%

    • Midterm and Final Exams.: 50%

    • (Extra points 5%)

    CA-Lec0 [email protected] 35