Processors selection
-
Upload
pradeep-shankhwar -
Category
Technology
-
view
270 -
download
3
description
Transcript of Processors selection
Processing Elements and their selection
By Pradeep Shankhwar
Presentation layout
• Computing elements • Processor architectures• processor
– Micro controller– PowerPC– ARM– MIPS– DSPs– GPU
• Selection • Conclusion
Computing Elements• Microprocessors
– ARM, Intel, AMD, PPC, Motorola, MIPS etc
• Microcontrollers– ARM, Intel, Atmel, Motorola etc
• Digital Signal Processor (DSP)– ADI DSPs and TI DSPs
• Graphics Processing Unit (GPU)– Nvidia and ATI GPUs
• System on Chip (SoC)– Free scale iMx51/53, TI DaVinchi Platform
• Application Specific IC (ASIC)– Crypto Elements, Ethernet Controller, USB Controller,
Serial Controller etc
• FPGA
Computing Element -architecture
• Architecture is concerned with– internal structures of processor and each
interconnections of ALU, Control Units; address generator, instruction decoder and pipelined execution of any instruction
Architecture defining parameters
•No of ALUs/FPU•No of memory units•On chip resources•External IO interfaces•No of cores•Clock of chip•Power requirement•Endianness (big/little)•Instruction set requirements•Mem handling architecture stack, reg-mem, accumulator, Load/store•Complex?•DSP capability Multiply/accumulate?•Addressing modes and address space supported•Width of machine ?•Instruction Pipelining support•Computing pipelining support•Cache size, levels
Kind of Architectures
Von Neumann Harvard
• Named after the mathematician and computer scientist John Von Neumann.
• The computer has single storage memory (data & program)
• Processor needs two clock cycles to complete an instruction.
• Pipelining the instructions is not possible with this architecture.
• This is a relatively older architecture and was replaced by Harvard architecture.
• Named after "Harvard Mark I" a relay based old computer.
• The computer has two separate memories for storing data and program.
• Processor can complete an instruction in one cycle if appropriate pipelining strategies are implemented.
• Most of the modern computing architectures are based on Harvard architecture. But the number of stages in the pipeline varies from system to system.
CPU
PCdata memory
program memory
address
data
address
data
Input OutputSo where is the Input/Output?
here
CPU Buses
Code Sequence C = A + B for Four Instruction Sets
Stack Accumulator Register(register-memory)
Register (load-store)
Push APush BAddPop C
Load AAdd BStore C
Load R1, AAdd R1, BStore C, R1
Load R1,ALoad R2, BAdd R3, R1, R2Store C, R3
memory memoryacc = acc + mem[C] R1 = R1 + mem[C] R3 = R1 + R2
Addressing ModesAddressing Mode Example Action
1. Register direct Add R4, R3 R4 <- R4 + R32. Immediate Add R4, #3 R4 <- R4 + 33. Displacement Add R4, 100(R1) R4 <- R4 + M[100 + R1]4. Register indirect Add R4, (R1) R4 <- R4 + M[R1]5. Indexed Add R4, (R1 + R2) R4 <- R4 + M[R1 + R2]6. Direct Add R4, (1000) R4 <- R4 + M[1000]7. Memory Indirect Add R4, @(R3) R4 <- R4 +
M[M[R3]]8. Autoincrement Add R4, (R2)+ R4 <- R4 + M[R2]
R2 <- R2 + d9. Autodecrement Add R4, (R2)- R4 <- R4 + M[R2]
R2 <- R2 - d10. Scaled Add R4, 100(R2)[R3] R4 <- R4 +
M[100 + R2 + R3*d]
What is CISC?• CISC (Complex Instruction Set Computer)• Instructions which require multiple clock cycles to
execute.• Variable length instructions where the length
often varies according to the addressing mode • A small number of general purpose registers• chips that are easy to program and which make
efficient use of memory. Since the earliest machines were programmed in assembly language and memory was slow and expensive, the CISC philosophy made sense
• CISC was developed to make compiler development simpler. It shifts most of the burden of generating machine instructions to the processor.
CISC contd…• Several special purpose registers. Many CTSC
designs set aside special registers for the stack pointer, interrupt handling, and so on. This can simplify the hardware design somewhat, at the expense of making the instruction set more complex.
• But recent changes in software and hardware technology have forced a re-examination of CISC and many modern CISC processors are hybrids, implementing many RISC principles.
• Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy
• implemented in such large computers as the PDP-11 and the DECsystem 10 and 20 machines.
• E.g. Pentium is considered a modern CISC processor
CISC Disadvantage
• instruction set & chip hardware become more complex with each generation of computers
• Many specialized instructions aren't used frequently enough to justify their existence -approximately 20% of the available instructions are used in a typical program
• condition codes as a side effect of the instruction. Not only does setting the condition codes take time, but programmers have to remember to examine the condition code bits before a subsequent instruction changes them
What is RISC?
• RISC, or Reduced Instruction Set Computer. is a type of microprocessor architecture that utilizes a small and highly-optimized set of instructions
• RISC processors have a CPI (clock per instruction) of one cycle.
• pipelining: a technique that allows for simultaneous execution of parts, or stages, of instructions to more efficiently process instructions;
• large number of registers: the RISC design philosophy generally incorporates a larger number of registers to prevent in large amounts of interactions with memory– The IBM 801, Stanford MIPS, and Berkeley RISC 1 and 2
RISC contd…• Less complex, simple instructions. • Hardwired control unit and machine instructions. • Few addressing schemes for memory operands
with only two basic instructions, LOAD and STORE • Many symmetric registers which are organised
into a register file.
Big & Little Endian• In little endian machines, the least significant byte is followed by the most significant byte.
• Big endian machines store the most significant byte first (at the lower address).
• As an example, suppose we have the hexadecimal number 12345678.
• The big endian and small endian arrangements of the bytes are shown below.
• Big endian:– Is more natural.– The sign of the number can be determined by looking at the byte at address offset 0.– Strings and integers are stored in the same order.
• Little endian:– Makes it easier to place values on non-word boundaries.– Conversion from a 16-bit integer address to a 32-bit integer address does not require
any arithmetic.
80x86 Instruction Frequency
Rank Instruction Frequency 1 load 22% 2 branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 register move 4%
9
9 call 1% 10 return 1%
Total 96%
Micro-controller
uC
Program(ROM) & data memory (RAM)Provision of extension of memory Simple mode of addressing
direct/ indirect addressingSpecial Function Registers
Microcontroller architecture• In addition to processor
– On-chip memory(RAM, ROM)
• clocking
• IO pins
• interrupts
• timers
• Peripherals
• Serial Ios
• ADC inputs
• DAC outputs
• PWM outputs
• Meant for low computation task– Can handle industrial control app– Can also work as supporting chip to main processor– All peripherals are made of micro controllers
• Ethernet, USB, Serial, Wi-Fi, Firewire, Bluetooth etc
Power Architecture• Performance Optimization With Enhanced RISC
(Power)• IBM came first with RISC System-RS/6000• Power architecture incorporated lots of RISC
attributes fixed-length instructions, register-to-register architecture, simple addressing modes, large general register file three-operand instruction format More characteristic from complex ISAs
Designed to be superscalar Compound instruction AIM alliance formed, resulted into PowerPC
PowerPC Architectureo In order to maintain RS/6000 software compatibility, the
PowerPC adapted the POWER architecture, and many enhancements were added to provide a low-cost, single-chip, superscalar, multiprocessor capable, and 64-bit processor • Support for operation in both big-endian and little-endian
modes• Single and double precision floating-point arithmetic 64-bit
architecture, backward compatible to 32-bit• Complex string instructions were left out, consistent with the
RISC philosophy • Several bit/field instructions that use three source operands
were eliminated to avoid the need for extra register ports. • Instructions whose operation was dependent on the value of
source operand were eliminated. • Precision shifts, integer multiplies, and divide-with-reminder
instructions were omitted.
PowerPC familyo PowerPC 601:
• includes a more sophisticated branch unit• capable to dispatch three “out-of-order” instructions per cycle. • up to 8 instructions per cycle can be fetched directly into an eight-
entry instruction queue (IQ), where they're decoded before being dispatched to the execution core.
• medium sized and medium performance processor Branch folding: The instruction queue is used for detecting and
dealing with branches. The branch unit scans bottom four entries of the queue, identifying branch instructions and determining what type they are (conditional, unconditional).
o PowerPC 603:• smaller die size than the 601• smaller cache • capable to dispatch three “out-of-order” instructions per cycle. The 604 and 620 microprocessors were developed in the sequel of the
PowerPC production line. Both aimed for higher performance. The 604 was based on the 32-bit architecture while the 620 is a 64-bit architecture.
PowerPC family– PowerPC e200 - 32 bit power architecture microprocessor - speed
ranging up to 600 MHz - ideal for embedded applications. – PowerPC e300 – similar to e200 with an increase in speed upto 667
MHz. – PowerPC e600 – speed upto 2 Ghz – ideal for high performance
routing and telecommunications applications. – POWER5 – IBM – dual core μP – POWER6 – IBM – Dual core μP - A notable difference from POWER5 is
that the POWER6 executes instructions in-order instead of out-of-order – PowerPC G3 - Apple Macintosh computers such as the PowerBook G3,
the multicolored iMacs, iBooks and several desktops, including both the Beige and Blue and White Power Macintosh G3s.
– PowerPC G4 - is a designation used by Apple Computer to describe a fourth generation of 32-bit PowerPC microprocessors.
– PowerPC G5 - 64-bit Power Architecture processors – Xenon - based on IBM’s PowerPC ISA – XBOX 360 game console. – Broadway – based on IBM’s PowerPC ISA – Nintendo Wii gaming
console
– Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004 – Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007
PowerPC G4e Pipelining• Seven Stage Pipeline• Superscalar Microprocessor – allows multiple
instructions to be executed in parallel.Nine Execution Units
• BPU : Branch Processing Unit• VPU : Vector Permute Unit• VIU : Vector Integer Unit• VCIU : Vector Complex Integer Unit• VFPU : Vector Floating Point Unit• FPU : Floating Point Unit• IU : Integer Unit• CIU : Complex Integer Unit• LSU : Load/Store Unit
Pros and Cons
• Instruction Set– 200 machine instructions
• More complex than most RISC machines• e.g. floating-point “multiply and add” instructions that take
three input operands• e.g. load and store instructions may automatically update
the index register to contain the just-computed target address
– Pipelined execution• More sophisticated than SPARC
• Input and Output– Two different modes
• Direct-store segment: map virtual address space to an external address space
• Normal virtual memory access
• Permits a range of implementation from low cost controllers through high performance processors.
ARM (Advanced RISC Machine)
• ARM is leading IP provider of high performance, low cost, power efficient processors, peripherals and SOCs through involvement with Virtual Socket Interface alliance(VISA) and Virtual component exchange (VCX)
• Four major OS platform supported– Embedded CE, Linux, Symbian and Palm OS
• Does not manufacture chip, it provides services to 40 licensed partner and finally validates test chips
• ARM's Global Technology Partner Network is the largest in the industry
ARM’s solution•it does not present hardened macros and synthesizable CPUs to the
industry
•It provides the ASIC infrastructure in the form of AMBA, the PrimeCell Peripherals, and models and modeling tools for the cores
•There is also the need for ARM to pursue ports for RTOSs, develop debug hardware and software development tools, and, of course, embedded software for "off-the-shelf” integration
•ARM is a full-solutions provider, supporting a broad range of applications
ARM architecture
• Many SoCs are built around ARM– Apple’s A4/A5/A5x, Nvidia’s Tegra– Samsung’s Exynos, TI’s Omap, Davinchi
platforms, freescale’s iMx51, 53 etc– Qualcomm’s snapdragon series etc
ARM architecture• The ARM uses modified Harvard architecture,
load/store architecture, i.e.,– Only 32 bit data bus for both inst. And data.– Only the load/store inst. (and SWP) access memory.
• Memory is addressed as a 32 bit address space
• Most ARM’s implement two instruction sets– 32-bit ARM Instruction Set– 16-bit Thumb Instruction Set
• Jazelle cores can also execute Java bytecode• Execution mode
– When the processor is executing in ARM state(32)– When the processor is executing in Thumb state(16)– When the processor is executing in Jazelle state(8)
• DSP instruction (multi-accumulate)
ARM block diagram
Brid
ge
Timer
On-chipRAM
ARM
InterruptController
Remap/Pause
TIC
Arbiter
Bus InterfaceExternalROM
ExternalRAM
Reset
System Bus Peripheral Bus
• AMBA– Advanced Microcontroller Bus
Architecture• ADK
– Complete AMBA Design Kit
• ACT– AMBA Compliance Testbench
• PrimeCell– ARM’s AMBA compliant
peripherals
AHB or ASB APB
ExternalBus
Interface
Decoder
Thumb • Thumb is a 16-bit instruction set
– Optimised for code density from C code (~65% of ARM code size)– Improved performance from narrow memory– Subset of the functionality of the ARM instruction set
• Core has additional execution state - Thumb– Switch between ARM and Thumb using BX instruction
015
31 0ADDS r2,r2,#1
ADD r2,#1
32-bit ARM Instruction
16-bit Thumb Instruction
For most instructions generated by compiler: Conditional execution is not used Source and destination registers identical Only Low registers used Constants are of limited size
Microprocessor Without Interlocked Pipeline Stages (MIPS)
• Main memory used for composite data– Arrays, structures, dynamic data
• Memory is byte addressed– Each address identifies an 8-bit byte
• Words are aligned in memory– Address must be a multiple of 4
• MIPS is Big Endian
• Reg 0 is the Constant Zero ($zero)
• The R10000 has three pipelines: A five-stage pipeline for integer instructions, a seven-stage pipeline for floating-point instructions, and a six-state pipeline for LOAD/STORE instructions.
• In all MIPS ISAs, only the LOAD and STORE instructions can access memory
• The ISA uses only base addressing mode
• MIPS Instruction sets MIPS1/2/3/4/5, MIPS32, MIPS64
• R2000/3000/4000 to R16000 etc
MIPS• The stored-program concept:
– Instructions are represented as numbers– Programs can be stored in memory to be read or written just
like data
• MIPS – ISA developed in the early 80’s (RISC)– Similar to other RISC architectures developed since the 1980's– Almost 100 million MIPS processors manufactured in 2002– Used by NEC, Nintendo, Cisco, Silicon Graphics, Sony, …– Regular (32 bit instructions, small number of instruction
formats)– Relatively small number of instructions– Register architecture (all instructions operate on registers)– Load/Store architecture (memory accessed only with load/store
instructions, with few addressing modes)– All arithmetic instructions have 3 operands– Operand order is fixed
Design Principles for MIPS
• Simplicity favors regularity– All instructions 32 bits– All instructions have 3 operands
• Smaller is faster– Only 32 registers
• Good design demands good compromises– All instructions are the same length– Limited number of instruction formats: R, I, J
• Make common cases fast– 16-bit immediate constant– Only two branch instructions
– Every ISA designed after 1980 uses a load-store ISA (i.e RISC, to simplify CPU design)
MIPS contribution
1400
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
01998 2000 2001 20021999
Other
SPARC
Hitachi SH
PowerPC
Motorola 68K
MIPS
IA-32
ARM
• Cable Modems 94%• DSL Modems 40%• VDSL Modems 93%• IDTV 40%• Cable STBs 76%• DVD Recorder 75%• Game Consoles 76% • Office Automation 48% • Color Laser Printers 62%• Commercial Color Copiers
73%
• Source: Website of MIPS Technologies, Inc.,
2004.
Java Virtual Machine(JVM)• Java runs on JVM• A JVM is written in a native language for a wide array of processors, including MIPS and Intel• Like a real machine, the JVM has an ISA all of its own, called bytecode. This ISA was designed to
be compatible with the architecture of any machine on which the JVM is running
• Java bytecode is a stack-based language.
• Most instructions are zero address instructions.
• The JVM has four registers that provide access to five regions of main memory.
• All references to memory are offsets from these registers. Java uses no pointers or absolute memory references.
• Java was designed for platform interoperability, not performance!
General DSP Architecture• Hard to find good definition: ---changing or analyzing
information which is measured as discrete sequences of numbers
• Most share common features:– They use a lot of maths (multiplying and adding
signals) – They deal with signals that come from the real
world – They require a response in a certain time
DSP
• DSP Support for Parallel Moves– Need to fetch next coefficient and next stored value at
each step in the filter– DSPs generally support a parallel move or fetch
operation while MAC is computed– This design avoids idle ALU and data buses
• DSP algorithms often have “multiply-accumulate” requirements: coef[n] * data[n], where two operands must be fetched
• Simple FIR filter is given by • Digital filters require accumulated sum-of-
products• Multiple address generators to handle separate
memory spaces
1
0
N
ii inxbny
DSP performance comparison
Architectural overview•Harvard architecture•On-chip memory•ALU•Multiplier• On chip IOs• Separate address spaces for
program memory, data memory, and I/O
• Pipelines operations • Single-Cycle Multiply-
accumulate capability• Specialized addressing
modes• Specialized execution
control• Irregular instruction sets•Support for complex
instruction•Multiple computing units to
support data handling in parallel
•More no of registers to have faster data access
•Higher bus bandwidth
Irregular Instruction SetsUnlike general microprocessors, DSPs’ instruction allow for arithmetic operations to be carried out in parallel with data moves
MACR -D0, D1, D7
AND D4, D5
MOVE.L (R0) +N0, R6
ADDA R2, R3
DALU Instr DALU Instr AGU Instr AGU Instr
four instruction in an execution set
Specialized execution control-DSP processors provide a loop instruction for fast nesting of repetitive operations. This is usually done hardware wise to increase the speed
Direct comparison
Processor MHz MIPS DSP Benchmarks
ISR Latency
Power Price Dimensions(in)
Pentium MMX
233 233 49 1.38 us 4.25 W $213 5.5 x 2.47 x .647
Pentium MMX
266 266 56 1.38 us 4.85 W $348 5.5 x 2.47 x .647
TMS320C62 120 960 62 0.09 us 1.14 W (est.)
$25 1.3 x 1.3 x .07
TMS320C62 200 1600 103 0.09 us 1.9 W $96 1.3 x 1.3 x .07
GPU• In 1999, Nvidia introduced GeForce 256,
marketed as 1st GPU, fixed function device
• ATI & 3dfx also made their devices
• General architecture of Nvidia 8800
• 8 thread processing clusters (TPC)
• Each TPC has two streaming multiprocessors (SM)
• Each SP has 8 scalar processors (SP)
• Each SP equipped with their own ALU & FPUs
GPU Architecture
GPUs focus is on increasing raw compute power, so that more primitives (vertices, triangles, pixels) can be processed• GPUs are always using smaller transistor size to dramatically increase
the number of processors, aiming at ever-larger data throughput • CPUs, rather, focus on instruction Level parallelism and reducing latency• GPU contains more no of ALUs than CPU, it implies higher arithmetic
operations. Less emphasis on cache and control unit
• Many parallel arithmetic ops, means same ops on huge data set
• Graphics is best example for parallel rendering of pixels
• However programmer has to parallelize app suitably
• Sqrt of array of numbers taken on quad core Xeon (2.33 GHz) and NVIDIA® Tesla C870 (1.35 GHz), GPU emerged as ~ 400 times faster
• Restrictive memory access compared to CPU
GPU architecture example
Fermi architecture of GPU
Elaborated view of SM
Softcore processor
• They are utilized in FPGA design flow• They are utilized in SoC devt.• They are available in various flavors
– Picoblaze/microblaze/arm/NIOS II/LEON3/4/CPU86/TSK3000A/TSK51/52/Cortex-M1/open RISC
• They can be programmed as normal CPU• CPU footprint is under user control• Multiple instances can be created• Ideal when embedded and FPGA both
approach is demanded by app
Requirement analysis
• Study of dataset– Is there parallelism?
• Timing requirement of application– Soft Real-time, Hard Real-time
• IO bound or CPU applications• Algorithmic complexity• Multitasking or non-multitasking solution
– Scheduler based application– Monolithic application
• Heterogeneous tasking solution– Single card or multi card solution– Bus based data sharing or through dedicated IOs or
Interface
Requirement analysis
• Time to market– Buying for R&D/ learning purpose– To be used in field application
• Availability of part in extended temp range or MIL grade
• Overall cost of development– In-house efforts– Cost of customization
• Availability of development tools– Open source supported– Only proprietary tools
General Purpose Hardware• PC based hardware is often called General purpose
hardware– Day-to-Day documentation & presentation– Offline data analysis– Simulation of activities– Gaming, Database, multimedia application– Internet based applications
• Mail, browsing, e-transactions and online database applications
• No pressure of time• More of sequential processing• When you need more interaction with system• Sometimes, it works as console for many systems• As a Development host• PC has a powerful hardware but highly under utilized as PC
– E.g. Intel or AMD processor based PC
Hardware for Multimedia App• Video Encoding, Video decoding & Image
compression – Possible with DSPs like C64xx, C67xx from TI– DaVinchi Devices like DM365, DM368,
DM6467t, DM642 etc– Freescale iMx51, iMx53 etc
• Application – Video transmission:
• LAN, WAN, Internet, Surveillance purpose, CCTV coverage
– Recording:• CD, DVD, in-built recording in defence equipments• Handheld cameras and camcorder • DTH services, IP TV service
Hardware for video processing in defence equipments
• Single video processing– DSPs are preferred– OEM supported image/video processing API are
provided as development framework– Convenient to use (single front end)
• Multi-video processing– FPGAs are preferred– GPUs can also be used– Developer has to develop every module– May take advantage of IPcores for complex
processing modules– More compact solution is possible
Can we live with open source solution?
• Open source h/w architecture– ARM
• Open source mobile platform kernel– Android a big example
• Open source development tools– Linux, Mozilla, thunderbird, Java, My SQL,
Tomcat Server, Apache server, Qt etc
• Open source API for dedicated purpose– Open CV, open GL, open CL, live 555, ffmpeg
etc
• Yes: we can definitely live with
RTOSes
• pSoS from Integrated Inc• VxWorks from Windriver• Integrity from Greenhills• QNX• RTLinux• Pico linux• Montavista Linux• Embedded NT• Etc
Conclusion
• Identifying the computing and IO needs is first
• Find the availability of prototyping tools and hardware
• ………………………..• ………………………..• ………………………..
Thank you
Relative Frequency of Control Instructions
Operation SPECint92 SPECfp92Call/Return 13% 11%
Jumps 6% 4%Branches 81% 87%
• Design hardware to handle branches quickly, since these occur most frequently
University of PittsburghMIPS Instruction Set
Architecture 55
MIPS Architecture• Design “philosophies” for ISAs: RISC vs. CISC
• Execution time =– instructions per program * cycles per instruction * seconds per cycle
• MIPS is implementation of a RISC architecture
• MIPS R2000 ISA– Designed for use with high-level programming languages
• small set of instructions and addressing modes, easy for compilers
– Minimize/balance amount of work (computation and data flow) per instruction• allows for parallel execution
– Load-store machine• large register set, minimize main memory access
– fixed instruction width (32-bits), small set of uniform instruction encodings• minimize control complexity, allow for more registers
University of PittsburghMIPS Instruction Set
Architecture 56
MIPS Instructions
• MIPS instructions fall into 5 classes:– Arithmetic/logical/shift/comparison– Control instructions (branch and jump)– Load/store– Other (exception, register movement
to/from GP registers, etc.)
• Three instruction encoding formats:– R-type (6-bit opcode, 5-bit rs, 5-bit rt, 5-bit rd, 5-bit shamt, 6-bit function code)
– I-type (6-bit opcode, 5-bit rs, 5-bit rt, 16-bit immediate)
– J-type (6-bit opcode, 26-bit pseudo-direct address)
University of PittsburghMIPS Instruction Set
Architecture 57
MIPS ISA
• MIPS pipeline stages– Fetch (F)
• read next instruction from memory, increment address counter
• assume 1 cycle to access memory
– Decode (D)• read register operands, resolve instruction in control
signals, compute branch target
– Execute (E)• execute arithmetic/resolve branches
– Memory (M)• perform load/store accesses to memory, take branches• assume 1 cycle to access memory
– Write back (W)• write arithmetic results to register file
Pipeline Implementation• Idea:
– Goal of MIPS: CPI <= 1– Some instructions take longer to execute than others– Don’t want cycle time to depend on slowest instruction– Want 100% hardware utilization– Split execution of each instruction into several, balanced
“stages”– Each stage is a block of combinational logic– Latency of each stage fits within 1 clock cycle– Insert registers between each pipeline stage to hold
intermediate results– Execute each of these steps in parallel for a sequence of
instructions– “Assembly line”
Hazards• Hazards are data flow problems that arise as a result of
pipelining– Limits the amount of parallelism, sometimes induces
“penalties” that prevent one instruction per clock cycle– Structural hazards
• Two operations require a single piece of hardware• Structural hazards can be overcome by adding additional hardware
– Control hazards• Conditional control instructions are not resolved until late in the
pipeline, requiring subsequent instruction fetches to be predicted– Flushed if prediction does not hold (make sure no state change)
• Branch hazards can use dynamic prediction/speculation, branch delay slot
– Data hazards• Instruction from one pipeline stage is “dependant” of data
computed in another pipeline stage
Terminology
• Hyper-Threading (HT)• Turbo Boost/Turbo Core• QuickPath Interconnect (QPI)/Hyper Transport• Tri-Gate (3D) Transistor• Cool'n'Quiet• CoolCore• Vector processing• Super scalar architecture • VLIEW architecture
Technical point of view RTOS vs OSOS RTOS
Multitasking and multiuser Multitasking but not a multiuser
Kernel size bin 10s of MB Kernel size in few KB to 2 MB
All features are bundled Scalable feature set
Native GUI support 3rd party app is needed to support GUI
User has no control over context switch Context switch time is very less
Preemption is not guaranteed Guaranteed preemption of task
Computing Hardware
• Dedicated & timed task• DSP or dedicated SoC or general CPU
• Parallelism in dataset? Use parallel hardware like FPGA, GPGPU– Image & Video processing– Weather forecasting– Stock market prediction– Bio-inspired computation