экосистема для CPU/GPU/DSP · Илья Перминов Тимур Палташев...
Transcript of экосистема для CPU/GPU/DSP · Илья Перминов Тимур Палташев...
Илья Перминов
Тимур Палташев
23.10.2013
Гетерогенная архитектура HSA:
экосистема для CPU/GPU/DSP
НИУ ИТМО
2 МСКФ’2013 | October 2013 |
ЧТО ТАКОЕ HSA
CPU Memory (Coherent)
CPU
1
CPU
N …
GPU Memory
GPU CPU
2
PCIe
Heterogeneous System Architecture
= HSA
3 МСКФ’2013 | October 2013 |
ЧТО ТАКОЕ HSA
CPU Memory (Coherent)
CPU
1
CPU
N …
GPU Memory
GPU CPU
2
PCIe
GPGPU = ? Гетерогенные
вычисления
4 МСКФ’2013 | October 2013 |
ЧТО ТАКОЕ HSA
GPGPU = ? Гетерогенные
вычисления
5 МСКФ’2013 | October 2013 |
ЧТО ТАКОЕ HSA
GPGPU = ? Гетерогенные
вычисления
6 МСКФ’2013 | October 2013 |
ЧТО ТАКОЕ HSA
GPGPU = ? Гетерогенные
вычисления
7 МСКФ’2013 | October 2013 |
HSA: HETEROGENEOUS SYSTEM ARCHITECTURE
ПАРАЛЛЕЛЬНАЯ
НАГРУЗКА
(ОДНО)ПОТОЧНАЯ
НАГРУЗКА
hUMA (MEMORY)
8 МСКФ’2013 | October 2013 |
HSA HIGH LEVEL ARCHITECTURE
CPU
GPU
Audio
Process
or
Video
Hardware
DSP
Image
Signal
Processi
ng
Fixed
Function
Acctr
Encode
Decode
Sh
are
d M
em
ory
Co
here
ncy,
User
Mo
de
Qu
eu
es
9 МСКФ’2013 | October 2013 |
СОСТАВ HSA FOUNDATION - 2013
Founders
Promoters
Supporters
Contributors
Academic
10 МСКФ’2013 | October 2013 |
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
Различные вирутальные адресные пространства
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
11 МСКФ’2013 | October 2013 |
ЧТО МЫ ИМЕЕМ СЕГОДНЯ
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
12 МСКФ’2013 | October 2013 |
ЧТО МЫ ИМЕЕМ СЕГОДНЯ
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
13 МСКФ’2013 | October 2013 |
ЧТО МЫ ИМЕЕМ СЕГОДНЯ
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
14 МСКФ’2013 | October 2013 |
ОБЩАЯ КОГЕРЕНТНАЯ ПАМЯТЬ (HUMA)
15 МСКФ’2013 | October 23, 2013 |
Physical Memory
GPU
HW Coherency
Virtual Memory
C
P
U
Entire memory space:
Both CPU and GPU can access and
allocate any location in the system’s
virtual memory space
Cache Cache
Coherent Memory: Ensures CPU and
GPU
caches both see
an up-to-date view
of data Pageable memory:
The GPU can seamlessly
access virtual memory
addresses that are not
(yet)
present in physical
memory
HSA KEY FEATURES
16 МСКФ’2013 | October 23, 2013 |
GPU
CPU
CPU Memory GPU Memory
| |
| |
|
| |
| |
|
| |
| |
|
| |
| |
|
CPU explicitly copies data to GPU memory
GPU completes computation
CPU explicitly copies result back to CPU memory
Без HSA
17 МСКФ’2013 | October 23, 2013 |
CPU / GPU Uniform Memory
| |
| |
|
| |
| |
|
HSA CPU simply passes a pointer to GPU
GPU complete computation
CPU can read the result directly – no copying needed!
GPU
CPU
18 МСКФ’2013 | October 2013 |
ОБЩАЯ ПАМЯТЬ
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
19 МСКФ’2013 | October 2013 |
ОБЩАЯ КОГЕРЕНТНАЯ ПАМЯТЬ
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
20 МСКФ’2013 | October 2013 |
УПРАВЛЕНИЕ ОЧЕРЕДЯМИ ЗАДАЧ
21 МСКФ’2013 | October 2013 |
ДИСПЕТЧЕРИЗАЦИЯ В РЕЖИМЕ ПОЛЬЗОВАТЕЛЯ
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
22 МСКФ’2013 | October 2013 |
HSA
Application OS GPU
Queue Job
Start Job
Finish Job
23 МСКФ’2013 | October 2013 |
ПОДДЕРЖКА ТЕХНОЛОГИЙ ПРОГРАММИРОВАНИЯ
C++ AMP
C++
C#
OpenCL
OpenMP
Java
Python
…
24 МСКФ’2013 | October 2013 |
LANGUAGE SUPPORT ALLOWS HSA TO HAVE
MULTIPLE SOFTWARE EXECUTION MODELS
Kernel Fusion
Driver (KFD)
HSA Core
Runtime HSA Finalizer
HSA Helper
Libraries
OpenCL App
OpenCL
Runtime
Java
Sumatra App
Java JVM
Runtime
Python
Fabric
App
Fabric Engine
RT
HSAIL
25 МСКФ’2013 | October 2013 |
JAVA ENABLEMENT BY APARAPI
Developer creates Java™ source
Source compiled to class files (bytecode) using standard compiler
Aparapi = Runtime capable of converting Java™ bytecode to OpenCL™
For execution on any
OpenCL™ 1.1+ capable device
OR execute via a thread pool if OpenCL™ is not available
26 МСКФ’2013 | October 2013 |
ПЛАНЫ ПО ПОДДЕРЖКЕ JAVA
CPU ISA GPU ISA
JVM
Application
APARAPI
GPU CPU
OpenCL™
CPU ISA GPU ISA
JVM
Application
APARAPI
HSA CPU HSA CPU
HSA Finalizer
HSAIL
CPU ISA GPU ISA
JVM
Application
APARAPI
HSA CPU HSA CPU
HSA Finalizer
HSAIL
HSA Runtime
LLVM Optimizer
IR
CPU ISA GPU ISA
Sumatra Enabled JVM
Application
HSA CPU HSA CPU
HSA Finalizer
HSAIL
27 МСКФ’2013 | October 2013 |
LOOKING FOR FACES IN ALL THE RIGHT PLACES
Quick HD Calculations
Search square = 21 x 21
Pixels = 1920 x 1080 = 2,073,600
Search squares = 1900 x 1060 = ~2 Million
28 МСКФ’2013 | October 2013 |
LOOKING FOR DIFFERENT SIZE FACES BY SCALING THE VIDEO FRAME
29 МСКФ’2013 | October 2013 |
LOOKING FOR DIFFERENT SIZE FACES BY SCALING THE VIDEO FRAME
More HD Calculations
70% scaling in H and V
Total Pixels = 4.07 Million
Search squares = 3.8 Million
30 МСКФ’2013 | October 2013 |
HAAR CASCADE STAGES
Feature l
Feature m
Feature p
Feature r
Feature q
Feature k
Stage N
Stage N+1
Face still possible? Yes
No
REJECT FRAME
31 МСКФ’2013 | October 2013 |
22 CASCADE STAGES, EARLY OUT BETWEEN EACH
STAGE 22 STAGE 21 STAGE 2 STAGE 1
NO FACE
FACE CONFIRMED
Final HD Calculations
Search squares = 3.8 million
Average features per square = 124
Calculations per feature = 100
Calculations per frame = 47 GCalcs
Calculation Rate
30 frames/sec = 1.4TCalcs/second
60 frames/sec = 2.8TCalcs/second
…and this only gets front-facing faces
32 МСКФ’2013 | October 2013 |
CASCADE DEPTH ANALYSIS
0
5
10
15
20
25Cascade Depth
20-25 15-20 10-15 5-10 0-5
33 МСКФ’2013 | October 2013 |
UNBALANCING DUE TO EXITS IN EARLIER
CASCADE STAGES
When running on the GPU, we run each search rectangle on a separate work item
Early out algorithms, like HAAR, exhibit divergence between work items
Some work items exit early
Their neighbors continue
SIMD packing suffers as a result
Live
Dead
34 МСКФ’2013 | October 2013 |
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9-22
Tim
e (
ms)
Cascade Stage
A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
GPU CPU
PROCESSING TIME/STAGE
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 GHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)
35 МСКФ’2013 | October 2013 |
ПРОИЗВОДИТЕЛЬНОСТЬ CPU-VS-GPU
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7 8 22
Imag
es/S
ec
Number of Cascade Stages on GPU
AMD A10-4600M APU (6CU@497Mhz, 4 cores@2700Mhz)
CPU HSA GPU
36 МСКФ’2013 | October 2013 |
HAAR SOLUTION RUN DIFFERENT CASCADES ON GPU AND CPU
By seamlessly sharing data between CPU and GPU,
HSA allows the right processor to handle its appropriate workload
+2.5x
-60%
INCREASED
PERFORMANCE DECREASED ENERGY
PER FRAME
37 МСКФ’2013 | October 2013 |
LINES-OF-CODE AND PERFORMANCE FOR
DIFFERENT PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LO
C
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Pe
rform
an
ce
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0 Copy-back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
(“Hessian” kernel)
38 МСКФ’2013 | October 2013 |
MORE INFO AT
http://hsafoundation.com
39 МСКФ’2013 | October 2013 |
HSA РЕШЕНИЯ ОТ ARM (CORTEX-A15 MALI-T600)
• Google Chromebook
• Google Nexus 10
• InSignal Arndale Community Board
40 МСКФ’2013 | October 2013 |
AMD “KAVERI” 2.0 PLATFORM DETAILS
APU Features
New “SteamrollerB” CPU Core with up to 20% performance* increase over “Richland” – Up to 4 Steamroller cores and 4 MB total L2 cache
– Temperature Smart Turbo Core
New Power Optimized Graphics Core Next with up to 30% performance* increase over “Richland” – Multiple DirectX® 11.1 GPU configurations
– Dual Graphics support with “Crystal” Series
New AMD fixed function acceleration – UVD 4.2 Universal Video Decode Engine
– VCE 2.0 Video Compression Engine
– New ACP (Audio Co-Processor)
– SAMU 2.1 Secure Asset Management Unit
New Display and I/O Features – New PCIe Gen3 x16 for discrete GPU expansion
– New Dedicated PCIe SSD interface
– PCIe Gen2 1 x4, 4 x1, 1x4 UMI
– 4096 x 2160 resolution per display output
– 16K x 16K Max Eyefinity SLS resolution
– “Lightning Bolt” Docking Solution
Power Management and Battery Life – 35W to 15W TDPs
– Targeting ~11 Hours Battery Life MM07* 62WHr (~8.5Hr 45WHr) 14” 1366x768 eDP panel
– New AMD Start Now 3.0 with smart sleep
*Pre-silicon projection and not based on actual measured data
QUESTIONS AND
ANSWERS