Opening & Plenary Session #1 Hardware
Transcript of Opening & Plenary Session #1 Hardware
Opening & PlenarySession #1 – Hardware
November 17, 2020
1
Prof. Shouyi YinDepartment of Micro- and Nano-electronics
Tsinghua University
Deploying AI EverywheremW/μW Level Neural Network Processor
2
AI HealthcareImage Recognition
Translation Speech-to-TextSmart home
Google Baidu IBM MicrosoftAlibaba
AI Cloud
Today’s AI Services almost All in Cloud
3
Things
WIFI Clients
MobileDevices
10101012
Cloud
108
From AI in Cloud to AI Everywhere
[Google Brain @ISSCC’18]
Little MLMicro-chips
Medium MLEdge, Mobile
Big MLData Center
[Rob Aitken @ARM]
4
AI Chipsets Revenue 2016 - 2025
From Tractica Report
5
Computation requirements v.s. Power constraints
Power
Computation (ops)
Smartphone (Always on)Wearable deviceSmart sensors
Smart phoneSmart homeSmart toysSmart glassesWiFi camera
Homeappliance
Video surveillanceIndustryAgriculture
AutomobileData center
1mW 100mW 2W 100W
100G
1T
2T
20T
6
① Programmability
Example 1:LeNet for Handwriting Recognition
Example 2:AlexNet for Image Classification
Example 3:LRCN for Image Captioning
② Neural Computing & General computing
③ High energy efficiency for edge computing ( ~ TOPS/W @ mW or μW level)
Challenges for AI Chips
FaceDetection
NN
Resize
NMS
LandMark(NN) Facial
alignment
Resize
Euclidean Distance
Resize
FaceRecognition
NNLeonardo DiCaprio
× =
× =
× =
Network Computation General Computation
Face detection and Recognition
Neural Computing: • Convolutional Network• Fully connected Network• Recurrent Network• ……
General Computing: • Image Signal Processing• Visual Processing• Sound Signal Processing• ……
7
RS DataflowMIT Eyeriss
2016
ASIPDiannao
2014
Domain Specific Architecture
Towards massive basic computation
Algorithm design in conjunction with hardware design:
less latency, more power efficient, more general !
Towards compact parallel computing
Network pruning, compression, quantization, low-bit …
Data granularity , program
ming/storage m
odel, …
Algorithm & hardware co-design, co-optimization, co-verification !
���
� � �
Systolic ArrayGoogle TPU
2017.1
Low-bit Adaptive quant. LQ-Nets, 2018.9
Low-bit TrainingDoReFa-Net, 2018.2
TWN2016.11
BWN2016.8
Pruning2016.2
Sparsity-awareNVIDIA SCNN
2017.6
Flexible BitwidthKAIST UNPU
2018
To make algorithms more flexible
To make hardware more busy
Dual Trends for Energy Efficient NN Computing
Compact NN Model
8
RS DataflowMIT Eyeriss
2016
ASIPDiannao
2014
Compact NN Model
Domain Specific Architecture
Towards massive basic computation
Algorithm design in conjunction with hardware design:
less latency, more power efficient, more general !
Towards compact parallel computing
Network pruning, compression, quantization, low-bit …
Data granularity , program
ming/storage m
odel, …
Algorithm & hardware co-design, co-optimization, co-verification !
���
� � �
Systolic ArrayGoogle TPU
2017.1
Low-bit Adaptive quant. LQ-Nets, 2018.9
Low-bit TrainingDoReFa-Net, 2018.2
TWN2016.11
BWN2016.8
Pruning2016.2
Sparsity-awareNVIDIA SCNN
2017.6
Flexible BitwidthKAIST UNPU
2018
To make algorithms more flexible
To make hardware more busy
Dual Trends for Energy Efficient NN Computing
9
Binary & Ternary Weight Neural Networks
10
Training Techniques for extreme low-bit NN
² Reduce memory footprint: various weight quantization techniques
² Increase representational capability: shortcut
² Reduce variation: weight approximation, batch-normalization
Top1 classification accuracy of ResNet-18 on ImageNet(weights of all networks are binary/ternary)
Evolvement of networks
Accuracy
XNOR-Net2016
LQ-Nets2018 � � �
DoReFa-Net2016
Full precision69.3%
65.0%
42.2%
51.2%
59.2%
60.8%
66.6% 68.0%
BWN2016
ABC-Net2017
TTQ2017
BinaryNet2016
11
RS DataflowMIT Eyeriss
2016
ASIPDiannao
2014
Compact NN Model
Domain Specific Architecture
Towards massive basic computation
Algorithm design in conjunction with hardware design:
less latency, more power efficient, more general !
Towards compact parallel computing
Network pruning, compression, quantization, low-bit …
Data granularity , program
ming/storage m
odel, …
Algorithm & hardware co-design, co-optimization, co-verification !
���
� � �
Systolic ArrayGoogle TPU
2017.1
Low-bit Adaptive quant. LQ-Nets, 2018.9
Low-bit TrainingDoReFa-Net, 2018.2
TWN2016.11
BWN2016.8
Pruning2016.2
Sparsity-awareNVIDIA SCNN
2017.6
Flexible BitwidthKAIST UNPU
2018
To make algorithms more flexible
To make hardware more busy
Dual Trends for Energy Efficient NN Computing
12
What Kind of Computing Architecture?
Programmabilityv.s.
Energy Efficiencyv.s.
Compact Model
ASIP: Cambricon Systolic Array: TPURS Dataflow: Eyeriss
Sparsity: SCNN What is Next ?Flexible Bit: UNPU
13
Coarse-Grained Reconfigurable Architecture
9.8 EMERGING COMPUTINGARCHITECTURE
14
Software Defined Hardware
Build runtime reconfigurable hardware and software thatenables near ASIC performance without sacrificingprogrammability for data-intensive algorithms.
l to build a processor that is reconfigurable at runtime.l to build programming languages and compilers that
optimize software and hardware at runtime.
Reconfiguration times: 300 - 1,000 ns
at runtime
15
Energy Efficiency Comparison
CPU
GPUFPGA
0.1
1
10
100
1000
10000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
CPU CPUs GPU CPU+GPU FPGA Dedicated
Dedicated
CPU+GPU
CPUs
moreprogrammable
lessprogrammable
notprogrammable
Chip number
En
erg
y e
ffic
ien
cy [
MO
PS/m
W]
[Stanford University, Prof. Kunle Olukotun, ISCA 2018, Keynote Speech]
Software Defined Hardware(CGRA)
16
New Start探索期 发展期 High Speed ExpansionExpansionExploration
PADDI1990
PADDI-21993
Matrix1996 RaPID1996
Pleiades1997
REMARC1998
PipeRench1998
DP-FPGA 1994
Garp 1997
RAW1997
CHESS 1999
MorphoSys1999
DReAM2000
MorphICs 2000
ADRES2003XPP2003Zippy2003
EGRA2009
DySER2012
CGRA History
Prof. Gerald Estrin
“Organization of Computer Systems-The Fixed Plus Variable Structure Computer,”Proc. Western Joint Computer Conf., New York, 1960, pp. 33-40.
17
863 Program: General Purpose Reconfigurable
Computing
NSFC: Reconfigurable Vision Processor
NSFC: Reconfigurable Networks-on-Chip
Major S&T Program: Reconfigurable Graphic Processor
20062010
20152018
863 Program: Reconfigurable Multimedia
Processor
NSFC: Reconfigurable
Architecture
Thinker: Reconfigurable NN Processor
2020
China-UK: Reconfigurable Cloud Computing
NSFC: Reconfigurable Hybrid AI Processor
NN Processor
Domain Specific Reconfigurable Processor
Theory and Basic Arch
Prof. Shaojun Wei
IEEE FellowDirector of IME, THU
Reconfigurable Computing Research in Tsinghua Univ.
18
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Con
text
mem
ory
Data memory (Scratchpad)
Hos
t Con
trolle
r
ALU
REGs
MEM
VddH VddL
Architecture Overview
Config
Execution
Config
Execution
Time
Arithmetic and Logic OperationsA + B A >> B A > B (A+B)>>C
A - B A + (B>>C) A == B (A+B)<<C
A & B A A < B (A-B)>>C
A | B A + (B<<C) A >= B (A+B)<<C
A ^ B A - (B<<C) A <= B A×B_H
A ~^ B |A-B| A != B Clip(A, -B, B)
~A (A>>C)-B A - (B>>C) Clip(A, 0, B)
A << B A×B_L (A<<C)-B C?A:B
n Spatial Arch, Distributed Mem, No-Instruction, flexible bit-width
19
Software defined Datapath
20
Programming Model
Compilation
21
Compiling Applications onto CGRA
DDG (dif, min)A loop
Kernel
II: Initiation IntervalS1
S2
S3
S4
S1
S2
S3
S4
S1
S2
S3
S4
Loop Pipelining
Time
PE1 PE2
PE1 PE2
PE1 PE2
T
T+1
S3
S4
S1
S2
Time Extended PE Array
CGRA
Key ProblemsExploiting operator level parallelism
Finding better OP-PE binding according to CGRA’s arch features
1
2
for(i=0; i<N; i++){S1: a[i] = b[i-1] + 1;S2: b[i] = a[i] / 2;S3: c[i] = b[i] + 3;S4: d[i] = c[i];}
S1
S2
S3
S4
(0,1)
(0,1)
(0,1)
(1,1)
Loop carried dependence
S1
S2
S3
S4
A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code.《Computer Architecture: A Quantitative Approach》
22
Loop
Non-affine Affine
Non-uniform Uniform
Multi-level Single-level
Branchw/o branch 2. Branch 1. w/o branch
3. Perfect 4. Imperfect
Loops fall into many categories and we mainly focus on 4 categories loop mapping to CGRAs,which cover most of real life applications.
convertload/store
innermost
for(i=0; i<N; i++)for(j=0;j<M;j++)
A[i][j]=A[2i+3j][j+1]+..
for(i=0; i<N; i++)a[i][0] = …for(j=0;j<M;j++)
A[i][j]=A[i+1][j+2]+..
[1,2]
for(i=0; i<N; i++)a = b + c;if(a>0){
d=a+c}
Multi-levelPerfectAffineNon-uniform
Multi-levelImperfectAffineUniform
Single-levelBranch
Iteration domain
Dependence
Structure
Loops Classification
24
TRMap(ICCAD’15/TVLSI’15)
MEMMap (TVLSI’15)
PolyMap(DAC’13/TVLSI’14)DualPL(TPDS’16)
ConfigurationContext
Nested?
Perfect ?
Y
YN
Branch ?
N
N
Y
DualVdd (TCAD’16)
Loops
CTRL
Auto Parallelization
Memory partitioning & mapping
Low-power Optimization
Compilation Flow
• PolyMap: Polyhedral model based mapping
• DualPL: Multi-level loop pipelining
• TRMap: Trigger-aware mapping
• MEMMap: Memory partitioning & mapping
• DualVdd: Voltage scheduling
25
Thinker: Reconfigurable AI Computing Architecture
Features:
1. Heterogeneous PE arrays supporting data reuse
2. Two types of reconfigurable PE providing programmability
3. Bitwidth adaptive MAC unit exploiting computing power for low bit NN
General PE
Super PE
26
Thinker: Reconfigurable AI Computing Architecture
Features:
1. Heterogeneous PE arrays supporting data reuse
2. Two types of reconfigurable PE providing programmability
3. Bitwidth adaptive MAC unit exploiting computing power for low bit NN
General PE
Super PE
Three levels of Reconfigurability
27
1. Reconfigurable MAC Unit
2. Reconfigurable PE 3. Reconfigurable PE Array
Three Levels of Reconfigurability
28
Reconfigurable MAC Unit: Bit-width Adaptive
• Support flexible bit-width neural networks
• Fully exploit the computing power of PE for low-bit neural networks
Bit-Serial ComputationSubword-Parallel Computation
29
Reconfigurable PE: Support different OPs in NN
30
Reconfigurable PE Array: On-demand Partitioning
CNN, FCN, LSTM execute sequentially CNN, FCN+LSTM execute in parallel
①②
③
31
2017 ACM/IEEE ISLPED Design Contest Award2017 VLSI Symposia on Technology and Circuits (VLSIC)2017 IEEE TVLSI Most Popular Article2018 IEEE Journal of Solid-State Circuits (JSSC)2018 MIT Technology Review
Thinker-I: Watt level AI processor
Technology TSMC 65nm LP
Supply voltage 0.67V~1.29V
Area 4.4mm×4.4mm
SRAM 348KB
Frequency 10 ~ 200MHz
Power 4mW ~ 447mW
Energy efficiency 1.06 ~ 5.09 TOPs/W
• General Purpose NN• Heterogeneous PE• Support CNN/FCN/RNN, Hybrid-NN
Thinker I
32
Power
Computation (ops)
Smartphone (Always on)Wearable deviceSmart sensors
Smart phoneSmart homeSmart toysSmart glassesWiFi camera
Homeappliance
Video surveillanceIndustryAgriculture
AutomobileData center
1mW 100mW 2W 100W
100G
1T
2T
20T
Tens of mW
The requirement of mW level AI processor
33
RS DataflowMIT Eyeriss
2016
ASIPDiannao
2014
Domain Specific Architecture
Towards massive basic computation
Algorithm design in conjunction with hardware design:
less latency, more power efficient, more general !
Towards compact parallel computing
Network pruning, compression, quantization, low-bit …
Data granularity , program
ming/storage m
odel, …
Algorithm & hardware co-design, co-optimization, co-verification !
���
� � �
Systolic ArrayGoogle TPU
2017.1
Low-bit Adaptive quant. LQ-Nets, 2018.9
Low-bit TrainingDoReFa-Net, 2018.2
TWN2016.11
BWN2016.8
Pruning2016.2
Sparsity-awareNVIDIA SCNN
2017.6
Flexible BitwidthKAIST UNPU
2018
To make algorithms more flexible
To make hardware more busy
Dual Trends for Energy Efficient NN Computing
Compact NN Model
Low-bit NN & HW co-design
34
Binary/Ternary Weight Neural Networks (BTNNs)
p No Multiplication p Low Memory Footprint and Capacityp Satisfied Accuracy
Bitwidth Accuracy Loss (%)
Weight Activation MNIST CIFAR-10 ImageNet
Binary Weight Neural Networks
(BWNNs)
Binary Connect 1 32 0.7 2.78 19.20Binary Weight Network 1 32 - 0.76 0.8Binary Neural Network 1 1 0.37 3.03 29.8
XNOR-Net 1 1 - 3.05 11
Ternary WeightNeural Networks
(TWNNs)
Ternary Connect 2 32 0.56 4.89 -Ternary Weight Network 2 32 0.06 0.32 0.8
Trained Ternary Quantization 2 32 - - 0.6
Ternary Neural Network 2 2 1.08 4.99 -
Hardware Friendly
35
Input Feature Maps
Redundant Operations
(ROPs)
Higher Power Consumption
Remove ROPs for different
kernel groups
Kernel Group
1 1-1 0
0 1-1 1
1 -11 0
-1 10 1
2 or 3 types of weight values
Kernel 2
Kernel 3
Kernel 4
1 + 2 - 3 + 0
0 + 2 - 3 + 4
1 - 2 + 3 + 0
-1 + 2 + 0 + 4
1 + (2 – 3) + 0
0 + (2 – 3) + 4
(1 – 2) + 3+ 0
-(1 – 2) +0 +4
1 23 4
Kernel 1
New opportunity to optimize B/T weight convolutions
36
Standard Convolution
1 2 35 6 79 1011
131415
48
1216
1 1 -1-1 1 -11 1 1
-1 1 11 1 1-1 -1 1
0 1 00 1 00 0 1
1 0 -1-1 0 -11 1 0
19223134
5 55 5
24273639
14172629
24273639
1 2 35 6 79 1011
131415
48
1216
14172629
Ifmap
Original Kernels
K1 K2
Ifmap
K1′ K2′
O1+O2 O1-O2
O1 O2
64 OPs 36 OPs
Ofmap
KTFR Kernels
OfmapFeature
Reconstruction
Kernel Transformation
KTFR
K1′ =K1 + K2
2
K2′ =K1 − K2
2
Special Optimization for B/T weight NNsKernel-Transformation-Feature-Reconstruction (KTFR)
38
Feature Summation
Remaining Convolution
IntegralFusion
Integral Calculation KBWI
1 2 32 1 23 3 4
Ifmap
6 89 10
-2×
Originalkernels
OfmapStandard Convolution24 OPs
①
③
-2 01 0
4 43 2
4 44 5
1 23 4
Ifmap
RRC②
0 11 0
0 00 1
1 -1-1 1
1 11 -1
1 2 32 1 23 3 4
1 2 32 1 23 3 4
-2 01 0
4 43 2
6 89 10
4 44 5
1 23 4
FIBCkernels
RRC Ofmap
Ifmap
KBWI
16 OPs
FIBC
Special Optimization for B/T weight NNs
39
2018 IEEE ISSCC SRP2018 International Symposium on Computer Architecture (ISCA)2018 VLSI Symposia on Technology and Circuits (VLSIC)2019 IEEE Journal of Solid-State Circuits (JSSC)
Technology TSMC 28nm HPC
Supply voltage 0.58V~0.9V
Area 1.7mm×2.7mm
SRAM 225KB
Frequency 20 ~ 400MHz
Power < 100mW
Energy efficiency 20 TOPs/W @ Binary AlexNet
Thinker II• Ultra-low Power• Load Balancing and Scheduling• Low bit-width Weights
Data Memory
Data
Me
mor
yWeight Memory
PLL
Cont
rolle
r
Processing Engine
Thinker-II: mW level AI processor
40
Power
Computation (ops)
Smartphone (Always on)Wearable deviceSmart sensors
Smart phoneSmart homeSmart toysSmart glassesWiFi camera
Homeappliance
Video surveillanceIndustryAgriculture
AutomobileData center
1mW 100mW 2W 100W
100G
1T
2T
20T
< 1 mW
The requirement of µW level AI processor
41
Prevailing Human-Machine Speech Interfaces
Apple Siri Google Now
Microsoft Cortana
○ Mobile phones, wearable devices, IoT devices…
Smart earphones
Wall switches
○ General speech recognition procedure
FeatureExtraction
AcousticModel DecodingVoice Activity
Detection
42
Binary Convolutional Neural Networks (BCNN)
○ Activation and weights quantized to 1 bit--save memory footprint○ Replace expensive multiplications by XNORs--save power and area
43
Frame-Level Reuse in BCNN□ Exploit temporal data locality to eliminate redundant computing
Overlapped SpeechFeature Maps
OverlappedConvolutional results
Output reuse ofConsecutive feature maps
140
3×3Kernel3×3
Kernel
3×3
642
11
... ×
23×3
Kernel3×3Kernel
3×3
64
1112
...
×
=
=
40
38
Redundancy
Buffer
10 3×3Kernel3×3
Kernel
3×31112
×
Buffer
=
Results reuse
Update stored results
NextLayer
12
121110
......
13
BCNN fmap 1BCNN fmap 2
BCNN fmap 3
3
(1×40 features per frame)
...
(11 frames per fmap)
44
Bit-Level BCNN Weight Regularization
8
3
16
10
18
0
14
9
12Z=
After pruning
Channel32 bit
Before pruning
□ Regularize and compress the bits in BCNN 3-d weight matrices
Before regularization After regularization
Zero completion
00...0
00 01 10 11
00...0110100...
10
01
2-4 decoder
0 1 0 0D0 D1 D2 D3
1EnAddr
NextFlag
+
+
+
+
+
110100...
Address Generator
16-zeroBank
12-zeroBank
8-zeroBank
FullBank
FlagTable
32 bit
32 bit
EnAddr
16 bit 20 bit
32 bit 32 bit 32 bit
EnAddr
24 bit 32 bit 2 bit
EnAddr Addr
24.25% 24.25% 1%27.50% 27.50% 2%
Storage Reduction
MEM Access Reduction
Precision Loss
5% 72% 4% 19%5% 80% 5% 10%
16-zero 12-zero 8-zero Full
①
②
45
Lowest 5-bit Addition
A4:0 B4:0
11111 + 00001100000
l=5
≥ 32
3-bitSub Carry
3-bit RCAAddition
A15:13 B15:13
4-bit Sub Carry
4-bit RCAAddition
A12:9 B12:9
Sum15:13 Sum12:9
4-bit Sub Carry
4-bit RCAAddition
B8:5
Sum4:0
A8:5
Sum8:5
□ Additions in BCNN are dominated by “+1”
□ Truncate high-bit carry-chain to shorten critical path.
95.9%
4.1%
1-bit Add
16-bit Add
Proportion of Addition operations in BCNN
Approximate Adder
46
□ Circuit diagram of lowest 5 bits adder. Reduce delay by 49.28%, power-delay-product (PDP) by 48.08%.
Lowest 5-bit to reduce power and guarantee correctness of
Incremental Addition.
Lowest 5-bit Addition
A4:0 B4:0
11111 + 00001100000
l=5≥ 32
Carry Chain:c0 = g0 c1 = p1&g0c2 = p2&p1&g0
c3 = p3& p2 &p1&g0c4 = p4&p3&p2
&p1&g0
TSMC-28nm, 0.9V→DC→Netlist→Hspice→Delay/Power* Benchmarks: TIDIGIT,TIMIT,RM,WSJ,Spoken Number
Sum4:0
Approximate Adder
47
Technology TSMC 28nm HPC
Supply voltage 0.52V~0.9V
Area 1.74mm×0.74mm
SRAM 27KB
Frequency 2 ~ 50MHz
Power 0.2 ~ 5mW
Energy efficiency 304 nJ/Frame
Thinker S• Ultra-low power for speech • Always-on & Real-time• Wakeup & Command & Speech
Deco
der
Unit
Contr
oller
Data
Memo
ry
Data Memory
Data Memory
Weight Memory
BCNN Unit
Pre-processing
2018 VLSI Symposia on Technology and Circuits (VLSIC)2018 Design Automation Conference (DAC)2019 IEEE Transactions on Circuits and Systems I: Regular Papers
Thinker-S: µW level AI processor
48
0.25
2.75 11.6 19.9
28.1
55
2.3
0.54
0
10
20
30
40
50
60
70
80
90
100
2016 2017 2018 2019 2020
MITEyerissRS Flow
KU LeuvenEnvisionDVAFS
MITCONV-RAM7×1 bit
SEUSandwich8×1 bit
KAISTUNPUReconfigurable
Phase-II√ In-memory computing
shows great potential× Only supporting the
basic MAC operations
Phase-I√ Innovation in architecture drives
energy efficiency√ Reconfigurable architecture has
good programmability× Digital architecture faces
“Memory wall” bottleneck
Ene
rgy
Eff
icie
ncy
(TO
PS/W
)
Next ?
THUThinker-IIReconfigurable
GoogleTPUSystolic
ICT/CambrianCambricon-XSparsity
Phase-IIIReconfigurable Architecture
+In-memory computing
The next step of AI chips
49
Thank youfor your attention
SponsorsPremier Sponsor & tinyML Strategic Partner
Gold Sponsor
Silver Sponsor
31 © 2020 Arm Limited (or its affiliates)31 © 2020 Arm Limited (or its affiliates)
Optimized models for embedded
Application
Runtime(e.g. TensorFlow Lite Micro)
Optimized low-level NN libraries(i.e. CMSIS-NN)
Arm Cortex-M CPUs and microNPUs
Profiling and
debugging
tooling such as
Arm Keil MDK
Connect to high-level
frameworks
1
Supported byend-to-end tooling
2
2
RTOS such as Mbed OS
Connect toRuntime
3
3
Arm: The Software and Hardware Foundation for tinyML
1
AI Ecosystem
Partners
Resources: developer.arm.com/solutions/machine-learning-on-arm
Stay Connected
@ArmSoftwareDevelopers
@ArmSoftwareDev
Dynamic Neural Accelerator™
Tight coupling between AI
software & hardware with
automated co-design
THE NEXT GENERATION OF AI PROCESSOR FOR THE EMBEDDED EDGE
10x more compute with
single DNA engine
More than 20x better
energy-efficiency
Ultra-low latency
Fully-programmable with
INT 8bit support
www.edgecortix.com
© 2020 EDGECORTIX. ALL RIGHTS RESERVED
� Automotive
� Robotics
� Drones
� Smart Cities
� Industry 4.0
TARGET
MARKETS
SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edge devices. We design
systems for real-time always-on smart sensing, for audio, vision,
IMUs, bio-signals and more.
https://SynSense.ai
Partners
Conference Partner
Media Partners
Questions?
Or to join tinyML WeChat Group
添加工作人员进官方微信群(注明tinyML)
Please add staff to join our official tinyML WeChat Group
Copyright Notice
This presentation in this publication was presented at tinyML® Asia 2020. The content reflects the
opinion of the author(s) and their respective companies. The inclusion of presentations in this
publication does not constitute an endorsement by tinyML Foundation or the sponsors.
There is no copyright protection claimed by this publication. However, each presentation is the work of
the authors and their respective companies and may contain copyrighted material. As such, it is strongly
encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions
regarding the use of any materials presented should be directed to the author(s) or their companies.
tinyML is a registered trademark of the tinyML Foundation.
www.tinyML.org