人工智能系统 System for AI 课程介绍 Lecture Introduction · 2021. 1. 22. · 2....

46
1

Transcript of 人工智能系统 System for AI 课程介绍 Lecture Introduction · 2021. 1. 22. · 2....

  • 1

  • 计算调度优化

    算子表达式IR

    计算图IR(DAG)x

    w

    *

    b+ y

  • 2. 内存优化

    3. 内核优化

    4. 调度优化

  • Add Log While

    Sub MatMul Merge

    Mul Conv BroadCast

    Div BatchNorm Reduce

    Relu Loss Map

    Tanh Transpose Reshape

    Exp Concatenate Select

    Floor Sigmoid …..

  • x

    w b

    *+ y

    x

    w b

    *+ y

  • p1 / (p0 + (p3 - p4))

  • Batch same-type operators to leverage GPU massive parallelism

    12

    + + +

    M

    xt

    M M

    Rf Ro Rz

    M

    ht-1

    M M

    𝝈 𝝈 𝒕𝒂𝒏𝒉

    ×

    +

    ×

    ht

    Wf Wo Wz

    Data-flow graph of a GRU cell

  • 通过将输入张量合并成一个大的张量来实现将相同的算子合并成一个更大的算子,从而更好的利用硬件并行度

    13

    +

    M

    xtR ht-1

    M

    𝝈 𝝈 𝒕𝒂𝒏𝒉

    ×

    +

    ×

    ht

    W

    Data-flow graph of a GRU cell

    + + +

    M

    xt

    M M

    Rf Ro Rz

    M

    ht-1

    M M

    𝝈 𝝈 𝒕𝒂𝒏𝒉

    ×

    +

    ×

    ht

    Wf Wo Wz

  • 14

  • 15

  • 16

  • 17

  • 1. 计算图优化

    2. 内存优化

    3. 内核优化

    4. 调度优化

  • 19

  • 20

  • 21

    AutoTM: Automatic Tensor Movement in HeterogeneousMemory Systems using Integer Linear Programming

  • 22

    AutoTM: Automatic Tensor Movement in HeterogeneousMemory Systems using Integer Linear Programming

  • 23

  • 24

  • 1. 计算图优化

    2. 内存优化

    3. 内核优化

    4. 调度优化

  • 计算图IR(DAG)x

    w

    *

    b+ y

    问题: 每个后端平台针对每个算子都需要单独实现至少一个内核考虑:编程模型、数据排布、线程模型、缓存大小等等

  • 计算图IR(DAG)x

    w

    *

    b+ y

    张量运算表达式 C[i] = A[i] + B[i]

  • 36

  • 1. 计算图优化

    2. 内存优化

    3. 内核优化

    4. 调度优化

  • 计算图优化XLA, TVM, nGraph

    算子生成AutoTVM, TC, Halide, Tiramisu,

    etc

    MatMul: C[i] = reduce_sum(A[m,i] * B[i,n])

    C[i] = A[i] + B[i]

    D[i] = max(C[i], 0)

    D[i] = max(A[i] + B[i], 0)

    __global__ add_relu_kernel(args…) {

    //.. Code

    }

    void launch_kernel(add_relu_kernel);

    add

    relu

    add_relu

    SM SM SM SM SM SM

    计算调度优化

    算子表达式IR

    计算图IR(DAG)x

    w

    *

    b+ y

    如何最终将每个算子的计算映射到硬件上?

  • Task-parallel Graph

    BlockFusionOptimizer

    NNFusion:全局计算调度优化

    计算调度优化

    算子表达式IR

    计算图IR(DAG)x

    w

    *

    b+ y

  • APEEND

    GROUP_SYNC

    通过引出新的调度原语来支持任务级调度

  • 错误依赖

    死锁

    Co-schedule to operators

    SM SM SM SM SM SM

    Conv

    Matmul

    Matmul

    Conv

    block0

    Matmul

    Conv

    block1

    Matmul

    Conv

    block2

    Matmul

    Conv

    block3

    Matmul

    Conv

    block4

    Matmul

    Conv

    block5

    Matmul

    Conv

    block6

    Matmul

    Conv

    block7

    Matmul

    Conv

    block8

    Matmul

    block9

    Matmul

    block10

    GPU

    barrier

    Wait for barrierUnable to execute

    Cannot scheduleConv

    Matmul

    挑战:调度算子的任务到GPU上的挑战

  • MatMul

    Convbarrier

    SM SM SM SM SM SM

    BE 0 BE 1 BE 2 BE 3 BE 4 BE 5

    GPU

    vblock0 vblock1 vblock4 vblock5vblock2 vblock3

    vblock6 vblock7 vblock10vblock8 vblock9

    vblock0 vblock1 vblock4

    vblock5

    vblock2 vblock3 vblock6

    vblock7 vblock8

    依赖关系的映射

  • 43

    SM SM SM SM SM SM

    BE 0 BE 1 BE 2 BE 3 BE 4 BE5

    GPU

    vblock0

    vblock1

    vblock4

    vblock5

    vblock2

    vblock3

    vblock6

    vblock7

    vblock0

    vblock1

    vblock0

    vblock1

    vblock2

    vblock3 Under utilization of

    GPU resource

    𝜎

    MatMul

    Conv

    [64, 512] [512, 128]

    [64, 128]

    Level 1

    Level 2

    Better UtilizationLower Latency

    并行度的映射

  • 45

  • 46