Leone Sig Graph 2011

download Leone Sig Graph 2011

of 37

Transcript of Leone Sig Graph 2011

  • 7/29/2019 Leone Sig Graph 2011

    1/37

    Native shader compi

    with

    Ma

  • 7/29/2019 Leone Sig Graph 2011

    2/37

    Why compile shaders?

    RenderMans SIMD interpreter is hard to beat.

    Amortizes interpretive overhead over batches of po

    Shading is dominated by floating point calculations.

  • 7/29/2019 Leone Sig Graph 2011

    3/37

    SIMD interpreter

    For each instruction in shader:

    Decode and dispatch instruction.

    For each point in batch:

    Ifrunflagis on:

    Load operands.

    Compute.

    Store result.

  • 7/29/2019 Leone Sig Graph 2011

    4/37

    SIMD interpreter: example inner l

  • 7/29/2019 Leone Sig Graph 2011

    5/37

    SIMD interpreter: benefits

    Interpretive overhead is amortized (if batch is large).

    Uniform operations can be executed once per batch.

    Derivatives are easy: neighboring values are always rea

  • 7/29/2019 Leone Sig Graph 2011

    6/37

    SIMD interpreter: drawbacks

    Low compute density, poor instruction-level parallelism

  • 7/29/2019 Leone Sig Graph 2011

    7/37

    SIMD interpreter: example inner l

  • 7/29/2019 Leone Sig Graph 2011

    8/37

    SIMD interpreter: drawbacks

    Low compute density, poor instruction-level parallelism

    Load, compute, store, repeat.

    Poor locality, high memory traffic

    Intermediate results are stored in memory, not regis

    High overhead for small batches

    Difficult to vectorize (pointers and conditionals).

  • 7/29/2019 Leone Sig Graph 2011

    9/37

    Compiled shader execution

    For each point in batch:

    Load inputs.

    For each instruction in shader:

    Compute.

    Store outputs.

  • 7/29/2019 Leone Sig Graph 2011

    10/37

    Benefits of native compilation

    Eliminates interpretive overhead. Good for small batch

    Good locality and register utilization.

    Intermediate results are stored in registers, not mem

    Good instruction-level parallelism.

    Instruction scheduling avoids pipeline stalls.

    Vectorizes easily.

  • 7/29/2019 Leone Sig Graph 2011

    11/37

    Issues: batch shading

    Use vectorized shaders on small batches.

    Uniform operations: once per grid, not once per point.

    Some are very expensive (e.g. plugin calls).

    Derivatives: need "previously" computed values fromneighboring points.

    RSL permits derivatives of arbitrary expressions.

  • 7/29/2019 Leone Sig Graph 2011

    12/37

    Why vectorize?

    Consider batch execution of a compiled shader:

    For each point in batch:

    Load inputs.

    For each instruction in shader:

    Compute.

    Store outputs.

  • 7/29/2019 Leone Sig Graph 2011

    13/37

    Why vectorize?

    Consider batch execution of a vectorizedshader:

    For each block of 4 or 8 points in batch:

    Load inputs.

    For each instruction in shader:

    Compute on vector registers (with mask)

    Store outputs.

  • 7/29/2019 Leone Sig Graph 2011

    14/37

    Consider using SSE instructions only for vectors and m

    Simple vector code generation

    float dot(vector v1, vector v2){

    vector v0 = v1 * v2;return v0.x + v0.y + v0.z;

    }

    load4 r1, [v1]load4 r2, [v2]mult4 r3, r1, r2move r0, r3.xadd r0, r3.yadd r0, r3.z

  • 7/29/2019 Leone Sig Graph 2011

    15/37

    Shader vectorization

    To vectorize, first scalarize:

    float dot(vector v{

    float x = v1.xfloat y = v1.yfloat z = v1.zreturn x + y +

    }

    float dot(vector v1, vector v2){

    vector v0 = v1 * v2;return v0.x + v0.y + v0.z;

    }

  • 7/29/2019 Leone Sig Graph 2011

    16/37

    Scalar code generation

    Next, generate ordinary scalar code:

    float dot(vector v1, vector v2){

    float x = v1.x * v2.x;float y = v1.y * v2.y;float z = v1.z * v2.z;return x + y + z;

    }

    load r1, [v1.x]load r2, [v2.x]mult r0, r1, r2

    load r1, [v1.y]load r2, [v2.y]mult r3, r1, r2

    load r1, [v1.z]load r2, [v2.z]mult r3, r1, r2

    add r0, r0, r3add r0, r0, r3

  • 7/29/2019 Leone Sig Graph 2011

    17/37

    Finally, widen each instruction for a batch size of four:

    load4 r1, [v1.x]load4 r2, [v2.x]mult4 r0, r1, r2

    load4 r1, [v1.y]load4 r2, [v2.y]mult4 r3, r1, r2

    load4 r1, [v1.z]load4 r2, [v2.z]mult4 r3, r1, r2

    add4 r0, r0, r3add4 r0, r0, r3

    float dot(vector v1, vector v2){

    float x = v1.x * v2.x;float y = v1.y * v2.y;float z = v1.z * v2.z;return x + y + z;

    }

    Vectorize for batch of four

  • 7/29/2019 Leone Sig Graph 2011

    18/37

    Struct of arrays (SOA)

    Normally a batch of vectors is an array of structs (AOS

    x y z x y z x y z x y z . . .

    Vector load instructions (in SSE) require contiguous da

    Store batch of vectors as a struct of arrays (SOA):

    x x x x . . . y y y y . . . z z z z .

  • 7/29/2019 Leone Sig Graph 2011

    19/37

    Masking / blending

    Use a mask to avoid clobbering components of registe

    by the other branch. No masking in SSE. Use variable blend in SSE4:

    blend(a, b, mask){

    return (a & mask) | ~(b & mask)}

    No need to blend each instruction

    Blend at basic block boundaries (at phi nodes in SSA).

  • 7/29/2019 Leone Sig Graph 2011

    20/37

    Vectorization: recent work

    ispc: Intel SPMD program compiler (Matt Pharr)

    Beyond Programmable Shadingcourse, SIGGRAPH 201

    Open source: ispc.github.com

    Whole function vectorization in AnySL (Karrenberg et

    Code Generation and Optimization2011

    http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/anysl/http://www.cdl.uni-saarland.de/projects/anysl/http://www.cdl.uni-saarland.de/projects/wfv/http://www.cdl.uni-saarland.de/projects/wfv/http://ispc.github.com/http://ispc.github.com/
  • 7/29/2019 Leone Sig Graph 2011

    21/37

    Film shading on GPUs

    Previous work

    LightSpeed (Ragan-Kelly et al. SIGGRAPH 2007)

    RenderAnts (Zhou et al. SIGGRAPH Asia 2009)

    Code generation is easier now (thanks CUDA, OpenC

    PTX AMD IL

    LLVM and Clang

  • 7/29/2019 Leone Sig Graph 2011

    22/37

    GPU code generation with LLV

    NVIDIAs LLVM to PTX code generator (Grover)

    Not to be confused with PTX to LLVM front end (PL

    Incomplete PTX support in llvm-trunk (Chiou)

    Google summer of code project (Holewinski)

    Experimental PTX back end for AnySL (Rhodin) LLVM to AMD IL (Villmow)

    http://sourceforge.net/projects/llvmptxbackend/https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011http://llvm.org/devmtg/2010-11http://llvm.org/devmtg/2010-11http://sourceforge.net/projects/llvmptxbackend/http://sourceforge.net/projects/llvmptxbackend/https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011https://sites.google.com/site/justinholewinski/projects/gsoc/llvm-ptx-back-end-2011
  • 7/29/2019 Leone Sig Graph 2011

    23/37

    Issues: GPU code generation

    Film shaders interoperate with the renderer.

    File I/O: textures, pointclouds, etc. (out of core).

    Shader plugins (DSOs).

    Sampling, ray tracing.

    Answer: multi-pass partitioning (Riffel et al. GH 2004)

  • 7/29/2019 Leone Sig Graph 2011

    24/37

    Partitioning

  • 7/29/2019 Leone Sig Graph 2011

    25/37

    Multi-pass partitioning for CPU

    Synchronize for GPU calls, uniform operations, derivati

    Does not require hardware threads or locks.

    A thread yields by returning (to a scheduler).

    Intermediate data is stored in a cactus stack (Cilk)or continuation closures (CPS).

    Data management and scheduling is a key problem(Budge et al. Eurographics 2009)

  • 7/29/2019 Leone Sig Graph 2011

    26/37

    Issues: summary

    CPU code generation (perhaps JIT)

    Vectorization

    GPU code generation

    Multi-pass partitioning

  • 7/29/2019 Leone Sig Graph 2011

    27/37

    Introduction to LLVM

    Mid-level intermediate representation (IR)

    High-level types: structs, arrays, vectors, functions.

    Control-flow graph: basic blocks with branches

    Many modular analysis and optimization passes.

    Code generation for x86, x64, ARM, ...

    Just-in-time (JIT) compiler too.

  • 7/29/2019 Leone Sig Graph 2011

    28/37

    Example: from C to LLVM IR

    definefloat@sqrt(float%f){ entry: %0 = fcmpogtfloat%f, 0.0

    bri1%0, label%bb1, label%bb2

    ...

    }

  • 7/29/2019 Leone Sig Graph 2011

    29/37

    Example: from C to LLVM IR

    definefloat@sqrt(float%f){ entry: %0 = fcmpogtfloat%f, 0.0

    bri1%0, label%bb1, label%bb2 bb1: %1 = callfloat@fabsf(float%f) retfloat%1 bb2: retfloat0.0}

  • 7/29/2019 Leone Sig Graph 2011

    30/37

    Example: from C to LLVM IR

    definevoid@foo(i32%x, i32%y){

    %z = allocai32 %1 = addi32%y, %x storei32%1, i32*%z

    ...}

  • 7/29/2019 Leone Sig Graph 2011

    31/37

    Writing a simple code generato

  • 7/29/2019 Leone Sig Graph 2011

    32/37

    Writing a simple code generato

    Ad f LLVM

  • 7/29/2019 Leone Sig Graph 2011

    33/37

    Advantages of LLVM

    Well designed intermediate representation (IR).

    Wide range of optimizations (configurable).

    JIT code generation.

    Interoperability.

    I b l

  • 7/29/2019 Leone Sig Graph 2011

    34/37

    Interoperability

    Shaders can call out to renderer via C ABI.

    We can inline library code into compiled shaders.

    Compile C++ to LLVM IR with Clang.

    This greatly simplifies code generation.

    W k f LLVM

  • 7/29/2019 Leone Sig Graph 2011

    35/37

    Weaknesses of LLVM

    No automatic vectorization.

    Poor support for vector-oriented code generation.

    No predication.

    Few vector instructions, must resort to SSE/AVX int

    LLVM

  • 7/29/2019 Leone Sig Graph 2011

    36/37

    LLVM resources

    www.llvm.org/docs

    Language Reference Manual

    Getting Started Guide

    LLVM Tutorial (section 3)

    Relevant open source projects

    ispc.github.com

    github.com/MarkLeone/PostHaste

    http://ispc.github.com/http://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://github.com/MarkLeone/PostHastehttp://github.com/MarkLeone/PostHastehttp://ispc.github.com/http://ispc.github.com/http://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/tutorial/LangImpl3.htmlhttp://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/tutorial/http://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/GettingStarted.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://www.llvm.org/docs/LangRef.htmlhttp://www.llvm.org/docshttp://www.llvm.org/docs
  • 7/29/2019 Leone Sig Graph 2011

    37/37

    Questions?

    Mark [email protected]