HEVCEncoderKwangwoonUniversity(KWU)
Donggyu Sim ([email protected])
:,
Contents
OverviewofHEVC
EncodingissuesforHEVCtestmodel(HM)
ComplexityanalysisofHEVCencoder
Fastencodingalgorithmsandperformances
Issuesofparallelprocessing
Conclusion
OVERVIEWOFHEVC
BlockdiagramofHEVCstandard
Typicalblockbasedhybridcodecstructure+additionalenhancedtools
Fn
PictureBuffer
Fn1
Fn2
Fn
Interprediction
ME MCDCTIFAMVPMerge
Intraprediction
Referencesamplepadding
PlanarDC
33angularMDIS
Transform
TUsize:3232~44
Residualquadtree
Quantization
DeltaQP RDOQ
Entropycoding
CABAC
Loopfilter
Sampleadaptiveoffset
Deblockingfilter
Transform1
Quantization1
++
Rn
Rn
FIGURE. BlockdiagramofHEVCencoder
BlockstructureinHEVC
ThreeblockstructuresaredefinedinHEVC Codingunit(CU) Predictionunit(PU) Transformunit(TU)
CU3232
CU1616 CU1616
CTU64
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
CTU6464 CTU6464 CTU64
TUdepth0
TUdepth1
TUdepth2
2N2N 2NN N2N
nL2N2NnD2NnU nR2N
NN
FIGURE. AnexampleofCU,PU,andTUpartitioninHEVC
ENCODINGSTRUCTURESOFHEVC
DecisionlevelforHEVCencoder
Sequencelevel Codingstructure(Allintra,Lowdelay,Randomaccess) Profile,tier,level Max/MinCTUsize,CUdepth Max/Min TUsize,TUdepth Toolon/off(SAO,deblocking,WPP,tile)
Picturelevel #refframe,ratecontrol Tile,slice
Slice ortilelevel Refframes Deblockingfilterparameters
CTUlevel CUpartitioning Sampleadaptiveoffsetparameters
CUlevel PUandTUpartitioning
PU &TUlevel Predictionmodes,motionvectors cbf,coefficients
Sequence
Picture
CTU
SliceorTile
CU
PU&TU
Temporalpredictionstructure (1/3)
Allintra(AI) Allpictureiscodedasinstantaneousdecodingrefresh(IDR)picture Notemporalpredictionisallowed
IDRPicture
time
0
QPI
=POC
Codingorder 1
QPI
2
QPI
3
QPI
4
QPI
5
QPI
6
QPI
7
QPI
Temporalpredictionstructure (2/3)
Lowdelay(LD) ThefirstpictureshallbecodedasIDRpicture GeneralizedPandB(GPB) pictureshallbeusedfortheothersuccessivepictures
TheGPBshallbeabletouseonlythereferencepictures,eachofwhosePOCissmallerthanthecurrentpicture(allreferencepictureinList_0andList_1shallbetemporallypreviousindisplayorderrelativetothecurrentpicture)
QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicturedependingontemporallayer
IDRorIntrapicture GPB(GeneralizedPandB)
picture
0
1
2
4
53
6
7
8
time
QPI
QPBL3=QPI+3
QPBL2=QPI+2
QPBL3 QPBL3 QPBL3
QPBL2
QPBL1=QPI+1 QPBL1:Depth==0:Depth==1:Depth==2
=POC
Codingorder
Temporalpredictionstructure(3/3)
Randomaccess(RA) HierarchicalBstructureshallbeusedforcoding IDR Intrapictureorcleanrandomaccess(CRA) pictureshallbeinsertedcyclicallyperaboutone
secondinrandomaccesspoint QPofeachintercodedpictureshallbederivedbyaddingoffsettoQPofIntracodedpicture
dependingontemporallayer
IDRorIntrapicture
GPB(GeneralizedPandB)picture
0
4
3
2
75 8
1
time
ReferencedBPicture
NonreferencedBPicture
8
4
1
2
3 5
6
7
0
QPI
QPBL4=QPI+4 QPBL4 QPBL4 QPBL4
QPBL3=QPI+3 QPBL3
QPBL2=QPI+2
QPBL1=QPI+1
POC
Codingorder
:Depth==0:Depth==1:Depth==2:Depth==3
Picturepartitioning
Picture :Apicturecontainsanarrayofluma samplesinmonochromeformatoranarrayofluma samplesandtwocorrespondingarraysofchroma samplesin4:2:0,4:2:2,and4:4:4colorformat
Codingorderofcodingtreeunit(CTU) israsterscanorder CTU:AnNxN blockofluma samplestogetherwithtwocorrespondingblockofchroma
samples Analogoustomacroblock inpreviousstandards Themaximumallowedsizeoftheluma blockinaCTUisspecifiedtobe64x64 inMainprofile
*CTU&CTB:TheCTUconsistsofaluma codingtreeblock(CTB)andthecorrespondingchroma CTBsandsyntaxelements
30
17
FIGURE. ExampleofapicturedividedintoCTUs
Example)ClassB(19201080) BQTerraceCTUsize:64643017CTUpartition
Picturepartitioning
Aslice isasequenceofcodingtreeunits(CTUs)
Unlikeslices,tilesarealwaysrectangularandalwayscontainanintegernumberofcodingtreeunitsincodingtreeunitrasterscan
Atleastoneofthefollowingconditionsshouldbetrueforeachsliceandtileinapicture AllCTBsinaslicebelongtothesametile,orallCTBsinatilebelongtothesameslice
FIGURE. Apicturewith3017codingtreeunitsthatispartitionedintothreeslices
FIGURE. Apicturewith3017 codingtreeunitsthatispartitionedintothreetiles
Codingunit(CU)andcodingtreestructure
Codingunit(CU):theleafnodeofaquadtreestructure Squareblocks Size:from88uptothesizeofCTU SizeofCTUisspecifiedinsequenceparameterset(SPS)
Thequadtreepartitioningstructureallowsrecursivesplittingintofourequallysizednodes
TABLE.SyntaxforsizeofCTUinSPSseq_parameter_set_rbsp() { Descriptorlog2_min_coding_block_size_minus3 ue(v)log2_diff_max_min_coding_block_size_minus2 ue(v)
}
FIGURE. Exampleofcodingtreestructure
CU3232
CU1616 CU1616
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
8x8~64x64
ExampleofCUquadtreestructure
Codingunitquadtreestructure StartingfromCTU,eachCUcanbesplitinto4smallerCUs
64CU:split_coding_unit_flag(1)32CU:split_coding_unit_flag(0)32CU:split_coding_unit_flag(1)16CU:split_coding_unit_flag(0)16CU:split_coding_unit_flag(0)16CU:split_coding_unit_flag(1)
FIGURE. ExampleofCUquadtreestructure
CU3232
CU1616 CU1616
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
TABLE.SyntaxforCUsplitflagincodingtreecoding_tree(x0, y0, log2CbSize, ctDepth) { Descriptor
if(x0+(1 log2MinCbSize && NumPCMBlock == 0 )split_coding_unit_flag[x0][y0] ae(v)
}
Codingunit(CU)decision
Codingunitquadtreestructure StartingfromCTU,eachCUcanbesplitinto4smallerCUs
BestCURDcostcalculationforeachCUlevel
CompetitionofthebestCUanditssubpartitionedCUs
CUsize
3232
23 10
3232
2 5
3232
44 15
3232
65 20
1616
8 2
1616
3 1
1616
13 3
1616
18 4
885
884
887
886
6464
1 21
8810
889
8812
8811
Predictionunit(PU)types
Predictionunit(PU):aregionusedforcarryingtheinformationrelatedtothepredictionprocesses
2PUtypesforIntraprediction 2N2N,(SmallestCU:additionallyNN)
8PUtypesforInterprediction SmallestCU:
8x8:2N2N,N2N,2NN Others:2N2N,N2N,2NN,NN
Others:2N2N,N2N,2NN,nL2N,nR2N,2NnU,2NnD
FIGURE. PUpartitionsinHEVC
2N2N N2N 2NN NN
2NnD2NnUnR2NnL2N
Predictionunit(PU)types
CurrentCUsizeSCUsize
AMPenableflag
CurrentCUsize==CurrentCUsize==SCUsize
AMPenableflag
CurrentCUsize==88
Intra2N2N
Inter2N2N
Inter2NN
InterN2N
Intra2N2N
Inter2N2N
Inter2NN
InterN2N
Intra2N2N
IntraNN
Intra2N2N
IntraNN
No Yes
YesYesNo No
InterAMP
Inter2N2N
Inter2NN
InterN2N
InterNN
Inter2N2N
Inter2NN
InterN2N
Transformunit(TU)andtransformtreestructure
Transformunit(TU):aregionsharingthetransformandquantizationprocesses Squareshape Size:from4x4upto32x32 AvailabletransformblocksizesandmaxtransformhierarchydeptharespecifiedinSPS
RootofTUquadtreeisCUwhichtheTUbelongto
FIGURE. TUquadtreestructureinHEVC
TABLE.SyntaxforsizeofTUinSPS
32
32
seq_parameter_set_rbsp() { Descriptor
log2_min_transform_block_size_minus_2 ue(v)
log2_diff_max_min_transform_block_size ue(v)
max_transform_hierarchy_depth_inter ue(v)
max_transform_hierarchy_depth_intra ue(v)
}
4x4~32x32
INTER/INTRAPREDICTIONANDPU/TUDECISION
OverallofHMencodingprocess
Sequence
Picture
CTUdecisionsinasliceoratile
Deblocking filterSAOEntropycoding
CUpartitioningdecision
PU&TUpartitioningdecision
RDOprocess
32323232 3232 3232
16161616 1616 1616
8888
8888
6464
8888
8888
Inter2N2N InterNN
InterN2NInter2NNInterAMP
Intra2N2N IntraNN
IntraPCM
CUsizeSCU
CompressCUCompressCU CompressCU CompressCU
Finish
No
Yes
CompressCU
Mergeskip
RDOprocesstodecidePU&TU
Intrapredictionflow
Predictionmodes Luma (35modes)
Planar,DC,Angularprediction(33directions) Chroma(5modes)
Planar,DC,Vertical,Horizontal,DM Filtering
MDIS(Modedependentintrasmoothing) DCfiltering,Ver/Hor filtering
3MPM
2N2NPU
MDIS
Intraprediction
Referencesamplepadding
RDcost,Intra_mode
Bestmodedecision
N
Y
Mode
Fastintraprediction&TUdecisioninHM
IntrapredictionstepinHM1)Roughpredictionmodedecision
35prediction SelectNpredictionmodes
Distortion(SATD)+lamda *modebits #ofcandidatepredictionmodes:Nmodes+MPM(3)
2)Bestintrapredictionmodedecisionwithtransform Transform(RQTdepth=1) 1bestintramodedecision
3)BestRQTdecisionwithRDcosts RQTdepth=3
35modes
Nmode+MPM
1Bestmode
BestmodeRDcost
Interprediction
Skip:Mergeskip
Nonskip Unidirectionalprediction Bidirectionalprediction Halfpel/Quarterpel motionrefinement
DCTIF(8tap/4tap) Merge
Mergeskip
Inter2N2NInter2NNInterN2N
Bestmodedecision
Unidirectionalprediction
Bidirectionalprediction
Merge
Bestmodedecision
Cur.CU
RDcost,Bestmode
Spatialcandidatesderivation
Temporalcandidatederivation
Additionalcandidatesderivation
RDcostcalculationAMP
(nL2N,nR2N,2NnU,2NnD)
FIGURE. Flowchart Interprediction
Interprediction
Intercodingmode Mergeskipmode(CUlevel)
skip_flag=1 andmerge_idx Noreferenceindex Nomotionvector Noresidual
Mergemode(PUlevel) skip_flag=0,pred_mode_flag,and part_mode merge_flag=1andmerge_idx Noreferenceindexandmotionvector no_residual_syntax_flag:Residualisencodedornot
GeneralPUmodes skip_flag=0pred_mode_flag,andpart_mode merge_flag=0 ref_idx_lx andmvp_lx_flag basedonAMVP(x=0or1) MVDisencoded no_residual_syntax_flag:Residualisencodedornot
InterpredictionflowBEGIN input : current PU part mode for a CU
FOR PU partition
FOR List = 0 to 1 DOFOR 0 to refidx DO
Motion estimation (diamond search, SR : 64)Decide best RD-cost for uni-prediction
ENDFORENDFOR
IF bi-directional prediction THENFOR iteration = 0 to 3 DO
FOR 0 to refidx DOMotion estimation (full search, SR : 4)Decide best RD-cost for bi-prediction
ENDFORENDFOR
ENDIFENDFOR
Merge
RD-cost competition among uni/bi-prediction and merge
END output : inter prediction syntax
Fastencoderdecision(FEN)SubsampledSADforintegerME
UsesubsampledSADwhenrows>8forintegerMEOnly1iterationforbipredictivemotionsearch
defaultnumber:4
FastDecisionforMergeRDcost(FDM)Aftermergewithmergeidx X,ifallcbf iszerothenmerge
processisterminated
FIGURE. Pseudo code - Inter prediction flow
time
Cur
CurrentPU
Uniprediction
Biprediction
LIST_0 LIST_1
Biprediction
SearchP0 andP1whichproduceminimumerrorwithO R =(O P),where P =(P0+P1)/2
PracticalBipredictivesearch1)SearchP1whichproduceminimum2Rwith(2O P0)
R =O (P0+P1)/2 2R=(2O P0) P12)SearchP0whichproduceminimumerrorwith(2O P1)
R =O (P0+P1)/2 2R=(2O P1) P0
BipredSearchRange :4 FEN:1(iteration:1)
P0 P1O
List1Reference
List0Reference Currentframe
Example)Biprediction
Bidirectionalprediction
Iteration:2
Iteration:3
Unidirectionalprediction
P1O
List1Reference
List0Reference Currentframe
Searchrange:64
P0 P1O
BipredSearchRange :4
P0 P1O
BipredSearchRange :4
Iteration :1
P0 P1O
BipredSearchRange :4
Iteration:4
P0 P1O
BipredSearchRange :4
P02OR0
P12OR1
P02OR0
P12OR1
Motionestimation(Integerpel)
Practicalmotionestimation(diamondsearch) Firstsearch &earlytermination
Max3(default)moreroundsafterarecentbestmatch
Rasterrefinementsearch Ifintegerpel distanceisbiggerthan5,thenconducttherasterrefinementsearch.
Starrefinementsearch&earlytermination Diamondsearchwiththecenterofthebestmatchfromtheearlytwosteps Max2roundsafterthebestmatch
FIGURE. Rasterrefinementsearch
3
3 2 32 1 2
3 2 1 0 1 2 3 2 1 2
3 2 3
3
FIGURE. Firstsearch&startrefinement
Motionestimation(Subpel refinement)
Integerpel motionsearch Costfunction:SAD
Subpel motionrefinement Costfunction:SATD Halfpel refinement Quarterpel refinement
FIGURE. Integerpel motionsearch
FIGURE. Halfpel motionsearch
FIGURE.Quarterpel motionsearch
Searchrange
S
e
a
r
c
h
r
a
n
g
e
Integerpel
Halfpel
Quarterpel
Interpolation
DCTIFinHEVC Fixed8tap(7tap)and4tapinterpolationfiltersbasedonDCT 2Dseparablefilter
8*Horizontal1Dfilter+1*Vertical1Dfilter
Component Filter()
Luma1/4 {1,4,10, 58,17,5,1,0}
1/2 {1, 4,11,40,40,11,4,1}
Chroma
1/8 {2,58, 10,2}
3/8 {6,46,28,4}
1/4 {4,54,16,2}
1/2 {4, 36,36,4}
FIGURE. Integerandfractionalsamplepositionsforluma andchroma interpolation
TABLE.Interpolationfiltercoefficients A-1,-1 A0,-1 a0,-1 b0,-1 c0,-1 A1,-1
A-1,0 A0,0 A1,0
A-1,1 A0,1 A1,1a0,1 b0,1 c0,1
a0,0 b0,0 c0,0
d0,0
h0,0
n0,0
e0,0
i0,0
p0,0
f0,0
j0,0
q0,0
g0,0
k0,0
r0,0
d-1,0
h-1,0
n-1,0
d1,0
h1,0
n1,0
A2,-1
A2,0
A2,1
d2,0
h2,0
n2,0
A-1,2 A0,2 A1,2a0,2 b0,2 c0,2 A2,2
B0,0 ae0,0 ag0,0 ah0,0ab0,0 ac0,0 ad0,0 af0,0 B1,0
B1,1B0,1
be0,0 bg0,0 bh0,0bb0,0 bc0,0 bd0,0 bf0,0ba0,0
ce0,0 cg0,0 ch0,0cb0,0 cc0,0 cd0,0 cf0,0ca0,0
de0,0 dg0,0 dh0,0db0,0 dc0,0 dd0,0 df0,0da0,0
ee0,0 eg0,0 eh0,0eb0,0 ec0,0 ed0,0 ef0,0ea0,0
fe0,0 fg0,0 fh0,0fb0,0 fc0,0 fd0,0 ff0,0fa0,0
ge0,0 gg0,0 gh0,0gb0,0 gc0,0 gd0,0 gf0,0ga0,0
he0,0 hg0,0 hh0,0hb0,0 hc0,0 hd0,0 hf0,0ha0,0
ah-1,0
bh-1,0
ch-1,0
dh-1,0
eh-1,0
fh-1,0
gh-1,0
hh-1,0
he0,-1 hg0,-1 hh0,-1hb0,-1 hc0,-1 hd0,-1 hf0,-1ha0,-1
ba1,0
ca1,0
da1,0
ea1,0
fa1,0
ga1,0
ha1,0
ae0,1 ag0,1 ah0,1ab0,1 ac0,1 ad0,1 af0,1
Inter2N2N InterNN
InterN2NInter2NNInterAMP
Intra2N2N IntraNN
IntraPCM
CUsizeSCU
CompressCUCompressCU CompressCU CompressCU
Finish
No
Yes
CompressCU
Mergeskip
ExampleofPUdecision
BipredictionRDcost=SAD/SATD+*Bmode
=9000
BipredictionRDcost=SSE+*Bmode
=8500
MergeRDcost=SAD/SATD+*Bmode
=11000
UnipredictionRDcost=SAD/SATD+*Bmode
=12000
Vs.
Vs.
Example
NoTUdecisionNoreconstruction
TUdecisionReconstruction
TUdecisionflow(Inter)
Residualquadtree
2N2N N2N 2NN NN
2NnD2NnUnR2NnL2N TUdepth:0
TUdepth:1
TUdepth:2
T/QIT/IQ(recon)RDcost(SSE+*Bmode)
Original Predictor Residual
TUdecisionflow(Intra)
Example)intra_pred_mode =10(verticalmode)
Referencesamples
Predictiondirection
IntrapredictionusingreferencesamplesT/QIT/IQRDcost(SSE+*Bmode)
Predictiondirection
Referencesample(afteraboveblockisreconstructed)
TUdepth:N
TUdepth:N+1
Residual
Transform
ImplementationoftransforminHEVC Matrixmultiplication
Straightforward/Fewcodelines Hugenumberofoperations,butSIMDfriendly
Partialbutterflyimplementation Utilizessymmetry/antisymmetrypropertiesofbasisvectors Lessmultiplications/additions Increasenumberofcodelines
Matrixmultiplication
Matrixmultiplication
Matrixmultiplication
Matrixmultiplication
PartitioningsyntaxforaCTU
Syntax
CU3232
CU1616 CU1616
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
64CU:split_coding_unit_flag(1)32CU:split_coding_unit_flag(0)
32CU:split_coding_unit_flag(1)16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(0)
16CU:split_coding_unit_flag(1)
32x32TU:splitflag(1)
16x16TU:splitflag(0)
16x16TU:splitflag(1)8x8TU:splitflag(0)8x8TU:splitflag(0)8x8TU:splitflag(0)8x8TU:splitflag(0)
16x16TU:splitflag(1)8x8TU:splitflag(0)8x8TU:splitflag(1)4x4TU:splitflag(0)4x4TU:splitflag(0)4x4TU:splitflag(0)4x4TU:splitflag(0)
FIGURE. ExampleofTUquadtreestructure
PUpartition&Pred_mode info
TUsplitflags&Coefficients
PUpartition&Pred_mode info
TUsplitflags&Coefficients
FIGURE. ExampleofCUquadtreestructure
SKIPflag(mergeidx) Predictionmodeflag(intraor inter) PUpartsize(2Nx2N,2NxN,Nx2N,NxN,
AMP) Predictioninfo.(Intramodeormv and
ref.idx.,mergeidx,AMVPidx)
PUpartition&Pred_mode info
TUsplitflags&Coefficients
ENCODINGPROCESSOFLOOPFILTER
Inloopfilter
InHEVC,twoprocessingsteps,adeblocking filter(DBF)andasampleadaptiveoffset(SAO) operationareapplied
DBF:similartotheDBFoftheH.264/AVCstandard SAO:appliedadaptivelytoallsamplessatisfyingcertainconditions(whiletheDBFisonlyapplied
tothesampleslocatedatblockboundaries)
On/offsyntaxesforinloopfilters1. slice_disable_deblocking_filter_flag :slicelevelon/off2. sample_adaptive_offset_enabled_flag :slicelevelon/off
Deblocking filter(DBF)
Basically,deblocking filterofHEVCissimilartothatofH.264/AVC Inloopfiltering
Codingperformanceforinterframe Framebasedfiltering On/offcontrolisprovided
Adaptivefiltering boundarystrength
Filteringontheblockboundaries transformandpredictionboundary
Sequentialfilteringforverticalandhorizontaledges Samplevaluesmodifiedduringfilteringofverticaledgesareusedasinputforthefilteringof
thehorizontaledges
Deblocking filter(DBF)
FeaturesofHEVCdeblocking filtercomparedtoH.264/AVC FortheTUsandPUswithedgeslessthan8samplesineitherverticalorhorizontaldirection,only
theedgeslyingonthe88samplegridarefiltered
verticaledges>horizontalfiltering
horizontaledges>verticalfiltering2
1 verticaledges>horizontalfiltering
horizontaledges>verticalfiltering2
1
[e.g. 16x16Codingunit]
H.264/AVC HEVC
(a) H.264/AVC (b)HEVCFIGURE. DerivationprocessfortheboundaryfilterstrengthinAVCandHEVC
ProcessingflowofDBF
Boundarydecision Threekindsofboundariesinvolvinginthefiltering
CU,TU,PUboundary CUboundariesarealwaysinvolvedinthefiltering TUboundaryat88blockgridandPUboundarybetween
eachPUinsideCUareinvolvedinthefiltering [Except]PUboundaryisinsideTU,theboundaryshall
notbefiltered
Bs calculation Bs iscalculatedin44blockbasis>remappedto88grid TwoBs arebelongto8pixelsconsistingalinein44grid,
maximumBs isselectedasBs forboundariesin88grid
Boundarydecision
Bs calculation(44>88)
,tc decision
filteron/offdecision
Strong/weakfilterselection
Strongfiltering Weakfiltering
FIGURE.Overallprocessingflowofdeblocking filterprocess
Overviewofsampleadaptiveoffset(1/2)
Artifacts Blockingartifacts,ringingartifacts,colorbiases,andblurringartifacts Alargertransformcouldintroducemoreartifacts
HEVC:4x4~32x32transform Artifactsareexistatmediumandlowbitrates
Alargenumberofinterpolationtapscanalsoleadtomoreseriousringingartifacts HEVC:8tap(luma),4tap(chroma)
Sampleadaptiveoffset Toreducesampledistortion(reconstructedpixels originalpixels) Average3.5%BDratereduction (with1%encodingtimeincrease,2.5%decodingtimeincrease)
SAOislocatedafterDFandalsobelongstoinloopfiltering
Overviewofsampleadaptiveoffset(2/2)
SAOfeatures EachcolorcomponentmayhasitsownSAOparameters TwoSAOtypes
Edgeoffset(EO;4EOclasses) Bandoffset(BO;1BOclass)
SAOmerging(leftCTUoraboveCTU) SAOmergeinformationissharedforthreecolorcomponents
SAOobjectandsubjectiveresults
SAOisenabled(QP=32)
SAOisdisabled(QP=32)
Anchor:DisablingSAOTest:EnablingSAO
CTUsizeinLuma: 64x64CTUBoundary:option1
YDBrate
Allintra(AI)
Randomaccess(RA)
Low delayB(LB)
LowdelayP(LP)
ClassSummary
Class A 0.6% 2.3%
ClassB 0.5% 2.1% 2.0% 11.1%
ClassC 0.5% 1.1% 1.8% 7.1%
ClassD 0.4% 0.3% 0.7% 4.4%
ClassE 0.6% 2.3% 11.0%
ClassF 1.5% 2.6% 5.7% 12.3%
OverallSummary
All 0.7% 1.7% 2.5% 9.2%
Enc.Time(%) 101% 100% 100% 100%
Dec.Time(%) 103% 103% 102% 102%
EdgeoffsetofSAO
Four1Ddirectionalpatterns horizontal,vertical,135 diagonal,45 diagonal
OnlyoneEOclasscanbeselectedforeachCTBofwhichEOisenabled EachsampleinsidetheCTBisclassifiedintooneoffivecategories
Oneedgeoffsetisencodedforeachcategory(4offsetsaretransmittedinthecaseofEO) Noinformationforclassificationoffivecategories(encoderanddecoderusesamerules)
a c b
a
c
b
a
c
b
a
c
bFIGURE. Four1DdirectionalpatternsforEOsampleclassification
Category Condition
1 cb
0 Noneoftheabove(SAOisnotapplied)
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
category1
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
category2
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
category3
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
pixelindexx1 x x+1
p
i
x
e
l
l
e
v
e
l
category4
Positiveedgeoffset Negativeedgeoffset
TABLE.Sampleclassificationrulesforedgeoffset
BandoffsetofSAO
BOimpliesoneoffsetisaddedtoallsamplesofthesameband Thesamplevaluerangeisequallydividedinto32bands For8bitsamplesrangingfrom0to255,thewidthofabandis8
Onlyoffsetsoffourconsecutivebandsandthestartingbandpositionaresignaledtothedecoder
Theaveragedifferencebetweentheoriginalsamplesandreconstructedsamplesinabandissignaledtothedecoder
Four offsetsaretransmittedinthecaseofBO
0 max
Thefirstbandforwhichoffsetistransmitted
Four offsetsaretransmittedforfourconsecutivebands
AfastdistortionestimationforSAO
Distortionshavetobecalculatedmanytimes Letk,s(k),andx(k)besamplepositions,originalsamples,andpreSAOsamples,
respectively DistortionbetweenoriginalsamplesandpreSAOsamples
DistortionbetweenoriginalsamplesandpostSAOsamples
h istheoffsetforthesamplesetandN isthenumberofsamplesintheset,thedeltadistortionisdefined(NandEcanbecalculatedonlyonce)
Ck
pre kxksD2))()(((
Ck
post hkxksD2)))(()((
Ck
prepost hENhkxkshhDDD 2)))()((2(22
Ck
kxksE ))()((RDJ
Offsetrefinement
Initialoffsetvalue,hisE/N Allthenumbersbetweenzeroandoffsetareusedforoffsetrefinementprocess
0
1
2
3
4
5
6
Initialoffset
0
1
2
3
4
5
6
Initialoffset
Ck
kxksE ))()((
EncodingflowofSAOinHM
CTUbasedprocessing
BO 32 band sum of difference, pixel count
EO class0 category Sum of difference, pixel count
EO class1 category Sum of difference, pixel count
EO class2 category Sum of difference, pixel count
EO class3 category Sum of difference, pixel count
EO class0 rdcost rdcost0 = distortion + rate( A fast distortion estimation, offset refinement )EO class1 rdcost rdcost1 = distortion + rate( A fast distortion estimation, offset refinement )EO class2 rdcost rdcost2 = distortion + rate( A fast distortion estimation, offset refinement )EO class3 rdcost rdcost3 = distortion + rate( A fast distortion estimation, offset refinement )
BO band position ( A fast distortion estimation, offset refinement )
Rdcost type (BO, EO class0, EO class1, EO class2, EO class3)
BO rdcost rdcostBO = distortion + rate
Left merge, up merge rdcost
E
N
FIGURE. Flowchart Sampleadaptiveoffset
Compressslice
Deblocking filter(DBF)
Sampleadaptiveoffset(SAO)
Encodeslice
RDOofSAO
ProcessSAO
1)CalculateSAOstatistics
2)CalculateSAORDcost
3)Mergeleftorup
1)CalculateSAOstatistics 2)CalculateSAORDcost
Slicelevelon/offcontrolofSAO
Hierarchicalquantizationparameter(QP)settingsforeachgroupofpictures
Aslicelevelon/offdecisionalgorithm Fordepth=0picture,SAOisalwaysenabledinthesliceheader Otherdepth
Ifthepreviouspicture(thelastpictureofdepthN1indecodingorder)disablesSAOformorethan75%ofCTUs,thecurrentpicturewillearlyterminatetheSAOencodingprocessanddisableSAOinallsliceheaders
8k
(8k+4)Depth=0
Depth=1
Depth=2
Depth=3
AhigherQP
(8k+2)
(8k+1) (8k+3) (8k+5) (8k+7)
(8k+6)
CTUbasedencodingissuesaboutSAO
SinceSAOisafterDF,theSAOparameterscannotbepreciselyestimateduntilthedeblocked samplesareavailable
InCTUbasedencoder,thedeblocked samplesoftherightcolumnsandthebottomrowsinthecurrentCTUmaybeunavailable
TwopracticalCTUbasedSAOdecisions Case1.Avoidingusingthebottomrowsandrightcolumns(currentHM) Case2.Usenondeblockfilteredpixelsforthebottomrows
andrightcoloumns (JCTVCJ0139)
TABLE.AverageBDratesofenablingSAOversusdisablingSAOfordifferentCTUsizes
deblockfilteredpixels
nondeblockfilteredpixels
CTUSizeinLuma
Option1:SkiprightandbottomsamplesintheCTUduringparameterestimation
Option 2:UsepredeblockedsamplesnearrightandbottomboundariesintheCTUduring
parameterestimation
Y Cb Cr Y Cb Cr
6464 3.5% 4.8% 5.8% 3.3% 5.3% 6.6%
3232 2.0% 1.1% 1.5% 2.5% 2.0% 2.7%
1616 0.0% 0.3% 0.3% 0.8% 0.4% 0.1%
COMPLEXITYANALYSISOFHEVCENCODER
ComplexityanalysisofHMencoder
Testsequences Sequence:ClassB(19201080),ClassC(832480)
ClassB:Kimono,ParkScene,Cactus,BasketballDrive,BQTerrace
ClassC:BasketballDrill,BQMall,PartyScene,RaceHorse
QP:22,27,32,37 Mainprofile Randomaccess,lowdelay
Testenvironment HM7.0software IntelCoreTM [email protected] 4GBmemory Windows7(64bit) Analysistool:IntelVtuneTM AmplifierXE
FIGURE. ClassB BasketballDrive
FIGURE. ClassC BQMall
ProfilingresultofHEVCencoder
Class ModuleQP
22 27 32 37
B
Entropy 6.6 3.4 1.0 0.9
Intra 3.3 2.2 2.1 1.4
Inter 68.4 78.1 83.9 85.7
TR+Q 20.4 15.2 11.7 10.6
Loopfilter 0.2 0.2 0.2 0.1
etc 1.2 1.1 1.3 1.5
C
Entropy 6.5 3.9 2.8 1.3
Intra 2.9 2.7 2.2 1.8
Inter 68.8 74.9 79.8 83.3
TR+Q 20.7 17.0 13.9 12.4
Loopfilter 0.2 0.2 0.2 0.1
etc 1.0 1.5 1.4 1.2
Class ModuleQP
22 27 32 37
B
Entropy 6.1 2.8 0.4 0.3
Intra 3.4 2.0 1.2 1.2
Inter 71.3 81.2 87.3 89.1
TR+Q 18.6 13.0 9.9 8.5
Loopfilter 0.2 0.2 0.2 0.1
etc 0.8 1.2 0.8 0.9
C
Entropy 5.3 3.1 1.1 0.4
Intra 3.0 2.5 1.8 1.5
Inter 72.6 79.1 83.5 87.2
TR+Q 18.2 14.9 12.1 10.1
Loopfilter 0.2 0.2 0.2 0.1
etc 1.1 0.6 1.6 1.0
TABLE. ComplexityratioofHM7.0encoder(RA) TABLE. ComplexityratioofHM7.0encoder(LD)
Loopfilter:0.10.2%
Interprediction:7781%
Intraprediction:12%
Entropycoding:24%
Tr +Q:1416%
ComplexityportionsofHMencoder
Fn
PictureBuffer
Fn1
Fn2
Fn
Interprediction
ME MCDCTIFAMVPMerge
Intraprediction
Referencesamplepadding
PlanarDC
33angularMDIS
Transform
TUsize:3232~44
Residualquadtree
Quantization
DeltaQP RDOQ
Entropycoding
CABAC
Loopfilter
Sampleadaptiveoffset
Deblockingfilter
Transform1
Quantization1
++
Rn
Rn
Interprediction
Transform+Q
Intraprediction
Loopfilter
Entropycoding
etcFIGURE. HEVCencoderblockdiagram andprofilingresult
ComplexityportionsforCUsizesandmodes
FIGURE. ExampleofCUquadtreestructure
CU3232
CU1616 CU1616
CU88 CU88
CU1616CU88 CU88
CU1616 CU1616
CU88 CU88 CU88 CU88
CU88 CU88 CU88 CU88
CU1616 CU1616 CU1616
CU88 CU88
CU88 CU88
TABLE. ComplexityportionsforCUsizesandmodes
Size Mode RA(%) LD(%) Average (%)
64x64
Intra 2.1 1.0 1.6
Inter 19.0 31.9 25.5
Skip 3.9 3.4 3.7
32x32
Intra 1.9 0.7 1.3
Inter 25.0 27.4 26.2
Skip 4.5 3.2 3.9
16x16
Intra 2.3 0.2 1.3
Inter 17.0 12.5 14.8
Skip 3.2 1.7 2.5
8x8
Intra 2.4 0.4 1.4
Inter 8.7 4.9 6.8
Skip 1.7 0.6 1.2
SelectedratiosofCU,PUandTUCU size PUmode
ClassB ClassC
22 27 32 37 22 27 32 37
64x64
Merge skip 10.6 26.6 43.3 55.2 11.7 20.6 30.6 39.5
Inter2Nx2N 4.5 7.1 7.2 6.0 5.8 7.5 6.7 5.5
InterNx2N 1.4 2.2 1.8 1.3 1.6 1.8 1.7 1.7
Inter2NxN 1.5 1.9 1.3 0.9 1.2 1.0 0.8 0.7
InterAMP 1.2 1.4 1.0 0.7 1.0 1.1 1.0 1.1
Intra 2Nx2N 0.3 0.4 0.6 1.0 0.0 0.0 0.0 0.1
32x32
Merge skip 9.9 12.4 19.9 8.4 12.2 13.5 15.2 16.8
Inter2Nx2N 8.1 6.9 4.6 3.1 9.1 7.2 5.4 4.3
InterNx2N 1.8 1.4 0.9 0.4 2.2 1.9 1.9 1.7
Inter2NxN 1.7 1.3 0.7 1.0 1.4 1.0 0.9 0.8
InterAMP 4.4 2.9 1.6 0.6 4.2 3.5 3.1 2.6
Intra 2Nx2N 2.3 2.3 2.6 2.6 0.2 0.4 0.7 1.1
16x16
Merge skip 6.8 5.6 3.9 2.9 8.0 7.7 7.3 6.1
Inter2Nx2N 9.1 3.7 1.7 0.8 6.9 4.8 3.1 2.0
InterNx2N 1.6 0.7 0.3 0.1 2.0 1.4 1.0 0.6
Inter2NxN 1.7 0.6 0.2 0.1 1.2 0.8 0.5 0.3
InterAMP 4.1 1.4 0.5 0.2 4.1 2.7 1.7 0.9
Intra 2Nx2N 2.6 2.1 1.7 1.4 1.2 1.6 1.8 1.7
8x8
Mergeskip 2.8 1.9 1.2 0.9 3.9 3.3 2.3 1.4
Inter2Nx2N 5.8 1.3 0.4 0.1 4.9 2.5 1.1 0.4
InterNx2N 0.3 0.2 0.1 0.0 1.2 0.7 0.3 0.1
Inter2NxN 0.4 0.2 0.1 0.0 0.7 0.4 0.2 0.1
Intra2Nx2N 2.9 1.2 0.1 0.5 2.1 1.7 1.2 0.8
IntraNxN 0.8 0.6 0.7 0.2 1.9 1.1 0.6 0.3
Class SizeQP
22 27 32 37
B
32x32 33.5 55.0 63.0 65.7
16x16 19.8 20.9 20.1 19.7
8x8 36.2 15.5 10.7 10.0
4x4 10.5 8.5 6.2 4.5
C
32x32 35.7 43.4 49.2 52.2
16x16 27.7 27.7 27.5 29.0
8x8 21.7 18.1 15.8 13.9
4x4 14.8 10.8 7.5 4.9
TABLE. SelectedratioofTU
TABLE. SelectedratioofCUsizeandPUmode
BDBRvs.EncodingtimedependingonCTUsize
CTUsize:32x32 3.33.4%BDbitrate 7879%encodingtime
CTUsize:16x16 15.417.5%BDbitrate 5054%encodingtime
CTUsize:16x16Enc T:50.8%BDbitrate:17.53%
CTUsize:32x32Enc T:79.22%BDbitrate:3.31%
CTUsize:64x64(Reference)
CTUsize:16x16Enc T:54.7%BDbitrate:15.43%
CTUsize:32x32Enc T:78.92%BDbitrate:3.43%
SW:HM7.1Seq :ClassBcfg :Randomaccess&Lowdelay
BDBRvs.EncodingtimedependingonTUsize
Transformsize 1616to44oncase
3.23.5%BDbitrate 96%encodingtime
88to44oncase 10.211.2%BDbitrate 9192%encodingtime
MaxTUsize:8x8Quadtreemaxdepth:1Enc T:92.4%BDbitrate:11.2%
MaxTUsize:8x8Quadtreemaxdepth:1Enc T:91.4%BDbitrate:10.24%
MaxTUsize:16x16Quadtreemaxdepth:2Enc T:96.8%BDbitrate:3.2%
MaxTUsize:16x16Quadtreemaxdepth:2Enc T:96.5%BDbitrate:3.5%
MaxTUsize:32x32Quadtreemaxdepth:3(Reference)
SW:HM7.1Seq :ClassBcfg :Randomaccess&Lowdelay
Toolon/offtest
FastencodingalgorithmsinHMsoftware
Contents note
FastEncodingSetting:FEN,JCTVCA0124
EarlyCUtermination SubsampledSADOperation SimpleBiprediction(Thenumberofiteration4>1)
FastDecisionforMergeRDCost:FDM,JCTVCH178 2Nx2NMerge CBF earlytermination PUlevel
RoughModeDecision(forIntra):RMD,JCTVCC311/D283
35 Intramode SATD RD RD RD FullRQT
PUlevel
AMPSpeedup:AMPS,JCTVCE316 AMP MEorMerge PUlevelCBFFastModeSetting:CFM,JCTVCF045 PU CBF 0 PU ME PUlevelEarlyCUSetting:ECU,JCTVCF092 CU Skip, CU CUlevelEarlySkipDetectionSetting:ESD,JCTVCG543 Inter2Nx2N EarlySkipDetection CUlevel
TABLE. FastencodingalgorithmsinHMsoftware
IPSL
HMencoderforFHD(BQTerrace.seq)
CPU
Compress Slice- Interpolation filter (IF)
- Motion estimation (ME)- Transform-Quantization (TR-Q)
- Intra prediction- MV derivation- Mode decision
- Entropy encoding (CABAC update)
DBF
SAO
Encode Slice
- Entropy encoding
Oneframe:57930ms
For real-time?33.33ms
IF:21548.62msRDOQ:2645.55msTR:1687.37msITR:653.2829ms
DBF:9.42msSAO:77.33ms
Inteli7CPU,2.xGHz
KWHEVCencoder
ANSICHEVCencodersoftwarebasedonHMencoder Cleanupfunctionsandvariables Nonrecursivefunctioncall
Minimummemoryallocationandbandwidth Explicitminimummemoryallocations(usingstaticmemory) Removalofcoderelatedtoduplicatevariablesandstructuretoavoid
redundantmemorycopy Removalofunnecessarymemoryallocation
Softwareoptimization SIMDimplementation(Costfunction,transform,interpolation,deblocking,..) Framelevelinterpolationfilter
Parallelprocessing SlicelevelparallelprocessingusingOpenMP MotionestimationusingCUDA
PerformanceofKWHEVC
1) Cconverting:18%ATSgain(anyBDBR,BDPSNRloss)2) +SIMD+FramelevelIF:2speedup(anyBDBR,BDPSNRloss)3) +Fastmodedecision:5speedup(12%BDBRloss)4) +Slicelevelparallel:20speedup(46%BDBRloss)5) +CUDAME&MD(lowdelay P,adjustmentConfig.):200speedup
(1520%BDBRloss){Inteli7(3.3GHz),GeForce660}=>10fps
200
Class Sequence Frame QP FPS
B
Kimono 240
22 5.7427 7.2532 8.3837 9.40
ParkScene 240
22 5.5127 7.5232 8.8737 10.03
Cactus 500
22 5.1927 7.7032 9.0937 10.09
BasketballDrive 500
22 4.8027 6.7132 8.0937 9.18
BQTerrace 600
22 4.1427 7.6832 9.6037 10.62
C
BasketballDrill 500
22 14.8627 19.0732 23.6037 28.12
BQMall 600
22 14.8127 19.8832 24.9137 29.20
PartyScene 500
22 11.0927 16.4632 22.0337 27.60
RaceHorses 300
22 10.4827 14.6032 19.4637 24.49FIGURE. Encodingspeedintermsofthedevelopmentsteps
TABLE. EncodingspeedofKWHEVC
Comparisonofdecodercomplexity
HM10.0(C++)vs.KWHEVCdecoder(C89) Cconversion Softwareoptimization
SequencesDecodingperformance
HM10.0(sec) FPS
KWHEVC(sec) FPS Ratio
BQTerrace_1920x1080_60_qp22.bin 98.271 6.11 71.007 8.45 1.38
BQTerrace_1920x1080_60_qp27.bin 46.531 12.89 30.778 19.49 1.51
BQTerrace_1920x1080_60_qp32.bin 32.737 18.33 19.234 31.19 1.70
BQTerrace_1920x1080_60_qp37.bin 28.189 21.28 15.912 37.71 1.77
Cactus_1920x1080_50_qp22.bin 51.355 9.74 36.270 13.79 1.42
Cactus_1920x1080_50_qp27.bin 31.371 15.94 20.155 24.81 1.56
Cactus_1920x1080_50_qp32.bin 25.506 19.60 15.381 32.51 1.66
Cactus_1920x1080_50_qp37.bin 21.933 22.80 12.792 39.09 1.71
ParallelismandSIMDprocessing
Parallelism Decodercannotexpectthetileorslicepartitioningofpictures Decodershouldconsiderworstbitstreams Theentropydecodercannotbeparallelized CTUbased2Dwavefrontparallelprocessingisapromisingwayfor
parallelism Deblocking filterandSAOaremoreproperfortheparallelism
Lessdatadependency
SIMDprocessing Inversetransform(X=ATYA) Motioncompensation
About40%ofdecodercomplexity 8tapand4tapfilters
PerformanceoftheoptimizedKWHEVCdecoder
SIMDandparallelization Pixelreconstruction,interpolation(partial) Tasklevelparallelism(entropy,pixeldecoding) Datalevelparallelism(deblocking filter)
2.934.98
2.28Mbps
Conclusion
OverviewofHEVC EncodingparametersforHEVCtestmodel(HM) ComplexityanalysisofHEVCencoder Fastencodingalgorithmsandperformances Issuesofparallelprocessing
HEVC
:,:
1. HEVC2. 3. 4. HEVC 5. 6. 7. 8. 9. 10. 11. CABAC12. 13. 14. 15. HEVC A. 2013
Top Related