W# ,))( W ,, 3 W,) --#(! W ), W ... - Aalto

seires noitacilbup ytisrevinU otlaASNOITATRESSID LAROTCOD 591 / 6102

rof gnissecorp yarra enohporciM seuqinhcet oidua laitaps cirtemarap

sitiloP sitnohcrA

fo rotcoD fo eerged eht rof detelpmoc noitatressid larotcod A eht fo noissimrep eht htiw ,dednefed eb ot )ygolonhceT( ecneicS

cilbup a ta ,gnireenignE lacirtcelE fo loohcS ytisrevinU otlaA rebmevoN 4 no loohcs eht fo 4S llah erutcel eht ta dleh noitanimaxe

.noon ta 6102

ytisrevinU otlaA gnireenignE lacirtcelE fo loohcS

scitsuocA dna gnissecorP langiS fo tnemtrapeD scitsuocA noitacinummoC

rosseforp gnisivrepuS dnalniF ,ytisrevinU otlaA ,ikkluP elliV rosseforP

rosivda sisehT

dnalniF ,ytisrevinU otlaA ,ikkluP elliV rosseforP

srenimaxe yranimilerP airtsuA ,zarG ,strA gnimrofreP dna cisuM fo ytisrevinU ,rettoZ znarF rotcoD

ASU ,AC ,ogeiD naS ,mmoclauQ ,sreteP sliN rotcoD

tnenoppO ASU ,YN ,yesreJ weN ,scitsuoca hm ,oklE yraG rotcoD

seires noitacilbup ytisrevinU otlaASNOITATRESSID LAROTCOD 591 / 6102

© sitiloP sitnohcrA

NBSI 4-8307-06-259-879 )detnirp( NBSI 7-7307-06-259-879 )fdp(

L-NSSI 4394-9971 NSSI 4394-9971 )detnirp( NSSI 2494-9971 )fdp(

:NBSI:NRU/if.nru//:ptth 7-7307-06-259-879

yO aifarginU iknisleH 6102

dnalniF

tcartsbA otlaA 67000-IF ,00011 xoB .O.P ,ytisrevinU otlaA if.otlaa.www

rohtuA sitiloP sitnohcrA

noitatressid larotcod eht fo emaN seuqinhcet oidua laitaps cirtemarap rof gnissecorp yarra enohporciM

rehsilbuP gnireenignE lacirtcelE fo loohcS tinU scitsuocA dna gnissecorP langiS fo tnemtrapeD

seireS seires noitacilbup ytisrevinU otlaA SNOITATRESSID LAROTCOD 591 / 6102 hcraeser fo dleiF gnissecorP langiS dna scitsuocA

dettimbus tpircsunaM 6102 yaM 91 ecnefed eht fo etaD 6102 rebmevoN 4 )etad( detnarg hsilbup ot noissimreP 6102 tsuguA 9 egaugnaL hsilgnE

hpargonoM noitatressid elcitrA noitatressid yassE

tcartsbAa sa desingocer ylgnisaercni si senecs dnuos dedrocer fo seitreporp laitaps fo noitcudorpeR

detneiro-amenic ro citsemod htiw ,snoitacilppa evisremmi gnigreme lla fo tnemele laicurc dna ,gnicnerefnocelet evisremmi dna ecneserpelet ,tnemniatretne rof noitcudorper lausivoidua

lareneg a morf tfieneb snoitacilppa hcuS .selpmaxe yek gnieb ytilaer lautriv dna detnemgua fo yteirav a morf noitamrofni laitaps tiolpxe ot elba gnieb ,krowemarf gnissecorp oidua laitaps tnerapsnart yllautpecrep a ni enecs dnuos lanigiro eht ecudorper ot redro ni stamrof gnidrocer

.yawdohtem noitcudorper dnuos laitaps cirtemarap tnecer a si )CAriD( gnidoC oiduA lanoitceriD

oidua D3 lasrevinu a no desab si tI .krowemarf a hcus fo stnemeriuqer eht fo ynam slfiluf taht rof noitcudorper lautpecrep evitceffe dna elbixefl seveihca dna tamrof-B sa nwonk tamrof

.senohpdaeh ro srekaepsduoltaht nwohs si ti ,yltsriF .ti dnetxe ot smia dna CAriD fo ledom eht no sesucof krow siht fo traP

-B eht setareneg taht yarra gnidrocer lennahc-ruof eht fo noitamrofni tnuocca otni gnikat yb.noitcudorper dna enecs dnuos eht fo sisylana htob evorpmi ot elbissop si ti ,slangis tamrof

.snoitarugfinoc gnidrocer suoirav rof desilareneg era sgnidnfi eseht ,yldnoceSlacirehps eht ,niamod mrofsnart laitaps a ni detpmetta si CAriD fo noitasilareneg rehtruf A ni ledom CAriD eht gnitalumroF .slangis tamrof-B redro-rehgih htiw ,)DHS( niamod cinomrah

fo noituloser desaercni eht htiw CAriD fo ssenevitceffe lautpecrep eht senibmoc DHS eht .CAriD lanoitidart fo snoitatimil tsom semocrevo dna tamrof-B redro-rehgih

rof detartsnomed era dnuos laitaps fo gnissecorp cirtemarap fo snoitacilppa levon emoS laitaps eht gniyfidom fo laitnetop eht swohs tsrfi ehT .gnireenigne cisum dna dnuos

swohs dnoces eht elihw ,senecs dnuos fo noitalupinam evitaerc rof gnidrocer eht ni noitamrofni .seuqinhcet gnidrocer dnuorrus dehsilbatse htiw derutpac noitcudorper cisum fo tnemevorpmi

seuc noitasilanretxe dna ecnatsid gniyevnoc ni seuqinhcet cirtemarap fo ssenevitceffe ehT a ni srekaepsduol gnisu ecnatsid deviecrep eht gnillortnoc ni hcraeser ot del ,senohpdaeh revo tcapmoc a gnisu oitar ygrene tnarebrever-ot-tcerid eht gnitalupinam yb deveihca si sihT .moor

.nrettap ytivitcerid elbairav a htiw yarra rekaepsduollaitaps eht fo noitasilarua ,senecs dnuos dedrocer fo noitcudorper morf trapa ,yltsaL

detius-llew si melborp siht taht etartsnomed eW .tseretni fo era secaps lacitsuoca fo seitreporp egral a htiw derutpac sesnopser eslupmi moor fo erutan ehT .sisylana laitaps cirtemarap ot

noitceted rof sehcaorppa hcus dna ,sehcaorppa noituloser-hgih yrev swolla yarra enohporcim dna deilppa era wodniw noitavresbo trohs elgnis a ni snoitcefler elpitlum fo noitasilacol dna

.derapmoc sdrowyeK noitcudorper dnuos ,sisylana dlefi dnuos ,syarra enohporcim ,oidua laitaps

)detnirp( NBSI 4-8307-06-259-879 )fdp( NBSI 7-7307-06-259-879 L-NSSI 4394-9971 )detnirp( NSSI 4394-9971 )fdp( NSSI 2494-9971

rehsilbup fo noitacoL iknisleH gnitnirp fo noitacoL iknisleH raeY 6102 segaP 512 nru :NBSI:NRU/fi.nru//:ptth 7-7307-06-259-879

Preface

This thesis work was carried out at the Department of Signal Process-

ing and Acoustics, School of Electrical Engineering, Aalto University in

Espoo, Finland. The research leading to these results has received fund-

ing from the European Research Council under the European Community

Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement

no [240453].

First and foremost, I would like to express my gratitude to my supervi-

sor Prof. Ville Pulkki, for his insight in the research pertaining this thesis

and his endless patience with my constant diversions. His unconventional

approach to attacking research problems has been very inspiring, as were

his dance moves.

I am also thankful to the pre-examiners of the thesis manuscript, Dr.

Franz Zotter and Dr. Nils Peters, and their prompt reviews during sum-

mertime. I sincerely hope it was not an ordeal.

The Communication Acoustics research group, in which I belonged, made

our workplace a very special and stimulating environment. My gratitude

extends to all the people that passed from it during my stay. I would

like to thank especially Dr. Mikko-Ville Laitinen, Tapani Pihlajamäki,

Dr. Juha Vilkamo, Symeon Delikaris-Manias, Alessandro Altoè, Javier

Gómez Bolaños, Ilkka Huhtakallio, Dr. Olli Santala, Dr. Marko Takanen,

Dr. Jukka Ahonen, Dr. Marko Hiipakka, Dr. Catarina Hiipakka, Olli

Rummukainen, Teemu Koski, Julia Turku, Juhani Paasonen, and Henri

Pöntynen.

The audio research community at Aalto University is large and strong

and the amount of knowledge and enthusiasm on everything audio- and

acoustics-related is staggering. I have learned a lot from my friends and

collaborators Dr. Sakari Tervo, Dr. Roberto Pugliese, and Dr. Julian

Parker. Many thanks also to Dr. Hannes Gamper, Dr. Stefano D’Angelo,

1

Preface

Sofoklis Kakouros, Fabian Esqueda, Sami Oksanen, and all the rest of the

great people I met from the Audio Signal Processing, Virtual Acoustics,

and Speech Technology groups.

I would additionally like to thank my hosts, Prof. Ramani Duraiswami

and Dr. Dmitry Zotkin, for their hospitality during my research visit at

UMIACS, University of Maryland, MA, USA.

My parents and my sister supported me all these years and for that I

am very grateful to them, especially considering that they believe I am

still some kind of a student (and they are not totally wrong). And lastly,

I am most grateful to my partner Henna Tahvanainen who stood by my

side throughout this work. Her presence helped me through all the ebbs

and flows of research, and made this thesis a reality.

Helsinki, September 17, 2016,

Archontis Politis

2

Contents

Preface 1

Contents 3

List of Publications 7

Author’s Contribution 9

List of Figures 11

1. Introduction 19

2. Physical Modelling of Spatial Sound Scenes and Array Pro-

cessing 23

2.1 Physical properties of sound fields . . . . . . . . . . . . . . . 23

2.1.1 Statistical properties of sound field quantities . . . . 25

2.1.2 Diffuse sound field . . . . . . . . . . . . . . . . . . . . 26

2.1.3 Microphones and array processing . . . . . . . . . . . 28

2.1.4 DoA estimation and acoustic localisation in complex

scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.1.5 Spherical harmonic representation of sound field quan-

tities and acoustical spherical processing . . . . . . . 36

3. Perception and Technology of Spatial Sound 43

3.1 Spatial sound perception . . . . . . . . . . . . . . . . . . . . . 43

3.1.1 Time-frequency resolution . . . . . . . . . . . . . . . . 43

3.1.2 Localisation cues . . . . . . . . . . . . . . . . . . . . . 44

3.1.3 Dynamic binaural cues . . . . . . . . . . . . . . . . . 45

3.1.4 Distance cues . . . . . . . . . . . . . . . . . . . . . . . 45

3.1.5 Precedence effect and summing localisation . . . . . . 46

3

Contents

3.1.6 Apparent source width, spaciousness and interaural

coherence . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Spatial sound technologies . . . . . . . . . . . . . . . . . . . . 48

3.2.1 Binaural technology . . . . . . . . . . . . . . . . . . . 48

3.2.2 Loudspeaker-based reproduction . . . . . . . . . . . . 49

3.2.3 Spatial sound recording . . . . . . . . . . . . . . . . . 54

3.3 Evaluation of spatial sound reproduction . . . . . . . . . . . 57

3.3.1 Technical evaluation . . . . . . . . . . . . . . . . . . 57

3.3.2 Listening tests . . . . . . . . . . . . . . . . . . . . . . 58

4. Non-parametric Methods for Spatial Sound Recording and

Reproduction 61

4.1 Ambisonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1.1 Recording and encoding . . . . . . . . . . . . . . . . . 63

4.1.2 Sound-scene manipulation . . . . . . . . . . . . . . . 65

4.1.3 Decoding and panning . . . . . . . . . . . . . . . . . . 67

4.1.4 Binaural decoding . . . . . . . . . . . . . . . . . . . . 68

4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2 Other non-parametric approaches . . . . . . . . . . . . . . . 69

5. Parametric Methods for Spatial Sound Recording and Re-

production 71

5.1 Spatial audio coding and upmixing . . . . . . . . . . . . . . . 71

5.1.1 Spatial audio coding . . . . . . . . . . . . . . . . . . . 71

5.1.2 Upmixing . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.1.3 Binaural reproduction . . . . . . . . . . . . . . . . . . 74

5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2 Parametric reproduction of recorded sound scenes . . . . . . 74

5.2.1 Assumptions on array input and output setup prop-

erties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.2 Assumptions on the sound-field model and estima-

tion of parameters. . . . . . . . . . . . . . . . . . . . . 76

5.3 Directional Audio Coding . . . . . . . . . . . . . . . . . . . . . 80

5.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3.2 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3.3 Sound scene synthesis and manipulation . . . . . . . 85

5.3.4 Binauralisation, transcoding and upmixing . . . . . 85

5.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.4 Analysis and synthesis of room acoustics . . . . . . . . . . . 88

4

Contents

6. Contributions 91

6.1 Perceptual distance control . . . . . . . . . . . . . . . . . . . 91

6.2 Improving parametric reproduction of B-format recordings

using array information . . . . . . . . . . . . . . . . . . . . . 91

6.3 Extending high-frequency estimation of DoA and diffuseness 92

6.4 Parametric sound-field manipulation . . . . . . . . . . . . . . 92

6.5 Enhancement of spaced microphone array recordings . . . . 93

6.6 Higher-order DirAC . . . . . . . . . . . . . . . . . . . . . . . . 93

6.7 Analysis of spatial room impulse responses . . . . . . . . . . 94

References 95

Publications 121

5

Contents

6

List of Publications

This thesis consists of an overview and of the following publications which

are referred to in the text by their Roman numerals.

I Archontis Politis, Ville Pulkki. Broadband Analysis and Synthesis

for Directional Audio Coding using A-format Input Signals. In 131st

Convention of the Audio Engineering Society, New York, NY, USA,

October 2011.

II Archontis Politis, Tapani Pihlajamäki, Ville Pulkki. Parametric Spa-

tial Audio Effects. In 15th International Conference on Digital Audio

Effects (DAFx-12), York, UK, September 2012.

III Archontis Politis, Mikko-Ville Laitinen, Jukka Ahonen, Ville Pulkki.

Parametric Spatial Audio Processing of Spaced Microphone Array

Recordings for Multichannel Reproduction. Journal of the Audio En-

gineering Society, 63, 4, 216–227, April 2015.

IV Archontis Politis, Symeon Delikaris-Manias, Ville Pulkki. Direction-

of-Arrival and Diffuseness Estimation Above Spatial Aliasing for

Symmetrical Directional Microphone Arrays. In IEEE International

Conference on Audio, Speech and Signal Processing, Brisbane, Aus-

tralia, April 2015.

V Archontis Politis, Juha Vilkamo, Ville Pulkki. Sector-Based Para-

metric Sound Field Reproduction in the Spherical Harmonic Do-

main. IEEE Journal of Selected Topics in Signal Processing, 9, 5,

852–866, August 2015.

VI Mikko-Ville Laitinen, Archontis Politis, Ilkka Huhtakallio, Ville Pulkki.

Controlling the Perceived Distance of an Auditory Object by Manip-

7

List of Publications

ulation of Loudspeaker Directivity. JASA Express Letters, 137, 6,

EL462–EL468, June 2015.

VII Sakari Tervo, Archontis Politis. Direction of Arrival Estimation of

Reflections from Room Impulse Responses Using a Spherical Micro-

phone Array. IEEE/ACM Transactions on Audio, Speech, and Lan-

guage Processing, 23, 10, 1539–1551, October 2015.

8

Author’s Contribution

Publication I: “Broadband Analysis and Synthesis for DirectionalAudio Coding using A-format Input Signals”

The author developed the methods, conducted the measurements and

wrote the article, under the supervision of the second author.

Publication II: “Parametric Spatial Audio Effects”

The present author developed and implemented the methods related to di-

rectional and reverberance modifications of recorded sound scenes, apart

from the acoustical zooming and translation effects. The remainder of

the methods were conceived and implemented by the second author. The

present author wrote the article, except section 3.3 which was written by

the second author.

Publication III: “Parametric Spatial Audio Processing of SpacedMicrophone Array Recordings for Multichannel Reproduction”

The present author implemented the methods in the paper, conducted the

objective evaluation and wrote most of the article. The analysis formula-

tion was developed by the present author, while the synthesis formulation

was developed by the second author. Additionally, the second author de-

signed the listening test and analyzed the results.

9

Author’s Contribution

Publication IV: “Direction-of-Arrival and Diffuseness EstimationAbove Spatial Aliasing for Symmetrical Directional MicrophoneArrays”

The present author developed and implemented the methods presented in

the paper and wrote most of the article. The evaluation of the method and

presentation of results was shared between the present and the second

author.

Publication V: “Sector-Based Parametric Sound Field Reproductionin the Spherical Harmonic Domain”

The present author developed the methods of the paper with the exception

of the optimal mixing stage of the synthesis, which is based on previous

work by the second author. The acoustic analysis in a sector was originally

conceived by the third author of the paper, who also supervised the work.

The implementation, listening tests, and evaluation responsibilities were

shared between the present and second author. The present author wrote

most of the work, with the exception of section V, and III.E.

Publication VI: “Controlling the Perceived Distance of an AuditoryObject by Manipulation of Loudspeaker Directivity”

The present author designed the beamforming operations for control of

the directivity pattern and formulated the theoretical model followed through-

out the work. The writing work was shared between the first and the

present author.

Publication VII: “Direction of Arrival Estimation of Reflections fromRoom Impulse Responses Using a Spherical Microphone Array”

The author formulated the measurement-based array response interpo-

lation and transform. The measurements were shared between the first

and the present author.

10

List of Figures

2.1 Directional patterns for a) common directional microphones,

b) omnidirectional microphone mounted on a hard cylinder

of 5 cm radius, and c) omnidirectional microphone mounted

on a sphere of 5 cm radius. . . . . . . . . . . . . . . . . . . . 30

2.2 Spherical harmonic functions up to order N = 3 of (a) real

form, and (b) complex form. The surfaces correspond to the

magnitude of the SHs. Blue and red indicate positive and

negative values for real SHs, while the colour map indicates

the phase for the complex SHs. . . . . . . . . . . . . . . . . . 38

3.1 (a) Measured head-related impulse responses of the author

in the horizontal plane for the left ear, and (b) extracted

and smoothed ITD surfaces from the measured responses

as seen from above. . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Amplitude panning patterns for a 7.0 loudspeaker setup, us-

ing a) VBAP, b) VBIP, c) MDAP with a spread of 45 degrees,

and d) ambisonic panning with naive mode-matching decod-

ing. The black dashed line indicates the energy preservation

capabilities of the panning curves as the sum of the squared

gains for each direction. . . . . . . . . . . . . . . . . . . . . . 52

3.3 WFS sound pressure rendering of a virtual point source (red

cross) using a linear array of 21 secondary sources (black cir-

cles), with an inter-source spacing of 15 cm. Two frequencies

are plotted: a) below the spatial aliasing limit and b) above

the spatial aliasing limit. . . . . . . . . . . . . . . . . . . . . 53

11

List of Figures

3.4 Main surround array designs (first row) and rear surround

array designs (second row) for surround music recording.

The dimensions are indicative and adapted from Theile (2001).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.5 Generation of the reference and rendering through room

simulation for the evaluation of perceived distance from the

reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1 General non-parametric mapping of microphone array sig-

nals to the reproduction setup. The bold arrows indicate

signal flow while the dashed ones indicate other parame-

ters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 Schematic of ambisonic encoding and decoding. The bold

arrows indicate signal flow, while the dashed ones indicate

other parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Depiction of major ambisonic transformations of a sound

scene. In (c) directional compression is shown with yellow

and expansion with light blue. In (d) the modified boundary

curve shows the directional energy modification compared

to the original sound scene. . . . . . . . . . . . . . . . . . . . 66

5.1 Block diagrams of basic signal and parameters flow in SAC

and upmixing examples. Signal flow is indicated with solid

arrows, and parameter flow with dashed arrows. The TFT/FB

blocks indicate time-frequency transform or filter bank op-

erations. In SAC, inter-channel level differences (ICLD),

phase differences (ICPD), and the inter-channel coherence

(ICC) are computed and transmitted. In upmixing these

channel dependencies are used to estimate parameters for

a primary–ambient (or direct–diffuse) decomposition. . . . . 72

12

List of Figures

5.2 Block diagrams of exemplary implementations for paramet-

ric reproduction based on a) spatial filtering/beamforming

and b) blind source separation. Signal flow is indicated with

solid arrows and parameter flow with dashed arrows. The

TFT/FB blocks indicate time-frequency transform or filter

bank operations. In (a) a number of dominant sources and

their directions are estimated and extracted by adaptive

beamforming, while in (b) independent components are de-

tected in the mixture and clustered into meaningful source

signals based on information of the array response. . . . . . 78

5.3 Block diagram of DirAC analysis. Signal flow is indicated

with solid arrows and parameter flow with dashed arrows.

The TFT/FB blocks indicate time-frequency transform or fil-

ter bank operations. The notation of signals and parameters

follows the one in the text. . . . . . . . . . . . . . . . . . . . 82

5.4 Block diagram of VM-DirAC synthesis. Signal flow is in-

dicated with solid arrows and parameter flow with dashed

arrows. The TFT/FB blocks indicate time-frequency trans-

form or filter bank operations. The notation of signals and

parameters follows the one in the text. . . . . . . . . . . . . 84

13

List of Figures

14

List of Abbreviations

ASW apparent source width

BCC Binaural Cue Coding

BSS blind source separation

CSD cross-spectral density

DDR direct-to-diffuse ratio

DirAC Directional Audio Coding

DoA direction-of-arrival

DRR direct-to-reverberant ratio

ERB equivalent rectangular bandwidth

HOA higher-order Ambisonics

HRTF head-related transfer function

IACC interaural cross correlation

IC interaural coherence

ICA independent component analysis

ICC inter-channel coherence

ICLD inter-channel level difference

ICPD inter-channel phase difference

ILD interaural level difference

ISHT inverse spherical harmonic transform

ITD interaural time difference

LEV listener envelopment

MDAP multiple direction amplitude panning

15

List of Abbreviations

ML maximum likelihood

MPEG Moving Pictures Expert Group

MUSHRA multi-stimulus with hidden reference and anchor test

NMF non-negative matrix factorization

PCA principal component analysis

PS Parametric Stereo

PSD power-spectral density

QMF quadrature mirror filter bank

S3AC Spatially Squeezed Surround Audio Coding

SAC Spatial Audio Coding

SASC Spatial Audio Scene Coding

SDM Spatial Decomposition Method

SH spherical harmonic

SHD spherical harmonic domain

SHT spherical harmonic transform

SIRR Spatial Impulse Response Rendering

SNR signal-to-noise ratio

SRP steered response power

TDoA time-difference of arrival

VBAP vector-base amplitude panning

VBIP vector-base intensity panning

WFS wave field synthesis

WNG white noise gain

16

List of Symbols

a directional wave-amplitude density

a vector of sound field coefficients in the SHD

bn modified radial sound field expansion terms due to sensor

directivity or scatterer

b pressure and pressure-gradient (B-format) signal vector

c speed of sound

d directional response of single sensor

D decorrelation matrix

f frequency

g vector of loudspeaker amplitude panning gains

h directional response of array channel at the origin

h array response (steering) vector

H matrix of array response vectors at multiple directions

i sound intensity

I identity matrix

jn radial sound field expansion terms (spherical Bessel func-

tions)

k wave number

M mixing matrix

p sound pressure

Pn Legendre polynomial of degree n

Pnm associated Legendre functions of degree n and order m

17

List of Symbols

Q directivity factor of microphone or beamformer

r coordinate vector

s vector of output signals for reproduction setup

Sxy cross-spectral density of x and y

t time

u sound particle velocity

v pressure-gradient signal vector

w beamforming weight vector

x vector of recorded signals

y vector of spherical harmonics

Ynm spherical harmonic of order n and degree m

Y matrix of SHs at multiple directions

Z0 characteristic acoustic impedance of air

α angle between two directions

β reactivity index

γ unit vector on the direction of (θ, φ)

Γ direct-to-diffuse ratio

Γdiff diffuse field coherence matrix of microphone array

Δf frequency bandwidth

θ inclination (or polar) angle

κ directivity coefficient of directional microphone capsules

λ regularization term for least-squares inversion

μ recursive smoothing coefficient

ρ ambient density of air

τμ recursive smoothing time constant

φ azimuthal angle

ψ diffuseness index

ω angular frequency

18

1. Introduction

Spatial sound technologies, the methods to capture, transmit, and repro-

duce the spatial aspects of a sound scene, were in a sense fully devel-

oped with the invention of stereophony, in 1931, by Alan Blumlein in his

famous patent. Stereophony opened up a new listening dimension and

constituted a huge leap in listening experience compared to the previous

monophonic sound transmission. Until today, such a grand leap in sound

production, in terms of the spatial qualities of sound, hasn’t occurred yet.

Stereophony impacted all aspects of sound production and added spatial

sound technologies in the field of audio engineering. More importantly,

these improvements were achieved in a highly efficient way, using the

minimum number of two channels and bringing a well-defined and effec-

tive set of tools that made the most of this pair of channels for recording,

transmission and playback of spatial sound content.

Production of content in stereophony lies traditionally somewhere be-

tween engineering and a creative craft. Music producers rarely surrender

their work on purely automated procedures; every production step goes

through a listening evaluation loop, until the result satisfies its creator.

In this sense, creating the spatial aspects of audio content is a purely

creative process; there are no restrictions on what can be done by the pro-

ducer or the tonmeister, and the result does not necessarily correspond

to the original sound scene. So far, no automated or computational pro-

cedure is able to beat the results of such creative tuning of the content.

The above process encompasses also the art of recording; good recording

engineers will do their own evaluation and they will deliver to the mix-

ing engineer, or tonmeister, subjectively well-balanced components of the

sound scene to be then mixed creatively.

On the other hand, music production is also an engineering task, some-

thing that needs to be achieved in a short time span and for which the

19

Introduction

above creative process is not always entirely possible. The produced con-

tent should ideally sound at least plausible and natural to the ears of the

listener, and a good performance should be captured preserving its good

qualities in any case. Hence, along with and in aid of the creative pro-

cess, various engineering techniques were developed for effective record-

ing, modification, and reproduction of spatial sound recordings.

The major issue of the stereophonic approach is its scalability. What is

possible with the two channels of stereophony becomes much more convo-

luted in a surround system of five or more channels. Recording the sound

scene with more than two microphones and aiming to preserve its spa-

tial properties becomes a daunting task. That became apparent during

the short life of quadraphony in the 1970s, which was a first attempt to

go one step beyond stereophony without success. However, even though

a failed attempt, quadraphony revealed the problem of handling the spa-

tial sound information effectively and in a rational manner. Going one

step further, certain researchers from that time envisioned already more

holistic methods beyond a fixed discrete number of channels that could

potentially deliver a large leap in experience from stereo.

Today’s emerging technologies pose additional challenges to spatial sound

production which the present discrete multichannel technologies hardly

meet. The experience leap from stereo to surround has failed to reach a

wide audience, mainly due to the restrictive requirements of a fixed layout

that becomes harder to align for more than two channels and that would

rarely work effectively in a domestic setup. We can envision today that in

the smart house of the future, where possibly the location of the user is

tracked, taking into account and aligning a wide range of layouts should

be possible. Furthermore, with the domination of portable music play-

ers and smart phones, headphone audio has overshadowed loudspeaker

listening in every day life, making headphone and binaural audio produc-

tion necessary. Finally, as content generation nowadays is shared between

the professionals of the past and everyday creators, due to wider access to

affordable tools, it is a matter of time before a large section of hobbyists

and creators require versatile tools that deliver high spatial fidelity with

minimal tuning.

The above processes consider primarily aesthetically critical applica-

tions, such as music production, where there are resources to involve cre-

ative tuning. There are other applications of spatial sound where this is

either not possible or not desired. In telepresence applications, for exam-

20

Introduction

ple, quality requirements are not as stringent as for music production.

However, realism, in terms of spatial properties of the transmitted sound

scene, is a desired requirement. Similarly, virtual and augmented audio

reality require a seamless and effective way to combine automatically real

recordings with interactive synthetic sound scenes. Auralisation of acous-

tic spaces, the process of making audible the acoustic characteristics of ar-

chitectural or natural spaces of interest, also demands reproduction that

is as realistic as possible.

All these recent emerging needs and applications indicate that a mod-

ern spatial audio processing pipeline should be able to a) separate the

recording process of the reproduction setup, b) use whatever spatial in-

formation exists in the recordings, with some knowledge of the record-

ing method or setup but without being bound to any, and c) distribute

the sound in the best way possible to loudspeakers or headphones, inde-

pendently of their arrangement. This introduction focuses on past and

present approaches to achieve this task. A distinction is made through-

out the thesis between non-parametric and parametric methods, with the

former considering only the properties of the recording and reproduction

setup and the latter considering additionally relationships between the

recorded signals.

The research leading to this thesis belongs to the second class, the para-

metric methods, with the aim of presenting practical strategies for achiev-

ing both high perceptual quality and realism in a single spatial sound

processing framework. More specifically, the research focuses on both

recording and reproduction rather than producing solely synthetic sound

scenes. The issues that are tackled are those of meaningful analysis of the

recordings, flexible handling of various recording and input setups, practi-

cal applications to sound engineering, and auralisation of room acoustics.

Furthermore, even though the techniques that are presented rely on cer-

tain design choices, we believe that the outcomes can be useful for most

parametric spatial audio processing methods.

The thesis is organised as follows. The second section presents briefly

the physical foundations of sound field modelling, acoustic analysis, and

recording and array processing that are relevant to the research of this

thesis. The third section outlines the most important psychoacoustical

aspects of spatial sound perception and the key technological components

of spatial sound reproduction. In the fourth and fifth sections, the princi-

ples, the strengths, and the limitations of non-parametric and parametric

21

Introduction

methods are outlined. Finally, the fifth section presents a summary of the

main contributions of this thesis.

22

2. Physical Modelling of Spatial SoundScenes and Array Processing

In this section we present the physical fundamentals of how a complex

sound scene is modelled and captured. The primary quantities of interest

are introduced, and a brief summary of the associated estimation and

processing is presented.

2.1 Physical properties of sound fields

Complex sound scenes comprise multiple physical sound sources and their

acoustic interaction with the surrounding environment that result in a

certain sound field. In spatial sound recording and reproduction problems,

we commonly aim to capture or reproduce sound fields in a source-free

region, with propagation dictated by the homogenous wave equation(∇2 − 1

c2∂2

∂2t

)p(r, t) = 0, (2.1)

where ∇2 is the Laplacian, c is the speed of sound, and p is the sound

pressure at point r and at time t. In its frequency domain counterpart,

the sound field satisfies the Helmholtz equation(∇2 + k2

)p(r, ω) = 0, (2.2)

where k = ω/c is the wave number at angular frequency ω. We can further

assume that the region of interest to capture or reproduce sound is in the

far field of sources, an assumption that is usually fulfilled in practice. In

this case, any sound field can be expressed as a continuous superposition

of plane waves coming from all directions with a plane-wave amplitude

density a(γ). The unit vector γ denotes the direction-of-arrival (DoA) of

the plane wave at inclination θ and azimuth φ, with components

γ =

⎡⎢⎢⎣

sin θ cosφ

sin θ sinφ

cos θ

⎤⎥⎥⎦ . (2.3)

23

Physical Modelling of Spatial Sound Scenes and Array Processing

The sound pressure due to a single plane wave incident from γ at point r

is the exponential function

p(r,γ, ω) = a(γ, ω)ejkγ·r, (2.4)

with j2 = −1 the imaginary unit, while the total field pressure at the same

point is

p(r, ω) =

∫γ∈S2

a(γ, ω)ejkγ·r dγ, (2.5)

integrated over the surface of the unit sphere S2

∫γ∈S2

dγ =

∫ π

0

∫ π

−πsin θdθdφ. (2.6)

The same amplitude density results in a certain acoustic particle veloc-

ity at any point r. Due the plane-wave assumption, the velocity u for a

single direction of incidence γ at point r is

u(r,γ, ω) = − 1

Z0a(γ, ω)ejkγ·rγ, (2.7)

and the total field velocity at that point is

u(r, ω) = − 1

Z0

∫a(γ, ω)ejkγ·rγ dγ, (2.8)

where Z0 = cρ0 is the characteristic acoustic impedance of air and ρ0 is

the ambient air density.

Apart from the pressure and velocity, the energetic properties of the

sound field are of interest. The energy flow per unit surface at the field

point r is given by the acoustic intensity vector, which for a single-frequency

(monochromatic) field is given by

i(r, ω) =1

2p(r, ω)u∗(r, ω). (2.9)

The real part of this complex acoustic intensity ia = �{i} is termed active

intensity, and it expresses the propagating part of the energy flow, while

the imaginary part ir = �{i} is called reactive intensity, and it expresses

the energy that locally oscillates at the field point with zero net energy

transfer (Fahy, 2002; Jacobsen, 2007). The total energy density at the

field point is

E(r, ω) =1

4cZ0

[Z20 ||u(r, ω)||2 + |p(r, ω)|2

], (2.10)

where || · || is the l2-norm of a vector and | · | the magnitude of a scalar.

It is possible to define various sound field indicators (Jacobsen, 1990)

based on the above energetic quantities, useful for various tasks such as

24


source power measurements, intensity measurements and noise identifi-

cation, and sound field characterisation. Sound field characterisation (Ja-

cobsen, 1990, 1989; Gauthier et al., 2014; Scharrer and Vorlander, 2013)

refers to distinguishing between different conditions that result in a cer-

tain sound field. These conditions may be, for instance, the presence of

multiple sources, standing waves or strong reactive components, and re-

verberant behaviour. A sound field indicator of special interest in this

work is the reactivity index (Jacobsen, 1990, 1989; Vigran, 1988; Schiffrer

and Stanzial, 1994)

β(r, ω) =||ia(r, ω)||cE(r, ω)

. (2.11)

In a basic propagating field, such as that of a single plane wave or a

monopole source in the far field, this quantity is unity since ||ia|| = cE.

This indicator evidently vanishes in a purely reactive field, such as in the

presence of a standing wave or a strong evanescent wave very close to a

vibrating surface, because the active intensity approaches zero.

2.1.1 Statistical properties of sound field quantities

In practice, and for spatial sound processing, steady-state pure-tone fields

are of limited interest. It is of greater practical importance to consider a

statistical description of field quantities. Such a statistical model of a

sound scene can be expressed by the sound amplitude density of (2.4),

but now seen as a random variable with certain spatial and temporal cor-

relations that depend on the acoustic components of the scene. Spatial

statistics can be defined between the signals carried by two plane waves

from directions γ and γ ′ as

Sγγ′(ω) = E[a(ω,γ)a∗(ω,γ ′)

], (2.12)

where E [·] denotes statistical expectation. The power-spectral densities

(PSDs) of pressure and velocity are then

Spp(ω) = E[|p(ω)|2

]= E

[∫a(γ, ω) dγ

∫a∗(γ ′, ω) dγ ′

]

=

∫ ∫Sγγ′(ω) dγ dγ ′ (2.13)

and

Suu(ω) = E[||u(ω)||2

]=

1

Z20

E

[∫a(γ, ω)γ dγ

∫a∗(γ ′, ω)γ ′ dγ ′

]

=1

Z20

∫ ∫Sγγ′(ω)γ · γ ′ dγ dγ ′

=1

Z20

∫ ∫Sγγ′(ω) cosα dγ dγ ′, (2.14)

25


where α is the angle between directions γ and γ ′. It is also useful to define

a vector cross-spectral density (CSD) between pressure and the velocity

components as

spu = E [p(ω)u∗(ω)] = − 1

Z0

∫ ∫Sγγ′(ω)γ ′ dγ dγ ′. (2.15)

Now it is possible to reformulate the active intensity, energy density,

and reactivity index of (2.9–2.11) in terms of pressure and velocity power

and cross-spectra as

ia(ω) =1

2�{spu(ω)} , (2.16)

E(ω) =1

4cZ0

[Z20Suu(ω) + Spp(ω)

], (2.17)

and

β(ω) =||ia(ω)||cE(ω)

= 2Z0||� {spu(ω)} ||

Z20Suu(ω) + Spp(ω)

, (2.18)

where the hat ( ) denotes the statistical version of the energetic quanti-

ties.

2.1.2 Diffuse sound field

Apart from indicating reactive conditions in a pure-tone field, which is of

limited practical importance for spatial sound processing, the latter in-

dex β acquires a more useful and intuitive meaning in its statistical ver-

sion. In the case of a single plane wave carrying a random signal, β will

still equal unity, although the index now will be less than one in the case

of multiple uncorrelated sounds incident from various directions. This

makes it an indicator of how closely the sound field resembles diffuse con-

ditions, where a diffuse field results from multiple plane waves incident

from any direction with equal probability and random amplitudes. When

the incident sound power for each direction is constant, the diffuse field

is termed isotropic (Jacobsen, 1979; Del Galdo et al., 2012). The isotropic

diffuse field approximates the reverberant conditions in common rooms

quite well, and hence it is an important model for various applications,

such as adaptive beamforming, dereverberation and speech enhancement

(Benesty et al., 2008; Brandstein and Ward, 2013), spatial audio coding

(Faller, 2008; Pulkki, 2007), and inference of the microphone array geom-

etry (McCowan et al., 2008), among others.

Seeing what happens to the previously defined quantities in the case

of an isotropic diffuse field is instructive. Firstly, due to the isotropy, we

have

Sγγ′(ω) = E[a(ω,γ)a∗(ω,γ ′)

]=

σ2df(ω)

4πδγ−γ′ , (2.19)

26


where σ2df is the total power (or variance) of the diffuse field and δγ−γ′ is

an angular delta function. The pressure and velocity PSDs are then

Spp(ω) =

∫ ∫Sγγ′(ω) dγ dγ ′ = σ2

df(ω) (2.20)

and

Suu(ω) =1

Z20

∫ ∫Sγγ′(ω) cosα dγ dγ ′ =

1

Z20

σ2df(ω), (2.21)

where cosα reduces to unity due to the delta function. On the other hand,

the CSD between pressure and velocity is

spu = − 1

4πZ0σ2df(ω)

∫γ dγ = 0, (2.22)

meaning that in a purely diffuse field the velocity and pressure at a cer-

tain point have zero correlation. Plugging these results into the energetic

quantities of interest, we have

ia(ω) =1

2�{spu(ω)} = 0, (2.23)

E(ω) =1

4cZ0

[Z20Suu(ω) + Spp(ω)

]=

1

2cZ0σ2df(ω), (2.24)

β(ω) = 2Z0||� {spu(ω)} ||

Z20Suu(ω) + Spp(ω)

= 0. (2.25)

The last relation is the basis for a diffuseness index, which is defined for

the remainder of this thesis as

ψ(ω) = 1− β(ω). (2.26)

The diffuseness of (2.26) is unity in a purely diffuse field and becomes zero

for a single plane wave. This diffuseness index has been used extensively

in a variety of spatial audio processing tasks (Merimaa and Pulkki, 2005;

Pulkki, 2007). Additionally, in a simple model of a mixture of a single

plane wave and an ideal diffuse field, diffuseness is directly related to the

direct-to-diffuse power ratio (DDR), and one can be used interchangeably

with the other. The DDR is commonly estimated in speech enhancement

applications based on adaptive filtering, as it can be used to construct

a Wiener filter to suppress diffuse reverberant sound and enhance the

signal (Simmer et al., 2001). It is simply given by

Γ(ω) =σ2pw(ω)

σ2df(ω)

, (2.27)

where σ2pw is the PSD of the plane-wave signal. Expressing the ideal dif-

fuseness for the same mixture as ψ = σ2df/(σ

2df + σ2

pw) gives the relation

ψ(ω) =1

1 + Γ(ω). (2.28)

27


The diffuseness of (2.26) is not the only relation that can give a measure

of how close to diffuse the sound field is (Abdou and Guy, 1994; Del Galdo

et al., 2012). It is possible to construct a diffuseness index based on de-

viations from the ideal correlations of spaced or directional microphone

signals in a diffuse field (Bodlund, 1976; Thiergart et al., 2012; Gover

et al., 2002; Gauthier et al., 2014; Jarrett et al., 2012) or on deviations of

the mean squared pressure at different points (Nélisse and Nicolas, 1997).

Additionally, Ahonen and Pulkki (2009) and Publication IV define diffuse-

ness with respect to the normalised average of short-term estimates of

intensity or DoA vectors

ψ(ω) =

√√√√√1−||E

[ia(ω)

]||

E

[||ia(ω)||

] (2.29)

which vanishes in a purely diffuse field, whose DoAs are uniformly dis-

tributed. This last estimator is used extensively in this thesis due to its

efficiency and practicality. Del Galdo et al. (2012) give a comparison be-

tween intensity- and energy-based estimators. More complex definitions

of diffuseness can also be determined by using an orthonormal decomposi-

tion of the sound field and through rank analysis of the covariance matrix

of the resulting coefficients (Kennedy et al., 2007).

One must note that non-isotropic diffuse fields can be defined, where

waves are uncorrelated but the directional power exhibits a certain distri-

bution. Such more advanced models are useful, for example, to describe

late reverberation in enclosures with very irregular dimensions (Blake

and Waterhouse, 1977), or to describe the ambient noise in underwater

acoustics (Cox, 1973). Another related concept is that of a diffuse spa-

tially distributed source or reflection, such as the reflection from a rough

surface when the wavelength is less than the Rayleigh limit of the scat-

tering surface (Valaee et al., 1995). In the non-isotropic case, the diffuse-

ness will always be less than unity due to the directional concentration of

sound power.

2.1.3 Microphones and array processing

In the case that the sound scene is recorded directly with a multichannel

recording, a number of microphones are positioned in order to capture the

spatial properties of the sound scene in a way suitable for reproduction

in a target system. Microphones can be seen as sensors that spatially

sample the sound field. Assuming that we have Q microphones at posi-

28


tions rq, with q = 1, .., Q, let us denote the microphone signal vector as

x = [x1, ..., xQ]T . Considering the directional sensitivity that individual

sensors may exhibit, we denote the frequency-dependent directional re-

sponse of a single sensor as d(γ, ω).

The overall directional response of the array to a unit amplitude plane

wave incident from direction γ is termed the steering vector of the array,

and it is denoted here by h(γ, ω) = [h1(γ), ..., hQ(γ)]T , where

hq(γ, ω) = dq(γ, ω)ejkγ·rq (2.30)

so that, for a sound scene with amplitude density a(γ, ω), the array signals

are given by

x(ω) =

∫h(γ, ω)a(γ, ω) dγ. (2.31)

For an omnidirectional microphone, the signal is proportional to the acous-

tic pressure at that point: xq(ω) = p(rq, ω). In spatial sound recording,

directional microphones are often preferred since, due to their inherent

directionality, their arrangement can be tuned to directly provide spatial

cues during reproduction. Additionally, they are more appropriate for ad-

vanced parametric and non-parametric processing suitable for arbitrary

speaker setups. Directional microphones can be physically realised by a

combination of vents on the capsule, resulting in directionality that is a

combination of an omnidirectional and a pressure-gradient response

d(γ) = κ+ (1− κ)γ · γq, (2.32)

where γq is the orientation of the capsule and γ · γq = cosαq is the cosine

angle between the DoA and the orientation of the capsule. The coefficient

κ ∈ [0, 1] determines the portion of the omnidirectional (pressure) and the

dipole (pressure-gradient) component and depends on the construction.

Example polar plots of directional patterns for common values of κ are

shown in Fig. 2.1.

Recently, microphone arrays based on microphones mounted on some

acoustically hard baffle are gaining popularity (Meyer and Elko, 2004;

Abhayapala and Ward, 2002; Moreau et al., 2006; Rafaely, 2015; Li and

Duraiswami, 2007; Teutsch, 2007; Jin et al., 2014), mainly for the follow-

ing reasons. Firstly, the baffle enforces directivity on otherwise omnidi-

rectional capsules, which are less costly than directional ones and easier

to construct. Secondly, it provides a natural casing for the assortment of

electronic components of the array resulting in a compact portable setup.

Thirdly, basic constructions, such as spherical baffles (for 3D acoustic

29


0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

(a)

κ = 1 (omni)κ = 1/2 (cardioid)κ = 1/4 (hypercardioid)

κ = (√3− 1)/2 (supercardioid)

κ = 0 (dipole)

0.5

1

1.5

2

30

210

60

240

90

270

120

300

150

330

180 0

(b)

100 Hz500 Hz1 kHz4 kHz12 kHz

0.5

1

1.5

2

30

210

60

240

90

270

120

300

150

330

180 0

(c)

100 Hz500 Hz1 kHz4 kHz12 kHz

Figure 2.1. Directional patterns for a) common directional microphones, b) omnidirec-tional microphone mounted on a hard cylinder of 5 cm radius, and c) omnidi-rectional microphone mounted on a sphere of 5 cm radius.

analysis and recording) or cylindrical baffles (for 2D analysis and record-

ing), have symmetry properties that enable a modal analysis of the sound

field properties, as will be detailed in the next section.

The array response of microphones on a baffle does not obey equation

(2.30) due to the presence of the scattered field. Theoretical expressions

exist for the fundamental cases of sensors on the surface of a sphere or an

infinitely long cylinder of radius R:

hcylq (γ, ω) = B0(kR) + 2∞∑n=1

jnBn(kR) cosnαq (2.33)

and

hsphq (γ, ω) =∞∑n=0

jn(2n+ 1)bn(kR)Pn(cosαq), (2.34)

where Bn and bn are combinations of normal and spherical Bessel-family

functions, respectively (Rafaely, 2015; Teutsch, 2007), expressing incom-

ing and outgoing circular or spherical waves. Computational routines that

generate steering vectors for arbitrary arrays of omnidirectional or direc-

tional microphones in free field or around a cylindrical or spherical scat-

terer are presented by the author (Politis, 2015b). Example polar plots for

a microphone on the surface of a rigid cylinder and sphere are shown in

Fig. 2.1.

A fundamental operation in array processing is beamforming, where the

array signals are combined with appropriate gain factors to achieve a de-

sired directionality at the output. These beamforming weights are de-

signed based on the application at hand, ranging from source-signal sepa-

ration and estimation to dereverberation and speech enhancement, locali-

30


sation of sources and tracking, and analysis of room acoustics. Beamforming-

related research is vast and ubiquitous in the field of array processing;

the interested reader is referred to the literature (Van Veen and Buckley,

1988; Johnson and Dudgeon, 1992; Van Trees, 2004; Benesty et al., 2008;

Brandstein and Ward, 2013).

In terms of spatial sound processing, beamforming is one way to dis-

tribute multichannel recordings to the speakers, by generating speaker

signals that correspond to appropriate beams that form the desired sound

scene. Beamforming is also essential for acoustic analysis and parameter

estimation of the sound field quantities. A few basic cases of interest are

highlighted below.

In its most general form beamforming is simply expressed as

y(ω) = wH(ω)x(ω) (2.35)

or by using the amplitude density formalism as

y(ω) =

∫wH(ω)h(γ, ω)a(γ, ω) dγ, (2.36)

with w = [w1, ..., wQ]T being the vector of complex weights applied to each

microphone signal for a certain frequency. The weights naturally depend

on the array geometry, or equivalently on the steering vector. The most

basic beamforming operation is phase-aligning the received signals in a

certain direction so that they sum coherently while other directions are

attenuated due to partially incoherent summation. If the microphones

are omnidirectional, the weights correspond to the well-known delay-and-

sum beamformer. For the general case, this operation corresponds to

a plane-wave decomposition operation in the beamforming direction γ0,

with weights given by

wpw(ω) =h(γ0, ω)

||h(γ0, ω)||2. (2.37)

The denominator forces unity amplitude in the look direction γ0. This de-

sign maximises the improvement of signal-to-noise ratio (SNR) compared

to a single microphone of the array, known as white noise gain (WNG)

(Brandstein and Ward, 2013).

The plane-wave decomposition beamformer of (2.37) is optimal to re-

duce microphone noise, but not to suppress diffuse sound. Such a design

requires minimisation of the power output of the array in all directions

except the steering direction. The weights of such a design correspond to

a maximum-directivity, or superdirective, beamformer and are given by

31


(Brandstein and Ward, 2013)

wdi(ω) =Γ−1diffh(γ0, ω)

hH(γ0, ω)Γ−1diffh(γ0, ω)

, (2.38)

where Γ is the coherence matrix of the array in the presence of the isotropic

diffuse field of (2.19), given by

[Γdiff ]ij =

∫hi(γ, ω)h

∗j (γ, ω) dγ√∫

|hi(γ, ω)|2 dγ∫|hj(γ, ω)|2 dγ

with i, j = 1, ..., Q. (2.39)

The plane-wave decomposition beamformer of (2.37) can be obtained from

(2.38) if the noise is assumed spatially white, Γdiff = IQ. It is evident

that the above designs depend only on the array properties and a basic

model of a noise field, and are thus non-adaptive. Contrarily, a large class

of beamforming methods exist that adapt the weights based on signal

statistics and are thus more effective when the interfering noise deviates

from the simple models above (Benesty et al., 2008; Brandstein and Ward,

2013).

A second beamforming example of importance to this thesis is the gen-

eral problem of approximating a target pattern with an arbitrary array

in a least-squares error sense. This is especially useful to spatial sound

processing as it provides means to directly obtain signals that deliver the

appropriate spatial cues during reproduction. The target patterns can

be, for example, amplitude panning gains for loudspeaker reproduction

(Backman, 2003) or head-related transfer functions (HRTFs) for head-

phone reproduction (Chen et al., 1992). The least-squares approach to

beamforming design has been useful in this work both for its flexibility

and for the fact that it can take into account nonidealities in the array re-

sponse that are generally not captured by simple theoretical models, such

as the ones in (2.32–2.34). The beamforming weights are obtained in the

following way. Let us assume that the target pattern is given by the direc-

tional function d(γ) and the squared error between the beamformer and

the target in a certain direction is given by

ed(γ) = ||wHh(γ)− d(γ)||2, (2.40)

so that its mean integrated in all directions is

〈ed(γ)〉 =∫||wHh(γ)− d(γ)||2 dγ. (2.41)

We are looking for the weight vector that minimise the mean squared er-

ror of (2.41). In practice, we can assume that the array response and the

32


target pattern is known either through an analytical formula or measure-

ments in K discrete directions. We can then formulate the solution to

(2.41) as a linear problem

w = argwmin||wHH− d||+ λ2||wHw||, (2.42)

where λ is a regularisation term, d = [d(γ1), ..., d(γK)] is the vector of

target values and H = [h(γ1), ...,h(γK)] the matrix of steering vectors

for the specified directions. The system of (2.42 ) admits the closed form

solution of

wH = dHH(HHH + λ2IQ)−1. (2.43)

The regularisation term λ ensures that the weights are constrained to

sensible values if the inversion of HHH is not well-conditioned.

A fundamental beamforming-related operation in this work is the mea-

surement of the acoustic particle velocity of (2.8). Considering that the

measurement is at the origin r = 0, the notional centre of the array, the

components of the particle velocity vector correspond to the integrated

product of the sound field density and the direction cosines of the DoA, or

equivalently, three orthogonal dipole patterns in acoustic terminology. As

a result, the acoustic velocity can be measured with the dipole patterns

produced by the specific beamforming weights

WHuh(γ) =

⎡⎢⎢⎣

sin θ cosφ

sin θ sinφ

cos θ

⎤⎥⎥⎦ = γ, (2.44)

where Wu = [wx,wy,wz] is the matrix of the weight vectors for each of

the three components of the velocity vector. The weight matrix for this ve-

locity beamforming can be derived from analytical array models (Cazzo-

lato and Ghan, 2005; Hacihabiboglu, 2014; Sondergaard and Wille, 2015)

or from the least-squares approach of (2.43). Alternatively, the measure-

ment can be performed by three equalised pressure-gradient microphones

placed proximately, with the drawback that the estimation of each com-

ponent will not be exactly coincident with each other.

2.1.4 DoA estimation and acoustic localisation in complexscenes

Parametric approaches to reproduction of recorded sound scenes extract

information on the DoA of directional sound components to be used in the

synthesis stage for spatialisation of such components. DoA estimation,

33


similar to beamforming, is a wide field of research. Three large families of

solutions are mentioned here. Steered-response power (SRP) approaches

generate a power map from the output of a beamformer steered towards

multiple directions, with peaks corresponding to dominant source direc-

tions. Time-difference-of-arrival (TDoA) approaches infer the DoA from

inter-sensor delays obtained through correlations and the geometry of the

array. Spectral or subspace approaches are able to infer multiple source

directions at a single frequency through subspace decomposition of the

array signal covariance matrix, such as the MUSIC (Schmidt, 1986) and

ESPRIT (Roy and Kailath, 1989) methods. For an overview of approaches,

the reader is referred to Benesty et al. (2008) and Brandstein and Ward

(2013).

For the parametric sound reproduction methods of interest in this work,

and contrary to speech enhancement applications, many of the above ap-

proaches can be either unusable in the complex acoustic scenarios of a

variety of interesting sound scenes or too complex and computationally

demanding for a perceptually motivated reproduction of spatial sound.

Alternatively, we focus on DoA information obtained through the acoustic

active intensity vector. It is trivial to see from (2.9) that the acoustic in-

tensity in the case of a single far-field source points to the opposite of the

DoA. It is also straightforward to deduce from the vanishing active inten-

sity in a reverberant field that it still provides a meaningful estimate for a

source in presence of reverberation after adequate time averaging. Tervo

(2009), Levin et al. (2010), and Günel (2013) have studied the statistics of

the intensity vector. These statistics have been exploited for DoA estima-

tion (Hickling et al., 1993; Tervo, 2009; Thiergart et al., 2009; Günel and

Hacihabiboglu, 2011; Levin et al., 2012; Jarrett et al., 2010; Moore et al.,

2015; Wu et al., 2015) and spatial audio processing (Merimaa and Pulkki,

2005; Hurtado-Huyssen and Polack, 2005; Pulkki, 2007).

Utilising the acoustic intensity exhibits some advantages in the repro-

duction task. Firstly, if the acoustic particle velocity is measured, esti-

mating the active intensity is a very efficient operation suitable for real-

time applications, and it avoids the directional scanning and peak finding

of many SRP and subspace approaches. Secondly, for multiple sources

or extended source distributions at a single frequency, correlated or un-

correlated, the intensity gives an estimate that always lies between the

directions of the sources. This property may seem a drawback from an

acoustical estimation or source localisation point of view, but it has some

34


perceptual relevance useful for sound reproduction, as the intensity will

essentially point towards the direction that most of the acoustic energy

propagates. There is evidence that the auditory system cannot localise

separately multiple directional components in a narrow frequency band

(Faller and Merimaa, 2004) and is instead dominated by a single direc-

tional cue related to the source distribution, with the exception of extreme

synthetic cases where the overall image does not get fused (Tahvanainen

et al., 2011). The active intensity seems to give information that is aligned

with this dominant directional perception.

Two sound field cases of interest are presented to highlight the be-

haviour of intensity for distributed sound sources. The first is a source

with a deterministic, possibly complex, spatial distribution g(γ, ω), which

can be modelled as (Valaee et al., 1995; Lee et al., 1997)

a(γ, ω) = g(γ, ω)s(ω) (2.45)

carrying the random source signal s(ω). This model includes cases in

which sound incident from some direction is a delayed and scaled copy

of the source signal. Examples include scattering from a curved surface

focusing sound towards the array or extended vibrating sound sources

with correlated point-to-point surface vibrations. The spatial correlation

of (2.45) is

Sγγ′(ω) = g(γ, ω)g∗(γ ′, ω)Sss(ω), (2.46)

with Sss being the PSD of the signal s. By inserting (2.46) into (2.15), the

active intensity vector results in

ia = −Sss(ω)

2Z0

∫ ∫�{g(γ, ω)g∗(γ ′, ω)

}γ ′ dγ dγ ′. (2.47)

The second case considers a diffusely distributed source, where the source

signal is uncorrelated between different directions. Examples include

the diffuse reflections and the non-isotropic diffuse fields mentioned in

Sec. 2.1.2. Such a distribution can be known only through its directional

power g2(γ, ω) as

Sγγ′(ω) = g2(γ, ω)δγ−γ′ . (2.48)

The resulting intensity, after inserting (2.48) in (2.15), is

ia = −1

2Z0

∫g2(γ, ω)γ dγ. (2.49)

It is obvious that (2.49) expresses an average of all directions γ, weighted

with the power distribution g2(γ, ω). Non-continuous versions of these re-

lations for a discrete number of point sources, relevant for loudspeaker re-

production systems, have been presented earlier (Merimaa, 2007; De Sena

35


et al., 2013). A pair of normalised vector quantities termed velocity and

energy vectors (Makita, 1962; Gerzon, 1992a), special cases of (2.47) and

(2.49), have been applied extensively for the evaluation of panning and

non-parametric multichannel sound reproduction (Gerzon, 1992c,b; Ger-

zon and Barton, 1992; Jot et al., 1999; Daniel et al., 1999; Poletti, 2000;

Lee et al., 2004; Zotter et al., 2012; Zotter and Frank, 2012; Epain et al.,

2014).

Intensity is not a suitable choice when detailed DoA estimates of mul-

tiple sources or reflections are required; detailed analysis and auralisa-

tion of spatial room impulse responses is such an application. In Publi-

cation VII, the DoAs of reflections from the early part of room impulse

responses are estimated with high-resolution subspace- and maximum-

likelihood (ML) based methods.

2.1.5 Spherical harmonic representation of sound fieldquantities and acoustical spherical processing

The previous sections defined all sound field quantities with respect to a

plane-wave amplitude density. Furthermore, the effect of the array sam-

pling on the sound field, beamforming, and measurement of the acous-

tic velocity were all defined as linear integral operators on that sound

field density, computed over the unit sphere S2. It is advantageous to

study these linear operators in terms of an orthogonal expansion. This is

accomplished by means of the Spherical Harmonic Transform (SHT), or

Spherical Fourier Transform, in which spherical functions are projected

onto harmonic functions that constitute an orthonormal basis over the

unit sphere. For a detailed analysis on the SHT and an overview of op-

erations on the spherical harmonic, or spectral, domain (SHD) the reader

is referred to Driscoll and Healy (1994). We highlight some fundamental

properties of the SHT with applications to acoustical processing.

The vector of angular spectrum coefficients f of a square integrable func-

tion f(γ) on the unit sphere S2 is given by the SHT as

f = SHT {f(γ)} =∫γ∈S2

f(Ω)y∗(γ) dγ, (2.50)

where the infinite-dimensional basis vector y(γ) has as its entries the

spherical harmonics (SHs) Ynm of integer order n ≥ 0 and degree m ∈[−n, n]:

[y(γ)]q = Ynm, with q = n2 + n+m+ 1 (2.51)

and [y∗(γ)]q = Y ∗nm(γ) denotes its complex conjugate. Conventions of

36


spherical harmonics vary between different scientific fields. A common

orthonormal complex form in most fields is 1

Ynm(θ, φ) = (−1)m√

(2n+ 1)

4π

(n−m)!

(n+m)!Pnm(cos θ)ejmφ, (2.52)

where Pnm are unnormalised associated Legendre functions of degree n.

Another form commonly used in audio are the real SHs defined as

Ynm(θ, φ) =

√(2− δm0)

(2n+ 1)

4π

(n− |m|)!(n+ |m|)!Pn|m|(cos θ)ym(φ), (2.53)

with

ym(φ) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

sin |m|φ m < 0,

1 m = 0,

cosmφ m > 0,

(2.54)

and δm0 the Kronecker delta. Using the real form, no conjugation occurs in

(2.50). One can change between the real and complex SH basis by means

of deterministic unitary matrices; example routines for this conversion

are contributed by the author (Politis, 2015d). An example of real and

complex SHs up to order N = 3 are displayed in Fig. 2.2.

The inverse SHT is given by

f(γ) = ISHT {f} = fTy(γ). (2.55)

The orthonormality of the SHs results in∫y(γ)yH(γ) dγ = I, (2.56)

where I is the identity matrix. Due to this orthonormality, Parseval’s

theorem for the SHT states that∫f(γ)g∗(γ) dγ = gH f and

∫|f(Ω)|2dΩ = ‖f‖2 . (2.57)

These last two relations show the usefulness of the SHT from a practi-

cal viewpoint, with directional integrals collapsing into vector products

between spectral coefficients.

In most practical cases, order-limited (or band-limited) functions are

considered, meaning that there is no energy in terms above some order N ,

so that |fq|2 = 0 for q > (N + 1)2. We denote the respective (N + 1)2-sized

coefficient vector as fN . For two functions f(Ω) and g(Ω) band-limited to

1Note that here it is assumed that the Condon-Shortley phase term (−1)m is notincluded in the definition of the associated Legendre functions Pnm.

37


(a)

(b)

Figure 2.2. Spherical harmonic functions up to order N = 3 of (a) real form, and (b)complex form. The surfaces correspond to the magnitude of the SHs. Blueand red indicate positive and negative values for real SHs, while the colourmap indicates the phase for the complex SHs.

order L and M respectively, (2.57) is limited accordingly by the smaller

order of the two.

Let us now assume that we have a finite-order representation of the

sound scene amplitude density aN = SHT {a(γ)}. It can be shown, by

solving the homogenous Helmholtz equation of (2.2) in spherical coordi-

nates (Williams, 1999; Ziomek, 1994), that the pressure around the origin

at position r due to the amplitude density a(γ) can be expressed as

p(r,γr, ω) = 4πN∑

n=0

jn(kr)n∑

m=−nanm(ω)Ynm(γr), (2.58)

where r = ||r||, γr = r/||r||, and jn are the spherical Bessel functions

of order n. Note that if the pressure is sampled around a scatterer as

in (2.34) or directional sensors are used, then jn should be replaced by

38


the appropriate radial functions bn. See, for example, Teutsch (2007) for

details. Taking the SHT of the pressure over a sphere of radius r and

inserting (2.58),

pnm(r, ω) =

∫p(r,γr, ω)Y

∗nm(γr) dγr

= 4πN∑

n=0

jn(kr)n∑

m=−nanm(ω)

∫Ynm(γr)Y

∗nm(γr) dγr

= 4πjn(kr)anm(ω). (2.59)

For the above equation to hold with negligible error inside a spheri-

cal region of radius R, the truncation order N depends on the maximum

wavelength of interest kmax. It is given by Kennedy et al. (2007) as

N =

⌈ekmaxR

2

⌉(2.60)

with ·� denoting the integer ceiling function. Relation (2.59) is impor-

tant for two reasons. First, it shows that we can encode any sound scene

within a distance R from the origin with a finite dimensional vector of

sound field coefficients aN . Second, it demonstrates a practical way to

capture the sound field coefficients aN through measurement of the pres-

sure over a sphere and by its subsequent SHT. These two reasons consti-

tute the basis of spherical array processing for recording, acoustic anal-

ysis and beamforming (Williams, 1999; Meyer and Elko, 2004; Teutsch,

2007; Rafaely, 2015), and spherical acoustic holophony or higher-order

Ambisonics (HOA) (Gerzon, 1973; Daniel, 2000; Poletti, 2005; Zotter, 2009a).

If the sound field coefficients are obtained, many spatial processing op-

erations simplify considerably. For example, beamforming design can be

performed directly in the spherical domain. If we define a beamforming

pattern w(γ) band-limited to order N with coefficients wN = SHT {w(γ)},Parseval’s theorem (2.57) dictates that the beamformer output is the dot

product of the beampattern’s coefficients and those of the sound field:

y(ω) =

∫w∗(γ)a(γ, ω) dγ = wH

NaN . (2.61)

Similarly to (2.37), and by substituting h(γ) = y∗(γ), the plane wave

decomposition beamformer weights in the SHD can be derived:

wpw(γ0) = wdi(γ0) =4π

(N + 1)2y∗N (γ0), (2.62)

where the relation ||yN (γ)||2 = (N + 1)2/(4π) is used in the denominator.

In the SHD, the plane-wave decomposition beamformer is also that of the

maximum-directivity of (2.38). This is due to the fact that the diffuse field

39


coherence Γdiff of the sound field coefficients in the diffuse field of (2.19) is

the identity matrix

Γdiff =4π

σ2df

E[aN aHN

]=

∫yN (γ)yH

N (γ) dγ = IN . (2.63)

A comprehensive set of routines for beamforming design, localisation of

sources and adaptive beamforming in the SHD is contributed by the au-

thor (Politis, 2016).

The acoustic particle velocity is also directly related to the sound field

coefficients of the first order a1. More specifically, if we define a signal

vector of pressure p(ω) and equalised pressure-gradient signals v(ω) =

−Z0u(ω) as

b =

⎡⎣ p(ω)

v(ω)

⎤⎦ =

⎡⎢⎢⎢⎢⎢⎣

p(ω)

vx(ω)

vy(ω)

vz(ω)

⎤⎥⎥⎥⎥⎥⎦ , (2.64)

we can directly relate it to the field coefficients through the transforma-

tion matrix Ma→pu as

b = Ma→pua1. (2.65)

The pressure and pressure-gradient signal vector b is known in the spa-

tial sound literature as B-format2. The transformation matrix is given for

the real and complex SH conventions presented above as

Mreala→pu =

√4π

⎡⎢⎢⎢⎢⎢⎣

1 0 0 0

0 0 0 1/√3

0 1/√3 0 0

0 0 1/√3 0

⎤⎥⎥⎥⎥⎥⎦ and

Mcomplexa→pu =

√4π

⎡⎢⎢⎢⎢⎢⎣

1 0 0 0

0 1/√6 0 −1/

√6

0 −j/√6 0 −j/

√6

0 0 1/√3 0

⎤⎥⎥⎥⎥⎥⎦ . (2.66)

The spherical spectral formulation of acoustical quantities has naturally

many applications in analysis and modelling of directional patterns of

sources or receivers, interpolation of discretely measured spherical func-

tions, such as HRTFs, spatial sound processing, and acoustic analysis and

beamforming. In terms of spatial sound capture and reproduction, the

2Traditionally, B-format has been defined with an additional scaling factor of1/√2 in the pressure signal.

40


spherical harmonic framework was first introduced in the 1970s, formu-

lated mostly by Gerzon (1973) under the name Ambisonics, with consid-

erable development towards a usable first-order implementation (Gerzon,

1975b; Farrar, 1979; Fellgett, 1975; Gerzon, 1975a).

In practice, severe practical limitations are imposed on acquiring the

sound field coefficient vector of (2.60) up to an order higher than the

first few for a compact microphone array and on reproducing the sound

scene on a small volume with a reasonable number of loudspeakers. Two

major issues are encountered in practice. The first has to do with dis-

crete sampling limitations of the acoustic pressure, so that spatial alias-

ing occurs above some limiting frequency and higher-order components

contaminate lower-order ones. For a discussion on spherical sampling

schemes and conditions the reader is referred to Rafaely et al. (2007b),

Zotter (2009b), and Rafaely (2015). The second is the vanishing of higher-

order coefficients at lower frequencies at some finite radius R from the

centre of the array, expressed in (2.59) through the radial term jn or bn.

The radial terms in the expansion decay rapidly for kR < N , making

their equalisation, and consequently higher-order recording, impossible at

large wavelengths (Moreau et al., 2006; Rettberg and Spors, 2014; Baum-

gartner et al., 2011; Lösler and Zotter, 2015). In summary, recording of

high-order coefficients is practical only in some frequency range having

an upper bound imposed by spatial aliasing and a lower bound by usable

equalisation of the radial functions.

41


42

3. Perception and Technology of SpatialSound

In this section we first highlight the major components of spatial sound

perception. This summary broadly mentions cues that play an important

role in perceptual reproduction of spatial sound scenes rather than ex-

haustively cover spatial sound psychoacoustics. For a detailed coverage of

spatial sound psychoacoustics the reader is referred to Moore (1995) and

Blauert (1996). In the latter part of this section, the major technological

components of sound spatialisation systems are presented.

3.1 Spatial sound perception

3.1.1 Time-frequency resolution

The spatial sound processing methods developed in this thesis operate in

the time-frequency domain, imposing a certain temporal and frequency

resolution to the analysis and synthesis of the recorded sound. Since the

goal is perceptual rather than physical reproduction of the sound scene,

this resolution needs to resemble somewhat the time-frequency resolution

of the auditory system. It has been shown that the frequency-resolution

of the basilar membrane can be approximated sufficiently by a bank of

overlapping band-pass filters, known as auditory filters (Fletcher, 1940;

Moore and Glasberg, 1983). Furthermore, the width of the auditory filters

can be, in turn, approximated with an equivalent rectangular bandwidth

(ERB) formula with respect to the centre frequency fc, given as (Moore

and Glasberg, 1983)

ΔfERB = 0.108fc + 24.7. (3.1)

These results can serve as guidelines for the design of time-frequency

transforms of filter banks for perceptually-motivated analysis and synthe-

43

Perception and Technology of Spatial Sound

sis of sound. Examples include gammatone filter banks (Patterson et al.,

1995; Hohmann, 2002), constant-Q transforms (Schörkhuber and Klapuri,

2010), the ERBlet transform (Necciari et al., 2013), and filter banks for

perceptual coding and spatial sound coding, such as quadrature mirror

filter (QMF) banks used by Schuijers et al. (2004), Herre et al. (2008), and

in Publication V.

3.1.2 Localisation cues

Sound localisation is dominated by two major cues. The first is the TDoA

of a wavefront at the ears incident from a certain angle, termed interau-

ral time difference (ITD). ITDs translate to interaural phase differences in

auditory processing. However, this translation is possible unambiguously

only at frequencies with a wavelength about half the interaural distance.

Above roughly 1.5 kHz (Moore, 1995), ITDs are not effective through de-

tection of phase differences, and they seem to contribute only through

time differences between the envelopes of the ear signals (Henning, 1974;

Salminen et al., 2015). The second major cue is the level difference be-

tween the two ear signals due to scattering effects of the head that at-

tenuate the signal at the contralateral ear with respect to the direction

of arrival, termed interaural level difference (ILD). Diffraction effects are

strong at low frequencies and hence ILDs are most effective above about 1

kHz. Even though it has been found that ILDs can also contribute to low

frequency localisation, this basic model of ITD dominance at low frequen-

cies and ILD dominance at mid-high frequencies, known as the duplex

theory of localisation (Rayleigh, 1907), is prevalent in technological appli-

cations of spatial sound.

ITDs and ILDs provide information on lateralisation, localisation across

the left-right direction. Due to the strong left-right symmetry of the head

and its almost ellipsoidal shape, it is possible to define cones of directions

centred at the interaural axis, at which ITDs and ILDs change weakly,

with the most extreme example being the median plane. Such a trajectory

is termed a cone of confusion. Localisation of elevated sources or of front-

back sources on a cone of confusion do not rely on ITDs and ILDs, but on

monaural spectral cues that occur mostly due to the effect of the pinna,

head shape, and shoulders (Musicant and Butler, 1984). Since the pinna

dimensions are within the range of only a few centimetres, pinna cues

are naturally high-frequency ones (Moore, 1995). Pinna-related notches

and peaks are deemed especially significant for the perception of elevation

44


(Roffler and Butler, 1968; Blauert, 1969; Macpherson and Sabin, 2013).

3.1.3 Dynamic binaural cues

Localisation of an unknown sound source in anechoic conditions, espe-

cially across cones of confusion, can be a daunting task in the absence of

other assisting cues if the listener and the source are immobilised. Free-

field localisation can improve considerably through relative motions be-

tween the listener and the source. Small head rotations especially can

jointly modify all major localisation cues and resolve front-back confu-

sions or ambiguous elevation effectively (Wallach, 1940; Thurlow et al.,

1967; Perrett and Noble, 1997; Wightman and Kistler, 1999).

3.1.4 Distance cues

Perception of a distance of a sound source relies on many factors and

is strongly dependent on the familiarity of the listener with the source

sound itself. In the range of distances encountered in regular rooms, the

inverse-distance law of sound pressure level attenuation is probably the

most prominent distance cue. However, level attenuation with distance is

a relative cue and possible only if the listener has an auditory reference

distance for the source.

In absence of a reference, the auditory system seems able to infer dis-

tance from a variety of additional cues. For close ranges below about 1 m,

ITD and ILD become distance-dependent, due to near-field effects of the

sound source that are otherwise negligible at longer ranges. Significant

ILDs occur even at low frequencies at such distances (Brungart and Rabi-

nowitz, 1999). These supplementary binaural cues seem to be utilised by

the auditory system for nearby sources (Brungart and Rabinowitz, 1999;

Brungart et al., 1999).

Otherwise, at regular distances an important factor in distance per-

ception is the direct-to-reverberant ratio (DRR) of the source power over

the reverberant sound power reflected and diffused by the room (Zahorik

et al., 2005). Reverberation in itself is not a primary cue, but it affects

jointly a number of distance cues. Recent research shows that externali-

sation and distance perception may originate from short-term statistics of

ILD (Catic et al., 2013, 2015) or ITD fluctuations (Bronkhorst, 2002) that

occur in reverberant conditions. Furthermore, there is evidence of monau-

ral cues that depend on reverberation (Lounsbury and Butler, 1979), such

45


as information due to absorption and filtering of the reflected sound with

regards to the direct sound. Phase coherence of partials over frequency

and the effect of reverberation on them has also been proposed to con-

tribute to distance perception (Laitinen et al., 2013).

Exploitation of the DRR to control distance perception was tested in

Publication VI by modifying the directionality of a loudspeaker array in

order to affect the DRR at the listener position. That study inspired fur-

ther work by Wendt et al. (2016). Distance control using multiple loud-

speakers, targeting control of either DRR or of the related binaural cues,

has been also proposed by de Bruijn et al. (2008), Jeon et al. (2015) and

Laitinen et al. (2015).

At long distances, after a few tens of metres, propagation effects become

perceptible, mainly due to the kinetic energy of the wave dissipating as

thermal energy in air, reducing high frequency content (Little et al., 1992).

Another common dynamic cue for distance perception is the Doppler ef-

fect of a moving source or listener at high speeds, with its characteristic

frequency shift (Lutfi and Wang, 1999).

3.1.5 Precedence effect and summing localisation

The precedence effect (Wallach et al., 1949; Haas, 1972), or the law of first

wavefront (Blauert, 1996), refers to the phenomenon of a leading sound

completely dominating localisation if the lagging sound, coming from a

different direction, follows in short interval ranging from about 5 ms to

40 ms (Moore, 1995). This effect occurs for onsets, transient and discon-

tinuous sounds, and it is responsible for our ability to suppress partially

the effect of echoes on localisation in rooms and enclosures. However,

even though information on the direction of an echo is lost, the listener is

still able to hear spectral differences due to the room, important for per-

ception of the source distance and the environment. A review of related

psychoacoustical research is given by Litovsky et al. (1999).

The precedence effect is important for sound reproduction because it

determines temporal limits after which panning and spatialisation opera-

tions, based on summing localisation (Blauert, 1996), are effective. Sum-

ming localisation refers to the phenomenon of two fully or partially cor-

related sounds, arriving at the listener from different directions, being

perceived as a single coherent spatial image. For amplitude and time-

delay panning techniques to be effective, the listener should be positioned

so that wavefronts from the speakers arrive with time differences smaller

46


than the precedence effect limit, otherwise the sound is completely pulled

to the closest speaker.

3.1.6 Apparent source width, spaciousness and interauralcoherence

Perception of a spatially extended source, perceived as such either due

to its physical extent or due to the contributing room acoustics, has been

termed in the literature as apparent source width (ASW). In room acous-

tics research, ASW is usually associated with early lateral reflections hav-

ing a widening effect around the direct sound (Morimoto and Maekawa,

1989; Barron, 1971; Griesinger, 1997). Another related concept is the per-

ception of spaciousness and listener envelopment (LEV), usually associ-

ated with the later reverberant diffuse sound, where no clear direction-

ality can be perceived (Bradley and Soulodre, 1995; Griesinger, 1997).

Recent psychoacoustical studies investigate additionally the effect of the

spatial distribution itself, and how it affects perception (Hirvonen and

Pulkki, 2006, 2008; Santala and Pulkki, 2011).

Both LEV and ASW are important in sound reproduction for two rea-

sons. First, the sound reproduction method should be able to recreate

them appropriately, since they are fundamental parts of a natural and

engaging listening experience. Second, understanding them is important

for the design of sound engineering tools and techniques that are able

to control these two attributes. In general such spatialisation effects are

introduced through emulation of room acoustics.

Understanding the cues that contribute to the ASW and LEV is still

under research. A commonly mentioned quantity is the interaural cross

correlation (IACC), the normalised short-term cross-correlation between

the two binaural signals (Hidaka et al., 1995). IACC is, however, a broad-

band index which can be misleading as an indicator of ASW (Mason et al.,

2005; Käsbach et al., 2013). A more appropriate index is its frequency-

domain counterpart, the interaural coherence (IC) (Blauert and Linde-

mann, 1986). Currently, the focus is moving from these signal statistics

to the short-term statistics of the binaural localisation cues themselves.

Mason et al. (2004) and Käsbach et al. (2014) study the fluctuations of

ITDs as a cue for ASW.

47


3.2 Spatial sound technologies

3.2.1 Binaural technology

ITDs, ILDs, and spectral cues for a source in the free field are encoded

into the acoustic transfer function from the source to the ears of the lis-

tener. These transfer functions can be measured in anechoic conditions

for multiple directions around the listener and stored as digital filters,

known as head-related transfer functions (HRTFs). HRTFs are crucial

for headphone-based binaural reproduction of spatial sound since source

signals can be spatialised by their filtering with the respective HRTFs

for the desired direction. Synthetic room reverberation can be generated

for headphones based on similar principles and, more specifically, on the

diffuse-field coherence of the HRTFs (Borß and Martin, 2009; Vilkamo

et al., 2012).

Properly measured HRTFs for the specific individual will contain all the

free-field listening localisation cues mentioned above except the dynamic

ones. Dynamic cues can be added by tracking the head of the listener (Al-

gazi et al., 2004) with an adequate temporal resolution (Laitinen et al.,

2012b), which resolves localisation confusion (Begault et al., 2001), and

more importantly, preserves the orientation of the sound scene with re-

spect to the listener’s movement, which is a requirement for a natural and

immersive experience. Individualisation of HRTFs is a difficult task due

to the lengthy measurement process and the demanding requirements on

anechoic conditions and on measurement apparatus, making it unsuitable

for general deployment of binaural spatial sound applications. Hence, ei-

ther HRTFs based on a generic ear model are used, or parametric HRTFs

or HRTFs from a database are individualised based on external infor-

mation, such as morphological features (Jin et al., 2000; Algazi et al.,

2001; Zotkin et al., 2003; Hu et al., 2006; Xu et al., 2009), head scans

(Guillon et al., 2008; Gamper et al., 2015; Politis et al., 2016), or active

tuning through listening tasks (Tan and Gan, 1998; Runkle et al., 2000;

Silzle, 2002; Seeber and Fastl, 2003; Iwaya, 2006; Härmä et al., 2012). All

aspects related to HRTFs have been studied extensively in the last two

decades, including rapid measurement (Enzner, 2009; Zotkin et al., 2006),

efficient storage and compression (Kistler and Wightman, 1992; Larcher

et al., 2000; Romigh et al., 2015), efficient filter modelling (Kulkarni and

Colburn, 2004; Huopaniemi et al., 1999), interpolation (Gamper, 2013;

48


Angle (deg)

Time (msec)

-0.10

0.2

0.4

0.61500.8

1001 50

1.2 01.4 -50

-1001.6-1501.8

0

0.1

(a)

-600

-400

-200

0

200

400

600

μsec

(b)

Figure 3.1. (a) Measured head-related impulse responses of the author in the horizontalplane for the left ear, and (b) extracted and smoothed ITD surfaces from themeasured responses as seen from above.

Carlile et al., 2000; Ajdler et al., 2008; Evans et al., 1998; Zotkin et al.,

2009; Pollow et al., 2012; Romigh et al., 2015), physical modelling (Katz,

2001; Xiao and Liu, 2003; Kahana and Nelson, 2007; Huttunen et al.,

2007; Mokhtari et al., 2008; Kreuzer et al., 2009; Huttunen et al., 2014;

Meshram et al., 2014), individualised effects (Wenzel et al., 1993; Mid-

dlebrooks, 1999; Iwaya and Suzuki, 2008; Fels and Vorländer, 2009), and

headphone effects (Hiipakka et al., 2012). For a review of HRTF process-

ing and binaural technology the reader is referred to Nicol (2010) and Xie

(2013). A visual example of measured head-related impulse responses on

the horizontal plane, and their respective ITDs, are shown in Fig. 3.1.

3.2.2 Loudspeaker-based reproduction

Amplitude panning

Panning refers to the family of techniques that distribute the same sound

signal to a number of speakers, and by adjusting the level or time differ-

ences between them they generate the auditory impression of a source in

a direction between the actual loudspeakers. This impression is termed

a virtual source, and its cause is the binaural effect of summing locali-

sation (Blauert, 1996). Panning by small time-differences between two

loudspeaker signals, time-delay panning, is seldom used in practice, as

it introduces variable image shifts at different frequencies resulting in

a blurred and unstable virtual source. It is, however, for the same rea-

son, an effective method for broadening a virtual source (Orban, 1970;

Zotter et al., 2011), known as pseudo-stereo. Amplitude panning (Blum-

lein, 1958) on the other hand is based on frequency-independent gains

49


applied to the loudspeaker channels. The two best known stereophonic

panning laws are the sine (Blumlein, 1958; Bauer, 1961) and tangent law

(Bernfeld, 1973). The tangent law has been shown to be more accurate if

the listeners turn their head towards the virtual source (Bernfeld, 1973),

and even for a fixed head orientation at low frequencies (Bennett et al.,

1985). Additionally, the tangent law seems to have a stronger physical

interpretation of sound reproduction, as it results in an intensity vector

at the sweet spot collinear with the actual one produced by a real far-field

source in the panning direction (De Sena et al., 2013). It is expressed as

tan θ

tan θ0=

g1 − g2g1 + g2

, (3.2)

where θ is the panning direction, θ0 is the loudspeaker angle from the cen-

tre line, and g1 and g2 are the resulting amplitude panning gains to be ap-

plied to the stereo pair. Extensions of stereophonic panning to consecutive

pairs of loudspeakers for horizontal surround setups are straightforward.

A generalisation of the tangent law of amplitude panning for arbitrary

setups, extended to three dimensions, was introduced by Pulkki (1997),

called vector-base amplitude panning (VBAP). In the three-dimensional

case, for a panning direction γp the gains g′ = [g′1, g′2, g′3]T for a triplet of

loudspeakers with directions L123 = [γ1,γ2,γ3] are given by

g′ = L−1123γp. (3.3)

The panning direction is inside or on the border of the triplet if gi ≥ 0,

i ∈ [1, 2, 3], otherwise the gains are negative and they are discarded.

An energy preserving normalisation of the gains is used in practice with

g = g′/||g′||, even though Laitinen et al. (2014) show that a frequency-

dependent normalisation

g(f) =g′(∑

k

(g′k)p(f)

)1/p(f)(3.4)

between preserving amplitude (p=1) and preserving energy (p=2) achieves

a better constant loudness in various room conditions.

VBAP has proved to be a robust spatialisation tool and, along with am-

bisonic panning, remains the most practical panning approach in azimuth

and elevation. VBAP, contrary to ambisonic panning, has maximal energy

concentration and minimum perceptual spread of the virtual source, as

it uses the minimum number of loudspeakers at any direction, which at

most are three. VBAP reconstructs correctly the acoustic velocity vector of

50


the virtual source and maximises the normalised Makita velocity vector

(Makita, 1962) used in ambisonic and surround setup evaluation. Due to

this concentration of energy, the virtual source can appear to have a vari-

able spread or extent depending on how many loudspeakers are used and

how far apart they are. To counteract this effect for applications where

it is undesirable, Pulkki (1999) proposed the use of secondary virtual

sources around the main panning direction to maintain a constant per-

ceptual spread at the expense of sharpness. This technique is known as

multiple-direction amplitude panning (MDAP). Another variant of VBAP

that maximises the Gerzon energy vector in the panning direction was

proposed by Pernaux et al. (1998), termed VBIP. Multiple-direction pan-

ning using VBAP or VBIP has found use in the design of robust ambisonic

decoders for irregular setups (Epain et al., 2014).

Ambisonics

Since Ambisonics comprises a unified approach to the reproduction of

recorded sound scenes, it will be discussed in detail in the next section.

Some comments, though, are included on Ambisonics as a panning method

and its comparison to VBAP. The gains in ambisonic panning constitute

global solutions that consider the whole loudspeaker setup rather than

a specific triplet, as with VBAP. Due to this, panning gains tend to fol-

low a broader distribution and activate multiple loudspeakers even when

the source coincides with a loudspeaker direction. Ambisonic panning in-

herently spreads the virtual source, and the spreading is at best equal

to the maximum spreading of VBAP. For a uniformly arranged setup, the

spreading is independent of direction. Similar to MDAP, this may be de-

sirable in certain applications, at the expense of directional sharpness.

In practice, ambisonic panning requires a better understanding of the

method than VBAP to setup correctly. It is also sensitive to irregular

loudspeaker layouts without careful tuning of the panning gains, making

it less flexible than VBAP. However, Ambisonics treats panning and repro-

duction of spatial recordings in the same unified domain, and provides a

consistent solution for mixing synthetic sound scenes with recorded ones.

For more details on ambisonic panning, the reader is referred to Zotter

and Frank (2012). Differences in gain patterns between VBAP, VBIP,

MDAP, and basic ambisonic panning are presented in Fig. 3.2 for an ITU

7.0 loudspeaker configuration. Routines that compute panning gains for

all the aforementioned amplitude panning methods have been published

51


0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

(a) VBAP

0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

(b) VBIP

0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

(c) MDAP

0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

(d) Ambisonic panning

Figure 3.2. Amplitude panning patterns for a 7.0 loudspeaker setup, using a) VBAP, b)VBIP, c) MDAP with a spread of 45 degrees, and d) ambisonic panning withnaive mode-matching decoding. The black dashed line indicates the energypreservation capabilities of the panning curves as the sum of the squaredgains for each direction.

by the author (Politis, 2015a).

Wave-field synthesis

Wave-field synthesis (WFS) is a spatialisation method based on the Kirchhoff-

Helmholtz integral theorem, which states that the acoustic field inside a

source-free region can be determined completely by the values of the pres-

sure and pressure-gradient on a surface that encloses it (Berkhout et al.,

1993; Boone et al., 1995; Spors et al., 2008). By reducing the problem to

two dimensions and with a few simplifying steps, WFS approximates an

acoustic field over a large area with a dense distribution of loudspeakers

on a boundary which can be either completely surrounding or covering

only partially the area. The method derives sets of digital filters for the

loudspeaker array that can recreate plane waves and point sources over

the valid region of reconstruction. WFS, like certain variants of high-

52


-2 -1 0 1 2x / m

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

y / m

(a) f = 1 kHz

-2 -1 0 1 2x / m

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

y / m

(b) f = 2 kHz

Figure 3.3. WFS sound pressure rendering of a virtual point source (red cross) usinga linear array of 21 secondary sources (black circles), with an inter-sourcespacing of 15 cm. Two frequencies are plotted: a) below the spatial aliasinglimit and b) above the spatial aliasing limit.

order ambisonic panning, physically recreates the sound field, thus deliv-

ering appropriate cues to all listeners inside that region. Complex scenes

can be created by superimposing multiple sources. Additionally, it is pos-

sible to generate spherical wavefronts with their virtual origin inside the

reproduction region, an effect known as focused source.

WFS systems have been limited to large permanent installations in ded-

icated spaces due to their extreme requirements in hardware and multi-

channel audio processing. The operating frequency range for which the

sound field reconstruction is valid is limited above by the practical spacing

of the loudspeaker distribution. For wavelengths below half that distance,

spatial aliasing occurs, rendering reconstruction inaccurate and perceptu-

ally incorrect (Spors and Rabenstein, 2006). In practice, high-frequencies

are treated with a separate spatialisation method, such as amplitude pan-

ning. A spatial aliasing example is depicted in Fig. 3.3. Limitations of

practical WFS systems and their perceptual effects are studied and pre-

sented in detail by Wierstorf (2014).

There has been considerable research in reproducing recorded sound

scenes with WFS (de Vries and Boone, 1999; Hulsebos et al., 2002). How-

ever, recording remains challenging and WFS has been limited to spatial-

isation of synthetic sound scenes. This is the reason that WFS is consid-

ered in this thesis as mostly a sound spatialisation method rather than a

complete recording-to-reproduction approach. Parametric approaches re-

lated to this work have the potential to bridge this gap; see for example

Cobos and Lopez (2009) and Cobos et al. (2012).

53


3.2.3 Spatial sound recording

Conventional surround recording

Spatial sound recording relies on using a small number of microphones,

and then either reproducing directly the recorded signals to the speaker

setup or mixing them in order to provide appropriate spatial cues at re-

production for a specific setup. The first approach, including the first

stereophonic recording techniques (Blumlein, 1958), is called here con-

ventional surround recording. Surround recording can be classified into

two major approaches, coincident and spaced (Lipshitz, 1986). Coincident

techniques use directional microphones placed very closely together, and

hence they introduce minimal time differences between the two channels.

Directional cues are delivered mainly by the directivity gain differences

between the two microphones oriented towards different directions. Such

an example is the XY, or Blumlein, method which employs two figure-

of-eight (pressure-gradient) microphones oriented orthogonally. Spaced

techniques, on the other hand, with omnidirectional or directional micro-

phones that can be metres apart, rely mainly on time differences between

the two microphones. The major difference between coincident and spaced

methods is the cues they rely on at reproduction. The former achieve

mostly frequency-independent inter-channel differences, resulting in sta-

ble localisation cues. However, due to the small spacing of the capsules,

the two signals can be highly correlated, which may result in lack of spa-

ciousness even for highly reverberant or diffuse sound scenes. In contrast,

spaced setups can generate ambiguous localisation cues (Lipshitz, 1986;

Pulkki, 2002) due to the frequency-dependent phase differences of the two

channels, but they may be more suitable for reproduction of ambient and

diffuse sound, which is mostly uncorrelated in the two channels (Theile,

1991). A popular example of a method that can modify inter-channel level

differences by mixing the two signals is the mid-side method (Blumlein,

1958), a coincident arrangement of a dipole microphone oriented sideways

and a cardioid looking forward (Julstrom, 1990).

With the introduction of surround systems, stereophonic recording tech-

niques were extended to multichannel ones based on similar principles

(Kassier et al., 2005). Most multichannel recording arrangements follow

the surround layout, and they try to optimise recording of the sound stage

with localisation cues only for the frontal loudspeaker triangle (Theile,

2001; Williams, 2004; Kassier et al., 2005). The recording channels for the

54


(a) INA 5 (b) OCT surround (c) Fukada tree

(d) IRT cross (e) Hamasaki square

Figure 3.4. Main surround array designs (first row) and rear surround array designs(second row) for surround music recording. The dimensions are indicativeand adapted from Theile (2001).

surround speakers are treated as reverberant channels. Frontal surround

recording methods, such as ORTF, the Decca-tree or INA-3, are near-

coincident, and they rely on both microphone directionality and time-

differences to deliver cues. Other arrangements, such as INA-5, OCT-

Surround, or the Fukada-tree, incorporate frontal and surround micro-

phones in a single specification, while arrays such as the IRT-cross or the

Hamasaki square are meant to capture only diffuse sound which should

be subsequently mixed with the frontal channels. A presentation of vari-

ous arrangements and their properties is given by Theile (2001) and some

popular ones are shown in Fig. 3.4. Array design for surround recording

in this way relies strongly on the preference of the recording engineer, and

it is optimised accordingly. Perceptual cues are determined solely by the

recording configuration, suitable only for a certain loudspeaker setup and

generally hard to modify during post-processing by simple means. Percep-

tual issues of reproduction of surround recordings with the conventional

techniques described here are treated by parametric methods presented

in Politis et al. (2013a), Politis et al. (2013b) and Publication III.

55


Advanced sound scene recording

An alternative proposal for surround recording comes in the form of B-

format recording (Gerzon, 1975b), which consists of an omnidirectional

and three dipole components oriented orthogonally in a coincident ar-

rangement. B-format can be viewed in a few different ways. From a sound

engineering perspective, B-format is an extension of the stereo coincident

methods and of Blumlein recording, and any other coincident technique

can be generated using the B-format signals. From an Ambisonics point

of view, it constitutes the first-order approximation of the spatial sound

field around the recording point (see also Sec. 2.1.5). From an acoustical

point of view, B-format captures the acoustic pressure and acoustic parti-

cle velocity at the recording point, as shown in (2.64). This information is

exploited in the parametric methods developed in this thesis.

Since it is impossible to position an omnidirectional and three dipole

capsules at the same point in space, the B-format is generated by spheri-

cal arrays of either directional or omnidirectional microphones, the best-

known example being the Soundfield microphones (Farrar, 1979). Sound-

field microphones are based on tetrahedral arrangements of directional

microphones, usually sub-cardioids or cardioids (Faller and Kolundžija,

2009). The microphone signals from a tetrahedral array used to generate

the B-format is known in ambisonic literature as A-format. Additionally,

it is possible to generate B-format signals from arbitrary arrays using the

least-squares solution of (2.43).

Recently, the concepts of higher-order microphones for acoustic analy-

sis and scene recording have been studied actively. Higher-order micro-

phones follow the principles of capturing the sound field coefficients, as de-

scribed in Sec. 2.1.5. Instead of capturing only the first-order components

of the scene, or the B-format, the processing of the microphone signals

performs the SHT and delivers the higher-order components. Arbitrary

arrays (Laborie et al., 2003; Poletti, 2005; Teutsch, 2007) can be used for

this task, but normally spherical ones are preferred due to their conve-

nient properties. Recording with higher-order microphones offers greater

flexibility and improved performance during reproduction for both non-

parametric and parametric methods. Some of the properties of higher-

order recording and their use will be mentioned in the following sec-

tions. Implementation examples and additional details can be found in

the works of Meyer (2001), Laborie et al. (2003), Meyer and Elko (2004),

Li and Ruraiswami (2005), Moreau et al. (2006), Teutsch (2007), Epain

56


and Daniel (2008), Melchior et al. (2009), Jin et al. (2014), Lecomte et al.

(2015), and Bernschütz (2016).

3.3 Evaluation of spatial sound reproduction

3.3.1 Technical evaluation

Technical and instrumental evaluation of spatial sound attributes in sound

reproduction is lacking in generality, since the target of spatial sound is

the creation of an impression to the listener. Hence, as a strictly per-

ceptual effect, it depends on various subjective factors that can only be

isolated and studied appropriately in controlled psychoacoustical tests.

However, technical descriptors of some perceptual attributes are desired

for practical purposes, so that rough estimates of the performance of a

spatial sound technology can be drawn without lengthy and costly listen-

ing tests.

The best-known technical measures for multichannel systems are lim-

ited to localisation attributes, with additional ones having potential links

to perceived source extent and localisation ambiguity. All of these mea-

sures seem to be related to the complex intensity vector (Daniel et al.,

1999; Poletti, 2000; Pulkki, 2007; De Sena et al., 2013; Franck et al.,

2015), with the active part generally taken as indicating the opposite

of perceived direction and the imaginary part contributing to localisa-

tion ambiguity, or phasiness (Gerzon, 1992a; Daniel et al., 1999; Poletti,

2000). The Makita velocity vector, a normalised acoustic velocity-related

quantity, is also proportional to the intensity vector when all loudspeaker

signals are correlated (Merimaa, 2007), as in amplitude panning. Ger-

zon (1992a), observing that source summation at the ears becomes mostly

incoherent at higher frequencies, introduced the Gerzon energy vector,

which is equivalent to the intensity vector when the source signals are

uncorrelated (Merimaa, 2007). The magnitude of the energy vector is also

believed to correspond to a perceived angular spread of a sound source

(Frank, 2013b). Both vectors are fundamental in ambisonic research, and

they constitute the guidelines for designing ambisonic decoders and pan-

ners (Gerzon, 1992b; Heller et al., 2008; Zotter et al., 2012; Zotter and

Frank, 2012; Epain et al., 2014). Parametric methods such as Directional

Audio Coding (DirAC) (Pulkki, 2007), rely on the assumption of percep-

57


tual relevance of the intensity vector and diffuseness, while others rely

on the velocity-energy vectors (Goodwin and Jot, 2008). The diffuseness

of (2.26) is connected to both phasiness and the ASW and IC. Note that,

even though these measures are used for evaluation and design of spatial

sound reproduction methods, their psychoacoustical validation is lacking

at present. An exception is the work of Frank (2014, 2013b,a) on ampli-

tude panning and source extent perception, and recent work by Stitt et al.

(2016).

An alternative to simple physical metrics such as the above is evalu-

ation through auditory modelling. Here binaural signals are generated

for the method under study, and they are subsequently fed to a computa-

tional auditory model. This approach was introduced in studies on percep-

tion of virtual sources (Macpherson, 1989; Mac Cabe and Furlong, 1994;

Pulkki et al., 1999) and has been recently extended to complex spatial

sound scenes by Takanen and Lorho (2012), Takanen et al. (2013, 2014),

and Härmä et al. (2014). Most auditory models consider only binaural

non-individualised cues, and hence they are limited to evaluation of hori-

zontal reproduction systems. A model incorporating monaural cues from

individualised HRTFs for elevation perception, with applications to eval-

uation of 3D sound systems, has been proposed by Baumgartner et al.

(2014).

3.3.2 Listening tests

When simple instrumental or computational evaluation is inadequate, lis-

tening tests should be performed. For an overview on listening test de-

sign for perceptual audio evaluation, the reader is referred to Bech and

Zacharov (2007). Usually aspects of spatial sound are tested with care-

fully designed psychoacoustical experiments in highly controlled condi-

tions and with stimuli that isolates the studied percept from other factors.

For the evaluation of the parametric reproduction methods developed

in this work, listening test design brings up two difficult issues. First,

since the reproduction of natural scenes is to be studied, the stimuli are

inherently complex with spatial, temporal, and spectral cues occurring si-

multaneously, containing also semantic and other higher-level cues that

occur in every natural listening situation. Hence, evaluating how well

one method performs compared to another is studied as an audio coding

problem; different methods are considered as codecs, and they are com-

pared by estimating how well they manage to reconstruct perceptually

58


the original sound scene, the reference. For the results presented in Pub-

lication III and Publication V, mean opinion score (MOS) listening tests

were designed, based on the multi-stimulus with hidden reference and an-

chor (MUSHRA) specification (International Telecommunications Union

(ITU), 2001).

The second difficulty rises from the fact that, contrary to audio codecs or

upmixing systems, with a recorded sound scene there is no true reference

that can be repeated. Due to the lack of established tests for reproduction

of recorded spatial sound, a new approach was followed based on gener-

ation of complex synthetic reference scenes. The motivation behind this

approach is based on the following assumptions:

a) It is adequate to use as a reference synthetically generated sound

scenes that include multiple sources and natural reverberation based

on physical modelling. The choice of the modelling method can be

based on geometrical or wave-based numerical acoustics and it is

not critical, as long as it is able to generate a perceptually natural

sounding result without audible artefacts.

b) The modelled scene should be discretised and distributed to an ar-

ray of loudspeakers, including the full reverberation with its spa-

tial distribution. The discretisation should be done in such a way

as to exclude any kind of reproduction method such as Ambisonics

or VBAP. It is not an issue if the exact acoustics of the model are

modified by the discretisation process, since the aim is to create a

physically-inspired natural sounding reference and not to auralise

the properties of the model.

c) The reference can be captured by the preferred recording system, ei-

ther through simulations or by actual recordings inside the listening

room. In the case of an actual recording, the room should be close to

anechoic to have a negligible effect on the result.

d) The recorded references can serve as input signals to the reproduc-

tion methods that are to be tested. Re-rendering them to the full

loudspeaker setup or a subset of it permits direct comparison be-

tween the reference and the renderings.

Using this approach, acoustic scenarios of arbitrary complexity can be

generated and evaluated. For the listening tests of Publication III and

Publication V artificial reverberation was introduced using the image source

59


method (Allen and Berkley, 1979; Peterson, 1986) with naturally sound-

ing simulation parameters. Direct sound components, early reflections

and late reverberation were distributed to their closest loudspeakers of

the listening room setup. The process is depicted in Fig. 3.5. The same

procedure has been used to evaluate parametric spatial sound reproduc-

tion by Vilkamo et al. (2009) and Laitinen et al. (2011).

simulated or realrecording

renderingmethod

reproductionreference

room & source simulator

comparison

Figure 3.5. Generation of the reference and rendering through room simulation for theevaluation of perceived distance from the reference.

60

4. Non-parametric Methods for SpatialSound Recording and Reproduction

This section highlights the recent trends in reproducing recorded sound

scenes with a microphone array through multiple loudspeakers or head-

phones, in a flexible manner and with high perceptual quality. The vari-

ous methods are categorised into two large families: parametric and non-

parametric.

Non-parametric approaches are determined solely by the recording ar-

ray properties and the properties of the reproduction setup. Signal-inde-

pendent filters or mixing gains are derived based on certain objective cri-

teria that need to be met in order to provide effective spatial cues during

reproduction. These static filters or mixing matrices are applied to the

input signals directly to generate the output signals. In the general case,

non-parametric methods can be expressed by the simple relation

sL(ω) = Mrep(ω)xQ(ω), (4.1)

where sL = [s1, ..., sL]T is the vector of the signals for the L output chan-

nels, xQ = [x1, ..., xQ]T is the vector of input signals from an array of Q

microphones, and Mrep is the L×Q mixing matrix from the inputs to the

outputs that is derived according to the application and approach. The

mixing matrix Mrep can be frequency-dependent, hence defining a ma-

trix of filters. The process is presented schematically in Fig. 4.1. Of all

non-parametric methods, Ambisonics is detailed more extensively in this

category due to its powerful formulation in the spatial transform domain

and its separation of the above mixing into an encoding part related to

recording and a decoding part related to reproduction.

4.1 Ambisonics

Ambisonics was inspired by pioneering work on hierarchical and gen-

eralised recording and panning for quadraphonic systems (Cooper and

61

Non-parametric Methods for Spatial Sound Recording and Reproduction

array signalsxQ

output signalssLfilter matrix

Mrep

target setupinformation

array setupinformation

Figure 4.1. General non-parametric mapping of microphone array signals to the repro-duction setup. The bold arrows indicate signal flow while the dashed onesindicate other parameters.

Shiga, 1972; Gibson et al., 1972) and fully conceptualised and introduced

by Gerzon Gerzon (1973, 1975a) in the 1970s. They rely heavily on the

concepts of the spectral representation of the spatial sound distribution

a(γ) in the scene, the SHT, and the order N of the respective sound field

coefficients aN , introduced in Sec. 2.1.5. In ambisonic literature, the sound

field coefficients aN are alternatively termed ambisonic signals and N am-

bisonic order. Traditionally, due to practical limitations, Ambisonics was

introduced for processing of first-order sound scene coefficients a1, or the

B-format signals (see (2.64)). More recently, ambisonic processing has

been extended to higher-orders, with practical realisations at present lim-

ited between third and seventh order. We make no distinction here be-

tween first-order and higher-order Ambisonics, as the presentation below

is valid for both cases.

As mentioned in Sec. 2.1.5, Ambisonics refers to a collection of tech-

niques for

(a) recording real sound scenes, or encoding synthetic ones,

(b) manipulating these sounds scenes spatially, and

(c) reproducing or decoding them to various loudspeaker setups or head-

phones,

with the common element that all operations are defined in the spatial

transform domain outlined in Sec. 2.1.5, the SHD.

Ambisonics divides the mapping from the input signals to the output

signals of (4.1) into two stages, the encoding and the decoding stage, as

shown in Fig. 4.2. This separation enables flexible and scalable imple-

62


mentations. For example, recordings from various microphone arrays can

be encoded into the ambisonic signals through the mathematical tools of

(2.58–2.59). Additionally, virtual sources can be encoded in a new set of

ambisonic signals of the same order through ambisonic encoding. The two

sound scenes, the recorded and synthetic, can be mixed by simple summa-

tion of the signals without regard to the recording setup. The final sound

scene can then be reproduced to a target system by means of an adequate

mixing matrix, or ambisonic decoder. This decoding matrix is designed ac-

counting for the reproduction setup properties and the order of the input

ambisonic signals.

arraysignalsxQ

ambisonicsignalsaN

ambisonicsignalsa N

outputsignalssL

decoding matrixMdec

encoding matrixMenc



transmissionmanipulation

Figure 4.2. Schematic of ambisonic encoding and decoding. The bold arrows indicatesignal flow, while the dashed ones indicate other parameters.

4.1.1 Recording and encoding

Ambisonic recording is no different from obtaining the sound field coeffi-

cients for general spherical acoustic processing, as detailed in Sec. 2.1.5

and Sec. 3.2.3. First-order ambisonic systems record in the B-format con-

vention, commonly employing tetrahedral microphone arrays of directional

capsules known as Soundfield microphones (Gerzon, 1975b). Higher-order

microphones, with only a few available commercially at present, follow

similar construction guidelines, replacing, however, directional capsules

for omnidirectional ones on a spherical rigid baffle. The maximum order

that the array can deliver is determined by the number of microphones

and their arrangement; for Q nearly uniformly distributed microphones

N ≤ �√Q− 1�, with �·� denoting the integer floor.

The encoding matrix Menc from the microphone signals to the ambisonic

signals

aN (ω) = Menc(ω)xQ (4.2)

can be derived in terms of the least-squares solution of (2.43) as (Moreau

63


et al., 2006)

Menc = YNCHH(HCHH + λ2IQ)−1, (4.3)

where YN = [yN (γ1), ...,yN (γK)] is the matrix of real spherical harmon-

ics up to order N for a dense grid of K directions covering the sphere.

The matrix H = [h(γ1), ...,h(γK)] is the matrix of array steering vectors

in the same directions, which can be known either through an analyti-

cal model of the array response or through measurements. A weighted

least-squares formulation is defined in (4.3), where the K × K diagonal

matrix of weights [C]kk = ck is used to appropriately weight the sampling

directions in case they are not uniformly distributed. As mentioned in

Sec. 2.1.5, due to vanishing higher-order components at low frequencies,

the inversion should be limited to avoid excess amplification. This can

be done through the regularisation value λ which limits the filters to a

maximum amplification gdB as

λ =1

2 · 10gdB20

. (4.4)

Note that the same inversion can be defined equivalently in the SHD (Jin

et al., 2014).

Encoding of plane waves or point sources for synthetic sound scenes can

be easily performed. The ambisonic signals up to order N , for a plane

wave incident from γ0 carrying a signal s and following the conventions

presented in Sec. 2.1.5, are given by1

aN (t) = s(t)yN (γ0), (4.5)

so that multiple signals for K sources can be encoded as

aN (t) =K∑k=1

sk(t)yN (γk) = YNsK(t), (4.6)

with sK = [s1, ..., sK ]T, the vector of source signals. Expressions (4.5–4.6)

are expressed in the time domain to highlight that encoding is frequency-

independent in this case. Relations are also readily available for point

sources at some distance from the origin, even though in this case encod-

ing involves frequency-dependent functions and can be realised with a set

of filters (Daniel, 2003).

1Here we follow the real SH convention commonly used in Ambisonics; for com-plex SHs the vector yN should be conjugated.

64


4.1.2 Sound-scene manipulation

Expressing the sound scene on the spatial-spectral representation of the

SHD allows for a series of useful spatial transformations of the scene di-

rectly on the coefficients. The most significant examples are

• arbitrary rotations of the scene (Zotter, 2009a),

• mirroring of the scene across any plane,

• warping the sound distribution towards regions of angular compres-

sion or expansion (Pomberger and Zotter, 2011; Kronlachner and

Zotter, 2014),

• directional smoothing of a scene, and

• directional amplitude manipulation (Kronlachner and Zotter, 2014).

All operations can be defined through transformation matrices Mtr as

aN (t) = MtraN (t), (4.7)

where aN are the new set of coefficients for the transformed sound scene.

Some operations (like rotation, mirroring, or smoothing) preserve the or-

der of the signals, while others (such as warping or amplitude manipu-

lation) can result in a higher order of coefficients than the original one.

Some of these transformations are depicted in Fig. 4.3.

Of these operations, rotation has been the most important due to its

efficiency in rotating the sound scene in ambisonic reproduction for head-

phones when combined with head-tracking. Rotation of the sound scene

in the SHD can be readily performed with rotation matrices (Driscoll and

Healy, 1994; Choi et al., 1999). Note that rotation of B-format signals

can be done with standard Cartesian rotation matrices. Mirroring simi-

larly exploits symmetries of SHs, and it reduces to inverting the polarity

of certain ambisonic signals. Warping is a more complicated operation,

and originally defined for first-order signals under the name dominance

(Gerzon and Barton, 1992). For a practical derivation of warping matri-

ces for higher-orders see Kronlachner and Zotter (2014). The directional

smoothing of sound scenes refers essentially to spherical convolution with

a spherical smoothing kernel, an operation that simplifies considerably in

the SHD (Driscoll and Healy, 1994). Finally, directional amplitude manip-

ulation, or applying a directional envelope on the sound field and obtain-

ing the ambisonic signals for the resulting field, requires the use of cou-

pling coefficients between SHs, known as Gaunt coefficients (Sébilleau,

65


(a) original (b) rotation

(c) warping (d) amplitude manipulation

Figure 4.3. Depiction of major ambisonic transformations of a sound scene. In (c) direc-tional compression is shown with yellow and expansion with light blue. In(d) the modified boundary curve shows the directional energy modificationcompared to the original sound scene.

1998). Their use has a special significance on the method of Publication V

for the acoustic analysis of the sound scene on angular sectors. Routines

for the computation of Gaunt coefficients are contributed by the author

(Politis, 2015d). Amplitude manipulation of the sound field results in a

higher-order representation than the non-modified original. A practical

approach to such modification that avoids the use of Gaunt coefficients is

presented by Kronlachner and Zotter (2014).

66


4.1.3 Decoding and panning

Reproduction of ambisonic signals to a target setup is performed through

an L× (N + 1)2 decoding matrix Mdec as

sL(t) = MdecaN (t). (4.8)

Usually ambisonic decoding uses frequency-independent mixing matri-

ces. Some frequency-dependency in first-order ambisonic decoding is com-

monly introduced with either shelf-filters and a single decoding matrix or

two frequency ranges with one decoding matrix for each (Heller et al.,

2008). The aim of these designs is to preserve constant amplitude at low

frequencies and optimise the Makita velocity vector, while preserving en-

ergy at high frequencies and optimise the Gerzon energy vector. Recent

higher-order decoding designs drop the frequency-dependency and focus

primarily on optimisation of the energy vector. An overview of traditional

and recent decoder designs can be found in Zotter et al. (2012) and Zotter

and Frank (2012). Decoding to irregular speaker layouts or partial ones

can also be challenging. Recent approaches involve a combination of am-

bisonic decoding and VBAP derived gains (Batke and Keiler, 2010; Zotter

and Frank, 2012; Epain et al., 2014), modified SH basis functions (Zotter

et al., 2012), and iterative numerical optimization of the decoding gains

to achieve some target performance (Wiggins, 2007; Moore and Wakefield,

2007; Scaini and Arteaga, 2014). Example routines for the derivation of

decoding matrices with some of the above techniques are contributed by

the author (Politis, 2015c).

Ambisonic panning is essentially the combination of the decoding (4.8)

and encoding stage (4.5):

sL(t) = MdecyN (γ0)s(t). (4.9)

Panning of a virtual point source that recreates a spherical wave from its

virtual origin is also possible, similar to wave-field synthesis. Contrary

to the above plane wave formulations, its realisation treats both the vir-

tual source and the loudspeaker setup as point sources at certain radii

and expands them in the SHD. This approach has been termed near-field

compensation, and it requires careful design of the corresponding filters

due to the extreme amplification for higher-orders and low frequencies

dictated by theory (Daniel, 2003).

67


4.1.4 Binaural decoding

Ambisonic reproduction for headphones can be formulated very efficiently

in the SHD through Parseval’s theorem in (2.57). Assuming that the

HRTFs of the listener are given by hbin(γ, ω) = [hl(γ, ω), hr(γ, ω)]T, and

assuming real SHs, the binaural signals sbin = [sl, sr]T are given by

sbin(ω) =

∫hbin(γ, ω)a(γ, ω) dγ = HT

bin(ω)aN (ω), (4.10)

where Hbin = [SHT {hl(γ)},SHT {hr(γ)}] = [h(l)N ,h

(r)N ] is the matrix of the

HRTFs expanded into SH coefficients. Hence the decoding matrix for am-

bisonic binaural reproduction is given directly by the expansion of the

HRTFs into SHs, and the order of the expansion is limited by the order of

the available ambisonic signals2. Binaural rendering of ambisonic signals

has been studied by Noisternig et al. (2003), Duraiswami et al. (2005),

Atkins (2011), Melchior et al. (2009), Shabtai and Rafaely (2014) and

Bernschütz (2016). Since the expanded HRTFs are truncated to the lower

order of the ambisonic signals, perceptible colouration occurs at high fre-

quencies. An approximate mean correction to this effect is proposed by

Sheaffer et al. (2014).

4.1.5 Discussion

Ambisonics is a significant leap towards many of the goals of a modern

reproduction method outlined in the introduction of this thesis. However,

it poses a number of limitations. As was mentioned before, recording of

the ambisonic signals is limited to the first few orders. A low-order rep-

resentation of a fully directional sound, such as a plane wave or a point

source, will be directionally spread and its binaural cues will deviate from

the actual ones, especially at high frequencies.

Similarly, reproduction of diffuse sound scenes can suffer from high cor-

relation between the output channels for the same reason; a low-order de-

coding is incapable of achieving the incoherent properties of a diffuse field.

Perceptual effects of these properties of ambisonic reproduction can be a

loss of brightness, increased localisation blur, loss of envelopment, in-head

localisation effects, and strong comb-filtering effects around the sweet

spot. Many of these effects were recognised in early ambisonic research

2We assume here compact recording arrays generally encountered in practice; inthis case the order of the captured sound field will be smaller than the expansionorder of the HRTFs, which is determined by the number of HRTF measurements.

68


and were collectively put under the term phasiness (Gerzon, 1975c).

The task of a well-designed ambisonic decoder is to minimise these ef-

fects as much as possible. However, there are inherent limits on how much

the decoder design can alleviate the above problems, and these limits are

set by the order of the ambisonic signals. Higher-order processing natu-

rally improves on these issues as spreading of virtual sources is compacted

and correlation between output channels in complex scenes is decreased.

Perceptual effects of the order of ambisonic reproduction with virtual or

recorded sources are analysed by Braun and Frank (2011), Kearney et al.

(2012), Bertet et al. (2013), Avni et al. (2013), Stitt et al. (2014), Yang and

Xie (2014), and Bernschütz et al. (2014). Spectral effects of using more

loudspeakers than the minimum that the available ambisonic order dic-

tates are discussed by Solvang (2008) and Benjamin and Heller (2014),

while positive perceptual effects of reverberation on low-order reproduc-

tion are presented by Santala et al. (2009).

Even if direct decoding of ambisonic signals at low-orders can be inad-

equate for transparent reproduction of sound scenes, it can serve as a

very efficient and compact format for capturing and mixing them. This

is mainly due to the detachment of the format from the recording pro-

cess and the unified acoustical processing that can be performed on the

sound field coefficients described in Sec. 2.1.5. We believe that one of the

outcomes of this thesis is that low-order ambisonic reproduction can be

the basis for the more elaborate parametric methods detailed herein, and

that this approach can achieve close to transparent reproduction using the

same low-order ambisonic signals as input. The methods of Publication

I, Publication II and Publication V start from an ambisonic-like decod-

ing of B-format or higher-order signals, and they enhance their reproduc-

tion through a parametric perceptually-motivated analysis. Parametric

methods, however, have an increased price in computational complexity

compared to the simple matrixing operation of ambisonic decoding.

4.2 Other non-parametric approaches

Non-parametric methods that mix input signals to the outputs, captured

with directional patterns or with different propagation delays, correspond

invariably to some form of beamforming, one per loudspeaker, including

Ambisonics. Based on this view, beamformers that satisfy criteria other

than the decoding principles of Ambisonics can be used for multichannel

69


reproduction of array recordings. One approach is to generate a set of

beamformers covering the region of interest and then to pan their sig-

nals as virtual sources from the beamforming directions at reproduction.

The method by Farina et al. (2013) does that using a set of higher-order

cardioid beamformers up to the maximum allowed order of the array, cov-

ering uniformly the sphere. A similar method is proposed in the non-

parametric part of the work by Hur et al. (2011), posed as a synthetic

array reconfiguration problem.

A second approach is to approximate the panning curves for the tar-

get setup with beamforming, obtaining directly the output signals. This

approach is conceptualised by Backman (2003) and is formulated for a

linear microphone array by Abel et al. (2010). Similarly, Ono et al. (2013)

attempts beamforming to obtain the signals for the standardised NHK

22.2 setup. For arbitrary arrays, such beam patterns can be generated

through the least squares solution (2.43).

Panning curves can be discontinuous and asymmetrical for irregular se-

tups and of very narrow directionality for dense ones, and hence they

can be poorly approximated by low-order beam patterns. The work by

Hacihabiboglu and Cvetkovic (2009) proposes a set of small differential

arrays to achieve the higher-order directivity that is required. This ap-

proach is generalised further into a framework for deriving beamformers

that reconstruct optimally the acoustic intensity in the reproduction re-

gion (De Sena et al., 2013).

Binaural reproduction using non-parametric methods is also possible

by forming a pair of beamformers that mimic the HRTFs (Chen et al.,

1992; Algazi et al., 2004; Li and Duraiswami, 2006; Sakamoto et al., 2010;

Rasumow et al., 2016). Unless hundreds of microphones are used, as

by Sakamoto et al. (2010), the high-frequency spatial variability of the

HRTFs cannot be approximated fully by small compact arrays, affecting

mostly high-frequency and individualised cues, such as ILDs and spectral

or pinna cues. These performance issues of binaural beamforming are

of the same nature as low-order effects in ambisonic reproduction over

headphones.

70

5. Parametric Methods for SpatialSound Recording and Reproduction

Parametric approaches, contrary to the non-parametric methods men-

tioned above, apart from the array properties and the properties of the

reproduction system, also take into account the statistics or dependencies

of the recorded signals themselves. By making certain assumptions on

how the sound scene is composed, they are able to overcome many of the

limitations of the non-parametric approaches. In the sections below, we

discuss the relation of parametric methods to spatial audio coding (SAC)

and upmixing methods, and then we give a summary of recent trends in

the reproduction of natural sound scenes obtained from a microphone ar-

ray. The DirAC method is presented in more detail as it constitutes the

backbone of a large part of this thesis. Finally, applications of parametric

methods to auralisation of room acoustics are discussed.

5.1 Spatial audio coding and upmixing

5.1.1 Spatial audio coding

SAC refers to the methods that aim to compress multichannel recordings

using a parametric approach in order to reduce bandwidth and process-

ing requirements. Ideally, a SAC system should be able to recreate the

multichannel content without loss of perceived quality and, ideally, with

no perceptual differences (transparency). In contrast to SAC, upmixing

refers to decomposing the audio channels of a multichannel mix in order to

redistribute them to a different target system; a classic example is stereo-

to-surround upmixing, or stereo-to-binaural conversion. Even though at

first glance the aims are different, both SAC and upmixing strive for the

same goal: process the recording in such a way that when recreated at

the target system it preserves the perceptual properties of the original.

71

Parametric Methods for Spatial Sound Recording and Reproduction

X.1 input signals

downmixcompression

synthesis

transmission

analysis

quantization

invTFT/FB

TFT/FB

ICLDICPDICC

(a) Multichannel SAC example.

stereo input signals

primary DoAprimary/ambient ratio

primarysynthesis

panning orHRTFs

decorrelation

ambiencesynthesis

analysis

invTFT/FB

TFT/FB

(b) Stereo to multichannel or binaural

upmixing example.

Figure 5.1. Block diagrams of basic signal and parameters flow in SAC and upmixing ex-amples. Signal flow is indicated with solid arrows, and parameter flow withdashed arrows. The TFT/FB blocks indicate time-frequency transform or fil-ter bank operations. In SAC, inter-channel level differences (ICLD), phasedifferences (ICPD), and the inter-channel coherence (ICC) are computed andtransmitted. In upmixing these channel dependencies are used to estimateparameters for a primary–ambient (or direct–diffuse) decomposition.

Similarities and differences between SAC and upmixing are outlined in

Fig. 5.1. Certain SAC approaches focus mostly on compression and ef-

ficiency, whereas others extract perceptual spatial parameters that are

suitable also for the upmixing task. The best-known SAC methods are

the MPEG Surround codec specification (Breebaart et al., 2005a; Herre

et al., 2008), Binaural Cue Coding (BCC) (Baumgarte and Faller, 2003),

and Parametric Stereo (PS) Breebaart et al. (2005b). A comprehensive

overview of these approaches is presented by (Breebaart and Faller, 2007).

An alternative approach to these methods, termed Spatially Squeezed

Surround Audio Coding (S3AC) (Cheng et al., 2007, 2013) omits meta-

data and aims to compress directional components of the sound scene into

the stereophonic angle-span, with the potential to remap them to other

setups in the synthesis stage.

Coding of ambisonic signals is an emerging research topic. The recent

MPEG-H 3D-audio standard (Herre et al., 2015), apart from channel-

based and object-based audio coding, incorporates HOA processing. Fur-

72


ther work on coding and transmission of HOA signals has been presented

by Peters et al. (2015) and Sen et al. (2016). The DirAC-based schemes of

Pulkki et al. (2013) and Publication V can be also viewed as SAC methods

for 2D and 3D HOA signals respectively.

5.1.2 Upmixing

Upmixing methods are traditionally defined for channel-based content,

with the first proposals focusing on conversion of stereo-to-surround con-

tent (Irwan and Aarts, 2002; Avendano and Jot, 2004; Faller, 2006). They

vary on how they separate the directional and diffuse components, an

approach also called direct-diffuse decomposition, or primary-ambient ex-

traction. In general, coherent signals between channels are treated as

directional, while incoherent signals are treated as ambient or diffuse.

With additional assumptions on the energies and correlations of the com-

ponents, it is possible to find estimates of their signals. A well-established

approach to direct-diffuse decomposition is the one presented by Faller

(2006) and Walther and Faller (2011), which formulates the decomposi-

tion as a multichannel Wiener filtering problem with a least-squares solu-

tion. Similar approaches solve a linear system of equations describing the

input signal and decomposed signal statistics (Avendano and Jot, 2004;

Merimaa et al., 2007; Thompson et al., 2012). A popular alternative has

been the principal component analysis (PCA) of the input signals. PCA

detects uncorrelated components and achieves effective primary-ambient

decomposition under certain conditions, as has been detailed by Irwan

and Aarts (2002), Driessen and Li (2005), Briand et al. (2006), Merimaa

et al. (2007), Baek et al. (2012), and He et al. (2014). PCA also seems

closely related to the unified-domain representation of Short et al. (2007).

No generalisation of its use has been proposed for more than stereo input.

Recent proposals either refine the direct-diffuse decomposition by com-

bining the approaches above (He et al., 2014), or apply blind source sep-

aration methods to estimate independent components in the channel sig-

nals (Nikunen et al., 2011). DirAC itself has found applications as an up-

mixing method (Pulkki, 2006; Laitinen and Pulkki, 2011; Laitinen, 2014),

with the difference that instead of inter-channel dependencies, it com-

putes global sound scene descriptors, such as the intensity vector and the

diffuseness at each subband. A method closely related to DirAC upmix-

ing is the Spatial Audio Scene Coding (SASC) based on the descriptors of

velocity and energy vectors instead of intensity and diffuseness (Goodwin

73


and Jot, 2008, 2007).

5.1.3 Binaural reproduction

Binaural reproduction is a major application of upmixing systems, due to

the realisation that rendering a virtual loudspeaker setup through bin-

aural panning is not an effective solution, since it suffers from coloura-

tion of the material, in-head localisation, and loss of envelopment. The

directional-diffuse decomposition has been found very effective for this

task, rendering the directional component using HRTFs and decorrelat-

ing the ambient component appropriately at the two ears, fixing most of

these issues (Faller and Breebaart, 2011). This approach has been fol-

lowed by most of the major methods mentioned above, such as MPEG

Surround (Breebaart et al., 2006), BCC/PS-related reproduction (Faller

and Breebaart, 2011), S3AC (Cheng et al., 2008), SASC (Goodwin and

Jot, 2007) and DirAC (Laitinen and Pulkki, 2009).

5.1.4 Discussion

SAC and upmixing share many similarities with the reproduction meth-

ods developed in this thesis, the major difference stemming from the fact

that upmixing and SAC focus on certain channel-based formats, while

the methods studied herein focus on array recordings and reproduction of

real sound scenes. This distinction is, however, somewhat artificial, since

recording-based methods still operate with a compact finite representa-

tion of the sound scene, the array signals. Capturing the sound field still

imposes certain mixing models for directional sounds and certain correla-

tions between channels for ambient or diffuse sounds, similar to the pan-

ning and reverberation tools for mixing and creating channel-based audio

content. However, the complexity of many natural sound scenes and the

variation in recording practices and approaches makes the sound repro-

duction problem more challenging than SAC and upmixing, and justifies

many of the more complicated approaches presented in the next section.

5.2 Parametric reproduction of recorded sound scenes

This section gives an overview of parametric methods that perform an

analysis of a sound scene captured with some microphone array and its

respective reproduction. There is great variability in the approaches and

74


their assumptions on which sound field models, source signal properties,

and array and reproduction system properties they are based on. In the

following, an attempt is made to present the various trends with respect

to these choices.

5.2.1 Assumptions on array input and output setup properties.

Even though most methods have general principles and can be applied

to arbitrary arrays, a large part of the research focuses on certain ar-

rangements dictated by the application, the target performance, cost, pro-

cessing requirements, and so on. For example, many applications con-

sider a two-dimensional model of the sound-field and hence linear, planar,

or circular arrays are adequate for its analysis. Linear arrays, popular

in array processing applications due to their simplicity in analysis and

beamforming, are suitable, for example, for two-way communication sys-

tems integrated with a flat screen, and have been used in that context

in sound scene acquisition and rendering (Beracoechea et al., 2006; Thier-

gart et al., 2011). Similarly, compact circular (Ahonen et al., 2007; Alexan-

dridis et al., 2013) and cylindrical arrays (Ahonen et al., 2012) have been

proposed for high-quality teleconferencing applications and general 2D

sound scene recording and reproduction. 2D arrays with directional mi-

crophones following the stereophonic principles of Sec. 3.2.3, popular in

music recording, have been studied for parametric recording and repro-

duction. Stereophonic arrangements were used by Ahonen et al. (2009)

and Faller (2008), and surround recording arrays have been utilised by

the author (Politis et al., 2013a,b) and Publication III. For 3D analysis

and synthesis, a minimum of four omnidirectional microphones is neces-

sary; such minimal 3D arrays are used as implementation examples by

Cobos et al. (2010) and Nikunen and Virtanen (2014). Large omnidirec-

tional arrays, on the other hand, can be useful in certain applications,

such as reproduction of sound scenes with the possibility of translation

inside the recording (Gallo et al., 2007; Niwa et al., 2008; Thiergart et al.,

2013), or auralisation of industrial environments (Verron et al., 2013).

A class of methods, instead of operating on array signals, work in their

spatial transform domain, more specifically on the SHD detailed in Sec. 2.1.5.

The advantages are apparent; it provides a standardised format for the

spatial sound scene that is independent of the array. Furthermore, it

bridges the distinction between recorded and synthetic sound scenes. Recorded

sound scenes encoded into SH or ambisonic signals suffer from the frequency-

75


dependency issues mentioned in Sec. 2.1.5. This fact needs to be taken

into account in parametric analysis and synthesis (see Publication V),

making processing of the array signals advantageous in certain tasks,

as shown in Publication I. The most compact format for 3D analysis is

the set of first-order SH signals, or equivalently the B-format. B-format

signals can be obtained with a broadband response using practical com-

pact arrays. DirAC has been traditionally defined with B-format input,

and also the methods by Hurtado-Huyssen and Polack (2005), Berge and

Barrett (2010b), and Günel et al. (2008). Higher-order ambisonic signals

encode the sound scene with a higher spatial resolution than B-format,

when available. This increased resolution is exploited in Publication V.

Due to the decomposition of the sound scene into its spatial components,

arbitrary rendering of the recordings is possible either using suitable pan-

ning functions for loudspeaker reproduction or HRTFs for binaural repro-

duction. Many of the methods mentioned here are applied to both targets

(Pulkki, 2007; Laitinen and Pulkki, 2009; Berge and Barrett, 2010b,a; Co-

bos et al., 2012, 2010; Nikunen and Virtanen, 2014; Nikunen et al., 2015;

Verron et al., 2013, 2011; Vilkamo and Delikaris-Manias, 2015; Delikaris-

Manias et al., 2015). Targeting binaural reproduction, compared to loud-

speakers, imposes requirements on real-time processing with interactive

update rates, since modern systems consider head-tracking an essential

component for a realistic listening experience. Practical recording for ren-

dering on WFS systems also seems possible with parametric techniques

(Beracoechea et al., 2006; Cobos et al., 2012; Thiergart et al., 2013; Bai

et al., 2015).

5.2.2 Assumptions on the sound-field model and estimation ofparameters.

Direct-diffuse decomposition

In terms of the sound-field model or assumptions on the spatial properties

of the sound scene, most methods follow the dominant directional/primary

plus diffuse/ambient component found in SAC and upmixing. This decom-

position, which has been shown to be perceptually effective in the above

applications, is followed by DirAC and the methods by Hurtado-Huyssen

and Polack (2005), Gallo et al. (2007), Faller (2008), and Thiergart et al.

(2013). The active intensity vector and its statistics are exploited for this

task in DirAC and by Hurtado-Huyssen and Polack (2005) and Günel

76


et al. (2008) due to its effectiveness and robustness. The method by Hurtado-

Huyssen and Polack (2005) attempts additionally to reproduce the energy

polarisation in the recording as a distributed sound source, while Günel

et al. (2008) uses intensity vector statistics as spatial filters to separate

the directional sounds.

Certain methods aim to extract multiple directional components per

time-frequency tile, and some drop the dependency on the diffuse/ambient

component. Cobos et al. (2010) assume just a single directional compo-

nent, while the HARPEX method (Berge and Barrett, 2010b) extracts two

plane waves from a B-format signal. This latter model is augmented with

a diffuse component by Thiergart and Habets (2012). Multiple directional

signals are extracted also by Günel et al. (2008), Hur et al. (2011), Alexan-

dridis et al. (2013), Thiergart et al. (2014), and Nikunen and Virtanen

(2014). A method that operates on SH signals and extracts multiple di-

rectional and diffuse components is presented in Publication V.

Relation to spatial filtering and enhancement

A large family of methods notices the similarity between the spatial sound

capture and reproduction problem and the more traditional array process-

ing applications of source detection, tracking and signal estimation, with

spatial filtering operations. This approach is very closely related to mul-

tichannel speech enhancement applications (see, for example, the refer-

ences in Benesty et al. (2008) and Brandstein and Ward (2013)) with the

difference that instead of trying to get a clean directional speech signal

and suppress interferers and ambient noise, all perceptual components of

interest should be estimated and preserved for transparent spatial sound

reproduction. In a sense, enhancement usually targets highly effective

suppression and quality constraints on the individual enhanced signal,

while spatial reproduction relaxes the requirements on suppression be-

tween components and targets quality constraints for their recombination

at synthesis. Both tasks have their own challenges but share many simi-

larities. An array-processing oriented approach has been applied to DirAC

by Thiergart et al. (2011). This approach has been recently generalised

by Thiergart et al. (2014) and Thiergart and Habets (2014) into a unified

framework for adaptive estimation of multiple source signals of interest

plus diffuse sound, similar to Fig. 5.2a. This work uses trade-off beam-

formers (Habets and Benesty, 2013) that can switch between maximum

suppression or low signal distortion, suitable for both enhancement and

77


array input signals

directionalsynthesis

panning orHRTFs

decorrelation

diffusesynthesis

source number detection

invTFT/FB

TFT/FB

DoA estimation

source beamformer

ambiencebeamformer

ambient signal directional signals

source number K


K source DoAs

(a) Parametric reproduction based on

adaptive spatial filtering.

componentinformation

array input signals

sourcesynthesis

panning orHRTFs

invTFT/FB

TFT/FB

component-to-sourceclustering

independent component

analysis

source signals



(b) Parametric reproduction based

on BSS.

Figure 5.2. Block diagrams of exemplary implementations for parametric reproductionbased on a) spatial filtering/beamforming and b) blind source separation.Signal flow is indicated with solid arrows and parameter flow with dashedarrows. The TFT/FB blocks indicate time-frequency transform or filter bankoperations. In (a) a number of dominant sources and their directions areestimated and extracted by adaptive beamforming, while in (b) independentcomponents are detected in the mixture and clustered into meaningful sourcesignals based on information of the array response.

sound reproduction. Similar approaches are proposed in the literature

using different beamforming variants (Beracoechea et al., 2006; Alexan-

dridis et al., 2013; Hur et al., 2011; Coleman et al., 2015). A comparison

between different beamformers for the separation of the sound scene com-

ponents is presented by Coleman et al. (2015).

Relation to blind source separation

An alternative approach to spatial filtering for the decomposition of the

sound scene is the blind source separation (BSS) approach. BSS aims to

estimate components inside a mixture of source signals that are statis-

tically independent, and then group them together into meaningful en-

tities. BSS for reproduction is well justified from a perceptual point of

view, since it extracts components that most likely correspond to different

78


sound generating objects instead of spatially filtering sounds from certain

directions. Plain BSS methods can suffer from reverberation and permu-

tation problems, which at first glance makes them unsuitable for high-

quality sound scene rendering. However, combining BSS with additional

array processing stages and a directional/diffuse separation seems to han-

dle these issues effectively; see, for example, Asano et al. (2003) and Niwa

et al. (2008). For the task of sound scene enhancement and reproduction,

the work by Niwa et al. (2008) applies independent component analysis

(ICA) on recordings of distributed arrays of omnidirectional microphones

and clusters the results into virtual sources. The work of Nikunen and

Virtanen (2014) and Nikunen et al. (2015) applies non-negative matrix

factorization (NMF) to the source separation task and estimates addi-

tionally the array spatial covariance due to each independent source sig-

nal, which determines the spatial properties of each source. An exam-

ple of BSS processing for scene analysis and reproduction is outlined in

Fig. 5.2b.

Other models

Certain recent proposals omit a sound field model and aim to utilise di-

rectly the observed spatial statistics of the sound scene for reproduction.

One such example is the spatial covariance matrix method introduced by

Hagiwara et al. (2013) which, however, requires, apart from the scene

recording, an additional reference recording at the reproduction setup

in order to match the statistics of the two adaptively. A more efficient

approach is presented by Vilkamo and Delikaris-Manias (2015), where

both scene recordings and desired spatial statistics for the target setup

are captured with non-parametric beamformers, and they are further en-

hanced by the adaptive mixing solution in (Vilkamo et al., 2013). A related

technique is used for synthesis of recordings with a sinusoidal-plus-noise

sound-field model by Verron et al. (2013).

Sparse and compressed sensing approaches to acoustic array process-

ing are highly active research areas at present. In the task of sound

scene analysis and synthesis, the work of Wabnitz et al. (2011) performs

a sparse plane wave decomposition that is used to upscale low-order am-

bisonic signals into higher-order ones. Even though not strictly a para-

metric approach, the method is signal-dependent and, combined with some

direct-diffuse decomposition (Epain and Jin, 2013), it seems suitable for

high resolution rendering of sound scenes.

79


5.3 Directional Audio Coding

Directional Audio Coding (DirAC) is a spatial audio coding, upmixing

and reproduction method for surround audio content and recorded sound

scenes based on a perceptually-motivated analysis and synthesis frame-

work (Pulkki, 2007; Vilkamo et al., 2009). The predecessor of DirAC,

the Spatial Impulse Response Rendering (SIRR) method (Merimaa and

Pulkki, 2005; Pulkki and Merimaa, 2006), was used for spatial reproduc-

tion of room impulse responses, and many of its perceptual assumptions

and technological innovations were extended in DirAC. The perceptual

motivation comes from the fact that DirAC, instead of a physically in-

spired reproduction as in WFS or certain ambisonic decoding variants,

aims at correct and stable interaural cues and preservation of timbral and

spectral qualities in the recording. This is attempted with a direct-diffuse

decomposition of the sound scene, using the sound field parameters of the

active intensity vector and diffuseness.

5.3.1 Analysis

The original DirAC analysis operates on the B-format signal vector b =

[p, vx, vy, vz]T, as defined in (2.64)1. All operations are defined in a suitable

time-frequency transform with centre frequency index k and temporal in-

dex l. As the vector of equalised pressure-gradient components in b is

proportional to the acoustic velocity vector with v = −Z0u, we get an esti-

mate of the active intensity vector (2.16) and energy density (2.17) at the

origin of the array as

ikl = −1

2Z0�{pklv∗kl} (5.1)

Ekl =1

4cZ0

[||vkl||2 + |pkl|2

]. (5.2)

Combining (5.1) and (5.2) according to (2.18) and (2.26), we get the dif-

fuseness estimate

ψkl =2||� {pklv∗kl} ||||vkl||2 + |pkl|2

. (5.3)

Diffuseness here is defined based on the spectral densities of a single win-

dowed frame. Most publications on DirAC include an additional averag-

ing operation that extends diffuseness estimation along successive time

1We omit here the√2 scaling commonly found in the definition of the B-format

signals.

80


frames

ψkl =2|| 〈� {pklv∗kl}〉 ||〈||vkl||2 + |pkl|2〉

. (5.4)

where 〈·〉 denotes temporal averaging. This formulation can approximate

ideal diffuseness values in stationary cases, and it is of practical impor-

tance since it controls how responsive the separation between direct and

diffuse components is in dynamic or static scenes. In general, longer tem-

poral averaging will result in higher and less rapidly changing diffuseness

values, and more sound will be allocated to the diffuse stream. This ef-

fect has the potential to mask musical noise due to rapid fluctuations in

the non-diffuse stream at the expense of directional sharpness. Temporal

averaging is performed either across a fixed number of frames, or more

commonly by recursive smoothing (Laitinen et al., 2011) as

Sxy(k, l) = μSxy(k, l − 1) + (1− μ)xkly∗kl, (5.5)

where Sxy is the cross-spectral density between two acoustic quantities

x, y (for example, the pressure and velocity), μ = e−K/(τμfs) with K the hop

size of the windowed transform or the decimation factor of the filter bank,

τμ the time constant of the smoothing, and fs the sample rate. Optimal

smoothing values can be determined either according to the application

(Laitinen et al., 2011), or estimated adaptively from the signals (Del Galdo

et al., 2010).

The direction opposite to the net energy flow, expressed through the in-

tensity vector in (5.1), comprises the DoA parameter used in DirAC syn-

thesis: ⎡⎢⎢⎣

cosϑkl cosϕkl

cosϑkl sinϕkl

sinϑkl

⎤⎥⎥⎦ = − ikl

||ikl||=

�{pklv∗kl}||�

{pklv

∗kl

}||, (5.6)

with ϑkl and ϕkl being the analysed elevation and azimuth, respectively.

Finally, instead of the energy density in (5.2), an alternative energy pa-

rameter is used:

εkl =1

2

[||vkl||2 + |pkl|2

]. (5.7)

This practical parameter is equal to the power of the omnidirectional sig-

nal for a plane-wave or a diffuse field. The parameters [ϑkl, ϕkl, ψkl, εkl]

comprise the analysed spatial metadata for the time-frequency tile k, l

and are used in the synthesis stage to recreate the sound scene at the

target setup. The DirAC analysis is depicted schematically in Fig. 5.3.

81


B-format signalsb bkl

TFT/FB

intensity

energy

temporal averaging

diffuseness kl

energy kl

DoA kl kl]transmission

or decoding

Figure 5.3. Block diagram of DirAC analysis. Signal flow is indicated with solid arrowsand parameter flow with dashed arrows. The TFT/FB blocks indicate time-frequency transform or filter bank operations. The notation of signals andparameters follows the one in the text.

5.3.2 Synthesis

DirAC synthesis has been defined in two ways. The first, also termed

as low-bitrate DirAC (LB-DirAC), uses only the pressure signal from the

B-format and the metadata for the synthesis at the target setup. This

version is low in both processing and transmission requirements and is

suitable for applications where efficiency is favoured over quality or trans-

parency, such as in teleconferencing (Ahonen et al., 2007). The second

version uses all the channels of the B-format and is termed virtual micro-

phone DirAC (VM-DirAC) (Vilkamo et al., 2009). This version targets

quality-critical applications such as music recording and reproduction,

and binaural rendering and upmixing. In VM-DirAC, the processing is

done not on the pressure signal, but on the output signals of an initial am-

bisonic decoding. The virtual microphones term refers to the first-order

beamformers towards the loudspeakers that comprise the ambisonic de-

coding. Since this thesis considers high-resolution techniques, we focus

on VM-DirAC exclusively.

Assuming a preferred ambisonic decoding matrix Mdec for the target

loudspeaker setup, the output signals before enhancement zkl are given

by

zkl = Mdecbkl. (5.8)

The decoding matrix Mdec(k) can be also frequency dependent. The direct-

diffuse decomposition is performed directly on the output signal vector

z. A separation filter (or time-frequency mask) is constructed using the

diffuseness parameter for the direct and diffuse stream as

zdirkl =√1− ψklzkl (5.9)

82


and

zdiffkl =√ψklzkl. (5.10)

The direct stream is spatialised to the target setup using amplitude pan-

ning gains gkl for the analysed direction (ϑkl, ϕkl) that concentrate it spa-

tially to the specific loudspeaker pair or triplet. The result is the output

direct stream

ydirkl = cdir ◦ gkl ◦ zdirkl , (5.11)

with ◦ denoting element-wise multiplication and cdir being a vector that

compensates for the mean energy loss of a plane wave due to the effect of

the combined amplitude panning gains and the ambisonic decoding. The

output diffuse stream is given by

ydiffkl = cdiff ◦

[Dkz

diffkl

], (5.12)

where cdiff are compensating gains for the loss of diffuse power due to the

virtual microphones, with entries ci = 1/√Qi, where Qi is the directivity

factor of the ith virtual microphone. The matrix Dk is a matrix of decor-

relating factors, used in order to achieve diffuse field properties during

reproduction of the diffuse stream. The decorrelating filters can be de-

signed in various ways (Pulkki and Merimaa, 2006). A common choice in

most DirAC publications is random frequency-dependent delays within a

temporal range of approximately 10 to 100 milliseconds (Laitinen et al.,

2011). For more details on the compensating factors cdir and cdiff , the

reader is referred to Vilkamo et al. (2009) and Publication I. The DirAC

synthesis is presented schematically in Fig. 5.4.

The reason for using the full B-format signals in the synthesis stage

comes from the realisation that a static ambisonic-like decoding of the

B-format signals can achieve already some of the desired reproduction

aims. The output signals already follow the spatial distribution of the

recorded sound scene for directional sounds, and they are partially decor-

related for diffuse sounds. Applying the DirAC synthesis on the outputs of

this non-parametric decoding reduces processing artefacts and improves

the quality of both the single channel signals and their combined repro-

duction (Laitinen et al., 2011) compared to the LB-DirAC. This approach

also justifies the characterisation of DirAC as an ambisonic-enhancement

method.

The previous relations present a simplified DirAC synthesis. In prac-

tice, the method involves additional stages for maximum quality, such as

83


B-format signals bkl

loudspeaker signals ykldecoding

Mdec

kl

kl kl]

directional gainsVBAP

loudspeakersetupkl

diffuse gains

decorrelationDk

invTFT/FBzkl

Figure 5.4. Block diagram of VM-DirAC synthesis. Signal flow is indicated with solidarrows and parameter flow with dashed arrows. The TFT/FB blocks indicatetime-frequency transform or filter bank operations. The notation of signalsand parameters follows the one in the text.

a multi-resolution time-frequency transform (Laitinen et al., 2011), by-

passing transients from the decorrelating filters (Laitinen et al., 2011;

Vilkamo and Pulkki, 2015), and splitting the diffuse stream into a re-

verberant and non-reverberant part (Laitinen and Pulkki, 2012). Using

these improvements the perceived quality of VM-DirAC has been found to

be high in listening tests (Vilkamo et al., 2009; Laitinen et al., 2011). The

major drawback of the VM-DirAC formulation is that it is not optimal in

achieving the energy and coherence properties that the spatial parame-

ters indicate. The reason for this is that it does not fully take into account

what is achieved already by the static decoding stage. Based on this ob-

servation, the work by Vilkamo and Pulkki (2013) proposes an alternative

formulation of DirAC synthesis which uses the optimal mixing framework

of Vilkamo et al. (2013). A mixing matrix from the B-format to the loud-

speakers is dynamically computed for each time-frequency tile that opti-

mally achieves the target energy and channel dependencies. The formu-

lation minimises additionally the amount of decorrelated energy that is

added to the output, since it exploits decorrelation that can be achieved

by mixing the inputs. The higher-order DirAC formulation of Publication

V takes full advantage of this framework.

84


5.3.3 Sound scene synthesis and manipulation

The spatial parameters of DirAC can be used directly to create synthetic

sound scenes and spatial effects. This approach has been found to be

effective in the creation of spatial sound objects with a prescribed spa-

tial extent through panning of individual subbands of a monophonic sig-

nal over the angular region that determines that extent (Laitinen et al.,

2012a; Pihlajamäki et al., 2014). Similar approaches have been studied

earlier (Kendall, 1995; Potard and Burnett, 2004) and have been realised

recently in the methods of Verron et al. (2010) and Franck et al. (2015).

An alternative for ambisonic signals is proposed by Zotter et al. (2014).

Apart from source extent, efficient perceptual reverberation can be added

to the scene by encoding the monophonic sound into a B-format set with

added reverberation and prescribing the desired diffuseness values for the

synthesis (Laitinen et al., 2012a).

Manipulation of real recorded sound scenes is possible by modulating

the analysed parameters before the synthesis stage. This modulation of

parameters is performed based on some meaningful spatial transforma-

tion, depending on the application. Examples include a zooming effect

towards an arbitrary direction (Schultz-Amling et al., 2010) and a re-

lated transformation that projects the recorded scene onto some virtual

surfaces and allows the user to move inside the recording (Pihlajamaki

and Pulkki, 2015). Besides directional transformations, perceptual rever-

beration control can be achieved through modification of the diffuseness

parameter. Parametric scene manipulation in the time-frequency domain

allows flexible and intuitive design of spatial effects, since it involves mod-

ifying properties such as the DoA of sound and the DDR. Publication II

presents a series of such examples.

5.3.4 Binauralisation, transcoding and upmixing

Binaural reproduction with DirAC is very similar to loudspeaker render-

ing. Since head-tracking is utilised, the method should adapt the anal-

ysis and synthesis to the rotating coordinate frame of the user’s head.

Two approaches are possible for binaural rendering. The one proposed

by Laitinen and Pulkki (2009) performs the synthesis for a fixed number

of virtual loudspeakers, which are then convolved with the appropriate

HRTFs. From the potential combinations that can perform the rotation

of the recording, the most efficient is deemed to be counter-rotation of

85


the analysed DoAs during analysis and counter-rotation of the ambisonic

decoding at synthesis. The method has been found to produce very con-

vincing results with sharp spatial images and effective externalisation

(Laitinen and Pulkki, 2009). The second approach, currently under devel-

opment, omits the virtual loudspeaker setup and replaces the amplitude

panning gains with HRTF-derived complex gains that correspond to the

time-frequency transform under use. The diffuse stream is additionally

rendered directly to binaural signals, taking into account the binaural co-

herence of diffuse sound, in a manner similar to that by Faller and Bree-

baart (2011), Borß and Martin (2009), and Vilkamo et al. (2012). The

method can additionally exploit the optimal mixing solution of Vilkamo

et al. (2013) for maximum perceived quality.

Upmixing using DirAC has been applied to stereo (Laitinen, 2014) and

surround content (Laitinen and Pulkki, 2011). It involves first encoding

the original signals into a B-format set and then processing the result

with DirAC. Some consideration should be given to the encoding process

in order to take into account the spatial non-uniformity of the input chan-

nels (Laitinen and Pulkki, 2011). An interesting application of DirAC

upmixing is re-encoding of the first-order B-format into a higher-order

ambisonic signal set, a process which has been also termed ambisonic

upscaling (Wabnitz et al., 2011). The higher-order signals can preserve

better the spatial quality and sharpness of DirAC reproduction. Further-

more, after upmixing to some reference order, material of various orders

can be mixed directly and decoded with a single optimised decoding ma-

trix for the target setup. A trivial way to perform this ambisonic upmix-

ing is to render the B-format into a uniform setup of virtual loudspeakers,

with the number and arrangement determined by the target order; see,

for example, spherical t-designs in Hardin and Sloane (1996), Gräf and

Potts (2011) and Publication V. The output signals can then be encoded

into the higher-order signals as plane waves from the virtual loudspeaker

directions using (4.6).

With reproduction for loudspeakers and headphones and upmixing for

channel-based content and recorded sound, along with spatialisation and

diffusion of synthetic spatial audio objects, DirAC shows the potential

of parametric techniques as universal spatial audio processors. Content

from various input formats can be parameterised into a common represen-

tation, and spatialised or converted to a variety of target systems or alter-

native formats. The methods have also the potential to objectify recorded

86


sound scenes by splitting them into perceptually meaningful components

along with their estimated signals and spatial parameters (Herre et al.,

2012).

5.3.5 Discussion

DirAC resembles many of the SAC and upmixing methods mentioned in

Sec. 5.1. Using a direct-diffuse decomposition DirAC achieves its percep-

tual target better than what can be done with first-order ambisonic de-

coding (Vilkamo et al., 2009). Since DirAC was originally defined with

B-format input, it can also be seen as a method that addresses many of

the problems of first-order reproduction mentioned in Sec. 4.1.5.

DirAC can be seen differently from various points of view:

(a) From a SAC perspective: DirAC is a SAC method that is able to

transmit a monophonic downmix and metadata, which can be used

for efficient reproduction of the sound scene. This perspective is ex-

ploited mostly in the teleconferencing applications of DirAC (Ahonen

et al., 2007).

(b) From an upmixing perspective: DirAC is an upmixing method that

utilises global spatial parameters and a direct-diffuse decomposition

to upmix four channels to any loudspeaker setup or headphones with

high perceptual quality (Vilkamo et al., 2009; Laitinen et al., 2011).

(c) From an ambisonic perspective: DirAC is a first-order ambisonic de-

coder that is active, meaning that instead of a fixed decoding matrix,

a signal-dependent one is adapted that tries to decode B-format op-

timally in accordance with the analysed ambisonic measures of en-

ergy vectors and phasiness (which relate to intensity and diffuseness

as mentioned in Sec. 3.3.1). From a purist ambisonic perspective,

DirAC can be seen an ambisonic enhancement method.

(d) From an audio engineering perspective: DirAC is an audio post-

recording tool that combines the stable localisation properties of

coincident recording techniques with the envelopment qualities of

spaced microphone techniques (Pulkki, 2007), suitable also to ma-

nipulate the sound scene.

(e) From an array processing perspective: DirAC is a method that anal-

yses the DoA of the dominant source in the recording using the

acoustic intensity, estimates the DDR using the diffuseness, and

87


constructs a Wiener filter to separate the two streams. The sep-

arate streams can then be used for enhancement or spatialisation

with appropriate sets of synthesis filters (Thiergart et al., 2009).

Even though DirAC has been originally defined for B-format input, it

has been extended to stereophonic recordings (Ahonen et al., 2009), lin-

ear arrays (Thiergart et al., 2011), circular arrays (Ahonen et al., 2012;

Vilkamo and Pulkki, 2013), and spaced surround-recording arrays ((Poli-

tis et al., 2013a,b) and Publication III). Furthermore, through the higher-

order formulation of Pulkki et al. (2013) and Publication V the resolution

of arbitrary 2D and 3D arrays can be exploited.

The analysis and synthesis model of DirAC has proved quite efficient

and effective in a variety of tasks. Beside spatial sound reproduction,

its principles have been applied to efficient perceptual synthesis for vir-

tual reality and game audio (Laitinen et al., 2012a; Pihlajamäki et al.,

2013), augmented audio reality (Pugliese et al., 2015), localisation and

tracking of sound sources (Thiergart et al., 2009), parametric spatial fil-

tering (Kallinger et al., 2009), perceptual studies on audiovisual interac-

tion (Rummukainen et al., 2014), and natural reproduction for audiologi-

cal research (Koski et al., 2013).

DirAC forms the backbone of most of the development in this thesis,

and Publication I, Publication II, Publication III and Publication V aim

to improve, extend and generalise it. However, even though a significant

part of the work operates within a DirAC context, we believe that many

of the research insights gained are useful in the more general context of

parametric spatial sound processing and reproduction.

5.4 Analysis and synthesis of room acoustics

Auralisation of measured impulse responses refers to the method of record-

ing the acoustic impulse response of an enclosure of acoustical interest,

and then listening to the result of its convolution with a test signal. This

technique is of major interest when studying auditorium acoustics and

their perceptual effects (Vorländer, 2007; Lokki et al., 2012). Aside from

the time-domain and frequency-domain effects, the complex spatial ef-

fects of the auralised space on the sound should be preserved as much as

possible. Appropriately distributing the captured room impulse response

from a set of microphones to loudspeakers or headphones is termed here

as rendering of spatial impulse responses.

88


Even though non-parametric methods, such as Ambisonics can be used

for rendering (Gerzon, 1975d; Farina and Ayalon, 2003), this task is well

suited to a parametric analysis-synthesis framework, since the estima-

tion of reflection properties is a well-studied problem in array process-

ing. Many of the techniques that were mentioned on recording and repro-

duction can be applied without modifications to recorded room impulse

responses. However, room impulse responses exhibit a much more well-

defined structure, and better results should be expected if this knowledge

is taken into account. For example, early reflections are expected to be

broadband pulses, highly concentrated in time and temporally sparse. It

is also known that the late reverberation part exhibits a spatially diffuse

behavior with different decay rates at different frequencies.

Even though there is a wealth of literature on the analysis of room

impulse responses, mainly on DoA estimation and visualisation of early

reflections (Merimaa et al., 2001; Rafaely et al., 2007a; Donovan et al.,

2008; Tervo et al., 2011; Sun et al., 2012; Mignot et al., 2013; Morgen-

stern et al., 2015), there have been few proposals for complete analysis-

synthesis frameworks. The best-known ones are the SIRR method (Meri-

maa and Pulkki, 2005; Pulkki and Merimaa, 2006) and the recent Spatial

Decomposition method (SDM) (Tervo et al., 2013). A precursor to both

methods seem to be the work by Endoh et al. (1986) and Yamasaki and

Itow (1989) using inter-microphone correlations or the acoustic intensity

to infer the properties of early reflections, even though that work focused

solely on visualisation and not auralisation. SIRR, similarly to DirAC,

follows the principle of direct-diffuse decomposition of the captured re-

sponse in a B-format signal, with its respective spatialisation using pan-

ning and decorrelation tools. SIRR has been used both for visualisation

(Merimaa et al., 2001) and for a variety of auralisation tasks (Pätynen

and Lokki, 2011; Laird et al., 2014). SDM is a more recent proposal that

utilises cross-correlation and the array geometry to infer reflection prop-

erties. SDM drops the frequency-dependent analysis of SIRR, assuming

broadband reflections, and hence enables a higher temporal resolution

than SIRR. A single microphone signal is then spatialised to the respec-

tive DoA. SDM has been proposed as a visualisation tool of the spatial

response and its temporal development (Pätynen et al., 2013). Addition-

ally, it has been used to auralise a variety of environments for studies

on perceptual attributes of room acoustics, such as concert halls (Kuusi-

nen and Lokki, 2015; Tahvanainen et al., 2015) or music studios (Tervo

89


et al., 2014). Generation of binaural room impulse responses using B-

format and HRTFs, and following similar principles to SIRR and SDM, is

proposed by Menzer and Faller (2008).

Alternative methods focus exclusively on rendering the diffuse portion

of the spatial response. The method by Romblom et al. (2016) presents

a comprehensive framework for diffuse response analysis and synthesis

from B-format signals, while Menzer and Faller (2010) generate diffuse

binaural room impulse responses using decaying noise sequences matched

to the binaural signal statistics of actual responses. The method by Oksa-

nen et al. (2013) spatialises diffuse reverberation in tunnel environments

using B-format recordings and efficient reverberators.

90

6. Contributions

6.1 Perceptual distance control

As mentioned in Sec. 3.1.4, perception of distance relies on a variety of

factors, with some cues being relative and others absolute and operat-

ing in different ranges. For typical enclosures and loudspeaker setups,

distance control relies mostly on manipulating the DRR in the sound ma-

terial, creating an effect of the sound source being further than the ac-

tual loudspeakers. Distances closer than the loudspeaker radius cannot

be achieved in this way, as the reverberation of the listening room itself

will determine the minimum DRR that can be achieved by playing a dry

sound from a loudspeaker. Publication VI explores the novel idea of con-

trolling the DRR of the room itself using a loudspeaker unit that com-

prises an array of multiple drivers. By applying beamforming operations

to the drivers to modify its directivity, it is possible to control the received

DRR at the listening position, reducing it or amplifying it compared to

an omnidirectional reference case. Listening test results indicate that it

is possible to perceive sources significantly closer or further than the ar-

ray radius using the above method, and with a single loudspeaker unit.

Effective control of both distance and angle can be achieved using mul-

tiple such units, and by combining the directivity control with standard

panning methods.

6.2 Improving parametric reproduction of B-format recordingsusing array information

First-order DirAC utilises the B-format signal set, which provides the ap-

propriate spatial parameters based on acoustic energy density and inten-

91

Contributions

sity, as detailed in Sec. 5.3.1. Since the B-format is generated by beam-

forming filters on a compact spherical microphone array, such as the Sound-

field microphone, there is a frequency range where the beamforming pat-

terns correspond to the ideal and frequency-independent monopole and

dipole B-format components. Around the spatial aliasing limit, the pat-

terns become corrupted by higher-order harmonic components. These

deviations affect both the analysis stage, with biased DoA estimation

and overestimation of diffuseness, and the synthesis stage, with distorted

beamformers towards the loudspeaker positions.

The aim of Publication I is to show that both of these issues can be

alleviated if there is knowledge of the array geometry that produced the

B-format signals, and access to the microphone signals before conversion.

More generally, Publication I indicates that spatial parameter estimation

for parametric reproduction is possible using array processing techniques,

even when estimation with the B-format fails.

6.3 Extending high-frequency estimation of DoA and diffuseness

Following Publication I, the work of Publication IV generalises the estima-

tion of a mean DoA direction and the DDR, or diffuseness, to more general

spherical arrays with certain symmetry properties above the spatial alias-

ing frequency. Instead of more elaborate and computationally demanding

methods that can achieve similar results, such as subspace methods, the

focus here is on a simple and light-weight energetic estimator that can

replace the intensity-based estimation of DirAC in the frequency range

where it fails. This estimator is further used in the higher-order DirAC of

Publication V.

6.4 Parametric sound-field manipulation

The work of Publication II presents the potential of parametric repro-

duction methods for creative use in an audio engineering context. This is

achieved by manipulating the analysed spatial parameters in some mean-

ingful way in order to create a certain spatial effect. A series of examples

is presented that can modify either the directional properties of the sound

scene or the balance between the directional and ambience components.

92

Contributions

6.5 Enhancement of spaced microphone array recordings

Multichannel recording of music performances commonly uses carefully

optimised spaced microphone arrays of directional microphones intended

to capture both the sound stage and the reverberant sound and deliver

recordings that can be played directly through a surround loudspeaker

setup with one-to-one channel mapping. Even though these recording

techniques provide a pleasant listening experience, they suffer from a

number of issues described in Sec. 3.2.3; mainly directional ambiguities.

Publication III shows that by applying parametric analysis and synthe-

sis on the recordings, and by considering the array properties, the overall

quality of reproduction can be improved.

6.6 Higher-order DirAC

The introduction of the thesis has presented key advantages and limita-

tions of Ambisonics and of parametric methods such as DirAC in spatial

sound reproduction. We consider that the contributions of Publication V

are multiple:

(a) it presents a way for direct-diffuse decomposition of multiple direc-

tional and diffuse components, using the effective principle of DirAC

analysis,

(b) it generalises DirAC to using input of higher-order than B-format,

and

(c) it indicates the fertile research ground of linking higher-order am-

bisonics with parametric methods, which combine the advantages of

the two.

In addition to these contributions, the sector-based analysis principle of

estimating intensity and diffuseness for an angularly-limited region has

potential applications on acoustic analysis applications other than repro-

duction. This is not, however, detailed in Publication V and is left for

future research.

93

Contributions

6.7 Analysis of spatial room impulse responses

As mentioned in Sec. 5.4, capturing and reproduction of spatial room im-

pulse responses is well suited for parametric analysis and synthesis. Cur-

rent methods rely on the assumption of a single narrowband or a single

broadband reflection component in each analysis window. This assump-

tion is easily violated when more than one reflection arrive in the same

analysis window.

The contribution of Publication VII is twofold. Firstly, it shows the possi-

bility of detection and localization of more than one reflection in the time-

frequency domain. Secondly, it shows that for room impulse responses,

where processing happens offline and computational resources are not an

issue, this task can be achieved using full ML estimation of the reflection

parameters. Additionally, a comparison between the ML estimation and

the more commonly-used subspace- and beamforming-based analysis is

presented, with the ML approach outperforming the rest.

94

References

A. Abdou and R. W. Guy. A review of objective descriptors for sound diffuseness.Canadian Acoustics, 22(3):43–44, 1994.

J. S. Abel, Y. Hur, Y.-c. Park, and D. H. Youn. A set of microphone array beam-formers implementing a constant-amplitude panning law. In 129th Conventionof the AES, San Francisco, CA, USA, 2010.

T. D. Abhayapala and D. B. Ward. Theory and design of high-order sound fieldmicrophones using spherical microphone array. In IEEE Int. Conf. on Acous-tics, Speech, and Signal Processing (ICASSP), Orlando, FL, USA, 2002.

J. Ahonen and V. Pulkki. Diffuseness estimation using temporal variation ofintensity vectors. In IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA), New Paltz, NY, USA, 2009.

J. Ahonen, V. Pulkki, and T. Lokki. Teleconference application and B-formatmicrophone array for directional audio coding. In 30th Int. Conf. of the AES:Intelligent Audio Environments, Saariselkä, Finland, 2007.

J. Ahonen, V. Pulkki, F. Kuech, G. Del Galdo, M. Kallinger, and R. Schultz-Amling. Directional audio coding with stereo microphone input. In 126thConvention of the AES. Munich, Germany, 2009.

J. Ahonen, G. Del Galdo, F. Kuech, and V. Pulkki. Directional analysis with mi-crophone array mounted on rigid cylinder for directional audio coding. Journalof the Audio Engineering Society, 60(5):311–324, 2012.

T. Ajdler, C. Faller, L. Sbaiz, and M. Vetterli. Sound field analysis along a circleand its applications to HRTF interpolation. Journal of the Audio EngineeringSociety, 56(3):156–175, 2008.

A. Alexandridis, A. Griffin, and A. Mouchtaris. Capturing and reproducing spa-tial audio based on a circular microphone array. Journal of Electrical andComputer Engineering, 2013:1–16, 2013.

V. R. Algazi, C. Avendano, and R. O. Duda. Estimation of a spherical-head modelfrom anthropometry. Journal of the Audio Engineering Society, 49(6):472–479,2001.

V. R. Algazi, R. O. Duda, and D. M. Thompson. Motion-tracked binaural sound.Journal of the Audio Engineering Society, 52(11):1142–1156, 2004.

95

References

J. B. Allen and D. A. Berkley. Image method for efficiently simulating small-roomacoustics. The Journal of the Acoustical Society of America, 65(4):943–950,1979.

F. Asano, S. Ikeda, M. Ogawa, H. Asoh, and N. Kitawaki. Combined approachof array processing and independent component analysis for blind separationof acoustic signals. IEEE Transactions on Speech and Audio Processing, 11(3):204–215, 2003.

J. Atkins. Robust beamforming and steering of arbitrary beam patterns usingspherical arrays. In IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA), New Paltz, NY, USA, 2011.

C. Avendano and J.-M. Jot. A frequency-domain approach to multichannel up-mix. Journal of the Audio Engineering Society, 52(7/8):740–749, 2004.

A. Avni, J. Ahrens, M. Geier, S. Spors, H. Wierstorf, and B. Rafaely. Spatial per-ception of sound fields recorded by spherical microphone arrays with varyingspatial resolution. The Journal of the Acoustical Society of America, 133(5):2711–2721, 2013.

J. Backman. Microphone array beam forming for multichannel recording. In114th Convention of the AES, Amsterdam, The Netherlands, 2003.

Y.-H. Baek, S.-W. Jeon, Y.-c. Park, and S. Lee. Efficient primary-ambient de-composition algorithm for audio upmix. In 133rd Convention of the AES, SanFrancisco, CA, USA, 2012.

M. R. Bai, Y.-H. Hua, C.-H. Kuo, and Y.-H. Hsieh. An integrated analysis-synthesis array system for spatial sound fields. The Journal of the AcousticalSociety of America, 137(3):1366–1376, 2015.

M. Barron. The subjective effects of first reflections in concert halls—the needfor lateral reflections. Journal of Sound and Vibration, 15(4):475–494, 1971.

J.-M. Batke and F. Keiler. Using VBAP-derived panning functions for 3D am-bisonics decoding. In 2nd Ambisonics Symposium, Paris, France, 2010.

B. B. Bauer. Phasor analysis of some stereophonic phenomena. The Journal ofthe Acoustical Society of America, 33(11):1536–1539, 1961.

F. Baumgarte and C. Faller. Binaural cue coding - part I: Psychoacoustic fun-damentals and design principles. IEEE Transactions on Speech and AudioProcessing, 11(6):509–519, 2003.

R. Baumgartner, H. Pomberger, and M. Frank. Practical implementation of ra-dial filters for ambisonic recordings. In 1st Int. Conf. on Spatial Audio (ICSA),Detmold, Germany, 2011.

R. Baumgartner, P. Majdak, and B. Laback. Modeling sound-source localizationin sagittal planes for human listeners. The Journal of the Acoustical Society ofAmerica, 136(2):791–802, 2014.

S. Bech and N. Zacharov. Perceptual Audio Evaluation – Theory, Method andApplication. John Wiley & Sons, 2007.

96

References

D. R. Begault, E. M. Wenzel, and M. R. Anderson. Direct comparison of the im-pact of head tracking, reverberation, and individualized head-related transferfunctions on the spatial perception of a virtual speech source. Journal of theAudio Engineering Society, 49(10):904–916, 2001.

J. Benesty, J. Chen, and Y. Huang. Microphone array signal processing. SpringerTopics in Signal Processing. Springer, 2008.

E. M. Benjamin and A. Heller. Assessment of ambisonic system performanceusing binaural measurements. In 137th Convention of the AES, Los Angeles,CA, USA, 2014.

J. C. Bennett, K. Barker, and F. O. Edeko. A new approach to the assessmentof stereophonic sound system performance. Journal of the Audio EngineeringSociety, 33(5):314–321, 1985.

J. A. Beracoechea, J. Casajus, L. García, L. Ortiz, and S. Torres-Guijarro. Imple-mentation of immersive audio applications using robust adaptive beamform-ing and wave field synthesis. In 120th Convention of the AES, Paris, France,2006.

S. Berge and N. Barrett. A new method for B-format to binaural transcoding. In40th Int. Conf. of the AES: Spatial Audio, Tokyo, Japan, 2010a.

S. Berge and N. Barrett. High angular resolution planewave expansion. In 2ndInt. Symposium on Ambisonics and Spherical Acoustics, Paris, France, 2010b.

A. J. Berkhout, D. de Vries, and P. Vogel. Acoustic control by wave field synthesis.The Journal of the Acoustical Society of America, 93(5):2764–2778, 1993.

B. Bernfeld. Attempts for better understanding of the directional stereophoniclistening mechanism. In 44th Convention of the AES, Rotterdam, The Nether-lands, 1973.

B. Bernschütz. Microphone arrays and sound field decomposition for dynamicbinaural recording. PhD thesis, Technische Universität Berlin, Berlin, Ger-many, 2016.

B. Bernschütz, A. V. Giner, C. Pörschmann, and J. Arend. Binaural reproductionof plane waves with reduced modal order. Acta Acustica united with Acustica,100(5):972–983, 2014.

S. Bertet, J. Daniel, E. Parizet, and O. Warusfel. Investigation on localisationaccuracy for first and higher order Ambisonics reproduced sound sources. ActaAcustica united with Acustica, 99(4):642–657, 2013.

W. K. Blake and R. V. Waterhouse. The use of cross-spectral density measure-ments in partially reverberant sound fields. Journal of Sound and Vibration,54(4):589–599, 1977.

J. Blauert. Sound localization in the median plane. Acta Acustica united withAcustica, 22(4):205–213, 1969.

J. Blauert. Spatial hearing: the psychophysics of human sound localization. MITpress, 1996.

97

References

J. Blauert and W. Lindemann. Spatial mapping of intracranial auditory eventsfor various degrees of interaural coherence. The Journal of the Acoustical So-ciety of America, 79(3):806–813, 1986.

A. D. Blumlein. British patent specification 394,325 (improvements in and relat-ing to sound-transmission, sound-recording and sound-reproducing systems).Journal of the Audio Engineering Society, 6(2):91–130, 1958.

K. Bodlund. A new quantity for comparative measurements concerning the diffu-sion of stationary sound fields. Journal of Sound and Vibration, 44(2):191–207,1976.

M. M. Boone, E. N. Verheijen, and P. F. Van Tol. Spatial sound-field reproduc-tion by wave-field synthesis. Journal of the Audio Engineering Society, 43(12):1003–1012, 1995.

C. Borß and R. Martin. An improved parametric model for perception-baseddesign of virtual acoustics. In 35th Int. Conf. of the AES: Audio for Games,London, UK, 2009.

J. S. Bradley and G. A. Soulodre. Objective measures of listener envelopment.The Journal of the Acoustical Society of America, 98(5):2590–2597, 1995.

M. Brandstein and D. Ward. Microphone arrays: Signal processing techniquesand applications. Springer, 2013.

S. Braun and M. Frank. Localization of 3D ambisonic recordings and ambisonicvirtual sources. In 1st Int. Conf. on Spatial Audio (ICSA), Detmold, Germany,2011.

J. Breebaart and C. Faller. Spatial Audio Processing: MPEG Surround and OtherApplications. Wiley, 2007.

J. Breebaart, S. Disch, C. Faller, J. Herre, G. Hotho, K. Kjörling, F. Myburg,M. Neusinger, W. Oomen, H. Purnhagen, et al. MPEG spatial audio cod-ing/MPEG surround: Overview and current status. In 119th Convention ofthe AES, New York, NY, USA, 2005a.

J. Breebaart, S. van de Par, A. Kohlrausch, and E. Schuijers. Parametric codingof stereo audio. EURASIP Journal on Applied Signal Processing, 2005:1305–1322, 2005b.

J. Breebaart, J. Herre, C. Jin, K. Kjörling, J. Koppens, J. Plogsties, and L. Ville-moes. Multi-channel goes mobile: MPEG surround binaural rendering. In 29thInt. Conf. of the AES: Audio for Mobile and Handheld Devices, Seoul, Korea,2006.

M. Briand, N. Martin, and D. Virette. Parametric representation of multichannelaudio based on principal component analysis. In 120th Convention of the AES,Paris, France, 2006.

A. W. Bronkhorst. Modeling auditory distance perception in rooms. In EAAForum Acusticum, Sevilla, Spain, 2002.

D. S. Brungart and W. M. Rabinowitz. Auditory localization of nearby sources.Head-related transfer functions. The Journal of the Acoustical Society of Amer-ica, 106(3):1465–1479, 1999.

98

References

D. S. Brungart, N. I. Durlach, and W. M. Rabinowitz. Auditory localization ofnearby sources. II. Localization of a broadband source. The Journal of theAcoustical Society of America, 106(4):1956–1968, 1999.

S. Carlile, C. Jin, and V. Van Raad. Continuous virtual auditory space usingHRTF interpolation: Acoustic and psychophysical errors. In 1st IEEE Pacific-Rim Conf. on Multimedia, Sydney, Australia, 2000.

J. Catic, S. Santurette, J. M. Buchholz, F. Gran, and T. Dau. The effect ofinteraural-level-difference fluctuations on the externalization of sound. TheJournal of the Acoustical Society of America, 134(2):1232–1241, 2013.

J. Catic, S. Santurette, and T. Dau. The role of reverberation-related binauralcues in the externalization of speech. The Journal of the Acoustical Society ofAmerica, 138(2):1154–1167, 2015.

B. S. Cazzolato and J. Ghan. Frequency domain expressions for the estimation oftime-averaged acoustic energy density. The Journal of the Acoustical Societyof America, 117(6):3750–3756, 2005.

J. Chen, B. D. Van Veen, and K. E. Hecox. External ear transfer function model-ing: A beamforming approach. The Journal of the Acoustical Society of Amer-ica, 92(4):1933–1944, 1992.

B. Cheng, C. Ritz, and I. Burnett. Principles and analysis of the squeezing ap-proach to low bit rate spatial audio coding. In IEEE Int. Conf. on Acoustics,Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, 2007.

B. Cheng, C. Ritz, and I. Burnett. Binaural reproduction of spatially squeezedsurround audio. In 9th Int. Conf. on Signal Processing (ICSP), Beijing, China,2008.

B. Cheng, C. Ritz, I. Burnett, and X. Zheng. A general compression approach tomulti-channel three-dimensional audio. IEEE Transactions on Audio, Speech,and Language Processing, 21(8):1676–1688, 2013.

C. H. Choi, J. Ivanic, M. S. Gordon, and K. Ruedenberg. Rapid and stable deter-mination of rotation matrices between spherical harmonics by direct recursion.Journal of Chemical Physics, 111(19):8825–8831, 1999.

M. Cobos and J. J. Lopez. Resynthesis of sound scenes on wave-field synthesisfrom stereo mixtures using sound source separation algorithms. Journal of theAudio Engineering Society, 57(3):91–110, 2009.

M. Cobos, J. J. Lopez, and S. Spors. A sparsity-based approach to 3D binauralsound synthesis using time-frequency array processing. EURASIP Journal onAdvances in Signal Processing, 2010(1):1–13, 2010.

M. Cobos, S. Spors, J. Ahrens, and J. J. Lopez. On the use of small microphonearrays for wave field synthesis auralization. In 45th Int. Conf. of the AES:Applications of Time-Frequency Processing in Audio, Helsinki, Finland, 2012.

P. Coleman, P. Jackson, and J. Francombe. Audio object separation using micro-phone array beamforming. In 138th Convention of the AES, Warsaw, Poland,2015.

99

References

D. H. Cooper and T. Shiga. Discrete-matrix multichannel stereo. Journal of theAudio Engineering Society, 20(5):346–360, 1972.

H. Cox. Spatial correlation in arbitrary noise fields with application to ambientsea noise. The Journal of the Acoustical Society of America, 54(5):1289–1301,1973.

J. Daniel. Représentation de champs acoustiques, application à la transmissionet à la reproduction de scènes sonores complexes dans un contexte multimédia.PhD thesis, University of Paris VI, Paris, France, 2000.

J. Daniel. Spatial sound encoding including near field effect: Introducing dis-tance coding filters and a viable, new ambisonic format. In 23rd Int. Conf. ofthe AES: Signal Processing in Audio Recording and Reproduction, Helsingør,Denmark, 2003.

J. Daniel, J.-B. Rault, and J.-D. Polack. Acoustic properties and perceptive im-plications of stereophonic phenomena. In 16th Int. Conf. of the AES: SpatialSound Reproduction, Rovaniemi, Finland, 1999.

W. de Bruijn, A. Härmä, and S. van de Par. On the use of directional loudspeakersto create a sound source close to the listener. In 124th Convention of the AES,Amsterdam, The Netherlands, 2008.

E. De Sena, H. Hacihabiboglu, and Z. Cvetkovic. Analysis and design of multi-channel systems for perceptual sound field reconstruction. IEEE Transactionson Audio, Speech, and Language Processing, 21(8):1653–1665, 2013.

D. de Vries and M. M. Boone. Wave field synthesis and analysis using arraytechnology. In IEEE Workshop on Applications of Signal Processing to Audioand Acoustics (WASPAA), New Paltz, NY, USA, 1999.

G. Del Galdo, F. Kuech, M. Prus, and O. Thiergart. Three-dimensional sound fieldanalysis with Directional Audio Coding based on signal adaptive parameterestimators. In 40th Int. Conf. of the AES: Spatial Audio, Tokyo, Japan, 2010.

G. Del Galdo, M. Taseska, O. Thiergart, J. Ahonen, and V. Pulkki. The diffusesound field in energetic analysis. The Journal of the Acoustical Society of Amer-ica, 131(3):2141–2151, 2012.

S. Delikaris-Manias, J. Vilkamo, and V. Pulkki. Parametric binaural renderingutilizing compact microphone arrays. In IEEE Int. Conf. on Acoustics, Speech,and Signal Processing (ICASSP), Sydney, Australia, 2015.

A. O. Donovan, R. Duraiswami, and D. Zotkin. Imaging concert hall acousticsusing visual and audio cameras. In IEEE Int. Conf. on Acoustics, Speech, andSignal Processing (ICASSP), Las Vegas, NV, USA, 2008.

P. F. Driessen and Y. Li. An unsupervised adaptive filtering approach of 2-to-5channel upmix. In 119th Convention of the AES, New York, NY, USA, 2005.

J. R. Driscoll and D. M. Healy. Computing Fourier transforms and convolutionson the 2-sphere. Advances in Applied Mathematics, 15(2):202–250, 1994.

R. Duraiswami, D. N. Zotkin, Z. Li, E. Grassi, N. A. Gumerov, and L. S. Davis.System for capturing of high order spatial audio using spherical microphonearray and binaural head-tracked playback over headphones with head related

100

References

transfer function cue. In 119th Convention of the AES, New York, NY, USA,2005.

K. Endoh, Y. Yamasaki, and T. Itow. Grasp and development of spatial informa-tions in a room by closely located four-point microphone method. In IEEE Int.Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Tokyo, Japan,1986.

G. Enzner. 3D-continuous-azimuth acquisition of head-related impulse responsesusing multi-channel adaptive filtering. In IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA,2009.

N. Epain and J. Daniel. Improving spherical microphone arrays. In 124th Con-vention of the AES, Amsterdam, The Netherlands, 2008.

N. Epain and C. T. Jin. Super-resolution sound field imaging with sub-space pre-processing. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing(ICASSP), Vancouver, Canada, 2013.

N. Epain, C. T. Jin, and F. Zotter. Ambisonic decoding with constant angularspread. Acta Acustica united with Acustica, 100(5):928–936, 2014.

M. J. Evans, J. A. S. Angus, and A. I. Tew. Analyzing head-related transferfunction measurements using surface spherical harmonics. The Journal of theAcoustical Society of America, 104(4):2400–2411, 1998.

F. Fahy. Sound intensity. CRC Press, 2002.

C. Faller. Multiple-loudspeaker playback of stereo signals. Journal of the AudioEngineering Society, 54(11):1051–1064, 2006.

C. Faller. Microphone front-ends for spatial audio coders. In 125th Convention ofthe AES, San Francisco, CA, USA, 2008.

C. Faller and J. Breebaart. Binaural reproduction of stereo signals using upmix-ing and diffuse rendering. In 131st Convention of the AES, New York, NY,USA, 2011.

C. Faller and M. Kolundžija. Design and limitations of non-coincidence correctionfilters for soundfield microphones. In 126th Convention of the AES, Munich,Germany, 2009.

C. Faller and J. Merimaa. Source localization in complex listening situations:Selection of binaural cues based on interaural coherence. The Journal of theAcoustical Society of America, 116(5):3075–3089, 2004.

A. Farina and R. Ayalon. Recording concert hall acoustics for posterity. In 24thInt. Conf. of the AES: Multichannel Audio, The New Reality, Banff, Canada,2003.

A. Farina, A. Amendola, L. Chiesi, A. Capra, and S. Campanini. Spatial PCMsampling: A new method for sound recording and playback. In 52nd Int. Conf.of the AES: Sound Field Control, Guildford, UK, 2013.

K. Farrar. Soundfield microphone. Wireless World, 85(1526):48–50, 1979.

101

References

P. Fellgett. Ambisonics. Part one: General system description. Studio Sound, 17(8):20–22, 1975.

J. Fels and M. Vorländer. Anthropometric parameters influencing HRTFs. ActaAcustica united with Acustica, 95(2):331–342, 2009.

H. Fletcher. Auditory patterns. Reviews of Modern Physics, 12(1):47, 1940.

A. Franck, F. M. Fazi, and F. Melchior. Optimization-based reproduction of dif-fuse audio objects. In IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA), New Paltz, NY, USA, 2015.

M. Frank. Phantom sources using multiple loudspeakers in the horizontal plane.PhD thesis, Institute of Electronic Music and Acoustics (IEM), University ofMusic and Performing Arts, Graz, Austria, 2013a.

M. Frank. Source width of frontal phantom sources: Perception, measurement,and modeling. Archives of Acoustics, 38(3):311–319, 2013b.

M. Frank. Localization using different amplitude-panning methods in the frontalhorizontal plane. In EAA Joint Symposium on Auralization and Ambisonics,Berlin, Germany, 2014.

E. Gallo, N. Tsingos, and G. Lemaitre. 3D-audio matting, postediting, and reren-dering from field recordings. EURASIP Journal on Advances in Signal Pro-cessing, 2007(1):1–16, 2007.

H. Gamper. Head-related transfer function interpolation in azimuth, elevation,and distance. The Journal of the Acoustical Society of America, 134(6):EL547–EL553, 2013.

H. Gamper, M. R. P. Thomas, and I. J. Tashev. Anthropometric parameterisationof a spherical scatterer ITD model with arbitrary ear angles. In IEEE Work-shop on Applications of Signal Processing to Audio and Acoustics (WASPAA),New Paltz, NY, USA, 2015.

P.-A. Gauthier, É. Chambatte, C. Camier, Y. Pasco, and A. Berry. Beamformingregularization, scaling matrices, and inverse problems for sound field extrapo-lation and characterization: Part I — Theory. Journal of the Audio EngineeringSociety, 62(3):77–98, 2014.

M. A. Gerzon. Periphony: With-height sound reproduction. Journal of the AudioEngineering Society, 21(1):2–10, 1973.

M. A. Gerzon. Ambisonics. Part two: Studio techniques. Studio Sound, 17(8):24–26, 1975a.

M. A. Gerzon. The design of precisely coincident microphone arrays for stereoand surround sound. In 50th Convention of the AES, London, UK, 1975b.

M. A. Gerzon. A geometric model for two-channel four-speaker matrix stereosystem. Journal of the Audio Engineering Society, 23(2):98–106, 1975c.

M. A. Gerzon. Recording concert hall acoustics for posterity. Journal of the AudioEngineering Society, 23(7):569–571, 1975d.

M. A. Gerzon. General metatheory of auditory localisation. In 92nd Conventionof the AES, Vienna, Austria, 1992a.

102

References

M. A. Gerzon. Panpot laws for multispeaker stereo. In 92nd Convention of theAES’, Vienna, Austria, 1992b.

M. A. Gerzon. Psychoacoustic decoders for multispeaker stereo and surroundsound. In 93rd Convention of the AES, San Francisco, CA, USA, 1992c.

M. A. Gerzon and G. J. Barton. Ambisonic decoders for HDTV. In 92nd Conven-tion of the AES, Vienna, Austria, 1992.

J. J. Gibson, R. M. Christensen, and A. L. R. Limberg. Compatible FM broad-casting of panoramic sound. Journal of the Audio Engineering Society, 20(10):816–822, 1972.

M. Goodwin and J.-M. Jot. Spatial Audio Scene Coding. In 125th Convention ofthe AES, San Francisco, CA, USA, 2008.

M. M. Goodwin and J.-M. Jot. Binaural 3-D audio rendering based on SpatialAudio Scene Coding. In 123rd Convention of the AES, New York, NY, USA,2007.

B. N. Gover, J. G. Ryan, and M. R. Stinson. Microphone array measurementsystem for analysis of directional and spatial variations of sound fields. TheJournal of the Acoustical Society of America, 112(5):1980–1991, 2002.

M. Gräf and D. Potts. On the computation of spherical designs by a new opti-mization approach based on fast spherical Fourier transforms. NumerischeMathematik, 119(4):699–724, 2011.

D. Griesinger. The psychoacoustics of apparent source width, spaciousness andenvelopment in performance spaces. Acta Acustica united with Acustica, 83(4):721–731, 1997.

P. Guillon, T. Guignard, and R. Nicol. HRTF customization by frequency scalingand rotation shift based on a new morphological matching method. In 125thConvention of the AES, San Francisco, CA, USA, 2008.

B. Günel. On the statistical distributions of active intensity directions. Journalof Sound and Vibration, 332(20):5207–5216, 2013.

B. Günel and H. Hacihabiboglu. Sound source localization: Conventional meth-ods and intensity vector direction exploitation. In W. Wang, editor, MachineAudition: Principles, Algorithms and Systems, chapter 6, pages 126–161. IGIGlobal, 2011.

B. Günel, H. Hacihabiboglu, and A. M. Kondoz. Acoustic source separation ofconvolutive mixtures based on intensity vector statistics. IEEE Transactionson Audio, Speech, and Language Processing, 16(4):748–756, 2008.

H. Haas. The influence of a single echo on the audibility of speech. Journal ofthe Audio Engineering Society, 20(2):146–159, 1972.

E. A. Habets and J. Benesty. A two-stage beamforming approach for noise reduc-tion and dereverberation. IEEE Transactions on Audio, Speech, and LanguageProcessing, 21(5):945–958, 2013.

H. Hacihabiboglu. Theoretical analysis of open spherical microphone arrays foracoustic intensity measurements. IEEE/ACM Transactions on Audio, Speech,and Language Processing, 22(2):465–476, 2014.

103

References

H. Hacihabiboglu and Z. Cvetkovic. Panoramic recording and reproduction ofmultichannel audio using a circular microphone array. In IEEE Workshopon Applications of Signal Processing to Audio and Acoustics (WASPAA), NewPaltz, NY, USA, 2009.

H. Hagiwara, Y. Takahashi, and K. Miyoshi. Wave front reconstruction using thespatial covariance matrices method. Journal of the Audio Engineering Society,60(12):1038–1050, 2013.

R. H. Hardin and N. J. Sloane. McLaren’s improved snub cube and other newspherical designs in three dimensions. Discrete & Computational Geometry,15(4):429–441, 1996.

A. Härmä, R. Van Dinther, T. Svedström, M. Park, and J. Koppens. Personal-ization of headphone spatialization based on the relative localization error inan auditory gaming interface. In 132nd Convention of the the AES, Budapest,Hungary, 2012.

A. Härmä, M. Park, and A. Kohlrausch. Data-driven modeling of the spatialsound experience. In 136th Convention of the AES, Berlin, Germany, 2014.

J. He, E.-L. Tan, and W.-S. Gan. Linear estimation based primary-ambient ex-traction for stereo audio signals. IEEE/ACM Transactions on Audio, Speech,and Language Processing, 22(2):505–517, 2014.

A. Heller, R. Lee, and E. Benjamin. Is my decoder ambisonic? In 125th Conven-tion of the AES, San Francisco, CA, USA, 2008.

G. B. Henning. Detectability of interaural delay in high-frequency complex wave-forms. The Journal of the Acoustical Society of America, 55(1):84–90, 1974.

J. Herre, K. Kjörling, J. Breebaart, C. Faller, S. Disch, H. Purnhagen, J. Koppens,J. Hilpert, J. Rödén, W. Oomen, et al. MPEG surround - the ISO/MPEG stan-dard for efficient and compatible multichannel audio coding. Journal of theAudio Engineering Society, 56(11):932–955, 2008.

J. Herre, C. Falch, D. Mahane, G. Del Galdo, M. Kallinger, and O. Thiergart. In-teractive teleconferencing combining Spatial Audio Object Coding and DirACtechnology. Journal of the Audio Engineering Society, 59(12):924–935, 2012.

J. Herre, J. Hilpert, A. Kuntz, and J. Plogsties. MPEG-H 3D Audio—the newstandard for coding of immersive spatial audio. IEEE Journal of Selected Top-ics in Signal Processing, 9(5):770–779, 2015.

R. Hickling, W. Wei, and R. Raspet. Finding the direction of a sound sourceusing a vector sound-intensity probe. The Journal of the Acoustical Society ofAmerica, 94(4):2408–2412, 1993.

T. Hidaka, L. L. Beranek, and T. Okano. Interaural cross-correlation, lateralfraction, and low-and high-frequency sound levels as measures of acousticalquality in concert halls. The Journal of the Acoustical Society of America, 98(2):988–1007, 1995.

M. Hiipakka, M. Takanen, S. Delikaris-Manias, A. Politis, and V. Pulkki. Local-ization in binaural reproduction with insert headphones. In 132nd Conventionof the the AES, Budapest, Hungary, 2012.

104

References

T. Hirvonen and V. Pulkki. Center and spatial extent of auditory events as causedby multiple sound sources in frequency-dependent directions. Acta Acusticaunited with Acustica, 92(2):320–330, 2006.

T. Hirvonen and V. Pulkki. Perceived spatial distribution and width of horizontalensemble of independent noise signals as function of waveform and samplelength. In 124th Convention of the AES, Amsterdam, The Netherlands, 2008.

V. Hohmann. Frequency analysis and synthesis using a Gammatone filterbank.Acta Acustica united with Acustica, 88(3):433–442, 2002.

H. Hu, L. Zhou, J. Zhang, H. Ma, and Z. Wu. HRTF personalization based onmultiple regression analysis. In Int. Conf. on Computational Intelligence andSecurity (CIS), Guangzhou, China, 2006.

E. Hulsebos, D. de Vries, and E. Bourdillat. Improved microphone array configu-rations for auralization of sound fields by wave-field synthesis. Journal of theAudio Engineering Society, 50(10):779–790, 2002.

J. Huopaniemi, N. Zacharov, and M. Karjalainen. Objective and subjective eval-uation of head-related transfer function filter design. Journal of the AudioEngineering Society, 47(4):218–239, 1999.

Y. Hur, J. S. Abel, Y.-C. Park, and D. H. Youn. Techniques for synthetic reconfig-uration of microphone arrays. Journal of the Audio Engineering Society, 59(6):404–418, 2011.

A. Hurtado-Huyssen and J.-D. Polack. Acoustic intensity in multichannel ren-dering systems. In 119th Convention of the AES, New York, NY, USA, 2005.

T. Huttunen, E. Seppälä, O. Kirkeby, A. Kärkkäinen, and L. Kärkkäinen. Simula-tion of the transfer function for a head-and-torso model over the entire audiblefrequency range. Journal of Computational Acoustics, 15(4):429–448, 2007.

T. Huttunen, A. Vanne, S. Harder, R. R. Paulsen, S. King, L. Perry-Smith, andL. Kärkkäinen. Rapid generation of personalized HRTFs. In 55th Int. Conf. ofthe AES: Spatial Audio, Helsinki, Finland, 2014.

International Telecommunications Union (ITU). RECOMMENDATION ITU-RBS.1534-1: Method for the subjective assessment of intermediate quality levelof coding systems (MUSHRA), 2001.

R. Irwan and R. M. Aarts. Two-to-five channel sound processing. Journal of theAudio Engineering Society, 50(11):914–926, 2002.

Y. Iwaya. Individualization of HRTFs with tournament-style listening test: Lis-tening with other’s ears. Acoustical Science and Technology, 27(6):340–343,2006.

Y. Iwaya and Y. Suzuki. Numerical analysis of the effects of pinna shape andposition on the characteristics of head-related transfer functions. The Journalof the Acoustical Society of America, 123(5):3297–3297, 2008.

F. Jacobsen. The diffuse sound field. Technical Report 27, The Acoustics Labora-tory, Technical University of Denmark (DTU), Lyngby, Denmark, 1979.

F. Jacobsen. Active and reactive, coherent and incoherent sound fields. Journalof Sound and Vibration, 130(3):493–507, 1989.

105

References

F. Jacobsen. Sound field indicators: Useful tools. Noise Control EngineeringJournal, 35(1):37–46, 1990.

F. Jacobsen. Sound Intensity. In T. D. Rossing, editor, Springer Handbook ofAcoustics, chapter 25, pages 1053–1075. Springer, 2007.

D. P. Jarrett, E. A. Habets, and P. A. Naylor. 3D source localization in the spheri-cal harmonic domain using a pseudointensity vector. In 18th European SignalProcessing Conference (EUSIPCO), Aalborg, Denmark, 2010.

D. P. Jarrett, O. Thiergart, E. A. Habets, and P. A. Naylor. Coherence-baseddiffuseness estimation in the spherical harmonic domain. In IEEE 27th Con-vention of Electrical & Electronics Engineers in Israel (IEEEI), Eilat, Israel,2012.

S.-W. Jeon, Y.-C. Park, and D. H. Youn. Auditory distance rendering based onICPD control for stereophonic 3D audio system. IEEE Signal Processing Let-ters, 22(5):529–533, 2015.

C. Jin, P. Leong, J. Leung, A. Corderoy, and S. Carlile. Enabling individual-ized virtual auditory space using morphological measurements. In 1st IEEEPacific-Rim Conf. on Multimedia (PCM), Sydney, Australia, 2000.

C. T. Jin, N. Epain, and A. Parthy. Design, optimization and evaluation of adual-radius spherical microphone array. IEEE/ACM Transactions on Audio,Speech, and Language Processing, 22(1):193–204, 2014.

D. H. Johnson and D. E. Dudgeon. Array signal processing: Concepts and tech-niques. Simon & Schuster, 1992.

J.-M. Jot, V. Larcher, and J.-M. Pernaux. A comparative study of 3-D audio en-coding and rendering techniques. In 16th Int. Conf. of the AES: Spatial SoundReproduction, Rovaniemi, Finland, 1999.

S. Julstrom. An intuitive view of coincident stereo microphones. In 89th Conven-tion of the AES, Los Angeles, CA, USA, 1990.

Y. Kahana and P. Nelson. Boundary element simulations of the transfer functionof human heads and baffled pinnae using accurate geometric models. Journalof Sound and Vibration, 300(3):552–579, 2007.

M. Kallinger, G. Del Galdo, F. Kuech, D. Mahne, and R. Schultz-Amling. Spatialfiltering using Directional Audio Coding parameters. In IEEE Int. Conf. onAcoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, 2009.

J. Käsbach, M. Marschall, B. Epp, and T. Dau. The relation between perceivedapparent source width and interaural cross-correlation in sound reproductionspaces with low reverberation. In 40th Italian Annual Conf. on Acoustics (AIA)and 39th German Annual Conf. on Acoustics (DAGA), Merano, Italy, 2013.

J. Käsbach, T. May, G. Oskarsdottir, C.-H. Jeong, and J. Chang. The effect ofinteraural-time-difference fluctuations on apparent source width. In ForumAcusticum, Krakow, Poland, 2014.

R. Kassier, H.-K. Lee, T. Brookes, and F. Rumsey. An informal comparison be-tween surround-sound microphone techniques. In 118th Convention of theAES, Barcelona, Spain, 2005.

106

References

B. Katz. Boundary element method calculation of individual HRTF: I. Rigidmodel calculation. The Journal of the Acoustical Society of America, 110(5):2440–2448, 2001.

G. Kearney, M. Gorzel, H. Rice, and F. Boland. Distance perception in interactivevirtual acoustic environments using first and higher order ambisonic soundfields. Acta Acustica united with Acustica, 98(1):61–71, 2012.

G. S. Kendall. The decorrelation of audio signals and its impact on spatial im-agery. Computer Music Journal, 19(4):71–87, 1995.

R. A. Kennedy, P. Sadeghi, T. D. Abhayapala, and H. M. Jones. Intrinsic limits ofdimensionality and richness in random multipath fields. IEEE Transactionson Signal Processing, 55(6):2542–2556, 2007.

D. J. Kistler and F. L. Wightman. A model of head-related transfer functionsbased on principal components analysis and minimum-phase reconstruction.The Journal of the Acoustical Society of America, 91(3):1637–1647, 1992.

T. Koski, V. Sivonen, and V. Pulkki. Measuring speech intelligibility in noisyenvironments reproduced with parametric spatial audio. In 135th Conventionof the AES, New York, NY, USA, 2013.

W. Kreuzer, P. Majdak, and Z. Chen. Fast multipole boundary element methodto calculate HRTFs for a wide frequency range. The Journal of the AcousticalSociety of America, 126(3):1280–1290, 2009.

M. Kronlachner and F. Zotter. Spatial transformations for the enhancement ofambisonic recordings. In 2nd Int. Conf. on Spatial Audio (ICSA), Erlangen,Germany, 2014.

A. Kulkarni and H. S. Colburn. Infinite-impulse-response models of the head-related transfer function. The Journal of the Acoustical Society of America,115(4):1714–1728, 2004.

A. Kuusinen and T. Lokki. Investigation of auditory distance perception andpreferences in concert halls by using virtual acoustics. The Journal of theAcoustical Society of America, 138(5):3148–3159, 2015.

A. Laborie, R. Bruno, and S. Montoya. A new comprehensive approach of sur-round sound recording. In 114th Convention of the AES, Amsterdam, TheNetherlands, 2003.

I. Laird, D. Murphy, and P. Chapman. Comparison of spatial audio techniquesfor use in stage acoustic laboratory experiments. In EAA Joint Symposium onAuralization and Ambisonics, Berlin, Germany, 2014.

M.-V. Laitinen. Converting two-channel stereo signals to B-format for DirectionalAudio Coding reproduction. In 137th Convention of the AES, Los Angeles, CA,USA, 2014.

M.-V. Laitinen and V. Pulkki. Binaural reproduction for Directional Audio Cod-ing. In IEEE Workshop on Applications of Signal Processing to Audio andAcoustics (WASPAA), New Paltz, NY, USA, 2009.

M.-V. Laitinen and V. Pulkki. Converting 5.1 audio recordings to B-format forDirectional Audio Coding reproduction. In IEEE Int. Conf. on Acoustics, Speechand Signal Processing (ICASSP), Prague, Czech Republic, 2011.

107

References

M.-V. Laitinen and V. Pulkki. Utilizing instantaneous direct-to-reverberant ra-tio in parametric spatial audio coding. In 133rd Convention of the AES, SanFrancisco, CA, USA, 2012.

M.-V. Laitinen, F. Kuech, S. Disch, and V. Pulkki. Reproducing applause-type sig-nals with Directional Audio Coding. Journal of the Audio Engineering Society,59(1/2):29–43, 2011.

M.-V. Laitinen, T. Pihlajamäki, C. Erkut, and V. Pulkki. Parametric time-frequency representation of spatial sound in virtual worlds. ACM Transactionson Applied Perception, 9(2):8: 1–20, 2012a.

M.-V. Laitinen, T. Pihlajamäki, S. Lösler, and V. Pulkki. Influence of resolution ofhead tracking in synthesis of binaural audio. In 132nd Convention of the AES,Budapest, Hungary, 2012b.

M.-V. Laitinen, S. Disch, and V. Pulkki. Sensitivity of human hearing to changesin phase spectrum. Journal of the Audio Engineering Society, 61(11):860–877,2013.

M.-V. Laitinen, J. Vilkamo, K. Jussila, A. Politis, and V. Pulkki. Gain normaliza-tion in amplitude panning as a function of frequency and room reverberance.In 55th Int. Conf. of the AES: Spatial Audio, Helsinki, Finland, 2014.

M.-V. Laitinen, A. Walther, J. Plogsties, and V. Pulkki. Auditory distance render-ing using a standard 5.1 loudspeaker layout. In 139th Convention of the AES,New York, NY, USA, 2015.

V. Larcher, O. Warusfel, J.-M. Jot, and J. Guyard. Study and comparison of ef-ficient methods for 3-D audio spatialization based on linear decomposition ofHRTF data. In 108th Convention of the AES, Paris, France, 2000.

P. Lecomte, P.-A. Gauthier, C. Langrenne, A. Garcia, and A. Berry. On the useof a Lebedev grid for Ambisonics. In 139th Convention of the AES, New York,NY, USA, 2015.

S.-L. Lee, K.-Y. Han, S.-R. Lee, and K.-M. Sung. Reduction of sound localizationerror for surround sound system using enhanced constant power panning law.IEEE Transactions on Consumer Electronics, 50(3):941–944, 2004.

Y. U. Lee, J. Choi, I. Song, and S. R. Lee. Distributed source modeling anddirection-of-arrival estimation techniques. IEEE Transactions on Signal Pro-cessing, 45(4):960–969, 1997.

D. Levin, E. A. Habets, and S. Gannot. On the angular error of intensity vectorbased direction of arrival estimation in reverberant sound fields. The Journalof the Acoustical Society of America, 128(4):1800–1811, 2010.

D. Levin, E. A. Habets, and S. Gannot. Maximum likelihood estimation of direc-tion of arrival using an acoustic vector-sensor. The Journal of the AcousticalSociety of America, 131(2):1240–1248, 2012.

Z. Li and R. Duraiswami. Headphone-based reproduction of 3D auditory scenescaptured by spherical/hemispherical microphone arrays. In IEEE Int. Conf. onAcoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, 2006.

108

References

Z. Li and R. Duraiswami. Flexible and optimal design of spherical microphonearrays for beamforming. IEEE Transactions on Audio, Speech, and LanguageProcessing, 15(2):702–714, 2007.

Z. Li and R. Ruraiswami. Hemispherical microphone arrays for sound captureand beamforming. In IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA), New Paltz, NY, USA, 2005.

S. P. Lipshitz. Stereo microphone techniques: Are the purists wrong? Journal ofthe Audio Engineering Society, 34(9):716–744, 1986.

R. Y. Litovsky, H. S. Colburn, W. A. Yost, and S. J. Guzman. The precedenceeffect. The Journal of the Acoustical Society of America, 106(4):1633–1654,1999.

A. D. Little, D. H. Mershon, and P. H. Cox. Spectral content as a cue to perceivedauditory distance. Perception, 21(3):405–416, 1992.

T. Lokki, J. Pätynen, A. Kuusinen, and S. Tervo. Disentangling preference rat-ings of concert hall acoustics using subjective sensory profiles. The Journal ofthe Acoustical Society of America, 132(5):3148–3161, 2012.

S. Lösler and F. Zotter. Comprehensive radial filter design for practical higher-order ambisonic recording. In 41st German Annual Conf. on Acoustics (DAGA),Nuremberg, Germany, 2015.

B. F. Lounsbury and R. A. Butler. Estimation of distances of recorded soundspresented through headphones. Scandinavian Audiology, 8(3):145–149, 1979.

R. A. Lutfi and W. Wang. Correlational analysis of acoustic cues for the discrim-ination of auditory motion. The Journal of the Acoustical Society of America,106(2):919–928, 1999.

C. Mac Cabe and D. J. Furlong. Virtual imaging capabilities of surround soundsystems. Journal of the Audio Engineering Society, 42(1/2):38–49, 1994.

E. A. Macpherson. A computer model of binaural localization for stereo-imagingmeasurement. In 87th Convention of the AES, New York, NY, USA, 1989.

E. A. Macpherson and A. T. Sabin. Vertical-plane sound localization with dis-torted spectral cues. Hearing Research, 306:76–92, 2013.

Y. Makita. On the directional localization of sound in the stereophonic soundfield. EBU Review, 73(Part A):1536–1539, 1962.

R. Mason, T. Brookes, and F. Rumsey. Integration of measurements of interau-ral cross-correlation coefficient and interaural time difference within a singlemodel of perceived source width. In 117th Convention of the AES, San Fran-cisco, CA, USA, 2004.

R. Mason, T. Brookes, and F. Rumsey. Frequency dependency of the relationshipbetween perceived auditory source width and the interaural cross-correlationcoefficient for time-invariant stimuli. The Journal of the Acoustical Society ofAmerica, 117(3):1337–1350, 2005.

I. McCowan, M. Lincoln, and I. Himawan. Microphone array shape calibrationin diffuse noise fields. IEEE Transactions on Audio, Speech, and LanguageProcessing, 16(3):666–670, 2008.

109

References

F. Melchior, O. Thiergart, G. Del Galdo, D. de Vries, and S. Brix. Dual radiusspherical cardioid microphone arrays for binaural auralization. In 127th Con-vention of the AES, New York, NY, USA, 2009.

F. Menzer and C. Faller. Obtaining binaural room impulse responses from B-format impulse responses. In 125th Convention of the AES, San Francisco,CA, USA, 2008.

F. Menzer and C. Faller. Investigations on an early-reflection-free model forBRIRs. Journal of the Audio Engineering Society, 58(9):709–723, 2010.

J. Merimaa. Energetic sound field analysis of stereo and multichannel loud-speaker reproduction. In 123rd Convention of the AES, New York, NY, USA,2007.

J. Merimaa and V. Pulkki. Spatial impulse response rendering i: Analysis andsynthesis. Journal of the Audio Engineering Society, 53(12):1115–1127, 2005.

J. Merimaa, T. Lokki, T. Peltonen, and M. Karjalainen. Measurements, analysis,and visualization of directional room responses. In 111th Convention of theAES, New York, NY, USA, 2001.

J. Merimaa, M. M. Goodwin, and J.-M. Jot. Correlation-based ambience extrac-tion from stereo recordings. In 123rd Convention of the AES, New York, NY,USA, 2007.

A. Meshram, R. Mehra, and D. Manocha. Efficient HRTF computation usingadaptive rectangular decomposition. In 55th Int. Conf. of the AES: SpatialAudio, Helsinki, Finland, 2014.

J. Meyer. Beamforming for a circular microphone array mounted on sphericallyshaped objects. The Journal of the Acoustical Society of America, 109(1):185–193, 2001.

J. Meyer and G. W. Elko. Spherical microphone arrays for 3d sound recording. InY. Huang and J. Benesty, editors, Audio signal processing for next-generationmultimedia communication systems, chapter 3, pages 67–89. Kluwer AcademicPublishers, 2004.

J. Middlebrooks. Individual differences in external-ear transfer functions re-duced by scaling in frequency. The Journal of the Acoustical Society of America,106(3):1480–1492, 1999.

R. Mignot, L. Daudet, and F. Ollivier. Room reverberation reconstruction: In-terpolation of the early part using compressed sensing. IEEE Transactions onAudio, Speech, and Language Processing, 21(11):2301–2312, 2013.

P. Mokhtari, H. Takemoto, R. Nishimura, and H. Kato. Computer simulation ofHRTFs for personalization of 3D audio. In 2nd Int. Symposium on UniversalCommunication (ISUC), Osaka, Japan, 2008.

A. H. Moore, C. Evers, P. A. Naylor, D. L. Alon, and B. Rafaely. Direction ofarrival estimation using pseudo-intensity vectors with direct-path dominancetest. In 23rd European Signal Processing Conf. (EUSIPCO), Nice, Italy, 2015.

B. C. J. Moore. Hearing. Academic Press, 1995.

110

References

B. C. J. Moore and B. R. Glasberg. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. The Journal of the Acoustical Soci-ety of America, 74(3):750–753, 1983.

D. Moore and J. Wakefield. The design of improved first order ambisonic decodersby the application of range removal and importance in a heuristic search algo-rithm. In 31st Int. Conf. of the AES: New Directions in High Resolution Audio,London, UK, 2007.

S. Moreau, S. Bertet, and J. Daniel. 3D sound field recording with higher orderAmbisonics – objective measurements and validation of spherical microphone.In 120th Convention of the AES, Paris, France, 2006.

H. Morgenstern, B. Rafaely, and F. Zotter. Theory and investigation of acousticmultiple-input multiple-output systems based on spherical arrays in a room.The Journal of the Acoustical Society of America, 138(5):2998–3009, 2015.

M. Morimoto and Z. Maekawa. Auditory spaciousness and envelopment. In 13thInt. Congress on Acoustics (ICA), Belgrade, Serbia, 1989.

A. D. Musicant and R. A. Butler. The influence of pinnae-based spectral cueson sound localization. The Journal of the Acoustical Society of America, 75(4):1195–1200, 1984.

T. Necciari, P. Balazs, N. Holighaus, and P. L. Sondergaard. The ERBlet trans-form: An auditory-based time-frequency representation with perfect recon-struction. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing(ICASSP), Vancouver, Canada, 2013.

H. Nélisse and J. Nicolas. Characterization of a diffuse field in a reverberantroom. The Journal of the Acoustical Society of America, 101(6):3517–3524,1997.

R. Nicol. Binaural technology. Audio Engineering Society, 2010.

J. Nikunen and T. Virtanen. Direction of arrival based spatial covariance modelfor blind sound source separation. IEEE/ACM Transactions on Audio, Speech,and Language Processing, 22(3):727–739, 2014.

J. Nikunen, T. Virtanen, and M. Vilermo. Multichannel audio upmixing based onnon-negative tensor factorization representation. In IEEE Workshop on Ap-plications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz,NY, USA, 2011.

J. Nikunen, A. Diment, T. Virtanen, and M. Vilermo. Binaural rendering of mi-crophone array captures based on source separation. Speech Communication,2015.

K. Niwa, T. Nishino, and K. Takeda. Encoding large array signals into a 3D soundfield representation for selective listening point audio based on blind sourceseparation. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing(ICASSP), Las Vegas, NV, USA, 2008.

M. Noisternig, T. Musil, A. Sontacchi, and R. Höldrich. 3D binaural sound repro-duction using a virtual ambisonic approach. In IEEE Int. Symposium on Vir-tual Environments, Human-Computer Interfaces and Measurement Systems(VECIMS), Lugano, Switzerland, 2003.

111

References

S. Oksanen, J. Parker, A. Politis, and V. Valimaki. A directional diffuse rever-beration model for excavated tunnels in rock. In IEEE Int. Conf. on Acoustics,Speech, and Signal Processing (ICASSP), Vancouver, Canada, 2013.

K. Ono, T. Nishiguchi, K. Matsui, and K. Hamasaki. Portable spherical micro-phone for Super Hi-Vision 22.2 multichannel audio. In 135th Convention ofthe AES, New York, NY, USA, 2013.

R. Orban. A rational technique for synthesizing pseudo-stereo from monophonicsources. Journal of the Audio Engineering Society, 18(2):157–164, 1970.

R. D. Patterson, M. H. Allerhand, and C. Giguere. Time-domain modeling of pe-ripheral auditory processing: A modular architecture and a software platform.The Journal of the Acoustical Society of America, 98(4):1890–1894, 1995.

J. Pätynen and T. Lokki. Evaluation of concert hall auralization with virtualsymphony orchestra. Building Acoustics, 18(3/4):349–366, 2011.

J. Pätynen, S. Tervo, and T. Lokki. Analysis of concert hall acoustics via visual-izations of time-frequency and spatiotemporal responsesa). The Journal of theAcoustical Society of America, 133(2):842–857, 2013.

J.-M. Pernaux, P. Boussard, and J.-M. Jot. Virtual sound source positioning andmixing in 5.1 implementation on the real-time system Genesis. In 1st Int.Conf. on Digital Audio Effects (DAFx-98), Barcelona, Spain, 1998.

S. Perrett and W. Noble. The effect of head rotations on vertical plane soundlocalization. The Journal of the Acoustical Society of America, 102(4):2325–2332, 1997.

N. Peters, D. Sen, M.-Y. Kim, O. Wuebbolt, and S. M. Weiss. Scene-based au-dio implemented with Higher Order Ambisonics (HOA). In SMPTE AnnualTechnical Conf. and Exhibition, Hollywood, CA, USA, 2015.

P. M. Peterson. Simulating the response of multiple microphones to a singleacoustic source in a reverberant room. The Journal of the Acoustical Society ofAmerica, 80(5):1527–1529, 1986.

T. Pihlajamaki and V. Pulkki. Synthesis of complex sound scenes with trans-formation of recorded spatial sound in virtual reality. Journal of the AudioEngineering Society, 63(7/8):542–551, 2015.

T. Pihlajamäki, M.-V. Laitinen, and V. Pulkki. Modular architecture for virtual-world parametric spatial audio synthesis. In 49th Int. Conf. of the AES: Audiofor Games, London, UK, 2013.

T. Pihlajamäki, O. Santala, and V. Pulkki. Synthesis of spatially extended vir-tual source with time-frequency decomposition of mono signals. Journal of theAudio Engineering Society, 62(7/8):467–484, 2014.

M. A. Poletti. A unified theory of horizontal holographic sound systems. Journalof the Audio Engineering Society, 48(12):1155–1182, 2000.

M. A. Poletti. Three-dimensional surround sound systems based on sphericalharmonics. Journal of the Audio Engineering Society, 53(11):1004–1025, 2005.

A. Politis. Vector–Base Amplitude Panning library .https://github.com/polarch/Vector-Base-Amplitude-Panning, 2015a.

112

References

A. Politis. Array Response simulator .https://github.com/polarch/Array-Response-Simulator, 2015b.

A. Politis. Higher Order Ambisonics library .https://github.com/polarch/Higher-Order-Ambisonics, 2015c.

A. Politis. Spherical Harmonic Transform library .https://github.com/polarch/Spherical-Harmonic-Transform, 2015d.

A. Politis. Spherical Array Processing library .https://github.com/polarch/Spherical-Array-Processing, 2016.

A. Politis, M.-V. Laitinen, J. Ahonen, and V. Pulkki. Parametric spatial audiocoding for spaced microphone array recordings. In 134th Convention of theAES, Rome, Italy, 2013a.

A. Politis, M.-V. Laitinen, and V. Pulkki. Spatial audio coding with spaced micro-phone arrays for music recording and reproduction. In 21st Int. Congress onAcoustics (ICA), Montreal, Canada, 2013b.

A. Politis, M. R. P. Thomas, H. Gamper, and I. J. Tashev. Applications of 3D spher-ical transforms to personalization of HRTFs. In IEEE Int. Conf. on Acoustics,Speech, and Signal Processing (ICASSP), Shanghai, China, 2016.

M. Pollow, K.-V. Nguyen, O. Warusfel, T. Carpentier, M. Müller-Trapet, M. Vor-länder, and M. Noisternig. Calculation of head-related transfer functions forarbitrary field points using spherical harmonics decomposition. Acta Acusticaunited with Acustica, 98(1):72–82, 2012.

H. Pomberger and F. Zotter. Warping of 3d ambisonic recordings. In 3rd Int. Sym-posium on Ambisonics and Spherical Acoustics, Lexington, KY, USA, 2011.

G. Potard and I. Burnett. Decorrelation techniques for the rendering of apparentsound source width in 3D audio displays. In 7th Int. Conf. on Digital AudioEffects (DAFx-04), Naples, Italy, 2004.

R. Pugliese, A. Politis, and T. Takala. ATSI: Augmented and Tangible Sonic In-teraction. In 9th Int. Conf. on Tangible, Embedded, and Embodied Interaction(TEI), Stanford, CA, USA, 2015.

V. Pulkki. Virtual sound source positioning using vector base amplitude panning.Journal of the Audio Engineering Society, 45(6):456–466, 1997.

V. Pulkki. Uniform spreading of amplitude panned virtual sources. In IEEEWorkshop on Applications of Signal Processing to Audio and Acoustics (WAS-PAA), New Paltz, NY, USA, 1999.

V. Pulkki. Microphone techniques and directional quality of sound reproduction.In 112th Convention of the AES, Munich, Germany, 2002.

V. Pulkki. Directional Audio Coding in spatial sound reproduction and stereoupmixing. In 28th Int. Conf. of the AES: The Future of Audio Technology–Surround and Beyond, Pitea, Sweden, 2006.

V. Pulkki. Spatial sound reproduction with Directional Audio Coding. Journal ofthe Audio Engineering Society, 55(6):503–516, 2007.

113

References

V. Pulkki and J. Merimaa. Spatial impulse response rendering II: Reproductionof diffuse sound and listening tests. Journal of the Audio Engineering Society,54(1/2):3–20, 2006.

V. Pulkki, M. Karjalainen, and J. Huopaniemi. Analyzing virtual sound sourceattributes using a binaural auditory model. Journal of the Audio EngineeringSociety, 47(4):203–217, 1999.

V. Pulkki, A. Politis, G. Del Galdo, and A. Kuntz. Parametric spatial audio repro-duction with higher-order B-format microphone input. In 134th Convention ofthe AES, Rome, Italy, 2013.

B. Rafaely. Fundamentals of Spherical Array Processing. Springer, 2015.

B. Rafaely, I. Balmages, and L. Eger. High-resolution plane-wave decompositionin an auditorium using a dual-radius scanning spherical microphone array.The Journal of the Acoustical Society of America, 122(5):2661–2668, 2007a.

B. Rafaely, B. Weiss, and E. Bachmat. Spatial aliasing in spherical microphonearrays. IEEE Transactions on Signal Processing, 55(3):1003–1010, 2007b.

E. Rasumow, M. Hansen, S. van de Par, D. Püschel, V. Mellert, S. Doclo, andM. Blau. Regularization approaches for synthesizing HRTF directivity pat-terns. IEEE/ACM Transactions on Audio, Speech, and Language Processing,24(2):215–225, 2016.

L. Rayleigh. On our perception of sound direction. The London, Edinburgh, andDublin Philosophical Magazine and Journal of Science, 13(74):214–232, 1907.

T. Rettberg and S. Spors. Time-domain behaviour of spherical microphone arraysat high orders. In 40th German Annual Conf. on Acoustics (DAGA), Oldenburg,Germany, 2014.

S. K. Roffler and R. A. Butler. Factors that influence the localization of soundin the vertical plane. The Journal of the Acoustical Society of America, 43(6):1255–1259, 1968.

D. Romblom, P. Depalle, C. Guastavino, and R. King. Diffuse field modelingusing physically-inspired decorrelation filters and B-format microphones: PartI Algorithm. Journal of the Audio Engineering Society, 64(4):177–193, 2016.

G. D. Romigh, D. S. Brungart, R. M. Stern, and B. D. Simpson. Efficient realspherical harmonic representation of head-related transfer functions. IEEEJournal of Selected Topics in Signal Processing, 9(5):921–930, 2015.

R. Roy and T. Kailath. ESPRIT - Estimation of signal parameters via rotationalinvariance techniques. IEEE Transactions on Acoustics, Speech, and SignalProcessing, 37(7):984–995, 1989.

O. Rummukainen, J. Radun, T. Virtanen, and V. Pulkki. Categorization of natu-ral dynamic audiovisual scenes. PloS ONE, 9(5):e95848, 2014.

P. Runkle, A. Yendiki, and G. Wakefield. Active sensory tuning for immersivespatialized audio. In Int. Conf. on Auditory Display (ICAD), Atlanta, GA, USA,2000.

114

References

S. Sakamoto, J. Kodama, S. Hongo, T. Okamoto, Y. Iwaya, and Y. Suzuki. A 3Dsound-space recording system using spherical microphone array with 252chmicrophones. In 20th Int. Congress on Acoustics (ICA), Sydney, Australia,2010.

N. H. Salminen, A. Altoe, M. Takanen, O. Santala, and V. Pulkki. Human corticalsensitivity to interaural time difference in high-frequency sounds. HearingResearch, 323:99–106, 2015.

O. Santala and V. Pulkki. Directional perception of distributed sound sources.The Journal of the Acoustical Society of America, 129(3):1522–1530, 2011.

O. Santala, H. Vertanen, J. Pekonen, J. Oksanen, and V. Pulkki. Effect of listen-ing room on audio quality in Ambisonics reproduction. In 126th Convention ofthe AES, Munich, Germany, 2009.

D. Scaini and D. Arteaga. Decoding of higher order Ambisonics to irregular pe-riphonic loudspeaker arrays. In 55th Int. Conf. of the AES: Spatial Audio,Helsinki, Finland, 2014.

R. Scharrer and M. Vorlander. Sound field classification in small microphonearrays using spatial coherences. IEEE Transactions on Audio, Speech, andLanguage Processing, 21(9):1891–1899, 2013.

G. Schiffrer and D. Stanzial. Energetic properties of acoustic fields. The Journalof the Acoustical Society of America, 96(6):3645–3653, 1994.

R. O. Schmidt. Multiple emitter location and signal parameter estimation. IEEETransactions on Antennas and Propagation, 34(3):276–280, 1986.

C. Schörkhuber and A. Klapuri. Constant-Q transform toolbox for music process-ing. In 7th Sound and Music Computing Conference (SMC), Barcelona, Spain,2010.

E. Schuijers, J. Breebaart, H. Purnhagen, and J. Engdegard. Low complexityparametric stereo coding. In 116th Convention of the AES, Berlin, Germany,2004.

R. Schultz-Amling, F. Kuech, O. Thiergart, and M. Kallinger. Acoustical zoomingbased on a parametric sound field representation. In 128th Convention of theAES, London, UK, 2010.

D. Sébilleau. On the computation of the integrated products of three sphericalharmonics. Journal of Physics A: Mathematical and General, 31(34):7157–7168, 1998.

B. Seeber and H. Fastl. Subjective selection of non-individual HRTFs. In Int.Conf. on Auditory Display (ICAD), Boston, MA, USA, 2003.

D. Sen, N. Peters, M.-Y. Kim, and M. Morrell. Efficient compression and trans-portation of scene based audio for television broadcast. In Int. Conf. of theAES: Sound Field Control, Guildford, UK, 2016.

N. R. Shabtai and B. Rafaely. Generalized spherical array beamforming for bin-aural speech reproduction. IEEE/ACM Transactions on Audio, Speech, andLanguage Processing, 22(1):238–247, 2014.

115

References

J. Sheaffer, S. Villeval, and B. Rafaely. Rendering binaural room impulse re-sponses from spherical microphone array recordings using timbre correction.In EAA Joint Symposium on Auralization and Ambisonics, Berlin, Germany,2014.

K. M. Short, R. A. Garcia, and M. L. Daniels. Multichannel audio processing us-ing a unified-domain representation. Journal of the Audio Engineering Society,55(3):156–165, 2007.

A. Silzle. Selection and tuning of HRTFs. In 112th Convention of the AES, Mu-nich, Germany, 2002.

K. U. Simmer, J. Bitzer, and C. Marro. Post-filtering techniques. In M. Brandsteinand D. Ward, editors, Microphone Arrays, chapter 2, pages 39–60. Springer,2001.

A. Solvang. Spectral impairment of two-dimensional higher order Ambisonics.Journal of the Audio Engineering Society, 56(4):267–279, 2008.

T. Sondergaard and M. Wille. Optimized vector sound intensity measurementswith a tetrahedral arrangement of microphones in a spherical shell. The Jour-nal of the Acoustical Society of America, 138(5):2820–2828, 2015.

S. Spors and R. Rabenstein. Spatial aliasing artifacts produced by linear andcircular loudspeaker arrays used for wave field synthesis. In 120th Conventionof the AES, Paris, France, 2006.

S. Spors, R. Rabenstein, and J. Ahrens. The theory of wave field synthesis revis-ited. In 124th Convention of the AES, Amsterdam, The Netherlands, 2008.

P. Stitt, S. Bertet, and M. van Walstijn. Off-centre localisation performance ofAmbisonics and HOA for large and small loudspeaker array radii. Acta Acus-tica united with Acustica, 100(5):937–944, 2014.

P. Stitt, S. Bertet, and M. van Walstijn. Extended energy vector prediction of Am-bisonically reproduced image direction at off-center listening positions. Jour-nal of the Audio Engineering Society, 64(5):299–310, 2016.

H. Sun, E. Mabande, K. Kowalczyk, and W. Kellermann. Localization of distinctreflections in rooms using spherical microphone array eigenbeam processing.The Journal of the Acoustical Society of America, 131(4):2828–2840, 2012.

H. Tahvanainen, A. Politis, T. Koski, J.-M. Koljonen, M. Takanen, and V. Pulkki.Perceived lateral distribution of anti-phasic pure tones. In Forum Acusticum,Aalborg, Denmark, 2011.

H. Tahvanainen, J. Pätynen, and T. Lokki. Studies on the perception of bass infour concert halls. Psychomusicology: Music, Mind, and Brain, 25(3):294–305,2015.

M. Takanen and G. Lorho. A binaural auditory model for the evaluation of re-produced stereophonic sound. In 45th Int. Conf. of the AES: Applications ofTime-Frequency Processing in Audio, Helsinki, Finland, 2012.

M. Takanen, O. Santala, and V. Pulkki. Binaural assessment of parametricallycoded spatial audio signals. In J. Blauert, editor, The technology of binaurallistening, chapter 13, pages 333–358. Springer, 2013.

116

References

M. Takanen, H. Wierstorf, V. Pulkki, and A. Raake. Evaluation of sound fieldsynthesis techniques with a binaural auditory model. In 55th Int. Conf. of theAES: Spatial Audio, Helsinki, Finland, 2014.

C.-J. Tan and W.-S. Gan. User-defined spectral manipulation of HRTF for im-proved localisation in 3D sound systems. Electronics Letters, 34(25):2387–2389, 1998.

S. Tervo. Direction estimation based on sound intensity vectors. In 17th Euro-pean Signal Processing Conference (EUSIPCO), pages 700–704, Glasgow, UK,2009.

S. Tervo, T. Korhonen, and T. Lokki. Estimation of reflections from impulse re-sponses. Building Acoustics, 18(1-2):159–174, 2011.

S. Tervo, J. Pätynen, A. Kuusinen, and T. Lokki. Spatial decomposition methodfor room impulse responses. Journal of the Audio Engineering Society, 61(1/2):17–28, 2013.

S. Tervo, P. Laukkanen, J. Pätynen, and T. Lokki. Preferences of critical listen-ing environments among sound engineers. Journal of the Audio EngineeringSociety, 62(5):300–314, 2014.

H. Teutsch. Modal array signal processing: Principles and applications of acous-tic wavefield decomposition. Springer, 2007.

G. Theile. On the naturalness of two-channel stereo sound. In 9th Int. Conf. ofthe AES: Television Sound Today and Tomorrow, Detroit, MI, USA, 1991.

G. Theile. Natural 5.1 music recording based on psychoacoustic principals. In19th Int. Conf. of the AES: Surround Sound-Techniques, Technology, and Per-ception, Bavaria, Germany, 2001.

O. Thiergart and E. A. P. Habets. Robust direction-of-arrival estimation of twosimultaneous plane waves from a B-format signal. In IEEE 27th Conventionof Electrical & Electronics Engineers in Israel (IEEEI), Eilat, Israel, 2012.

O. Thiergart and E. A. P. Habets. Extracting reverberant sound using a linearlyconstrained minimum variance spatial filter. IEEE Signal Processing Letters,21(5):630–634, 2014.

O. Thiergart, R. Schultz-Amling, G. Del Galdo, D. Mahne, and F. Kuech. Lo-calization of sound sources in reverberant environments based on DirectionalAudio Coding parameters. In 127th Convention of the AES, New York, NY,USA, 2009.

O. Thiergart, M. Kallinger, G. Del Galdo, and F. Kuech. Parametric spatial soundprocessing using linear microphone arrays. In A. Heuberger, G. Elst, andR. Hanke, editors, Microelectronic Systems: Circuits, Systems and Applica-tions, chapter 30, pages 321–329. Springer, 2011.

O. Thiergart, G. Del Galdo, and E. A. P. Habets. On the spatial coherence inmixed sound fields and its application to signal-to-diffuse ratio estimation. TheJournal of the Acoustical Society of America, 132(4):2337–2346, 2012.

O. Thiergart, G. Del Galdo, M. Taseska, and E. A. P. Habets. Geometry-basedspatial sound acquisition using distributed microphone arrays. IEEE Trans-actions on Audio, Speech, and Language Processing, 21(12):2583–2594, 2013.

117

References

O. Thiergart, M. Taseska, and E. A. P. Habets. An informed parametric spa-tial filter based on instantaneous direction-of-arrival estimates. IEEE/ACMTransactions on Audio, Speech, and Language Processing, 22(12):2182–2196,2014.

J. Thompson, B. Smith, A. Warner, and J.-M. Jot. Direct-diffuse decomposition ofmultichannel signals using a system of pairwise correlations. In 133rd Con-vention of the AES, San Francisco, CA, USA, 2012.

W. R. Thurlow, J. W. Mangels, and P. S. Runge. Head movements during soundlocalization. The Journal of the Acoustical society of America, 42(2):489–493,1967.

S. Valaee, B. Champagne, and P. Kabal. Parametric localization of distributedsources. IEEE Transactions on Signal Processing, 43(9):2144–2153, 1995.

H. L. Van Trees. Detection, estimation, and modulation theory. John Wiley &Sons, 2004.

B. D. Van Veen and K. M. Buckley. Beamforming: A versatile approach to spatialfiltering. IEEE ASSP magazine, 5(2):4–24, 1988.

C. Verron, M. Aramaki, R. Kronland-Martinet, and G. Pallone. A 3-D immersivesynthesizer for environmental sounds. IEEE Transactions on Audio, Speech,and Language Processing, 18(6):1550–1561, 2010.

C. Verron, P.-A. Gauthier, J. Langlois, and C. Guastavino. Binaural analy-sis/synthesis of interior aircraft sounds. In IEEE Workshop on Applicationsof Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA,2011.

C. Verron, P.-A. Gauthier, J. Langlois, and C. Guastavino. Spectral and spatialmultichannel analysis/synthesis of interior aircraft sounds. IEEE Transac-tions on Audio, Speech, and Language Processing, 21(7):1317–1329, 2013.

T. E. Vigran. Acoustic intensity-energy density ratio: an index for detecting de-viations from ideal field conditions. Journal of Sound and Vibration, 127(2):343–351, 1988.

J. Vilkamo and S. Delikaris-Manias. Perceptual reproduction of spatial soundusing loudspeaker-signal-domain parametrization. IEEE/ACM Transactionson Audio, Speech, and Language Processing, 23(10):1660–1669, 2015.

J. Vilkamo and V. Pulkki. Minimization of decorrelator artifacts in DirectionalAudio Coding by covariance domain rendering. Journal of the Audio Engineer-ing Society, 61(9):637–646, 2013.

J. Vilkamo and V. Pulkki. Adaptive optimization of interchannel coherence withstereo and surround audio content. Journal of the Audio Engineering Society,62(12):861–869, 2015.

J. Vilkamo, T. Lokki, and V. Pulkki. Directional audio coding: Virtualmicrophone-based synthesis and subjective evaluation. Journal of the AudioEngineering Society, 57(9):709–724, 2009.

J. Vilkamo, B. Neugebauer, and J. Plogsties. Sparse frequency-domain reverber-ator. Journal of the Audio Engineering Society, 59(12):936–943, 2012.

118

References

J. Vilkamo, T. Bäckström, and A. Kuntz. Optimized covariance domain frame-work for time–frequency processing of spatial audio. Journal of the AudioEngineering Society, 61(6):403–411, 2013.

M. Vorländer. Auralization: Fundamentals of acoustics, modelling, simulation,algorithms and acoustic virtual reality. Springer, 2007.

A. Wabnitz, N. Epain, A. McEwan, and C. Jin. Upscaling ambisonic sound scenesusing compressed sensing techniques. In IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA,2011.

H. Wallach. The role of head movements and vestibular and visual cues in soundlocalization. Journal of Experimental Psychology, 27(4):339, 1940.

H. Wallach, E. B. Newman, and M. R. Rosenzweig. A precedence effect in soundlocalization. The Journal of the Acoustical Society of America, 21(4):468–468,1949.

A. Walther and C. Faller. Direct-ambient decomposition and upmix of surroundsignals. In IEEE Workshop on Applications of Signal Processing to Audio andAcoustics (WASPAA), New Paltz, NY, USA, 2011.

F. Wendt, M. Frank, F. Zotter, and R. Höldrich. Influence of directivity pattern or-der on perceived distance. In 42nd German Annual Conf. on Acoustics (DAGA),Aachen, Germany, 2016.

E. M. Wenzel, M. Arruda, D. J. Kistler, and F. L. Wightman. Localization usingnonindividualized head-related transfer functions. The Journal of the Acousti-cal Society of America, 94(1):111–123, 1993.

H. Wierstorf. Perceptual assessment of sound field synthesis. PhD thesis, Tech-nische Universität Berlin, Berlin, Germany, 2014.

B. Wiggins. The generation of panning laws for irregular speaker arrays usingheuristic methods. In 31st Int. Conf. of the AES: New Directions in High Reso-lution Audio, London, UK, 2007.

F. L. Wightman and D. J. Kistler. Resolution of front–back ambiguity in spa-tial hearing by listener and source movement. The Journal of the AcousticalSociety of America, 105(5):2841–2853, 1999.

E. G. Williams. Fourier acoustics: Sound radiation and nearfield acousticalholography. Academic Press, 1999.

M. Williams. Multichannel sound recording using 3, 4 and 5 channels arrays forfront sound stage coverage. In 117th Convention of the AES, San Francisco,CA, USA, 2004.

K. Wu, V. G. Reju, and A. W. H. Khong. Multi-source direction-of-arrival esti-mation in a reverberant environment using single acoustic vector sensor. InIEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Bris-bane, Australia, 2015.

T. Xiao and Q. H. Liu. Finite difference computation of HRTF for human hearing.The Journal of the Acoustical Society of America, 113(5):2434–2441, 2003.

119

References

B. Xie. Head-related transfer function and virtual auditory display. J. Ross Pub-lishing, 2013.

S. Xu, Z. Li, and G. Salvendy. Identification of anthropometric measurements forindividualization of HRTFs. Acta Acustica united with Acustica, 95(1):168–177, 2009.

Y. Yamasaki and T. Itow. Measurement of spatial information in sound fields byclosely located four point microphone method. Journal of the Acoustical Societyof Japan, 10(2):101–110, 1989.

L. Yang and B. Xie. Subjective evaluation on the timbre of horizontal ambison-ics reproduction. In Int. Conf. on Audio, Language and Image Processing(ICALIP), Shanghai, China, 2014.

P. Zahorik, D. S. Brungart, and A. W. Bronkhorst. Auditory distance perceptionin humans: A summary of past and present research. Acta Acustica unitedwith Acustica, 91(3):409–420, 2005.

L. Ziomek. Fundamentals of acoustic field theory and space-time signal process-ing. CRC Press, 1994.

D. Zotkin, J. Hwang, R. Duraiswaini, and L. Davis. HRTF personalization usinganthropometric measurements. In IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 2003.

D. N. Zotkin, R. Duraiswami, E. Grassi, and N. A. Gumerov. Fast head-relatedtransfer function measurement via reciprocity. The Journal of the AcousticalSociety of America, 120(4):2202–2215, 2006.

D. N. Zotkin, R. Duraiswami, and N. A. Gumerov. Regularized HRTF fitting usingspherical harmonics. In IEEE Workshop on Applications of Signal Processingto Audio and Acoustics (WASPAA), New Paltz, NY, USA, 2009.

F. Zotter. Analysis and synthesis of sound-radiation with spherical arrays. PhDthesis, Institute of Electronic Music and Acoustics (IEM), University of Musicand Performing Arts, Graz, Austria, 2009a.

F. Zotter. Sampling strategies for acoustic holography/holophony on the sphere.In Int. Conf. on Acoustics (NAG/DAGA) and 35th German Annual Conf. onAcoustics (DAGA), Rotterdam, The Netherlands, 2009b.

F. Zotter and M. Frank. All-round ambisonic panning and decoding. Journal ofthe Audio Engineering Society, 60(10):807–820, 2012.

F. Zotter, M. Frank, G. Marentakis, and A. Sontacchi. Phantom source wideningwith deterministic frequency dependent time delays. In 14th Int. Conf. onDigital Audio Effects (DAFx-11), Paris, France, 2011.

F. Zotter, H. Pomberger, and M. Noisternig. Energy-preserving ambisonic decod-ing. Acta Acustica united with Acustica, 98(1):37–47, 2012.

F. Zotter, M. Frank, M. Kronlachner, and J.-W. Choi. Efficient phantom sourcewidening and diffuseness in ambisonics. In EAA Joint Symposium on Aural-ization and Ambisonics, Berlin, Germany, 2014.

120

W# ,))( W ,, 3 W,) --#(! W ), W ... - Aalto

Documents

Transcript of W# ,))( W ,, 3 W,) --#(! W ), W ... - Aalto

W# ,)*)( W ,, 3 * W,) --#(! W ), W ... - Aalto

Documents

Transcript of W# ,)*)( W ,, 3 * W,) --#(! W ), W ... - Aalto

W# ,))( W ,, 3 W,) --#(! W ), W ... - Aalto

Transcript of W# ,))( W ,, 3 W,) --#(! W ), W ... - Aalto