Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of...

49
UPTEC F 17024 Examensarbete 30 hp Juni 2017 Bayesian Parameterization in the spread of Diseases Robin Eriksson

Transcript of Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of...

Page 1: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

UPTEC F 17024

Examensarbete 30 hpJuni 2017

Bayesian Parameterization in the spread of Diseases

Robin Eriksson

Page 2: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Bayesian Parameterization in the spread of Diseases

Robin Eriksson

Mathematical and computational epidemiological models are important tools in effortsto combat the spread of infectious diseases. The models can be used to predictfurther progression of an epidemic and for assessing potential countermeasures tocontrol disease spread. In the proposal of models (when data is available), one needsparameter estimation methods. In this thesis, likelihood-less Bayesian inferencemethods are concerned. The data and the model originate from the spread of averotoxigenic Escherichia coli in the Swedish cattle population. In using the SISE3model, which is an extension of the susceptible-infected-susceptible model with addedenvironmental pressure and three age categories, two different methods wereemployed to give an estimated posterior: Approximate Bayesian Computations andSynthetic Likelihood Markov chain Monte Carlo. The mean values of the resultingposteriors were close to the previously performed point estimates, which gives theconclusion that Bayesian inference on a nation scaled SIS-like network is conceivable.

ISSN: 1401-5757, UPTEC F 17024Examinator: Tomas NybergÄmnesgranskare: Per LötstedtHandledare: Stefan Engblom

Page 3: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

Popularvetenskaplig sammanfattning

Forskare och industri anvander idag datormodeller i stor utstrackning. Anled-ningen ar av bade ekonomiska och fysiska skal. Det ar mer kostnadseffektivtatt gora experiment i en datoriserad miljo an i verkligheten. I den virtuellamiljon ar det aven enklare att utfora tester. Utforaren far beroende pa modell,kontroll over sma detaljer som hastigheten pa vinden i en flygplanssimuleringeller antalet liter vatten som strommar genom ett reningsverk. I dessa modellerfinns parametrar och det ar dessa som finjusterar de namnda kvantiteterna.Motsatsen till att simulera modeller ar att fran data forsoka anpassa den modellsom beskriver data bast. For att anpassa data till modeller behover man hittavilka parametrar som gor att felet mellan simulering, via modell, och verklighetblir sa liten som mojligt. Detta kallas parametrisering.

Vid studier av hur sjukdomar sprider sig anvands idag ofta stokastiska,slumptalsberoende modeller. En vanlig modell for ett basfall ar SIS-modellen:individer gar mellan tva stadier: S – susceptible (mottaglig) och I – infected (in-fekterad). Modellen innehaller tva parametrar, smitthastigheten och aterhamt-ningshastigheten. Detta examensarbete behandlar Bayesianska parametriser-ingsmetoder dar parametrar behandlas som sannolikhetsfordelningar och intesom punktvarden.

Ett hinder for en popular Bayesiansk parametriseringsmetod, Metropolis-Hasting Markov chain Monte Carlo (MCMC), ar att man pa forhand behoverveta vad som kallas likelihood som ar ett matt pa sannolikheten att datagenererats med foreslagna parametrar. Nar man inte langre vet fordelningenbehover man anvanda sig av andra metoder som pa olika satt forsoker kringgakravet.

I detta examensarbete studeras en SIS-liknande modell, som ar ett falldar likelihood ar okant. Istallet for MCMC anvands Approximate BayesianComputations och Synthetic Likelihood Markov chain Monte Carlo som badaar metoder drivna av simuleringar vilket gor dem valdigt berakningstunga. Deresulterande parameterfordelningarna fran dessa metoder forvantas befinnasig kring nagra pa forhand bestamda punktvarden, vilket de aven gjorde.Slutsatsen ar att det erhallna resultaten visar att metoderna ar anvandbaraoch att fortsatt utveckling av dem rekommenderas.

Page 4: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

Contents

Contents 1

1 Introduction 3

2 Disease modeling 72.1 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 SIS-model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 SISE3-model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Local and global dynamics . . . . . . . . . . . . . . . . . . . . . 112.5 Simulation software . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Available data 133.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Clustering of the data . . . . . . . . . . . . . . . . . . . . . . . 143.3 Point estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Prior observation . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Bayesian Methods 214.1 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Approximate Bayesian computations . . . . . . . . . . . . . . . 234.3 Synthetic Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 234.4 SLMCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Results 275.1 ABC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 SLMCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Discussion 41

7 Conclusion 43

Bibliography 45

1

Page 5: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,
Page 6: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

1 Introduction

With modern mathematics and computers, modeling of the spread of infectiousdisease is made possible [2]. Examples of infectious diseases are virus outbreaks,like influenza, or bacterial spreads as antimicrobial resistance (AMR). AMRis a global public health threat [12]. Antibiotics are crucial for the treatmentof many bacterial infections as well as for the success of major surgery andcancer chemotherapy. However, with evolution and misuse, bacteria develop anatural resistance to these medications. The research is thus under time- andsocietal pressure to find new, not yet repelled, antimicrobials. A contributor toAMR is the food producing animal industry [14]. The animals are subjected toantibiotics for three reasons: growth promoting (banned in the EU), group, andindividual treatment of infections. In the case of AMR spread, the two formerare of major concern [10, 14].The resistance gene spreads from the farms tobacteria in contact with the human population, and without anti-infective drugs,the number of infections without effective treatment will increase. Researchin epidemiology will aid in finding how to anticipate and stop the spread ofdiseases, such as AMR.

Three processes can be planned for the supervision of diseases: measure-ments, forecasts, and countermeasures. Measurements can be samples fromlivestock, food, or hospitals. The goal of the measurements is to monitor andbetter understand the spread. Forecasts are needed to tell how the disease willspread. The evaluation of possible countermeasures determines what methodhinders the disease most efficiently. To perform the latter two processes, oneneeds mathematical-/computational models, which are the targets for thisthesis.

The mathematical modeling of infectious diseases is an area that in lateryears, in the age of information technology and computers, has seen an increasein academic papers. In 1957, Bailey released a book covering the subject inwhich the number of articles on the subject was in the hundreds [2]. At thattime computer modeling was in its infancy compared to today. Bailey reliedon rigorous mathematics and transformation techniques. Today much work iscarried out by simulations and numerical methods. State of the art tools, suchas SimInf [18], allows one to compute years of simulated time in a couple ofseconds on a personal desktop computer.

3

Page 7: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

4 CHAPTER 1. INTRODUCTION

The specific modeling in this thesis is done using a state space modelingtechnique called continuous time discrete state Markov chain (CTMC). Anentity, say a cow, can be in one state at a time. The entity will with a specifiedrate shift to another possible state. The transition between states is executed incontinuous time, therefore the name CTMC. When modeling a disease spread,a simple, comprehensible candidate model is the SIS-process [2]. The Markovchain has two states, Susceptible (S) and Infected (I), and the rates are theinfection rate, how quickly the infection spreads to susceptible entities, and therecovery rate, how fast an infected entity recovers and re-enters the susceptiblestate. When one assumes the CTMC model, stochasticity in the model is alsoassumed. At each time step, the system evaluates the possible state transitionwhere the outcome is dependent on a random variable.

To adjust models, one introduces parameters, which allow for fine tuningof the characteristics of the output. When data becomes available, one canextract enough information to find the correct parameters that fit the modeloutput to the received data. The procedure of finding a good enough parameterfit is called estimation. In a deterministic setting, the estimation of parametersoften involves finding parameters under a present noise. In the stochastic case,one will need to rely on previously developed methods and theorems. Todaythere are two schools of statistics, frequentist, and Bayesian. Briefly described,frequentists believe that variable (parameter) values hold one true value andone only observes these values with a specified accuracy, P-values. Bayesianssay that the variable (parameter) have a probability distribution and that theobserved values are drawn from this distribution. In this thesis, we assumethe Bayesian point of view. A popular method today for parameter estimationis Markov chain Monte Carlo (MCMC). The MCMC method incrementallytakes steps in the parameter space and finds the posterior of the parameter.One limitation of the method is that it requires a defined likelihood function,which for complex probability models might be infeasible to obtain. This thesisaddresses that limitation and implements newer methods that do not need thelikelihood for parameter estimations.

Data is increasingly available today. A requirement for new solutions isthat they need to be able to react as fast as possible to new data. The methodsused in this thesis utilize the data gathered to estimate parameters. In usingMCMC with a defined likelihood function, one computes the probability ofthe sampled dataset being from the same distribution as the examined data.Another type of methods is built around received data, e.g. Kalman filters [9],newer filtering techniques, as the particle filter, ensemble Kalman filters [16],or the use of path wise information theory [15]. The key difference betweenthe filtering techniques and the posterior inference methods, can be explainedby analogy. An artist is completing a painting, and the task is to name thepainter from just observing the painting. The dynamical, data-centric methods(filters), starts with equal weights, the certainty of the answer, as the canvasis empty. The methods recompute the weights with every nth stroke of the

Page 8: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

5

brush, and converge to an answer. The posterior inference methods, observethe finished painting and give a solution based on prior knowledge of a pool ofpainters. This thesis concerns posterior inference methods, but as an outlook,one could explore the filtering methods.

When constructing methods and solutions for complex problems, is acrossroad: either one goes straight ahead and tries to solve the problem in itsfull complexity, or one starts with a simple toy model and later increases thecomplexity to full rank. The latter methodology has the advantage that if oneincrementally raises the complexity of the problem and the method fails ina stage, then one knows that the method will not solve the full problem asit cannot solve the simpler alternative. One way of constructing a simplifiedproblem is by the inverse crime, inheriting its name from optimization theory.An example where using the inverse crime would be beneficial is in financialstock price modeling with parameter estimation. As a first step, consider acandidate model. Generate a time series with a known parameter set usingthe model. If and only if the estimation method successfully infers the correctparameters from the generated data, continue to the next step. The followinglevels should consider perturbed data with added noise. Again, if and only ifthe methods determines the parameters from the information given, continueto add complexity. The increments carry on until one confidently can feed thereal data to the method. Finalizing each step often requires more computingtime, because of the growing intricacy of the problem. Thus being able todiscard the method at the early steps is more time efficient than startingwith full complexity. This thesis is the third phase in the inverse crime forsolving the problem of parameter estimation on a national scale disease spreadnetwork. One explored and proved the feasibility of a parameter estimationunder synthetic data [3]. Then one achieved point estimates for actual data [19].The aim of this thesis is to determine a Bayesian posterior for said parametersusing a synthetic dataset closely resembling real data.

Page 9: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,
Page 10: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

2 Disease modeling

Data is becoming available, and the area of computational epidemiology isnow of interest as the required data, and computational power are finally here.This chapter defines Markov chains, SIS-processes, and network models.

2.1 Markov Chain

A Markov process is a group of stochastic processes that also satisfies theMarkov property. With this process, one can make predictions about itsfuture solely dependent on its present state. Consider the stochastic processYn, n = 0, 1, 2, 3, . . . that takes on a finite number of values. The process issaid to be in the state i at the time n if Yn = i. Suppose that when instate i there is a fixed probability Pij that the next state the process will bein is j. Figure 2.1 gives a visual interpretation of the example, where theprocess can be in either state i or j with the transition probabilities definedas Pii, Pij , Pjj , and Pji. The example process is a Markov process. Timedependence of Markov processes, are defined as discrete or continuous in time.In this thesis, we will refer to Markov chains as continuous in time, CTMC forshort.

2.2 SIS-model

To simulate disease spread, one needs to decide on a mathematical model. Abasic example is the SIS-model. The population can be in one of two states;susceptible, S, or infected, I. One writes a representation of the model as

S+Iυ−→ 2I

Iγ−→ S

}, (2.1)

where υ is the transmission rate and γ the recovery rate [2].Figure 2.2 shows a simulation of a SIS system over 1000 days. One sees

how after ≈ 200 days the outbreak takes on and the proportion of individualsinfected rises to ≈ 60% and later reaches its peak and then die off as individualsreturn to the susceptible state and the number of infected individuals vanishes

7

Page 11: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

8 SISE3-model

Figure 2.1: A Markov chain with two defined states, i and j, and the transitionprobabilities: Pii, Pij , Pjj , and Pji.

after 400 days. Figure 2.2 is just one simulation with a set of parameters.Another run, with a different random number seed, might give a differentpattern; the percentage of infected individuals might never surpass 10% andthen goes straight to zero, or the system stays in the oscillating state formultiple periods before returning to zero. Figure 2.3 depicts is a simulationwith the same parameters as in Figure 2.2, but with different random numbers,showcasing the latter mentioned phenomena.

2.3 SISE3-model

The SIS-model can be adjusted and extended in a number of ways. Whenmodeling a bacterial infection, one extension is adding an environmental,infectious pressure ϕ(t) as the infected individuals shed the disease/bacteria tothe environment [3]. Modeling the continues variable ϕ(t) is done as

dt= α

I

N− β(t)ϕ, (2.2)

where α is the average shedding rate, N is the sum of susceptible and infectedindividuals, and β(t) is the decay of bacteria in the environment at time t. β(t)is a seasonally dependent constant. Four seasons translates into a set of fourconstants, where only one is observable and active at a time. When includingϕ(t) in (2.1) one gets

S+Iυϕ−−→ 2I

Iγ−→ S

}. (2.3)

Page 12: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

9

0.00

0.25

0.50

0.75

1.00

0 250 500 750 1000

Time

valu

e

variable S I

Figure 2.2: Evolution of a SIS system over 1000 days. The red curve representsthe proportion of individuals that are in the susceptible state, S, and the blueare in the infected state, I. The y-axis gives the proportion and the x-axis theelapsed time in days. After 400 days the proportion of infected individualsgoes to zero. The used parameter values were υ = 0.017 and γ = 0.1.

The transition between states is modeled as a CTMC over i = 1, 2, . . . , Nspatial nodes and j = 1, 2, . . . ,M age groups. Further extending (2.3) into

Si,j+Ii,jυjϕi−−−→ 2Ii,j

Ii,jγj−→ Si,j

, (2.4)

with υj being the age group dependent transmission rate and ϕi the environ-mental, infectious pressure at node i defined as

dϕidt

= α

∑j Ii,j

Ni− β(t)ϕi, (2.5)

where Ni is the sum of all age groups of individuals in the node as Ni =∑j Ii,j + Si,j . As an extension to (2.5), adding a rate of proximal spread

between nearby nodes captures unregistered connections in the network and

Page 13: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

10 SISE3-model

0.00

0.25

0.50

0.75

1.00

0 250 500 750 1000

Time

valu

e

variable S I

Figure 2.3: Evolution of the same SIS system as in Figure 2.2 over 1000 days,but with a different random seed. The red curve represents the proportionof individuals that are in the susceptible state, S, and the blue are in theinfected state, I. The y-axis gives the proportion and the x-axis the elapsedtime in days. As the random seed is different the systems evolution over timeis different. Compared to Figure 2.2 the proportion of infected individualsnever reaches zero in the time frame of 1000 days.

increases model to data explanation [19]. Let D denote the rate of the proximalspread, and let dik denote the distance between the two nodes i and k. Defininga new ϕi as

dϕidt

= α

∑j Ii,j

Ni+∑k

ϕk(t)Nk(t)− ϕi(t)Ni(t)

Ni(t)· Ddik− β(t)ϕi. (2.6)

The distance between nodes, the sum over k, is evaluated for all nodes exceptfor the trivial case of k = i.

Figure 2.4 illustrates how a SISE3 system can behave over 1 000 days with100 connected nodes. Seasonality is observable in the figure, and the infectiondecays to zero over time and is almost non-existent after the 1 000 days.

Page 14: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

11

0.00

0.25

0.50

0.75

1.00

0 250 500 750 1000

Time

valu

e

variableS1

S2

S3

I1

I2

I3

Figure 2.4: Evolution of a SISE3 system over 1000 days and 100 networkconnected nodes. The three age categories are represented with differentdashing of the curves. The red, yellow, and green curves give the proportionof susceptible individuals and the blue, purple, and magenta curves give theproportion of infected individuals. Seasonality is observable as in the periodicityof multiple local minima in the dashed green line, which represents S3. They-axis gives the proportion and the x-axis the elapsed time in days. The valuesof the parameters used were υ1 = 0.0357, υ2 = 0.0357, υ3 = 0.00935, γ1 = 0.1,γ2 = 0.1, γ3 = 0.1, α = 1, βt1 = 0.19, βt2 = 0.085, βt3 = 0.075, $betat4 = 0.185,and D = 0.02.

2.4 Local and global dynamics

The system at hand connects multiple nodes, forming a network, and thedynamics needs to be defined. A stochastic differential equation (SDE) withjumps describes the local dynamics as

dXt = Sµ(dt), (2.7)

where µ(dt) = [µ1(dt), . . . , µNcomp(dt)]T and Ncomp is the number of compart-ments or age groups. An example of the transitions between compartments isthe SIS-model (2.1). The state vector consists of two compartments X = [S, I],

Page 15: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

12 Simulation software

with the transition matrix

S =

[−1 1

1 −1

], (2.8)

and intensity vectorR(x) = [υx1x2, γx2]

T (2.9)

In the case of a transition, the state vector is modified by the transition matrixas

Xt = Xt− + Sk (2.10)

when transition k occurred at time t.The local dynamics extends to a network model of Nnodes nodes, by a state

matrix X ∈ ZNcomp×Nnodes+ and (2.7) extends to

dX(i)t = Sµ(i)(dt), (2.11)

with node index i ∈ {1, . . . , Nnodes}. Then consider an undirected graph G withof Nn nodes as its vertexes and interactions defined by the counting measuresνi,j and νj,i. Where νi,j represents the change in state due to an influx ofindividuals from the node i to j, and the reverse for νj,i. Node j (i) is assumedto be in the connected component C(i) (C(j)) of node i (j). These connectionsare stored in matrix C. The network dynamic is then

dX(i)t = −

∑j∈C(i)

Cνi,j(dt) +∑

j;i∈C(j)

Cνj,i(dt), (2.12)

which in combination with (2.11) leads to the overall dynamics

dX(i)t = Sµ(i)(dt)−

∑j∈C(i)

Cνi,j(dt) +∑

j;i∈C(j)

Cνj,i(dt). (2.13)

2.5 Simulation software

The CTMC modeling in this thesis is performed using SimInf [18], a generalsoftware framework for data-driven modeling and simulation of stochasticdisease spreading in a complex network of connected nodes. SimInf is anextension to R and utilizes compiled C code with R matrices, in an objectoriented programming framework to define objects in logical layers connected bywell-defined interfaces. The computations are done in parallel using OpenMP,allowing efficient use of systems with multiple computational threads available.

Page 16: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

3 Available data

The thesis use data from an observational study tracing the verotoxigenicEscherichia coli O157 (VTEC O157) status in Swedish cattle herds from 2009to 2013 [20]. The data includes births, deaths, and movements on an animallevel from 37 221 farms, referred to as nodes. However, the study conductedmeasurements for the bacteria only on 126 of the nodes due to some factors.Illustrated in Figure 3.1 is the spatial system that connects the Swedish cattlefarms (or nodes) through some of the animal transfers.

3.1 Data Collection

The data collection process was such that over a period of four years, everyother month, a measurement was taken at some specified nodes. The testerswore overshoe-socks and collected samples by walking around the animals withthe socks on. The socks were then sent to a laboratory in which bacterialcultures were grown from each sock. If the verotoxigenic bacterium was foundin one of the sample from a node, that node was marked as infected. Theprocedure resulted in a binary time series of ones and zeros.

In working under the inverse crime principle, it is not desirable to startworking with the data directly. Instead, we use a generated dataset thatmimics the observed data. The new dataset includes the same number of datapoints, the same nodes, and the same date of accumulation, as the real data,to reproduce the same situation as with the actual data but with controlledcomplexity. The dataset can also be altered with different parameters, enablingthe use of data-driven estimation methods.

As the actual data from the study is an observation of the underlying state,to complete the constructed dataset one also needs to model the detections ofthe bacteria. Therefore the procedure is modeled to emulate the laboratorydetection, creating simulated observed data similar to the real dataset. Forthe simulation of the sock-to-binary result, the result used is from a priorstudy [1] in which they examined the infection detection probability. In thisstudy, tests were taken at a node with high probability of finding the infection,in animal-pools of three individuals in each pool. From the resulting samples

13

Page 17: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

14 Clustering of the data

a sensitivity function was determined to be [0, 0.28, 0.86, 0.99] for the numberof infected individuals in the range (0− 3) in each pool.

In summary there are thus two variance building steps. First, the overar-ching simulation of the underlying state, and secondly, the swabbing of thenodes, uncovering the observed state.

3.2 Clustering of the data

Spatial aggregation of the data is needed for the analysis. A possible aggregationis to sum all nodes, and view the data as one node as

Nj =∑

i∈[1,Nnodes]

Ni,j (3.1)

where i is the node number from 1 to Nnodes, and j the time of observation.Another example is to group into counties with the argument that Swedenis a latitudinally dependent country where the southern climate is differentfrom the northern one. To group nodes into counties, one needs to combinethe location of the node with the boundaries of the counties. Another similar,more robust way, is to group the nodes with a clustering algorithm such aK-means or hierarchical clustering, which only requires either the geographicalpoint of the node or the distance between the nodes. Figure 3.2 illustratesthe dendrogram that one gets from hierarchically clustering the dataset. Theheight defines the distance between clusters and each bifurcation representsthe division of the data into another cluster. The number of clusters that thedata is divided into is therefore directly related to what level of bifurcationsthat one chooses. Level one has one cluster, two has two clusters, three hasthree clusters, and so on. To evaluate what level to choose in a hierarchicalcluster, one can create a heat map, as in Figure 3.3. In the heat map, thelighter the color (white to yellow to red) the further the nodes are apart, andthere are five distinct darker (red) squares. The number of darker squaresgives the number of possible clusters. In Figure 3.3, one can argue that thereshould be three, four, five, or six number of groupings. However, one shouldaim towards a good balance between precision and generality. Therefore fiveclusters were chosen.

Using hierarchical clusters, the spatial aggregation of the data becomes

Nk,j =∑

i∈[k1,kM ]

Ni,j (3.2)

where k is the index of the cluster which holds aggregated information aboutnodes i = k1 to kM , which are defined in this thesis from the dendrogram inFigure 3.2.

In the dataset, one can see that sampling on the same date was not alwayspossible. By aggregating the data in time, one accounts for the mismatch

Page 18: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

15

in date. The decision was made to aggregate in yearly quarters; Jan-Mar,Apr-Jun, Jul-Sep, and Oct-Dec.

The values in Table 3.1 represents the data received after one simulatedand detected trajectory. Each row is a quarter for which the study ran, and thecolumns show the number of infected holdings in the cluster in that quarter.This quantification definition, of the infection spread, is later named prevalence.

Table 3.1: Aggregated sampled data from a simulated SISE3-system of fiveclusters, P.1 to P.5, over the number of infected nodes, prevalence, in eachcluster for the quarters, from 2009 Q4 to 2012 Q4. The table is from anexample run and is what is used for the analysis.

P.1 P.2 P.3 P.4 P.5

2009 Q4 1 0 2 6 42010 Q1 0 0 4 3 32010 Q2 2 0 5 4 32010 Q3 3 1 1 1 22010 Q4 4 1 6 9 102011 Q1 1 1 1 4 62011 Q2 3 2 2 6 82011 Q3 3 1 3 4 52011 Q4 7 2 5 3 92012 Q2 2 5 4 1 72012 Q3 4 3 2 2 22012 Q4 4 5 5 7 5

3.3 Point estimates

In a study of point estimates for the parameters in the SISe3 model, θ =(υ1, υ2, υ3, γ1, γ2, γ3, α, β1, β2, β3, β4, D)(2.4), an optimization problem definedas

argminθ

G(θ) (3.3)

tries to solve for the best fit of single values of the parameters, θ. Here G is adistance measure determined by G1, the number of infected holdings, and G2,the incidence cases of infected holding, as G(θ) = G1(θ) +G2(θ). In turn, G1

and G2 are defined as

G1(θ) =∑

tn∈y,q,c(Yin(θ)− Y ∗in)2, (3.4)

G2(θ) =∑

tn∈y,q,c(Xin(θ)−X∗in)2, (3.5)

Page 19: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

16 Prior observation

where we aggregate data over time slots of quarters, q, in years, y, for clusters,c. Y ∗in denotes the nth observed status, 1-positive and 0-negative, in node i attime tn. Similarly Yin(θ) denotes the simulated status corresponding to Y ∗in.X∗in denotes the first occurrence of a positive status in node i at time tn, suchthat X∗in = 1 at the first occasion Y ∗in = 1 and X∗in = 0 all other occasions.The optimization problem (3.3) was then solved by using the Nelder-Meadmethod [19] and the resulting parameter values are given in Table 3.2.

Table 3.2: The resulting parameters values from point estimations conductedin [19]. Values accompanied with a “=” sign were fixed and not estimated. υ1and υ2 were fixed to the same value as it was determined that the two youngercategories were similar compared to the adults, group 3.

Parameter Value

υ1 0.0184

υ2 0.0184

υ3 0.00980

γ1 = 0.100

γ2 = 0.100

γ3 = 0.100

α = 1.00

β1 0.109

β2 0.103

β3 0.114

β4 0.125

D 0.100

3.4 Prior observation

Cray et al. [5] conducted a study on the difference between transmission ratesυ for the different age groups. The study determines that there is a significantdifference between the young (age group 1 and 2) and the adult population(age group 3) in cattle. From the produced point estimates [19] one observesthat υ3 is approximately half the size of υ1. One can employ this knowledge ofthe relationship between the parameters into Bayesian parameter inference bya prior. This thesis implements a fixed scalar prior where υ3 is half the valueof υ1. υ2 is set to the same value as υ1 as it is seen as young cattle.

Page 20: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

17

Figure 3.1: Visualization of cattle movements in the data. The connectionsshown are a random subset of the complete dataset of ∼ 108 recorded events.The source of the data is the national cattle register at the Swedish Board ofAgriculture [Reproduced from [3]].

Page 21: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

18 Prior observation

Figure 3.2: Cluster dendrogram of the hierarchical cluster constructed from thedistance matrix of the nodes in the data. The height represents the distancebetween two nodes. The bottom of the tree, at zero height, the identificationnumber of the nodes in the dataset are given. Each bifurcation represents alevel at which the data is divided into one more cluster. At the top level allthe nodes are in one cluster and at the bottom the number of clusters is thesame as the number of nodes.

Page 22: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

19

15301

1014

15363

15518

15556

18398

18395

18188

29211

18366

16732

16773

18363

18872

15724

15947

16050

16123

15945

16130

16365

16408

16499

16456

23517

16294

16231

4757

11392

12136

11669

12002

11990

22487

12192

14680

15046

14655

15001

36

15027

15099

14609

14666

10117

6444

30363

5574

5587

2529

5633

4966

2550

2934

2584

2931

4849

2553

2986

2590

6837

4360

20360

4356

2945

2808

3205

2560

2401

3037

245

15900

3539

15184

15168

13600

13577

14718

13581

13604

13603

13569

13563

19905

14768

15110

15182

14735

14755

13579

13602

22957

13559

13539

20061

13536

13586

13537

2186

1625

1903

1902

1908

1907

1613

6821

1873

1874

1870

1943

1345

1030

1631

1649

1650

1686

1694

1691

1663

24404

1679

1680

1674

1678

1683

1677

15301101415363155181555618398183951818829211183661673216773183631887215724159471605016123159451613016365164081649916456235171629416231475711392121361166912002119902248712192146801504614655150013615027150991460914666101176444303635574558725295633496625502934258429314849255329862590683743602036043562945280832052560240130372451590035391518415168136001357714718135811360413603135691356319905147681511015182147351475513579136022295713559135392006113536135861353721861625190319021908190716136821187318741870194313451030163116491650168616941691166324404167916801674167816831677

Figure 3.3: Visualization of the heat map result from the hierarchical clusterconstructed from the distance matrix of the nodes in the data. More distinctred color represent closer distances between nodes. The right and bottomsides list the identification numbers of the nodes. The top and left sides givedendrogram as seen in Figure 3.2 which is the underlying base of the heat map.

Page 23: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,
Page 24: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

4 Bayesian Methods

Estimating parameters of stochastic models using available data can be donein multiple ways [8]. In Bayesian inference, the parameters are described bytheir posterior distribution. As a general setting, imagine data D generatedfrom a model M with parameters θ, and the prior density P (θ). The posteriordistribution of the parameters is then given by

P (θ|D) =P (D|θ)P (θ)

P (D), (4.1)

referred to as Bayes’ theorem. The critical point about the Bayesian inferenceis that it provides a means of combining new evidence, data, with prior beliefsin the priors. A simplistic approach to computing the posterior, P (θ|D) forthe parameter θ, is a plain rejection method:

1 Propose a θ from the prior P (θ).

2 Accept the proposal with probability h = P (D|θ); return to 1.

In the above method, one assumes that the likelihood P (D|θ) is known for themodel M, which is not true for all models. More involved Bayesian inferencemethods exist, such as MCMC [6], given in Algorithm 1. To use MCMC, oneneeds to have a problem with an analytically tractable likelihood, L(θ). As onesteps away from this space, one needs to use other methods. These methodswill inherently be worse off, as the likelihood defines the probability of thevalues. This section will cover two methods that try to mimic the same resultas MCMC without using a defined likelihood, ABC and SLMCMC.

21

Page 25: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

22 Summary statistics

Algorithm 1 MCMC Metropolis [6]

1: Initial guess θ2: Compute a likelihood L(θ)3: while n < N do4: Propose θ′ from a proposal distribution g(θ′|θ)5: Compute a likelihood L(θ′)6: Draw a uniformly distributed number r ∼ U [0, 1]

7: if r < min[1, L(θ

′)g(θ′|θ)L(θ)g(θ|θ′)

]then

8: Set θ = θ′

9: Set L(θ) = L(θ′)10: Set n = n+ 1

11: end if12: end while

The likelihood function L(θ) defines the probability of the outcome datagiven the parameter value θ, and is used in the algorithm to determine whetherto accept or reject a proposed parameter. The proposal distribution functiong(θ′|θ) returns a value θ′ dependent on the argument θ, interpreted as a newparameter set θ′ from the old set θ in the algorithm. The proposal function isreversible, in the sense that if one steps to θ′, you can use g to return to θ. Fora symmetric proposal function, the fraction in Step 7 of the algorithm reduces

to L(θ′)L(θ) . A possible choice is to let g be a Gaussian distribution centered

around θ. Points close to θ are thus more common than distant ones. In thisthesis a log-normal distribution is chosen as none of the parameters are allowedto be negative.

4.1 Summary statistics

A data set of length n is said to have n dimensions. Comparing two data setsof length n means comparing two sets of n dimensions, which is ineffective forlarge n. Therefore, a more efficient approach is to summarize the data into alower dimension m, where m� n. The comparison is then between two datasets of dimension m instead of n. The m summary statistics (SS) provide thenew dimension m.

A SS can often be categorized into one of four types: location (e.g. meanor median), spread (e.g. standard deviation or range), shape (e.g. skewnessor kurtosis), and dependence (e.g. Pearson product-moment correlation co-efficient). Summarizing the data with one SS from each of the mentionedcategories could give a good set of summarizing data points for the entire dataset.

If the SS summarizes the data perfectly, they are called sufficient statisticsand are a mapping from n to m dimensions without loss of any information [4].

Page 26: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

23

The goal for choosing SS is to mimic sufficient statistics but is often notreachable.

Finding what SS to adopt is mostly based on heuristics for each appliedproblem [11, 21]. However, Nunes et al. [13] studies a quantification of theproblem in search for a non-heuristic solution of finding SS.

The choice of SS in this thesis is selected to be the prevalence, the number ofinfected holdings, as defined in equation (3.4), but without the square difference.Defining the prevalence defined as

P =∑

tn∈y,q,kYin(θ), (4.2)

which aggregates data over time slots of quarters, q, in years, y, for clusters, k.Y ∗in denotes the nth observed status, 1-positive and 0-negative, in node i attime tn.

4.2 Approximate Bayesian computations

In recent years, as the accessibility of computational power, ABC has become aviable option for Bayesian inference. The advantage of using ABC over MCMCis the possibility of using SS when a likelihood is not available [17]. ABC iscomputationally heavy, compared to MCMC, because simulated trajectoriesare evaluated through SS, which hold less information than a likelihood, andone proposes new parameters without knowledge of the previously simulatedtrajectory; ABC trades memory for generality. The pseudo-algorithm for ABCis given in Algorithm 2.

Algorithm 2 ABC algorithm [17]

1: Compute summary statistic S∗ of observed data D∗2: Draw θi where i = 1, 2, . . . , N from a prior distribution3: Simulate one set of data Di for each parameter θi4: Compute summary statistics Si of Di5: Calculate the distance between the summary statistics, ρi = ||Si − S∗||6: Option 1: Accept θi for which ρi < ε.7: Option 2: Sort ρi in ascending order and keep highest ranking θi.

For a sufficiently small tolerance ε, the posterior P (θ|D) is approximated[17].

4.3 Synthetic Likelihood

Wood [21] proposed the idea to approximate summary statistics for the data as

s ∼ N (µθ,Σθ), (4.3)

Page 27: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

24 SLMCMC

where µθ is the multivariate mean and Σθ the covariance of a multivariatenormal variable. The validity of the approximation made in (4.3) can bechecked per SS using e.g. the Jarque-Bera test [7,21]. One writes the syntheticlog-likelihood as

ls(θ) = −1

2(s− µθ)T Σ−1θ (s− µθ)− 1

2log |Σθ|, (4.4)

with µθ and Σθ being estimates of µθ and Σθ over Nr simulated observationsof the proposed parameter θ. These estimates are defines in (4.10) and (4.13)As Nr increases, ls starts to behave like a conventional log-likelihood.

4.4 SLMCMC

As in ABC, instead of using an analytically tractable likelihood, SLMCMCaims to use SS to measure a distance between the data generated by theproposed parameters and the reference data. The difference between ABC andSLMCMC is that SLMCMC tries to replicate the MCMC method by using aSS based synthetic likelihood.

This thesis tests two cases of the application of the synthetic likelihood, aheuristic urn-model and the synthetic likelihood defined in (4.4).

A binomial approximation

When replicating the synthetic likelihood in the event of a binomially distributedSS, some simplifications are possible. However, to correctly use the binomialapproximation the measurements should be independent of one another. Thefollowing arguments assume this property.

Assume that for each prevalence measurement P (i) is drawn from the ithurn, where i = 1, 2, . . . , Nm and Nm is the number of prevalence measurements.The heuristic urn model is described by a binomial distribution

P (i) ∼ Bin(N (i), p(i)), (4.5)

where N (i) is the number of observations and p(i) the binomial probability forurn i. By estimating p(i), one can define the likelihood of two measurementsbeing drawn from the same urn.

From the data Dθ∗ generated by the sought after parameter set θ∗, one canestimate p(i) by the frequentist approach,

p(i) =E[P (i)]

N (i), (4.6)

where E[P (i)] is the expected value of P (i). For a proposed parameter set, θ, themodelM(θ) is simulated and P =

(P (1), P (2), . . . , P (Nn)

)is extracted from the

generated data Dθ. The probability mass function of a binomially distributed

Page 28: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

25

variable defines the probability that ˜P (i) is from the same distribution as P (i),where the probability mass function is

P(P (i) = P (i)) =

(N (i)

P (i)

)pP

(i)(1− p)N(i)−P (i)

. (4.7)

The synthetic likelihood, L(θ|Dθ∗), is (4.7) or if multiple P s are extracted fromthe data, e.g. a time series, the product of the probabilities as,

L(θ|Dθ∗) =∏j

P(P (i) = P (i)). (4.8)

By the approximations in (4.5),(4.7), and (4.8), one can construct a searchalgorithm similar to MCMC, and determine an approximated posterior for θ.A comment about the synthetic likelihood extracted in (4.8) is that because ofthe assumption of the binomial distribution of P , each observation is assumedto be discrete. Which is a correct assumption because of the binary result fromthe actual measurements, a farm cannot be half infected. But as we alwayscompare these discrete observations we do not put any more effort into arguingfor continuity when comparing to other approximations of the likelihood.

A normality approximation

For multiple measurements (simulations), the prevalence, P , can be approxi-mated to be normally distributed, as Wood proposed about general SS [21].We therefore assume

P ∼ N (µθ,Σθ) (4.9)

with the mean vector µθ and covariance matrix Σθ. One aims to estimate µθand Σθ in the following manner

µ(i)θ =

∑j

s(i)j /Nr (4.10)

where s is a vector consisting of the SS from multiple simulations with thesame parameters θ, this case the prevalence P as

s =(s(1), s(2), . . . , s(Nm)

)=

=

((P

(1)1 , P

(1)2 , . . . , P

(1)Nr

),

(P

(2)1 , P

(2)2 , . . . , P

(2)Nr

),

. . . ,

(P

(Nm)1 , P

(Nm)2 , . . . , P

(Nm)Nr

)), (4.11)

where i = 1, 2, . . . , Nm is the number of prevalence measures and j = 1, 2, . . . , Nris the number of simulations from the parameter θ. To estimate Σθ, the correla-tion between the SS, one first computes the difference between each simulation

Page 29: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

26 SLMCMC

vector s and the vector of mean values, µθ, as

S = (s(1) − µ(1)θ , s(2) − µ(2)θ , . . . , s(Nm) − µ(Nm)θ ), (4.12)

and then Σθ by the sample covariance matrix,

Σθ = SST /(Nr − 1). (4.13)

(4.4) then gives the synthetic log-likelihood.Note that the number of simulations, Nr, used for estimates µθ and Σθ

needs to be larger or equal to Nm in order to correctly determine the covarianceΣθ. Otherwise, Σθ will have zero eigenvalues, and the determinant becomeszero. A computational trick that can be used to circumvent the problem is tonudge the diagonal of the matrix with a small value δ. However, this will givea slightly biased result.

Page 30: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

5 Results

The parameter estimation setting first simulates a dataset with the parametersgiven in Table 3.2, and then by using two inference methods, ABC andSLMCMC, to see if a posterior is possible to achieve. The aim is to estimateθ = (υ12, υ3), where υ1 = υ2 = υ12 and the remaining parameters are viewedas fixed values previously observed and determined. The true parameterset θ∗, that the methods tries to estimate is as in Table 3.2: (υ12, υ3) =(0.0184, 0.00980).

5.1 ABC

The results covers two cases in ABC. The most general one is a uniform prior,while the second one was a scalar dependence between the two parameters,effectively creating one parameter, υ12 = 2 · υ3.

For 5000 runs, with two separate υ for the age groups (υ1 = υ2 6= υ3) theresulting posterior is given in Figure 5.1. In the figure we see four differentsymbols, representing different top percentages of the data; + - 10%, � - 5%,4 - 2.5%, © - 1.24%. Each higher level includes the lower level points as well.Implying that all points presented are inside the 10% region, and so on.

To further investigate the posterior, the mean and standard deviationare extracted, Figure 5.2 and Figure 5.3 gives the histogram for the top 100values for the same ABC. Presented in both figures are straight lines for themean values and dashed lines for the actual values both added for illustrativepurposes. The moment values are for υ12 mean: 0.019025 and standarddeviation: 0.010493, as seen in Figure 5.2. The mean is close to the true value,but the total range is large (0, 0.04). For υ3 the mean is 0.006740 and standarddeviation is 0.005190. Compared to υ12 the range is smaller (0, 0.2), the truevalue is still inside of one standard deviation, but the histogram looks heavyon the left side.

The histogram in Figure 5.4 gives the resulting posterior with a scalar priorfixated υ3 fixed at the half of υ12. υ12 has a mean: 0.016977, and standarddeviation: 0.001491. Compared to the case without the prior knowledge applied,the mean is not closer to the true value. However, the standard deviation isone order smaller for the same amount of accepted values. One needs fewer

27

Page 31: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

28 SLMCMC

Figure 5.1: Posterior for θ = (υ12, υ3) using ABC with a total of N = 5000trajectories with a decreasing acceptance from 10% to 1.24%. The differentaccuracies are represented by different colors and symbols. The finer accuraciesalso lie inside of the courser ones. The 10% values are also include the 1.25%region.

simulations with the prior applied for the same amount of accuracy as in thecase without it.

5.2 SLMCMC

In testing the two SLMCMC methods, first a synthetic likelihood map iscomputed, then a prior condition is applied which results in a 1-dimensionalparameter space for which the likelihood also is presented. If this map andspace are well defined without singularities, and coupling a maximum is feasible,

Page 32: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

29

0

5

10

0.00 0.01 0.02 0.03 0.04

υ12

Co

un

t

variable Mean True

Figure 5.2: Histogram over υ12 using ABC for the top 100 values of a totalof 5000 trajectories. In addition to the bars, the solid red line illustrates themean value and the dashed blue line marks the true value of the parameter.

then we apply the search algorithm SLMCMC.

The SLMCMC algorithm, with both synthetic likelihood approximations,was used to extract posteriors. The algorithms ran until 100 values wereaccepted. From these values every nth value was kept in order to remove asmuch autocorrelation as possible, and the nth value was the first value forwhich the autocorrelation was lower than 20%, referred to as burn-in period.

The full space coordinate grid is constructed from υ3 = [0, 0.02] andυ12 = [0, 0.04], and the prior later applied is υ12 = 2υ3.

Page 33: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

30 SLMCMC

0

10

20

0.00 0.01 0.02 0.03 0.04

υ3

Co

un

t

variable Mean True

Figure 5.3: Histogram over υ3 using ABC for the top 100 values of a totalof 5000 trajectories. In addition to the bars, the solid red line illustrates themean value and the dashed blue line marks the true value of the parameter.

Binomial distributed prevalence

When estimating p, the original dataset was consisted of 10 simulated trajec-tories. Then for each proposed parameter only one trajectory was used forevaluation.

Figure 5.5 shows the log-likelihood map over the mentioned grid. One seesa similar pattern as in Figure 5.1. The lighter shade of blue represents a higherlikelihood of the prevalence generated from the proposed θ. Contour lines arealso added to the figure to emphasize the change in likelihood as one travelsfrom the darker regions from the upper right to the lower left of the grid.

After applying the prior condition to θ, the log-likelihood can be presented ina one-dimensional parameter space as shown in Figure 5.6. The blue triangles

Page 34: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

31

0

5

10

15

0.00 0.01 0.02 0.03 0.04

υ12

Co

un

t

variable Mean True

Figure 5.4: Histogram over υ12 using ABC for the top 100 values of a total of5000 trajectories when applying the prior: υ12 = 2υ3. In addition to the bars,the solid red line illustrates the mean value and the dashed blue line marksthe true value of the parameter.

are computed values of the log-likelihood and the red circle indicates themaximum log-likelihood which is υ12 = 0.018, a close estimate of the genuinevalue, observable in Figure 5.7 which is the likelihood and not log-likelihood.

Figure 5.8 illustrates the SLMCMC generated posterior for υ12 with theprior applied on υ12 to be the scalar 2υ3. The resulting mean is υ12 = 0.018694and standard deviation is υ12 = 1.805509 · 10−4. The burn-in was reached after4 values.

Page 35: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

32 SLMCMC

0.000

0.005

0.010

0.015

0.020

0.00 0.01 0.02 0.03 0.04

υ12

υ3

−4000

−3000

−2000

−1000

L

Figure 5.5: The log-likelihood computed over θ-pairs using the binomialapproximation of the prevalence. The lighter the color the higher the likelihood.The white lines are to contour the level changes in the log-likelihood.

Normal distributed prevalence

In the computation of the synthetic likelihood, 10 simulations were made foreach proposed parameter, requiring the diagonal of the estimated covariancematrix to be perturbed by δ = 10−3, in order to have non-zero eigenvalues.

In Figure 5.9 one sees the log-likelihood map from the normally distributedapproximation of the prevalence. The transitions are not as smooth as inFigure 5.5, but the major contour lines are there. There is a difference betweenthe maps also in magnitude as seen in the level scale on the right-hand side ofthe figures.

Applying the same prior to θ in the log-likelihood map as in Figure 5.6one gets the log-likelihood versus υ12 in Figure 5.10 and Figure 5.11 for the

Page 36: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

33

−4000

−3000

−2000

−1000

0

0.01 0.02 0.03 0.04

υ12

log

−lik

elih

oo

d

variable max up12

Figure 5.6: A splice of the log-likelihood over θ-pairs using the binomialapproximation under prior υ12 = 2υ3. The maximum likelihood is marked witha red circle on υ12 = 0.018.

likelihood without the logarithm. The likelihood in Figure 5.11 was scaled bya factor 1/100 to emphasize the difference in likelihood between points. Thered circles the maximum of the log-likelihood υ12 = 0.018. The right-handside of the maximum is not as smooth as in Figure 5.6, which is caused by thenon-smoothness present in Figure 5.9.

As for the binomial approximation, an SLMCMC algorithm is deployedand the resulting posterior is visualized in Figure 5.12. The mean value is0.015765, standard deviation is 0.000541, and the burn-in was reached after 4values. The result is similar to that of the binomial one, but with the meanoutside the desired area.

Page 37: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

34 SLMCMC

−4.103678e−56

1.846655e−55

4.103678e−55

6.360701e−55

8.617723e−55

0.01 0.02 0.03 0.04

υ12

like

lih

od

Figure 5.7: The splice presented in Figure 5.6 but without the logarithm.

Page 38: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

35

0

5

10

15

0.00 0.01 0.02 0.03 0.04

υ12

Co

un

t

variable Mean True

Figure 5.8: The posterior from using SLMCMC with the binomial approximatedprevalence subjected to the prior υ12 = 2υ3, with the acceptance rate 0.9599%.In addition to the bars, the solid red line illustrates the mean value and thedashed blue line marks the true value of the parameter.

Page 39: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

36 SLMCMC

0.000

0.005

0.010

0.015

0.020

0.00 0.01 0.02 0.03 0.04

υ12

υ3

−3e+05

−2e+05

−1e+05

L

Figure 5.9: The log-likelihood computed over θ-pairs using the normal approx-imation of the prevalence. The lighter the color higher the likelihood. Thewhite lines are the contours of the levels in the log-likelihood.

Page 40: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

37

−3e+05

−2e+05

−1e+05

0e+00

0.01 0.02 0.03 0.04

υ12

log

−lik

elih

oo

d

variable max up12

Figure 5.10: A splice of the log-likelihood over θ-pairs using the normalapproximation under priorυ12 = 2υ3. The maximum likelihood is marked witha black circle.

Page 41: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

38 SLMCMC

0.0e+00

5.0e−08

1.0e−07

1.5e−07

0.01 0.02 0.03 0.04

υ12

e(l

og

−lik

elih

od

10

0)

Figure 5.11: The splice presented in Figure 5.10 but without the logarithm.The log-likelihood was scaled by a factor 1/100 before removing the logarithm.

Page 42: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

39

0

5

10

0.00 0.01 0.02 0.03 0.04

υ12

Co

un

t

variable Mean True

Figure 5.12: The posterior from using SLMCMC with the normal approximatedprevalence subjected to the prior υ12 = 2υ3, with the acceptance rate 35.97122%.In addition to the bars, the solid red line illustrates the mean value and thedashed blue line marks the true value of the parameter.

Page 43: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,
Page 44: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

6 Discussion

Variance building sources have been prevalent in the performed simulations.The system simulation and the detection swiping were the two contributors.One way of combating the induced variance is to do multiple runs for the sameparameters and take the average outcome, both for the trajectory and thedetection. The problem with that solution is the added computational time.To compute, and evaluate the data, one needs to pay in computation time. Asingle trajectory simulation for the dataset takes 20 seconds and the detectiontake 1 second both on a dual core Intel Core i5 CPU 750 @ 2.67GHz, utilizingOpenMP for parallelization. In a perspective to achieve an acceptable posteriorusing ABC of 100 (top 2%) runs one needs to compute 5000 trajectories intotal, resulting in 29 hours for one run and one swipe for each parameter set.Taking the average of multiple trajectories should give a smoother distancesurface, but at the price of increasing the total computational time with almostN times the total time for N simulations.

The choice of SS was something briefly debated in the start of the project.When handling spatial and time dependent data and one wants to quantify itsstatistical properties, one could just compare the data directly, but as arguedin Section 4.1 comparing data in its full dimension is not always desirable.First, we used prevalence and incidence. Incidence added information aboutspace dependence, forcing the disease to spread to multiple nodes. However,comparing data ordered by only prevalence with data ordered by prevalenceand incidence, resulted in almost equal posteriors using ABC. The alternativeto incidence was clustering of the data. Instead of having the data aggregatedinto one time series, the cluster analysis gave five distinct groups. Therefore thedata was after that aggregated into five trajectories. The evaluation platformthen became five prevalence trajectories.

As each simulation of a trajectory is expensive in computational time,ABC is not the most efficient method. ABC works well as a method forexploring the posterior space. Other more refined investigation methodslike the SLMCMC algorithm are efficiently deployed after knowing that theposterior is discoverable. From the inverse crime point of view, ABC is a goodfirst step. If ABC is not able to approximate the posterior, then a finer methodwould not either.

41

Page 45: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

42 CHAPTER 6. DISCUSSION

Based on the fact that spatially the clusters are relatively distant from eachother and are seen as independent the assumption that the prevalence has abinomial distribution is acceptable. The measurements are also distant in timeas they are performed on average one time per six weeks and are viewed seenas independent.

The posteriors produced by the two SLMCMC methods are relativelysimilar. The likelihoods are different in magnitude, but they show a similarpattern. In a sense, this confirms that it is possible to find a traceable structure,or likelihood, in the SS prevalence. However, the mean from the normallydistributed approximation is off from the actual value. The reason is mostlikely due to the adjustment of the diagonal elements by δ.

As an outlook, the normal distribution approach is the most desirable,if computational time is not constraining, because it is more general andnot as instance specific as the binomial approximation. A future designimplementation to decrease the elapsed time is to first evaluate and reject ona pseudo-model level and if accepted do an evaluation of the full model.

Page 46: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

7 Conclusion

In modeling of disease spread, parameter estimation is a serious concern whendata becomes available. This thesis uses observational data from a longitudinalstudy of VTEC O157 in the Swedish cattle population [20]. The infectiousdisease model used was SISE3, which is an extension of the regular SIS-model,but with an added environmental, infectious pressure and three age categories.

Parameter estimation is performed on a synthetic dataset, meant to emulatethe real data with known parameters. Two different estimation algorithmswere used: ABC and SLMCMC. The simpler method, ABC, gave a coarseposterior, not as accurate as desirable in the given time frame. The othermethod, SLMCMC was tested in two flavors, assuming that the SS, prevalence,is either binomially or normally distributed. The former gave a precise andwell defined posterior around the actual value. The latter gave a posterior witha mean not centered around the actual value, showing a biased result becausethe estimated covariance matrix in the synthetic likelihood was perturbed bya small value in order to estimate with a small number of simulations. Thetotal result encourages further development of the SLMCMC method.

43

Page 47: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,
Page 48: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

Bibliography

[1] Mark E Arnold, Johanne Ellis-Iversen, Alasdair JC Cook, Robert HDavies, Ian M McLaren, Anthony CS Kay, and Geoff C Pritchard. In-vestigation into the effectiveness of pooled fecal samples for detectionof verocytotoxin-producing Escherichia coli O157 in cattle. Journal ofVeterinary Diagnostic Investigation, 20(1):21–27, 2008.

[2] Norman TJ Bailey. The mathematical theory of infectious diseases and itsapplications. Charles Griffin & Company Ltd, 5a Crendon Street, HighWycombe, Bucks HP13 6LE., 1975.

[3] Pavol Bauer, Stefan Engblom, and Stefan Widgren. Fast event-basedepidemiological simulations on national scales. The International Journalof High Performance Computing Applications, 30(4):438–453, 2016.

[4] David Roxbee Cox. Principles of Statistical Inference. Cambridge Univer-sity Press, 2006.

[5] William C Cray and Harley W Moon. Experimental infection of calves andadult cattle with Escherichia coli O157: H7. Applied and EnvironmentalMicrobiology, 61(4):1586–1590, 1995.

[6] W Keith Hastings. Monte Carlo sampling methods using Markov chainsand their applications. Biometrika, 57(1):97–109, 1970.

[7] Carlos M Jarque and Anil K Bera. Efficient tests for normality, ho-moscedasticity and serial independence of regression residuals. Economicsletters, 6(3):255–259, 1980.

[8] Jari Kaipio and Erkki Somersalo. Statistical and Computational Inverseproblems, volume 160. Springer Science & Business Media, 2006.

[9] Rudolph Emil Kalman. A new approach to linear filtering and predictionproblems. Journal of Basic Engineering, 82(1):35–45, 1960.

[10] Arti Kapi. The evolving threat of antimicrobial resistance: Options foraction. Indian Journal of Medical Research, 139(1):182, 2014.

45

Page 49: Bayesian Parameterization in the spread of Diseases1127474/FULLTEXT01.pdf · In the proposal of models (when data is available), one needs parameter estimation methods. In this thesis,

46 BIBLIOGRAPHY

[11] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavare. Markovchain Monte Carlo without likelihoods. Proceedings of the NationalAcademy of Sciences, 100(26):15324–15328, 2003.

[12] Harold C Neu. The crisis in antibiotic resistance. Science, 257(5073):1064–1074, 1992.

[13] Matthew A Nunes and David J Balding. On optimal selection of summarystatistics for approximate Bayesian computation. Statistical Applicationsin Genetics and Molecular Biology, 9(1):34, 2010.

[14] Jim O’Neill. Tackling drug-resistant infections globally: final report andrecommendations. The review on antimicrobial resistance, 2016.

[15] Yannis Pantazis, Markos A Katsoulakis, and Dionisios G Vlachos. Para-metric sensitivity analysis for biochemical reaction networks based onpathwise information theory. BMC Bioinformatics, 14(1):311, 2013.

[16] Sebastian Reich and Colin Cotter. Probabilistic Forecasting and Bayesiandata Assimilation. Cambridge University Press, 2015.

[17] Mikael Sunnaker, Alberto Giovanni Busetto, Elina Numminen, JukkaCorander, Matthieu Foll, and Christophe Dessimoz. Approximate Bayesiancomputation. PLoS Comput Biol, 9(1):e1002803, 2013.

[18] Stefan Widgren, Pavol Bauer, and Stefan Engblom. Siminf: An R packagefor data-driven stochastic disease spread simulations. arXiv preprintarXiv:1605.01421, 2016.

[19] Stefan Widgren, Stefan Engblom, and Ulf Emanuelson. Spatio-temporalmodelling of verotoxigenic Escherichia coli O157 in cattle in Sweden:Exploring options for control. manuscript, 2017.

[20] Stefan Widgren, Erik Eriksson, Anna Aspan, Ulf Emanuelson, StefanAlenius, and Ann Lindberg. Environmental sampling for evaluatingverotoxigenic Escherichia coli O157: H7 status in dairy cattle herds.Journal of Veterinary Diagnostic Investigation, 25(2):189–198, 2013.

[21] Simon N Wood. Statistical inference for noisy nonlinear ecological dynamicsystems. Nature, 466(7310):1102–1104, 2010.