Event-based Failure Prediction - hu-berlin.de · gang mit Fehlern: Im Anschluss an die Vorhersage...

Event-based Failure PredictionAn Extended Hidden Markov Model Approach

DISSERTATION

zur Erlangung des akademischen GradesDoktor-Ingenieur (Dr.-Ing.)

im Fach Informatik

eingereicht an derMathematisch-Naturwissenschaftlichen Fakultät II

Humboldt-Universität zu Berlin

vonHerrn Dipl.-Ing. Felix Salfner

geboren am 27.04.1974 in Düsseldorf

Präsident der Humboldt-Universität zu BerlinProf. Dr. Christoph Markschies

Dekan der Mathematisch-Naturwissenschaftlichen Fakultät IIProf. Dr. Wolfgang Coy

Gutachter:

1. Prof. Dr. M. Malek2. Prof. Dr. Dr. h.c. G. Hommel3. Prof. Dr. A. Reinefeld

Tag der mündlichen Prüfung: 6.2.2008

To Gesine, Anton Linus, Henry, and

Fabienne.

iii

Acknowledgments

First of all, I would like to thank my doctoral advisor Miroslaw Malek for his ongoingsupport and advice —I have benefitted greatly from his broad experience. I am alsovery grateful to Katinka Wolter, who has led me to the fascinating beauty of stochasticprocesses and who has repeatedly helped me to review, rethink, and revise my ideas.

A part of this work was carried out as a member of the Graduate School “StochastischeModellierung und quantitative Analyse großer Systeme in den Ingenieurwissenschaften”(MAGSI), which has provided an inspiring scientific environment. I would like to thankthe members of MAGSI for discussions and for giving feedback on my work from themost diverse viewpoints. In particular, I would like to acknowledge the effort of GünterHommel and Armin Zimmermann (Technical University Berlin) organizing and providinga forum for stimulating scientific exchange, and of Tobias Harks (Technical UniversityBerlin) who kept a watchful eye on the mathematical aspects of this work. This work wasalso greatly improved by fruitful discussions with my colleagues, especially with GüntherHoffmann, Maren Lenk, and Peter Ibach, and by the great support from Jan Richlingand Steffen Tschirpke, whom I hereby thank. I am also grateful for discussions, help,and comments from Alexander Schliep (Max Planck Institute for Molecular Genetics,Berlin), Tobias Scheffer and Ulf Brefeld (Max Planck Institute for Computer Science),Aad van Moorsel (School of Computer Science, Newcastle University), who have givenmany impulses to my work, and I would like to express my thanks to my old friend PatrickStiegeler, who was an open-minded reviewer of my thesis.

Besides the working life, I am very grateful to my parents taking good care of meespecially during writing of the first half of the thesis and for improving many of thefigures found in this dissertation. Finally, I want to extend my most heartfelt thanks to mywonderful wife Fabienne and our children, without whose support and consideration thiswork would not have come into existence.

This work was supported also by Deutsche Forschungsgemeinschaft (German Re-search Foundation) project “Failure Prediction in Critical Infrastructures” and Intel Cor-poration.

v

Abstract

Human lives and organizations are increasingly dependent on the correct functioning ofcomputer systems and their failure might cause personal as well as economic damage.There are two non-exclusive approaches to minimize the risk of such hazards: (a) fault-intolerance tries to eliminate design and manufacturing faults for hardware and softwarebefore a system is put into service. (b) fault-tolerance techniques deal with faults that oc-cur during service trying to avert that faults turn into failures. Since faults, in most cases,cannot be ruled out, we focus on the second approach. Traditionally, fault tolerance hasfollowed a reactive scheme of fault detection, location and subsequent recovery by redun-dancy either in space or time. However, in recent years the focus has changed from thesereactive methods towards more proactive schemes that try to evaluate the current situationof a running system in order to start acting even before the failure occurs. Once a fail-ure is predicted, it may either be prevented or the outage may be shifted from unplannedto planned downtime, which can both improve significantly the system’s reliability. Thefirst step in this approach, online failure prediction, is the main focus of this thesis. Theobjective of the online failure prediction is to predict the occurrence of failures in the nearfuture based on the current state of the system as it is observed by runtime monitoring.

A new failure prediction method that builds on the evaluation of error events is in-troduced in this dissertation. More specifically, it treats the occurrence of errors as anevent-driven temporal sequence and applies a pattern recognition technique in order topredict upcoming failures. Hidden Markov models have successfully solved many pat-tern recognition tasks. However, standard hidden Markov models are not well-suited toprocessing sequences in continuous time and existing augmentations do not account ade-quately for the event-driven character of error sequences. Hence, an extension of hiddenMarkov models has been developed that employs a semi-Markov process to state traver-sals providing the flexibility to model a great variety of temporal characteristics of theunderlying stochastic process.

The proposed hidden semi-Markov model has been applied to industrial data of acommercial telecommunication platform. The case study showed significantly improvedfailure prediction capabilities in comparison to well-known existing approaches. The casestudy also demonstrated that hidden semi-Markov models perform significantly betterthan standard hidden Markov models.

In order to assess the impact of failure prediction and subsequent actions, a reliabil-ity model has been developed that enables to compute steady-state system availability,reliability and hazard rate. Based on the model, it is shown that such approaches cansignificantly improve system dependability.

Keywords:Event-based failure prediction, Hidden semi-Markov model, Proactive faultmanagement, Autonomic Computing

vii

Zusammenfassung

Es gibt kaum mehr einen Bereich in unserer Gesellschaft, der nicht an ein korrektes undfehlerfreies Funktionieren von zum Teil hochkomplexen Computersystemen gebunden ist.So kann nicht nur das Überleben ganzer Unternehmen davon abhängen, sondern auch dasLeben von Menschen. Es gibt zwei grundlegende Ansätze mit diesem Risiko umzugehen:(a) man versucht, Fehlerursachen während der Entwurfs- und Herstellungsphase, alsonoch bevor das System in Betrieb geht, zu eliminieren (Fehler-Intoleranz) und / oder (b)man versucht, um einen Ausfall des Systems zu verhindern, ein System zu bauen, dasmit Fehlern —die trotz ausgefeilter Fehler-Intoleranz Verfahren in der Produktionsphaseauftreten können— umgehen kann (Fehlertoleranz). Die vorliegende Arbeit konzentriertsich auf letzteren Ansatz.

Traditionell haben Fehlertoleranz-Verfahren auf Fehler lediglich reagiert und ver-sucht, Ausfälle des Gesamtsystems durch räumliche oder zeitliche Redundanz zu ver-hindern. In den letzten Jahren hat sich der Fokus der Forschung jedoch von diesen eherstatischen Verfahren zu dynamischeren Ansätzen verschoben, die versuchen, bereits vordem Auftreten eines Fehlers einzugreifen. Dazu wird der Zustand des laufenden Systemsüberwacht und analysiert, um einen möglichen Ausfall vorherzusagen. Bei einem dro-henden Ausfall wird dann entweder versucht, den Ausfall zu verhindern, oder sich aufihn vorzubereiten, um die Reparaturzeit zu verringern. Beides kann die Zuverlässigkeitdes Systems erheblich verbessern.

Die vorliegende Arbeit beschäftigt sich vorwiegend mit der Vorhersage von Ausfällenund verfolgt dazu einen Ansatz, der auf der Erkennung von Mustern in Sequenzen vonFehlerereignissen basiert. Das entwickelte Vorhersageverfahren ist das erste, das sowohldie Art von Fehlerereignissen, als auch deren Auftrittszeitpunkt erfolgreich integriert unddas ein Mustererkennungsverfahren anwendet, um zu entscheiden, ob eine im System be-obachtete Sequenz von Fehlern symptomatisch für einen drohenden Ausfall ist oder nicht.Das Mustererkennungsverfahren basiert auf zu “hidden semi-Markov Modellen” erwei-terten “hidden Markov Modellen,” die dem ereignisgesteuerten Charakter von Fehlernbesser gerecht werden.

Das Ausfallvorhersageverfahren wurde auf Daten einer kommerziellen Telekommu-nikationsplattform angewandt und evaluiert. Sowohl im Vergleich zu den bekanntestenexistierenden Verfahren als auch im Vergleich zu herkömmlichen zeitdiskreten “hiddenMarkov Modellen” wird eine signifikant bessere Vorhersagegüte erreicht.

Eine Ausfallvorhersage ist lediglich der erste wichtige Schritt für einen aktiven Um-gang mit Fehlern: Im Anschluss an die Vorhersage müssen Aktionen ausgeführt werden,um einen drohenden Ausfall zu vermeiden beziehungsweise seine Folgen zu minimieren.In der Arbeit wird ein Zuverlässigkeitsmodell vorgestellt, mit dem stationäre Verfügbar-keit, Zuverlässigkeit und Hazard-Rate von Systemen mit Ausfallvorhersage und anschlie-ßenden Maßnahmen berechnet werden können. Mit Hilfe dieses Modells kann gezeigtwerden, dass die Kombination aus Ausfallvorhersage und sich anschließsenden Aktionendie Systemzuverlässigkeit erheblich verbessern kann.

Schlagwörter:Ereignisgesteuerte Ausfallvorhersage, Hidden Semi-Markov Modell, PräventiveFehlertoleranz, Autonomic Computing

ix

Contents

List of Figures xvii

List of Tables xxi

Mathematical Notation xxiii

Preface xxv

I Introduction, Problem Statement, and Related Work 1

1 Introduction, Motivation and Main Contributions 31.1 From Fault Tolerance to

Proactive Fault Management . . . . . . . . . . . . . . . . . . . . . . . 41.2 Origins and Background . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Problem Statement, Key Properties, and Approach to Solution 92.1 A Definition of Online Failure Prediction . . . . . . . . . . . . . . . . . 9

2.1.1 Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Online Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 The Objective of the Case Study . . . . . . . . . . . . . . . . . . . . . 122.3 Key Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Analysis of the Approach . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Identifiable Types of Failures . . . . . . . . . . . . . . . . . . . 192.5.2 Identifiable Types of Faults . . . . . . . . . . . . . . . . . . . . 202.5.3 Relation to Other Research Areas and Issues . . . . . . . . . . . 24

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 A Survey of Online Failure Prediction Methods 293.1 A Taxonomy and Survey of Online Failure Prediction Methods . . . . . 293.2 Methods Used for Comparison . . . . . . . . . . . . . . . . . . . . . . 45

3.2.1 Dispersion Frame Technique . . . . . . . . . . . . . . . . . . . 463.2.2 Eventset Method . . . . . . . . . . . . . . . . . . . . . . . . . 483.2.3 SVD-SVM Method . . . . . . . . . . . . . . . . . . . . . . . . 503.2.4 Periodic Prediction . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

xi

4 Introduction to Hidden Markov Models and Related Work 554.1 An Introduction to Hidden Markov Models . . . . . . . . . . . . . . . . 55

4.1.1 The Forward-Backward Algorithm . . . . . . . . . . . . . . . . 584.1.2 Training: The Baum-Welch Algorithm . . . . . . . . . . . . . . 60

4.2 Sequences in Continuous Time . . . . . . . . . . . . . . . . . . . . . . 634.2.1 Four Approaches to Incorporate Continuous Time . . . . . . . . 64

4.3 Related Work on Time-Varying Hidden Markov Models . . . . . . . . . 674.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

II Modeling 73

5 Data Preprocessing 755.1 From Logfiles to Sequences . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1.1 From Messages to Error-IDs . . . . . . . . . . . . . . . . . . . 755.1.2 Tupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.1.3 Extracting Sequences . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Clustering of Failure Sequences . . . . . . . . . . . . . . . . . . . . . 795.2.1 Obtaining the Dissimilarity Matrix . . . . . . . . . . . . . . . . 805.2.2 Grouping Failure Sequences . . . . . . . . . . . . . . . . . . . 815.2.3 Determining the Number of Groups . . . . . . . . . . . . . . . 825.2.4 Additional Notes on Clustering . . . . . . . . . . . . . . . . . . 83

5.3 Filtering the Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.4 Improving Logfiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4.1 Event Type and Event Source . . . . . . . . . . . . . . . . . . . 865.4.2 Hierarchical Numbering . . . . . . . . . . . . . . . . . . . . . 875.4.3 Logfile Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 895.4.4 Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 The Model 956.1 The Hidden Semi-Markov Model . . . . . . . . . . . . . . . . . . . . . 95

6.1.1 Wrap-up of Semi-Markov Processes . . . . . . . . . . . . . . . 956.1.2 Combining Semi-Markov Processes with HMMs . . . . . . . . 97

6.2 Sequence Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.2.1 Recognition of Temporal Sequences: The Forward Algorithm . 996.2.2 Sequence Prediction . . . . . . . . . . . . . . . . . . . . . . . 102

6.3 Training Hidden Semi-Markov Models . . . . . . . . . . . . . . . . . . 1056.3.1 Beta, Gamma and Xi . . . . . . . . . . . . . . . . . . . . . . . 1056.3.2 Reestimation Formulas . . . . . . . . . . . . . . . . . . . . . . 1066.3.3 A Summary of the Training Algorithm . . . . . . . . . . . . . . 109

6.4 Difference Between the Approach and other HSMMs . . . . . . . . . . 1126.5 Proving Convergence of the Training Algorithm . . . . . . . . . . . . . 116

6.5.1 A Proof of Convergence Framework . . . . . . . . . . . . . . . 1166.5.2 The Proof for HSMMs . . . . . . . . . . . . . . . . . . . . . . 119

6.6 HSMMs for Failure Prediction . . . . . . . . . . . . . . . . . . . . . . 1256.7 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . 1286.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

xii

7 Classification 1337.1 Bayes Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.1.1 Simple Classification . . . . . . . . . . . . . . . . . . . . . . . 1347.1.2 Classification with Costs . . . . . . . . . . . . . . . . . . . . . 1357.1.3 Rejection Thresholds . . . . . . . . . . . . . . . . . . . . . . . 136

7.2 Classifiers for Failure Prediction . . . . . . . . . . . . . . . . . . . . . 1367.2.1 Threshold on Sequence Likelihood . . . . . . . . . . . . . . . . 1367.2.2 Threshold on Likelihood Ratio . . . . . . . . . . . . . . . . . . 1367.2.3 Using Log-likelihood . . . . . . . . . . . . . . . . . . . . . . . 1377.2.4 Multi-class Classification Using Log-Likelihood . . . . . . . . 138

7.3 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387.3.1 Bias and Variance for Regression . . . . . . . . . . . . . . . . . 1387.3.2 Bias and Variance for Classification . . . . . . . . . . . . . . . 1407.3.3 Conclusions for Failure Prediction . . . . . . . . . . . . . . . . 143

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

III Applications of the Model 147

8 Evaluation Metrics 1498.1 Evaluation of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 149

8.1.1 Dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . 1498.1.2 Banner Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1518.1.3 Agglomerative and Divisive Coefficient . . . . . . . . . . . . . 151

8.2 Metrics for Prediction Quality . . . . . . . . . . . . . . . . . . . . . . 1528.2.1 Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . 1538.2.2 Metrics Obtained from Contingency Tables . . . . . . . . . . . 1548.2.3 Plots of Contingency Table Measures . . . . . . . . . . . . . . 1578.2.4 Cost Impact of Failure Prediction . . . . . . . . . . . . . . . . . 1608.2.5 Other Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.3 Evaluation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1668.3.1 Setting of Parameters . . . . . . . . . . . . . . . . . . . . . . . 1668.3.2 Three Types of Data Sets . . . . . . . . . . . . . . . . . . . . . 1678.3.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.4 Statistical Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 1688.4.1 Theoretical Assessment of Accuracy . . . . . . . . . . . . . . . 1688.4.2 Confidence Intervals by Assuming Normal Distributions . . . . 1698.4.3 Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1708.4.4 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 1708.4.5 Bootstrapping with Cross-validation . . . . . . . . . . . . . . . 1718.4.6 Confidence Intervals for Plots . . . . . . . . . . . . . . . . . . 172

8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

9 Experiments and Results Based on Industrial Data 1759.1 Description of the Case Study . . . . . . . . . . . . . . . . . . . . . . . 1759.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

9.2.1 Making Logfiles Machine-Processable . . . . . . . . . . . . . . 1779.2.2 Error-ID Assignment . . . . . . . . . . . . . . . . . . . . . . . 178

xiii

9.2.3 Tupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1799.2.4 Extracting Sequences . . . . . . . . . . . . . . . . . . . . . . . 1809.2.5 Grouping (Clustering) of Failure Sequences . . . . . . . . . . . 1829.2.6 Noise Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 188

9.3 Properties of the Preprocessed Dataset . . . . . . . . . . . . . . . . . . 1919.3.1 Error Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . 1929.3.2 Distribution of Delays . . . . . . . . . . . . . . . . . . . . . . 1929.3.3 Distribution of Failures . . . . . . . . . . . . . . . . . . . . . . 1939.3.4 Distribution of Sequence Lengths . . . . . . . . . . . . . . . . 197

9.4 Training HSMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1989.4.1 Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . 1989.4.2 Results for Parameter Investigation . . . . . . . . . . . . . . . . 199

9.5 Detailed Analysis of Failure Prediction Quality . . . . . . . . . . . . . 2059.5.1 Precision, Recall, and F-measure . . . . . . . . . . . . . . . . . 2059.5.2 ROC and AUC . . . . . . . . . . . . . . . . . . . . . . . . . . 2059.5.3 Accumulated Runtime Cost . . . . . . . . . . . . . . . . . . . . 206

9.6 Dependence on Application Specific Parameters . . . . . . . . . . . . . 2079.6.1 Lead-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2079.6.2 Data Window Size . . . . . . . . . . . . . . . . . . . . . . . . 208

9.7 Dependence on Data Specific Issues . . . . . . . . . . . . . . . . . . . 2099.7.1 Size of the Training Data Set . . . . . . . . . . . . . . . . . . . 2109.7.2 System Configuration and Model Aging . . . . . . . . . . . . . 211

9.8 Failure Sequence Grouping and Filtering . . . . . . . . . . . . . . . . . 2129.8.1 Failure Grouping . . . . . . . . . . . . . . . . . . . . . . . . . 2129.8.2 Sequence Filtering . . . . . . . . . . . . . . . . . . . . . . . . 213

9.9 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2139.9.1 Dispersion Frame Technique (DFT) . . . . . . . . . . . . . . . 2149.9.2 Eventset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2159.9.3 SVD-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2169.9.4 Periodic Prediction Based on MTBF . . . . . . . . . . . . . . . 2179.9.5 Comparison with Standard HMMs . . . . . . . . . . . . . . . . 2179.9.6 Comparison with Random Predictor . . . . . . . . . . . . . . . 2189.9.7 Comparison with UBF . . . . . . . . . . . . . . . . . . . . . . 2199.9.8 Discussion and Summary of Comparative Approaches . . . . . 219

9.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

IV Improving Dependability, Conclusions, and Outlook 225

10 Assessing the Effect on Dependability 22710.1 Proactive Fault Management . . . . . . . . . . . . . . . . . . . . . . . 227

10.1.1 Downtime Avoidance . . . . . . . . . . . . . . . . . . . . . . . 22910.1.2 Downtime Minimization . . . . . . . . . . . . . . . . . . . . . 229

10.2 Related Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23110.3 The Availability Model . . . . . . . . . . . . . . . . . . . . . . . . . . 233

10.3.1 The Original Model for Software Rejuvenation by Huang et al. . 23310.3.2 Availability Model for Proactive Fault Management . . . . . . . 234

10.4 Computing the Rates of the Model . . . . . . . . . . . . . . . . . . . . 236

xiv

10.4.1 The Parameters in Detail . . . . . . . . . . . . . . . . . . . . . 23710.4.2 Computing the Rates from Parameters . . . . . . . . . . . . . . 239

10.5 Computing Availability . . . . . . . . . . . . . . . . . . . . . . . . . . 24310.6 Computing Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

10.6.1 The Reliability Model . . . . . . . . . . . . . . . . . . . . . . . 24410.6.2 Reliability and Hazard Rate . . . . . . . . . . . . . . . . . . . . 245

10.7 How to Estimate the Parameters from Experiments . . . . . . . . . . . 24610.7.1 Failure Prediction Accuracy . . . . . . . . . . . . . . . . . . . 24610.7.2 Failure Probabilities PTP , PFP , and PTN . . . . . . . . . . . . . 24710.7.3 Repair Time Improvement k . . . . . . . . . . . . . . . . . . . 25110.7.4 Summary of the Estimation Procedure . . . . . . . . . . . . . . 252

10.8 A Case Study and an Example . . . . . . . . . . . . . . . . . . . . . . 25210.8.1 Experiment Description . . . . . . . . . . . . . . . . . . . . . . 25210.8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25410.8.3 An Advanced Example . . . . . . . . . . . . . . . . . . . . . . 258

10.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

11 Summary and Conclusions 26311.1 Phase I: Problem Statement, Key Properties and Related Work . . . . . 26311.2 Phase II: Data Preprocessing, the Model, and Classification . . . . . . . 265

11.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 26511.2.2 The Hidden Semi-Markov Model . . . . . . . . . . . . . . . . . 26611.2.3 Sequence Classification . . . . . . . . . . . . . . . . . . . . . . 268

11.3 Phase III: Evaluation Methods and Results for Industrial Data . . . . . . 26811.3.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 26811.3.2 Results for the Telecommunication System Case Study . . . . . 270

11.4 Phase IV: Dependability Improvement . . . . . . . . . . . . . . . . . . 27311.4.1 Proactive Fault Management . . . . . . . . . . . . . . . . . . . 27311.4.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27311.4.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 27411.4.4 Case Study and an Advanced Example . . . . . . . . . . . . . . 274

11.5 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27411.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

12 Outlook 27712.1 Further Development of Prediction Models . . . . . . . . . . . . . . . . 277

12.1.1 Improving the Hidden Semi-Markov Model . . . . . . . . . . . 27712.1.2 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . 27812.1.3 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 27812.1.4 Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 27812.1.5 Further Application Domains for HSMMs . . . . . . . . . . . . 280

12.2 Proactive Fault Management . . . . . . . . . . . . . . . . . . . . . . . 280

V Appendix 283

Derivatives with respect to Parameters for Selected Distributions 285

xv

Erklärung 289

Acronyms 291

Index 295

Bibliography 301

xvi

List of Figures

1.1 Predict-react cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 The engineering cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Definitions and interrelations of faults, errors and failures . . . . . . . . . 102.2 Four stages where faults can become visible . . . . . . . . . . . . . . . . 112.3 Distinction between root cause analysis and failure prediction . . . . . . . 112.4 Time relations in online failure prediction . . . . . . . . . . . . . . . . . 122.5 Failure definition for the case study . . . . . . . . . . . . . . . . . . . . . 132.6 Data acquisition setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7 Two phase machine learning approach . . . . . . . . . . . . . . . . . . . 162.8 Dependencies among components lead to a temporal sequence of errors . 172.9 Overview of the training procedure . . . . . . . . . . . . . . . . . . . . . 192.10 Overview of the online failure prediction approach . . . . . . . . . . . . 202.11 Permanent, intermittent and transient faults (Siewiorek & Swarz [241]). . 212.12 Fault model based on Barborak et al. [23] . . . . . . . . . . . . . . . . . 22

3.1 A taxonomy for online failure prediction approaches . . . . . . . . . . . 313.2 Failure prediction by function approximation . . . . . . . . . . . . . . . 343.3 Failure prediction using signal processing techniques . . . . . . . . . . . 393.4 Failure prediction based on the occurrence of errors . . . . . . . . . . . . 403.5 Failure prediction by recognition of failure-prone error patterns . . . . . . 433.6 Dispersion Frame Technique . . . . . . . . . . . . . . . . . . . . . . . . 463.7 The eventset method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.8 Bag-of-words representation of error sequences . . . . . . . . . . . . . . 513.9 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . 513.10 Maximum margin classification using support vector machines . . . . . . 52

4.1 Discrete Time Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . 564.2 A discrete-time hidden Markov model . . . . . . . . . . . . . . . . . . . 574.3 A trellis to visualize the forward algorithm . . . . . . . . . . . . . . . . . 594.4 A trellis visualizing the computation of ξt(i, j) . . . . . . . . . . . . . . . 624.5 Notations for event-driven temporal sequences . . . . . . . . . . . . . . . 634.6 Incorporating continuous time by time slotting . . . . . . . . . . . . . . . 644.7 Duration modeling by a discrete-time HMM with self-transitions. . . . . . 644.8 Representing time by delay symbols . . . . . . . . . . . . . . . . . . . . 654.9 Delay representation by two-dimensional output probability distributions . 664.10 Duration modeling by explicit modeling of state durations . . . . . . . . . 684.11 Topology of an Expanded State HMM . . . . . . . . . . . . . . . . . . . 69

xvii

5.1 From faults to error messages . . . . . . . . . . . . . . . . . . . . . . . . 765.2 Truncation and collision in tupling . . . . . . . . . . . . . . . . . . . . . 785.3 Plotting the number of tuples over time window size ε . . . . . . . . . . . 785.4 Extracting sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.5 For each failure sequence F i, a separate HSMM M i is trained. . . . . . . 805.6 Matrix of logarithmic sequence likelihoods . . . . . . . . . . . . . . . . 815.7 Inter-cluster distance rules . . . . . . . . . . . . . . . . . . . . . . . . . 835.8 Noise filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.9 Three different sequence sets to compute symbol prior probabilities . . . 865.10 Hierarchical error numbering with SHIP . . . . . . . . . . . . . . . . . . 875.11 An inherent problem of hard classification approaches . . . . . . . . . . . 885.12 Sets of required information and given information of a log record . . . . 895.13 A plot of log entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.14 Principle structure of a Common Base Event . . . . . . . . . . . . . . . . 91

6.1 A semi-Markov process . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2 A sample hidden semi-Markov model . . . . . . . . . . . . . . . . . . . 986.3 Notation for temporal sequences . . . . . . . . . . . . . . . . . . . . . . 996.4 Summary of the complete training algorithm for HSMMs. . . . . . . . . . 1116.5 A simplified sketch of phoneme assignment to a speech signal. . . . . . . 1136.6 Assigning states to observations in speech processing . . . . . . . . . . . 1146.7 Trellis structure for the forward algorithm with duration modeling . . . . 1156.8 Lower bound optimization . . . . . . . . . . . . . . . . . . . . . . . . . 1176.9 Gradient vector projection . . . . . . . . . . . . . . . . . . . . . . . . . 1246.10 Failure prediction model structure used for training . . . . . . . . . . . . 1266.11 Model with intermediate states . . . . . . . . . . . . . . . . . . . . . . . 127

7.1 Classification by maximum posterior for a two-class example . . . . . . . 1347.2 Error in regression problems . . . . . . . . . . . . . . . . . . . . . . . . 1397.3 True and estimated posterior probabilities . . . . . . . . . . . . . . . . . 1417.4 Distribution of estimated posterior . . . . . . . . . . . . . . . . . . . . . 1427.5 Boundary error plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.6 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.1 Dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1508.2 Banner plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1518.3 Sample precision/recall-plot for two failure predictors . . . . . . . . . . . 1588.4 Sample ROC plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1588.5 Relation between ROC plots and precision and recall . . . . . . . . . . . 1598.6 Detection error trade-off plot . . . . . . . . . . . . . . . . . . . . . . . . 1608.7 Iso-cost lines in ROC space . . . . . . . . . . . . . . . . . . . . . . . . . 1618.8 Determining minimum cost from ROC . . . . . . . . . . . . . . . . . . . 1628.9 Cost curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1628.10 Exemplary accumulated runtime cost . . . . . . . . . . . . . . . . . . . . 1638.11 AUC can be misleading . . . . . . . . . . . . . . . . . . . . . . . . . . . 1658.12 Cross-validation and bootstrapping . . . . . . . . . . . . . . . . . . . . . 1728.13 Averaging ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

9.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

xviii

9.2 Typical error log record . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779.3 Levenshtein similarity plot . . . . . . . . . . . . . . . . . . . . . . . . . 1799.4 Effect of tupling window size for cluster-wide logfile . . . . . . . . . . . 1809.5 Effect of tupling window size. . . . . . . . . . . . . . . . . . . . . . . . 1819.6 HSMM toplogy for failure sequence grouping . . . . . . . . . . . . . . . 1829.7 Effect of clustering methods. . . . . . . . . . . . . . . . . . . . . . . . . 1849.8 Effect of number of states . . . . . . . . . . . . . . . . . . . . . . . . . . 1869.9 Effect of background distribution weight . . . . . . . . . . . . . . . . . . 1879.10 Values of Xi for noise filtering: Cluster prior . . . . . . . . . . . . . . . . 1899.11 Values of Xi for noise filtering: Cluster failure sequences . . . . . . . . . 1899.12 Values of Xi for noise filtering: all sequences . . . . . . . . . . . . . . . 1909.13 Mean sequence length depending on filtering threshold . . . . . . . . . . 1919.14 Number of errors per five minutes . . . . . . . . . . . . . . . . . . . . . 1929.15 Histogram and QQ-plots of delays between errors . . . . . . . . . . . . . 1949.16 Analysis of time between failure . . . . . . . . . . . . . . . . . . . . . . 1959.17 Normalized autocorrelation of failure occurrence . . . . . . . . . . . . . 1969.18 Histogram and ECDF for the length of sequences . . . . . . . . . . . . . 1979.19 Average negative training sequence log-likelihood . . . . . . . . . . . . . 2019.20 Mean training time for number of states and maximum span of shortcuts . 2039.21 Computation times for testing . . . . . . . . . . . . . . . . . . . . . . . . 2049.22 Upper bounds for mean testing times . . . . . . . . . . . . . . . . . . . . 2049.23 Precision/recall and F-measure plot for industrial data . . . . . . . . . . . 2069.24 ROC plot for industrial data . . . . . . . . . . . . . . . . . . . . . . . . . 2079.25 Accumulated runtime cost for industrial data . . . . . . . . . . . . . . . . 2089.26 Failure prediction performance for various lead-times . . . . . . . . . . . 2099.27 Effects of data window size ∆td . . . . . . . . . . . . . . . . . . . . . . 2109.28 Data sets for experiments investigating size of the data set. . . . . . . . . 2109.29 F-measure and Training time as function of size of training data set . . . . 2119.30 Data sets for experiments investigating system configuration . . . . . . . 2129.31 Prediction quality as function of train-test gap. . . . . . . . . . . . . . . . 2139.32 Precision / recall plot and ROC plot for single failure group model . . . . 2149.33 Histograms of time-between-errors for DFT . . . . . . . . . . . . . . . . 2159.34 Precision/recall and ROC plot for the SVD-SVM prediction algorithm . . 2179.35 Summary of prediction results for comparative approaches . . . . . . . . 220

10.1 Principle approach of proactive fault management . . . . . . . . . . . . . 22810.2 Improved TTR for prediction-driven repair schemes . . . . . . . . . . . . 23010.3 The original rejuvenation model . . . . . . . . . . . . . . . . . . . . . . 23410.4 Availability model for proactive fault management . . . . . . . . . . . . . 23510.5 Four cases of prediction including lead-time and prediction-period. . . . . 23810.6 Time relations for prediction . . . . . . . . . . . . . . . . . . . . . . . . 24210.7 CTMC model for reliability . . . . . . . . . . . . . . . . . . . . . . . . . 24510.8 Four situations in failure prediction experiments . . . . . . . . . . . . . . 24710.9 Cases with fault injection . . . . . . . . . . . . . . . . . . . . . . . . . . 25010.10 Summary of the procedure to estimate model parameters . . . . . . . . . 25310.11 Overview of the case study . . . . . . . . . . . . . . . . . . . . . . . . . 25410.12 Reliability for the case study . . . . . . . . . . . . . . . . . . . . . . . . 25610.13 Hazard rate for the case study . . . . . . . . . . . . . . . . . . . . . . . . 257

xix

10.14 Reliability for the more sophisticated example . . . . . . . . . . . . . . . 25910.15 Hazard rate for the more sophisticated example. . . . . . . . . . . . . . . 259

11.1 Trade-off between predictive power and complexity . . . . . . . . . . . . 276

12.1 Steps of proactive fault management . . . . . . . . . . . . . . . . . . . . 281

xx

List of Tables

8.1 Contingency table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538.2 Metrics obtained from contingency table . . . . . . . . . . . . . . . . . . . 154

9.1 Number of different log messages . . . . . . . . . . . . . . . . . . . . . . 1789.2 Experiment settings for detailed analysis. . . . . . . . . . . . . . . . . . . 2059.3 Contingency table for a random predictor. . . . . . . . . . . . . . . . . . . 2189.4 Contingency table for the UBF failure prediction approach. . . . . . . . . . 2199.5 Summary of computation times for comparative approaches . . . . . . . . . 221

10.1 Actions performed after prediction . . . . . . . . . . . . . . . . . . . . . . 22810.2 Parameters used for modeling . . . . . . . . . . . . . . . . . . . . . . . . . 23710.3 Simplified contingency table . . . . . . . . . . . . . . . . . . . . . . . . . 23810.4 Solution to the steady-state equations for availability . . . . . . . . . . . . 24410.5 Mapping of cases to situations . . . . . . . . . . . . . . . . . . . . . . . . 24810.6 Estimation results for the case study . . . . . . . . . . . . . . . . . . . . . 25510.7 Relative amount of the four types of prediction . . . . . . . . . . . . . . . 25710.8 Parameters assumed for the sophisticated example . . . . . . . . . . . . . . 258

xxi

Mathematical Notation

• Vectors are typeset in bold lower-case letters or square brackets such asπ = [π1, . . . , πN ]

• Matrices are typeset in bold capital letters such asB = [bij] or as a vector of vectorssuch asA = [ai]

• Sets are indicated by curly brackets such as E = {x, y, z}

• Random Variables are denoted by capital letters such as X . If a random variable isfixed to some value, the notation X = x is used

• Observation symbols are denoted by lower-case letters oi ∈ O , where O denotesthe alphabet of size M . The alphabet is simply a set of observation symbols: O ={o1, . . . , oM}

• Sequences of observations are denoted by a sequence of random variables O with-out separating commas such as O1O2 . . . OL. For a specific, given sequence ofobservations vector notation o = [Ot] is used

• The notation Oi = ok expresses that the i-th element in an observation sequence isequal to symbol ok

• States are denoted by lower-case letters si ∈ S , where S denotes the set of all Nstates. Similar to observations, random variables denoting states use capital S andsequences of states are defined equivalently to observation sequences

• Observation probabilities in hidden Markov models are either denoted in a matrixformB = [bij] or in a functional form bij = bi(oj)

xxiii

Preface

There are no faults, only lessons.— freely adapted from Dr. Chérie Carter-Scott

Knowing the future has always been ingrained into desires of mankind —and it hasbeen fascinating ever since. Think, for example, of the oracle at Delphi during the clas-sical period of Greece or the priests of the oracle at Siwa. Their supposed ability toforesee the future created an aura and reputation lasting already for more than 2000 years.Stonehenge, as a second example, has probably been an equinox predictor. Today, pre-dictions are used in a manifold of areas. There are methods to forecast wars, the weather,and winds.1 Financial markets, healthcare and insurance heavily use predictions, as well.Turning to physics and engineering, prediction strategies are, for example, applied to pre-dict the path of meteorites, or the future development of a signal in signal processing.Even in computer science, prediction methods are quite frequently used: In micropro-cessors, branch prediction tries to prefetch instructions that are most likely executed andmemory or cache prediction tries to forecast what data might be required next.2 In thisdissertation, prediction techniques are used to forecast the occurrence of system failures.

Today, human lives and organizations are increasingly dependent on the correct func-tioning of computer systems. Train control systems, emergency systems, stock tradingsoftware, and enterprise resource planning systems are only a few examples. A failure inany of these systems may cause huge personal as well as economic damage. However,computer systems have reached a level of complexity that precludes the development of acompletely correct system. Therefore, the occurrence of failures cannot be fully ruled outbut the likelihood of their occurrence should be minimized. This dissertation contributesto an approach called proactive fault management, which tries to deal with faults evenbefore the failure has occurred. These methods can be applied most efficiently, if it isknown whether a failure is imminent in the system or not. This is called online failureprediction and it is the main topic of this thesis.

Turning back to historic oracles, it was the search for structures3 and interrelationsand the ability to identify fundamental influencing factors which was essential in their“modus operandi”. Based on this knowledge they have been able to analyze the presentsituation and to infer future developments. These two principles are also the key to thechallenge of online failure prediction in complex computer systems. Particularly, the

1To interested readers, specific references on war forcasting (Moll & Luebbert [186]), weather forecasting(Pielke [204]), and wind forecasting (Marzbana & Stumpf [178]) can be found.

2Specific references can be found on signal processing (Kalman & Bucy [140]), instruction prefetching(Jiménez & Lin [134]), and cache prediction (Joseph & Grunwald [136])

3For example, Jacob Burckhardt [41] reports in his book about Greek culture that in ancient times priestshoped to forecast the future by examining viscera of sacrificial animals

xxv

approach proposed in this dissertation investigates interrelations between system compo-nents by identifying symptomatic error patterns.

One of the key problems in prediction is that future is in principle not fully predictable.Hence, any prediction needs to handle uncertainty. In case of the historic oracles, theirreplies have intentionally been cryptic and ambiguous, as can be seen from one of thebest-known replies, which is the one given to Croesus: When Croesus asked the oracleat Delphi whether he should go to war with the Persians, the oracle responded: “If Croe-sus attacks the Persians, he will destroy a mighty empire”.4 However, it was Croesus’esmighty empire that was destroyed, not the Persian —nevertheless, the oracle’s reply re-mained to be true. The prediction method proposed here takes on a different approach tohandle uncertainty: it strictly follows a probabilistic approach.

Due to the size and complexity of contemporary computer systems machine learningtechniques have been applied in order to reveal symptomatic patterns from observed fail-ures that have occurred in the past —which is a fundamental difference to the task theancient predictors were confronted with: oracles had to evaluate singular events while infailure prediction, there is a chance to gain experience. Hence the problem that is solvedin this dissertation is incomparably easier than the job of the venerable Greek oracles.

4Herodot [118]

xxvi

Part I

Introduction, Problem Statement, andRelated Work

1

Chapter 1

Introduction, Motivation and MainContributions

A manifold of domains in today’s life and organizations are becoming increasingly de-pendent on the correct functioning of computer systems. Automotive assistant systems,medical imaging devices, banking systems, and production planning and control systemsare only a few examples. Hence dependability, which is about preventing personal aswell as economic damage, is rendered a crucial issue. However, computer systems havereached a level of complexity that precludes the development of a completely correct sys-tem. Being built of commercial-off-the-shelf components with millions of transistors andmillions of lines of code, the occurrence of failures cannot be fully ruled out but the like-lihood of their occurrence should be minimized. Considering availability, another aspectcan be observed: striving for high availability in most cases implies extremely short repairtimes. For example, a five-nines availability1 implies that a system must on average notbe down for more than 5.26 minutes per year. It is almost impossible for a human be-ing to analyze, diagnose and repair a complex system within such a short time interval.2

Hence, systems need to react to failures more or less automatically. But even if reactionis automated, it might in some cases be rather difficult to even restart the system withinfive minutes. One way out of this dilemma is to follow a more proactive approach thatstarts acting even before the failure occurs. This requires some short-term anticipationof upcoming failures based on an evaluation of the current runtime state of the system,followed by some proactive mechanisms that either try to avoid the upcoming failure ortry to minimize its effects (see Figure 1.1). This thesis focuses on online failure predic-tion for centralized complex computer systems, which is the first step towards an efficientproactive fault management.

The need for accurate short-term failure prediction methods for computer systems hasrecently been demonstrated by Liang et al. [165]. The authors mention that checkpoint-ing3 is one of the most efficient ways to improve dependability in large scale computers.However, in parallel computing, the overhead of checkpointing is immense and can evennullify the gain in dependability due to the fact that failures occur irregularly. Failure

1I.e., the ratio of uptime over lifetime equals at least 0.999992Even if a failure occurs only every three years, it seems rather difficult to repair the system within 15.768minutes

3Checkpointing denotes the strategy to regularly save the entire state of a system such that this consistentstate can be restored when a failure has occurred

3

4 1. Introduction, Motivation and Main Contributions

Figure 1.1: Predict-react cycle

prediction methods are needed to differentiate between periods with few failures and pe-riods with many and to adapt checkpointing to these situations. Oliner & Sahoo [197]carry out experiments showing that failure prediction-driven checkpointing4 can boostboth performance and reliability of large-scale systems.

1.1 From Fault Tolerance toProactive Fault Management

Online failure prediction belongs to the research discipline called fault tolerance whichdates back to the pioneers of computing (c.f., e.g, Hamming [113], or von Neumann[192]). The methods developed at that time mainly concerned ways to deal with incredi-bly unreliable hardware components such as relays and vacuum tubes. As complexity ofcomputing systems increased over the years, the main interest in reliable computing alsogradually turned over to a system wide view (Esary & Proschan [92]). Along with thisdevelopment, fault tolerance methods became more dynamic. One well-known exampleis the Self-Testing And Repairing (STAR) computer, developed by Avizienis et al. [15].Various variants of fault tolerance mechanisms employing static and dynamic fault toler-ance techniques (hybrid approaches) have been developed (see, e.g., Siewiorek & Swarz[241] for an introduction). At the same time, software became more and more complexand software fault tolerance techniques such as recovery blocks (Randell [212]) and N-version programming (Avizienis [14], Kelly et al. [143]) have been developed. This wasin part a reaction to the fact that the relative amount of software-related failures becamepredominant (see, e.g., Sullivan & Chillarege [252]). However, fault tolerance techniquesdeveloped until the 1990s were reactive, passive and still static in nature: They weretriggered after a problem had been detected and the type of reactions had to be prespec-ified during system design. In 1995, Huang et al. [126] proposed a new approach thathas become well-known under the term rejuvenation. Rejuvenation is a technique that

4The authors call it cooperative checkpointing

1.2 Origins and Background 5

restarts parts of a system even if no fault has occurred. It has proven to be a successfulconcept to deal with problems of software-aging (Parnas [198]) such as accumulating nu-merical rounding errors, corruption of data, exhaustion of resources, memory leaks, etc.All the while system complexity has not stopped to grow, and traditional fault tolerancemechanisms could not keep pace with the dynamics and flexibility of new computingarchitectures and paradigms. Both industry and academia set off the search for new con-cepts in fault tolerance and other dependability issues like security as can be seen frominitiatives and research efforts on autonomic computing (Horn [123]), trustworthy com-puting (Mundie et al. [188]), adaptive enterprise (Coleman & Thompson [63]), recovery-oriented computing (Brown & Patterson [40]), responsive computing (e.g., Malek [173]),rejuvenation (e.g., Garg et al. [101]) and various conferences on self-*properties (see,e.g., Babaoglu et al. [19]) where the asterisk can be replaced by any of “configuration”,“healing”, “optimization”, or “protection”. Throughout this dissertation, the term proac-tive fault management will be used.

In parallel to computer fault tolerance, research in mechanical engineering developedthe concept of preventive maintenance. Preventive maintenance tries to improve systemreliability by replacement of components (c.f., e.g., Gertsbakh [105] for an overview).Several replacement strategies exist ranging from simple lifetime distribution models tomore complex models including prediction-based preventive maintenance incorporatingmonitoring data (c.f., e.g., Williams et al. [278]). However, due to the fact that the actionstriggered for mechanical machines differ significantly from those for computing systemsand since the observation-based methods seem not to be able to account for the complexityof contemporary large computer systems, the two research communities have not merged(except for some rare approaches such as Albin & Chao [4]).

1.2 Origins and BackgroundInitial point for the work described in this dissertation was the challenge to develop failureprediction algorithms based on data collected from an industrial telecommunication sys-tem. At the Computer Architecture and Communication group at Humboldt UniversityBerlin, three different approaches have been proposed: Steffen Tschirpke has introducedan adaptive fault dictionary, Günther Hoffmann has developed a method based on datafrom continuous system monitoring (Hoffmann [120]) and this thesis focuses on a predic-tion method based on error event patterns. However, the prediction method described inthe following chapters is not the first attempt to master the challenge. Previously, a ratherstraightforward solution has been developed that builds on a semi-Markov process andclustering of similar error events. This method has been named Similar Events Prediction(see Salfner et al. [226] for details). However, it has two major drawbacks:

1. Computing overhead for predictions longer than three minutes in advance resultedin unacceptable computation times due to exponentially growing complexity of thealgorithms.

2. Although results seemed promising, prediction quality dropped to a low level if testdata differed only slightly (e.g., caused by a different configuration of the systemunder investigation) from the data that had been used to build the model. The ex-planation for this behavior is called overfitting, which means that the model is too


specifically tailored to the data analyzed: If an observed pattern under investiga-tion varied only slightly from the patterns observed in the training data, it was notrecognized anymore and hence no failure was predicted.

Having learned the lessons, the task of failure prediction for the commercial telecom-munication system has been analyzed from scratch in a structured, traditional engineeringfashion (see Figure 1.2): First, key properties of the system have been identified and byabstraction, a precise problem statement has been formulated. Then, a methodology hasbeen developed that is specifically targeted to the key properties of the problem. Havingdeveloped a methodology, it has been implemented and tested with the industrial data ofthe telecommunication system in order to assess how well the solution solves the problem.In the last phase of the engineering cycle, the solution is usually applied to improve thesystem. However, failure prediction per se does not improve system dependability unlesscoupled with proactive actions, which is beyond the scope of this dissertation. Therefore,only a theoretical assessment of the effects on dependability has been performed.

Figure 1.2: The engineering cycle.

1.3 Outline of the ThesisFollowing the engineering approach depicted in Figure 1.2, this thesis is divided into fourparts:

• Part I The first step —abstraction and identification of key properties— is de-scribed in Chapter 2: a problem statement is given and the principle approachtaken in this dissertation is motivated, introduced, and discussed. Before devel-oping a new solution, any engineer should review and investigate existing ones. InChapter 3, a survey of failure prediction methods is provided. This includes a tax-onomy in order to categorize existing methods and to classify the approach takenin this thesis. Furthermore, some approaches are described in more detail sincethese methods are used for comparison in the experiments carried out in Part III.Due to the fact that the prediction method presented here builds on hidden Markov

1.4 Main Contributions 7

models (HMMs), related work on HMMs and their extension to continuous timeare described in Chapter 4.

• Part II The second step of the engineering cycle, which is concerned with the de-velopment of a methodology, is covered by Chapters 5 to 7. In Chapter 5, some con-cepts of data preprocessing are described including issues related to error logfiles,a clustering method to identify failure mechanisms and an approach to tackle theproblem of noisy data. In Chapter 6, the hidden semi-Markov model used for fail-ure prediction is presented. As for failure prediction the output of hidden Markovmodels are probabilistic likelihoods, subsequent classification is necessary in orderto decide whether the current runtime state is failure-prone or not. Classification isdiscussed in Chapter 7.

• Part III The third step of the engineering cycle involves experiments in order toverify that the assumptions made during modeling match the original problem andto investigate how well the developed methodology performs. Prediction perfor-mance is gauged by several measures, which are introduced in Chapter 8. Then themodel is applied to industrial data of the commercial telecommunication system inChapter 9. This includes a detailed analysis of the data, data preprocessing, predic-tion performance and a comparative analysis with the most well-known predictionapproaches in that area.

• Part IV In order to close the engineering cycle, dependability improvement capa-bilities are assessed in Chapter 10, in which a model is developed in order to the-oretically assess the effect of failure prediction-driven fault tolerance mechanisms(proactive fault management) on availability, reliability and hazard rate. The chap-ter also includes results of a case study where such mechanisms have been appliedto a demo web-shop application.

Main results are summarized and an outlook to future research topics is provided inChapters 11 and 12.

The main contributions of each chapter are presented in chapter summaries.

1.4 Main ContributionsThe overall contribution of this dissertation is the development of a novel approach to er-ror event-based failure prediction. Experiments on industrial data of an industrial telecom-munication system have shown superior prediction performance in comparison with themost well-known prediction algorithms in that area. In addition to that several advance-ments to the state-of-the-art are presented:

• A novel extension of Hidden Markov Models to incorporate continuous time. Incontrast to previous extensions that have been developed mainly in the area ofspeech recognition, the model developed in this thesis is specifically tailored toevent-driven temporal sequences.

• To our knowledge the first taxonomy and survey on computer failure predictionapproaches including indication of promising areas for further research. The tax-onomy is based on the fundamental relationship among faults, errors, and failures.


Symptoms, which reflect side-effects of faults, have been added to this basic con-cept.

• To our knowledge the first model to assess dependability of prediction-driven faulttolerance techniques (proactive fault management). The model incorporates correctand false predictions, downtime avoidance as well as downtime minimization tech-niques and cases where failures are induced by the fault management techniquesthemselves.

• A novel methodology to group failure sequences. Although only used for data pre-processing, the approach may as well contribute to diagnosis.

• To our knowledge the first measure to quantify quality the of logfiles: logfile en-tropy combines Shannon’s information entropy with specific requirements for com-prehensive logfiles.

All in all this comprehensive approach to online failure prediction proposed in this the-sis, if combined with preventive actions, has a potential of increasing computer systemsavailability by an order of magnitude.

Chapter 2

Problem Statement, Key Properties,and Approach to Solution

The first step in any scientific as well as any engineering project should be a proper state-ment of the problem to be solved. The challenge that had to be solved in the course ofthis work is online failure prediction, which is defined in Section 2.1. The motivating casestudy that lead to the selection of this topic is an industrial telecommunication system ofwhich we had given the chance to collect data. In Section 2.2, the prediction objectiveis clearly specified for the concrete scenario of the telecommunication system. The casestudy is introduced at this early point of the thesis in order to identify key properties ofsystems for which the failure prediction method proposed in this thesis is designed. Thekey properties are discussed in Section 2.3. From these key properties, the principle ap-proach to the solution is presented in Section 2.4 and its general properties are analyzedin Section 2.5.

2.1 A Definition of Online Failure PredictionThe aim of online failure prediction is to predict the occurrence of failures during runtimebased on the current system state. For a more precise definition, the terms “failure” and“online prediction” are defined separately.

2.1.1 FailuresFailures are commonly defined as follows (Avizienis & Laprie [16]):

A system failure occurs when the delivered service deviates from the spec-ified service, where the service specification is an agreed description of theexpected service.

Similar definitions can be found, e.g., in Melliar-Smith & Randell [180], Laprie & Ka-noun [155], Avižienis et al. [17]. The main point here is that a failure refers to misbehaviorthat can be observed by the user, which can either be a human or a computer componentusing another component. Things may go wrong inside the system, but as long as it does

9

10 2. Problem Statement, Key Properties, and Approach to Solution

Figure 2.1: Definitions and interrelations of faults, errors and failures

not result in corrupted output,1 there is no failure. More specifically, a failure is an event:It is the point in time when a system terminates to fulfill its intended function [64].

Faults are the root cause of failures and are defined to be a defective (incorrect) state[64]. In most cases faults remain undetected for some time. Once a fault has become visi-ble it is called an error. That is why errors are called “manifestation” of faults. Figure 2.1,which is a modified version of a figure by Siewiorek & Swarz [241], visualizes the rela-tionships. The key aspect to note here is that faults are unobserved defective states. Fourstages exist at which faults can become visible (see Figure 2.2):

1. The system can be audited in order to actively search for faults, e.g., by testing onchecksums of data structures, etc.

2. System parameters such as memory usage, number of processes, workload, etc.,can be monitored in order to identify side-effects of the faults. These side-effectsare called symptoms. For example, the side-effect of a memory leak (the fault) isthat the amount of free memory decreases over time.

3. If a fault is activated and detected (observed), it turns into an error.

4. If the fault is not detected by fault detection mechanisms, it might directly turn intoa failure which can be observed from outside the system or component.

A good example for this are faults on disk drives: Consider the fault of a defective disksector. Until no read/write operations have been performed trying to access the sector,the fault remains unobserved. Auditing would make it visible by, e.g., reading the entiredisk (not for data but for testing purposes). Symptoms of a (not yet completely faileddisk) could be observed by monitoring, e.g., wobbling of the disk. Once the sector iscompletely damaged and data shall be read from it, an error is detected. In a single diskenvironment, this is usually equivalent to the occurrence of a failure. However, if thedefective disk is, e.g., part of a redundant array of independent disks (RAID), the desiredservice of data delivery can still be fulfilled and hence no failure occurs.

1including the case that there is no output at all

2.1 A Definition of Online Failure Prediction 11

Figure 2.2: Faults can become visible at four stages: by auditing, by monitoring of systemparameters such as workload, memory usage, etc., to capture symptoms of faults,by detecting manifestation of faults (errors), or by a failure that can be observedfrom outside the system or component

Figure 2.3: Distinction between root cause analysis and failure prediction

Another key aspect for a precise definition of failure prediction methods is that usuallythere is no one-to-one mapping among faults and errors: Several faults may result in onesingle error or one fault may result in several errors. The same holds for errors andfailures: Some errors result in a failure some errors do not, and even more complicatedare cases where some errors only result in a failure under special conditions, and somefaults may cause failures directly. Moreover, some faults remain inactive for the entiresystem lifetime. For this reason, two distinct research directions have evolved: root causeanalysis and failure prediction. Having observed some misbehavior by one of the meansshown in Figure 2.2, root cause analysis tries to identify the fault that caused an erroror failure, while failure prediction tries to assess the risk that the misbehavior will resultin future failure (see Figure 2.3). For example, if it is observed that a database is notavailable, root cause analysis tries to identify what the reason for unavailability is: abroken network connection, or a changed configuration, etc. Failure prediction on theother hand tries to assess whether this situation bears the risk that the system cannotdeliver its expected results, which depends on the system and the current situation: isthere a backup database or some other fault tolerance mechanism available? What is thecurrent load of the system?

2.1.2 Online PredictionThe term “failure prediction” is widely used, e.g., for reliability prediction where the goalis to assess future reliability of a system from its design or specification (see, e.g., Musaet al. [189], Bowles [35], Denson [77], Blischke & Murthy [32]). However in contrast,


the topic of online failure prediction is to identify during runtime whether afailure will occur in the near future based on an assessment of the monitoredcurrent system state.

Although architectural properties such as interdependencies play a crucial role in someonline failure prediction methods, online failure prediction is concerned with a short-termassessment that allows to decide, whether there will be a failure, e.g., five minutes aheador not. Reliability prediction, however, is concerned with long-term predictions based oninput data such as architectural properties or the number of bugs that have been fixed.More precisely, for the case of online failure prediction, four different times need to bedefined (see Figure 2.4):

• Lead-time ∆tl defines how far from present time failures are predicted in the future.

• Minimal warning-time ∆tw defines the minimum lead-time such that failure pre-diction is of any use. If lead-time were shorter than the warning time, there wouldnot be enough time to perform any preparatory or preventive actions.

• Prediction-period ∆tp is the time for which a prediction holds. Increasing ∆tpincreases the probability that a failure is predicted correctly.2 On the other hand, if∆tp is too large, the prediction is of little use since it is not clear when exactly thefailure will occur.

• Data window size ∆td defines the amount of data that is taken into account forfailure prediction. Even if online failure prediction algorithms take the currentsystem state into account, many algorithms additionally investigate what happenedshortly before present time. However, in some approaches the amount of data is notdetermined by a time window but other measures such as, e.g., a fixed number oferror events. In this case ∆td is also defined, but may vary with each prediction.

Figure 2.4: Time relations in online failure prediction. Present time is denoted by t. Failuresare predicted with lead-time ∆tl, which must be greater than minimal warning-time ∆tw. A prediction is assumed to be valid for some time period, namedprediction-period, ∆tp. In order to perform the prediction, some data up to a timehorizon of ∆td are used. ∆td is called data window size.

2.2 The Objective of the Case StudyData of an industrial telecommunication system serves as a gauge of the extent to whichthe online failure prediction algorithm is able to predict the occurrence of failures. Al-

2For ∆tp →∞, simply predicting that a failure will occur would always be 100% correct!

2.2 The Objective of the Case Study 13

though it is a case study, it demonstrates the type of systems and environments in whichthe developed online failure prediction method is intended to be applied, and that is whythe concrete objective of the case study is described at this early point of the thesis. Insubsequent sections, the case study serves to identify key properties that are typical forthe problem domain.

The main purpose of the telecommunication system under investigation is to realizea so-called Service Control Point (SCP) in an Intelligent Network (IN) [171]. An SCPprovides services3 to handle communication related management data such as billing,number translations or prepaid functionality for various services of mobile communica-tion: Mobile Originated Calls (MOC), Short Message Service (SMS), or General PacketRadio Service (GPRS). The fact that the system is an SCP implies that the system co-operates closely with other telecommunication systems in the Global System for MobileCommunication (GSM). Note that the system does not switch calls itself. It rather hasto respond to a large variety of different service requests regarding accounts, billing, etc.submitted to the system over various protocols such as Remote Authentication Dial InUser Interface (RADIUS), Signaling System Number 7 (SS7), or Internet Protocol (IP).

The system’s architecture is very complex and cannot be reproduced here for confiden-tiality reasons. However, two key facts are that it has a multi-tier architecture employinga component based software design. At the time when data were collected, the systemconsisted of more than 1.6 million lines of code, approximately 200 components realizedby more than 2000 classes, running simultaneously in several containers, each replicatedfor fault tolerance.

Typically, one of the most complicated parts in reliability-related projects is the cleardefinition of what a failure is. As defined before, failures are the event when a systemceases to fulfill its specification. Specification for the telecommunication system requiresthat within successive, non-overlapping five minutes intervals, the fraction of calls havingresponse time longer than 250 milliseconds must not exceed 0.01%, as shown in Fig-ure 2.5.

Figure 2.5: If within a five minutes interval, the fraction of calls having response time > 250msexceeds 0.01%, a failure has occurred

This definition is equivalent to a required four-nines interval service availability:

Ai = no. of service requests within 5 min having response time ≤ 250mstotal no. of service requests within 5 min

!≥ 0.9999 . (2.1)

3so-called Service Control Functions (SCF)


Various classifications of failures have been published, one of which is Cristian et al.[69], extended by Laranjeira et al. [156], who classify failures by the following cate-gories:

• crash failure the service stops operating and does not resume operationuntil repair

• omission failure the service does not respond to a request

• performance failure the service responds too late (given a threshold)

• timing failure the service too early or too late (given two thresholds)

• computation failure the service’s response shows wrong results

• arbitrary failure the service’s response shows an arbitrary failure

where each failure class is included in the following classes. According to this definition,the objective of this thesis is to predict performance failures of the telecommunicationsystem. Using the terminology of Laprie & Kanoun [155], these failures are consistenttiming failures. However, it is not possible (for us) to assess the consequences on theenvironment since these are top-level failures and no information is available on howother parts of a telecommunication network rely on the service of the system analyzedhere.

Figure 2.6: Data acquisition setup. Error logs have been collected from the telecommuni-cation system while a failure log has been obtained from an entity that trackedresponse times of calls.

Field data was collected including various workloads. Request response times havebeen measured and all failed requests (i.e., having response times of more than 250 mil-liseconds) have been written into a failure log. The second source of data are error logs,which have been collected from the telecommunication system (see Figure 2.6). Bothfailure and error logs have been collected for 200 days containing a total of 1560 failures.

2.3 Key PropertiesBy analyzing the telecommunication case study, key properties have been identified yield-ing the assumptions on which the failure prediction approach developed in this thesis isbased. In particular, the key properties are:

1. Only very little knowledge about the system internals is available. For the reasonthat we did not have full access to the system internals a thorough analysis of the

2.3 Key Properties 15

system’s structure has not been possible. Moreover, such an analysis seems infea-sible due to the sheer size of the system.

2. A lot of data is available. The error logs of the 200 days of testing contained anoverall amount of 26,991,314 log records, which corresponds to an average loggingactivity of 43 log records per minute on one node and 51 log records per minute onthe second node, respectively. Investigations have shown that only a small fractionof error records gives a notice of upcoming failures.

3. Failures occur rarely. This leads to an imbalance of failure and non-failure data.

4. The telecommunication system is built of software components. Software compo-nents are more or less isolated subsystems that are executed in so-called containers,providing additional functionality such as data persistency, replication, logging, etc.The system serves requests by invoking one or more components, which in turn in-voke other components to fulfill the job. This leads to interdependencies within thesoftware. Usually, interdependencies set up a forest in terms of graph theory, butcycles cannot be excluded in general.

5. Fine-grained fault detection and error reporting is built into the system. For exam-ple, each component is continuously observing its state and is checking the inputreceived from other components. Additionally, there might be several steps of es-calation that can assign different levels of severity to error events.

6. The system is running multiple tasks and processes in parallel. For this reason,several concurrent tasks can send messages to the error logging back-end. Such be-havior can be interpreted as noise in the error logs. A second effect of this propertyis that the order of events can be interchanged if several events occur more or lessconcurrently.

7. Error logs have at least two dimensions: a timestamp and a type specifying whathas happened.4 It is assumed that both dimensions contribute information that canbe exploited for failure prediction.

8. Due to the property of being event-triggered and showing values of a finite count-able set, error logs form a temporal sequence.

9. The telecommunication system can serve requests for several protocols such asGPRS, SMS, MOC, etc. Data of two groups of protocols have been recorded sep-arately. Furthermore, interval service availability requirements must be fulfilledseparately for both groups. In general, it must be assumed that contemporary sys-tems can show failures of various types and different failure definitions may existfor each of them.

10. In a system of such complexity, it must be assumed that several failure mecha-nisms exist for each failure type. A failure mechanism denotes the relation of faultsand system states to a failure with focus on the process how the faults lead to thefailure. This is closely related to the term failure modes as defined by Laprie &

4In many cases such as the telecommunication system investigated here, the type is only implicitly specifiedby an error message in natural language. The task of message type assignment is addressed in Chapter 5


Kanoun [155], but the term failure mechanism is used here in order to emphasizethe temporal aspect.

11. The telecommunication system is highly configurable: more than 2000 parame-ters can be adjusted. Configurability also adds to system complexity, e.g., byparametrization of interrelations within the system.

12. Systems are subject to updates which can alter system behavior significantly. Hence,the process to adapt failure predictors to new system specifics should require as lit-tle effort as possible. At least from that perspective, algorithmic solutions seempreferable in comparison to human analysis.

13. The system is non-distributed. Although the data is collected from two machines in-terconnected by a dedicated high-speed local network and running on synchronizedclocks, the data is merged into one single error log and no computing node-specificaspects are used throughout this thesis.

2.4 ApproachDue to the property that only limited analytical knowledge but a large amount of data isavailable, a machine learning approach has been chosen. It infers symptoms of upcomingfailures from measurements (training data) rather than from an analytical analysis of thesystem. Machine learning, as applied here, consists of two steps (see Figure 2.7): first

Figure 2.7: Machine learning approach: First a model is built from training data (a). Aftertraining, the model is used to predict the occurrence of failures during runtime (b)

a model is built from recorded data using some training algorithm, which means thatmodel parameters are adjusted such that some objective function is optimized. Specifi-cally, training data consists of error-log files and failure logs, which are used to identifywhether a failure occurred or not. Having trained a model, the model is used to predictfailures online during runtime. However, as we do not have access to the running system,this thesis must do without real-time testing. Rather, the data set is divided into a trainingand a test dataset such that prediction quality must be estimated from samples that werenot available in training.

2.4 Approach 17

The key notion of the approach is that dependencies in the component-based systemlead to error patterns, as shown in Figure 2.8. Assume that component “C3” is faulty.

Figure 2.8: Dependencies among components lead to a temporal sequence of errors

Once the fault is detected, an error message “C” is generated and written to the error log.Some time later, component “C1” needs some functionality of “C3”, but due to the factthat “C3” is faulty, “C1” also has a problem and reports an error of type “A”. Due tocomponent internal mechanisms / dependencies (see, e.g., Hansen & Siewiorek [114]),the component writes a second error message “B”. After some time, the same happens to“C2”: when functionality of “C2” is requested but cannot be delivered, an error messageof type “D” is generated. As can be seen from the bottom time line of Figure 2.8, thisbehavior leads to an event-triggered temporal sequence of error events.

The telecommunication system under study is a fault-tolerant system. Hence the chainof dependencies as shown in the figure is not necessarily traversed for a single request.For example, if component “C3” has problems connecting to the database, which resultsin error message “C”, this problem may be handled by another component5 or it maylead to a single failed call request. But a single failed request does not make a failure,yet. However, if component “C3” is faulty for a while, there are some conditions underwhich other components start to have problems, too, which is component “C1” in thefigure. This may still be fine, but in some situations, even “C2” gets a problem, whichfinally leads to a failure since too many components are having problems and hence toomany call requests fail. These effects give rise to the central idea of the failure predictionapproach investigated in this thesis:

⇒ Dependencies in the system lead to error patterns, as shown in Figure 2.8

⇒ There are error patterns that lead to failures, others do not, depending on conditionswhich are not observable from outside

⇒ Apply pattern recognition techniques to identify those patterns that have lead tofailures.

⇒ Analyze error patterns that have been previously recorded to train the pattern rec-ognizer using machine learning techniques.

5which is a component failover


Hidden Markov Models (HMMs) have been shown to be successful pattern recog-nition tools in a large variety of recognition tasks ranging from speech recognition tointrusion detection in computer systems. This being the first reason for the choice to useHMMs for failure prediction, there is a second rationale referring to the very basic dis-tinction between faults, errors and failures: Faults are by definition unobserved. Oncethey manifest, they turn into errors, which are observable. This insight can be transferredanalogously to HMMs: the states of an HMM are hidden, i.e., unobservable, generat-ing observation symbols. Hence, a close match exists between “hidden units”, faults andthe states of HMMs, and between their manifestations, which are errors and observationsymbols, respectively. As the occurrence of failures represents some final state (at leastin non-repairable systems) failures are represented by an absorbing final state producinga dedicated failure symbol.

However, standard hidden Markov models are not well-suited to represent event-triggered temporal sequences (as is discussed in Section 4.2). For this reason, an extensionof HMMs has been developed that permits to model time behavior of error sequences byuse of a continuous-time semi Markov process.

The training procedure. The goal of training is to adjust HMM parameters to errorpatterns that are indicative of upcoming failures. To account for the imbalance of failureversus non-failure data (class skewness), HMMs are trained with failure-prone sequencesonly. Since it is assumed that several failure mechanisms exist in the system and henceare present in the data, a separate HMM is trained for each. The term failure mecha-nism denotes the principle process how specific faults, states and circumstances lead toa specific failure. In order to separate failure sequences in the training data to group se-quences of the same failure mechanism, clustering of failure sequences is accomplished(see Section 5.2). In order to distinguish failure-prone from non-failure sequences inthe prediction phase, a separate model targeted to non-failure sequences is needed. It istrained from a selection of non failure-prone sequences in the training data. Althoughgrouping of non-failure sequences would in principle be possible, it is not applied sincethe non-failure sequence model only serves as a reference for classification. Furthermore,sequence clustering would not be applicable due to the large number of non-failure se-quences in the data set.

Due to the fact that logfiles are noisy and that sometimes there is too much logginggoing on for a prediction method to be successful, data needs to be preprocessed. Datapreprocessing involves filtering mechanisms and statistical testing. An overview of thetraining procedure is provided by Figure 2.9.

Online prediction. Given an error event sequence observed at runtime, online failureprediction is performed by computing the similarity of the observed sequence to the se-quences of the training data. This is done by computing sequence likelihood for eachmodel including the model targeted to non-failure sequences. Sequence likelihood canbe interpreted as a probabilistic measure of similarity between the given sequence andsequence characteristics as represented by the hidden Markov model. In order to cometo a decision whether the current situation is failure-prone or not, multi-class classifica-

6The letter u is used here since letters i to n, which are commonly used to indicate integer numbers, occurfrequently in later chapters and have fixed connotations in this thesis.

2.5 Analysis of the Approach 19

Figure 2.9: An overview of the training procedure. Model 0 is trained with non-failure se-quences. Failure sequences are grouped by means of clustering. A separatemodel is then trained for each of the u groups.6

tion, based on Bayes decision theory, is performed. As was the case for training, datapreprocessing including failure group specific filtering has to be applied prior to sequencelikelihood computation. An overview of the procedure for online failure prediction isdepicted in Figure 2.10.

2.5 Analysis of the ApproachIn order to show principle properties and limitations of the approach, various aspects arediscussed in the following sections. The intention is to position the approach with respectto existing failure and fault models, and to relate it to other research areas.

2.5.1 Identifiable Types of FailuresA classification of failures has already been given in Section 2.2 from which it has beenconcluded that the objective for the telecommunication system is to predict performancefailures. However, the prediction algorithm can be applied to other systems as well. Sincethe algorithm is data-driven, it is clear that it can only learn to predict failures whoseunderlying failure mechanism is similar to the mechanisms contained in the training data.Furthermore, the machine learning approach focuses on general principles in the datawhich means that very rare special cases are more or less ignored. The conclusion fromthis discussion is that the proposed prediction approach can only predict failures that occurmore or less frequently —it is not appropriate for predicting really rare failure events. This


Figure 2.10: An overview of the online failure prediction approach. In order to investigate anobserved error sequence, sequence likelihood is computed for each of the mod-els including the model targeted to non-failure sequences (Model 0). Sequencelikelihood is a probabilistic measure for similarity of the observed error sequenceto sequences of the training data. Failure prediction is then performed by subse-quent classification whether the current situation is failure-prone or not. In orderto prepare the sequence for this process, data preprocessing including failuregroup specific filtering has to be applied.

may seem insufficient from a researcher’s viewpoint, but it is useful from an engineer’sperspective. For example, in [60], Chillarege et al. show that the distribution of failuresresembles a Pareto distribution, from which follows that a few failures contribute to themajority of outages. Levy & Chillarege [162] state that from an economic viewpointit is most efficient to first address those failures that occur most frequently in order toachieve the largest impact on overall system availability. Furthermore, Lee & Iyer [159]report in a study about the Tandem GUARDIAN system that over two-thirds of reportedsoftware failures are recurrences of previously reported faults. The authors concluded that“in addition to reducing the number of software faults, software dependability in Tandemsystems can be enhanced by reducing the recurrence rate”.

2.5.2 Identifiable Types of FaultsResearch on dependable computing has put much effort on analyzing and categorizing thethings that can go wrong in computer systems. Classification of different types of faultsare called fault models, which can be helpful, e.g., to determine the potentials and limitsof a fault tolerance technique.


Design–Runtime fault model. A fundamental distinction of faults addresses the devel-opment phase from which the fault originates. Design faults originate from bad systemdesign, e.g., use of an algorithm that does not converge in some situations and hencemight cause an “infinite loop.” Opposed to this are runtime faults that occur during theproduction phase of a system.

Permanent–Intermittent–Transient fault model. Another well-known classificationfocuses on the duration of faults, as shown in Figure 2.11.

Figure 2.11: Permanent, intermittent and transient faults (Siewiorek & Swarz [241]).

The figure introduces three types of faults:

• permanent faults. which are defects that stay active until the fault is removed byrepair. A typical example is a damaged sector on a hard disk.

• intermittent faults. which are temporary defects that result from system internalflaws.

• transient faults. which are temporary defects that trace back to environmental ex-ceptions such as a hit by an alpha particle, etc.

As might have become visible, this categorization is focused on hardware issues. Al-though the concept can in principle be transferred to software, there are some difficulties.For example, due to the fact that a software fault (a bug) can only be removed by repair,software faults should be classified as permanent. However, some studies have shown thattheir occurrence resembles transient faults (see, e.g., Gray [107]) due to the fact that theiractivation patterns are dependent on many conditions in the system.

Bohr–Mandel–Heisen–Schrödingbugs fault model. This fault model is tailored tosoftware faults and explores an analogy between software bugs and well-known physi-cists and mathematicians. It focuses on the bugs’ type in terms of observability / tangibil-ity. Gray & Reuter [109] classify software bugs into “Bohrbugs” and “Heisenbugs”. Thisconcept has been extended, as, for example, in Candea [42]:


• Bohrbugs. According to the rather simple and deterministic atom model of NielsBohr,7 Bohrbugs are deterministic bugs that can be reproduced most easily. MostBohrbugs are identified by testing and eliminated in a thorough software engineer-ing process.

• Mandelbugs. According to the mathematician Benoît B. Mandelbrot, who is oneof the founders of chaos theory, Mandelbugs are bugs that appear chaotic due tomanifold and complex dependencies.

• Heisenbugs. According to Werner Heisenberg’s uncertainty principle, Heisenbugsdisappear or change behavior when being investigated. For example race conditionscan disappear when a program is run in a debugger since the debugger changes thetiming behavior of the program.

• Schrödingbugs. According to Schrödinger’s cat thought-experiment in quantumphysics, Schrödingbugs do not manifest until, e.g., someone reading source codenotices it and the program stops working for everybody until fixed. An example forsuch a bug might be a security breach that is exploited rapidly after being identifiedso that the program becomes unusable until the bug is fixed.

Fail-stop–to–Byzantine fault model. It characterizes faults with respect to their “haz-ardousness” or “behavior”. The fault model presented here is taken from Barborak et al.[23], which is an extended version of Laranjeira et al. [156], who themselves extended amodel introduced by Cristian et al. [68] (see Figure 2.12). One of the beautiful propertiesof the model is that inner fault classes are proper subsets of outer fault classes. The fartheroutside a fault resides in the picture, the more difficult it is to detect and hence the morecomplex are the resulting failure scenarios.

Figure 2.12: Fault model based on Barborak et al. [23]

The types of faults can be described as follows:

7terming the model “simple” is not intended to belittle the merits of Niels Bohr —remember that he pro-posed this model already in 1904!


• Fail stop A faulty processing entity ceases operation and sig-nals this to other processors.

• Crash fault The processor simply halts (crashes).

• Omission fault The processor omits to react to some tasks

• Timing fault The processor reacts to tasks, but too early or toolate.

• Incorrect computation fault The processor responds to all requests in time, butthe result is corrupted.

• Authenticated Byzantine fault An arbitrary or even malicious fault that cannotcorrupt authenticated messages (sender or receivercan detect corruption)

• Byzantine fault Every fault / malicious action possible.

Software–Hardware–Human fault model. While the presented classifications of faultclasses reflects mainly design and operational faults, there is also a number of faults thatcan be attributed to human operators. One way to incorporate operator faults is to clas-sify according to their origin: hardware, software, or human. Several variants of thisdistinction exist that basically refer to the same concepts. For example, Scott [232] usesthe terms “technology and disasters”, “application failure” and “operator error”, and in theSHIP model (Malek [174]), the concept is extended by incorporation of “interoperability”faults.

Discussion of fault models. Unfortunately, none of the presented fault models providesa tight boundary that allows to completely describe all faults leading to failures that canbe predicted by the presented approach. Nonetheless, each fault model provides a frame-work to discuss potentials and limits of the failure prediction approach presented in thisdissertation.

1. Design–Runtime fault model. Design faults are the target of fault intolerance tech-niques (Avižienis [13]) which attempt to eliminate flaws by elaborate engineeringsuch as formal specification, design reviews, and thorough testing. If —despiteof all efforts to build a flaw-free system— something goes wrong, runtime faultsare addressed by fault tolerance techniques which try to handle the situation suchthat no catastrophic failure occurs. Online failure prediction is a fault tolerancetechnique and is hence targeted at runtime faults. However, the boundary betweendesign and runtime faults is sometimes blurred. If, for example, a design fault al-ways results in similar misbehavior that is clearly identifiable by patterns of errorevents, the proposed failure prediction method can anticipate runtime faults as well.

2. Permanent–Intermittent–Transient fault model. The failure prediction approach ofthis thesis identifies faults that trigger failure mechanisms known from training data.This is most likely the case for permanent faults. Albeit the fact that this fault modelis of limited use for software faults, also failures caused by transient or intermittentfaults can be predicted, if triggering has been observed often enough in the train-ing data. This seems rather unlikely for faults such as the hit by an alpha particle.


However, as the failure prediction approach is targeted at identifying failure trig-gering conditions, it fits the transient behavior of software faults as observed bycondition-based activation patterns.

3. Bohr–Mandel–Heisen–Schrödingbugs fault model. Online failure prediction willmost likely be performed on fault-tolerant systems that have undergone thoroughcode revision, testing, etc. For this reason, it can be assumed that most Bohrbugshave been eliminated. Schrödingbugs are a construct that is very unlikely to occur,but as all programs stop until fixing of the bug, there is no need for online failureprediction. Mandelbugs and Heisenbugs are the typical bugs for which failure pre-diction is relevant. Both are triggered under complex conditions and the differencebetween both is more related to root cause analysis rather than failure prediction.

4. Fail-stop–to–Byzantine fault model. Since this fault model has the property thatmore “friendly” fault classes are proper subsets of more general fault classes, it issufficient to determine an upper bound. Due to the fact that Byzantine faults canbehave arbitrarily they can trigger failure mechanisms that have not been present inthe training data and can hence not be predicted. The same holds for authenticatedByzantine faults. Incorrect computation faults can be predicted, as long as they leadto errors that are detected within components. Nevertheless, it should be pointedout that there is no 100% coverage, even not for fail-stop faults.8

5. Software–Hardware–Human fault model. The failure prediction approach operateson errors that have been logged by some software. From this follows that hardwarefaults can only be detected if they result in an error at the software level. If, forexample, it is never detected until system failure that some hard disk controllerdelivers corrupted data, this failure cannot be predicted. However, several studies oncauses of failures such as Gray [107], Gray [108], and Scott [232] have documenteda trend towards software caused failures. The most astonishing study is Lee & Iyer[159] who have investigated the Tandem GUARDIAN system and have identifiedthat 89.5% of reported failures have been identified to be caused by software.

2.5.3 Relation to Other Research Areas and IssuesIn the following, relations to other research areas are briefly discussed. A comprehensiveclassification of the proposed failure prediction algorithm with respect to other predictionapproaches is given in Chapter 3.

Fault diagnosis. According to Marciniak & Korbicz [176], there are three differentapproaches to pattern recognition for fault diagnosis:

• Minimal distance methods. Classification is achieved by assigning data under in-vestigation to the nearest class as determined by a distance metric in feature space.In failure prediction, error sequences would have to be analyzed in order to extractfeatures like frequency of error occurrence, etc.

8Although fail-stop faults will very unlikely evolve into a system failure due to the fault-tolerant design ofthe system.


• Statistical methods. The goal is to estimate the probability of a class given the datapoint under investigation: P (c|x). In failure prediction, classes refer to failure-prone or not-failure-prone and x refers to an error sequence.

• Approximation approach. The class membership function F (x) is approximated bya function. In the case of failure prediction, F (x) would determine whether errorsequence x belongs to the class of failure-prone sequences.

With respect to this classification, the approach of this thesis is a statistical method sincethe outcome of the HMM forward algorithm is sequence likelihood P (x|c), which isturned into P (c|x) by the subsequent Bayesian classification step.

Temporal sequence processing. It has been stated that the approach is related to tem-poral sequence processing. According to Sun [253], temporal sequence processing istypically accomplished if one of four problems is addressed:

1. Sequence generation. Having specified a model, generate samples of time series.

2. Sequence recognition. Does some given sequence belong to the typical behavior ofthe underlying stochastic process or not? More precisely: What is the probabilityof it?

3. Sequence prediction. Given the beginning of a sequence, assess the probability ofthe next observation (or state) of the time series.

4. Sequential decision making. Select a sequence of actions in order to achieve somegoal or to optimize some cost function.

Failure prediction, as introduced here, clearly refers to sequence recognition. However,Section 12.1.1 in the outlook sketches a variant of failure prediction that makes use ofsequence prediction. Since the majority of models for temporal sequence processing dealwith series whose values occur equidistantly (see, e.g., Box et al. [36] for an overview),it seems infeasible to compare the HMM approach to other temporal sequence modelingtechniques.

Machine learning. The solution presented here clearly belongs to the group of super-vised learning algorithms. Supervised learning refers to the property that training data islabeled with a target value. In terms of failure prediction, this means that for every errorevent sequence in the training data set it is known whether it is a failure or non-failuresequence. Furthermore, the presented approach employs batch learning,9 which denotesthat the approach consists of two phases: a training phase and an application phase (seeFigure 2.7). Such approach is valid as long as dynamics of the system more or less staythe same. Due to configurability of the system and updates, this assumption only holdspartly, as is investigated in Section 9.7.2. A solution to this problem can be online learn-ing where the model is adapted continuously during runtime.

The No Free Lunch Theorem10 of machine learning proves that on the criterion of gen-eralization performance, there is no single modeling technique that is superior to all other

9also called offline learning10see, e.g., Wolpert [280]


techniques on all problems. However, this does not imply that for a given problem allapproaches are equal. In fact, it is the topic of this thesis to design, test and verify superi-ority of one specific modeling technique for the concrete task of online failure predictionfrom error events.

Data-driven approaches. The approach presented here is clearly a measurement data-driven approach. Such approaches can, —despite of their generalization capabilities—only learn interrelations that are present in the training data. Hamerly & Elkan [112] andPetsche et al. [202] argue that one escape from the dilemma is to build anomaly detectors,which inverts the problem: The focus of modeling is not the abnormal failure behaviorbut the way the system behaves when it is running well. However, this approach also failsif normal behavior is very diverse, which can be assumed for systems of such complexityas the telecommunication system. In the outlook (Chapter 12), a new approach to thisdilemma is proposed: The HSMM developed in this thesis may be augmented manuallyto account for failure mechanisms that are not contained in the training data.

Class Skewness. Failure prediction approaches usually have to deal with extreme classskewness: measurements for failures —even performance failures— occur much moreseldom than measurements for non-failures. As can be seen from Figure 2.2 on Page 11,errors occur late in the process from faults to failures: an error is only reported if somemisbehavior in the system has been detected. Hence, in comparison to failure predictionapproaches operating on periodically measured symptom monitoring, the ratio of failureand non-failure data is more balanced and the problem of class skewness is mitigated.Nevertheless, both classes are far from being equally distributed and hence failure modelsare trained on failure data only.

2.6 SummaryThis chapter has defined the objective of this thesis: online failure prediction. In terms ofthe telecommunication system case-study, the failures that are to be predicted are perfor-mance failures, which are defined to be a drop below a four-nines threshold on five minuteinterval call availability. Key properties of the objective have been identified and the ap-proach pursued in this thesis has been outlined. The last section of the chapter includeda brief description of one failure and four fault models and has discussed potentials andlimits of the described approach.

The following list summarizes the line of arguments that lead to the approach to onlinefailure prediction followed in this thesis:

• Dependencies within systems lead to error sequences.

• In fault-tolerant systems, not every occurrence of errors leads to a failure.

• Fault-tolerant systems fail only under some conditions.

• Error pattern recognition is applied to distinguish between error sequences that arefailure-prone and those that are not.

2.6 Summary 27

• It is assumed that both dimensions of error sequences, time of event occurrenceand type of the event are equally important. Hence, error sequences are treated astemporal sequences.

• Extended hidden Markov models are used as pattern recognition toolkit. The ex-tension allows to model the temporal behavior of error patterns by use of a semi-Markov process.

• Several failure mechanisms are assumed to be present in a system. In order toseparate failure mechanisms, failure sequences in the training data are grouped byclustering.

• Since error logs are a noisy data source, data preprocessing has to be applied to thedata

• In order to address the problem of class skewness, failure models are trained usingfailure sequences only.

• The approach is a batch learning supervised machine learning task.

• By use of a model targeted to non-failure sequences, Bayes decision theory is ap-plied for online prediction in order to classify the current situation of a runningsystem as failure-prone or not.

Contributions of this chapter. This chapter has discussed the stages at which faultscan be observed. It turned out that the classical distinction between faults, errors, andfailures is not sufficient as it is missing side-effects of faults, which are called symptoms.Hence one contribution is the extension of this differentiation.

The second contribution is a novel view on the task of online failure prediction. To thebest of our knowledge, this work is the first to treat the problem as a pattern recognitiontask of temporal sequences.

Relation to other chapters. This chapter has formally defined the objective of the thesisand has presented an overview of the approach. The next two chapters provide somebackground on related approaches. The reason why there are two chapters on relatedwork is that in this thesis, an existing modeling technique —hidden Markov models—has been extended and applied to the area of online failure prediction. Hence Chapter 3provides an overview of other approaches to online failure prediction, while Chapter 4covers related work on hidden Markov models.

Chapter 3

A Survey of Online Failure PredictionMethods

As mentioned in Section 2.1, online failure prediction denotes only a small area in thebroad field of prediction techniques. However, even in that limited sense, a wide spec-trum of approaches have been published. This chapter provides a survey on methods thathave been published and points to techniques that might in future be applied to onlinefailure prediction. In order to structure the spectrum, a taxonomy is introduced in Sec-tion 3.1. Major concepts are briefly explained and related work is referenced. As it isnot possible to implement all techniques without a huge team of researchers, in this thesisonly the most promising approaches that are closely related to the approach presented inthis thesis have been selected for comparative analysis in the case study. These methodsare explained in more detail in Section 3.2.

3.1 A Taxonomy and Survey of Online Failure PredictionMethods

A significant body of work has been published in the area of online failure prediction. Thissection introduces a taxonomy that structures the manifold of approaches (see Figure 3.1).

The most fundamental differentiation of failure prediction approaches refers to theability to evaluate the current state. Since the current state can only be considered if somemonitoring of the system is used as input data, these methods are also called monitoring-based methods. However, to be complete, failure prediction mechanisms exist that are,e.g., only based on lifetime probability distributions, the system’s architecture, or otherstatic properties of the system (Branch 2 in the taxonomy). Reliability models and mostmethods known from preventive maintenance fall into this category. The book by Lyu[170], and especially the chapters Farr [94] and Brocklehurst & Littlewood [38], providea good overview, while the book by Musa et al. [189] covers the topic comprehensively.

The category of methods that evaluate the current system state (branches starting with1 in the taxonomy), can be further divided into four categories by analyzing at whichstage of failure evolution, observations are taken. Referring to Figure 2.2 on Page 11,

29

30 3. A Survey of Online Failure Prediction Methods

faults can be observed at four stages: By audits, by monitoring of symptoms, detection oferrors or observation of failures. However, since audit-based methods are mainly offlineprocedures,1 they are not included in the taxonomy.

Failure Observation (1.1)The basic idea of failure prediction based on previous failure occurrence is to draw con-clusions about the probability distribution of future failure occurrence. The frameworkfor these conclusions can be quite formal as is the case with Bayesian classifiers or ratherheuristic as in the case of counting and thresholding.

Bayesian Predictors (1.1.1)

The key notion of Bayesian failure prediction is to estimate the probability distribution ofthe next time to failure by benefiting from the knowledge obtained from previous failureoccurrences in a Bayesian framework. In Csenki [72], such a Bayesian predictive ap-proach [3] is applied to the Jelinski-Moranda software reliability model [132] in order toyield an improved estimate of the next time to failure probability distribution.

Non-parametric Methods (1.1.2)

It has been observed that the failure process can be non-stationary and hence the prob-ability distribution of time-between-failures (TBF) varies. Reasons for non-stationarityare manifold, since the fixing of bugs, changes in configuration or even varying utiliza-tion patterns can affect the failure process. In these cases, techniques such as histogramsresult in poor estimations since stationarity2 is inherently assumed. For these reasons,the non-parametric method of Pfefferman & Cernuschi-Frias [203] assumes the failureprocess to be a Bernoulli-experiment where a failure of type k occurs at time n with prob-ability pk(n). From this assumption follows that the probability distribution of TBF forfailure type k is geometric since only the n-th outcome is a failure of type k and hencethe probability is:

Pr{TBFk(n) = m | failure of type k at n

}= pk(n)

(1− pk(n)

)m−1. (3.1)

The authors propose a method to estimate pk(n) using an autoregressive averaging filterwith a “window size” depending on the probability of the failure type k.

Counting / Thresholding (1.1.3)

It has been observed several times, that failures occur in clusters in a temporal as well asin a spatial sense. Liang et al. [165] choose such an approach to predict failures of IBM’sBlueGene/L from event logs containing reliability, availability and serviceability data.The key to their approach is data preprocessing employing first a categorization and thentemporal and spatial compression: Temporal compression combines all events at a singlelocation occurring with inter-event times lower than some threshold, and spatial compres-sion combines all messages that refer to the same location within some time window.

1We have not found any publication investigating audit-based online failure prediction2at least within a time window

3.1 A Taxonomy and Survey of Online Failure Prediction Methods 31

Figure 3.1: A taxonomy for online failure prediction approaches


Prediction methods are rather straightforward: Using data from temporal compression, ifa failure of type application I/O or network appears, it is very likely that a next failure willfollow shortly. If spatial compression suggests that some components have reported moreevents than others, it is very likely that additional failures will occur at that location. Apaper by Fu & Xu [99] formalizes the concept by introducing a measure of temporal andspatial correlation.

Symptom Monitoring (1.2)Some types of faults affect the system gradually, which is also known as service degra-dation. A prominent example for such types of faults are memory leaks. If some partof a system has a memory leak, more and more system memory is consumed over time,but, as long as there is still memory available, neither an error nor a failure is observed.When memory is getting scarce, the computer may first slow down3 and only if there is nomemory left an error occurs, which may then result in a failure. The key notion of failureprediction based on monitoring data is that faults like memory leaks can be grasped bytheir side-effects on the system such as exceptional memory usage, CPU load, or diskI/O. These side-effects are called symptoms. Four principle approaches have been iden-tified: Failure prediction based on a system model, function approximation techniques,classifiers, and time series analysis.

System Models (1.2.1)

The foundation of these failure prediction methods is a model of system behavior, whichis in most cases built from previously recorded training data.

Stochastic models (1.2.1.1): Vaidyanathan & Trivedi [263] construct a semi-Markovreward model in the following way: Several system parameter measurements are period-ically taken from a running system including the number of process context switchesand the number of page-in and page-out operations. Clustering training data yieldedeleven clusters. The authors assume that these clusters represent eleven different workloadstates. A semi-Markov reward model was built where each of the clusters corresponds toone state in the Markov model. State transition probabilities were estimated from themeasurement dataset and sojourn-time distributions were obtained by fitting two-stage-hyperexponential or two-stage-hypoexponential distributions to the training data. Then, aresource consumption “reward” rate for each workload state is estimated from the data:Depending on the workload state the system is in, the state reward defines at what ratethe modeled resource is changing. The rate was estimated by fitting a linear function tothe data using the method of Sen [233]. The authors modeled two resources: the amountof swap-space used and the amount of free real memory. Failure prediction is accom-plished by estimating the time until resource exhaustion. This is achieved by computingthe expected reward rate at steady state from the semi-Markov reward model.

Berenji et al. [27] build a system model in a hierarchical two step approach: First, theybuild component simulation models that try to mimic the input / output behavior of systemcomponents. These models are used to train component diagnostic models by combininginput data with component outputs obtained from the component simulation models. The

3e.g., due to memory swapping


target output values of the diagnostic models are binary where a value of one correspondsto faulty component behavior and zero to non-faulty behavior. The same approach is thenapplied on the next hierarchical level to obtain a system-wide diagnostic models. Theauthors use a clustering method to obtain a radial basis function rule base.

A more theoretic approach that could in principle be applied to online failure pre-diction is to abstract system behavior by a queuing model that incorporates additionalknowledge about the current state of the system. Failure prediction can be performedby computing the input value dependent expected response time of the system. Ward& Whitt [272] show how to compute estimated response times of an M/G/I processor-sharing queue based on measurable input data such as number of jobs in the system attime of arrival using a numerical approximation of the inverse Laplace transform.

Anomaly detectors (1.2.1.2): One of the most intuitive methods of failure prediction isto build a model that captures key aspects of system behavior and to check during runtime,whether the actual system behavior deviates from normal behavior. For example, Elbaumet al. [89] describe an experiment where function calls, changes in the configuration,module loading, etc. of the email client “pine” had been recorded. The authors haveproposed three types of failure prediction among which sequence-based checking wasmost successful: a failure was predicted if two successive events occurring in “pine”during runtime do not belong to any of the event transitions observed in the training data.

Candea et al. [45] describe a dependable system consisting of several parts such as thepinpoint problem determination approach [53] or automatic failure path inference [44].Even though the methods are only used in the context of recovery-oriented computing[40] the methods could easily be extended to detect deviation from usual behavior duringruntime in order to predict upcoming failures. The same holds for a failure diagnosissystem that employs a decision tree evaluating runtime properties of requests to a largeInternet site [54]. In [144], a χ2 goodness-of-fit test is used to determine, whether theproportion of runtime paths between a component instance and other component classesdeviates from a fault-free behavior.

Control theory (1.2.1.3): It is common in control theory to have an abstraction of thecontrolled system estimating the internal state of the system and its progression over timeby some mathematical equations, such as linear equation systems, differential equationsystems, Kalman filters, etc. (see, e.g., Lunze [169]). These methods are widely usedfor fault diagnosis (see, e.g., Korbicz et al. [147]) but have only rarely been used forfailure prediction. However, many of the methods inherently include the possibility topredict future behavior of the system and hence have the ability to predict failures. Forexample, Neville [193] describes in his Ph.D. thesis the prediction of failures in largescale engineering plants. Another example is Discenzo et al. [78] who mention that suchmethods have been used to predict failures of an intelligent motor using the standard IEEEmotor model. Limiting the scope to failure prediction in computer systems, only a fewexamples exist, one of which is Yang [282] who uses Kalman filters to predict future statesin combination with an “early failure detection and isolation arrangement” (EFDIA) PetriNet.

Another approach has been published by Singer et al. [243] who propose the Multi-variate State Estimation Technique (MSET) to detect system disturbances by a compari-son of the estimated and measured system state. More precisely, a matrix of measurement


Figure 3.2: Function approximation tries to mimic an unknown target function by the use ofmeasurements taken from a system at runtime

data of normal operation is collected. This training data is further processed such thatan expressive subset of training data is selected. In the operational phase, a combinationof selected data vectors weighted by similarity to the current (runtime) observations isused to compute a state estimate. The difference between observed and estimated stateconstitutes a residual that is checked for significant deviation by a sequential probabilityratio test (SPRT). In Gross et al. [110], the authors have applied the method to detect soft-ware aging [198] in an experiment where a memory-leak fault injector consumed systemmemory at an adjustable rate. MSET and SPRT have been used to detect whether the faultinjector was active and if so, at what rate it was operating. By this, time to memory con-sumption can be estimated. MSET has also been applied to online transaction processingservers in order to detect software aging (Cassidy et al. [48]).

Function Approximation (1.2.2)

Function approximation techniques try to mimic target values, which are assumed to bethe outcome of an unknown function of input data. Target functions include, e.g., theprobability of failure occurrence or the true long-term progression of resource consump-tion. Due to the fact that neither the function is known nor can the faults, which are partof the input to the unknown function, be observed, the target function can only be esti-mated from measurements (see Figure 3.2). Function approximation is a broad researcharea, and various approaches have been published to address this type of problems, amongwhich some are listed here that are related to failure prediction.

Prediction of failures can be achieved with function approximation techniques in twoways:

1. The target function is the probability of failure occurrence. In these cases, the targetvalue in the training dataset is boolean. This case is depicted in Figure 3.2.

2. The target function is some computing resource and failure prediction is accom-plished by estimating the time until resource exhaustion.

However, since most of the work presented below follows the second approach, cate-gorization distinguishes between function approximation methods rather than the targetfunction.

Curve fitting (1.2.2.1): In this category of techniques, the target function is the true,long-term progression of some system resource, e.g., system memory. However, if free


system memory is measured periodically during runtime, measurements vary heavilysince it is natural that memory is allocated and freed during normal system operation.Curve fitting techniques4 adapt parameters of a function such that the curve best fits themeasurement data, e.g., by minimizing mean square error. The simplest form of curvefitting is regression with a linear function. Garg et al. [100] have presented work whereafter data smoothing a statistical test (seasonal Kendall test) is applied in order to identifywhether a trend is present and if so, a non-parametric trend estimation procedure [233]is applied. Failure prediction is then accomplished by computing the estimated time toresource exhaustion. Castelli et al. [49] mention that IBM has implemented a curve fittingalgorithm for the xSeries Software Rejuvenation Agent. Several types of curves are fit tothe measurement data and a model-selection criterion is applied in order to choose thebest curve. Prediction is again accomplished by extrapolating the curve.

Cheng et al. [57] present a framework for high availability cluster systems. Failureprediction is accomplished in two stages: first, a health index ∈ [0, 1] is established basedon measurement data employing fuzzy logic and then trend analysis is applied in order toestimate the mean time to next failure.

Andrzejak & Silva [10] apply deterministic function approximation techniques such assplines to characterize the functional relationships between the target function5 and “workmetrics” such as the work that has been accomplished since the last restart of the system.Deterministic modeling offers a simple and concise description of system behavior withfew parameters. Additionally, using work-based input variables offers the advantage thatthe function is not depending on absolute time anymore: For example, if there is onlylittle load on a server, aging factors accumulate slowly and so does accomplished workwhereas in case of high load, both accumulate more quickly.

Genetic programming (1.2.2.2): In the paper by Abraham & Grosan [1] the targetfunction is the so-called stressor-susceptibility-interaction (SSI), which basically denotesfailure probability as function of external stressors such as environment temperature orpower supply voltage. The overall failure probability can be computed by integration ofsingle SSIs. The paper presents an approach where genetic programming has been usedto generate code representing the overall SSI function by learning from training data.Although the paper mainly focuses on electronic devices, the approach might be adoptedfor failure prediction in complex computer systems. However, this is difficult to tell sinceonly few results are presented in the paper.

Machine learning (1.2.2.3): One of the predominant applications of machine learningis function approximation. It seems natural that various techniques have a long traditionin failure prediction, as can also be seen from various patents in that area. In 1990,Troudet et al. have proposed to use neural networks for failure prediction of mechanicalparts and Wong et al. [281] use neural networks to approximate the impedance of passivecomponents of power systems. The authors have used an RLC-Π model where faults havebeen simulated to generate the training data. Neville [193] has described how standardneural networks can be used for failure prediction in large scale engineering plants.

4which are also called regression techniques5the authors use the term “aging indicator”


Turning to publications regarding failure prediction in large scale computer systems,various techniques have been applied there, too. Ning et al. [194] have modeled resourceconsumption time series by fuzzy wavelet networks (FWN). They use fuzzy logic in-ference to predict software aging in application servers based on performance parame-ters. Turnbull & Alldrin [259] use Radial Basis Functions (RBF) to predict server failuresbased on hardware sensors on motherboards. In his dissertation [120], Günther Hoffmannhas developed a failure prediction approach based on universal basis functions (UBF),which are an extension to RBFs that use a weighted convex combination of two kernelfunctions instead of a single kernel. He has applied the method to predict failures ofthe same telecommunication system used as case study in this thesis. However, UBFprimarily builds on equidistantly monitored data to identify symptoms while the methodproposed in this dissertation focuses on event-driven error sequences. In [122], Hoffmannet al. have conducted a comparative study of several modeling techniques with the goalto predict resource consumption of the Apache webserver. The study showed that UBFturned out to yield the best results for free physical memory prediction, while server re-sponse times could be predicted best by support vector machines (SVM). However, theauthors point out that the issue of choosing a good subset of input variables has a muchgreater influence on prediction accuracy than the choice of modeling technology. Thismeans that the result might be better if, for example, only workload and free physicalmemory are taken into account and other measurements such as used swap space are ig-nored. Variable selection6 is concerned with finding the optimal subset of measurements.Typical examples of variable selection algorithms are principle component analysis (PCA,see Hotelling [124]) as used in Ning et al. [194] or Forward Stepwise Selection (see, e.g.,Hastie et al. [115]), which has been used in Turnbull & Alldrin [259]. Günther Hoffmannhas also developed a new algorithm called probabilistic wrapper approach (PWA), whichcombines probabilistic techniques with forward selection or backward elimination.

Instance-based learning methods store the entire training dataset including input andtarget values and predict by finding similar matches in the stored database of training data(eventually combining them). Kapadia et al. [141] have applied three learning algorithms(k-nearest-neighbors, weighted average and weighted polynomial regression) to predictCPU-time of semiconductor simulation software based on input data such as number ofgrid points, or number of etch steps of the simulated semiconductor.

Classifiers (1.2.3)

In contrast to function approximation, classification approaches do not strive to mimicsome target function but try to directly come to a decision about criticality of the system’sstate. For this reason, training data for classification approaches has discrete (and inmost cases binary) target labels. However, the input data to classification approaches canconsist of discrete as well as continuous measurements. For example, for hard disk failureprediction based on SMART7 values, input data may consist of the number of reallocatedsectors (discrete value) and the drive’s temperature (theoretically a continuous variable).Target values are not a continuous values but a binary classification whether the drive isfailure-prone or not.

6some authors also use the term feature selection7Self-Monitoring And Reporting Technology


Statistical Tests (1.2.3.1): Ward et al. [271] estimate time-dependent mean and vari-ance of the number of TCP connections in various states from a web proxy server inorder to identify Internet service performance failures. If actual measurements deviatesignificantly from the mean of training data, a failure is predicted.

A more robust statistical test has been applied to hard disk failure prediction in Hugheset al. [127]. The authors employ a rank sum hypothesis test to identify failure prone harddisks. The basic idea is to collect SMART values from fault-free drives and store them asreference data set. Then, during runtime SMART values of the monitored drive are testedthe following way: The combined data set consisting of the reference data and the valuesobserved at runtime is sorted and the ranks of the observed measurements are computed8.The ranks are summed up and compared to a threshold. If the drive is not fault-free, thedistribution of observed values are skewed and the sum of ranks tends to be greater orsmaller than for fault-free drives.

Bayesian Classifier (1.2.3.2): In [112], Hamerly & Elkan describe two Bayesian failureprediction approaches. The first Bayesian classifier proposed by the authors is abbreviatedby NBEM expressing that a specific Naïve Bayes model is trained with the ExpectationMaximization algorithm based on a real data set of SMART values of Quantum Inc. diskdrives. Specifically, a mixture model is proposed where each naïve Bayes submodel mis weighted by a model prior P (m) and an expectation maximization algorithm is usedto iteratively adjust model priors as well as submodel probabilities. Second, a standardnaïve Bayes classifier is trained from the same input data set. More precisely, SMARTvariables xi such as read soft error rate or calibration retries are divided into bins andconditional probabilities for class k ∈ {Failure, Non-failure} are computed. The termnaïve derives from the fact that all attributes xi are assumed to be independent and hencethe joint probability can simply be computed as the product of single attribute probabilitiesP (xi | k). The authors report that both models outperform the rank sum hypothesis testfailure prediction algorithm of Hughes et al. [127].9

Pizza et al. [205] propose a Bayesian method to distinguish between transient andpermanent faults on the basis of diagnosis results. In this case the measured symptomsare obtained by monitoring and evaluation of modules or components. Although notmentioned in the paper, this method could be used for failure prediction by issuing afailure warning once a permanent fault has been detected.

Other approaches (1.2.3.3): Failures of computer systems can be predicted by apply-ing a clustering method directly to system measurement data: After collection of a labeledtraining data set indicating whether measurements are failure-prone or not, a clusteringmethod can be used, e.g., to identify centroids of failure-free and failure-prone regions.During runtime, actual measurements can be classified by assessing proximity to failure-prone and failure-free centroids. Sfetsos [234] describes that clustering has been usedtogether with function approximation techniques for load-forecasting of power systems.Additionally, clustering is part of the training procedure in Berenji et al. [27], which hasbeen described in category 1.2.1.1.

8which in fact involves nothing more than simple counting9The rank sum test was announced and submitted to the journal in 2000, but appeared after the publicationof the NBEM algorithm in the year 2002.


Cheng et al. [57] apply a fuzzy logic soft classifier to compute a health index in highavailability cluster systems (see category 1.2.2.1).

Daidone et al. [73] have proposed to use a hidden Markov model approach to inferwhether the true state of a monitored component is healthy or not. The use of hiddenMarkov models is motivated by the fact that the true state of the monitored componentcannot be observed. However, the state can be estimated from a sequence of monitoringresults by the so-called forward algorithm of hidden Markov models. Additionally, mis-takes in the component specific defect detection mechanism10 are included in the model.Since this method is based on concurrent monitoring the method could also be used forfailure prediction: If a component is detected to be faulty, a failure is likely to occur.

Chen et al. [52] and Kiciman & Fox [144], which are related publications, apply aprobabilistic context free grammar (PCFG)11 to evaluate call paths collected from a Java2 Enterprise Edition (J2EE) demo application, an industrial enterprise voice applicationnetwork, and from eBayTM servers. Although the approach is designed to identify failuresquickly, the approach could also be used to predict upcoming failures: if the probability ofthe beginning of a call path is very low, it is likely that the system is not behaving normallyand there is an increased probability that a failure will occur in the further course of therequest.

Time Series Analysis (1.2.4)

Failure predictions belonging to this category directly measure the target function andanalyze it in order to determine whether a failure is imminent or not. Feature analysiscomputes a residual of the measurement series, while time series prediction models try topredict the future progression of the target function from the series’ values itself (withoutusing other measurements as input data). Finally, also signal processing techniques canbe used for time series analysis.

Feature analysis (1.2.4.1): Crowell et al. [71] have discovered that memory related sys-tem parameters such as kernel memory or system cache resident bytes show multifractalcharacteristics in the case of software aging. The authors used the Hölder exponent toidentify fractality, which is a residual expressing the amount of fractality in the time se-ries. In a later paper [238], the same authors extended this concept and built a failureprediction system by applying the Shewhart change detection algorithm [24] to the resid-ual time series of Hölder exponents. A failure warning is issued after detection of thesecond change point.

Time Series Prediction (1.2.4.2): In Hellerstein et al. [117], the authors describe anapproach to predict if a target function will violate a threshold. In order to achieve this,several time series models are employed to model stationary as well as non-stationaryeffects. For example, the model accounts for the influence of the day-of-the-week, ortime-of-the-day, etc. Experiments have been carried out on prediction of HTTP operationsper second of a production webserver. A similar approach has been described in Vilaltaet al. [266].

10The authors use the term “deviation detection mechanism”.11For more details on PCFGs, see category 1.3.3.1


Li et al. [163] collect various parameters from a web-server and build autoregressivemodel with auxiliary input (ARX) to predict further progression of system resources uti-lization. Failures are predicted by estimating resource exhaustion times.

A similar approach has been proposed by Sahoo et al. [220] who applied various timeseries models to data of a 350-node cluster system to predict parameters like percentageof system utilization, idle time and network IO.

Signal Processing (1.2.4.3): Signal processing techniques are of course related to meth-ods that have already been described (e.g., Kalman filters in category 1.2.1.3). However,in contrast to the methods presented above, techniques of this category neither rely on anyother input data nor do they require an abstract model of system behavior or a concept of(hidden) system states. Algorithms that fall into this category use signal processing tech-niques such as low-pass or noise filtering to obtain a clean estimate of a system resourcemeasurement. For example, if free system memory is measured, observations will varygreatly due to allocation and freeing of memory. Such measurement series can be seenas a noisy signal where noise filtering techniques can be applied in order to obtain the“true” behavior of free system memory: If it is a continuously decreasing function, soft-ware aging is likely in progress and the amount of free memory can be estimated for thenear-future by means of signal processing prediction methods (see Figure 3.3). However,to the best of our knowledge, signal processing techniques such as frequency transforma-tions have only been used for data preprocessing so far.

Figure 3.3: Failure prediction using signal processing techniques on measurement data canfor example be achieved by noise filtering

Manifestation of Faults – Errors (1.3)As already mentioned, the third major group of failure prediction methods that incorporatethe current state of the system analyzes the occurrence of error events in order to assess thecurrent situation with regard to upcoming failures. One of the major differences betweenerrors and symptom monitoring is that errors always denote an event while symptoms arein most cases detected by periodic system observations. Furthermore, symptoms are inmost cases values out of a continuous range while error events are mostly characterizedby discrete, categorical data such as event IDs, component IDs, etc. (see Figure 3.4).

Frequency of Occurrence (1.3.1)

One assumption that is very common in failure prediction approaches is the notion that thefrequency of error occurrence increases before a failure occurs. Several methods building


Figure 3.4: Failure prediction based on the occurrence of errors (A,B,C). The goal is to assessthe risk of failure at some point in future (indicated by the question mark). In orderto perform the prediction, some data that have occurred shortly before presenttime are taken into account (data window).

on this assumption have been proposed over the decades.According to Siewiorek & Swarz [241], Nassar & Andrews [190] were the first to pro-

pose two ways of failure prediction based on the occurrence of errors. The first approachinvestigates the distribution of error types. If the distribution of error types changes sys-tematically (i.e., one type of error occurs more frequently) a failure is supposed to beimminent. The second approach investigates error distributions for all error types ob-tained for intervals between crashes. If the error generation rate increases significantly,a failure is looming. Both approaches resulted in computation of threshold values uponwhich a failure warning can be issued.

Iyer et al. [131] apply a hierarchical aggregation method to error occurrences in orderto filter out so-called symptoms:12 First, errors of equal type reported by one machineform so-called clusters. Second, subsequent clusters that occur within some specifiedtime interval are combined to form so-called error groups. Third, error groups that occurwithin a 24h interval and that share at least two error records are called “events”. Afterdata aggregation, Iyer et al. estimate singleton and joint probabilities to test for statisticaldependence.13 A symptom of an event is formed by records that are common to mostof the groups in an event. Although originally used for automatic identification of theroot cause of permanent faults, the detection of a symptom could as well be used for theprediction of upcoming failures (see also Iyer et al. [130]).

The dispersion frame technique (DFT) developed by Lin & Siewiorek [167] uses a setof heuristic rules on the time of occurrence of consecutive error events to identify loomingpermanent failures. Since this method is used for comparison with the model presentedin this thesis, DFT is further explained in Section 3.2.1.

Lal & Choi [153] show plots and histograms of errors occurring in a UNIX Server.The authors propose to aggregate errors in an approach similar to tupling (c.f., Tsao &Siewiorek [258]) and state that the frequency of clustered error occurrence indicates anupcoming failure. Furthermore, they showed histograms of error occurrence frequencyover time before failure.

More recently, Leangsuksun et al. [157] have presented a study where hardware sen-sors measurements such as fan speed, temperature, etc. are aggregated using severalthresholds to generate error events with several levels of criticality. These events are an-

12Not to be confused with side-effects of faults as used in this thesis13For independent random variables A and B, the following equation holds: P (A,B) = P (A) ∗ P (B). If

not, A and B are not independent and are likely to occur together.


alyzed in order to eventually generate a failure warning that can be processed by othermodules. The study was carried out on data of a high availability high performance Linuxcluster.

In the paper presented by Levy & Chillarege [162], the authors derive three principles,two of which fall into this category: principle one (“counts tell”) again emphasizes theproperty that the number of errors14 per time unit increases before a failure. Principlenumber three (“clusters form early”) basically states the same by putting more emphasison the fact that for common failures the effect is even more apparent if errors are clusteredinto groups.

Another link to this relationship between errors and failures is provided by Liang et al.[165]: The authors have analyzed jobs of an IBM BlueGene/L supercomputer and supportthe thesis: “On average, we observe that if a job experiences two or more non-fatal eventsafter filtering, then there is a 21.33% chance that a fatal failure will follow. For jobs thatonly have one non-fatal event, this probability drops to 4.7%”.

Rule-based Systems (1.3.2)

The essence of rule-based failure prediction is that the occurrence of a failure is predictedonce at least one of a set of conditions is met. Hence rule-based failure prediction has theform

IF <condition1> THEN <failure warning>

IF <condition2> THEN <failure warning>

. . .

Since in most computer systems the set of conditions cannot be set up manually, the goalof failure prediction algorithms in this category is to identify conditions algorithmicallyfrom a set of training data. The art is to find a set of rules that is general enough to captureas many failures as possible but that is also specific enough not to generate too many falsefailure warnings.

Data mining (1.3.2.1): To our knowledge, the first data mining approach to failure pre-diction has been published by Hätönen et al. [116]. The authors describe that a rule minerwas set up by manually specifying certain characteristics of episode rules. For example,the maximum length of the data window, types of error messages15 and ordering require-ments had to be specified. However, the algorithm returned too many rules such thatthey had to be presented to human operators with system knowledge in order to filter outinformative ones.

Weiss [275] introduces a failure prediction technique called “timeweaver” that isbased on a genetic training algorithm. In contrast to searching and selecting patternsthat exist in the database, rules are generated “from scratch” by use of a simple language:error events are connected with three types of ordering primitives. The genetic algorithm

14Since the paper is about a telecommunication system, the authors use the term alarm for what is termedan error, here.

15As this work has also been published in the telecommunication community, the authors use the term alarminstead of errors.


starts with an initial set of rules16 and repetitively applies crossing and mutation opera-tions to generate new rules. Quality of the obtained candidates is assessed using a specialfitness function that incorporates both prediction quality17 as well as diversity of the ruleset. After generating a rule set with the genetic algorithm, the rule set is pruned in orderto remove redundant patterns. Results are compared to three standard machine learn-ing algorithms: C4.5rules [209], RIPPER [61] and FOIL [208]. Although timeweaveroutperforms these algorithms, standard learning algorithms might work well for failureprediction in other applications.

Vilalta & Ma [268] describe a data-mining approach that is tailored to short-termprediction of boolean data. Since the approach builds on a concept termed “eventsets”, thefailure prediction algorithm is referenced here as eventset method. The method searchesfor predictive subsets of events occurring before a target event. In the terminology usedhere, events refer to errors and target events to failures. The first major concept of themethod addresses class skewness (see Section 2.5.3). The solution is —similar to thesolution used in this thesis— to first consider only error sequences preceding a failurewithin some time window, and to incorporate non-failure data only to remove unwantedpatterns in a later step. The eventset method is used for comparative analysis and is henceexplained in more details in Section 3.2.2. The eventset method has also been applied forfailure prediction in a 350-node cluster system, as described in [220].

As indicated by its name, the eventset method operates on sets of errors and doesnot take the ordering of errors into account while the timeweaver method includes partialordering. However, there are other data-mining methods having the potential to achievegood results, which have not yet been applied to the problem of failure prediction. Forexample a lot of research has been published in the field of sequential pattern mining.As an example, Srikant & Agrawal [249] introduce the concept of ontologies that wouldenable to incorporate relationships between error messages, which is closely related tohierarchical fault models. A second area of research having developed methods that couldas well be applied to failure prediction is concerned with the analysis of path traversalpatterns. For example, Chen et al. [55] generate a tree structure of path traversals toidentify frequent paths and to isolate those paths that set up a basis (so-called “maximalreference sequences”). However, since the method assumes a dedicated start of all paths18

application of the method to failure prediction is limited to areas where some dedicatedstarting points exist such as in transaction-based systems.

Fault trees (1.3.2.2): Fault trees have been developed in the 1960’s and have become astandard reliability modeling technique. A comprehensive treatment of fault trees is, forexample, given by Vesely et al. [265]. The purpose of fault trees is to model conditionsunder which failures can occur using logical expressions. Expressions are arranged inform of a tree, and probabilities are assigned to the leaf nodes, facilitating to compute theoverall failure probability.

Fault tree analysis is a static analysis that does not take the current system status intoaccount. However, if the leaf nodes are combined with online fault detectors, and logicalexpressions are transformed into a set of rules, they can be used as online failure predictor.

16the so-called initial population17based on a variant of the F-Measure, that allows to adjust the relative weight of precision and recall18which is the root node of the tree


Although such approach has been applied to chemical process failure prediction [260]and power systems [216], we have not found such approach being applied to computersystems.

Other approaches (1.3.2.3): In the area of machine learning, a broad spectrum of meth-ods are available that could in principle be used for online failure prediction. This para-graph only lists a few techniques that either have been applied for failure prediction orthat seem at least promising.

A relatively new technique on the rise is the so-called “rough set theory” [199]. Chi-ang & Braun [58] propose a combination of rough set theory with neural networks topredict failures in computer networks based on network events. Rough set theory has alsobeen applied to aircraft component failure prediction (c.f., e.g., Pena et al. [200]).

Bai et al. [20] employ a Markov Bayesian Network for reliability prediction but asimilar approach might work for online failure prediction, as well. The same holds fordecision tree methods: upcoming failures can be predicted if error events are classifiedusing a decision tree approach similar to Chen et al. [54], which has been described inSection 1.2.1.2.

Pattern recognition (1.3.3)

Sequences of errors form error patterns. The principle of pattern recognition in this cat-egory is to assign a ranking value to an observed sequence of error events expressingsimilarity with learned patterns that are known to lead to system failures. Failure pre-diction is then accomplished by classification based on pattern similarity rankings (seeFigure 3.5).

Figure 3.5: Failure prediction by recognition of failure-prone error patterns

Probabilistic context-free grammars – PCFG (1.3.3.1): This modeling technique hasbeen developed in the area of statistical natural language processing (see, e.g., Manning& Schütze [175]). A probabilistic context free grammar consists of a set of rules of acontext-free grammar Ni → X where Ni is a nonterminal symbol and X is a sequenceof terminals and nonterminals. Furthermore, PCFGs associate a probability with eachrule such that:

∀i :∑j

P (Ni →Xj) = 1 .

Given a sentence, which is a sequence of terminal symbols, i.e., the words, the sentence’probability can be computed by finding and summing all possible parse trees having thegiven sentence as leaf nodes. The probability of each tree is defined as the product of ruleprobabilities that were used to generate the parse tree. Algorithms have been developed to


perform these computations efficiently in a dynamic programming manner. Furthermore,algorithms have been developed to learn rule probabilities from a given set of trainingsentences.

Failure prediction could be realized with PCFGs by learning the grammar of errorevent sequences that have lead to a failure in the training dataset. Following the approachdepicted in Figure 3.5, failures can be predicted during runtime by computing the proba-bility of the sequence of error events that have occurred in a time window before presenttime. To our knowledge, such an approach has not been implemented for online failureprediction. The only failure-related publications that use PCFGs are Chen et al. [52] andKiciman & Fox [144]. However, these papers analyze runtime-paths, which are symptomsrather than errors —hence this approach has been described in category 1.2.3.3.

A further well-known stochastic speech modeling technique are n-gram models [175].N -grams represent sentences by conditional probabilities taking into account a context ofup to n words in order to compute the probability of a given sentence.19 Conditionaldensities are estimated from training data. Transferring this concept to failure prediction,error events correspond to words and error sequences to sentences. If the probabilities (the“grammar”) of an n-gram model were estimated from failure sequences, high sequenceprobabilities would translate into “failure-prone” and low probabilities into “not failure-prone”.

Markov models (1.3.3.2): Similarity of error sequences to failure-prone patterns ex-tracted from training data can be computed with Markov models in two different ways,depending whether a Markov chain or a hidden Markov model (HMM) is used.

In case of Markov chains, each error event corresponds to a state in the chain. Se-quence similarity is hence computed by the product of state traversal probabilities. Simi-lar events prediction (SEP), which is the prequel of the prediction technique developed inthis thesis, was built on this concept (see [226] for a description).

The failure prediction approach described in this thesis also belongs to this category.The first ideas have been published in Salfner [223], but an implementation has shownthat the concept needed to be developed further, which resulted in the approach presentedhere.

Pairwise alignment (1.3.3.3): Computing similarity between sequences is one of thekey tasks in biological sequence analysis [86]. Various algorithms have been developedsuch as the Needleman-Wunsch algorithm [191], Smith-Waterman algorithm [244] or theBLAST algorithm [8]. The outcome of such algorithms is usually a score evaluatingthe alignment of two sequences. If used as a similarity measure between the sequenceunder investigation and known failure sequences, failure prediction can be accomplishedas depicted in Figure 3.5. One of the advantages of alignment algorithms is that they buildon a substitution matrix providing scores for the substitution of symbols. In terms of errorevent sequences this technique has the potential to define a score for one error eventbeing “replaced” by another event giving rise to use a hierarchical grouping of errors asis defined in Section 5.4. However, to our knowledge, no failure prediction approachesapplying pairwise alignment algorithms have been published, at this time.

19Although, in most applications of statistical natural language processing the goal is to predict the nextword using P (wn|w1, . . . , wn−1), the two problems are connected via the theorem of conditional proba-bilities.

3.2 Methods Used for Comparison 45

Other Methods (1.3.4)

Statistical tests (1.3.4.1): Principle number two (“the mix changes”) in Levy &Chillarege [162] delineates the discovery that the order of subsystems sorted by errorgeneration frequency changes prior to a failure. According to the paper, relative errorgeneration frequencies of subsystems follow a Pareto distribution: Most errors are gen-erated by only a few subsystems while most subsystems generate only very few errors.20

The proposed failure prediction algorithm monitors the order of subsystems and predictsa failure if it changes significantly, which basically is a statistical test.

Classifier (1.3.4.2): Classifiers usually associate an input vector with a class label. Incategory 1.3, input data consists of one or more error events that have to be represented bya vector in order to be processed by a classification algorithm. A straightforward solutionwould be to use the error type of the first event in a sequence as value of the first inputvector component, the second type as second component, and so on. However, it turns outthat such a solution does not work: If the sequence is only shifted one step, the sequencevector is orthogonally rotated in the input space and most classifiers will not judge thetwo vectors as similar. One solution to this problem has been proposed by Domeniconiet al. [81]: SVD-SVM21 borrows a technique known from information retrieval: the so-called “bag-of-words” representation of texts [175]. In the bag of words representation,there is a dimension for each word of the language. Each text is a point in this high-dimensional space where the magnitude along each dimension is defined by the numberof occurrences of the specific word in the text.22 SVD-SVM applies the same techniqueto represent error event sequences. Since SVD-SVM is used for comparative analysis, itis described in more detail in the next section.

3.2 Methods Used for ComparisonIn order to compare the prediction method presented in this thesis to the state-of-the-art,other prediction methods have been implemented and applied to the data of the case study.The selection of approaches is primarily based on the type of input data: the best-knownand most promising error-based approaches have been chosen, which are:

• Dispersion Frame Technique (DFT) developed by Lin [166], which is an error-frequency based approach (category 1.3.1)

• Eventset Method developed by Vilalta & Ma [268], which is a data-mining ap-proach (category 1.3.2)

• SVD-SVM developed by Domeniconi et al. [81], which is a classification approach(category 1.3.4)

Together with the pattern recognition approach presented in this dissertation, all cate-gories of error-based failure prediction are covered.

20This is also known as Zipf’s law [285]21Singular-Value-Decomposition and Support-Vector-Machine22There are more sophisticated representations incorporating term weighting such as tf.idf, but this has not

been used for SVD-SVM


In addition to that, a periodic prediction of failures based on mean-time-between-failures (MTBF), which belongs to category 1.1, has been applied in order to show theprediction results that can be achieved with almost no effort.

Comparing the data that is taken into account by the various prediction methods, onecan conclude:

• DFT only makes use of the time of error occurrence

• Eventset only makes use of the type of error occurrence

• SVD-SVM makes use of the type of error events. Using a bag-of-words represen-tation, also the number of error occurrences can be incorporated. Using a specialrepresentation to incorporate time of error occurrence has not been successful forthe case study.

• MTBF only takes the occurrence of failures into account.

In this regard, the novelty of the approach presented here is that it is the first to analyzeerror events as event-triggered temporal sequence.

3.2.1 Dispersion Frame TechniqueLin [166] has developed a technique called Dispersion Frame Technique (DFT) that eval-uates the time of error occurrence and is therefore classified to category 1.3.1. It is basedon the notion that errors occur more frequently before a failure occurs. It is a well-known heuristic to analyze error occurrence frequencies and has been shown to be supe-rior to classic statistical approaches like fitting of Weibull distribution shape parameters[167, 12]. The technique was developed for data of the Andrews distributed File Sys-tem at Carnegie-Mellon University. The following paragraphs describe DFT as originallypublished and notes about its application to the case study in this thesis are provided atthe end.

Figure 3.6: Dispersion Frame Technique. Diamond u i denotes the last error that has oc-curred, i − 1 the predecessor error of the same type. DF denotes a dispersionframe and EDI the error dispersion index. W denotes a failure warning that isissued at the end of DF 1 centered around error i− 2.

The first step of DFT prediction is to separate all error events pertinent to one device.Then the time of error occurrence for each device is analyzed. A Dispersion Frame (DF)


is the interval time between successive error events of the same type. In Figure 3.6, twoDFs are shown, DF1 is the time interval between errors i − 4 and i − 3 whereas DF2 isthe interval between errors i − 3 and i − 2. Each DF is shifted such that it is centeredaround the next and next but one error. The Error Dispersion Index (EDI) is defined tobe the number of error occurrences in the later half of a DF. If it is observed that a DF isless than 168 hours, a heuristic is activated, which predicts a failure if at least one of thefollowing rules are met:

1. when two consecutive EDIs from successive application of the same DF exhibit anEDI of at least three. In Figure 3.6 this is true for DF1 centered around i − 3 andi− 2

2. when two consecutive EDIs from two successive DFs exhibit an EDI of at leastthree.

3. when a dispersion frame is less than one hour,

4. when four error events occur within a 24-hour frame,

5. when there are four monotonically decreasing DFs and at least one DF is half thesize of its previous DF. This rule is also met in Figure 3.6

The failure warning is issued at the end of the data frame, as shown in the figure.As might have become clear, the rules are heuristic and account for several types

of system behavior. For example, rules three and four put absolute thresholds on error-occurrence frequencies, whereas rules one and two on window-averaged occurrence fre-quencies. Finally, rule five is determined to detect trends in error occurrence frequencies.

It should be noted that DFT was developed for data of the Andrews distributed FileSystem (AFS). In this dissertation, the approach has been transferred to the prediction offailures of a component-based industrial telecommunication system. Therefore, the DFTmethod had to be adapted slightly:

1. AFS is a physically distributed campus-wide system and error messages could beassigned easily to field replaceable units (FRUs), which are also strong fault con-tainment regions. The data used for the case study in this thesis derives from anon-distributed system built from software components. However, in the case ofAFS, error detection took place within each FRU, while in the case-study consid-ered here, software components are much weaker fault containment regions anderror detection frequently took place in other parts of the system. Moreover, com-ponents were sometimes even not identifiable in the data. Hence, software contain-ers, which execute the components, have been considered as the entity equivalentto FRUs.

2. There are several parameters in the ruleset that are problem specific. For example,the activation threshold of 168 hours is the time above which faults are consideredto be unrelated. Since the goal of the case-study used here is to predict service avail-ability failures on a five-minutes timescale, ruleset parameters had to be adapted.To do this, each parameter has been “optimized” separately by varying parametervalues. Each choice has been evaluated with respect to the ability to predict fail-ures. If two choices for a parameter were almost equal in precision and recall (seeChapter 8), the one with less false positives has been chosen.


3. There is no notion of warning-time ∆tw in the method. Since warning-time is theminimum time for any failure prediction to be useful, failure warnings issued for theinterval (t, t+ ∆tw] are removed. Indeed, by design DFT can only predict failuresat most half the length of a dispersion frame ahead. This resulted in removal ofquite a lot of warnings due to the short inter-error-event times occurring in the data.

3.2.2 Eventset MethodThe prediction approach published by Vilalta & Ma [268] is based on data-mining tech-niques. The basic concept of the method are so-called eventsets. As the name indicates,an eventset E = {Xi} is a set of error events that indicates an upcoming failure. Thefailure predictor consists of a set of eventsets. The goal of the training procedure is to finda good set of eventsets such that as many failures as possible can be captured with as fewfalse warnings as possible.

In order to deal with the imbalance of class distributions (failures are rare events), themethod first considers only failure data and uses non-failure data in a second validationstep. Failure data consists of all error events that have occurred within a time windowof length ∆td before each failure in the training dataset. These windows are termedfailure windows, here. The original approach does not consider lead-time ∆tl. It hasbeen incorporated by shifting the failure window, as depicted in Figure 3.7.

Figure 3.7: The eventset method builds a database of sets of errors occurring within a timewindow before failures (indicated by t). The database is then reduced in severalsteps to yield a better predictor. In some of these steps, data occurring in non-failure windows are used. ∆td denotes the length of the data window and ∆tllead-time

An initial database consisting of all subsets of events that have occurred in the eventwindows is set up. This initial database of eventsets is then reduced in three steps:

1. Keep only frequent eventsets. An eventset is said to be frequent if it has support


greater than a user-defined threshold. Support is defined to be the relative frequencyof occurrence in the event windows:

support(E) = number of failure windows containing Etotal number of failure windows

. (3.2)

In the example, eventsets {A}, {B}, and {A,B} have support 100% and eventsets{C}, {A,C}, {B,C}, and {A,B,C} have support 50%. Assuming a threshold of,say 70%, only the first eventsets remain in the database.

2. Keep only accurate eventsets. In the example, the eventA also occurs between thetwo failures which leads to the conclusion that the occurrence of A is not indicatingan upcoming failure. Confidence takes this into account: Confidence is definedto be the relative frequency of occurrence of the eventset with respect to all timewindows (including those that do not precede a failure event):

confidence(E) = number of failure windows containing Enumber of all windows containing E

. (3.3)

An eventset is said to be accurate if it has confidence greater than a user-definedthreshold. In the example, eventsets {B}, and {A,B} have confidence 100%while {A} has confidence 2

3 . Assuming a confidence threshold of, say, 70%, onlyeventsets {B}, and {A,B} remain in the database.

Due to the fact that putting a threshold on confidence does not check for neg-ative correlations, an additional statistical test is performed testing for the null-hypothesis:

H0 : P (E|failure windows) ≤ P (E|non-failure windows). (3.4)

Only eventsets E for which H0 can be rejected (with a certain confidence level)stay in the database.

3. Remove eventsets that are too general. Remaining eventsets are ordered by con-fidence in the first place and subsequently by support and finally by specificity: Aneventset E1 is more specific than E2 if E2 ⊂ E1. Going through the sorted list ofeventsets, the algorithm removes eventsets that are less specific. In the example, thesorted list of eventsets consists of

[{A,B}, {B}

]. Since {B} ⊂ {A,B}, {B} is

removed and the only remaining eventset is {A,B}. This means that events A andB must occur together in order to indicate an upcoming failure.

Failure prediction is performed by checking, whether any eventset of the database is asubset of the currently observed set of error events. For example, if —during runtime—errors A, C, and B occur within a time window spanning an interval of length ∆td, afailure is predicted since the eventset {A,B} ⊂ {A,B,C}.

As might have become clear, the initial database of eventsets has cardinality of thepower set, which would tag the algorithm infeasible in real applications. Therefore thefirst step of support filtering is incorporated into the generation of the initial eventsetdatabase by use of the a-priori algorithm (Agrawal et al. [2]), which also applies branchand bound techniques.


3.2.3 SVD-SVM MethodLatent semantic indexing (LSI) is a technique developed in information retrieval that en-ables to find related text documents even if they do not share search terms. LSI is based onthe notion of co-occurrence of terms and provides a method to identify “latent” semanticconcepts in texts (see, e.g., [175]). Domeniconi et al. [81] have applied this technique tothe problem of failure prediction and assume that co-occurrence of error events indicatesthe “latent” state of the system.23 More specifically, the approach consists of three steps.

1. Error sequences are represented in a so-called bag-of-words representation, whichis frequently used in natural language processing: for text documents, there’s adimension for each word of the language and the magnitude along each dimen-sion is, for example, simply the number of times the word occurs in the document.In the case of error event sequences, there’s a dimension for each event type andthe magnitude along the dimension (i.e., the distance from origin) represents how“prominent” an error type is in the sequence. The authors describe three ways ofassigning a value to “prominence”:

• existence: one if an event occurs in the sequence, zero if not

• count: the number of occurrences in the sequence

• temporal: partitioning the sequence into time-slots and assigning a one toa binary digit if the event occurs within the corresponding timeslot.

The key notion of the bag-of-words representation is that each event sequence rep-resents a point in a high-dimensional space and hence the entire training data setcomprises a multidimensional point cloud.

The process of turning error log data into sequences is similar to the Eventsetmethod: All errors occurring within a time window of length ∆td preceding a fail-ure by lead-time ∆tl constitute a failure sequence which is translated into a positive(failure-prone) bag-of-words data point. Errors occurring in data windows betweenfailures constitute negative examples (see Figure 3.8).

2. Semantic concepts, which refer to the latent states of the system, are identified bymeans of singular value decomposition (SVD). The result of SVD is then used toreduce the number of dimensions in the data. More precisely, co-occurring eventsin the space of event types are mapped onto the same dimensions in the space oflatent states by a least-squares method to decompose the matrix of training event-sequences into a product of square and diagonal matrices. Assuming that there are ntraining sequences, andm event types, then the matrix of training dataD is am×nmatrix with each column corresponding to an event sequence. SVD decomposesDinto

D = U S V T , (3.5)

where S is a diagonal matrix with ordered singular values on the main diagonalindicating the amount of variation for each dimension. SVD has the property that

23The authors call it “pattern context information”


Figure 3.8: Bag-of-words representation of error sequences occurring prior to failures (s).Each time window defines an event sequence. By assuming that there are onlytwo types of error messages (A and B), each sequence can be mapped to a pointin two-dimensional event-type space where the magnitude along each dimensionis determined by the number of times the event occurs in the sequence. Se-quences from windows preceding a failure are positive examples (black bullets),sequence from windows between failures constitute negative examples (white bul-let). ∆td denotes length of the window and ∆tl lead-time.

projecting data onto the the first k dimensions yields a least-square optimal pro-jection. The projection matrix is defined by the first k columns of matrix U andprojection can simply be performed by matrix multiplication.

An example is shown in Figure 3.9: assuming that there are only two differenterrors A and B, the training data set can be represented in two-dimensional space.Figure 3.9-a shows an example using the count encoding, black bullets to indicatefailure and white bullets to indicate non-failure sequences. The training data definesa 2×11 dimensional matrixD. SVD computes new dimensions x1 and x2 as shownin (b), such that projection (c) results in a least-square overall error. The projecteddata set has only one dimension.

Figure 3.9: Singular value decomposition (SVD). (a): Bag-of-words representation of trainingdata set. (b): Rotated dimensions found by SVD. (c): Projection onto the newdimension x1.

3. A classifier is trained in order to distinguish between failure and non-failure se-quences. The input data to classification are the projected event sequences (obtainedfrom step two). The classification technique used are Support Vector Machines


(SVMs), which have been developed at the beginning of the 1990’s by Vapnik[264].24 Support vector machines are linear maximum margin classifiers. Linearmeans that the decision boundary corresponds to a straight line in two-dimensionalspace and to a hyper-plane in higher dimensions. However, such an approach canonly classify linearly separable problems appropriately, which is not the case formost real-world classification problems. To remedy this problem, a second trans-formation into a high-dimensional feature space including non-linear features isperformed, which can turn complex classification problems into linear problems infeature space. Figure 3.10 depicts such a transformation denoted by ϕ. Althoughthe additional transformation seems to introduce extra computation complexity, itis in fact one of the reasons for the computational efficiency of SVMs: The trick isthat transformations exist for which the distance measure can be computed muchmore efficiently. The second important feature of SVMs is that they belong to theclass of maximum margin classifiers, which means that the decision boundary ischosen such that the margin25 is maximal. It has been proven that this results inmost robust classification (see, e.g., [237]).

Figure 3.10: Maximum margin classification in feature space. On the left-hand side datapoints in the original space cannot be separated linearly. By transformation ϕdata points are transformed into a feature space, where a linear separation ispossible. The decision boundary (indicated by the dashed line) is chosen suchthat the margin (solid lines) is maximal.

After training, online failure prediction is performed in three steps:

1. all error events that occurred within a time window of length ∆td before presenttime are represented as a bag-of-words.

2. Singular value decomposition needs not to be performed for online prediction. In-stead, the bag of words is transformed into reduced semantic space by multiplica-tion with the projection matrix.

3. the resulting k-dimensional vector is classified using a support vector machine,which includes a further transformation using ϕ.

24An introduction can be found in Cristianini & Shawe-Taylor [70]25which is the distance to the closest datapoints

3.3 Summary 53

According to the authors of SVD-SVM, failure patterns show similar properties astext classification tasks. For example, the frequency distribution for error events followsZipf’s law [285], which inspired them to apply text processing techniques.

3.2.4 Periodic PredictionThe failure prediction method used to estimate some sort of lower bound can be deriveddirectly from reliability theory, since the probability of failure occurrence up to time t issimply:

F (t) = 1−R(t) , (3.6)

where R(t) is reliability.Assuming a Poisson failure process, reliability turns out to be an exponential distribu-

tion (see, e.g., Musa et al. [189]) and failure probability is:

F (t) = 1− e−λt . (3.7)

The distribution parameter λ is fit to the data by setting

λ = 1MTBF

, (3.8)

where MTBF denotes mean-time-between-failure of the training data set.26

Using this model, a failure is predicted according to the median of the failure distri-bution:

Tp = 1λln(2) . (3.9)

3.3 SummaryThis chapter has introduced a taxonomy of online failure prediction approaches for com-plex computer systems and has provided a comprehensive survey of online failure predic-tion methods. Furthermore, the survey points to research areas that provide a toolbox ofmethods that could most promisingly be applied to the task of online failure prediction.From this it can be concluded that the technique presented in this thesis is the first to applytemporal sequence pattern recognition methods to the task of online failure prediction.

The second major goal of this chapter was to describe in detail four existing failureprediction approaches that are used for comparative analysis in this thesis, namely disper-sion frame technique, eventset method, SVD-SVM and a periodic prediction based on areliability model.

Contributions of this chapter. To the best of our knowledge, this chapter provides thefirst taxonomy and the first survey of online failure prediction approaches.

26Some works use MTTF instead of MTBF, but since in the case study performance failures are predicted,repair time is not an issue, here.


Relation to other chapters. This chapter has presented related work with respect toonline failure prediction approaches. Since the failure prediction method presented in thisthesis is based on an extension to hidden Markov models, related work on hidden Markovmodels is presented in the next chapter. However, in order to explain the various models,first, an introduction to the theory of hidden Markov models is provided.

Chapter 4

Introduction to Hidden Markov Modelsand Related Work

As a result of their capabilities, hidden Markov models (HMMs) are becoming more andmore frequently used in modeling. Examples include, e.g., the detection of intrusioninto computer systems [273], fault diagnosis [73], network traffic modeling [229, 274],estimation and control [90], speech recognition [125], part-of-speech tagging [175], andgenetic sequence analysis applications [86]. In this work, HMMs are used for onlinefailure prediction following a pattern recognition approach. For this reason this chaptergives an introduction to the theory of HMMs (Section 4.1). The approach taken in thisthesis builds on the assumption that time and type of error occurrence are crucial foraccurate failure prediction. However, standard HMMs are not appropriate models forprocessing temporal sequences. In Section 4.2, three principle approaches are providedhow temporal sequences can be handled by HMMs, followed by related work on time-varying HMMs, which is provided in Section 4.3.

4.1 An Introduction to Hidden Markov ModelsHMMs are based on discrete-time Markov chains (DTMCs), which consist of a set S ={si} of N states, a square matrix A = [aij] defining transition probabilities between thestates, and a vector of initial state probabilities π = [πi] (see Figure 4.1). A is a stochasticmatrix, which means that all row sums equal one:

∀i :N∑j=1

aij = 1 . (4.1)

Additionally, the vector of initial state probabilities π must define a discrete probabilitydistribution such that

N∑i=1

πi = 1 . (4.2)

The stochastic process defined by a DTMC can be described as follows: An initialstate is chosen according to the probability distribution π. Starting from the initial state,

55

56 4. Introduction to Hidden Markov Models and Related Work

Figure 4.1: Discrete Time Markov Chain

the process transits from state to state according to the transition probabilities defined byA: being in state i, the successor state j is chosen according to the probability distributionaij . Such a process shows the so-called Markov assumptions or properties:

1. The process is memoryless: a transition’s destination is dependent only on the cur-rent state irrespective of the states that have been visited previously.

2. The process is time-homogeneous: transition probabilitiesA stay the same regard-less of the time that has already elapsed (A is not depending on time t)

More formally, both assumptions can be expressed by the following equation:

P (St+1 = sj | St = si, . . . , S0) = P (S1 = sj | S0 = si) . (4.3)

Loss of memory is expressed by the fact that all previous states S0, . . . , St−1 are ignoredon the right-hand side of Equation 4.3, and time-homogeneity is reflected by the factthat the transition probabilities for time t → t+ 1 are equal to the probabilities for time0 → 1.

Hidden Markov Models extend the concept of DTMCs in that at each time step anoutput (or observation) is generated according to a probability distribution. The key notionis that this output probability distribution depends on the state the stochastic process is in.Two types of HMMs can be distinguished regarding the types of their outputs:

• If the output is continuous, e.g, a vector of real numbers, the model is called acontinuous HMM.1

• If the output is chosen from some finite countable set, outputs are called symbols.Such models are called discrete HMMs. Due to the fact that error message IDs arefinite and countable only discrete HMMs are considered.

In order to formalize this, HMMs additionally define a finite countable set of symbols O ={oi} of M different symbols, which is called the alphabet of the HMM. A matrix B =[bij] of observation probabilities is defined where each row i of B defines a probabilitydistribution for state si such that bij is the probability for emitting symbol oj given thatthe stochastic process is in state si:

bij = P (Ot = oj|St = si) , (4.4)

1Not to be confused with continuous-time HMMs, as explained later

4.1 An Introduction to Hidden Markov Models 57

where Ot denotes the random variable for the observation at time t. Hence,B has dimen-sions N ×M and is a stochastic matrix such that

∀i :M∑j=1

bij = 1 . (4.5)

Note that for readability reasons, bij will sometimes be denoted by bsi(oj). Figure 4.2shows a simple discrete-time HMM.

Figure 4.2: A discrete-time HMM with N = 4 states and M = 2 observation symbols

The reason why HMMs are called “hidden” stems from the perspective that only theoutputs can be observed from outside and the actual state si the stochastic process residesin is hidden from the observer. From this notion, three basic problems arise for whichalgorithms have been developed:

1. Given a sequence of observations and a hidden Markov model, but having no clueabout the states the process has passed to generate the sequence, what is the overallprobability that the given sequence can be generated? This probability is calledsequence likelihood. The Forward algorithm provides an efficient solution to thisproblem.

2. Given a sequence and a model as above: What is the most probable sequence ofstates the process has traveled through while producing the given observation se-quence? The Forward-Backward and Viterbi algorithms provide solutions to thisproblem.

3. Given a set of observation sequences: What are optimal HMM parameters A, B,and π such that the likelihood of the sequence set is maximal? The Baum-Welchtraining algorithm yields a solution by iteratively converging to at least a local max-imum.

The following sections will introduce the three algorithms. Although the algorithms canbe found in many textbooks or in Rabiner [210], they are described here for reasonsof comparison: In Chapter 6, these algorithms are adapted for the hidden semi-Markovmodel introduced in this thesis.


4.1.1 The Forward-Backward AlgorithmAs the name might suggest, the Forward-Backward algorithm consists of a forward anda backward part. The forward part alone provides a solution to the first problem: thecomputation of sequence likelihood. The likelihood of a given observation sequence o =[Ot] is the probability that a given HMM with parameters λ = (A,B,π) has generatedthe sequence, which is denoted by P (o|λ). In order to compute this probability, firstassume that the sequence of hidden states s = [St] was known. The likelihood could thenbe computed by:

P (o, s|λ) = πS0 bS0(O0)L∏t=1

aSt−1 St bSt(Ot) , (4.6)

where L is the length of the sequence. As only o is known, all possible state sequences shave to be considered and summed up:

P (o|λ) =∑s

πS0 bS0(O0)L∏t=1

aSt−1 St bSt(Ot) . (4.7)

However, such approach results in intractable complexity since there are NL+1 differentsequences. An efficient reformulation has been found exploiting the Markov assump-tion that transition probabilities are time homogeneous and only dependent on the currentstate. Using this property, Equation 4.7 can be rearranged such that repetitive computa-tions can be grouped together. From this rearrangement it is only a small step to a recur-sive formulation, which is also known as dynamic programming. The resulting algorithmis called Forward algorithm.

Forward algorithm. The algorithm is based on a forward variable αt(i) denoting theprobability for sub-sequence O0 . . . Ot under the assumption that the stochastic process isin state i at time t:

αt(i) = P (O0O1 . . . Ot, St = si|λ) . (4.8)

αt(i) can be computed by the following recursive computation scheme:

α0(i) = πi bsi(O0)

αt(j) =N∑i=1

αt−1(i) aij bsj(Ot); 1 ≤ t ≤ L .(4.9)

The algorithm can be visualized by a trellis structure as shown in Figure 4.3. Each noderepresents one αt(i) while edges visualize the terms of the sum in Equation 4.9. Thetrellis can be computed from left to right, from which the name “forward algorithm” isderived.

As αL(i) is the probability of the entire sequence together with the fact that thestochastic process is in state i at the end of the sequence, sequence likelihood P (o|λ)can be computed by summing over all states in the rightmost column of the trellis:

P (o|λ) =N∑i=1

αL(i) , (4.10)

which is the solution to the first problem.


Figure 4.3: A trellis to visualize the forward algorithm. Bold edges indicate the terms that haveto be summed up in order to compute αt(i)

Backward Algorithm. A backward variable βt(i) can be defined in a similar way,denoting the probability of the rest of the sequence Ot+1 . . . OL given the fact that thestochastic process is in state i at time t:

βt(i) = P (Ot+1 . . . OL|St = si, λ) . (4.11)

βt(i) can be computed in a similar recursive way by:

βL(i) = 1 (4.12)

βt(i) =N∑j=1

aij bsj(Ot+1) βt+1(j); 0 ≤ t ≤ L− 1 . (4.13)

Forward-backward algorithm. Combining both αt(j) and βt(i) leads to the estima-tion of the probability that the process is in state si at time t given an observation sequenceo. This probability is denoted by

γt(i) = P (St = si|o, λ) . (4.14)

Some computations yield:

P (St = si|O0 . . . OL, λ) = P (St = si, O0 . . . OtOt+1 . . . OL |λ)P (O0 . . . OL |λ) (4.15)

= P (St = si, O0 . . . Ot |λ) P (Ot+1 . . . OL |St = si, λ)P (O0 . . . OL |λ)

(4.16)

= αt(i) βt(i)P (O0 . . . OL |λ) (4.17)

and hence γt(i) can be computed by:

γt(i) = αt(i) βt(i)P (o |λ) = αt(i) βt(i)∑N

i=1 αt(i) βt(i). (4.18)

Viterbi algorithm. The forward-backward variable γt(i) does not yet solve the secondproblem completely, since γt(i) solves for the most probable state at one point in time but


the task is to find the most probable sequence of states. A straightforward solution wouldbe to select the most probable state at each time step t:

Smax(t) = arg maxi

γt(i) . (4.19)

However, it turns out that models exist for which some transitions from Smax(t) toSmax(t+ 1) are not possible (i.e., the transition probability aij equals zero). This is dueto the fact that α and β both combine all possible paths through the states of the DTMC—and γ is only the product of α and β.

One solution to this problem is called Viterbi algorithm. Very similar to αt(i), letδt(i) denote the probability of the most probable state sequence for the sub-sequence ofobservations O0 . . . Ot that ends in state si:

δt(i) = maxS0 ... St−1

P (O0 . . . Ot, S0 . . . St−1, St = si|λ) . (4.20)

δt(i) can be computed by a slight modification of the forward algorithm using the maxi-mum operator instead of the sum over all states:

δ0(i) = πi bsi(O0) (4.21)δt(j) = max

1≤i≤Nδt−1(i) aij bsj(Ot); 1 ≤ t ≤ L . (4.22)

In order to identify the states that contributed to the most probable sequence, each stateselected by the maximum operator has to be stored in a separate array. The sequencecan then be reconstructed by tracing backwards through the array starting from statearg maxi δL(i).

4.1.2 Training: The Baum-Welch AlgorithmIn the forward-backward algorithm, the HMM parameters λ were assumed to be fixedand known. However, in the majority of applications, λ cannot be inferred analyticallybut need to be estimated from recorded sample data. In the machine learning community,such a procedure is called training. Several algorithms exist for HMM training, of whichthe Baum-Welch algorithm is most prominent.

In terms of HMMs, the goal of training is to maximize sequence likelihood for trainingsequences. More precisely, the parameters π, A, and B have to be set such that Equa-tion 4.10 is maximized. For convenience, only a single training sequence is considered,here and the case of multiple sequences is discussed later.

The algorithm can be understood most easily by first considering a simpler case wherethe sequence of “hidden” states is known. This occurs, e.g., in part-of-speech tagging2

applications. In this case, the parameters of the HMM can be optimized by maximumlikelihood estimates:

• Initial state probabilities πi are determined by the relative frequency of sequencesstarting in state si:

πi = number of sequences starting in sitotal number of sequences

. (4.23)

2See, e.g., Manning & Schütze [175]


• Transition probabilities aij are determined by the number of times the process wentfrom state si to state sj divided by the number of times, the process left state si toanywhere:

aij = number of transitions si → sjnumber of transitions si → ?

. (4.24)

• Emission probabilities bsi(oj) are determined by the number of times the processhas generated symbol oj in state si compared to the number of times the processhas been in state si:

bi(oj) = number of times symbol oj has been emitted in state sinumber of times the process has been in state si

. (4.25)

However, in many applications the sequence of states is not known. The solutionfound by Baum and Welch introduced expectation values for the unknown quantities.The algorithm belongs to the class of Expectation-Maximization (EM) algorithms.3 Thealgorithm consists of two major steps:

1. Expectation step: Compute estimates for unknown data (state probabilities) usingthe current set of model parameters.

2. Maximization step: Adjust model parameters to maximize data likelihood using theestimates for the unknown data of the expectation step.

This scheme is repeated until sequence likelihood converges. It can be proven (see Sec-tion 6.5) that at least a local maximum is found. In the following paragraphs, both stepsare described in more detail.

Expectation-Step. Let Xt(i, j) denote the binary random variable indicating whetherthe transition taking place at time t passes from state si to sj or not. The expected value forXt(i, j) is equal to the probability that Xt(i, j) is one.4 Let ξt(i, j) denote this probability(given an observation sequence o):

ξt(i, j) = P (St = si, St+1 = sj|o, λ) . (4.26)

ξt(i, j) can be computed similarly to Equations 4.15–4.17 by interposing the transitionfrom si to sj between α and β:

ξt(i, j) =αt(i) aij bsj(Ot+1) βt+1(j)

P (o|λ) (4.27)

=αt(i) aij bsj(Ot+1) βt+1(j)∑N

i=1∑Nj=1 αt(i) aij bsj(Ot+1) βt+1(j)

. (4.28)

This approach can also be visualized in a trellis as shown in Figure 4.4.While ξt(i, j) is the expected value that a transition i → j takes place at time t, theexpected value for the total number of transitions from si to sj is

∑t

E [Xt(i, j)] =L−1∑t=0

ξt(i, j) . (4.29)

3A more detailed discussion is given along with the proof of convergence for HSMMs, see Section 6.54E [X] =

∑X P (X) = 0 ∗ P (X = 0) + 1 ∗ P (X = 1) = P (X = 1)


Figure 4.4: A trellis visualizing the computation of ξt(i, j)

Note that summing up ξt(i, j) over all destination states sj yields the probability forthe source state si at time t:

N∑j=1

ξt(i, j) = γt(i) . (4.30)

The expectation step requires knowledge of model parameters λ which are either knownfrom (random) initialization or a previous iteration of the EM algorithm.

Maximization-Step. The second step of the Baum-Welch algorithm is a maximum like-lihood optimization of parameters λ based on the expected values estimated in the firststep:

πi ≡expected number of sequences starting in state si

total number of sequences≡ γ0(i) (4.31)

aij ≡expected number of transitions si → sj

expected number of transitions si → ?≡

L−1∑t=0

ξt(i, j)

L−1∑t=0

γt(i)(4.32)

bi(k) ≡expected number of times observing ok in state si

expected number of times in state si≡

L−1∑t=0

s.t. Ot=ok

γt(i)

L−1∑t=0

γt(i).(4.33)

Notes on the Baum-Welch algorithm. Initial point for the Baum-Welch algorithm isa completely initialized HMM. This means that the number of states, number of obser-vation symbols, transition probabilities, initial probabilities and observation probabili-ties need to be defined. The algorithm then iteratively improves the model’s parametersλ = (A,B,π) until a (local) maximum in sequence likelihood is reached.5 In each

5For implementation, a maximum number of iterations is often used as an additional stopping criterion.

4.2 Sequences in Continuous Time 63

M-step, the expectation values of the previous E-step are used and vice versa. Severalproperties of the algorithm can be derived from that:

• The number of states and the size of the alphabet are not changed by the algorithm.

• The model structure is not altered during the training process: if there is no tran-sition from state si to sj (aij = 0), the Baum-Welch algorithm will never changethis.

• Initialization should exploit as much a-priori knowledge as possible. If this is notpossible, simply random initialization can be used.

Training with Multiple Sequences. The formulas presented here have only consideredone single observation sequence, although in most applications, there is a large set oftraining sequences. The main idea of multiple sequence training is that nominators anddenominators of Equations 4.31 to 4.33 are transformed into a sum over sequences ok,each scaled by 1

P (ok |λ) computed along with the E-step of the algorithm.

4.2 Sequences in Continuous TimeError events occur on a continuous time scale, but there has been no notion of time inHMMs, yet. In this section four approaches how time can be incorporated into HMMsare introduced followed by a review of approaches that have been published on that topic.

An observation sequence is assumed to be an event-driven sequence consisting ofsymbols which are element of a finite countable set. Such sequences are called temporalsequences. Sequences of length L + 1 are considered where the first symbol occurs attime t0 and the last at time tL, as shown in Figure 4.5. In order to clarify notations, let

Figure 4.5: Notations for an event-driven temporal sequence. The sequence consists of sym-bols A, B and C that occur at time t0, . . . , tL. The delay between two successivesymbols is denoted by dk.

• {o1, . . . , oM} denote the set of symbols that can potentially occur, which is{A, B, C} in the example.

• Ok denotes the symbol that has occurred at time tk and

• dk denotes the length of the time interval tk − tk−1 .


4.2.1 Four Approaches to Incorporate Continuous TimeThe notion of continuous time can be incorporated into HMMs in four ways:

1. Time can be divided into equidistant time slots

2. Delays can be represented by delay symbols

3. Events and delays can be represented by two-dimensional outputs

4. A time-varying stochastic process can be used.

The following paragraphs investigate each solution and discuss their properties.

Time slots. Time is divided into non-overlapping intervals of equal length, as shownin Figure 4.6. Due to the reason that hidden Markov models generate a symbol in each

Figure 4.6: Incorporating continuous time by division of time into slots of equal length

time step, time slots not containing any error symbols need to be “filled” by a specialobservation indicating “silence”. Performing this procedure on the temporal sequenceshown in Figure. 4.6 results in an observation sequence “AC B S S S A” where S denotesthe symbol indicating silence.

The simplest way to incorporate time-slotting into HMMs is to introduce state self-transitions: In each time step, there is some probability that the stochastic process transitsto itself and hence stays in the state (see Figure 4.7).

Figure 4.7: Duration modeling by a discrete-time HMM with self-transitions.

This approach leads to a geometric distribution for state sojourn times, since the proba-bility to stay in state si for d time-steps equals

Pi(D = d) = ad−1ii (1− aii) . (4.34)

Time-slotting has the following characteristics:

+ Standard HMMs can be used.

+ There is almost no increase in computational complexity.

4.2 Sequences in Continuous Time 65

– Time slot size is critical. If it is too small, long delays must be represented byrepetitions of the silence symbol as can be seen in the example. The geometricdelay distribution leads to poor modeling quality in most cases.6 On the other hand,if time slot size is too large, more than one event will probably occur within a timeslot. There are several solutions to this issue including the definition of additionalsymbols representing combined events, dropping of events or assignment to thenext “free” slot. However, all of these solutions have their problems. In general, ifthe length of inter-symbol intervals varies greatly, time slots cannot represent thetemporal behavior of event sequences appropriately.

– Time resolution is reduced to the size of time slots since it is no longer known whenexactly an event has occurred within the time slot. This is true especially for thecase of long time slot intervals.

For these reasons, time slotting does not appear appropriate for online failure prediction.

Delay symbols. A second approach to incorporating inter-event delays is to define a setof delay symbols representing delays of various lengths. The sequence shown in Fig. 4.6could then be represented by, e.g., “AS1C S1B S3A”. An evaluation of this approachshows that:

+ In comparison to time slotting, representation of time is improved since “chains”of silence symbols are avoided. If delays are represented on a logarithmic scale, awide range of inter-symbol delays can be handled.

+ The approach can be implemented using a completely discrete environment suchthat standard implementations of HMMs can be used.

– The structure of HMMs must be adapted. Due to the fact that events and delaysymbols alternate, there must be two distinct sets of states one of which generatesevent symbols while the other generates delay symbols. This results in increasedcomputational complexity (see Figure 4.8)

Figure 4.8: Representing time by delay symbols. States Ei generate error observation sym-bols and states Di generate delay symbols

– The internal (hidden) stochastic process does not represent the properties of thestochastic process that originally generated the observation sequence.

– Time resolution is even worse than for time slotting due to the fact that one symbolaccounts for long time intervals.

6The effect can be reduced by introducing silence sub-models, which are out of scope here


Figure 4.9: Delay representation by two-dimensional output probability distributions

Two dimensional output symbols. Hidden Markov models permit usage of multidi-mensional output symbols. Hence the temporal sequence can be represented by tuplesconsisting of the event type and the delay to the previous event. The example sequenceof Figure 4.6 would then be represented as (A, d0) (C, d1) (B, d2) (A, d3) where d0 is notrelevant. For such a representation, observation probabilities are two-dimensional: onedimension is discrete representing event symbols while the second dimension is continu-ous representing inter-event delays as shown in Figure 4.9. Output probabilities have toobey

∀i :M∑j=1

∫ ∞0

bsi(oj, dτ ) dτ = 1 . (4.35)

An assessment of the method yields:

+ Lossless representation of the temporal sequence with in principle unlimited timeresolution.

– The internal (hidden) stochastic process does not represent temporal properties ofthe stochastic process that originally generated the observation sequence. This is aproblem especially when future behavior of the stochastic process is to be predicted.

– Public implementations do —to the best of our knowledge— only exist for merelydiscrete and merely continuous outputs. Hence an implementation would requirethe development / adaption of a new toolkit.

Time-varying internal process. The fourth approach is to incorporate the temporal be-havior of the stochastic process that originally generated the observation sequence directlyinto the stochastic process of hidden state transitions. For example, a straightforward solu-tion is to replace the internal DTMC by a continuous-time Markov chain (CTMC), whichis able to handle transitions of arbitrary durations since transition probabilities are definedby exponential probability distributions P (t). Such an approach results in:

+ Lossless representation of the temporal sequence

+ The internal stochastic process can (at least in part) mimic the stochastic processthat originally generated the observation sequence

– Although various extensions to time-varying processes have been published (seenext section), to our knowledge no publicly available toolkit exists.

4.3 Related Work on Time-Varying Hidden Markov Models 67

Summary. Error event sequences are temporal sequences. Four approaches have beendescribed how continuous time can be incorporated into HMMs. From the discussionfollows that the most promising approach is to incorporate time variation directly intothe hidden stochastic process, which is the approach taken in this thesis. Since varioussolutions to an incorporation of time variation into the stochastic process exist, relatedwork with such focus is presented in the following.

4.3 Related Work on Time-VaryingHidden Markov Models

A few decades ago, application of standard discrete HMMs was the only way to get toa feasible (i.e., real-time) solution, even in application domains where temporal behavioris important. One such domain is speech recognition, where, e.g., phoneme durationsvary statistically. Since it was observed quickly that continuous time models can im-prove modeling performance significantly (see, e.g., Russell & Cook [218]), and due toincreasing available computing power, more and more time varying models have beenpublished. The development was mainly driven by the speech recognition research com-munity but time-varying models have also been applied to other domains such as web-workload modeling [284]. The following sections give an overview of the various classesof time-varying HMMs.

Continuous Time Hidden Markov ModelsIncorporating time-variance into HMMs by replacing the internal (hidden) DTMC processby a continuous time Markov chain (CTMC) has been described in Wei et al. [274]. Theresulting model is abbreviated by CT-HMM and should not be confused with continuousHMMs (CHMMs), which are discrete-time HMMs with continuous output probabilitydensities. CTMCs are determined by an initial distribution equivalent to DTMCs, but thetransition matrix A is replaced by an infinitesimal generator matrix Q. Determinationof the infinitesimal generator matrix Q follows a two-step approach: First, a transitionmatrix P (∆) and the initial distribution are estimated by Baum-Welch training from thetraining data. Then Q is obtained by Taylor expansion of the equation

Q = 1∆ ln(P ) , (4.36)

which can be derived directly from Kolmogorov’s equations (see, e.g., Cox & Miller[67]). ∆ denotes some minimal delay (a time step).

Hidden Semi-Markov ModelsModels such as CT-HMMs imply strong assumptions about the underlying stochastic pro-cess since CTMCs are based on exponential distributions, which are time-homogeneousand memoryless. A more powerful approach towards continuous-time HMMs is to sub-stitute the underlying DTMC by a semi-Markov process (SMP), which allows to use ar-bitrary probability distributions for specification of each transition’s duration.7 Resulting

7The only requirement is that it depends solely on the two states of the transition. A precise definition isgiven in Chapter 6.


models are called Hidden Semi-Markov Models (HSMMs).

Figure 4.10: Duration modeling by explicit modeling of state durations

A first approach to HSMMs is to substitute the self-transitions as in Figure 4.7 bystate durations that follow a state-specific probability distribution pi(d) as depicted inFigure 4.10. Several solutions have been developed to explicitly specify and determinepi(d) from training data along with the Baum-Welch algorithm.

Ferguson’s model. One of the first approaches to explicit state duration modeling wasproposed by Ferguson [96] in the year 19808. The idea was to use a discrete probabil-ity distribution for pi(d). While the approach was very flexible, it showed three disad-vantages: first, it is a discrete-time model requiring the definition of a time step ∆ anda maximum delay D, second, convergence of the training algorithm was insufficientlyslow, and third, much more training data was needed for training. The last two drawbacksresult from a dramatically increased number of parameters that have to be estimated fromthe training data: The number of parameters increases from N self-transitions to N ×Dduration probabilities. Mitchell et al. [182] extend the approach to transition durationsand propose a training algorithm with reduced complexity.

HSMMs with Poisson-distributed durations. In order to reduce the number of param-eters, Ferguson already proposed to use parametric distributions instead of discrete ones.So have done Russell & Moore [219], who have used Poisson distributions. A compari-son of both models showed that the Poisson-distributed model performs better in the casewhen an insufficient amount of training data available [218].

HSMMs with gamma-distributed durations. Levinson [161] provided a maximumlikelihood estimation for parameters of gamma-distributed durations. As it is the casewith most maximum likelihood procedures, optimal parameters are obtained by derivationof the likelihood function. However, this derivative cannot be computed explicitly andnumerical approximation has to be applied. Azimi et al. [18] apply HSMMs with gamma-distributed durations to signal processing but adjust duration parameters from estimatedmean and variance of durations in the training data set.

HSMMs with durations from the exponential family. Mitchell & Jamieson [183] ex-tended the spectrum of available distributions for explicit duration modeling to all distri-butions of the exponential family, which includes gamma distributions. Their work is alsofounded on a direct computation of maximum likelihood involving numerical approxima-tion of the maximum.

8A crisp overview can be found in Rabiner [210].

4.3 Related Work on Time-Varying Hidden Markov Models 69

HSMMs with Viterbi path constrained uniform distributions. Kim et al. [145]present an approach where transition durations are assumed to be uniformly distributed.Their key idea is that first parameters π, A and B are obtained by the discrete-timestandard HMM reestimation procedure as explained in Section 4.1. A subsequent stepinvolves computation of Viterbi paths for the training data in order to identify minimumand maximum durations for each transition: this defines a uniform duration distributionfor each transition.

Expanded State HMMs (ESHMMs). In parallel to the development of HSMMs withparametrized probability distributions, it has been found that Ferguson’s model can beimplemented in a much easier way by a series-parallel topology of the hidden states (Cook& Russell [65]). To be precise, each state of the HMM is replaced by a DTMC sharing thesame emission probability distribution. State durations are then expressed by transitionprobabilities of the DTMC. Figure 4.11 shows a small example for a HMM with left-to-right topology. Those models are named by Expanded State HMMs(ESHMMs).

Figure 4.11: Topology of an Expanded State HMM (ESHMM). The model represents discretestate duration probabilities pi(d) by discrete-time Markov chains. Emission prob-abilities bsi(oj) have been omitted.

The benefit of ESHMMs is that they can be implemented using standard discrete-timeHMM toolkits. Furthermore, the idea to represent state durations by state chains led toseveral variants extending Ferguson’s model. For example, the duration Markov chainmay have self-transitions that allow to model durations of arbitrary length instead of afixed maximum duration D. Some structures have been proposed by Noll & Ney [195]and Pylkkönen [206] and a comparison of two extended structures is provided by Russell& Cook [218]. More elaborate training algorithms for ESHMMs have been proposed byWang [270] and Bonafonte et al. [33].

Segmental HMMs. Segmental HMMs are used to model sequences whose behaviorchanges in epochs. It is assumed that there is some outer stochastic process determiningthe “type” of the segment. Some discrete duration is chosen specifying the length of theepoch. Once the type and the duration of the epoch are fixed, an inner stochastic processdetermines the behavior for the segment. Examples for such models can be found in Ge[102] and Russell [217].

Hidden Semi-Markov Event Sequence Model (HSMESM). In [93], Faisan et al. havepresented a hidden semi-Markov model for modeling of functional magnetic resonance


imaging (fMRI) sequences.9 The key idea with respect to temporal modeling is that dis-crete duration probabilities are stored for each transition rather than state durations. How-ever, the model is specifically targeted to fMRI.

Inhomogeneous HMMs (IHMMs). Ramesh & Wilpon [211] have developed anothervariant of HMMs, called Inhomogeneous HMM (IHMM) . Time homogeneity of stochas-tic processes refers to the property that the behavior (i.e., the probability distributions) donot change over time. In terms of Markov chains, this means that the transition proba-bilities aij are constant and not a function of time. However, the authors abandon thisassumption and define:

aij(d) = P (St+1 = j|St = i, dt(i) = d); 1 ≤ d ≤ D , (4.37)

which is the transition probability from state si to state sj given that the duration dt(i) instate si at time t equals d. In order to define a proper stochastic process, the transitionprobabilities must satisfy:

∀d ∈ {1, . . . , D} :N∑j=1

aij(d) = 1 . (4.38)

As can be seen from the formulas, Ramesh & Wilpon also assume discretized time and amaximum state duration D.

4.4 SummaryThe approach to online failure prediction taken in this thesis is to use HMMs as patternrecognition tool for error sequences, which are event-driven sequences in continuous timewith symbols from a finite countable set. Such sequences are called temporal sequences.This chapter has introduced the theory of standard HMMs and has identified four wayshow sequences in continuous time can be handled by HMMs. From this discussion fol-lowed that the most promising solution is to turn the stochastic process of hidden statetraversals into a time-varying process. Since this idea is not new, related work on previ-ous extensions has been presented.

Most domains standard hidden Markov models have been applied to are characterizedby

• equidistant / periodic occurrence of observation symbols caused by sampling. Thisdefines a minimum time step size such that all temporal aspects can be expressed ininteger multiples of the sampling interval.

• a maximum duration. For example in speech recognition, phonemes, syllables, etc.can well be assumed to have limited duration.

However, these assumptions do not hold for online failure prediction based on errorevents: Observation symbols can occur on a continuous time scale and delays betweenerrors can range from very short to very long time intervals. Therefore, none of thecontinuous-time extensions presented in this chapter seems appropriate for failure predic-tion. The extended hidden semi-Markov model proposed in this dissertation differs fromexisting solutions in the following aspects:

9More details on the model can be found in Thoraval [255]

4.4 Summary 71

1. The model operates on true continuous time instead of multiples of a minimum timestep size. This feature circumvents the problems associated with time-slotting andis advantageous if sequences show a great variability of inter-event delays as is thecase for the log data used in the case study.

2. There is no maximum duration D. The model can handle very long inter-errordelays with the same computational overhead as short delays.

3. The model allows to use a great variety of parametric transition probability distri-butions. More specifically, every parametric continuous distribution for which thedensity’s gradient with respect to the parameters can be computed are applicable.This includes well-known distributions such as Gaussian, exponential, gamma, etc.The advantage of this feature is that transition duration distributions can be adaptedto the delays occurring in the system rather than to assume some distribution a-priori. Furthermore, the model allows to use background distributions which helpsto deal with noise in the data.

4. The model allows to specify transition durations rather than state durations. Widelyused state durations are a special case where all transitions are equally distributed.This feature alleviates the Markov restriction that the process is only dependent onthe current state.

Although some of the models presented in this chapter share some of these features, theproposed model is the first to provide the combination of all four properties, which, as itwill be seen later, proved to be beneficial.

Contributions of this chapter. First, four ways to incorporate continuous time intoHMMs have been identified and discussed. Second, the chapter seems to be the firstwork to present a summary and state-of-the-art for continuous time extensions of hiddenMarkov models.

Relation to other chapters. This chapter concludes the first phase of the engineeringcycle, which has focused on a problem statement, identification of key properties andrelated work. The second phase focuses on a proper formalization of the approach thathas been sketched in Figure 2.9 on Page 19 and Figure 2.10 on Page 20, respectively.More specifically, formalization of the approach includes data preprocessing (Chapter 5),the hidden semi-Markov model (Chapter 6), and classification (Chapter 7).

Part II

Modeling

73

Chapter 5

Data Preprocessing

The overall approach to online failure prediction consists of several steps of which datapreprocessing is the first. It is applied for training, i.e., estimation of model parameters,and for online prediction. In Section 5.1 some known concepts of error-log preprocess-ing are described. A novel approach to separate failure mechanisms is introduced inSection 5.2 and a statistical method to filter noise is explained in Section 5.3. Finally,in Section 5.4, logfile formatting is discussed and a novel concept of logfile entropy isintroduced.

5.1 From Logfiles to SequencesError logfiles are a natural source of information if something goes wrong in the system,and they are frequently used both for diagnosis and online failure prediction.1 This sectiondescribes the necessary steps to get from raw error logs to temporal event sequences usedas input data for the hidden semi-Markov models.

5.1.1 From Messages to Error-IDsOne of the major handicaps with error logfiles is that they are commonly not designed forautomatic processing. Their main purpose is to convey information to human operators tosupport quick identification of problems. Hence error logs frequently do not contain anyerror-ID. Instead, they consist of error messages in natural language. This also holds forthe error logs of the telecommunication system and hence methods had to be developedto turn natural language messages into error IDs. The method described here has beendeveloped together with Steffen Tschirpke.2

The key idea of translating natural language messages into an error ID is to applya similarity measure known from text editing to yield a similarity matrix, to cluster thematrix and to assign an error ID to each cluster. However, even if dedicated log data

1See category 1.3 in the taxonomy2In fact, he was the one who implemented it and who solved all the real problems regarding this issue.

75

76 5. Data Preprocessing

such as timestamps, etc. are ignored, almost every log message is unique. This is due tonumbers and log-record specific data in messages. For example, the log message:

process 1534: end of buffer reached

will most probably occur only very rarely in an error log since it happens infrequentlythat exactly the process with number 1534 will have the same problem. For this reason,the mapping from error messages to error IDs consists of three steps:

1. All numbers, and log-record specific data such as IP addresses, etc. are replaced byplaceholders. For example, the message shown above is translated into:

process nn: end of buffer reached

2. A 100% complete replacement of all record-specific data is infeasible. Furthermore,there are even typos in the error messages themselves. Hence, dissimilarities be-tween all pairs of log messages are computed using the Levenshtein distance metric[11], which measures the number of deletions, insertions and substitutions requiredto transform one string into the other.

3. Log messages are grouped by a simple threshold on dissimilarity: All messageshaving dissimilarity below the threshold are assigned to one message ID.

The goal of the method described is to assign an ID to textual messages. The ID thenforms the so-called error type or symbol. If additional information from log messagessuch as, e.g., thread IDs, shall be used, various numbers have to be combined into a singleerror type.

5.1.2 TuplingAs Iyer & Rosetti [129] have noted, repetitive log records that occur more or less at thesame time are frequently multiple reports of the same fault. Hansen & Siewiorek analyzedthis property further and presented an illustrative figure, which is reproduced for conve-nience in Figure 5.1. Please note that terms have been adapted in order to be consistentwith other chapters. The figure depicts the process from a fault to corresponding events

Figure 5.1: A fault, once activated, can result in various misbehavior. Some misbehaviors arenot detected, some are detected several times and sometimes, several misbehav-iors are caught by one single detection. Due to, e.g., a system crash, not everyerror may occur as message in the error log [114].

in the error log. Once activated, a fault may lead to various misbehaviors in the system.There are four possibilities how such misbehavior can be detected:

5.1 From Logfiles to Sequences 77

1. unusual behavior is detected leading to one error

2. unusual behavior is not detected and hence no error occurs

3. unusual behavior is detected by several fault detectors leading to several errors

4. one fault detector detects several misbehaviors resulting in one single error

However, not each error finds its way to the error log. For example, if the fault causes thelogging process or the entire system to crash, the error cannot be written to the logfile.

In order to increase expressiveness of logfiles, Tsao & Siewiorek [258] introduceda procedure called tupling, which basically refers to grouping of error events that oc-cur within some time interval or that refer to the same location. However, equating thelocation reported in an error message with the true location of the fault only works forsystems with strong fault containment regions. Since this assumption does not hold forthe telecommunication system under consideration, spatial tupling is not considered anyfurther, here.There are two principle approaches to grouping of errors in the temporal domain:

1. after some pause, all errors that occur within a fixed interval starting from the firsterror are grouped, as proposed by Iyer et al. [131]

2. All errors showing an inter-arrival time less than a threshold ε are grouped, as pro-posed by Tsao & Siewiorek [258]3

Further considerations only refer to the second grouping method. Two problems can arisewhen tupling is applied (see Figure 5.2):

1. error messages might be combined that refer to several (unrelated) faults. Accord-ing to the paper this case is called a collision

2. If an inter-arrival time > ε occurs within the error pattern of one single fault, thispattern is divided into more than one tuple. This effect is called truncation

Both the number of collisions and truncations depend on ε. If ε is large, truncationhappens rarely and collision will occur very likely. If ε is small the effect is vice versa. Inorder to analyze the relationship, Hansen & Siewiorek [114] have derived a formula forthe probability of collision. Assuming that faults are exponentially distributed, collisionprobability can be computed by

Pc(ε) = 1− e−λF ε∑

j

pj e−λF lj

, (5.1)

where λF is the fault rate, and pj denotes the discrete distribution of tuples of length ljestimated from the logfile. However, the fault rate λF is unobservable and the authorssuggest to estimate it by the tuple rate λT . The authors have checked their results usingtwo machine years of data from a Tandem TNS II system and showed that the formulacan provide a rough estimate.

3In Tsao & Siewiorek [258], there is a second, larger threshold to add later events if they are similar, butthis is not further considered, here


Figure 5.2: For demonstration purposes, the first and second time line depict error patternsfor two faults separately. The bottom line shows what is observed in the error log.Error logs are grouped if inter-arrival time is less than ε. Each group defines atuple (shaded areas). Truncation occurs, if the inter-arrival time for one fault is> ε. However, large ε lead to collisions, if events of other faults occur earlier thanε [114].

As stated above, reducing the number of collisions by lowering ε increases the numberof truncations. However, truncation is much more complicated to identify since it ismostly difficult to tell whether some error occurring much later (> ε) belongs to the samefault or to another. Therefore, the authors suggest the following strategy: Plotting thenumber of tuples over ε yields an L-shaped curve as shown in Figure 5.3. If ε equals

Figure 5.3: Plotting the number of tuples over time window size ε yields an L-shaped curve[114].

zero, the number of tuples equals the number of error events in the logfile. While ε isincreased, the number first drops quickly. At some point, the curve flattens suddenly.Choosing ε slightly above this point seems optimal. The rational behind this procedureis the assumption that —on average— there is a small gap between the errors of differentfaults: If ε is large enough to capture all errors belonging to one fault, the number ofresulting tuples decreases slower if ε is further increased.

Current research aims at quantifying temporal and spatial tupling. For example, Fu &Xu [99] introduce a correlation measure for this purpose but since research is at an earlystage, such measure has not been applied. For the rest of this chapter, it is assumed thatboth error-ID assignment and tupling have been applied.

5.2 Clustering of Failure Sequences 79

5.1.3 Extracting SequencesThe hidden Markov models are trained using either failure or non-failure sequences, asshown in Figure 2.9 on Page 19. A failure sequence is defined as a temporal sequence oferror events preceding a failure (see Figure 5.4). Its maximum duration is determined bythe data window size ∆td, as defined in Section 2.1 (see Figure 2.4 on Page 12). The timeof failure occurrence is usually not reported in the error logs themselves but in documentssuch as operator repair reports, logs of stress generators, service trackers, etc. Non-failure

Figure 5.4: Extracting sequences from an error log. Sequences are extracted from a timewindow of duration ∆td. Sequences preceeding a failure (denoted by t) by lead-time ∆tl form failure sequences F i. Sequences occurring between failures (withsome margin ∆tm) set up non-failure sequences NF i

sequences denote sequences that have occurred between failures. In order to be relativelysure that the system is healthy and no failure is imminent, non-failure sequences must notoccur within some margin ∆tm before or after any failure. Non-failure sequences can begenerated with overlapping or non-overlapping windows or by random sampling.

5.2 Clustering of Failure SequencesA failure mechanism, as used in this thesis, denotes a principle chain of actions or con-ditions that leads to a system failure. It is assumed that various failure mechanisms existin complex computer systems such as the telecommunication system. Different failuremechanisms can show completely different behavior in the error event logs, which makesit very difficult for the learning algorithm to extract the inherent “principle” of failurebehavior in a given training data set. For this reason, a novel approach to an identifica-tion and separation of failure mechanisms has been developed. The key notion of theapproach is that failure sequences of the same underlying failure mechanism are moresimilar to each other than to failure sequences of other failure mechanisms. Groupingcan be achieved by clustering algorithms, however, the challenge is to define a similaritymeasure between any pair of error event sequences. Since there is no “natural” distancesuch as Euclidean norm for error event sequences, sequence likelihoods from small hid-den semi-Markov models are used for this purpose.4 The approach is related to Smyth

4The same hidden semi-Markov models are used as developed in the next chapter. However, since thisthesis follows the order of tasks: preprocessing → modeling → classification, details on the model arepresented in Chapter 6. For the time being, it is sufficient to remember that HSMMs are hidden Markovmodels tailored to temporal sequences.


[246] but it yields separate specialized models instead of one mixture model.

5.2.1 Obtaining the Dissimilarity MatrixSince most clustering algorithms require dissimilarities among data points as input data, adissimilarity matrix D is computed from the set of failure sequences F i. More precisely,D(i, j) denotes the dissimilarity between failure sequence F i and F j .

In order to compute D(i, j), first, a small HSMM M i is trained for each failure se-quence F i, as shown in Figure 5.5.

Figure 5.5: For each failure sequence F i, a separate HSMM M i is trained.

Second, sequence likelihood is computed for each sequence F i using each model M j .However, since sequence likelihood takes on very small numbers for longer sequences,it cannot be represented properly even by double precision floating point numbers andthe logarithm of the likelihood (log-likelihood) is used here.5 Sequence likelihood ofall sequences F i computed with all HSMMs M j defines a matrix where each element(i, j) is the logarithm of the probability that model i can generate failure sequence j:log

[P (F i|M j)

]∈ (−∞, 0]. In other words, the logarithmic sequence likelihood is close

to zero if the sequence fits the model very well and is significantly smaller if it does notreally fit. Since model M j has been adjusted to the specifics of failure sequence F j in thefirst step, P (F i|M j) expresses some sort of proximity between the two failure sequencesF i and F j . An exemplary resulting matrix of log-likelihoods is shown in Figure 5.6.

Unfortunately, the matrix is not yet a dissimilarity matrix, since first, values are ≤ 0and second, sequence likelihoods are not symmetric: P (F i|M j) 6= P (F j|M i). This issolved by taking the arithmetic mean of both likelihoods and using the absolute value.Hence D(i, j) is defined as:

D(i, j) =

∣∣∣∣∣∣log

[P (F i |M j)

]+ log

[P (F j |M i)

]2

∣∣∣∣∣∣ . (5.2)

Still, matrix D is not a proper dissimilarity matrix since a proper metric requires thatD(i, j) = 0, if F i = F j . There is no solution to this problem since from D(j, j) = 0follows that P (F j|M j) = 1. However, if M j would assign a probability of one to F j

it would assign a probability of zero to all other sequences F i 6= F j , which would beuseless for clustering. Nevertheless, D(j, j) is close to zero since it denotes log-sequence

5In fact, many HMM implementations only return the log-likelihood

5.2 Clustering of Failure Sequences 81

Figure 5.6: Matrix of logarithmic sequence likelihoods. Each element (i, j) in the matrix islogarithmic sequence likelihood log

[P (F i|M j)

]for sequence F i and model

M j .

likelihood for the sequence, model M j has been trained with. For this reason, matrix Dis used as defined above.

Regarding the topology of models M i, the purpose of each model is to get a roughnotion of proximity between failure sequences. In contrast to the models used for failureprediction (c.f., Section 6.6, the purpose is not to clearly identify sequences that are verysimilar to the training data set and to judge other sequences as “completely different”.Therefore, models M i have only a few states and have the structure of a clique, whichmeans that there is a transition from every state to every other state.6 In order to furtheravoid too specific models, so-called background distributions are applied (c.f., Page 112).The effects of the number of states and background distributions are further investigatedalong with the case study.

5.2.2 Grouping Failure SequencesIn order to group similar failure sequences, a clustering algorithm is applied. Two groupsof clustering algorithms exist (c.f., Kaufman & Rousseeuw [142]): Partitioning tech-niques divide the data into u different clusters (partitions), and u is a fixed number thatneeds to be specified in advance. Hierarchical clustering approaches do not rely on aprespecification of the number of clusters. They either divide the data into more andmore sub groups (divisive approach), or start with each data point as separate cluster andrepetitively merge smaller clusters into bigger ones (agglomerative approach). In general,partitioning approaches yield better results for a single u, while hierarchical algorithmsare much quicker than repetitively partitioning for different values of u. Due to the factthat u cannot be determined upfront, hierarchical clustering is used for the grouping offailure sequences.

The output of hierarchical clustering algorithms is a grouping function gF (u) that

6These models are also called ergodic


partitions the set of failure sequences F = {F i} into u groups:

gF (u) ={Gl

}; 1 ≤ l ≤ u; ∀ l : Gl ⊂ F ;

⋃l

Gl = F,⋂l

Gl = ∅ , (5.3)

where Gl denotes the set of failure sequences that belong to group l.

5.2.3 Determining the Number of GroupsHierarchical clustering yields the function gF (u), determining for each number of groupsu which sequences belong to which group. Therefore, the number of groups u needs tobe determined in order to separate the failure sequences in training data. In principle, ushould be as small as possible, since a separate model needs to be trained for each group,which affects computation time both for training and online prediction. Moreover, themore groups the less failure sequences remain in the training data set of each group whichresults in worse models. On the other hand, if u is too small, there is no clear separation offailure mechanisms and the resulting failure prediction models have difficulties to learnthe structure of failure sequences. Several ways have been proposed to determine thenumber of groups u:

• Visual inspection is a very robust technique if data is presented adequately. Bannerplots (see Section 8.1.2) have shown to be an adequate representation for this pur-pose. However, visual inspection works only if the number of failure sequences isnot too large.

• Evaluation of inter-cluster distances. Such approaches investigate the distance levelat which clusters are merged or divided. The basic idea is that if there is a large gapin cluster distance (one that deviates significantly from the others) some fundamen-tal difference must be present in the data. Such approaches are sometimes calledstopping rules (see, e.g., Mojena [185], Lance & Williams [154], Salvador & Chan[228]).

• Elbow criterion. The percentage of variance explained7 is plotted for each numberof groups. The point at which adding a new cluster does not add sufficient infor-mation can be observed by an elbow in the plot (see, e.g., Aldenderfer & Blashfield[5])

• Bayesian framework. Using Bayes’ theorem, maximum probability for the numberof groups given the data arg maxu P (u|D) can be computed from the probability ofdata given the number of groups P (D|u). However, this requires to try all values ofu, ranging from one to the number of sequences F , and each trial requires to trainu HSMMs. Hence, F (F−1)

2 = F 2+F2 Baum-Welch training procedures would have

to be performed which is not feasible in reasonable time.

Due to the fact that the number of failure sequences in the case study are still manageable,and visual inspection is a very simple but robust technique, it is the method of choice inthis thesis.

7This is the ratio of within-group variance to total variance

5.3 Filtering the Noise 83

Figure 5.7: Inter-cluster distance rules: (a) nearest neighbor, (b) Furthest neighbor, and (c)unweighted pair-group average method

5.2.4 Additional Notes on ClusteringMatrix D defines some sort of distance between single failure sequences. However, forclustering some measure is needed to evaluate the distance between clusters, which canhave a decisive impact on the result of clustering. The three predominant techniques foragglomerative clustering are (see Figure 5.7):

• Nearest neighbor. The shortest connection between two clusters is considered. Thisapproach tends to yield elongated clusters due to the so-called chaining effect: Iftwo clusters get close only in point, the two clusters are merged. For this reason,the nearest neighbor rule is also called single linkage rule.

• Furthest neighbor. The maximum distance of any two points in two clusters areconsidered. This approach tends to yield compact clusters that are not necessarilywell separated. This rule is also called complete linkage rule.

• Unweighted pair-group average method (UPGMA). The distance of two clusters iscomputed by the average of distances from all points of one group to all points ofthe other. This approach results in ball-shaped clusters that are in most cases wellseparated.

In addition to these inter-cluster distances measures, Ward’s method generates clusters byminimizing the squared Euclidean distance to the center mean.

Each method has its advantages and disadvantages and it is difficult to determine up-front, which is best-suited for a given data set. Therefore, for all methods have beenapplied to data of the case study (c.f., Section 9.2.5). Despite of failure sequence group-ing for data preprocessing, the clustering method presented here can possibly be used toenhance diagnosis, as is discussed in the outlook.

5.3 Filtering the NoiseThe objective of the previous clustering step was to group failure sequences that are tracesof the same failure mechanism. Hence it can be expected that failure sequences of onegroup are more or less similar. However, experiments have shown that this is not thecase. The reason for this is that error logfiles contain noise, which results mainly fromparallelism within the system (see Section 2.3). Therefore, some filtering is necessary toeliminate noise and to mine the events in the sequences that make up the true pattern.

The filtering applied in this thesis is based on the notion that at certain times withinfailure sequences of the same failure mechanism, indicative events occur more frequently


Figure 5.8: After grouping similar failure sequences by means of clustering, filtering is appliedto each group in order to remove noise from the training data set. For failure groupu the blow-up shows that sequences are aligned at the time of failure occurrence(t). For each time window (vertical shaded bars) each error symbol (A,B,C)is checked whether it occurs significantly more frequent than expected. Thosesymbols that do not pass the filter (crossed-out symbols) are removed from thetraining sequence

than within all other sequences. The precise definition of “more frequently” is based onthe χ2 test of goodness of fit.

The filtering process is depicted in the blow-up of Figure 5.8 and performs the follow-ing steps:

1. Prior probabilities are estimated for each symbol. Priors express the “general” prob-ability that a given symbol occurs.

2. All sequences of one group (which are similar and are expected to represent onefailure mechanism), are aligned such that the failure occurs at time t = 0. In thefigure, sequences F 1, F 2, and F 4 are aligned and the dashed line indicates time offailure occurrence.

3. Time windows are defined that reach backwards in time. The length of the timewindow is fixed and time windows may overlap. Time windows are indicated byshaded vertical bars in the figure.

4. The test is performed for each time window separately, taking into account all errorevents that have occurred within the time window in all failure sequences of thegroup.

5. Only error events that occur significantly more frequently in the time window thantheir prior probability stay in the training sequences. All other error events withinthe time window are removed, since these are assumed to be noise. In the figure,removed error events are crossed out.

6. Filtering rules are stored for each time window specifying error symbols that passthe filter. The filter rules are used for online failure prediction where new sequences

5.3 Filtering the Noise 85

have to be processed in order to classify the current state of the system as failure-prone or not. Each incoming error sequence is filtered before sequence likelihoodis computed. Each failure group has a separate set of filter rules and no filtering isapplied for the non-failure sequence model. That is why there is a group-specificpart of the preprocessing block in Figure 2.10 on Page 20.

In order to formalize the test, let p0i denote the estimated prior probability of error event

type (symbol) i, which is the null hypothesis. The set of failure sequences under consid-eration is obtained from clustering. Assume the l-th group is to be filtered, then the set offiltering sequences Gl consisting of sequences Gj

l is defined by:

Gl ={Gjl

}=[gF (u)

]l, (5.4)

where gF (u) is defined by Equation 5.3. Let S denote the set of symbols that occur in allfailure sequences in Gl within the time window (t−∆t, t]:

S =⋃j

{s ∈ Gj

l | s occurs within (t−∆t, t]}. (5.5)

Each symbol si ∈ S is checked for significant deviation from the prior p0i by a test variable

known from χ-grams, which are a non-squared version of the testing variable of the χ2

goodness of fit test (see, e.g., Schlittgen [230]). The testing variable Xi is defined as thenon-squared standardized difference:

Xi = ni − n p0i√

n p0i

, (5.6)

where ni denotes the number of occurrences of symbol si and n is the total number ofsymbols in the time window. Disregarding estimation effects, properties of the testingvariable Xi can be assessed by assuming that ni is binomially distributed, so that from thePoisson approximation follows8 for expectation value and variance:

E [ni] ≈ n p0i (5.7)

V [ni] ≈ n p0i , (5.8)

where E [] denotes expectation value and V [] variance. Hence,

E [Xi] = Eni − n p0

i√n p0

i

≈ 0 (5.9)

V [Xi] = V

ni − n p0i√

n p0i

≈ 1 . (5.10)

From this analysis follows that all Xi are standardized and can be compared to a thresh-old c: Filtering eliminates all symbols si from S within time window (t−∆t, t], for whichXi < c. Hence, the set of remaining symbols for the time window is:

S ′ = {si ∈ S | Xi ≥ c} . (5.11)

8p0i can be assumed to be rather small


Figure 5.9: Three different sequence sets can be used to compute symbol prior probabilities:the set of all training sequences, the set of failure training sequences, and the setof failure training sequences belonging to the same group (indicated by Gi). Inreality, grouped sequence sets Gi cover (i.e. partition) the set of failure trainingsequences.

The set of filtered training sequencesG′l = {Gj′

l } is finally obtained by removing all sym-bols from each sequence that do not occur in any of the filtered symbol sets S ′ coveringthe time at which the symbol occurs in the sequence. G′l is then used to train the model forthe l-th failure mechanism / group (see Section 6.6). For online prediction, the sequenceunder investigation is filtered the same way before sequence likelihood is computed.

Three variants regarding the computation of priors pi0 have been investigated in thisthesis (see Figure 5.9):

1. p0i are estimated from all training sequences (failure and non-failure). Xi compares

the frequency of occurrence of symbol si to the frequency of occurrence within theentire training data.

2. p0i are estimated from all failure sequences (irrespective of the groups obtained from

clustering). Xi compares the frequency of occurrence of symbol si to all failuresequences (irrespective of the group).

3. p0i are estimated separately for each group of failure sequences from all errors within

the group. For each symbol si the testing variable Xi compares the occurrencewithin one time window to the entire group of failure sequences.

All variants have been applied to the data of the case study. An analysis is provided inSection 9.2.6.

5.4 Improving LogfilesLife could have been easier if logfiles would have been written in a format that is suitedfor automatic processing. From the experience of working with logfiles, a paper has beenwritten (Salfner et al. [227]) discussing several issues relevant for logging. The majorconcepts are described in shorter form, here. At the end of the section, a comparison tothe Common Base Event format is added, which is not included in the paper.

5.4.1 Event Type and Event SourceWhile data like timestamps and process identifiers are given more or less explicitly inlogfiles, the logged event itself is in most cases represented in natural language. Analyzing

5.4 Improving Logfiles 87

messages like “Could not get connection to service X” reveals that the textual descriptionmerges two different pieces of information, that should be represented distinguishably:

• what has happened: some connection could not be established. This information iscalled the event type

• what resource the problem arose with, which is “service X” in the example. Thisinformation is called the source the event is associated with. Note that the source isin general not identical to the detector issuing the error report.

Event type and source are related to orthogonal defect classification (Chillarege et al.[59]) where the type correlates with the defect type and the source with the defect trigger.However, since in our scheme error events and not the root cause of defects are considered,the event type must not necessarily coincide with the defect type. The source is only asuspected trigger entity, while the defect trigger, as defined by Chillarege et al., describesthe entire state in which the defect occurred.

Of course a natural language sentence is able to carry more information than only“event type” and “source”. To prevent the additional information from being lost, it shouldbe spelled out by additional fields in the log record.

5.4.2 Hierarchical NumberingIn Section 5.1.1 a method has been described to map natural language error messages toevent IDs. This step could have been avoided if message IDs would have been writtendirectly into the log. Furthermore, if error message IDs are chosen in a systematic way,such approach can be superior to natural language error messages, as can be shown forthe numbering scheme described in the following.

The numbering scheme is based on a hierarchical classification of errors, representedby a tree. The topmost classification is based on the SHIP fault model (c.f. Section 2.5).The software subtree has been further developed introducing 62 categories, of which anexcerpt is shown in Figure 5.10. Error message identifiers are simply constructed of

Figure 5.10: Hierarchical numbering scheme for error event types with the SHIP model. Theexample only shows a sub-classification for software errors

the labels along the path from the root to the leaf node separated by a dot. This num-bering scheme originates from Dewey [51] and has become popular, e.g., with LDAP


(Lightweight Directory Access Protocol)[269]. In cases where an error matches severalleaves of the tree, all possible identifiers should be written into the log. On the other hand,if an event cannot be resolved down to a leaf category, the most-detailed identifiable cat-egorization should be used. Furthermore, the error classification scheme can be extendedeasily.

In comparison with freely chosen error IDs, as they occur with methods such as theone presented in Section 5.1.1, the numbering scheme provides two advantages:

1. It provides an ordering that can be exploited to derive a notion of similarity betweenerror event types

2. It provides means to present error data with multiple levels of detail.

A distance metric. The numbering scheme gives rise to a measure of similarity thatcould be used, e.g., in clustering algorithms to group error messages. For example, failureprediction algorithms could benefit from a notion of error proximity, or clusters can beanalyzed in order to diagnose an apparent problem. The distance metric proposed here isdefined as follows:

d(id1, id2) := length of path between id1 and id2. (5.12)

which has properties:

d(id1, id2) = 0 ⇔ id1 = id2 (5.13)d(id1, id2) = d(id2, id1) (5.14)d(id1, id3) ≤ d(id1, id2) + d(id2, id3) (5.15)

from which follows that d(id1, id2) is a proper metric. It can be efficiently computed bysimply comparing the individual parts of id1 and id2 from left to right and calculatingd(id1, id2) directly from the position in which the two identifiers differ.9

Due to the lack of system knowledge, it has not been possible to apply hierarchicalnumbering to the data of the telecommunication system, and hence the distance metrichas not applied to industrial data. Therefore, one potential conceptual problem knownfrom decision trees could not be investigated using real data: the proposed metric canassign a large distance to objects that are closely related in reality but reside in differentsubtrees (see Figure 5.11)

Figure 5.11: An inherent problem of hard classification approaches such as decision trees:the two highlighted points are assigned a long distance (thick lines), althoughthey are close in reality

9This algorithm does not even require knowledge of the error classification tree


Multiple levels of detail. The proposed numbering scheme supports views of diversegranularity on the data, which enables to present the log data at multiple levels of detail.For example, a failure prediction tool will need more fine-grained information than anadministrator who is only observing whether the system is running well. Presentation atvarious levels of granularity can simply be achieved by truncating the error numbers.

5.4.3 Logfile EntropyGaining experience from working with logfiles of various programs, one gets a notionof what makes a good logfile. In order to assess the quality of logfiles quantitatively, ametric has been developed. Due to its affinity to Shannon’s definition [235] it is calledinformation entropy of logfiles.

Starting from Shannon’s work, information entropy is defined as:

H(Xi) = log2

(1

P (Xi)

), (5.16)

where Xi is a symbol of a signal source, and P (Xi) is the probability that Xi occurs. Interms of error logs, Xi corresponds to the type of a log record. If P (Xi) = 1, the logfilewill consist only of messages of type Xi. According to Shannon, as the occurrence ofsuch a log record is fully predictable, it does not convey any new information and theentropy is zero.

However, the frequency of occurrence is only one part of what makes a good logrecord. A metric must also comprise the information that is given within the record. Tomeasure this, log records are taken to be a set where the elements relate to pieces ofinformation such as timestamp, process ID, etc. Let Ri be the set of information requiredto fully describe the event that log record Xi is reporting on, and let Gi denote the set ofinformation that is actually given within log record Xi. As can be seen from Figure 5.12,

Figure 5.12: Sets of required information (Ri) and given information (Gi) of a log record

the intersectionRi∩Gi is the required information that is actually present in the log recordand (Ri ∪Gi) \ (Ri ∩Gi) is the set of information that is either missing or irrelevant. Thebigger the intersection, and the smaller the rest, the better the log record. This is expressedby the integrity function I(Xi), where the notation ](·) denotes cardinality of a set:

I(Xi) = ](Ri ∩Gi)](Ri ∪Gi)

− ]((Ri ∪Gi) \ (Ri ∩Gi))](Ri ∪Gi)

. (5.17)

The first term is a Jaccard score [106] for the similarity between given and required infor-mation and the second evaluates the amount of missing and irrelevant information. To see


how integrity is measured by I(Xi), consider the following two extreme cases: If a logrecord contains exactly the information that is required and nothing more, Ri ∩Gi equalsRi ∪ Gi. Hence I(Xi) equals one. If a log record contains none of the required infor-mation (all given information is irrelevant), Ri ∩ Gi = ∅ and the result is -1. Therefore,I(Xi) can take any values of the range [−1, 1].

Not only the fraction of given and required information but also the absolute numberof statements contained in a log message has impact on information density. The numberof reasonable statements in the record is

S(Xi) = ](Ri ∩Gi) . (5.18)

Combined with a linear transformation of I(Xi) to the range [0, 1], the quality Q(Xi)of a log record is measured by

Q(Xi) = S(Xi)I(Xi) + 1

2 . (5.19)

Finally, entropy for one log record is the product of the quality Q(Xi) and Shannon’slogarithmic quantity measure given by Equation 5.16:

H(Xi) = Q(Xi) log2

(1

P (Xi)

). (5.20)

In order to compute entropy of entire logfiles, the expected value over all log records iscomputed analogously to Shannon:

H(X) =m∑i=1

P (Xi)H(Xi) . (5.21)

Properties of logfile entropy. Q(Xi) contributes to H(Xi) on linear scale, while thequantity function has a logarithmic scale. Not surprisingly, for a high qualityQ(Xi) whichoccurs with very small probability P , the entropy takes on very high values (see Fig-ure 5.13). In order to compute the maximum entropy, integrity I(Xi) = 1 and P (Xi) = 1

m

is assumed for all m log records. Then, maximum entropy is [103]:

Hmax(X) = ]R log2m , (5.22)

where ]R denotes mean number of required statements per log record.The set Ri is defined to contain all the information needed to comprehensively de-

scribe the event that caused writing of the log record, but nothing more. Analyzing Ri foreach error type is a laborious task. In Salfner et al. [227] an example is provided and it isshown how an intuitively better logfile results in an increased entropy.

5.4.4 Existing SolutionsWhen IBM started its autonomic computing initiative, it has been found out quicklythat an automatic processing of logfiles is crucial —Bridgewater [37] called them “anervous system for computer infrastructures”. Against the background of multi-vendorcommercial-off-the-shelf systems, a log standard had to be developed. Together withother companies such as HP, Oracle and SAP, the Common Base Event [196], which is


0.0

0.2

0.4

0.6

0.8

1.0

probability P

0.2

0.4

0.6

0.8

1.0

0

1

2

3

entr

opy

H

norm

aliz

ed q

ualit

y#R

G

Q

Figure 5.13: A surface plot of the entropy H(Xi). Quality Q has been normalized to therange [0, 1]

part of the “Web Services Distributed Management” standard (WSDM 1.0), has beendeveloped to enable standardized logging.

A Common Base Event (CBE) is a specification of one event, which has been calledlog record, so far. A CBE consists of three major parts:

1. The component reporting a particular situation

2. The component affected by the situation

3. The situation itself

Each of the three parts is further specified by several, fixed attributes. The specificationcontains an UML description of the CBE format, of which Figure 5.14 is a simplifiedversion to visualize the concept.

Common Base Event

creationTime

severity

message

...

Situation

categoryName

StartSituation ConfigureSituation DependencySituation ConnectSituation

ComponentIdentification

application

componentID

componentType

processID

...

reporter

source

...

...

Figure 5.14: Principle structure of a Common Base Event, depicted as an UML model


Evaluating CBE with respect to the issues raised in the previous sections, the follow-ing conclusions can be drawn:

• CBE also separates event type and source: “Situation” specifies the event typeand the “source ComponentIdentification” contains a specification of the failed re-source.

• The “reporter ComponentIdentification” corresponds to parts of the log records thathave not been addressed in the previous sections. For example, in application log-files, the reporter is in most cases the application that wrote the log.

• Instead of a hierarchical numbering scheme specifying the event, eleven “situation-Names” have been defined. Valid situation identifiers include “START”, “CON-NECT”, or “CONFIGURE”. For this reason, the proposed distance metric cannotbe applied to CBE.

Error logs of Sun Microsystem’s Solaris 10 operating system, which has been releasedin 2004, also separates event type and event source. Solaris 10 error reportings supportthree levels of detail by structuring each report into an outline, error details and error IDdetails. However, instead of a hierarchical numbering scheme, unique event-IDs are usedthat can be looked up on the Internet in order to obtain further details.

5.5 SummaryThis chapter has covered the steps that are applied to yield a set of training sequencesfrom error logfiles. Specifically, this process involves:

• Mapping natural language error messages to event IDs using the Levenshtein editdistance.

• Removal of repetitive reportings of the same cause by means of tupling.

• Extraction of sequences from the filtered logfiles.

• Grouping of failure sequences that belong to the same failure mechanism. Toachieve this, a hierarchical clustering based on a dissimilarity matrix computed byusing small HSMMs is applied.

• Filtering the noise that is present in the data by means of a statistical test related tothe χ2 goodness of fit test.

The last section of the chapter addressed the topic of logfiles in general. It has beenproposed that event type and source should be separated and a hierarchical numberingscheme should be applied to assign IDs to error events. The numbering scheme allows todefine a distance metric and to present logfiles at various levels of detail. A measure forthe quality of logfiles has been developed. The measure is based on Shannon’s definitionof information entropy and is hence called “logfile entropy”. Finally, these principles of“educated logging” have been compared to an existing logging standard from autonomiccomputing, the Common Base Event.

5.5 Summary 93

Contributions of this chapter. This chapter has introduced

• a novel approach to identify failure mechanisms in the system by means of failuresequence clustering. This may also be helpful for failure diagnosis.

• a novel approach to noise reduction in failure sequences by means of the χ2 relatedstatistical test

• a novel way to represent error events using hierarchical numbering, which givesrise to a definition of a distance between error event IDs

• a novel way to assess the quality of logfiles by means of an entropy measure forlogfiles.

Relation to other chapters. Having covered data preprocessing in this chapter, the ex-tended hidden Markov model, which is the heart of the failure prediction approach pre-sented in this dissertation, is described in the next chapter.

Chapter 6

The Model

This chapter describes the essence of the proposed approach to failure prediction: Thehidden semi-Markov model that is used for pattern recognition. In Section 6.1, the modelis defined. Subsequently, in Section 6.2 it is described how the model is used to processtemporal sequences, followed by a delineation of the training procedure in Section 6.3.A proof of convergence for the training procedure is given in Section 6.5 and model-ing issues that are specific to failure prediction are discussed in Section 6.6. Finally, inSection 6.7, computational complexity is analyzed.

6.1 The Hidden Semi-Markov ModelSimilar to the way standard hidden Markov models (HMMs) are an extension of dis-crete time Markov chains (DTMCs), hidden semi-Markov models (HSMMs) extend semi-Markov processes (SMPs). For this reason, SMPs are defined first followed by their ex-tension to HSMMs.

6.1.1 Wrap-up of Semi-Markov ProcessesSMPs are continuous-time stochastic processes that allow to specify probability distribu-tions for the duration of transitions from one state to the next. Several definitions exist,which all lead to the same properties. In this dissertation, the approach of Kulkarni [149]is adopted. Semi-Markov processes are a continuous-time extension of Markov renewalsequences, which are defined as follows:A sequence of bivariate random variables {(Yn, Tn)} is called a Markov renewal sequenceif

1. T0 = 0, Tn+1 ≥ Tn; Yn ∈ S , and (6.1)

2. P (Yn+1 = j, Tn+1 − Tn ≤ t|Yn = i, Tn, . . . , Y0, T0) = P (Y1 = j, T1 ≤ t|Y0 = i) (6.2)

∀n ≥ 0 .

Here, S denotes the set of states, and the random variables Yn and Tn denote the stateand time of the n-th element in the Markov renewal sequence. Note that Tn refers topoints in time on a continuous time scale and t is the length of the interval between Tn

95

96 6. The Model

and Tn+1. Similarly to Equation 4.3 on Page 56, Equation 6.2 expresses that Markovrenewal sequences are memoryless and time-homogeneous: As the transition probabilitiesonly depend on the immediate predecessor, it has no memory of previous states, andsince transition probabilities at time n are equal to the probabilities at time 0, the processis time-homogeneous.

Let gij(t) denote the conditional probability that state sj follows si after time t asdefined by Equation 6.2. Then the matrix G(t) := [gij(t)] is called the kernel of theMarkov renewal sequence. Note that gij(t) has all properties of a cumulative probabilitydistribution except that the limiting probability pij must be equal to or less than one:

pij := limt→∞

gij(t) = P (Y1 = j|Y0 = i) ≤ 1 . (6.3)

Even if Markov renewal sequences are defined on a continuous time scale, they form adiscrete sequence of points. If the gaps between the points of a Markov renewal sequenceare “filled”, a Semi-Markov process (SMP) is obtained. More formally:

A continuous-time stochastic process {X(t), t ≥ 0} with countable state space S is saidto be a semi-Markov process if

1. it has piecewise constant, right continuous sample paths, and

2. {(Yn, Tn), n ≥ 0} is a Markov renewal sequence, where Tn is the n-th jump epochand Yn = X(Tn+) .

Yn = X(Tn+) denotes that the state X of the SMP is defined by the state Yn of theMarkov renewal sequence at any time t. The notation Tn+ indicates that the sample pathis right continuous, and n is determined such that it is the largest index for which Tn ≤ t(see Figure 6.1).

Figure 6.1: A semi-Markov process X(t) defined by a Markov renewal sequence {(Yn, Tn)}

An SMP is called regular, if it only performs a finite number of transitions in a finiteamount of time. As only regular SMPs are considered in this thesis, the term “regular”will be omitted from now on.

As can be seen from Equation 6.3, the limiting probabilities pij “eliminate” temporalbehavior. Hence, they define a DTMC that is said to be embedded in the SMP. From thisanalogy it is clear that the following property holds for each transient state si:

∀i :N∑j=1

pij = 1 , (6.4)

6.1 The Hidden Semi-Markov Model 97

expressing the fact that it is sure that the SMP leaves state si if time t approaches infinity.In addition to the notion of the embedded DTMC, the limiting probabilities pij can be

used to define a quantity that helps to understand the way how SMPs operate. Let dij(t)denote a probability distribution for the duration of a transition from state si to state sj:

dij(t) = P (T1 ≤ t | Y0 = i, Y1 = j) . (6.5)

Using the limiting probabilities pij , durations dij(t) can be computed from gij(t) the fol-lowing way:

dij(t) =

gij(t)pij

if pij > 01 if pij = 0 ,

(6.6)

and therefore, gij(t) can be split into a transition probability and a transition durationdistribution:

gij(t) = pij dij(t) , (6.7)

which leads to an intuitive description of the behavior of SMPs: Assume that at time 0the system enters state i. Then, it chooses the next state to be j according to probabilitypij . Having decided upon the next state to be j, it stays in state i for a random amount oftime sampled from distribution dij(t) before it enters state j. Once the SMP enters state jit looses all memory of the history and behaves as before, starting from state j. Note thatthe theory of SMPs allows pii 6= 0, i.e., the SMP may return to state i immediately afterleaving it. However, for simplicity reasons, it will be assumed from now on that pii = 0.This description of SMPs also shows why they are called semi-Markov processes: thechoice of the successor state is a Markov process, but the duration probability is dependingboth on the current as well as on the successor state and is therefore non-Markovian.Hence the name semi-Markov.

Finally, it should be noted that SMPs are fully specified by two quantities:

1. the initial distribution π = [πi] = [P (X(0) = i)]

2. the kernel G(t) of the underlying Markov renewal sequence. From Equation 6.7follows that G(t) can alternatively be specified by P = [pij], which is a transi-tion matrix for the embedded DTMC, and D(t) = [dij(t)] defining the probabilitydistributions for the duration of each transition from si to sj .

Be aware that Equation 6.7 only holds for each gij(t) separately and hence matrices PandD(t) can only be multiplied element-wise.

6.1.2 Combining Semi-Markov Processes with Hidden Markov Mod-els

HSMMs extend SMPs in the same way that HMMs extend DTMCs. Hence, once thestochastic process of state traversals enters a state si, an observation oj is produced ac-cording to the probability distribution bsi(oj). Due to the fact that error event-based failureprediction evaluates temporal sequences with discrete symbols, only discrete distributionsbsi(oj) are considered, here. Nevertheless, the approach could be extended easily to con-tinuous, multimodal outputs.1 An example is shown in Figure 6.2.

1See, e.g., Liporace [168], Juang et al. [137], Rabiner [210] for a summary how it is done for discrete-timeHMMs

98 6. The Model

Figure 6.2: Similar to the HMM shown in Figure 4.2, a HSMM consists of a semi-Markov pro-cess of (hidden) state traversals defined by gij(t) and output probabilities bsi(oj)

According to Equation 6.7, gij(t) is the product of limiting probabilities pij and du-rations dij(t). Durations dij(t) can in general be arbitrary time-continuous cumulativedistributions, which even need not to be differentiable. For example, dij(t) can be a piece-wise constant non-decreasing function. In this thesis, however, a convex combination ofparametrized probability distributions is assumed:

dij(t) =R∑r=0

wij,r κij,r(t|θij,r) (6.8)

s.t.R∑r=0

wij,r = 1 , wij,r ≥ 0 . (6.9)

Each duration distribution dij(t) is a sum of R cumulative probability distributionsκij, r(t|θij,r) with a specific set of parameters θij,r, weighted by wij,r. The weights sumup to one so that a proper probability distribution is obtained. The single distributionsκij, r are called kernels. For example, if κij, r is a Gaussian kernel, parameters θij,r con-sists of mean µij,r and variance σ2

ij,r. Additionally as stated above, it is assumed, thatpii = 0, expressing the fact that there are no self-transitions in the model. In the literature,such convex combination is sometimes termed a mixture of probability distributions, eventhough the term is mathematically less precise.

In summary, an HSMM is completely defined by

• The set of states S = {s1, . . . , sN}

• The set of observation symbols O = {o1, . . . , oM}

• The N -dimensional initial state probability vector π

• The N ×M matrix of emission probabilitiesB

• The N ×N matrix of limiting transition probabilities P

• The N ×N matrix of cumulative transition duration distribution functionsD(t)For better readability of formulas, let λ = {π, B, P , D(t)} denote the set of parameters.Taking Equation 6.7 into account, sometimes also the notation λ = {π, B, G(t)} is used.S and O are not included since O is determined by the application and S is not alteredby the training procedure, as is explained in Section 6.3.

6.2 Sequence Processing 99

6.2 Sequence ProcessingIn machine learning, usually a training procedure is applied first in order to adjust modelparameters to a training data set, and after that the resulting model is applied. For failureprediction with HSMMs, this translates into determining model parameters λ and then toprocess error sequences observed during system runtime. Nevertheless description of thetwo steps is reversed here for simplicity reasons since sequence processing is better suitedto explain the hidden semi-Markov model. Training is then covered in the next section.

6.2.1 Recognition of Temporal Sequences: The Forward AlgorithmOnline failure prediction with HSMMs consists of the three stages preprocessing, se-quence recognition resulting in sequence likelihood, and subsequent classification (c.f.Figure 2.10 on Page 20). This section covers the second stage and provides the algorithmto compute sequence likelihood from a given observation sequence, which is the forwardalgorithm.

Figure 6.3 illustrates some notations that are used throughout this chapter. The nota-tion Oi = ok is used to describe observation sequences, expressing that the i− th symbolin a sequence is symbol ok ∈ O . The notation is adopted from literature on randomvariables such as Cox & Miller [67], where capital letters denote the variables and smallletters the realization of the variable. In the figure, ok is either “A”, or “B”. The eventsoccur at times t0 to t2. However, if relative distances between events are relevant, time isrepresented by delays di = ti − ti−1. The sequence of hidden states that are traversed togenerate the observations is denoted as a sequence of random variables Si = sj , wheresj ∈ S .

Figure 6.3: Notations used for temporal sequences in this chapter. Capital letters denoterandom variables, small letters realizations (actual values). [Oi] denotes the se-quence of observation symbols and [Si] the sequence of hidden states. Time isexpressed as delay di between observations at time ti and ti−1.

The forward algorithm for HSMMs is derived from the discrete-time equivalent asdefined by Equations 4.9 on Page 58. The fact that sequences in continuous time areconsidered leads to a change in time indexing: instead of t denoting an equidistant timestep, tk denotes the time when the k-th symbol has occurred.

As can be seen from comparing Figure 6.2 with Figure 4.2 on Page 57, transition prob-abilities aij are replaced by gij(t) in HSMMs. However a strict one-to-one replacement isnot sufficient, as can be seen from the following considerations:

100 6. The Model

1. Assume that at time tk−1 the stochastic process has just entered state si and hasemitted observation symbol ol : Sk−1 = si, Ok−1 = ol.

2. Assume that there is a state transition when the next observation occurs. Hence, theduration of the transition is dk := tk − tk−1.

3. Knowing dk, transition probabilities to successor states sh can be computed bygih (dk). Assume that the successor state is sj : Sk = sj .

4. The subsequent symbol Ok = om is then emitted by state sj with probabilitybsj(om).

5. However, the inequalityN∑h=1

gih(dk) ≤ 1 (6.10)

holds and equality is only reached for dk → ∞ (c.f., Equations 6.3 and 6.4 andkeeping in mind that gii (t) ≡ 0). Hence, for dk < ∞, the sum is less than one,which means that some fraction of the probability mass is not distributed amongsuccessor states. The explanation for this is as follows: there is a non-zero proba-bility that the stochastic process still resides in state si when time dk has elapsed.In this case, state si generates symbol om, and the probability for this is

1−N∑h=1

gih(dk) . (6.11)

6. Applying the Markov assumptions, the stochastic process looses all memory andconsiderations for the next observation start from 1.

In order to formalize these considerations, a probability vij(dk) is defined as follows:

vij(dk) = P (Sk = sj, dk = tk − tk−1 | Sk−1 = si) (6.12)

=

gij(dk) if j 6= i

1−N∑h=1h6=i

gih(dk) if j = i (6.13)

with the property that

∀ i, d :N∑j=1

vij(d) = 1 . (6.14)

One of the advantageous characteristics of this approach is that it can handle the sit-uation when the order of errors occurring closely together is changed, which happensfrequently in systems where several components send error messages to a central loggingcomponent (c.f., Property 6 on Page 15). More technically, if two symbols O1 = oa andO2 = ob occur at the same time (d = 0), the resulting sequence likelihood is identicalregardless of the order, since for d = 0 the process stays in the state si with probabilityone and the resulting (part of) the sequence likelihood is bsi(oa) bsi(ob) = bsi(ob) bsi(oa).


The forward algorithm. Similar to the case of discrete-time HMMs (c.f. Section 4.1),the forward variable α for HSMMs equals the probability of the sequence up to time tkfor all state sequences that end in state si (at time tk):

αk(i) = P (O0O1 . . . Ok, Sk = si|λ) . (6.15)

By replacing aij by vij(t) and changing time indexing, the following recursive computa-tion scheme for αk(i) is derived from Equation 4.9 on Page 58:

α0(i) = πi bsi(O0)

αk(j) =N∑i=1

αk−1(i) vij(tk − tk−1) bsj(Ok); 1 ≤ k ≤ L .(6.16)

The forward algorithm can also be visualized by a trellis structure as shown in Figure 4.3on Page 59.

Sequence likelihood. In the context of online failure prediction with HSMMs, sequencelikelihood is a probabilistic measure for the similarity of the observed error sequence tothe sequences in the training data set. More specifically, sequence likelihood is denotedas P (o |λ), which is the probability that a HSMM with parameter set λ can generateobservation sequence o. Equivalent to standard HMMs, this probability can be computedby the sum over the last column of the trellis structure for the forward variable α:

P (o |λ) =N∑i=1

αL(i) . (6.17)

When executing the forward algorithm on computers, probabilities quickly approachthe limit of computational accuracy, even with double precision floating point numbers.Therefore, a technique called scaling is applied (see, e.g., Rabiner [210]). The values ofcolumn k in the trellis for α are scaled to one by a scaling factor ck:

ck := 1∑i αk(i)

⇒∑i

ck αk(i) =∑i

α′k(i) = 1 . (6.18)

Instead of the sequence likelihood, which also gets too small very quickly, the logarithmof the sequence likelihood is used. It can be shown that the so-called log-likelihoodlog

[P (o |λ)

]can be computed easily by summing up the logarithms of the scaling fac-

tors:

log[P (o |λ)

]= −

L∑k=1

log ck . (6.19)

Finding the most probable sequence of states: Viterbi algorithm. The forward algo-rithm incorporates all possible state sequences. In some applications, however, this is notdesired and only the most probable sequence of states is of interest. This is computed bythe Viterbi algorithm.

102 6. The Model

In analogy to discrete-time HMMs,2 the Viterbi algorithm is derived from the forwardalgorithm by replacing the sum over all previous states by the maximum operator:

δk(i) = maxS0 S1 ... Sk−1

P (O0O1 . . . Ok, S0, S1, . . . , Sk−1, Sk = si | λ) (6.20)

δ0(i) = πi bsi(O0) (6.21)

δk(j) = max1≤ i≤N

δk−1(i) vij(tk − tk−1) bsj(Ok) . (6.22)

Hence, maxi δL(i) is the maximum probability of a single state sequence generating ob-servation sequence o. The sequence of states itself can be obtained by storing which statewas selected by the maximum operator and then tracing back through the array startingfrom state arg maxi δL(i).

6.2.2 Sequence PredictionSequence prediction deals with the estimation of the future behavior of a temporal se-quence. Although not used for failure prediction in this thesis, other application areasexist that take advantage of anticipating the further evolvement of a given sequence.

Given a model and the beginning of a temporal sequence, sequence prediction ad-dresses the question, how the sequence will evolve in the near future based on the char-acteristics expressed by the underlying model. More precisely, two different types ofsequence prediction can be distinguished:

1. What is the probability for the next observation of the sequence?

2. What is the probability that the underlying stochastic process will reach a certaindistinguished state within some time interval?

Probability of the Next Observation. In order to estimate the probability of next ob-servations, the following probability is defined:

ηt(ok) = P (OL+1 = ok, T ≤ t | tL, O0 . . . OL, λ); t ≥ tL . (6.23)

Here, ηt(ok) is the probability that the next emitted observation symbol is ok occurring attime T ≤ t, given a HSMM λ, the beginning of an observation sequence o = O0 . . . OL,and the time of occurrence of the last symbol tL. ηt(ok) can be computed as follows:

ηt(ok) =N∑j=1

P (SL+1 = sj, OL+1 = ok, T ≤ t | tL, o, λ) (6.24)

=N∑j=1

[P (OL+1 = ok | SL+1 = sj, T ≤ t, tL, o, λ)×

P (SL+1 = sj, T ≤ t | tL, o, λ)].

(6.25)

The first probability of Equation 6.25 is simply the observation probability for state sj:

P (OL+1 = ok | SL+1 = sj, T ≤ t, tL, o, λ) = bsj(ok) (6.26)

2c.f., Equation 4.20–4.22


whereas the second probability in Equation 6.25 can be split up further:

P (SL+1 = sj, T ≤ t | tL, o, λ) (6.27)

=N∑i=1

P (SL+1 = sj, SL = si, T ≤ t | tL, o, λ) (6.28)

=N∑i=1

P (SL+1 = sj, T ≤ t | SL = si, tL, o, λ) P (SL = si | tL, o, λ) . (6.29)

The first term of the product in Equation 6.29 is the probability that the state process is instate sj at time T ≤ t given that it was in state si at time tL. This equals the cumulativeprobability distribution vij(d):

P (SL+1 = sj, T ≤ t | SL = si, tL, o, λ) = vij(t− tL) . (6.30)

The second term of the product in Equation 6.29 is the probability that the state processresides in state si at the end of the observation sequence. This can be computed by use ofthe forward algorithm:

P (SL = si | tL, o, λ) = P (o, SL = si | tL, λ)P (o | λ) = αL(i)

P (o | λ) = αL(i)N∑j=1

αL(j). (6.31)

Summarizing the results, the probability that observation symbol ok will occur up totime t in the future can be computed by

ηt(ok) =N∑j=1

bsj(ok)N∑i=1

vij(t− tL)αL(i)P (o | λ) . (6.32)

Probability to Reach a Distinguished State. Computing probabilities for the next ob-servation symbol involved one single state transition (see Equation 6.30). However, ifthe next observation symbol is not of interest but the probability distribution to reach adistinguished state, computation of the first-step successor is not sufficient. Moreover, thegeneral probability to reach the distinguished state sd for the first time by time t irrespec-tive of the number of hops is desired:

P (Sd = sd, Td ≤ t | o, λ); Td = min( t : St = sd) . (6.33)

The procedure to compute this probability involves two steps:

1. Based on the given observation sequence o and the model λ, compute the proba-bility distribution for the last hidden state in the sequence P (SL = si | o, λ) usingEquation 6.31.

2. Use P (SL = si | o, λ) as the starting point to estimate the future behavior of thesystem. The objective is the probability defined in Equation 6.33, which is calledfirst passage time distribution.

104 6. The Model

In principle, an estimation of future behavior should take into account both the processof hidden state traversals and generated observation symbols. Taking into account ob-servation symbols results in a sum over all symbols for each state. However, only thesemi-Markov process of hidden state transitions has to be analyzed, since observationprobabilities can be omitted due to

∑Mk=1 bsi(ok) = 1.

In order to compute the first passage time distribution, the so-called first step analysis(Kulkarni [149]) is applied. The essence of first step analysis can be summarized asfollows:

In order to reach the designated state, the first step of the stochastic processeither reaches the state directly or the process transits to an intermediate state.In the latter case, the designated state is then reached directly from the inter-mediate state or via another intermediate state. This establishes a recursivecomputation scheme.

As in Equation 6.33, let Td denote the time to first reach the designated state sd and letFid(t) = P (Td ≤ t|SL = si) denote the probability to reach sd by time t given that theprocess is in state si at the end of the observation sequence, then

Fid(t) = gid(t) +∑j 6=d

∫ t

0d gij(τ)Fjd(t− τ) , (6.34)

where gid(t) is the cumulative probability distribution as defined in Section 6.1.1 and∫ t

0d gij(τ)Fjd(t− τ)

denotes the Lebesgue-Stieltjes integral (see, e.g., Saks [221]). Equation 6.34 is derivedfrom first step analysis: either state sd is reached directly within t —for which the prob-ability is gid(t)— or via some intermediate state sj 6= sd. In this case the transition tosj takes time τ and state sj is then reached within time t− τ . As might have becomeclear from the formula, this is a recursive problem, since starting from sj the destinationstate may either be reached directly or via yet another intermediate state. However, theduration of the transition from si to the intermediate state sj is not known. Therefore, allpossible values for τ have to be considered which results in the integral with bounds 0and t.

In order to solve the equation system defined by Equation 6.34, a recursive schemecan be defined:

F(0)id (t) = 0

F(n+1)id (t) = gid(t) +

∑j 6=d

∫ t

0d gij(τ)F (n)

jd (t− τ) . (6.35)

Kulkarni [149] showed that this recursion has the approximation property:

sup0≤τ≤t

∣∣∣F (n)id (τ)− Fid(τ)

∣∣∣ ≤ µ[nr ] , (6.36)

where µ and r are derived from a result on regular Markov renewal processes stating thatfor any fixed t ≥ 0, an integer r and real number 0 < µ < 1 exist such that:∑

j

g∗rij (t) ≤ µ (6.37)

6.3 Training Hidden Semi-Markov Models 105

and g∗rij (t) denotes the r-th convolution of gij(t) with itself.Since Fid(t) assumes the stochastic process to be initially in state si, the sum over

all states has to be computed where the probability for each state is determined by Equa-tion 6.31. Hence, in summary the probability to reach state sd within time t is givenby:

P (Sd = sd, Td ≤ t | o, λ) =∑i

Fid(t) P (SL = si | o, λ) . (6.38)

Computation of Equation 6.35 can be quite costly, depending on n, which is the max-imum number of transitions up to time t that are considered in the approximation. Ad-ditionally, each step involves a solution of the Lebesgue-Stieltjes integral which must inmany cases be solved numerically as there are many distributions for which there is no an-alytical representation (e.g., the cumulative distribution of a Gaussian random variable).However, computational complexity can be limited since the maximum number of tran-sitions is commonly limited by the application (in most applications, there is a minimumdelay between successive observations). Furthermore, a minimum delay between obser-vations also limits the number of points in time for which the Lebesgue-Stieltjes integralhas to be approximated.

It should also be noted that Fid(t) depends on the parameters of the HSMM but not onthe observation sequence: hence, the complex computations including integrations canbe precomputed. An online evaluation of Equation 6.38 only involves computation ofEquation 6.31 for each state, multiplication with precomputed Fid(t) and summing up theproducts.

6.3 Training Hidden Semi-Markov ModelsIn previous sections it has been assumed that the parameters λ of a HSMM are given. Thissection deals with the task to estimate the parameters from training sequences. For thispurpose, the Baum-Welch algorithm for standard HMMs (see Section 4.1.2) is adapted tohidden semi-Markov models.

6.3.1 Beta, Gamma and XiIn addition to the forward variable αk(i), reestimation formulas for standard HMMs arebased on a backward variable βt(i), a state probability γt(i), and a transition probabilityξt(i, j). The same applies to reestimation for HSMMs, which uses equivalent variablesβk(i), γk(i) and ξk(i, j).

Analogously to standard HMMs, the backward variable βk(i) denotes the probabilityof the rest of the observation sequence Ok+1 . . . OL given that the process is in state si attime tk:

βk(i) = P (Ok+1 . . . OL | Sk = si, λ) . (6.39)

βk(i) is computed backwards starting from time tL:

βL(i) = 1

βk(i) =N∑j=1

vij(dk) bsj(Ok+1) βk+1(j) .(6.40)

106 6. The Model

γk(i) denotes the probability that the stochastic process is in state si at the time whenthe k-th observation occurs. It can be computed from αk(i) and βk(i) following the samescheme as presented in Section 4.1.1:

γk(i) = αk(i) βk(i)∑Ni=1 αk(i) βk(i)

. (6.41)

ξk(i, j) is the probability that the stochastic process is in state si at time tk and is instate sj at time tk+1:

ξk(i, j) = P (Sk = si, Sk+1 = sj | o, λ) (6.42)

ξk(i, j) =αk(i) vij(dk+1) bsj(Ok+1) βk+1(j)∑N

i=1∑Nj=1 αk(i) vij(dk+1) bsj(Ot+1) βk+1(j)

. (6.43)

As was the case for standard HMMs, the expected number of transitions from state sito state sj is the sum over time

L−1∑k=0

ξk(i, j) . (6.44)

6.3.2 Reestimation FormulasAs has been described in Section 4.1.2, the most common training procedure of standardHMMs is the Baum-Welch algorithm, which is an iterative procedure. Similar to standardHMMs, the “expectation” step comprises computation of α, β, and subsequently γ andξ. Then, the “maximization” step is performed where model parameters are adjustedusing the values computed in the expectation step. This section provides the formulasfor the maximization step, which are derived in the course of the proof of convergence inSection 6.5.

Initial probabilities π

πi ≡expected number of series starting in state si

total number of sequence≡ γ0(i) . (6.45)

Emission probabilities bsi(oj)

bi(oj) ≡expected number of times observing oj in state si

expected number of times in state si≡

L∑k=0

s.t. Ok=oj

γk(i)

L∑k=0

γk(i). (6.46)

Except for a different notation of time, the formulas are the same as for standard HMMs.


Transition parameters. Since the stochastic process underlying state traversals ischanged from a discrete time Markov chain for standard HMMs to a semi-Markov pro-cess in the case of HSMMs, maximization of transition parameters is quite different fromstandard HMMs. The key difficulty is that parameters of all outgoing transitions from sito sj occur at two places: once for the transition si → sj ; j 6= i and once in the com-putation of the probability that the process has stayed in state si. This can be seen fromthe definition of vij(d) (see Equation 6.13), which is reiterated in an extended form forconvenience, here:

vij(dk) =

pij dij(dk) if j 6= i

1−N∑h=1h6=i

pih dih(dk) if j = i . (6.47)

The fact that pij occurs in both cases of the equation prohibits to apply similar formulas asfor standard HMMs. Instead, a gradient-based iterative optimization is used to maximizelikelihood of the training sequence with respect to the transition parameters, which arespecifically:

• limiting transition probabilities pij

• kernel parameters θij,r for each transition duration dij(dk) (c.f., Equation 6.8)

• kernel weights wij,r for each transition duration dij(dk) (c.f., Equation 6.9)

As is derived in detail in Section 6.5, optimization is performed for each state si by max-imizing the objective function Qv

i :

Qvi =

∑k

∑j 6=i

{ξk(i, j) log

[pij dij(dk)

]}+ ξk(i, i) log

[1−

∑h6=i

pih dih(dk)] . (6.48)

The gradient comprises partial derivatives of the objective function with respect to HSMMtransition parameters pij , wij,r, θij,r. The derivative for pij is given by:

∂

∂ pijQvi =

∑k

ξk(i, j) 1pij− ξk(i, i)

dij(dk)1−

∑h6=i

pih dih(dk)

; i 6= j . (6.49)

For i = j, the derivative is equal to zero.Derivation of Qv

i with respect to wij,r can be computed as follows:

∂ Qvi

∂ wij,r= ∂ Qv

i

∂ dij(dk)∂ dij(dk)∂ wij,r

, (6.50)

where∂ dij(dk)∂ wij,r

= κij,r(dk | θij,r) (6.51)

and

∂ Qvi

∂ dij(dk)=∑k

ξk(i, j) 1dij(dk)

− ξk(i, i)pij

1−∑h6=i

pih dih(dk)

; i 6= j . (6.52)

108 6. The Model

Again, for i = j, the derivative is equal to zero.Derivation of Qv

i with respect to θij,r is determined by:

∂ Qvi

∂ θij,r= ∂ Qv

i

∂ dij(dk)∂ dij(dk)∂ κij,r

∂ κij,r∂ θij,r

, (6.53)

where ∂ Qvi∂ dij(dk)

is as given by Equation 6.52,

∂ dij(dk)∂ κij,r

= wij,r , (6.54)

and ∂ κij,r∂ θij,r

is depending on the type of probability distribution. For example, if an expo-nential distribution is used, the derivative is given by:


= ∂

∂ λij,r

(1− e−λij,r dk

)= dk e

−λij,r dk . (6.55)

Gradient-based optimization techniques are usually iterative using a search directions(n), which is at least in part based on the gradient. The algorithms perform an updateof length η in direction of s(n). Various techniques exist to estimate η, including linesearch, and the Goldstein Armijo rule (see, e.g., Dennis & Moré [76]). The next point ofevaluation in the parameter space λ is determined by:

λ(n+1) = λ(n) + η s(n) . (6.56)

The search direction s(n) is given by:

s(n) = ∇Qvi |λ(n) , (6.57)

where ∇Qvi |λ(n) denotes the gradient vector of Qv

i with respect to the parametersevaluated at the point λ(n). A slight modification is used by conjugate gradient ap-proaches (Hestenes & Stiefel [119]), where the next search direction is obtained by:

s(n) = ∇Qvi |λ(n) + ζ s(n−1) , (6.58)

where ζ is a scalar that can be computed from the gradient.3

Several equality constraints apply to the optimization problem for a fixed state si suchas:

1.∑j

pij!= 1 (6.59)

2. ∀j :∑r

wij,r!= 1 , (6.60)

which results in a restricted search space for the gradient method. Equality constraintsof this form can be incorporated by projecting the search direction onto the hyperplanedefined by the constraints. The following example explains this procedure. Assume thatstate si has J outgoing transitions, and each duration distribution dij(t) consists of exactlyone Gaussian distribution having two parameters µij and σij . Then, the gradient vector

3See, e.g., Shewchuk [239]


∇Qvi has 3J components.4 The constraint on limiting transition probabilities pij defines

the hyperplane given by ∑j

pij − 1 = 0 . (6.61)

Let M denote the 3J × J matrix5 of orthonormal base vectors for the hyperplane trans-lated to cross the origin of parameter space. The new gradient vector (∇Qv

i )′, whichobeys equality constraints is obtained by projecting it onto the hyperplane by matrix mul-tiplication:

(∇Qvi )′ = (MMT )∇Qv

i . (6.62)

If several equality constraints apply to the optimization problem, M is the matrix of or-thonormal base vectors for the intersection of all hyperplanes induced by the constraints.Moreover, equality constraints are also obeyed for conjugate gradient approaches, sinceboth (∇Qv

i )′ and s(n) lie within the constraint hyperplanes, and hence a linear combina-

tion of the two vectors also results in a search direction within the hyperplane.Variables also have to satisfy inequality constraints. For example, probabilities pij can

only take values within the interval [0, 1]. In order to account for this, η must be restrictedsuch that the optimization algorithm cannot leave the space of feasible parameter values.This can be achieved by checking, whether λ(n+1) is in the range of feasible values. Ifnot, η must be made smaller, which can either be done by computing the intersection ofthe line λ(n) + a s(n) with the bordering hyperplane6 or by other heuristics.

6.3.3 A Summary of the Training AlgorithmThe goal of the training procedure is to adjust the model parameters λ such that the like-lihood of a given training sequence o is maximized. The training algorithm does onlyaffect π,B, P , andD(t), but not the structure of the HSMM. The structure consists of

• the set of states S = {s1, . . . , sN},

• the set of symbols O = {o1, . . . , oM}, which is also called the alphabet

• the topology of the model. It defines, which of the N states can be initial states,which of the potentially N × N transitions can be traversed by the stochastic pro-cess, and which of the potentially N × M emissions are available in each state.Technically, a transition si → sj is “removed” by setting pij = 0. The same holdsfor the initial state distribution π and the emission probabilities: if bsi(ok) is set tozero, state si cannot generate observation symbol ok. Since the training algorithmcan never assign a non-zero value to probabilities that are initialized by zero, it doesnot change the structure of the HSMM.

• specification of the transition durations D(t). This includes the number R andtypes of kernels κij,r for each existing transition. The structure may also comprisespecification of additional parameters that are not adjusted by the training procedure

4Since there is only one duration distribution, no weights wij,r are needed. Hence each of the J outgoingtransitions is determined by µij , σij , and pij

5Equation 6.61 defines a J dimensional hyperplane in 3J-dimensional space6E.g., defined by pij = 0

110 6. The Model

such as upper and lower bounds for uniform background distributions, which needto be set up before training starts.

Having specified the model structure, the training algorithm performs the steps shownin Figure 6.4 in order to adjust the parameters λ such that sequence likelihood of P (o |λ)reaches at least a local maximum.

Some notes on the training procedure: Gradient-based maximization within an EMalgorithm has been used to train standard HMMs, e.g., in Wilson & Bobick [279]. Suchapproach is called Generalized Expectation Maximization algorithm. If a conjugate gra-dient approach is applied, the resulting HMM learning algorithm is called ExpectationConjugate Gradient (ECG). Under certain conditions, ECG performs even better than theoriginal Baum-Welch algorithm (Salakhutdinov et al. [222]), but computational complex-ity is increased. However, complexity can be limited:

• The number of parameters that have to be estimated depends heavily on the numberof outgoing transitions (J). These, in turn, depend on the topology of the model: If,for example, the topology is a simple chain, then each state, despite of the last one,has only one outgoing transition. In case of an ergodic topology, where every stateis connected to every other, the number equals N − 1.

• The kernel weights wij,r do not necessarily need to be optimized. If the number ofparameters is too large, the weights of the convex combination can simply be fixed,which reduces the number of parameters by J × R, where R denotes the averagenumber of kernels per outgoing transition.

• The number of kernels may be reduced when some duration background distribu-tion is used. If specified a-priori, background distributions do not increase the sizeof the parameter vector.

• As is shown in Section 6.5, the overall EM algorithm also converges if sequencelikelihood is only increased by a sufficiently large amount. Therefore, gradient-based optimization can be stopped after a few iterations.

• Since the optimization algorithm is based on the gradient, only cumulative distri-butions dij(dk) can be used for which the derivative with respect to its parametersare available. However, this is the case for many widespread distributions. SeeAppendix V for some examples.

The former notes have mainly addressed the embedded gradient-based optimization ofQvi . Regarding the entire training procedure, the following notes should be kept in mind

when applying HSMMs as a modeling technique:

• Equivalently to standard HMMs, scaling factors sk used to scale αk(i) (c.f., Equa-tion 6.18) can also be used to scale βk. ξ and γ can then be computed on the basisof scaled α′ and β′.

• It has been shown that for large models, results and speed of convergence can beimproved if prior knowledge is incorporated into parameter initialization. For ex-ample, the length of failure sequences used for training of one model show a cer-tain distribution with respect to the number of observations. This can be exploited


1. Initialize the model by assigning values to π, B, and G(t) for all entries thatexist in the structure. This constitutes λold

2. Compute αk(i) by Equation 6.16, βk(i) by Equation 6.40, γk(i) by Equa-tion 6.41, and ξk(i, j) by Equation 6.43 using λold and observation sequenceo

3. Compute sequence likelihood P (o |λold) by Equation 6.17.

4. Adjust π by Equation 6.45, and B by Equation 6.46, resulting in λnewπ andλnewB

5. Reestimate the parameters of G by the embedded optimization algorithm. Foreach state si, perform:

(a) Compute the gradient vector g(n) of Qvi with respect to the parameters of

G at λ(n)Gi

, which is either initialized by λoldGi or obtained from a previousiteration

g(n) = (∇Qvi )∣∣∣∣λ

(n)Gi

=[∂ Qv

i

∂ pij, . . . ,

∂ Qvi

∂ wij,r, . . . ,

∂ Qvi

∂ κij,r, . . .

]λ

(n)Gi

,

where Qvi is given by Equation 6.48

(b) Project the gradient onto the hyperplane of feasible solutions for equalityequations. This is achieved by matrix multiplication:

g′(n) = (MMT ) g(n) ,

where M denotes the matrix of orthonormal base vectors for hyperplanesdefined by equality constraints such as that the sum of probabilities shouldequal one.

(c) Determine a search direction s(n) from g′(n) and eventually s(n−1) and astep size η, e.g., by line search. Assure that the search vector does notcross the boundaries induced by inequality equations such as the condi-tion that probabilities must lie between [0, 1]. The next point in searchspace is obtained by:

λ(n+1)Gi

= λ(n)Gi

+ η s(n) .

(d) Repeat from Step 5a until step size is less than some bound or a maximumnumber of steps is reached. The result constitutes to λnewGi

6. Set λold := λnew and repeat Steps 2 to 6 until the difference in observationsequence likelihood P (o |λnew)− P (o |λold) is less than some bound.

Figure 6.4: Summary of the complete training algorithm for HSMMs.

112 6. The Model

to come up with a better guess for initial probabilities π. Additionally, initializa-tion of observation probabilities can be improved by taking the prior distribution ofsymbols into account. Other techniques first apply the Viterbi algorithm to comeup with an initial assignment of states to observations, as described, e.g., in Juang& Rabiner [138]. Similar techniques can also be used to obtain an initial guess fortransition durations.

• It has also been shown that results can be improved by setting all observation prob-abilities bi(ok) to zero that are less than some threshold (Rabiner [210]).

• The training procedure improves model parameters until some local maximum isreached, which can be significantly lower than the global maximum. Therefore, inthis thesis the training procedure is performed several times with different (random)parameter initializations. Other approaches are discussed in the outlook (Chap-ter 12).

• Gradient-based optimization could be applied to sequence likelihood directly (andnot to the Q-function, which is a lower bound for likelihood). However, first, di-mensionality of the optimization parameter space would be dramatically increased,and second, the efficiency that parts of the optimization problem can be solved an-alytically would be lost.

• The training procedure described only considered one single training sequence. Anextension to multiple sequences is similar to standard HMMs with the slight differ-ence that the gradient takes all training sequences into account. However, vectorsλ(n) → λ(n+1) for single sequences can be linearly combined exploiting the fact thatlog-likelihood for multiple sequences is the sum of single sequence log-likelihoods.

• Background distributions for observation probabilitiesB can be applied to HSMMsin the same way as to standard HMMs. They are frequently used to circumventone of the major drawbacks of the Baum-Welch algorithm: Observation probabil-ities bsi(oj) are computed from the number of occurrences of observation oj (c.f.,Equation 6.46). If one specific symbol oc has not occurred in the training data,bsi(oc) is set to zero for all states si in the first iteration of the training algorithm.Hence, in the forward algorithm, any observation sequence containing oc is assigneda sequence likelihood of zero (c.f., Equation 6.16). Background probabilities rem-edy this problem by substituting bsi(oj) with b′i(oc) > 0 as defined in the following.Let Pb(oj) denote a discrete probability distribution over all observation symbols.Observation probabilities of a hidden Markov model become a convex combinationof the original observation probabilities bsi(oj) and Pb(oj):

b′ij = b′si(oj) = ρi Pb(oj) + (1− ρi) bsi(oj); 0 ≤ ρi ≤ 1 , (6.63)

where ρi is a state-dependent weighting factor.

6.4 Difference Between the Approach and other HSMMsThe term “hidden semi-Markov model” has been used for various models, since “semi”simply indicates that a model employs some probability distribution for representation of

6.4 Difference Between the Approach and other HSMMs 113

time. Due to the fact that the models have been developed in the area of speech recognitionand signal processing, almost all models assume input data to be an equidistant timeseries, which leads to the simplification that a minimum time step exists and durationscan be handled by multiples of the time step.

Speech recognition. In order to better explain the differences between the approachpresented here and previously published work, the task of phoneme7 assignment to aspeech signal is taken as an example. A plethora of work exists on this topic8 introducingvarious methods and techniques to improve speech recognition quality —however, thefocus here is on duration modeling and only the basic principles are explained.

The process of phoneme recognition is sketched in Figure 6.5. Starting from the top of

Figure 6.5: A simplified sketch of phoneme assignment to a speech signal.

the figure, the analog sound signal is sampled and converted into a digital signal. Portionsof the sampled signal are then analyzed in order to extract features of the signal. Featureextraction involves, e.g., a short-time Fourier transform and various other computations.Since in this thesis only discrete emissions are considered, assume that the result of featureextraction is one symbol out of a discrete set, denoted by “A” and “B”.9 Subsequently,the sequence of features is analyzed by several HMMs: Each HMM is modeling one

7A phoneme is the smallest unit of speech that distinguishes meaning.8For an overview, see, e.g., Cole et al. [62]9Usually, it is a feature vector containing both discrete and continuous values

114 6. The Model

phoneme and sequence likelihood is computed for each HMM using the forward or Viterbialgorithm. In order to assign a phoneme to the sequence of features, some classificationis performed.

As has been pointed out by several authors (see, e.g., Russell & Cook [218]), the qual-ity of assignment can be improved by introducing the notion of state duration: Rather thantraversing to the next state each time an observation symbol (i.e., a feature) occurs, thestochastic process may reside in one state for a certain time generating several subse-quent observation symbols before traversing to the next state. Figure 6.6 (a) shows the

Figure 6.6: Assigning states si to observations (A or B). (a) shows the case where a statetransition takes place each time an observation symbols occurs. If state durationsare introduced the process may reside in one state accounting for several subse-quent observations. However, several state sequences are possible, of which afew are shown by (b)-(d)

case where the occurrence of each feature symbol corresponds to a state transition. Intro-ducing the notion of state duration, the process of state transitions is decoupled from theoccurrence of observation symbols. However, this flexibility results in several potentialstate sequences, as can be seen from Figures 6.6 (b) to (d). Considering all potential statesequences increases the complexity to compute sequence likelihood since all possiblestate paths have to be summed up. To be precise, the number of potential paths increasesfrom NL where N denotes the number of states and L the length of the sequence (c.f.,Equation 4.7 on Page 58) to

L−1∑k=0

(L− 1k

)N (N − 1)k , (6.64)

where k is the number of state transitions that take place.10 The major drawback of thisis that dynamic programming approach such as the forward algorithm cannot be applied.

10It is assumed that(x0)

= 1

6.4 Difference Between the Approach and other HSMMs 115

Figure 6.7: The trellis structure for the forward algorithm with duration modeling. A maximumduration of D = 2 is used. Thick lines highlight terms involved in computation ofα3(1)

This is due to the fact that the Markov assumptions do not apply: the condition that all theinformation needed to compute αt(j) is included in α’s of the previous time step is notfulfilled for variable state durations.

Concrete models that were used in speech recognition have typically applied one re-striction in order to come up with a feasible algorithm: They included an upper bound forstate durations (denoted by D). This leads to the following forward-like algorithm (see,e.g., Mitchell & Jamieson [183]):

αt(j) =N∑i=1

min(D,t)∑τ=1

αt−τ (i) aij dj(τ)τ−1∏m=0

bsj(Ot−m) , (6.65)

where αt(j) denotes the probability of the observation sequence for all state sequencesfor which state sj ends at time t. The algorithm includes an additional sum over τ , whichis the duration how long the process stays in state sj , and dj(τ) specifies the probabilitydistribution for the duration. The product over bsj(·) results from the fact that duringits stay, state sj has to produce all the emission symbols Ot−τ+1 . . . Ot. Similar to thestandard forward algorithm, the approach can be visualized by a trellis structure, as shownin Figure 6.7. As can be seen from the figure, the major drawback of the algorithm is itscomputational complexity: according to Ramesh & Wilpon [211], it increases by a factorof D2

2 . Various modifications to this approach have been proposed of which the majorcategories have been described in Chapter 4.

Temporal sequences. The essential difference between speech recognition and tempo-ral sequence processing is that symbols occur equidistantly in the first case, which doesnot apply to the latter case. Periodicity in speech recognition is caused by the underlyingsampling of an analogous signal, whereas in temporal sequences such as error sequences,the occurrence of symbols is event-triggered. This difference leads to the following con-clusions:

• Using discrete timesteps of fixed size is appropriate for speech signals but not fortemporal sequences due to the reasons given in the discussion of time-slotting (c.f.Section 4.2.1).

• In event-driven temporal sequences, temporal variability is already included in theobservation sequence itself. Therefore, a tight relation between hidden state tran-sitions and occurrence of observation symbols can be assumed. Specifically, themodel presented in this thesis assumes a one-to-one mapping.

116 6. The Model

• The one-to-one mapping between state transitions and observation symbol occur-rence has two advantages:

1. It enforces the Markov assumption, which leads to an efficient forward algo-rithm that is very similar to the standard forward algorithm of discrete-timeHMMs. Specifically, the sum over durations τ in Equation 6.65 is avoidedso that the algorithm belongs to the same complexity class as the standardforward algorithm, as is shown in Section 6.7.

2. Durations can be assigned to transitions rather than to states, which increasesmodeling flexibility and expressiveness. Obviously, state durations are a spe-cial case of transition durations.11

Considering Equation 6.13, the approach is related to inhomogeneous HMMs (IHMMs).However, the process must still be called homogeneous since probabilities vij(d) stay thesame regardless of the time when the transition takes place, i.e. at the beginning or endingof the sequence. Furthermore, in contrast to IHMMs, continuous duration distributionsrather than discrete ones are used in this thesis.

6.5 Proving Convergence of the Training AlgorithmThe objective of the training procedure is to find a set of parameters λopt that maximizessequence likelihood of the training data:

λopt = arg maxλ

P (o |λ) . (6.66)

The training procedure described here is an Expectation-Maximization (EM) algorithm(Dempster et al. [75]). It improves sequence likelihood until at least some local maximumis reached. The algorithm described here is closely related to the Baum-Welch algorithm,whose convergence was originally proven by Baum & Sell [25] without the frameworkof EM algorithms. However, the framework of EM algorithms provides a view on theproblem that allows for simpler proofs. Such approach is adapted to prove convergenceof the training algorithm presented here. In the following, first a general proof of conver-gence for EM algorithms by Minka [181] is presented, which is subsequently adapted tothe specifics of HSMMs.

6.5.1 A Proof of Convergence FrameworkEM algorithms are maximum-a-posteriori (MAP) estimators and hence rely on the pres-ence of some data that has been observed, which in this case refers to the observationsequence o forming dataset O. The goal is to maximize data likelihood P (o|λ).

The potential and wide range of application of EM algorithms stems from two prop-erties:

1. EM algorithms build on lower bound optimization (Minka [181]). Instead of op-timizing a complex objective function directly, some simpler lower bound is opti-mized.

2. EM algorithms can handle incomplete / unobservable data.

11In this case, all outgoing transitions have the same duration distribution

6.5 Proving Convergence of the Training Algorithm 117

Lower bound optimization. In lower bound optimization, which is also called theprimal-dual method (Bazaraa & Shetty [26]), a computationally intractable objectivefunction is optimized by repetitive maximization of some lower bound that is easier tocompute. More specifically, if o(λ) denotes the objective function, a simpler lower boundb(λ) that equals o(λ) at the current estimation of λ is maximized (see Figure 6.8). Max-imization of b(λ) yields a new estimate for λ, for which the objective o(λ) is increased(except for the case when the derivative of the objective equals zero, which is a local op-timum). If the objective function is continuous and bounded, as is the case for HSMMs,iteratively increasing the lower bound converges to at least a local optimum of the objec-tive function.

Figure 6.8: Lower bound optimization. Starting from the current estimate of parameter λ,a lower bound b(λ) to the objective function o(λ) is determined that is easier tomaximize than the objective function. If the lower bound equals the objective func-tion at the current estimate of λ, maximization of the lower bound leads to a newestimate of λ for which the value of the objective function is increased. Performingthis procedure iteratively yields at least a local maximum of the objective function.

From this, the following iterative optimization scheme can be derived:

1. Determine a lower bound simpler than the objective function that equals the objec-tive function at the current estimate of parameter λ.

2. Determine the maximum of the lower bound yielding the next estimation of λ.

3. Repeat until the increase of the objective function is below some threshold.

Compared to this, gradient-based optimization approaches approximate the objectivefunction by the tangent to the objective function at the current estimate for λ and movealong that line for some distance to obtain the new estimate.

Handling of unobservable data. Unobservable data describes the situation where somequantity used in modeling cannot be observed by measurements. In the case of HMMsand its variants, this refers to the fact that the sequence of hidden states s, which thestochastic process has traversed, cannot be observed. Analogously to O, let S = {s}denote the set of state sequences s of the training data set. Two data sets must be dis-tinguished: the complete dataset Z = (O,S) includes both observed and unknown data,

118 6. The Model

while the incomplete dataset only consists of observed data. The objective is to opti-mize data likelihood of the (observable) incomplete dataset P (o|λ). EM algorithms dealwith this problem by assuming the incomplete data likelihood to be the marginal of thecomplete data set. Hence,

P (o|λ) =∫sP (o, s|λ) ds . (6.67)

The Q-Function. In order to determine a lower bound to data likelihood, Jensen’s in-equality [133] can be used:∑

j

g(j) aj ≥∏j

gajj (j); aj ≥ 0,

∑j

aj = 1, g(j) ≥ 0 (6.68)

stating that the arithmetic mean is greater or equal to the geometric mean. Application toEquation 6.67 requires extension by some arbitrary function q(s) as follows (see Minka[181]):

P (o|λ) =∫s

P (o, s|λ)q(s) q(s) ds ≥

∏s

(P (o, s|λ)q(s)

)q(s) ds= f(λ, q(s)) , (6.69)

where ∫sq(s) ds != 1 . (6.70)

f(λ, q(s)) is the lower bound and q(s) is some arbitrary probability density over s.The arbitrary function f needs to be chosen such that it touches the objective function

at the current estimate of parameters λold (see Figure 6.8). It can be shown that setting

q(s) = P (s |o, λold) (6.71)

fulfills the requirement (see Minka [181]).Maximization of the lower bound is performed by maximizing its logarithm. Loga-

rithmizing yields

log[f(λ, q(s))

]=∫sq(s) log

[P (o, s|λ)

]ds−

∫sq(s) log

[q(s)

]ds . (6.72)

Substituting Equation 6.71 into Equation 6.72 and dropping terms that are not dependingon λ yields the so-called Q-function:

Q(λ, λold) =∫slog

[P (o, s|λ)

]P (s |o, λold) ds , (6.73)

which in fact is the expected value over the unknown data s of the log-likelihood of thecomplete data set. Since likelihood of the complete data set is in many cases easier tooptimize than the one of the incomplete data set, EM algorithms can solve more complexoptimization problems.

EM algorithms. With the notation just developed, the procedure of EM algorithms canbe refined as follows:

• E-step: Compute the Q-function based on parameters λold obtained from initializa-tion or the previous M-step.


• M-step: Compute the next estimation for λ by maximizing the Q-function:

λnew = arg maxλ

Q(λ, λold) . (6.74)

• Repeat until increase in data likelihood P (o|λ) is less than some threshold.

Convergence of the procedure is guaranteed, since for the objective function holds0 ≤ Q(λ, λold) ≤ P (o|λ) ≤ 1 and the lower bound Q is not decreasing in any iteration.

In case of HMMs, a local maximum of Q is usually found by partial derivation of theQ-function and solving the equation

∂ Q

∂λ!= 0 . (6.75)

and usage of Lagrange multipliers to account for additional constraints on parameters(e.g., the sum of outgoing probabilities has to be equal to one). Another way to optimizeQ is to apply an iterative approximation technique. If the optimum of Q is not foundexactly, the algorithm converges still, if only some new parameter values λ are foundfor which the lower bound is sufficiently greater than for λold. Such approach is calledGeneralized EM algorithm (e.g, Wilson & Bobick [279]).

6.5.2 The Proof for HSMMsFor HSMMs, the complete dataset Z = (O,S) consists of the observation sequence oand the sequence of hidden states s that the stochastic process has traversed. If both thesequence of hidden states and observation sequence are known, (complete) data likeli-hood is computed by alternately multiplying state transition probabilities and observationprobabilities along the path of states s:

P (o, s |λ) = πs0 bs0(O0)L∏k=1

vsk−1 sk(dk) bsk(Ok) (6.76)

= πs0

L∏k=0

bsk(Ok)L∏k=1

vsk−1 sk(dk) (6.77)

and hence the Q-function is (c.f., Equation 6.73):

Q(λ, λold) =∑s∈S

log[P (o, s|λ)

]P (s |o, λold) (6.78)

=∑s∈S

log[πs0

]P (s |o, λold) (6.79)

+∑s∈S

L∑k=0

log[bsk(Ok)

]P (s |o, λold) (6.80)

+∑s∈S

L∑k=1

log[vsk−1 sk(dk)

]P (s |o, λold) (6.81)

= Qπ(π, λold) +Qb(B, λold) +Qv(G, λold) , (6.82)

120 6. The Model

where S denotes the set of all possible state sequences s.Some papers (e.g., Bilmes [29]) use P (s, o |λold) instead of P (s |o, λold). However,

this difference does not matter since

P (s, o |λold) = P (s |o, λold) P (o |λold) , (6.83)

and since P (o |λold) is independent of λ it does not affect the arg max operator used todetermine λnew (c.f., Equation 6.74).

The important feature of Equation 6.82 is that the terms Qπ, Qb, and Qv are indepen-dent of each other with respect to π, B, and G. Due to partial derivation involved inmaximization, Qπ, Qb, and Qv can be maximized separately.

MaximizingQπ. Qπ can be further simplified:

Qπ(π, λold) =∑s∈S

log[πs0

]P (s |o, λold) =

N∑i=1

log[πi]P (S0 = si |o, λold) , (6.84)

since for each s ∈ S, only the first state s0 is of importance. The second term on the rightof the Equation, P (S0 = si |o, λold), subsumes all state sequences starting with state siand hence the sum over all state sequences s can be turned into a sum over all states.

In order to determine λopt with respect to π, the following constrained maximizationproblem has to be solved:

πopt = arg maxπi;

i=1,...,N

Qπ(π, λold); s.t.N∑i=1

πi = 1 . (6.85)

This can be accomplished by a Lagrange multiplier ϕ. Note that derivation is performedfor one specific πi out of the sum of πi’s:

∂

∂πi

(N∑i=1

log[πi]P (S0 = si |o, λold)− ϕ

( N∑i=1

πi − 1))

!= 0 (6.86)

⇔ 1πiP (S0 = si |o, λold)− ϕ = 0 . (6.87)

The Lagrangian multiplier ϕ can be determined by substituting Equation 6.87 into theside condition:

N∑i=1

P (S0 = si |o, λold)ϕ

!= 1 (6.88)

⇔ ϕ =N∑i=1

P (S0 = si |o, λold) (6.89)

⇔ ϕ = 1 , (6.90)

since it is sure that the stochastic process is in one state at the beginning of the sequence.Using this result, Equation 6.87 can be solved to obtain the reestimation formula given inEquation 6.45:

πi = P (S0 = si |o, λold) , = γ0(i) (6.91)

as can be seen from the definition of γi(t) (c.f., Equation 4.14 on Page 59).


Maximizing Ob. In order to maximize the second term of the Q-function it is sim-plified first. The “row-wise” collection along state sequences s is exchanged by a“column-wise” collection for each time step k. Therefore, P (s |o, λold) is exchangedby P (Sk = si |o, λold) and sums are adapted adequately:

Qb(B, λold) =∑s∈S

L∑k=0

log[bsk(Ok)

]P (s |o, λold) =

N∑i=1

L∑k=0

log[bsi(Ok)

]P (Sk = si |o, λold) . (6.92)

For readability reasons, bsi(oj) is denoted by bij in the following. The maximizationproblem is:

Bopt = arg maxbij

i=1,...,Nj=1,...,M

Qb(B, λold); s.t. ∀ i :M∑j=1

bij!= 1 (6.93)

leading to

∂

∂bij

N∑i=1

L∑k=0

log[bi(Ok)

]P (Sk = si |o, λold)−

N∑i=1

ϕi( M∑j=1

bij − 1) = 0 (6.94)

⇔L∑

k=0;Ok=oj

1bij

P (Sk = si |o, λold)− ϕi = 0 (6.95)

⇔ bij =

L∑k=0;Ok=oj

P (Sk = si |o, λold)

ϕi; ϕi 6= 0 . (6.96)

Substitution into side-constraints yields:

M∑j=1

L∑k=0;Ok=oj


ϕi

!= 1 (6.97)

⇔ ϕi =M∑j=1

L∑k=0;Ok=oj

P (Sk = si |o, λold) =L∑k=0

P (Sk = si |o, λold) . (6.98)

The condition ϕi 6= 0 is fulfilled, if state si is reachable with the given sequence. Finally,

bij =

L∑k=0;Ok=oj


L∑k=0

P (Sk = si |o, λold)=

L∑k=0;Ok=oj

γk(i)

L∑k=0

γk(i). (6.99)

122 6. The Model

MaximizingQv. In order to maximize the transition part of theQ function for HSMMs,the sums are again rearranged. This time, the grouping collects all transitions fromSk−1 = si to Sk = sj as follows:

Qv(G, λold) =∑s∈S

L∑k=1

log[vsk−1 sk(dk |G)

]P (s |o, λold) =

N∑i=1

N∑j=1

L∑k=1

log[vij(dk |G)

]P (Sk−1 = si, Sk = sj |o, λold) . (6.100)

In contrast to the maximization of π and B, and in contrast to standard HMMs, maxi-mization cannot be performed analytically. The reason for this can be traced back to thedefinition of vij(dk), which actually is a function of parameters P andD:12

vij(dk |G) = vij(dk |P , D(d)

)=

pij dij(dk) if j 6= i

1−N∑h=1h6=i

pih dih(dk) if j = i . (6.101)

The problem is that pij and dij(dk) appear twice in vij(dk), once in each case, whichcomplicates computations as can be seen from a derivation with respect to pij . In order toshorten notations, it can be seen from the definition given in Equation 6.42 that

P (Sk−1 = si, Sk = sj |o, λold) = ξk(i, j) . (6.102)

Incorporating the side conditions given in Equation 6.4, the Lagrangian L for Equa-tion 6.100 is:

L =N∑i=1

N∑j=1

L∑k=1

log[vij(dk)

]ξk(i, j)−

N∑i=1

ϕi( N∑j=1j 6=i

pij − 1)

(6.103)

=L∑k=1

N∑i=1

[ N∑j=1j 6=i

(log

[pij dij(dk)

]ξk(i, j)

)+ log

[1−

∑h=1h6=i

pih dih(dk)]ξk(i, i)

]−

N∑i=1

ϕi( N∑j=1j 6=i

pij − 1). (6.104)

12Although it has been assumed that pii ≡ 0, notations include h 6= i to highlight that no self-transitionsare incorporated.


Setting the partial derivative to zero yields:

∂

∂ pijL = 0 (6.105)

⇔L∑k=1

dij(dk)pij dij(dk)

ξk(i, j) + −dij(dk)1−

∑h6=i

pih dih(dk)ξk(i, i)

− ϕi = 0 (6.106)

⇔L∑k=1

1pij

ξk(i, j)−dij(dk)

1−∑h6=i,j

pih dih(dk)− pij dij(dk)ξk(i, i)

− ϕi = 0 . (6.107)

Although a solution for pij exists, we are not able to analytically solve for ϕi. For thisreason, a gradient-based approximation technique is applied. It can be seen from Equa-tion 6.107 that the derivatives are independent for each state si. Therefore, parameterscan be optimized separately for each state and the objective function Qv can be split asfollows:

Qv(P ,D(d), λold) =N∑i=1

Qvi (P i,Di(d), λold) . (6.108)

This reduces complexity of the optimization procedure since the number of parametersis much smaller. The objective comprises all outgoing transitions of state si, the objectivefunction is denoted by Qv

i :

Qvi (P i,Di(d), λold) =L∑k=1

[ N∑j=1j 6=i

(log

[pij dij(dk)

]ξk(i, j)

)

+ log[1−

∑h=1h6=i,j

pih dih(dk)− pij dij(dk)]ξk(i, i)

].

(6.109)

Let∇Qvi denote the gradient vector, which components are obtained by partial deriva-

tion of Equation 6.109 with respect to the parameters. Derivations with respect to kernelweights wij,r and kernel parameters θij,r are obtained by

(∇Qvi )wij,r = ∂ Qv

i

∂ dij(dk)∂ dij(dk)∂ wij,r

(∇Qvi )θij,r = ∂ Qv

i

∂ dij(dk)∂ dij(dk)∂ κij,r


. (6.110)

The dimension of∇Qvi equals

dim(∇Qvi ) = J

(1 + R

(1 + θ

) ), (6.111)

where J is the number of outgoing transitions, R the average number of kernels per tran-sition and θ the average number of kernel parameters θij,r per kernel.

However, the optimization procedure has to obey several restrictions. The first hasalready been expressed by the Lagrangian in Equation 6.103: the sum over pij for alloutgoing transitions has to equal one. Rearranging this restriction yields:

J∑j=1

pij − 1 = 0 , (6.112)

124 6. The Model

which is the defining equation of a J-dimensional hyperplane in the space of all optimiza-tion parameters. The interpretation for this is that all feasible solutions to the optimizationproblem have to be points within the hyperplane. However, the vector∇Qv

i does not nec-essarily point to a direction parallel to the hyperplane such that an unrestricted gradientascent would leave the hyperplane of feasible solutions. In order to avoid this, the gradientvector is projected onto the hyperplane, which results in the direction of steepest ascentwithin the subspace of feasible solutions (see Figure 6.9).

Figure 6.9: Projecting the gradient vector g into the plane of values for which p1 + p2 = 1.The result is denoted by g′. Θ denotes an arbitrary third parameter

Projection onto a hyperplane can be achieved by simple matrix multiplication:

(∇Qvi )′ = (MMT )∇Qv

i , (6.113)

where (∇Qvi )′ denotes the projected gradient vector andM is the matrix of orthonormal

base vectors of the hyperplane translated such that it crosses the origin of parameter space.Note thatM is constant such that the projection matrix can be precomputed.

In most applications, duration distributions will be applied that are a convex combi-nation of two or more kernels. In this case, the requirement that kernel weights sum upto one (c.f., Equation 6.9 on Page 98) constitutes an additional hyperplane restricting thesubspace of feasible solutions similar to Equation 6.112. In a geometric interpretation, thesubspace of feasible solutions is then defined by the intersection of all constraining hyper-planes. For example in Figure 6.9, if parameter Θ had to be equal to zero, the subspaceof feasible solutions would only consist of the intersection of the shaded hyperplane withthe p1 p2 plane, as indicated by the bold line. MatrixM consists of the orthonormal basevectors of the intersection of all restricting hyperplanes, which can also be precomputed.

However, there are further restrictions. For example, pij denote probabilities, whichcan hence only take values of range [0, 1]. Another example is that the parameter λ of anexponential distribution must be greater than zero. The solution to this problem is that thestepsize along the projected gradient vector needs to be restricted such that the optimiza-

6.6 HSMMs for Failure Prediction 125

tion cannot leave the admissible range. In the geometric interpretation, this correspondsto clipping the projected gradient vector at boundary hyperplanes such as λ = 0.

Summary of the proof of convergence. The goal of this section was to prove conver-gence of the training algorithm. The strategy of EM algorithms is to iteratively maximizea lower bound to reach a maximum of the objective function. The lower bound of EMalgorithms is the so-called Q function, which is the expected training data likelihood overall combinations of unknown data. In the case of HSMMs, the Q function is the expectedobservation sequence likelihood over all sequences of hidden states. Similar to standardHMMs, the Q function for HSMMs can be separated into a sum of three independentparts such that maximization of Q can be achieved by individual maximization. Themaximum for initial probabilities π and observation probabilities B has been computedanalytically using the method of Lagrangian multipliers resulting in reestimation formu-las similar to the ones of the Baum-Welch algorithm for standard HMMs. However, ananalytical solution is not available for transition parameters. Therefore, a gradient-basediterative maximization procedure is used for this part of the Q function. The fact that theQ function is increased leads to an increased value of the objective function. Since theobjective function is continuous and bounded, a repetitive increase converges to a localmaximum.

6.6 HSMMs for Failure PredictionThere is a principle interrelation between the number of free parameters and the amountof training data needed to estimate the parameters: The more parameters that need to beestimated the more training sequences are required to yield reliable estimates. Since infailure prediction, the models are trained from failure data, the amount of training data isnaturally limited. Hence the number of free parameters must be kept small. The numberof free model parameters is mainly determined by the number of states and the topology,which determines the connections among states.

The most wide-spread topology for HMMs is a chain like structure, since first, the no-tion of a sequence has some “left-to-right” connotation and second, it has the least numberof transitions. The model’s topology used for online failure prediction are no exceptionin that respect. However, there are some particularities that need to be explained.

It is a principle and unavoidable characteristic of supervised machine learning ap-proaches that the desired specifics are extracted from training data, which can never cap-ture all properties of the true underlying interrelations. More specifically, this results fromthe fact that

1. Training data is a finite sample of data, from which follows that samples only con-tain/reveal a subset of the true characteristics

2. Measurement data is subject to noise. In the case of error sequences, e.g., it iscommon that error messages that are not related to the failure mechanism occur inthe training data (noise filtering can alleviate the problem but cannot completelyremove noise from the data)

In order to account for these two properties, a strict left-right model is extended in twosteps:

126 6. The Model

1. Jumps are introduced such that states can be left out, as is shown in Figure 6.10.This addresses missing error events in training sequences, which is related to thefirst particularity listed above.

2. After training, intermediate states are introduced (see Figure 6.11), addressing thesecond particularity.

Training is performed between the two steps for the reason to keep the number of param-eters as small as possible.

Chain model with shortcuts. The model topology to which training is applied is shownin Figure 6.10. Since this structure is rather sparse, training computation times remainacceptable. Note that only shortcuts bypassing one state have been included in the figure.The models used for the telecommunication system case study also included shortcuts oflarger maximum span.

Figure 6.10: Failure prediction model structure used for training. Only shortcuts bypassingone state are shown. In implementations, also shortcuts having a larger spanhave been used.

Transition parameters, and prior probabilities are initialized randomly. Observationprobabilities are also initialized randomly, with one restriction: failure symbols can onlybe generated by the last absorbing failure state and error event IDs only by the transientstates. Since the number of states N is not altered by the training procedure, it mustbe prespecified, although the optimal number of states cannot be identified upfront: Ifthere are too few states, there are not enough transitions to represent all symbols in thesequences. If there are too many, the number of parameters is too large to be reliablyestimated from the limited amount of training data. Furthermore, the model might overfitthe training data. For this reason, several values of N are tried and the most appropriatemodel is selected. Please also note that training sequences have been filtered in the processof data preprocessing, as described in Section 5.3.

Background distributions. As has been pointed out in Section 6.3.3, the Baum-Welchalgorithm sets observation probabilities to zero for all observation symbols that do not oc-cur in the training data set and hence every sequence containing one of those symbols isassigned a sequence likelihood of zero. This is not appropriate for failure prediction sincethe subsequent classification step builds on a continuous measure for similarity. Further-more, training data is incomplete: During online prediction, there might be failure-pronesequences that are very similar containing some symbol that has not been present in the

6.6 HSMMs for Failure Prediction 127

(filtered) training data. Assigning a sequence likelihood (i.e., a similarity) of zero is obvi-ously not appropriate. Hence, after training the chain model with shortcuts, backgrounddistributions have to be applied to observation sequences.

Intermediate states. For each transition of the model, a fixed number of intermediatestates are added such that the sum of mean transition times equals the mean transitiontime of the original transition (see Figure 6.11). More precisely, for any pair of states

Figure 6.11: Adding intermediate states for each transition. Bold arcs visualize transitionsfrom the model shown in Figure 6.10. µij denotes the mean duration of thetransition from state si to state sj . Observation probability distributions bsi(oj)for states 1, 2, and 3 have been omitted

si and sj of the model obtained from training (c.f., Figure 6.10), v intermediate statessij,1, . . . , sij,v are added such that the mean time of transition duration via intermediatestates equals duration mean time of the direct transition si → sj .13 Limiting transitionprobabilities pij are adapted by distributing a fixed, prespecified amount of probabilitymass equally to the intermediate states. For example in Figure 6.11, if it is specifiedupfront, that 10% of probability mass should be assigned to intermediates, then p12 andp13 are scaled by 0.9 and the probabilities from state s1 to all the intermediates equals 0.1

4 .Observation probabilities of intermediate states are not subject to training and hence priorprobabilities P (oj) estimated from the entire training data set are used.

13That is, e.g., if the mean transition time from state s1 to s2 is µ12 = 12s and there are two intermediatestates s12,1 and s12,2, mean durations from s1 to s12,1, from s12,1 to s12,22, and from s12,2 to s2 are allfour seconds, but the transition from s1 to s12,2 is eight seconds

128 6. The Model

6.7 Computational ComplexityAn assessment of computational complexity for most machine learning techniques hasto consider two cases: training and online application. Training is performed offline andcomputing time is hence less critical than the application of the model, which is in thiscase the online prediction of upcoming failures. Both cases are investigated separately.

Application complexity. The approach to failure prediction presented here involvescomputation of the forward algorithm for each sequence. The forward algorithm of stan-dard HMMs is of the order O(N2 L) as can be seen from the trellis shown in Figure 4.3on Page 59: for each of the L + 1 symbols of the sequence, a sum over N terms has tobe computed for each of the N states. However, this is only true if really all predecessorsare taken into account. If the implementation uses adjacency lists, this assessment appliesonly to ergodic (fully connected) model structures. In case of frequently used left-to-rightstructures complexity goes down to O(NL).

Complexity of the Viterbi algorithm is the same since the sum of the forward algo-rithm is simply replaced by a maximum operator, which also has to investigate all Npredecessors in order to select the maximum value.

Complexity of the Backward algorithm is also equal to the forward algorithm, al-though multiplication of bsi(Ot) cannot be factored out —but since constant factors donot change the class of complexity in the O-calculus, the same class results.

Turning to HSMMs, the algorithms belong to the same class of complexity, since theonly difference between the algorithms is that aij is replaced by vij(dk). More precisely:

aij ⇔ pijR∑r=0

wij,r κij, r(d|θij,r) for i 6= j . (6.114)

κij,r(d) are cumulative probability distributions that have to be evaluated for delayd. Depending on the type of distribution this might involve more or less computationssince for, e.g., Gaussian distributions, there is no formula for the cumulative distribution.However, since R is constant (and most likely less than five) irrespective of N and L, itis a constant factor and complexity in terms of the O-calculus is the same as for standardHMMs. For the case that the process has stayed in state si (j = i), computations are evenless costly if the products pij dij(d); j 6= i are summed up “on the fly”.

Training complexity. Estimating overall complexity of the Baum-Welch algorithm is adifficult task since the number of iterations is depending on many factors such as

• model initialization, which is in many cases random

• quality and quantity of the training data, which includes the number of trainingsequences

• appropriateness of the HMM assumptions

• appropriateness of model topology

• number of parameters of the model. In case of a standard HMM, the number isdetermined byN values for π, up toN2 transition probabilities aij in case of a fully

6.7 Computational Complexity 129

connected HMM, and NM observation probabilitiesB. Since M is determined bythe application, it is assumed to be constant. Hence, the number of parameters isO(N2).

Some approaches have been published that try to predict computation time (e.g., Hoff-mann [120]) but since these models are based on measurements, they do not help toderive an O-calculus assessment. Due to the number of parameters being in the order ofO(N2) it is assumed here that also the number of iterations is ∈ O(N2), which in realityis a quite loose upper bound. In fact convergence can be much better if a large amount ofconsistent training data is available. Furthermore, in real applications, a constant upperbound for the number of iterations is used. Note that this does not guarantee that the train-ing procedure is close to a local maximum. However, since usually training is repeatedseveral times with different random initializations, this drawback is relatively small.

Complexity of one reestimation step can be determined: The E-Step of the EM al-gorithm involves execution of the forward-backward algorithm of complexity O(N2L).Then, to accomplish the M-step, reestimation of

π requires O(N) steps

B requires O(NL) steps

A requires O(N2L) steps

for each sequence. Hence the overall training procedure also has complexity O(N2L).Putting this together with the number of iterations, overall training complexity is of theorder of O(N4L). Similar to model application, complexity of models used in real appli-cations (e.g., left-to-right topology) is less.

Turning to HSMMs, reestimation of π and B remains the same while reestimationof A is replaced by an iterative approximation procedure, which leads to an increasedcomplexity of HSMMs:

• The optimization algorithm has to be run for each of the N states.

• For a fully connected model, the number of parameters that have to be estimatedincreases by const ∗ (N − 1), which is in the order of O(N).

• Computing the gradient involves the sum over all training data, which is O(L).

• Since a few gradient-based optimization steps are sufficient and assuming constantcomplexity to determine the step size, the number of iterations can be limited toO(1).

The resulting complexity is:

N ∗ O(N) ∗ O(L) ∗ O(1) = O(N2L) . (6.115)

Assuming the number of iterations of the outer EM algorithm to beO(N2) as before, thisagain yields an overall complexity of O(N4L). Again, in real applications such as onlinefailure prediction, a left-to-right structure is used, which also limits training complexity ofeach iteration toO(NL). Additionally, a constant upper bound on the number of iterationscan be applied, showing the same drawback as for standard HMMs. In general, thisanalysis shows the sometimes misleading over-simplification of theO-calculus: althoughbelonging to the same complexity class, it should be noted that HSMMs are clearly morecomplex than standard HMMs. However, as experiments along with the case study willshow, computation times are still acceptable (see Sections 9.4.2, 9.7.1, and 9.9.5).

130 6. The Model

6.8 SummaryHidden Semi-Markov Models (HSMMs) are a combination of semi-Markov processes(SMPs) and standard hidden Markov models (HMMs): Standard HMMs employ a dis-crete time Markov chain for the stochastic process of hidden state traversals, which isreplaced by a continuous time SMP in the case of HSMMs. Although it is not the firsttime that such a combination has been proposed, previous approaches were limited todiscrete time steps of length ∆t and/or have used state duration distributions instead oftransition durations and / or were limited to a maximum duration.

The forward, backward, and Viterbi algorithm have been derived yielding algorithmsthat are of the same complexity class14 as the algorithms of standard HMMs. This hasbeen achieved by a strict application of the Markov property and the assumption that astate transition takes place each time an observation occurs. Although this might soundtoo simplistic, a comparison of event-triggered temporal sequence processing with thesituation encountered in speech recognition reveals why this assumption is appropriate fortemporal sequence processing: temporal properties of the process appear at the surfaceand are expressed by the time when events occur whereas speech recognition operates onperiodic (i.e., equidistant) sampling, and hence the underlying temporal properties do notappear in observation data.

The forward or Viterbi algorithm are used for sequence recognition. Sequence pre-diction aims to forecast the further development of the stochastic process. There are twodifferent types of prediction: first, it might be of interest what the next observation sym-bol at a certain time in the future will be, and second, the probability that the stochasticprocess reaches a distinguished state up to some time t in the future can be computed.Solutions to both goals have been derived.

Training of HSMMs is accomplished in a similar way to standard HMMs: Based onthe forward and backward algorithm, an expectation maximization algorithm is employed.However, the formulas known from standard HMMs can only be adopted for initial stateand observation distributions π and bsi(oj), respectively. Limiting transition distributionspij and transition durations dij(dk) need to be optimized by an embedded gradient-basedoptimization procedure. The entire training procedure has been summarized on Page 111.A proof to show that the training procedure converges to a local maximum of sequencelikelihood, has been presented. It is based on the notion that EM algorithms performlower-bound optimization from which a so-called Q-function can be derived. This deriva-tion has been applied to the case of HSMMs yielding three terms that can be optimizedindependently. The proof investigates all terms and derives the training formulas.

Topics that are relevant to the application of HSMMs to online failure prediction havebeen covered, including a two step model construction process. Together with the appli-cation of background distributions, this process increases model bias and lowers variance,as is shown in Section 7.3.

Finally, complexity of the derived algorithms has been assessed using theO-calculus.For a fully connected (ergodic) model, both the algorithms for standard HMMs andHSMMs are of complexity O(N4L), assuming the number of outer EM iterations tobe of O(N2). However, the constant factors, which are hidden by the O-calculus, aresignificant for HSMMs. Furthermore, for many applications complexity is reduced toO(NL).

14in terms of the O-calculus

6.8 Summary 131

Contributions of this chapter. HSMMs, as proposed in this chapter, follow a novel ap-proach to extend hidden Markov models to continuous time. The fundamental differencebetween periodically sampled input data of applications such as speech recognition andevent-triggered temporal sequences is that temporal aspects of the underlying stochasticprocess is revealed at the level of observations. By exploiting this difference, a hiddensemi-Markov model has been proposed that operates on true continuous time rather thandiscrete time steps. It is able to model transition durations rather than state sojourn times,and does not require specification of a maximum duration. Furthermore, the model pro-vides great flexibility in terms of the distributions used and offers the possibility to incor-porate background distributions for transition durations. Moreover, the algorithm is of thesame complexity class as standard hidden Markov models.

Relation to other chapters. In online failure prediction, similarity of an error sequencethat has been observed in the running system is compared to failure-prone sequences ofthe training dataset by computing sequence likelihood. Since at least two HSMMs areused —one for similarity to failure and one for non-failure sequences— a classificationstep is needed in order to come to a final evaluation of the current system status. Severalapproaches to classification are presented in the next chapter.

Chapter 7

Classification

Classification is the last stage of the failure prediction process (see Figure 2.10 onPage 20). Classification facilitates a decision whether the current status of the system, asexpressed by the observed error sequence, is failure-prone or not. This chapter discussesissues related to that topic. More specifically, in Section 7.1 Bayes decision theory is intro-duced, while topics directly related to failure prediction are discussed in Section 7.2. Asthe outcome of the classifier is a decision that can be either right or wrong, classificationerror is analyzed in more detail in Section 7.3. This includes the bias-variance-dilemmaand approaches how the trade-off between bias and variance can be controlled.

7.1 Bayes Decision TheoryClassification, in its principal sense, denotes the assignment of some class label ci, i ∈{0, . . . , u}, to an input feature vector s. It seems not surprising that a decision theorybearing the name of Revd. Thomas Bayes is a stochastic formal foundation to derive andevaluate rules for class label assignment based on the Bayes rule. The principal approachof Bayesian decision is that class label assignment is based on the probability for classci after having observed feature vector s, which is the so-called posterior probabilitydistribution P (ci | s). Applying Bayes’ rule, the posterior can be computed by:

P (ci | s) = p(s | ci)P (ci)p(s) = p(s | ci)P (ci)∑

l p(s | cl)P (cl), (7.1)

where p(s | ci) is called the likelihood, and P (ci) is called the prior. The likelihoodexpresses that certain features occur with different probabilities depending on the trueclass. The prior accounts for the fact that classes ci are not equally frequent. Due to thefact that classification theory has mainly been developed for continuous feature vectors,there are infinitely many s’es and likelihood p(s | ci) is a probability density, which isdenoted by a small letter “p”.

133

134 7. Classification

7.1.1 Simple ClassificationThe simplest classification rule is to assign an observed feature vector s to the class withmaximum posterior probability

class(s) = arg maxci

P (ci|s) (7.2)

= arg maxci

p(s|ci) P (ci)∑l p(s|cl) P (cl)

(7.3)

= arg maxci

p(s|ci) P (ci) . (7.4)

The last step from Equation 7.3 to Equation 7.4 can be performed since the denominatoris independent of ci and hence does not influence the arg max operator.

This classification rule seems intuitively correct, and it can be shown that it minimizesthe missclassification error (see Bishop [30]). For the sake of simplicity, let us assumethat there are only two classes c1 and c2. Let Ri denote a decision region, which is a notnecessarily contiguous partition of feature space: if a data point s occurs within regionRi, class label ci is assigned to s. The total probability of misclassification, i.e., the error,is given by:

P (error) = P (s ∈ R2, c1) + P (s ∈ R1, c2) (7.5)= P (s ∈ R2 | c1)P (c1) + P (s ∈ R1 | c2)P (c2) (7.6)

=∫

R2p(s | c1)P (c1) ds+

∫R1p(s | c2)P (c2) ds . (7.7)

The boundaries between decision regions are known as decision surfaces or decisionboundaries. Figure 7.1 visualizes Equation 7.7 for a one-dimensional feature space sand two continuous regions defining a single decision boundary θ. It can be seen from the

Figure 7.1: Classification by maximum posterior for a two-class example. The curves showp(s|ci)P (ci) and hatched areas indicate the error. R1 and R2 are decision re-gions: Every s within R1 is classified as c1 and within R2 as c2. It can be seenthat the error is minimal if the decision boundary θ equals the point where the twoprobabilities cross.

figure that the total probability of an error (i.e., the hatched area in the figure) is minimal

7.1 Bayes Decision Theory 135

if θ is chosen to be the value of s for which p(s | c1)P (c1) = p(s | c2)P (c2). From thisfollows that the decision rule given in Equation 7.4 results in minimum probability ofmisclassification for two classes. The resulting minimum error for this boundary is calledthe Bayes error rate.

In the case of more classes, it is easier to compute the probability of correct classifi-cation

P (correct) =∑ci

∫Rip(s | ci)P (ci) ds . (7.8)

Choosing decision regions such that the probability of correct classification is maximizedleads to Equation 7.4 in its general form for multiple classes. In summary, the Bayesclassifier chooses decision regions such that the probability of correct classification ismaximized. No other partitioning can yield a smaller probability of error (Duda & Hart[84]).

7.1.2 Classification with CostsThe classification rule derived above has not considered any cost or risks involved withclassification. However, cost can influence classification significantly. For instance inthe case of medical screening, classifying an image of a tumor as normal is much worsethan the reverse. The same might hold for failure prediction, too: Not predicting anupcoming failure might cause much higher cost than spuriously predicting a failure whenthe system is actually running well. In order to account for cost, a cost or risk matrix isintroduced.1 Each element of the risk matrix rta defines the cost / risk associated withassigning a pattern s to class ca when in reality it belongs to class ct. Although the term“risk” might not seem appropriate for cases where the correct class label is assigned, theterm is used here. Instead of minimizing the probability of error, an optimal cost-basedclassification minimizes expected risk. To derive a formula, first the expected risk ofassigning a sequence s to class ca is considered:

Ra(s) =∑t

rta P (ct | s) . (7.9)

Since class ca is assigned to all s ∈ Ra, the average cost of assignment to class ca is:

Ra =∫

Ra

∑t

rta P (ct | s) p(s) ds =∫

Ra

∑t

rtap(s | ct)P (ct)

p(s) p(s) ds (7.10)

and the total expected risk equals

R =∑a

Ra =∑a

∫Ra

∑t

rta p(s | ct)P (ct) ds . (7.11)

Risk is minimized if the integrand is minimized for each sequence s, which is achievedby choosing the decision region for assignment to class ca such that s ∈ Ra if∑

t

rta p(s | ct)P (ct) <∑t

rti p(s | ct)P (ct) ∀ i 6= a (7.12)

resulting in a Bayes decision rule where minimum loss across all assignments for se-quence s is chosen. If two assignments have equal loss, any tie-breaking rule can beused.

1In classification, the matrix is also called loss matrix


7.1.3 Rejection ThresholdsBishop [30] mentions that classification can also yield the result that a given instancecannot be classified with enough confidence. The idea is to classify a sequence s only ifthe maximum posterior is above some threshold θ ∈ [0, 1] (c.f., Equation 7.2):

class(s) =

ck = arg maxci P (ci | s) if P (ck | s) ≥ θ

∅ else .(7.13)

Rejection thresholds might be useful for online failure prediction if there is a human oper-ator who can be alerted if a sequence cannot be classified in order to further investigate orobserve the system’s status. However, since experiments carried out in this work are onlybased on a data set (there has been no operator to alert), rejection thresholds have not beenapplied, here. A second application of rejection thresholds is concerned with improvingcomputing performance of classifiers: In a first step, simple classifiers can be used toclassify the non-ambiguous cases. More complex situations, for which the simple classi-fications do not exceed the rejection thresholds, more sophisticated but computationallymore expensive methods can be applied to further analyze the situation. However, sinceoptimization of computing performance is not the purpose of this dissertation, such ap-proach has also not been applied in this thesis.

7.2 Classifiers for Failure PredictionBayesian decision theory provides the basic framework for classification. In this section,failure prediction specific as well as practical issues are discussed. Note that now proba-bility p(s | ci) denotes likelihood of sequence s, which has been observed during runtime.In case of hidden Markov models, sequence likelihood is computed by the forward algo-rithm.

7.2.1 Threshold on Sequence LikelihoodThe simplest classification rule is to have only one single HSMM trained on all failuresequences irrespective of the failure mechanism, and to apply a threshold θ ∈ [0, 1] tosequence likelihood p(s |λF ) where λF denotes a model that has been trained on failuredata, only. The problem is that observation sequences s are delimited by a time window∆td (c.f., Figure 5.4 on Page 79) resulting in a varying number of symbols in observationsequences. Sequence likelihood decreases monotonically with the number of observationsymbols and hence threshold θ has to depend on the number of observation symbols. Fur-thermore, experiments have shown that such approach does not result in decisive models.For these reasons, the method of simple thresholding is not used in this thesis.

7.2.2 Threshold on Likelihood RatioOne way to circumvent the problem of varying length of observation sequences is to useexactly two models —λF for failure and λF for non-failure sequences— and to computethe ratio of sequence likelihoods. A failure is predicted if the ratio is above some threshold

7.2 Classifiers for Failure Prediction 137

θ ∈ [0,∞). More formally, a failure is predicted if

P (s |λF )P (s |λF ) > θ . (7.14)

In order to analyze this approach it is cast into the framework of Bayes decision theory.However, to simplify affairs, formulas of Bayes decision theory become more handy ifrephrased for the two-class case. From Equation 7.12 follows that the classifier shouldopt for a failure if

rFF p(s | cF )P (cF ) + rFF p(s | cF )P (cF ) <

rFF p(s | cF )P (cF ) + rF F p(s | cF )P (cF )(7.15)

⇔ (rFF − rFF ) p(s | cF )P (cF ) < (rF F − rFF ) p(s | cF )P (cF ) . (7.16)

Under the reasonable assumption that rF F < rFF , which means that the cost associatedwith classifying a non-failure-prone situation correctly as o.k. are less than cost associatedwith falsely classifying it as failure-prone, equations can be transformed as follows:

⇔ rFF − rFFrF F − rFF

p(s | cF )P (cF ) > p(s | cF )P (cF ) (7.17)

⇔ p(s | cF )p(s | cF ) >

(rFF − rF F )P (cF )(rFF − rFF )P (cF ) . (7.18)

Identifying the likelihoods p(s | cF ) with estimated sequence likelihoods P (s |λF ) ob-tained from the model, it can be seen that classification by threshold on sequence likeli-hoods is optimal if for threshold θ holds:

θ = (rFF − rF F )P (cF )(rFF − rFF )P (cF ) . (7.19)

7.2.3 Using Log-likelihoodIn many real applications and models such as hidden semi-Markov models, sequencelikelihoods P (s |λt) get too small to be computed and hence the log-likelihood is used(c.f., Equation 6.19 on Page 101). However, this does not rule out the Bayes classifica-tion to be used since the logarithm is a strictly monotonic increasing function and henceEquation 7.18 can be transformed into

log[p(s |λF )

]− log

[p(s |λF )

]> log

[rFF − rF FrFF − rFF

]︸︷︷︸

∈(−∞;∞)

+ log[P (cF )P (cF )

]︸︷︷︸

const.

. (7.20)

Usefulness of the formula can be seen more easily if only cost for misclassification aretaken into account, which means rFF = rF F = 0. Hence,

θ = log[rFFrFF

]+ c . (7.21)


Equation 7.21 approaches−∞ if rFF and rFF → 0. In other words, if cost for incorrectlyraising a failure warning approaches zero, the threshold θ gets infinitely small and con-sequently classifying every event sequence as failure-prone results in minimal cost. Onthe other hand, if cost for such misclassification is high, then it must be quite evident thatthe current status is failure-prone, i.e., big difference in sequence log-likelihoods, until afailure warning is raised. In terms of rFF the situation is inverse.

7.2.4 Multi-class Classification Using Log-LikelihoodAs can be seen from Figure 2.10 on Page 20, in the approach presented here, one non-failure model and u failure models —one for each failure mechanism— are used to pre-dict a failure, which naturally leads to a multi-class classification problem. If sequencelikelihoods P (s |λt) would be available, then Equation 7.12 had to be used for classifi-cation. However, in real applications only log-likelihoods logP (s |λF ) are available butEquation 7.12 cannot be solved to include singleton logP (s |λt) terms. Therefore, themulti-class classification problem is turned into a two-class one by selecting maximumlog sequence likelihood of failure models and comparing it to log sequence likelihood ofthe non-failure model:

class(s) = F ⇔ umaxi=1

[logP (s |λi)

]− logP (s |λ0) > log θ , (7.22)

where θ is as in Equation 7.19. The motivation for the approach is as follows: Failuremodels are related since they all indicate an upcoming failure. If the system encounters anupcoming failure, the observed error sequence is the outcome of exactly one underlyingfailure mechanism. Hence the failure model that is targeted to this failure mechanismshould recognize the error sequence as most similar, which is expressed by maximumsequence log-likelihood. An additional advantage of the approach is that the cost matrixdefining θ only has four elements, which can be overseen and determined more easily.

7.3 Bias and VarianceBayes decision theory has been based on minimizing classification error for each singleobservation sequence (c.f., Equation 7.5). However, the classifier is trained from somefinite training data set. Analyzing dependence on training data yields fundamental insightsinto machine learning, which in turn lead to improved modeling techniques. In order todescribe the concept, bias and variance are first derived for regression, as it has beendeveloped by Geman et al. [104]. Having the concept in mind, the work of Friedman [98]is described, who has proposed an analysis of bias and variance for classification. Thepurpose of presenting this material is to provide the background for a discussion of biasand variance in the context of failure prediction and for an overview of known techniquesto control the trade-off between bias and variance. For further details, please refer totextbooks such as Bishop [30] or Duda et al. [85].

7.3.1 Bias and Variance for RegressionMachine learning techniques usually try to estimate unknown mechanisms / interrelationsfrom training samples, which leads to different resulting models depending on the data

7.3 Bias and Variance 139

Figure 7.2: Mean square error in regression problems. Dots in each figure indicate two differ-ent training datasets D1, and D2 from which (in this case linear) models y(s;Di)have been trained. Mean square error is determined by (y(s;D)− t(s))2, wheret(s) is the target value at point s.

present in the training data set. This is due to the fact that training data is a finite sampleand the system under investigation might also be stochastic. The following considerationsassess the dependence on the choice of training data, resulting in an analysis of bias andvariance. A common way to explain the two terms is to first investigate mean square errorE for regression: The error is measured by square of the difference between y(s;D),which is the output value for input data point s of some model that has been trainedfrom training data set D of fixed size n, and the target value t(s) (see Figure 7.2). Sincetraining data is a finite sample, resulting models may vary with every different trainingdataset. The expected error over all training datasets for one data point s is computed anddecomposed as follows: (c.f., e.g., Alpaydin [6]):

E = ED[(y(s;D)− t(s)

)2]

(7.23)

= ED[y2]− 2 t ED [y] + t2 (7.24)

= ED[y2]− 2 t ED [y] + t2 + ED [y]2 − ED [y]2 (7.25)

= ED [y]2 − 2 t ED [y] + t2 + ED[y2]− ED [y]2 (7.26)

=(ED [y(s;D)]− t(s)

)2

︸︷︷︸Bias2

+ ED[y(s;D)2

]− ED

[y(s;D)

]2︸︷︷︸

V ariance

. (7.27)

Equation 7.27 indicates that the mean squared deviation from the true target data of anymachine learning method consists of two parts:

1. ability to mimic the training data set (bias)

2. sensitivity of the training method to variations in the selection of the training dataset (variance)

The relation can be understood best if two extreme cases are considered:


• Assume a machine learning technique that memorizes all training data points. Suchtechnique has a bias of zero. However, the resulting model is strongly different fordifferent selections of the training data set resulting in high variance.

• Assume a “learning” technique that does not adapt to the training data at all (e.g., afixed straight line), then the resulting model is the same irrespective of the data set(zero variance). However, deviation from the target values is quite high resulting ina high bias.

The key insight of Equation 7.27 is that in order to obtain a model with small averageerror on s, both bias and variance must be reduced. A good model achieves a balance ofunderfitting (high bias, low variance) and overfitting (low bias, high variance), which isalso known as the bias and variance!dilemma.

7.3.2 Bias and Variance for ClassificationThe above derivations investigated mean square error for regression problems. Turningto classification, the situation is different. In two-class classification, there are only twotarget values t ∈ {0, 1}. Mean squared error (y(s;D)− t)2 could be used to measureproximity of the model output to binary target data as well, but this is not a proper ap-proach. Regard, for example, a classifier that yields output y(s) = 0.51 for t = 1 andy(s) = 0.49 for t = 0 for all s. This is a perfect classifier since with a threshold of 0.5 alls would be classified correctly. However, with the mean square error, the classifier wouldreceive a high bias. Friedman [98] was one of the first to investigate this problem and toderive an assessment of bias and variance for classification problems. Although otherssuch as Shi & Manduchi [240], Domingos [82] have developed the topic further, only thebasic findings of Friedman are presented here.

The regression problem of the previous section involved the notation y(s;D) to de-note the output of the model. In terms of classifiers, classification is based on modeledposterior class probability (c.f., Equation 7.1):

f(s;D) = P (c = 1|s) = 1− P (c = 0|s) , (7.28)

which is an estimate of the true posterior probability f(s) = P (c1|s). The posterior esti-mate f(s;D) is used to classify intput s in a Bayes classifier. In a two-class classificationproblem, the assigned class label is determined by:

c(s;D) = IA

(f(s;D) ≥ r01

r01 + r10

), (7.29)

where IA(x) is the standard indicator function and rta denote classification risk as inEquation 7.12. Correspondingly, the optimal classification is based on the true posterior:

cB(s) = IA

(f(s) ≥ r01

r01 + r10

), (7.30)

which results in cost minimal (Bayes) classification. In order to simplify notations, equalcost r01 = r10 is assumed such that the decision level is set to 1/2. Figure 7.3 shows thesituation.

Similar to derivation of bias and variance for regression, the estimated posteriorf(s;D) is a random variable depending on the training data set D. For one training


Figure 7.3: True posterior probability f(s), and estimated posterior f(s;D) obtained fromtraining using dataset D. In regions of s where f(s) and f(s;D) are on thesame side of the Bayesian decision boundary 1/2, a correct classification resultsand classification error rate is minimal (regions R2 and R4). If not, the classifierbased on f(s;D) assigns the wrong class label resulting in maximal classificationcost (for s in that region)

data set, f(s;D) may be on the correct side of the decision boundary (for s), for anotherdata set not. In order to handle this dependency on training data, again the expected valueED is used to assess the average misclassification rate

P(c(s) 6= c(s)

)= ED

[P(c(s;D) 6= c(s)

)], (7.31)

where c(s) is the true class of input s. It can be shown that Equation 7.31 can be separatedinto the minimal Bayes error P

(cB(s) 6= c(s)

)and a term that is linearly dependent on

the so-called boundary error P(c(s) 6= cB(s)

). Since the Bayesian error does not depend

on the classifier, only the boundary error needs to be investigated.For further assessment, Friedman assumes that the estimated posterior f(s;D) is dis-

tributed —for varying datasets D— according to p(f(s)

), which is unknown in general.

However, since many machine learning algorithms (including Baum-Welch) employ av-eraging, p

(f(s)

)can be approximated by a normal distribution:

p(f(s)

)= N

(ED[f(s;D)]; Var[f(s;D)]

). (7.32)

In order to compute the boundary error P(c(s) 6= cB(s)

), the desired quantity is the

probability that f(s) and f(s) are on opposite sides of the decision boundary 1/2, whichyields (see Figure 7.4):

P(c(s) 6= cB(s)

)=

∫∞

1/2 p(f(s)

)df if f(s) < 1/2

∫ 1/2−∞ p

(f(s)

)df if f(s) ≥ 1/2 .

(7.33)

The two cases can be turned into one using the sign function:

P(c(s) 6= cB(s)

)= Φ

(sign

(f(s)− 1/2

) (ED[f(s;D)]− 1/2

)︸︷︷︸boundary bias

(Var[f(s;D)]︸︷︷︸

variance

)−1/2)

(7.34)


Figure 7.4: Distribution of estimated posterior

where Φ is the upper tail integral of the normal distribution.2 Plots of the boundary erroras a function of f and ED[f ] are provided for two values of Var[f ] in Figure 7.5.

Ef

−0.5

0.0

0.5

1.0

1.5

V[f^]

0.0

0.5

1.0

1.5

boundary error

0.0

0.2

0.4

0.6

0.8

1.0

Boundary Error, f = −0.25

f

0.0

0.2

0.4

0.6

0.8

1.0

E[f^]

−0.5

0.0

0.5

1.0

1.5

boundary error

0.0

0.2

0.4

0.6

0.8

1.0

Boundary Error, var[f^] = 0.05

(a) (b)

Figure 7.5: Boundary error P(c(s) 6= cB(s)

). Plot (a) shows dependence on ED[f(s;D)]

and Var[f(s;D)] for a given true posterior of f(s) = −0.25. Plot (b)shows dependence on true posterior f(s) and expected value ED[f(s;D)] forVar[f(s;D)] = 0.05. Note that depending on the modeling technique estimatesf(s;D) may exceed the range [0, 1]. This is not a problem since classification isperformed by comparing f(s) to the decision boundary 1/2

Several key insights into the nature of classification error can be gained from this:

1. From Equation 7.34 it can be seen that bias and variance affect each other in a mul-tiplicative way rather than additive as in the case of regression (c.f., Equation 7.27).This results in the complex relationship seen in Figure 7.5-a.

2. Small classification errors can only be achieved if variance Var[f(s;D)] is low.

2Hence Φ(·) = 1− erf(·)


However, this is only true if boundary bias is positive, i.e., f(s) and f(s;D) are onthe same side of the decision boundary 1/2. If it is negative, a very large classifica-tion error results.

3. Except for the special case of zero variance, the error rate is depending on thedistance of ED[f(s;D)] from the decision boundary 1/2. For this reason, bias isexpressed as boundary bias.

4. The error rate of the classifier is not depending on distance between f(s) and thedecision boundary, as long as f(s) and ED[f(s;D)] are on the same side. In Fig-ure 7.5-b, this can be seen that for fixed ED[f(s;D)] the boundary error is the samefor all f < 1/2 and f ≥ 1/2, respectively.

From this discussion follows that optimal classification3 is achieved for small variance(resulting models are more or less equal regardless of the selection of training data) pro-vided the fact that the training algorithm on average yields an estimation of the posteriorprobability that is on the correct side of the Bayes decision boundary.

Note that all the formulas derived above have evaluated only one single s. If theoverall error rate is to be assessed, a further integral is needed:

P (c 6= c) =∫ ∞−∞

P(c(s) 6= c(s)

)p(s) ds . (7.35)

7.3.3 Conclusions for Failure PredictionThe detailed analysis of classification error with respect to bias and variance has shownthat first, there is a trade-off between underfitting and overfitting, and second, in the caseof classification, small variance is more important than small bias. For this reason, biasand variance have to be controlled in order to achieve a robust classifier. A manifold oftechniques exist among which a few are shortly described here including a discussionwhether they can be used for online failure prediction with HSMMs, or not.

• The most intuitive golden rule for machine learning approaches is to increase theamount of training data. However, in most real application the amount of availabletraining data is limited, either since cost for data acquisition is too high or, as inthe case of failure prediction, data acquisition simply takes too long. Since in mostapplications one part of available data is used for training and the other is used toassess the generalization / prediction quality of the models, it is suggested to usetechniques such as m-fold cross validation to make full use of the limited data. Thistechnique has also been used in this thesis (see Section 8.3.3).

• Training with noise. In the case that not enough training data is available, noisecan be synthetically added to the training data in order to divert the training proce-dure and to avoid memorizing of training data points (overfitting), hence increasingbias and lowering variance. In the case of regression, “noise” refers to a simplezero mean stochastic process being added to measurement data. However, it is notclear how this concept translates into failure sequences. While a zero mean randomnumber could be added to the delay between error events, it seems hazardous to

3Remember that the overall error rate is the sum of Bayes error and a term linear in P(c(s) 6= cB(s)

)


interchange the event type, which is a nominal, i.e., non-ordinal, variable.4 Hence,this technique could not be applied in this thesis.

• Early stopping. Many machine learning techniques apply an iterative estimation al-gorithm to stepwise adapt model parameters to the training data. This correspondsto a stepwise transition from under- to overfitting. The idea of early stopping isto evaluate generalization performance with a data set that is not used for train-ing and to halt the training procedure once the validation error begins to rise (seeFigure 7.6). Experiments have shown that early stopping does not seem to be an ap-

Figure 7.6: Early stopping. Error for the training data decreases with every training step ap-proaching some minimum error. Evaluating generalization performance using aseparate validation data set shows an increasing error after some number of train-ing steps due to the fact that the model is overfitting the training data. Early stop-ping interrupts the training procedure once validation error begins to rise

propriate technique for hidden semi-Markov models. The reason for this is that theBaum-Welch estimation procedure sets all observation probabilities of symbols thatdo not occur in the training data set to zero in the first iteration. As early stoppingcan only halt at integer steps, the first possible stop is already “too late”. It has alsobeen tried to combine early stopping with background distributions but this did notresult in significant improvement in comparison to the application of backgrounddistributions alone.

• Growing and pruning. One of the major factors influencing the trade-off betweenbias and variance is the number of free parameters of the model: Provided that thereis enough training data, the greater the number of free parameters, the better a modelcan memorize training data points resulting in a low bias but high variance. The ideaof growing or pruning algorithms is to iteratively increase / decrease the number offree parameters until an optimal solution is found. In hidden Markov models, thenumber of parameters is mainly determined by the number of states and transitionsand hence algorithms try to add / delete edges or nodes / states following somemostly heuristic rule. Bicego et al. [28] have proposed several pruning algorithms.However, these methods can only be applied to models with recurrent states, whichis not the case for the models used for online failure prediction.

• Model order selection. As discussed above, growing and pruning are not applica-

4If a numbering scheme for event IDs similar to the one proposed in Section 5.4.2 is used, adding noisecould be applied. However, data of the telecommunication platform did not provide such numbering.


ble within an automatic rule-based approach. However, “growing and pruning” isachieved by simple trial and error for some range of model parameters such as thenumber of states, number of intermediate states or maximum span of shortcuts. Inthis approach, the most appropriate model is selected applying techniques such ascross-validation.

• Parameter tying. The number of free parameters of some model classes such asneural networks and hidden Markov models can be reduced if several parametersare “grouped”. In the case of hidden semi-Markov models, for example, transitionparameters pij of several transitions can be forced to be equal, which reduces thenumber of free parameters. However, in order to apply tying wisely, not blindly,strong assumptions and hence detailed knowledge about the modeled process arenecessary, which is not the case for the problem addressed in this dissertation.

• Background distributions, intermediate states, and shortcuts. Observation proba-bilities of hidden Markov models can be mixed with so-called background distri-butions (c.f. Page 112). Background distributions “blur” the output probabilities ofthe HMM which results in an increased training bias but reduced variance. If ob-servation probabilities are trained using the Baum-Welch algorithm (as is the casefor this thesis) the application of background distributions is especially importantto circumvent the problem of zero probability for observation symbols not occur-ring in the training data set. Intermediate states and shortcuts added to the modeltopology (see Section 6.6) have a similar effect on specificity of state transitions.Furthermore, HSMMs also allow to incorporate background distributions into tran-sition durations. Due to the fact that (a) observation background distributions havebeen available for the HMM toolkit on which the implementation of HSMMs isbased, (b) transition background distributions are within the core of HSMMs, and(c) intermediate states and shortcuts can easily be incorporated by modifying themodel structure, these techniques have primarily been used in this thesis.

• Regularization. The techniques described so far have left the core of the trainingprocedure untouched. Regularization methods modify the training procedure itselfin that the objective function of training is modified such that model complexityis penalized. Many regularization techniques exist for neural networks (see, e.g.,Bishop [30]). However, for hidden Markov models there are less. Hence, regular-ization has been left over for future work.

• Aggregated models. Another group of techniques does not build on one singlemodel but rather on a population of component models that are aggregated to forma big one. One of the predominant techniques is called arcing5 among which bag-ging and boosting are most well-known. Bagging trains various component modelsby randomly chosen subsets of training data. The output of the aggregated modelis simply a majority vote among component models. Boosting, among which Ad-aBoost6 is most well-known, first trains a component model from a subset of train-ing data, and then subsequently trains further component models from data sets that

5Adaptive Reweighting and CombinING6“Adaptive Boosting”


consist half of input data that is correctly classified by the previous component mod-els and half of incorrectly classified training samples. By this method, subsequentcomponents models are somewhat complementary to their predecessors. See Dudaet al. [85] for an overview of these methods. In this thesis, aggregated models havenot been used. Nonetheless, the concepts could be applied without restrictions.

7.4 SummaryIn this chapter the theory of the last step of online failure prediction using a pattern recog-nition approach such as hidden semi-Markov models has been covered: the final classifi-cation whether the current status of the system, as expressed by the observed error eventsequence, is failure-prone or not.

In order to found the classification process on a theoretical framework, Bayes deci-sion theory has been introduced. It has been shown why the overall error rate of anyclassifier is minimal if decision boundaries are chosen at the points where posterior prob-ability distributions cross. This concept has been extended to multi-class classification,minimum cost classification and the use of rejection thresholds. Based on the framework,other straightforward classification schemes have been analyzed. Since in real applica-tions, log-likelihood is most commonly used, classification based on log-likelihood hasbeen investigated leading to the conclusion that only two-class classification can be used.Since the modeling approach of this thesis employs a model for each failure mechanism,all failure-related models are combined using the maximum operator, which is then com-pared to the sequence log-likelihood of the non-failure model.

The framework of Bayes decision theory is also the foundation for a detailed analysisof classifier error rate in terms of bias and variance. The so-called bias-variance dilemmahas been introduced by the simpler case of regression. Subsequently, an analysis forclassification has been presented. The main purpose of this excursion was to explain thenecessity to control the bias-variance trade-off of the modeling approach. Hence finally,a collection of well-known techniques has been described and each has been discussed inthe light of online failure prediction with hidden semi-Markov models.

Contributions of this chapter. The overview of main methods to control the trade-offbetween bias and variance is a collection of the techniques found in several textbookson machine learning and pattern recognition. Additionally, it is —to the best of ourknowledge— the first time the aspect of log-likelihood for multiclass classification isconsidered. Furthermore, some new figures and plots have been developed in the hope tomake Friedman’s theory more understandable.

Relation to other chapters. This chapter has covered the third stage of the comprehen-sive approach to online failure prediction pursued in this thesis: after data preprocessingand HSMM modeling, it has described the step of coming to a conclusion about the cur-rent status of the system.

This chapter also concludes the modeling part of the thesis. Being equipped withthe principal solution to the problem of online failure prediction, the next part turns to thethird phase of the engineering cycle: The application of the principal solution to industrialdata of a commercial telecommunication system.

Part III

Applications of the Model

147

Chapter 8

Evaluation Metrics

Having presented in detail the approach to online failure prediction, this third part of thethesis is concerned with the experimental evaluation of the approach. Experiments havebeen performed on data of an industrial telecommunication system. Before presentingexperimental results, this chapter introduces the metrics used for evaluation. Specifically,in Section 8.1 metrics related to failure sequence clustering are presented, and in Sec-tion 8.2, metrics to evaluate accuracy / quality of failure predictions are covered. Theevaluation process is described in Section 8.3 including the topic how statistical signifi-cance is assessed.

8.1 Evaluation of ClusteringData preprocessing includes clustering at two levels: first, when message IDs are assignedto log records and second, when failure sequences are grouped in order to separate failuremechanisms in the training data (c.f., Sections 5.1.1 and 5.2). Several aspects must beconsidered in the process of clustering: a (hierarchical) clustering algorithm must be cho-sen (i.e., agglomerative or divisive clustering), and in case of agglomerative clustering, theinter-cluster distance metric needs to be defined (i.e., nearest neighbor, furthest neighbor,unweighted pair-group average, or Ward’s method). Using dendrograms and banner plots,the choice of methods can be visually investigated in order to see whether the clusteringtechnique results in a clear and reasonable division. A more formal analysis is providedby the agglomerative and divisive coefficient that try to express “clusterability” as a realnumber between zero and one.

After clustering, the number of groups needs to be determined into which the data ispartitioned. This topic has been covered in Section 5.2.3, one of which is visual inspec-tion. For visual inspection, dendrograms or banner plots can be used as well.

8.1.1 DendrogramsDendrograms are tree-like charts that indicate which data points have successively beenmerged / divided in the course of agglomerative / divisive hierarchical clustering. InFigure 8.1, dendrograms for a simple six point example clustered with three different

149

150 8. Evaluation Metrics

clustering methods are shown. The tree structure indicates which data points are merged /

5 10 15 20

510

1520

data points

x1

x2

AB

C

D

E F

(a)A B

C

D

E F05

1020

divisive clustering

Divisive Coefficient = 0.78dissimilarities

Hei

ght

(b)

A B

C

D E F

02

46

810

agglomerative single linkage clustering

Agglomerative Coefficient = 0.63dissimilarities

Hei

ght

(c)

A B

C D

E F05

1020

agglomerative complete linkage clustering

Agglomerative Coefficient = 0.84dissimilarities

Hei

ght

(d)

Figure 8.1: Dendrograms for a six point example. (a) shows the data points to be clustered.(b) shows the result of divisive clustering, (c) agglomerative clustering with the sin-gle linkage distance metric, and (d) agglomerative clustering using the completelinkage distance metric

divided and the height of the connecting horizontal bar indicates the corresponding levelof the distance metric termed “height”. It can be seen that different clustering algorithmscan result in different groupings. In the example depicted in Figure 8.1, divisive andsingle linkage clustering suggest a division into two groups {A,B} and {C,D,E, F}while complete linkage clustering suggests three groups {A,B}, {C,D}, and {E,F}.

8.1 Evaluation of Clustering 151

8.1.2 Banner PlotsAlthough dendrograms provide an intuitive way to present the result of clustering, theyget overly complicated if the number of data points is increased. Rousseeuw [215] hasintroduced banner plots, which are more suited to large data sets. Therefore, in this dis-sertation, banner plots are used to visualize clustering results.

A banner plot is a horizontal plot that connects data points by a colored bar of lengthaccording to the level of division / merge. As is the case for dendrograms, this sometimesrequires reordering of data points. Figure 8.2 shows corresponding banner plots for thedendrograms shown in Figure 8.1-b and 8.1-d. Note that banner plots for divisive and

Height

divisive clustering

Divisive Coefficient = 0.78

26.9 24 20 16 12 8 6 4 2 0

F

E

D

C

B

A

(a)

Height

agglomerative complete linkage clustering

Agglomerative Coefficient = 0.84

0 2 4 6 8 12 16 20 24

F

E

D

C

B

A

(b)

Figure 8.2: Banner plots for divisive clustering (a) and agglomerative clustering based oncomplete linkage (b). The plots correspond to dendrograms (b) and (d) of Fig-ure 8.1

agglomerative clustering are reversed, since banner plots document the “operation” of theclustering algorithm, i.e., division and merging, from left to right.

8.1.3 Agglomerative and Divisive CoefficientDendrograms and banner plots visually give a notion of the data set’s “clusterability”.A formal metric addressing this aspect are divisive or agglomerative coefficient. For thedivisive algorithms, let d(i) denote the diameter of the last cluster to which observationi belongs (before being split off as a single observation) divided by the diameter of thewhole dataset. For agglomerative algorithms, let m(i) denote dissimilarity of observationi to the first cluster it is merged with, divided by the dissimilarity of the merger in the finalstep of the algorithm. Then divisive coefficient DC and agglomerative coefficient AC are


defined as follows:

DC = 1n

n∑i=1

1− d(i) ∈ [0, 1] (8.1)

AC = 1n

n∑i=1

1−m(i) ∈ [0, 1] . (8.2)

Both coefficients can be interpreted as average width of the banner plot, which is also ameasure for the “filling” of the banner plot. Since the banner plot is scaled such that thefirst split / last merger determines one border of the plot, the larger the filled area, theclearer is the structure in the data. Hence, AC and DC can be interpreted as an indicatorfor the strength of clustering structure in the data. However, with increasing number ofobservations n, both AC and DC grow and should therefore not be used to compare datasets of very different sizes.

8.2 Metrics for Prediction QualityThe output of online failure prediction is a binary decision whether the current statusof the system is failure-prone or not. Evaluating these binary decisions results in a so-called contingency table from which a variety of metrics can be inferred. The advantageof these metrics is that an intuitive interpretation of classification results exists. On theother hand, as explained in Chapter 7, decisions are subject to various parameters such asclassification cost and prior distributions. While prior distributions can be estimated fromthe data set, an assignment of classification cost is quite application specific and is notan easy task. Indeed, by choice of classification cost a comparison of failure predictionmethods can easily be tuned in favor of one or another failure prediction method. For thisreason, classification independent metrics are also used to evaluate the predictive powerof online failure prediction approaches.

The purpose of this section is to provide a comprehensive overview of the variousevaluation metrics for failure prediction algorithms. However, only

• precision, recall, true positive rate, false positive rate

• F-measure,

• precision-recall plot,

• ROC plot,

• AUC, and

• accumulated runtime cost

are used in this dissertation.

8.2 Metrics for Prediction Quality 153

8.2.1 Contingency TableObviously, the goal of any failure prediction is to predict a failure if and only if the systemreally is failure-prone. However, it can be doubted that any prediction algorithm will everreach such one-to-one match between failure predictions and true situation of the system.In fact, two types of mispredictions can occur:

• The failure prediction algorithm may predict an upcoming failure but in reality thesystem is running well so no failure is about to occur. This is called a false positive,or Type I error. In failure prediction, a positive prediction is also called a failurewarning and hence the misprediction is a false warning.

• The failure prediction algorithm may suggest that the system is in a correct, notfailure-prone state but this is not true. Such misprediction is called a false negativeor Type II error. Since there is no warning about the upcoming failure, this situationis also called a missing warning

Similarly, there are two cases for correct predictions:

• If the system is correctly identified as failure-prone, the prediction is a true positiveor correct warning

• if the system is correctly identified as non failure-prone, the prediction is a truenegative or correct no-warning

If for an experiment each prediction is assigned to one of the four cases and the numberof occurrence of each case is stored, a so-called contingency table is obtained, as shownin Table 8.1. The table is sometimes also called the confusion matrix (e.g., in Kohavi &

True Failure True Non-failure Sum

Prediction: Failure true positive (TP ) false positive (FP ) positives(failure warning) (correct warning) (false warning) (POS)Prediction: No failure false negative (FN ) true negative (TN ) negatives(no failure warning) (missing warning) (correctly no warning) (NEG)

Sum failures (F ) non-failures (NF ) total (N )

Table 8.1: Contingency table. Any failure prediction belongs to one out of four cases: if theprediction algorithm decides in favor of an upcoming failure, the prediction is calleda positive resulting in raising of a failure warning. This decision can be right orwrong. If in truth the system is in a failure-prone state, the prediction is a truepositive. If not, a false positive. Analogously, in case the prediction decides thatthe system is running well (a negative prediction) this prediction may be right (truenegative) or wrong (false negative)

Provost [146]), and it depends on lead-time ∆tl, prediction-period ∆tp and data windowsize ∆td (c.f., Figure 2.4 on Page 12).

.


8.2.2 Metrics Obtained from Contingency TablesVarious metrics have been proposed in different research communities that express var-ious aspects of the contingency table. Table 8.2 summarizes the metrics. Although thetable already lists the metrics that are used in this thesis, they are discussed shortly in thenext paragraphs. Please note further that the terms “precision” and “accuracy” are useddifferently than for measurements, where they refer to the mean deviation from the truevalue and spread of measurements. Moreover, there are at least seven more meanings of“precision”.

Name of the metric Symbol Formula Other names

Precision p TPTP+FP = TP

POS

ConfidencePositive predictive val.

RecallTrue positive rate

rtpr

TPTP+FN = TP

F

SupportSensitivityStatistical power

False positive rate fpr FPFP+TN = FP

NF Fall-out

Specificity 1− fpr TNTN+FP = TN

NF True negative rate

False negative rate 1− r FNTP+FN = FN

F

Negative predictive val. npv TNTN+FN = TN

NEG

False positive error rate 1− p FPFP+TP = FP

POS

Accuracy acc TP+TNTP+TN+FP+FN

Odds ratio OR TP ·TNFP ·FN

Table 8.2: Metrics obtained from contingency table (c.f., Table 8.1). Different names for thesame measures have been used in various research areas (rightmost column).Specificy, false negative rate, negative predictive value, and false positive errorrate are listed for completeness, they are not further discussed in this thesis asthey do not add a fundamentally different view on the contingency table.

Precision and recall. The terms precision and recall have originally been introducedfor information retrieval by van Rijsbergen [214]. Precision is defined as the ratio ofcorrectly identified failures to the number of all failure predictions. Recall is the ratio of


correctly predicted failures to the number of true failures:

Precision p = true positivestrue positives + false positives

= correct warningsfailure warnings

∈ [0, 1] (8.3)

Recall r = true positivestrue positives + false negatives

= correct warningsfailures

∈ [0, 1] . (8.4)

Consider the following two examples for clarification: First, a perfect failure predic-tor would achieve precision and recall of 1.0. Second, a real prediction algorithm thatachieves precision of 0.8, generates correct failure warnings (referring to true failures) in80% of all cases and false positives in 20% of all cases. A recall of 0.9 expresses that90% of all true failures are predicted and 10% are missed.

Since information retrieval has to cope with extreme class imbalance1 precision andrecall are also well-suited for the evaluation of failure prediction tasks: failures are usuallymuch more rare than non-failures. There are two boundary cases for which, precision, andrecall are not defined:

• Precision is not defined if there are no positive predictions at all. Since the num-ber of true positives equals the number of all positives, a precision of 1 is used.The same result is obtained, if a threshold is involved in classification (c.f., Sec-tion 7.2.2): with increasing threshold, the prediction algorithm must be “more sure”about an upcoming failure to issue a warning. Hence precision increases. At somepoint the threshold is so high that not a single prediction is positive and precision ishence set to one.

• Recall is not defined if the number of failures in the experiment is zero. However,since testing a failure predictor without any failures in the test data set is not usefulthis case is not further considered.

Weiss & Hirsh [277] argue that in real applications of failure prediction, first, the samefailure might be predicted several times and second, false positives occurring in burstsshould not be counted equally to false positives occurring separately. Therefore, the au-thors introduce a modified version of precision and recall:

p′ = predicted failurespredicted failures + discounted false warnings

(8.5)

r′ = predicted failurestotal number of failures

, (8.6)

where discounted false warnings refer to the number of complete, non-overlapping pre-diction periods ∆tp associated with a false prediction.

F-measure. Improving precision, i.e., reducing the number of false positives, often re-sults in worse recall, i.e., increasing the number of false negatives, at the same time. Tointegrate the trade-off between precision and recall the F-measure can be used (Makhoul

1Usually the number of relevant documents is much smaller than the total number of documents


et al. [172]). The F-measure is the weighted harmonic mean of precision and recall, whereprecision is weighted by α ∈ [0, 1]:

Fα = 1αp

+ 1−αr

= p · r(1− α) p+ α r

∈ [0, 1] . (8.7)

A special case is F0.5 where precision and recall are weighted equally:

F0.5 = 2 · p · rp+ r

. (8.8)

If precision and recall both equal zero, the F-measure is not defined, but the disconti-nuity can be removed such that the F-measure equals 0 in this case.2

False positive rate and true positive rate. The false positive rate is defined as the ratioof incorrect failure warnings to the number of all non-failures:

false positive rate fpr = false positivesfalse positives + true negatives

= false warningsnon-failures

. (8.9)

The definition of true positive rate tpr is equivalent to recall. However, in combinationwith false positive rate, the term true positive rate is used.

Accuracy. All evaluation metrics are concerned with the “accuracy” of failure predic-tion approaches in a general meaning of the word. Confusingly, one such measure isactually called accuracy, which is defined as the ratio of correct predictions in compari-son to all predictions performed:

accuracy acc = true positives + true negativestrue positives + false positives + false negatives + true negatives

. (8.10)

However, accuracy is not an appropriate measure for failure prediction. This is dueto the fact that failures are rare events. Consider, for example, a predictor that alwaysclassifies the system to be non-failure-prone. Since the vast majority of predictions referto non-failure prone situations, the predictor achieves excellent accuracy since it is rightin most of the cases. Instead, precision and recall measure the percentage of correctfailure warnings and percentage of correctly predicted failures, respectively. Hence, thesemetrics are more appropriate to assess the quality of failure prediction algorithms.

Odds ratio. Although mainly used in medical research, odds ratio can be applied toassess failure prediction algorithms. In statistics, odds are a way to describe probabilitiesin a p : q manner. More specifically, the odds O of an event E is defined as:

O(E) = P (E)1− P (E) . (8.11)

For example, if 60% of all cats are black, odds for a cat to be black are 60:40 = 1.5.

2To prove lim(p,r)→(0,0) F (p, r) = 0, it has to be shown that ∀ ε > 0 : ∃ δ > 0 such that ∀ (p, r); p, r > 0;|(p, r)− (0, 0)| < δ:

∣∣∣ 2p rp+r ∣∣∣ < ε (c.f., e.g., Bronstein et al. [39]). The existence of δ can be proven byletting p, r = ε

2 from which follows that δ = ε√2 . �


The odds ratio is defined as the ratio of the odds of an event occurring in one group tothe odds of it occurring in another group:

OR(E) = O1(E)O2(E) . (8.12)

For example, if the odds for mice to be black is 1:10 = 0.1, the odds ratio is 1.50.1 = 15� 1

expressing that cats are much more likely to be black than mice. Due to the fact thatOR(E) can take values from [0,∞), the odds ratio is skewed. However, taking the log-arithm turns it into a measure with values in (−∞,∞), which additionally is normallydistributed such that standard error and hence confidence intervals can be computed (see,e.g., Bland & Altman [31]).

In the case of failure prediction evaluation, the odds ratio is

OR(W ) = TP · TNFP · FN

, (8.13)

expressing the “odds” that a failure warning occurs in the case of a true failure than inthe case of a true non-failure. However, odds ratio is equivalent to tpr

1−tpr ·1−fpr

fpr and acomparison with ROC-plots, which also relate tpr and fpr (see below), has shown thatROC plots are much more meaningful (Pepe et al. [201]). Therefore, the odds ratio is notused explicitly in this dissertation.

8.2.3 Plots of Contingency Table MeasuresThe various measures obtained from a contingency table are singleton values that sharetwo restrictions:

1. They evaluate binary decisions. As derived in Chapter 7, binary decisions resultfrom comparison with a threshold θ. Hence contingency table-based metrics aredependent on θ.

2. They represent average behavior over the entire evaluation data set

If either of the two restrictions are released, a curve rather than a singleton value results.By inspection of these curves more insight into a predictor’s characteristics can be gained.On the other hand, comparability between failure prediction methods is worse.

Precision-recall curves. To visualize the inverse relationship between precision andrecall —improving recall by more frequently warning about an upcoming failure oftenresults in worse precision and vice versa— values of precision and recall can be plotted forvarious threshold levels. The resulting graph is called a precision-recall curve. Figure 8.3shows an exemplary plot.

Note that neither precision nor recall incorporate the number of true negative pre-dictions. Receiver operating characteristics employ false positive rate, which indirectlyincludes the number of true negatives.


Figure 8.3: Sample precision/recall-plot for two failure predictors A and B. Each point on acurve corresponds to one classification threshold θ. Predictor A shows relativelygood precision for most recall values but then drops quickly. In the limiting casethat all sequences are classified as failure prone, a recall of one and correspond-ing precision of F/N is achieved. The opposite case, where no sequence is classi-fied as failure prone, recall is zero and precision equals one.

Receiver Operating Characteristics (ROC). ROC curves (see, e.g., Egan [88]) areone of the most versatile plots used in machine learning. They plot true positive rate overfalse positive rate. Since a perfect classifier achieves a false positive rate fpr = 0 andtrue positive rate tpr = 1, the closer a curve gets to the upper left corner, the better theclassifier. If applicable, points for various thresholds are drawn and linearly interpolatedresulting in a curve.3 As has been shown in Chapter 7, in case of Bayes classification,θ depends on skewness as well as on the cost involved with the four cases of classifica-tion. Figure 8.4 shows ROC curves for three threshold-based predictors / classifiers and aperfect classifier.

Figure 8.4: ROC plot. True positive rate is plotted over false positive rate for varying classifi-cation threshold θ. Predictor A shows better performance than B, while predictorC corresponds to random guessing. A perfect predictor would achieve fpr = 0and tpr = 1.

3Other methods, such as decision trees (e.g., C4.5) apply a “fixed” classification and hence result in a singlepoint in ROC space


In order to relate ROC plots to precision and recall, consider the following equivalentformula for precision:

p = TP

TP + FP=

TPF

TPF

+ FPF

=TPF

TPF

+ NFF· FPNF

= tpr

tpr + NFF· fpr

, (8.14)

which is a function of tpr and fpr . NFF

denotes the ratio of non-failure over failure se-quences, which is class skewness. It can be shown that iso-precision curves in ROC spaceare concentric lines originating from point (0,0) (c.f., Flach [97]). Keeping in mind thattrue positive rate equals recall, each point on the ROC curve can be associated with avalue for precision and recall as shown in Figure 8.5.

Figure 8.5: Relation between ROC plots and precision and recall. Each point on the ROCcurve is associated with a precision / recall pair. Iso-precision lines are concentricat (0,0). In the graph, precision p1 > p2 > p3. Since recall equals true positiverate, corresponding recall values r1, r2 and r3 can be read off directly.

ROC plots, as well as precision-recall plots, account for all possible values of θ, whichis one of their major advantages. However, in the special case of failure prediction oneproblem occurs with ROC plots. In failure prediction, there is usually non negligible classskewness since failures are encountered less frequently than non-failure cases. Therefore,low false positive rates are easily obtained and hence only a small fraction of ROC spaceis of “interest”. In other words, in many failure prediction approaches, and especially inthose evaluating periodic measured data, true negative predictions dominate which resultsin a small fpr . Flach [97] has analyzed effects of class skewness on ROC plots and hasdefined skew insensitive variants for accuracy, precision, and F-measure. However, theseare not considered in this thesis since experiments are carried out on one single data setand hence class skewness is the same for all experiments.

Detection error trade-off (DET). Another way to compensate for class skewness is touse DET curves (Martin et al. [177]). DET curves differ from ROC plots in two ways:

1. Instead of true positive rate, the y-axis plots false negative rate fnr = 1− tpr. Thisgives uniform treatment to both types of mispredictions: false positives and falsenegatives.


2. Both axes are plotted on normal deviate scale. This leads to a linear curve in thecase of normal class distributions.

Figure 8.6 shows an example.

Figure 8.6: Detection error trade-off (DET) plot. In comparison to ROC plots, DET plot falsenegative rate fnr = 1− tpr instead of tpr over false positive rate. Both axeshave normal deviate scale. Curve B corresponds to random prediction while pre-dictor A is better than random.

The drawback of DET curves is that there is no graphical way to determine minimumcost, such as for ROC plots (see below). Additionally, DET curves have not yet beenestablished as standard plot for classification performance evaluation and no failure pre-diction related publication has been found that uses them. Hence, DET curves are notfurther considered.

8.2.4 Cost Impact of Failure PredictionIn Section 7.1.2, a cost or risk matrix was introduced, where rta denotes the cost for as-signing class label a to a sequence which in reality belongs to class t, e.g., rFF denotes thecost for falsely classifying a failure-prone sequence as non failure-prone. If true positiverate and false positive rate of a failure prediction algorithm are known, its expected costcan be determined as follows:

cost = F

N

((1− tpr) rFF + tpr rFF

)+ NF

N

((1− fpr) rF F + fpr rFF

). (8.15)

The equation distinguishes between all four cases: true and false, positive and negativepredictions. F/N determines the fraction of failure and NF/N the fraction of non-failuresequences. The true positive rate (tpr) indicates the fraction of failure sequences that arepredicted4 and hence cost rFF are assigned to this case. The same argumentation appliesto the remaining three cases. Given a cost / risk matrix, the overall goal is to find a failurepredictor with minimum expected cost.

4“caught” by the failure predictor


When analyzing contour lines of classification cost in ROC space, it can be shownthat iso-cost lines are straight lines having slope

d cost

d fpr = NF

F

rFF − rF FrFF − rFF

, (8.16)

which is only dependent on class skewness NFF

and the classification cost matrix rij .Figure 8.7 shows iso-cost lines for two values for class skewness. As expected, lower cost

iso−cost lines

false positive rate

true

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

NF:F

40:14:1

Figure 8.7: Iso-cost lines. Contours of equal cost are plotted for two class distributions. Solidlines correspond to a ratio of NF : F = 40 : 1, while dashed lines correspond toa ratio of 4 : 1. Classification cost has been assumed to be rF F = 1, rFF = 10,rFF = 100, rFF = 1000 (c.f., Section 7.1.2).

is achieved near the top-left corner of the ROC plot.Since the slope of iso-cost lines is only dependent on variables that are determined

by the application and not by the classifier, minimum achievable cost can be assessed byidentifying the iso-cost line that is a tangent to the ROC curve (see Figure 8.8)

Cost graphs of Drummond & Holte. In [83], Drummond & Holte propose a way toturn ROC plots into a graph that explicitly shows cost. They define a so-called probabilitycost function

PCF =FNrFF

FNrFF + NF

NrFF

(8.17)

expressing the ratio of misclassifying a failure-prone sequence as non-failure prone( FNrFF ) and maximum expected cost, which is the sum of both types of misclassifica-

tion. Note that PCF only consists of application-specific parameters.Normalized expected cost NE is defined as expected cost divided by maximum cost.

It can be shown thatNE = (1− tpr − fpr)PCF + fpr , (8.18)


Figure 8.8: Determining minimum achievable cost from ROC. Three iso-cost lines c1 < c2 <c3 of slope determined by Equation 8.16 are drawn in the figure. Minimum achiev-able cost can be determined by the tangent to the ROC curve.

which is a linear function in PCF with bounding values fpr and 1− tpr. Hence, foreach point in the ROC curve (i.e., tpr and fpr for a given threshold θ), there is a tpr /fpr pair defining a straight line in the cost graph. If this line is plotted for various ROCpoints / thresholds, a convex hull results (see Figure 8.9). The convex hull can be usedto identify the optimal threshold, resulting in minimal normalized expected cost, for each(application specific) value of PCF. Furthermore, the intersection of the convex hull withthe lines for always positive and always negative predictions defines the range of operationin terms of PCF for a given predictor.

Figure 8.9: Cost curves. Varying the classification threshold θ = {θi} for one predictor resultsin set of corresponding pairs (tpri, fpr i). Each pair defines a straight line showingnormalized expected cost (NE) as a function of the probability cost function (PCF).The diagonals correspond to the two trivial predictors that always predict a failureF or non-failure F . If for every value on the PCF-axis the minimum value ischosen, a convex hull results (thick line). It can be seen that for some values ofPCF, expected cost is greater than for a trivial predictor. This defines the operatingrange (in terms of PCF) of the predictor.

However, as can be seen from Equation 8.17, the plot only takes misclassification cost


rFF and rFF into account5 but cost for correct classification are not involved. Due to thisrestriction and due to the fact that cost is difficult to estimate for the telecommunicationsystem, cost graphs of Drummond & Holte are left over for future investigations.

Accumulated runtime cost. All of the above metrics and graphs build on average val-ues for the entire data set. However, it makes a difference if a failure predictor runs verywell for most of the time except for short periods showing bursts of mispredictions or ifthe same number of wrong predictions occur all over the training data set. Accumulatedruntime cost graphs yield exactly this insight by adding cost rij for each prediction andshowing the step function of accumulating cost over runtime of the test (see Figure 8.10for an example). They have initially been developed together with Dr. Günther Hoffmann(see, e.g., Salfner et al. [224]) and have been extended in this dissertation. An accumu-lated runtime cost curve can be drawn either for several predictors or varying thresholdsθ for one predictor.

Figure 8.10: Exemplary accumulated runtime cost. Cost for all four types of prediction (true /false positive / negative) is plotted as it accumulates over time for two predictorsA and B. In the figure, a cost setting of rF F : rFF : rFF : rFF = 1 : 2 : 4 : 8has been assumed. Shaded areas indicate cost boundaries: maximum cost(each prediction is wrong), cost without failure prediction (failures are missed),cost for a perfect predictor (each prediction correct), and cost for oracle predic-tions (rFF for each failure). Diamonds (u) on the time line indicate the time offailure occurrence and circles (•) the time of predictions between failures.

A further advantage of accumulated runtime cost is that cost boundaries can be visu-alized:

• An oracle, which is of course not existing, would need no evaluation of measure-ment data. It would just know when a failure is about to occur. Hence accumulating

5Hence, the cost/risk matrix would have zeros at the main diagonal.


cost would only consist of cost for correct failure predictions rFF occurring eachtime a true failure is observed.

• In contrast to the oracle, real predictors need to evaluate measurements from therunning system. As each evaluation incurs some cost, real predictors result in higheraccumulated cost. However, the perfect predictor, which only performs correctpredictions, indicates the minimum cost for any predictor operating at times ofmeasurements. More specifically, cost of rFF occurs at times of failure and costof rF F at times of non-failure predictions / measurements. Nevertheless, it mustbe pointed out that this only determines minimal achievable cost for one class ofpredictors. If, for example, measurements and hence predictions are performedmuch more rarely, lower cumulative cost can result even for non-perfect predictors.One typical example for this is the distinction whether prediction is performed onerror events or on periodic measurements of system parameters such as workload:in most systems errors occur more seldom than periodic measurements.

• Cost if no predictor is in place can be determined in the following way: At eachoccurrence of a failure, cost of rFF − rF F occurs, which means that all failuresare missed and no predictions are performed in between. The reason why rFF isdecreased by rF F is that rFF also includes cost for performing a prediction. Costfor a prediction without action can be approximated by true negative predictionsand hence rF F is subtracted.

• Maximum cost can be determined by assuming all predictions to be wrong. Hence,each non-failure prediction receives cost of rFF and each prediction at the time offailure occurrence receives rFF . This also applies only to one class of predictors.

Of course, as is the case for all plots assuming fixed cost, the graph can look significantlydifferent if the ratio of cost rij is changed. Furthermore, the difficulty to estimate the cost/ risk matrix for real systems also applies to accumulated cost graphs. Nevertheless, sinceaccumulated cost graphs do not build on average values, they provide an insight into thetemporal behavior of a failure prediction algorithm and are for this reason used in thisdissertation.

8.2.5 Other MetricsDespite of measures obtained from the contingency table (see Table 8.1) and the plotsshown here, some other measures should be mentioned.

Area under ROC curve (AUC). The integral of a ROC curve,

AUC =∫ 1

0tpr(fpr) d fpr ∈ [0, 1] , (8.19)

is a wide-spread measure for classification accuracy. AUC can also be interpreted as theprobability that a randomly chosen failure-prone sequence receives a higher rating than arandomly chosen non-failure sequence.

AUC turns the ROC curve into a single real number which, in contrast to ROC plots,enables numeric comparison of classifiers. Obviously, a perfect predictor achieves AUC


equal to one and a purely random classifier receives an AUC of 0.5.6 AUC is thresholdindependent, which is the major difference to contingency table based metrics. However,AUC has its problems, too:

• AUC equally incorporates all possible threshold values regardless of class skewness(c.f., the discussion of the ROC curve).

• Interpretation of the AUC is not as intuitive as of contingency table base methods

• For a given cost setting and class skewness, AUC can be misleading: a classifierwith larger AUC might result in a higher cost impact of failure prediction, hence,even though AUC is better than for other predictors, the predictor results in worsecost incurred. For example in Figure 8.11, AUC for predictor B is larger than forpredictor A. However, minimal achievable cost for B is C2 which is larger than C1for predictor A.

Figure 8.11: AUC can be misleading: predictorB (dashed line) has better AUC than predictorA (solid line). However, for a given cost setting and NF/F ratio, the cost incurredby prediction are higher for B than for A, since C2 > C1.

Precision-recall-break-even One special point in precision-recall curves is the pointwhere the precision-recall curve crosses the ascending diagonal. At this point, precisionand recall are equal resulting in a scalar measure that can be used for comparison. How-ever, if precision and recall are not equally significant for the application, this approachseems not convincing and is hence not further considered in this thesis.

Further metrics. Many other metrics have been proposed in various scientific disci-plines such as data mining or machine learning with decision trees. These measuresinclude more recently introduced measures such as the G-measure (Flach [97]), weightedrelative accuracy (Todorovski et al. [256]), and SAR (Squared error, Accuracy, and ROC

6Note that the inverse inference is not valid: An AUC of 0.5 does not necessarily imply a random classifier!


area, see Caruana & Niculescu-Mizil [46]), as well as well-known metrics such as Ginicoefficient, lift, Piatetsky-Shapiro, φ-coefficient, etc. The measures could be applied tofailure prediction as well, however, to the best of our knowledge, this has not been inves-tigated, so far.

One exception is the κ-statistic, which has been used by Elbaum et al. [89] to builda detector for anomalous software events within the email client “pine”. The interestingthing about κ-statistics is that it allows for a “soft” evaluation of prediction performancebased on the κ value (see, Altman [7] for details).

8.3 Evaluation ProcessIn previous sections the metrics by which the potential to predict failures is assessedhave been discussed. In this section, the focus is on the evaluation process, i.e., how themetrics are obtained. The ultimate goal of evaluation is, of course, to identify the potentialto predict failures of a failure prediction approach given some data set. The evaluationprocess consists of several parts:

1. Many modeling approaches such as the one described in this thesis involve param-eters that need to be adjusted, which is also called training.

2. In machine learning, training is based on data. However, this data should not beused for evaluation. Hence the data set needs to be split.

3. In application domains such as failure prediction the amount of data available forevaluation is limited. Hence a technique called cross-validation is applied.

The following sections discuss each issue separately.

8.3.1 Setting of ParametersIdeally, parameters involved in modeling should be adjusted such that optimal failureprediction performance is achieved. However, “performance”, as has been discussed, canbe assessed by various metrics. Having decided upon one optimization criterion (e.g., F-measure), theoretically, each parameter in the modeling process should be analyzed withrespect to its effect on final failure prediction performance, which implies that each valueof each parameter must be tested in combination with each value of each other parameterof the entire modeling process. Not surprisingly, this is hardly feasible if more than15 parameters are involved in HSMM failure prediction. For this reason, the evaluationprocess consists of a mixture of “greedy” and “non-greedy” steps:

• greedy: Parameters that can be set rather “robustly” by some local optimiza-tion criterion or heuristic. Local optimization in this context meansthat not overall prediction performance needs to be evaluated butsome criterion that can be computed without a fully trained predic-tion model. “Robust” in this context means that there is sufficientbackground knowledge about the effect of the parameter on final pre-diction performance. “Greedy” also implies that —once adjusted—a parameter is not changed in later stages of the modeling process.An example of a parameter that can be set greedily is the length of

8.3 Evaluation Process 167

the tupling interval ε. The local optimization criterion is here thenumber of resulting tuples (c.f., Section 5.1.2).

• non-greedy: Parameters for which no local optimization criterion exists, or uponwhich little is known with respect to the effects on final failure pre-diction performance, need to be tested in combination with all otherparameters that cannot be determined greedily. In order to reducecomplexity, the range of values needs to be limited. Additionally,not each single value of the range needs to be explored, if it is ex-pected that final prediction performance is a smooth function of theparameter. An example for such parameter setting is the adjustmentof the maximum span of shortcuts in the structure of the HSMM(c.f., Section 6.6). For each combination of parameters, the full mod-eling process needs to be performed and prediction performance isassessed with respect to the selected evaluation metric.

Since greedy parameter optimizations include only one local optimization, increasingtheir number drastically reduces the overall training effort. Chapter 9 will provide de-tails how each parameter for modeling of the industrial telecommunication system hasbeen set.

8.3.2 Three Types of Data SetsAs described in Section 2.4, a typical two phase batch learning approach is applied in thisthesis: first, a model is trained from previously recorded data, and is subsequently appliedto the running system in order to predict failures online. However, the project from whichthe data has been acquired did not allow to apply the failure prediction approach to arunning system for evaluation. For this reason, failure prediction performance must beevaluated from the data set itself. However, assessing prediction of failures that have beenknown in the training phase does not yield a realistic estimation of prediction performance—hence the data needs to be separated into disjoint training and test data set data set.

However, training involves non-greedy estimation of model parameters, from whichfollows that parameters have to be adjusted with respect to the final prediction perfor-mance metric. For this reason, the training data set needs to be further subdivided to yielda so-called validation data set. Hence three types of data sets result:

1. Training data set: The data on which the training algorithm is run.

2. Validation data set: Parameters for which no local optimization criterion is avail-able need to be optimized with respect to final prediction performance (non-greedyestimation). Validation data is used to assess prediction performance of each setting.

3. Test data set: Final evaluation of failure prediction performance is carried out oncompletely new data, which is test data. By this, generalization performance of themodel is assessed which is taken as an indication to how well the failure predic-tor would predict future upcoming failures in a running system. Since evaluationis performed on data that has not been available for training and validation, suchevaluation is also called out-of-sample.


8.3.3 Cross-validationIn many machine learning applications, much data is available such that it cannot beprocessed entirely. In this case, the issue is to determine the minimum size of data sets thatis needed to assure some statistical significance. In the case of online failure prediction,the situation is different: Failure data is always scarce and all available data must be used.It is even so scarce that after splitting data into training, validation and test data, datasets get too small to yield statistically reliable results. To remedy this situation m-foldcross validation,7 which exploits the limited amount of data available by cyclic reuse,each time holding out another portion of the data for validation / testing of performance,can be used. More precisely, for m-fold cross validation data is split randomly into mdisjoint sets of equal size n/m, where n is the size of the data set. The training and testingprocedure is repeated m times, each time holding out a different subset for testing. Theremaining portion, which is of size n− n/m is subsequently split further into training andvalidation data.

A special form of cross validation uses stratification, which means that distribution ofclasses NF and F remain the same in each subset. However, stratification can only beapplied to validation since it is one of the main characteristics of the training procedure toseparate failure from non-failure sequences in order to deal with class imbalance. Hence,stratification has not been applied, here.

A further validation variant is Monte-Carlo cross validation (Shao [236]) where thedata set is repeatedly divided into a fraction β for testing and (1−β) for training. Althoughthe procedure has been shown to yield more stable results for selecting the number ofkernels in a Gaussian mixture modeling problem (Smyth [245]). However, since firstit is not clear upfront, that Monte-Carlo cross validation also performs better for failureprediction, and second it adds another parameter (β) that needs to be determined, onlystandard m-fold cross validation has been applied in this dissertation.

8.4 Statistical ConfidenceIn order to gain trust in the assessment of failure prediction quality, each evaluation metricshould be accompanied by confidence intervals. For the accuracy evaluation metric atheoretical analysis is available. A second theoretical analysis for other metrics builds onthe assumption of a normal sampling distribution, which cannot be guaranteed. For thisreason, confidence intervals are obtained from a well-known resampling strategy called“bootstrapping”.

8.4.1 Theoretical Assessment of AccuracyMitchell [184] provides an analysis of confidence intervals for the mean error rate ob-served from an experiment

Es = ES[P(c(s) 6= c(s)

)]= 1n

∑s∈S

(1− δc(s) c(s)) , (8.20)

7According to Duda et al. [85], cross validation has been invented by Cover [66]. However, Yu [283]claims that cross-validation has first been invented by Kurtz [151] and has been developed to multi-crossvalidation by Krus & Fuller [148]. Even more confusingly, Bishop [30] mentions Stone [251] as itsinventor.

8.4 Statistical Confidence 169

where n denotes the size of the experiment’s data set S = {s}, c(s) is the true value fors, c(s) the estimated value, and δij is the Kronecker delta. Es is also called the sampleerror rate.

Confidence intervals can be obtained from the fact that counting misclassificationswithin a test data set of size n is a Bernoulli experiment and the probability to encounterk misclassifications in the test data set is

P (k) =(n

k

)pk (1− p)n−k , (8.21)

where p is the true yet unknown error rate. p can be estimated from the number of misclas-sified sequences in the test data set, which is k, since the maximum likelihood estimation

p ≈ k

n= Es (8.22)

is an unbiased estimator given that the samples of the test data set had been drawn ac-cording to the prior distribution P (s). From the fact that p is estimated as mean value andfrom the fact that k is binomially distributed follows that standard deviation of the errorrate is approximately

σEs ≈√Es (1− Es)

n. (8.23)

For nEs (1− Es) ≥ 5, the binomial distribution can be well approximated by a normaldistribution8 and confidence intervals can be obtained:

CN(Es) =Es − zN

√Es (1− Es)

n, Es + zN

√Es (1− Es)

n

, (8.24)

where zN is the width of the smallest interval about the mean that includes N% of thetotal probability mass.

Finally, a confidence interval for accuracy can be obtained from the relation

acc = 1− Es . (8.25)

However, Duda et al. [85] show that unless n is fairly large, the maximum likelihoodestimation of p must be interpreted with caution. Furthermore, the analysis is only appli-cable to error rate / accuracy but confidence intervals are needed for all of the evaluationmetrics presented. Hence this approach is not applied in this thesis.

8.4.2 Confidence Intervals by Assuming Normal DistributionsThe central limit theorem states that any sum of independent and identically distributedrandom variables tends towards the normal distribution. From this follows that statisticssuch as the mean, which is defined by a sum also tends to be normally distributed andhence confidence intervals can be obtained by

C =[x− s√

n, x+ s√

n

], (8.26)

8otherwise, the cumulative binomial distribution must computed directly


where x denotes the mean of values observed in the test data set, s denotes the standarddeviation, and n denotes sample size.

However, this parametric way to determine confidence intervals only works for statis-tics that yield normal sampling distributions, which is a strong assumption that cannot beapplied to all statistics. Furthermore, there is no way to correct for bias or skew of theestimator. For this reason, this approach is also not applied in this thesis.

8.4.3 JackknifeQuenouille [207] invented an estimation procedure that is applicable to any statistic es-timator θ. The principle idea of the method was to compute the statistic for a data setfrom which one single data point has been removed. This is repeated by removing eachdata point once and the overall value of the statistic is finally obtained by the so-calledleave-one-out mean:

θ = 1n

n∑i=1

θ(i) , (8.27)

where θ denotes the estimate of statistic θ and θ(i) is the statistic for the data set fromwhich data point i has been removed.

The major benefit of this method is that bias and variance of the statistic can be esti-mated, even for statistics that resist theoretical analysis such as the mode or median. Forthis versatility, the method became also known under the term jacknife.

Although this method can in principle be applied to this thesis, the major problem withthe jackknife method is that it processes exactly n subsets. Computing complexity can belimited by leaving out more than one single sequence (similar to m-fold cross validation),but this on the other hand deteriorates the quality of statistic estimation.

8.4.4 BootstrappingBootstrapping (Efron [87]) adds more flexibility to the estimation process and is cur-rently seen as state-of-the art (at least in engineering disciplines). According to Moore &McCabe [187], bootstrapping should be applied when the sampling distribution is non-normal, biased or skewed, or to the estimation of statistics for which parametric estima-tions of confidence intervals are not available (such as for the well-known outlier resistant25% trimmed mean).

The basic idea of bootstrapping is that based on one original sample, many so-calledresamples are generated by randomly selecting n instances from the original sample withreplacement. Similar to the jackknife, the desired statistic is computed for each resample.However, the number of resamplings can be chosen arbitrarily and the same data pointmay occur several times in one resample. One of the explanations why this method worksis that the resulting bootstrapping distribution, which is the distribution of the statisticamong resamples, can be shown to approximate the true sampling distribution if the orig-inal sample represents reality rather well.

The statistic’s bias can be estimated by:

bias = 1B

B∑b=1

θ(b) − θ , (8.28)

8.4 Statistical Confidence 171

where B denotes the number of resamples, θ(b) the statistic θ computed from the b-thresample and θ is the statistic computed from the original sample. This estimate of biascan be used to yield more reliable confidence intervals even for biased and skewed sam-pling distributions. In this thesis bootstrap bias corrected accelerated confidence intervals(BCa) have been used, which require that the number of artificial resamples B has to beset to at least 5000.

8.4.5 Bootstrapping with Cross-validationAs stated before, failure data is scarce and cross-validation needs to be applied to fullyexploit available data. Each step in cross-validation could be analyzed separately andresults could be combined afterwards. However, bootstrapping cannot compensate forsmall sized original samples! Even if the resampling process is run many thousand times,resamples only consist of the few data points available in the original sample. For thisreason, a combination of cross-validation and bootstrapping has been applied in this thesis(see Figure 8.12)

• The complete dataset is randomly divided into ten groups for 10-fold-cross valida-tion.

• Each group is used once as test group

– The remaining nine data groups form the training / validation dataset.

– A model is trained / validated.

– The resulting model is applied to data of the test group.

– Model outputs are stored in a test result dataset.

• After performing this for all ten groups, the evaluation metric (statistic) is computedfrom the (combined) test result dataset.

• Bootstrapping is applied to test results, which means that test results are resampled5000 times in order to yield BCa confidence intervals for the evaluation metric.

Ten-fold-validation has been described above, which implies that ten complete mod-eling procedures have to be performed. However, this procedure can be adapted to thecomputing power available: The number of folds can be increased up to n, which wouldresult in the jackknife method with subsequent bootstrapping results. Note that the boot-strapping procedure only operates on the result of the training and testing procedure,which is incomparably less laborious than training 5000 models. In summary, the numberof data points in the result dataset from which the statistic is estimated is always n and thebootstrapping tries to compensate for the reduced number of trainings. Please also notethat

• cross-validation simulates the variability in selecting the training and test data

• bootstrapping simulates the sampling process in order to mimic the sampling dis-tribution

although the two are related, they are not exactly the same.


Figure 8.12: Cross-validation and bootstrapping. The dataset contains three failure se-quences (hatched boxes at the top) and seven non-failure sequences (shadedboxes at the top). All sequences are randomly divided into ten groups. Eachgroup is used once as test data set. For each test group the remaining ninegroups are used as training / validation dataset. After training / validation, se-quences of the test group are fed into the model and results are stored in thetest result dataset. The evaluation statistic is computed at the end from all testresults. In order to estimate confidence intervals, bootstrapping with 5000 re-samples is applied.

8.4.6 Confidence Intervals for PlotsThe estimation procedure shown above can be applied directly to contingency table-basedmetrics such as precision, recall, etc. However, plots, such as ROC, precision / recall,have two equally important dimensions. Fawcett [95] discusses the topic extensively forROC curves and proposes to compute contingency intervals in both directions by fixingthe threshold (see Figure 8.13). The same concept applies to precision / recall curves.Confidence intervals are not investigated for accumulated runtime cost since this graphdepends on one specific excerpt of the data (times of predictions and failures are shownon the x-axis). As AUC integrates over all threshold values, no threshold-based averagingcan be applied, either. Instead, confidence intervals can be computed by the bootstrappingprocedure directly.

8.5 SummaryIn this chapter the process of evaluating failure prediction methods has been discussed.Starting from an evaluation of clustering results, which is only relevant for the approachin this dissertation, in subsequent sections failure prediction metrics have been discussed.There are two principal groups:

1. Contingency table-based measures such as precision, recall, F-measure, false pos-itive and true positive rate, or accuracy. These measures evaluate binary decisionsand are hence dependent on one specific decision threshold

8.5 Summary 173

Figure 8.13: Averaging ROC curves. For each value of the threshold (A, B, C, D, and E),confidence intervals are computed separately for false positive and true positiverate. The ROC curve is then plotted through average values

2. Plots that account for various thresholds. In this thesis Precision / recall plots, ROCplots, detection error trade-off (DET), cost curves and accumulated runtime costgraphs have been presented.

AUC has a special place as it is a single value and does not depend on a threshold valuesince it is obtained from ROC plots.

The subsequent topic addressed by this chapter has been a description of the evaluationprocess. Three topics have been discussed: greedy vs. non-greedy parameter optimiza-tion, the distinction between training, validation, and test data set, and cross-validation.

Evaluating failure prediction by the use of data naturally raises the question of statis-tical confidence. Several approaches to an assessment of confidence estimation have beendiscussed and it has been argued why they cannot be applied to the case of online fail-ure prediction. The discussion concludes that bootstrapping is applied in this thesis anda combination of cross-validation and bootstrapping has been proposed. Finally, it hasbeen described how confidence intervals can be generated for plots having two equallyimportant variables.

Contributions of this chapter.

• To the best of our knowledge, the first comprehensive overview on failure predictionevaluation metrics has been presented.

• A novel evaluation plot —accumulated runtime cost graphs— has been introduced.In comparison to other evaluation techniques, the graph can reveal whether a pre-dictor operates very well for most of the time but fails for a short period or whetherfalse predictions occur equally distributed over time. Furthermore, the graph allowsto compare cost incurred by a predictor with cost for an oracle predictor, perfector worst predictor, and with cost for a system without failure prediction in place.However, these comparisons only hold for predictors of the same class. A furtherdrawback is that the graph is highly sensible to the assignment of cost for true andfalse positive and negative predictions.


• To the best of our knowledge, this thesis presents a novel combination of m-foldcross validation and bootstrapping: The computationally much more expensive taskof model training is reduced and at least partly compensated by bootstrapping with alarge number of resamplings. Furthermore, this approach allows to fully exploit thelimited amount of data and to take advantage of state-of-the art confidence intervalestimation offered by the bootstrap.

Relation to other chapters. This chapter has been the first of the third phase of theengineering cycle in which the modeling methodology is applied to industrial data ofthe real system. Having defined the measures for evaluation as well as the procedurehow these measures are obtained, the whole approach will be applied to real data of theindustrial telecommunication system in the next chapter.

Chapter 9

Experiments and Results Based onIndustrial Data

The failure prediction approach proposed in this dissertation has been applied to indus-trial data of a commercial telecommunication system. In this chapter, detailed results areprovided. The chapter is organized along the process of modeling: starting with the in-troduction of the case study (Section 9.1) and data preprocessing (Section 9.2), propertiesof the data set are presented in Section 9.3, and training of HSMMs is discussed in Sec-tion 9.4. The resulting failure predictor is analyzed in detail (Section 9.5) and dependenceon various parameters is investigated in Sections 9.6 and 9.7. Furthermore, a comparativeanalysis is provided by applying several different prediction techniques to the same data.Note that for readability reasons, in this chapter, the term “model” is not only used todenote the class of hidden semi-Markov models, but also a concrete HSMM parametriza-tion, such as a “model with 50” states, which denotes an instance of a HSMM that has 50states.

9.1 Description of the Case StudyAlthough the telecommunication system has been briefly introduced in Section 2.2, thedescription is repeated here for convenience. The main purpose of the telecommunica-tion system is to realize a Service Control Point (SCP) in an Intelligent Network (IN),providing Service Control Functions (SCF) for communication related management suchas billing, number translations or prepaid functionality. Services are offered for MobileOriginated Calls (MOC), Short Message Service (SMS), or General Packet Radio Service(GPRS). Service requests are transmitted to the system using various communication pro-tocols such as Remote Authentication Dial In User Interface (RADIUS), Signaling Sys-tem Number 7 (SS7), or Internet Protocol (IP). Since the system is a SCP, it cooperatesclosely with other telecommunication systems in the Global System for Mobile Com-munication (GSM), however, it does not switch calls itself. The system is realized asmulti-tier architecture employing a component-based software design. At the time whenmeasurements were taken the system consisted of more than 1.6 million lines of code, ap-proximately 200 components realized by more than 2000 classes, running simultaneously

175

176 9. Experiments and Results Based on Industrial Data

in several containers, each replicated for fault tolerance.Specification for the telecommunication system requires that within successive, non-

overlapping five minute intervals, the fraction of calls having response time longer than250ms must not exceed 0.01%. This definition is equivalent to a required four-ninesinterval service availability (c.f., Equation 2.1 on Page 13). Hence the failures predictedin this work are performance failures.

The setup from which data has been collected is depicted in Figure 9.1. A call trackerkept trace of request response times and logged each request that showed a response timeexceeding 250ms. Furthermore, the call tracker provided information in five-minute in-tervals whether call availability dropped below 99.99%. More specifically, the exact timeof failure has been determined to be the first failed request that caused interval availabilityto drop below the threshold. The telecommunication system consisted of two nodes thatare connected with a high-speed local network. Error logs have been collected separatelyfrom both nodes and have been combined to form a system-wide logfile by merging bothlogs into one based on timestamps (the system runs with synchronized clocks) treatingthe system as whole.

Figure 9.1: Experiment setup. Call response times have been tracked from outside the sys-tem in order to identify failures. The telecommunication system consisted of twocomputing nodes from which error logs have been collected.

We had access to data collected at 200 non-consecutive days spanning a period of 273days. The entire dataset consists of error logs of two machines including 12,377,877 +14,613,437 = 26,991,314 log records including 1,560 failures of two types: The first type(885 instances) relates to GPRS and the second (675 instances) to SMS and MOC servicesbut due to limited human resources, only the first failure type has been investigated.

Some notes on the procedure. As has been stated in Section 8.3, there are two strate-gies how parameters are set: greedy and non-greedy. Obviously, the best parameter settingwould be found by trying all combinations of parameters and to evaluate them with respectto failure prediction. However, such approach is not feasible and a different approach hasbeen taken for experiments: As long as there is a reasonable way to set parameters directlybased on some local criterion or observation, parameters are set by this heuristic. This im-plies that once a parameter has been set by a “local” criterion or heuristic, its effect onoverall failure prediction quality is not checked later, and hence it cannot be determinedwhether even better prediction results may be achievable with the method. However, sincethe results achieved by this strictly forward approach are already convincing, there is noneed to do so —at least from an engineering point of view. For this reason the followingsections go through the entire data preprocessing and modeling process from the start andinvestigate each step one after another.

9.2 Data Preprocessing 177

Implementation of the HSMM approach has been accomplished by modifying theGeneral Hidden Markov Model (GHMM) [179] library developed by the Algorithmicsgroup lead by Dr. Alexander Schliep at the Max Planck Institute for Molecular Genet-ics, Berlin, Germany. The GHMM library and hence its modifications are written in C,wrapped by Python classes which in turn are controlled by shell scripts. Clustering, eval-uation and plotting has been performed using R statistical language (see, e.g., Dalgaard[74]).

9.2 Data PreprocessingAs explained in Chapter 2, modeling first involves data preprocessing, which consists ofseveral steps. The following investigations will explain and analyze each step separatelyin the order they have been performed on the data.

9.2.1 Making Logfiles Machine-ProcessableSystem logfiles contain events of all architectural layers above the cluster managementlayer including 55 different, partially non-numeric variables. Figure 9.2 shows one(anonymized) log record consisting of three lines in the error log. In order to obtain

2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-AGOMP#020200034000060|

020101044430000|000000000000-020234f43301e000-2.0.1|020200003200060|00000001

2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-NOT: src=ERROR_APPLICATION

sev=SEVERITY_MINOR id=020d02222083730a

2004/04/09-19:26:13.634089-29846-00010-LIB_ABC_USER-unknown nature of address

value specified

Figure 9.2: Typical error log record consisting of three log lines (anonymized).

machine-readable logfiles, many steps had to be performed, and the tremendous effortby Steffen Tschirpke, who has done most of the programming for these steps, should beacknowledged at this point.

The major steps of logfile preprocessing include:

1. Eliminating logfile rotation. Many large systems perform logfile rotation, whichmeans that logfiles are limited either in size or time span (or both) and once a log-file has reached the limit, logging is redirected to the next file. After logging tothe n-th file, logging starts from the first file in a ring-buffer fashion. This behav-ior lead to duplicated log messages. Data has been reorganized to form one largechronologically ordered logfile for each computing node.

2. Identifying borders between messages. While error messages “travel” throughvarious modules and levels of the system, more and more information is accumu-lated until the resulting log-record is written into the logfile. In our case, variousdelimiters between the pieces of information were used and one log record couldeven span several lines in the logfile, sometimes quoting the error message several


times. For this reason, the logfile had to be parsed in order to generate a log whereeach line corresponds to one log record, to employ usage of a unique delimiter andto assign pieces of information to fixed positions (columns) within the line.

3. Converting time. Timestamps in the original logfiles were tailored to be “pro-cessed” by humans and were of the form 2004/04/09-19:26:13.634089stating that the log message occurred at 7 pm, 26 minutes and 13.634089 secondson April, 9th in the year 2004. In order to be able to, e.g., compute the length of thetime interval between two successive error messages, time had to be transformedinto a format that can be processed by computers. Real-valued UTC has been usedfor this purpose, which roughly relates to seconds since Jan. 1st, 1970.

9.2.2 Error-ID AssignmentAfter preprocessing, the next step involved assignment of an error ID to each message asdescribed in Section 5.1.1. In case of the telecommunication data, there were originally1,695,160 different log-messages. By replacing numbers, etc., the number of differentmessages has been reduced to 12,533. By application of the Levenshtein distance metricto each pair (resulting in 157,063,556 distances) the log messages could be assigned to1,435 groups by application of a constant similarity threshold. Table 9.1 summarizes thenumbers.

Data No of different messages Reduction in %Original 1,695,160 n/aWithout numbers 12,533 99.26%Levenshtein clustering 1,435 88.55% / 99.92% (original)

Table 9.1: Number of different log messages in the original data, after substitution of numbersby placeholders, and after clustering by the Levenshtein distance metric.

In principle, the task of message grouping is a clustering problem. However, group-ing 12,533 data points using a full-blown clustering algorithm is a considerably complextask. Furthermore, application of such complex algorithms is not necessary. Figure 9.3provides a plot where the gray value of each point indicates distance of the correspondingmessage pair. Except for a few blocks in the middle of the plot, there are dark steps alongthe main descending diagonal and the rest of the plot is rather light-colored. The plot hasbeen created by putting sequences next to each other if their Levenshtein distance wasbelow some fixed threshold. Since plotting similarities is not possible for all messages,Figure 9.3 has been generated from a subset of the data. The figure indicates that strongsimilarity is only present among groups of log messages and not to other message types.Hence a rather robust grouping can be achieved by one of the simplest clustering methods:grouping by a threshold on dissimilarity. The reason why this simple method works ratherrobustly is that (after replacement of numbers by a placeholder), messages with more orless the same text agree in most parts and other messages are significantly different.

Note that each error message type corresponds to one error symbol (indicated byA,B,or C in previous chapters). Together with the number of failure types (which are at most


Figure 9.3: Levenshtein similarity plot for a subset of message types. Each point representsLevenshtein distance of one pair of error messages. Dark dots indicate similarmessages (small distance) while lighter dots indicate a larger Levenshtein dis-tance. Messages have been arranged such that sequences are next to eachother if their Levenshtein distance is below some fixed threshold.

two in our case study) the number of different error messages defines the size of theHSMM alphabet. Therefore, experiments in this case study had alphabets of size 1, 436(1.435 errors plus one failure) since only one failure type has been investigated at a time.Please also note that memory consumption of observation symbol matrix B is determinedby the number of states times the size of the alphabet. For these reasons, reducing thenumber of error messages is an important step in the failure prediction approach describedin this thesis.

9.2.3 TuplingAs described in Section 5.1.2, tupling is a technique that combines several occurrences ofthe same event in order to account for multiple reporting of the same problem. In order todetermine the optimal time window size ε, the heuristic shown in Figure 5.3 on Page 78has been applied to the data. The size of the optimal time window is identified graphicallyby plotting the number of resulting tuples over various values for ε. Figure 9.4 shows theplot for a subset of one million log records for the cluster logfile (which has been receivedby merging error logs of both machines). The graph strongly supports the claim of Iyer &Rosetti that a change point value for ε can be identified above which the number of tuplesdecreases much slower. According to the heuristic, ε is chosen slightly above the changepoint.

In order to show that properties related to tupling do not change by merging the twoerror logs, tupling analysis has also been performed for each machine error log separately.As shown in Figure 9.5, the change point for both machines occurs at roughly the samepoint. The most striking difference between Figure 9.5 and Figure 9.4 is that the numberof resulting tuples is smaller for single machine logfiles. This can be traced back tothe merging process: Tupling only lumps bursts of the same message —if a different


Figure 9.4: Effect of tupling window size for the cluster-wide logfile. The graph shows resultingnumber of tuples depending on tupling time window size ε (in seconds)

message from the second machine is woven into the burst, the burst results in at least twoseparate tuples. However, the main point of the analysis is that a change point exists, and,furthermore, that it occurs roughly at the same value for ε in single machine logfiles.

Based on this analysis, a value of ε = 0.015s has been used for experiments.

9.2.4 Extracting SequencesAfter tupling, sequences are extracted from error logs (c.f., Section 5.1.3 and especiallyFigure 5.4 on Page 79). In order to decide whether a sequence is a failure sequence or not,the failure log, which has been written by the call tracker, has been analyzed, to extracttimestamps and types of failure occurrences. Three time intervals determine the processof sequence extraction:

1. Lead-time ∆tl. If not specified explicitly, a lead-time of five minutes has been used,although it is shown in Section 9.6.1 that prediction performance is comparablygood for even longer lead-times. However, since lead-time experiments have beencarried out relatively late, previous experiments have not been carried out again andresults are reported for five minutes lead time. For large and complex computersystems, it is assumed that proactive fault handling actions such as restart, garbagecollection or checkpointing can be performed within five minutes, i.e., warning-time ∆tw is shorter than five minutes.

2. Data window size ∆td. Analyses presented in the next section are based on a datawindow size of five minutes. An explicit analysis of ∆td is carried out in Sec-tion 9.6.2.

3. Margins for non-failure sequences ∆tm. This value is used to determine time in-tervals when no failure is imminent in the system. Since it cannot be measured,whether the system really is fault-free, a value of 20 minutes has been chosen arbi-trarily. According to an analysis of failure data, it has been observed that failuresoften occur in bursts which are interpreted to be caused by the same instability


Figure 9.5: Effect of tupling window size for each individual machine


(fault). Employing a margin of 20 minutes seems to yield a stable separation. Forother systems that show long-range failure behavior (e.g., in the order of hours),this value might be too small.

Non-failure sequences have been generated using overlapping time windows, which sim-ulates the case that failure prediction is performed each time an error occurs.

9.2.5 Grouping (Clustering) of Failure SequencesThe goal of failure sequence clustering is to identify failure mechanisms contained in thetraining data set (c.f., Section 5.2). The approach builds on ergodic (fully connected)HSMMs to determine the dissimilarity matrix that is subsequently analyzed by a cluster-ing method. Clustering has been performed using the cluster library of the statisticalprogramming language R.1

The approach implies several parameters such as the size of the HSMMs. This sectionexplores their influence on sequence clustering. In order to explore the influence, manycombinations of parameters have been tried. Although it is not possible to include allplots here, key results are presented and visualized by plots. In order to support clarityof the plots, a data excerpt from five successive days including 40 failure sequences hasbeen used.

The HSMMs used to compute sequence likelihoods had a topology as shown in Fig-ure 9.6 and used exponential duration distributions mixed with a uniform background.

Figure 9.6: Topology of HSMMs used for computation of the dissimilarity matrix. The modelshown here has five states and an additional absorbing failure state

Results are presented for one failure type only. However, conclusions drawn from theanalysis also apply to the second failure type.

Clustering method. As explained in Section 5.2, several hierarchical clustering meth-ods exist. In this thesis, one divisive and four agglomerative approaches have been appliedto the same data: The DIANA algorithm described in Kaufman & Rousseeuw [142] fordivisive clustering and agglomerative clustering using single linkage, average linkage,complete linkage and Ward’s procedure. The agglomerative clustering method is called

1see http://www.r-project.org, or Dalgaard [74]


“AGNES”, hence the name is also used in the plots to indicate agglomerative clustering.Figure 9.7 shows banner plots (c.f., Section 8.1.2) for all methods using a dissimilaritymatrix that has been generated using a HSMM with 20 states and a background level of0.25. As will be shown next, the choice of the number of states and background levelhave only very little impact on clustering results. Therefore, results look very similar ifthe clustering methods are applied to dissimilarity matrices computed with another modelconfiguration. The plotting software could not include sequence labels on the y-axis inthe plots. However, checking the grouping by hand for some instances yielded similargroupings.

Regarding first single linkage clustering (second row, left), the typical chaining effectcan be observed. Since single linkage merges two clusters if they get close at only onepoint, elongated clusters result. Although beneficial for some applications, this behav-ior does not result in a good separation of failure sequences yielding an agglomerativecoefficient of only 0.45. Hence single linkage is not appropriate for the purpose.

Complete linkage (first row, right) performs better resulting in a clear separation oftwo groups and an agglomerative coefficient of 0.72. Not surprisingly, average linkage(first row, left) resembles some mixture of single and complete linkage clustering. Theresult is not convincing with two single sequences left over. As was the case for completelinkage, it cannot be clearly stated how many groups are in the data. Hence averagelinkage also does not seem appropriate.

Divisive clustering (bottom row, left) divides data into three groups at the beginningbut does not look consistent since groups are split up further rather quickly. The resultingagglomerative coefficient is 0.69. Finally, agglomerative clustering using Ward’s method(second row, right) results in the clearest separation achieving an agglomerative coeffi-cient of 0.85.

Considering other parameter settings, the picture always is the same: single linkagefails and Ward’s method results in the clearest separation. For this reason, Ward’s methodis considered to be the most robust and most appropriate for failure sequence clusteringand has been used in all further experiments conducted in this dissertation. Neverthe-less, there are other parameters to failure clustering, such as the number of states of theHSMMs, which are investigated in the following.

Number of states. Since it is not clear a priori, how many states the HSMMs shouldhave, experiments have been conducted with model sizes ranging from five to 50 states.Results for clustering using Ward’s procedure are shown in Figure 9.8. It can be observedfrom the figure that the order in which clusters are merged is very similar for 20, 35, and50 states, but is different for five states. Although not provable, the effect might be at-tributed to the number of the model’s transitions. Let N denote the number of states2 thenthe number of transitions equalsN ·(N−1)+N = N2. Considering the empirical cumu-lative distribution function (ECDF) of the length of failure sequences (c.f., Figure 9.18-bon Page 197), it can be observed that for N = 5 (i.e., 25 transitions) more than 60%of failure sequences have more symbols than there are transitions in the model, whereasfor N = 20 (i.e., 400 transitions) there is no failure sequence for which this is the case.Although the number of transitions is not directly proportional to a model’s recognitionability it gives an indication. Note that the ergodic models used here can in principle rec-

2Not including the absorbing failure state F


Height

agnes average 20 states bg = 0.25


0 20 40 60 80 100 120 140

Height

agnes complete 20 states bg = 0.25


0 20 40 60 80 120 160 200 234

Height

agnes single 20 states bg = 0.25


0 10 20 30 40 50 60 70 80 90

Height

agnes ward 20 states bg = 0.25


0 50 100 150 200 250 300 350 400

Height

diana standard 20 states bg = 0.25

Divisive Coefficient = 0.69

234 200 160 120 80 60 40 20 0

Figure 9.7: Effect of clustering methods. Five different clustering methods are applied to thesame dissimilarity matrix, which has been generated by a 20-state HSMM with0.25 background weight. The agglomerative clustering algorithm is called “agnes”and the divisive algorithm “diana”. For agglomerative clustering, average linkage,complete linkage, single linkage and Ward’s procedure have been used.


ognize sequences of arbitrary length but if transitions have to be “reused”, probabilitiesget blurred and the model looses discriminative power. Similar observations can be madeif clustering methods other than Ward’s procedure are used..

As a rule of thumb, the number of states for HSMMs used for failure sequence clus-tering should be chosen such that N >

√L for the majority of failure sequences, where

N denotes the number of of states and L the length of the sequence.

Weight of background distributions. It has already been mentioned in Section 6.6 thatbackground distributions must be used with HSMMs since observation probabilities forerrors that do not occur in the (single) training sequence are set to zero by the Baum-Welchtraining algorithm. Hence each failure sequence that contains at least one error messagenot contained in the training sequence would receive a sequence likelihood of zero (or−∞ in the case of log-likelihood) and no useful dissimilarity matrix would be obtained.Using background distributions, a small probability is assigned to all observation symbolsresulting in non-zero sequence likelihoods. In the experiments, a uniform distribution ofall error symbols occurring in the entire set of failure sequences has been used. The effectof background distributions on sequence clustering has been investigated by varying thebackground distribution weighting factor ρi, which has been equal for all states i of theHSMM (c.f., Equation 6.63 on Page 112). Figure 9.9 shows results for clustering with aHSMM with 20 states using Ward’s method.

As can be seen from the plots, varying the background weight does only slightly affectgrouping. In fact, with increasing background weight more “chaining-effects” can be ob-served and the agglomerative coefficient is decreasing. The explanation for this behavioris that the single sequence HSMMs become “more equal” with increasing ρi due to thefact that the uniform background distribution supersedes the specialized output probabil-ities obtained from training. The more similar the models, the more equal are sequencelikelihoods resulting in less structure in the dissimilarity matrix. Nevertheless, all back-ground values result in a grouping that is similar to the ones obtained by the majorityof clustering approaches. Analysis is based on Ward’s procedure here, but the same ef-fect can be observed for other clustering methods, as well. For some of the proceduresclustering is affected if the background distribution weight gets too large. A plot for abackground weight of zero has not been included since it could not be used for clusteringdue to sequence log-likelihoods of −∞. Hence, the conclusions from this analysis is thatthe background weight has not much influence on clustering but should neither be toosmall nor too large. For the case study, a value 0.1 has been used.

Summary of failure sequence grouping. From the experiments the following conclu-sions (regarding failure sequence clustering) can be drawn:

• Agglomerative clustering using Ward’s procedure yields the most robust and mostclear grouping

• The number of states of the HSMMs used to compute sequence likelihoods is notcritical, however, it should be chosen such that the number of transitions is largerthan the number of error symbols of the majority of failure sequences, hence thenumber of states should be roughly equal to

√L.


Height



0 50 150 250 350 450 541

Height



0 50 150 250 350 450 540

Height



0 50 150 250 350 450 540

Height



0 50 150 250 350 450 540

Figure 9.8: Effect of number of states. The plots show clustering results using agglomerativeclustering with Ward’s procedure for dissimilarity matrices computed by HSMMswith 5, 20, 35, and 50 hidden states.


Height



0 50 150 250 350 450 540

Height



0 50 100 200 300 400

Height



0 50 100 150 200 250 300 350

Figure 9.9: Effect of background distribution weight. One HSMM with 20 states has beentrained and dissimilarity matrices have been computed using three different valuesof the background distribution weight ρi (denoted by “bg” in the plots). The bannerplots show results of agglomerative clustering using Ward’s procedure.


• Background distributions are necessary in order to yield useful dissimilarity ma-trices but the actual value is not very decisive. A value of 0.1 is used in the casestudy.

9.2.6 Noise FilteringThe goal of the statistical test involved with noise filtering (c.f., Section 5.3) is to elim-inate error messages that are not indicative of failure sequences. The idea is to consideronly error messages that occur significantly more frequent in the failure sequences incomparison to the expected number of occurrences in a given time frame. The decisionis based on a testing variable Xi (c.f., Equation 5.6 on Page 85), which involves the priorprobability p0

i .As described in Section 5.3, three variants exist to compute priors p0

i :

1. p0i are estimated separately for each group of failure sequences.

2. p0i are estimated from all failure sequences —irrespective of the groups.

3. p0i are estimated from all sequences, containing failure and non-failure sequences.

Noise filtering has been implemented such thatXi values are stored for each symbol inorder to allow for filtering with various thresholds c. Experiments have been performedon the dataset used previously for clustering analysis and six non-overlapping filteringtime windows of length 50 seconds have been analyzed.

Figures 9.10-9.12 show bar plots of Xi values for each symbol and time window. Fig-ure 9.10 has been generated using group-based priors, Figure 9.11 using failure sequence-based priors and Figure 9.12 using a prior computed from the entire training dataset. Eachfigure shows two plots: one for each group of failure sequences. The three figures are or-dered by specificy of priors: the group-wise prior is computed from the failure symbolsitself (but without windowing) resulting in rather small values of Xi since the distributionof failures in the time window is very close to the expected distribution. More generalpriors result in larger values of Xi, as can be seen in Figures 9.10, 9.11, and 9.12.3

Regarding Figure 9.10, it can be observed that the distribution of symbols dependson time before failure. The prior has been computed without time windows which canbe seen as the average over the entire length of failure sequences. Xi values mark thedifference to the prior for each time window. The figure shows that deviation from priorsis different for each window. This is an important finding: It is a further evidence for oneof the most principle assumptions of this thesis: The assumption that timing information—at least time-before-failure— cannot be neglected in online failure prediction. By theway, Figure 9.10 supports the second principle mentioned by Levy & Chillarege in [162],stating that the mix of errors changes prior to a failure.

Due to the fact that the prior is computed for each group separately, the sum of Xi

values over all time windows should be equal to zero. Although this is the case for mostof the symbols, some violate this equality. The explanation for this is that sequences oflength up to 300 seconds have been used, but only time windows up to 250s have beenplotted for readability reasons.

3Note that y-axes have been scaled to fit all Xi values.


−225 −175 −125 −75 −25

group 1, group prior

filtering interval centers [seconds]

Xi

−5

05

10

−225 −175 −125 −75 −25

group 2, group prior


Xi

−5

05

10

Figure 9.10: Values of Xi for noise filtering with a prior computed from each cluster of failuresequences. The upper plot is for the first group of failure sequences and thelower for the second group. Within each plot, each group corresponds to onetime window. Within each group, each bar corresponds to one error symbol andthe y-axis displays the value of the testing variable Xi. Numbers below eachgroup denote the center of the time interval in seconds before failure occurrence.

−225 −175 −125 −75 −25

group 1, fseq prior


Xi

−2

02

46

8

−225 −175 −125 −75 −25

group 2, fseq prior


Xi

−6

−2

24

68

Figure 9.11: Values of Xi for noise filtering with a prior computed from failure sequences.


−225 −175 −125 −75 −25

group 1, all seq prior


Xi

010

2030

40

−225 −175 −125 −75 −25

group 2, all seq prior


Xi

−20

010

20

Figure 9.12: Values of Xi for noise filtering with a prior computed from all sequences.

Regarding Figure 9.11, it can be observed that the distributions of Xi values are quitedifferent in the two groups. This is due to the fact, that the prior has been computed fromall failure sequences (regardless of the group), which can be interpreted as an indicationthat failure sequence grouping supports failure pattern recognition since separate modelscan be trained that are tailored towards the distributions in each group.

The third figure (Figure 9.12), which is based on a prior from failure and non-failuresequences, supports the third principle described by Levy & Chillarege in [162] called“clusters form early”: It can be observed especially in the lower plot that a few error sym-bols outnumber their expected value heavily. Furthermore, the effect becomes strongerthe closer the time window is to the occurrence of failures (the further right in the plot).

In order to investigate the effect of filtering on sequences the number of symbolswithin each sequence has been analyzed. Figure 9.13 plots the average number of symbolsin one group of failure sequences after filtering out all symbols with Xi < c for variousvalues of c. Again, all three types of priors have been investigated.

Regarding first the “global” prior computed from all sequences (solid line), the result-ing curve can be characterized as follows: For very small thresholds, all symbols pass thefilter and the average number of symbols in failure sequences equals the average numberwithout filtering. At some value of c the length of sequences starts dropping quickly untilsome point where sequence lengths stabilize for some range of c. With further increasingc average sequence length drops again until finally not a single symbol passes filtering.

The supposed explanation for this behavior is that the first drop results from filteringreal noise. The middle plateau indicates some sort of a “gap”, which may result from somesignificant difference in the data: this is the filtering range where error symbols relevantto failure sequences still get through but background noise is eliminated. At some point cbecomes too large even for relevant error symbols to get through and the average number

9.3 Properties of the Preprocessed Dataset 191

Figure 9.13: For each filtering threshold value c, mean sequence length has been plotted.The solid line shows values for a prior computed from all sequences, the dashedline for a prior computed from all failure sequences and the dotted line for priorscomputed individually for each group/cluster of failure sequences.

of symbols in failure sequences drops to zero (the plateau around c = 40 is interpreted toresult from outliers).

Comparing the “global” prior with the two other priors, it can be observed that thecurve for a cluster-based prior drops most quickly and the curve for a “global” prior dropsmost slowly. The reason for this is again specificy of the priors. Plateaus are at least notas obvious as for the global prior.

Summary of noise filtering. From this analysis follows that a “global” prior computedfrom all sequences (failure and non-failure) seems most appropriate. Therefore, furtherexperiments are based on filtered data using such prior. Similar to the tupling heuristicproposed by Tsao & Siewiorek [258], the filtering threshold c has been chosen such thatit is slightly above the beginning of the middle plateau.

9.3 Properties of the Preprocessed DatasetBefore going into details of the modeling process, the preprocessed data has been ana-lyzed. Later sections will then refer back to the properties described here. Additionally,data analysis helps to understand better the system under investigation and may also helpothers to judge whether results presented in this thesis can be transferred to their systems.


9.3.1 Error FrequencyOne of the most straightforward methods for online failure prediction is to look at thefrequency of error occurrence and to warn about an upcoming failure once the frequencystarts to rise significantly. However, as Figure 9.14 shows, such simple approach is noteffective when applied to the commercial telecommunication system.

0 200 400 600 800 1000

050

100

150

time[min]

num

ber

of e

rror

s pe

r 5

min

utes

Figure 9.14: Number of errors per five minutes in preprocessed data. Diamonds (u) indicatethe occurrence of failures.

More specifically, the figure shows the number of errors per five minute time intervals.The plot has been generated of data obtained after tupling. As can be seen from the plot,the number of log records varies greatly ranging from zero to 153 log records within fiveminutes. Note that Figure 9.14 is only an excerpt of the data. The peak value observedin the data of five successive days (the same data that has also been used in the previousanalyses) even reaches 267 log records within five minutes. Performing the same analysiswith time intervals of length of one second reveals that there are up to eight messages persecond.

The figure shows that a straightforward counting method would not work well sincethe pure number of errors seems quite unrelated to the occurrence of failures: Failuresoccur at times with many and with few errors per time interval, and in sections where thenumber of errors increases as well as decreases. There are time intervals with heavy errorreporting but only a few failures and time intervals with few errors but several failures.

9.3.2 Distribution of DelaysThe model for online failure prediction proposed in this dissertation builds on the timingbetween error occurrence and hence uses probability distributions to handle time betweensuccessive errors (delays). This section provides an analysis of delays in error sequences.

Theory of HSMMs allows to define a unique convex combination of distributions for


each transition. However, it is not possible to determine upfront which transition shouldhave what type of distribution, and, it is not practical for real applications. Therefore,the same combination of distributions has been used for all transitions: Each transition,for example, consists of a convex combination of exponential and uniform distribution.Note that this does not imply that distributions are equal: the parameters of the distribu-tions (e.g., rate λ of exponential distributions, the combining weight, etc.) are initializedrandomly and then further changed by the Baum-Welch algorithm.

In order to get a picture of delay distributions, delays occurring in the entire datasethave been analyzed. More precisely, a histogram and quantile-quantile-plots (QQ-plots)are provided in Figure 9.15.

The dataset used for analysis comprised 24,787 delays spanning a range from zero4 to29.39 seconds with a mean of 1.404 seconds. The histogram shown at the top left of thefigure plots relative frequency of delays with a resolution of 1 second. The distribution ofdata seems to resemble an exponential distribution except for the peak at 12-13 seconds.It might be supposed that the peak results from some outliers. However, 1,048 delaysfall into this category and hence the peak results more likely from some system inherentproperty. In order to further investigate which parametric distribution fits best the data,QQ-plots have been generated plotting quantiles of the observed delay distribution againstthe parametric ones: The normal distribution (middle row, left) obviously fits very badly.This is due to its property that the distribution can take on negative values, which isinappropriate for delays. Exponential (top row, right) and lognormal (middle row, right)fit much better. However, both distributions show a quite bad match for higher quantiles.As HSMMs provide the possibility to mix distributions, the exponential and log-normaldistributions have been mixed with a uniform distribution resulting in an improved fit(except for very large delays), with the exponential being slightly better than the log-normal. However, further investigations have revealed that very long delays (> 12s) occuronly in 0.41% of all cases and a worse fit of the distribution can be accepted.

Based on this analysis, experiments have been performed using a convex combinationof exponential and uniform distribution.

9.3.3 Distribution of FailuresAssumptions on the distribution of failures (with respect to their time of occurrence) areused in various areas of dependable computing research. For example, preventive main-tenance, reliability engineering, and reliability modeling make use of it. As has beendescribed in Chapter 3, there are also online failure prediction approaches exploiting thetime of failure occurrence. Therefore, an analysis of the distribution of time-between-failure (TBF) has been performed. However, since failures are not as common as errors,the entire dataset of 200 days has been analyzed. More precisely, the dataset consisted of885 timestamps of failures of one type. Figure 9.16 summarizes the results.

Similar to the analysis of inter-error-delays, a histogram is provided at the top leftof the figure. Note that the histogram might not fully represent reality for the first twoslots since failures occurring earlier than 20 minutes after a previous failure have beenconsidered as related to the previous one and have been eliminated from the dataset during

4A delay of zero means that two log records occur with the same timestamp in the log. Technically, thismeans that the two records have a delay lower than the minimum time resolution of the system, which isabout a millisecond in the telecommunication system.


histogram

delays [seconds]

Den

sity

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++

++

++ + + +

+

0 2 4 6 8 10

05

1015

2025

30

exponential

distribution quantiles

data

qua

ntile

s

+ + + ++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++

++++++++++

++++

+++

++ + +

+

−5 0 5 10

05

1015

2025

30

normal


data

qua

ntile

s

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++

+++++

++

+ + + +

+

0 5 10 15 20 25 30

05

1015

2025

30

lognormal


data

qua

ntile

s

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++ + + +

++

+++++

+

0 2 4 6 8 10 12

05

1015

2025

30

exponential mixed with uniform


data

qua

ntile

s

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

+++++++++++++ ++

++

+ +++

+

0 2 4 6 8 10 12

05

1015

2025

30

lognormal mixed with uniform


data

qua

ntile

s

Figure 9.15: Histogram and QQ-diagrams of delays between errors. QQ-plots plot the distri-bution of delays observed in the dataset versus several parametric distributions:exponential, normal, log-normal, exponential mixed with uniform and log-normalmixed with uniform. The straight line indicates a perfect match of quantiles. Pa-rameters of parametric distributions have been estimated from the data (e.g.,mean of the normal distribution has been set to the mean of the data)


TBF [min]

Den

sity

0 50 100 150 200 250 300

0.00

00.

002

0.00

40.

006

0.00

80.

010

0.01

2

+++++++++++++

++++++++++++++++

+++++++++++++++

+++++++++++++++

+++++++++

+++++++

+++++++

++++++++++ + +

+ ++

+

+

+

0 100 200 300 400

5010

015

020

025

030

0

exponential


data

qua

ntile

s

+ + + +++++++++++++++++++++++

+++++++++++++++

+++++++++++++++

++++++++++

++++++++

+++++++

++++++

++++

+++++

++

+

+

0 50 100 150 200

5010

015

020

025

030

0

normal


data

qua

ntile

s

++++++++++++++++++

++++++++++++++++

+++++++++++++++

++++++++++++

++++++++++

+++++++

+++++++

++++++

++ ++ +

++

+

+

50 100 150 200 250

5010

015

020

025

030

0

lognormal


data

qua

ntile

s

++++++++++++++++++++

+++++++++++++++

+++++++++++++++

++++++++++++

++++++++++

+++++++

+++++++

+++++

++++ +

++

+

+

50 100 150 200

5010

015

020

025

030

0

Gamma


data

qua

ntile

s

++++++++++++++++++++

+++++++++++++++

+++++++++++++++

++++++++++++

++++++++++

+++++++

+++++++

+++++

++++ +

++

+

+

50 100 150 200

5010

015

020

025

030

0

Weibull


data

qua

ntile

s

Figure 9.16: Analysis of time-between-failures (TBF). The top left plot shows a histogram.The five other graphs plot quantiles of the observed data against quantiles ofexponential, normal, log-normal, gamma, and Weibull distribution (QQ-plots).


data preprocessing. In addition to the histogram, QQ-plots are provided for the mostfrequently used distributions in reliability theory. Parameters for the gamma and Weibulldistributions have been estimated by maximum likelihood. The interesting observationhere is that the frequently used exponential distribution yields a relatively bad fit. Butalso other frequently used distributions such as the gamma or Weibull do not really fit thedata. The best approximation is obtained by a lognormal distribution.

Results of a second analysis are provided in Figure 9.17. In order to investigate,whether some periodicity is present in the data, the normalized autocorrelation of failureoccurrence has been plotted. More specifically, the data has been divided into bucketsof five minute intervals and the autocorrelation has been computed for lags of up to 240minutes. The observation is that there is almost no periodicity in failure occurrence, whichis the reason why periodic prediction does not work for the case study (see Section 9.9.4).

0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

1.0

Lag [min]

AC

F

Figure 9.17: Normalized autocorrelation of failure occurrence. Failure data has been groupedinto buckets of five minute intervals and autocorrelation has been computed forlags of up to 240 minutes.


9.3.4 Distribution of Sequence LengthsError sequences are delimited by time ∆td and hence an analysis of the length of se-quences in terms of the number of errors is provided here. For the test dataset, a histogramof the number of symbols is shown in Figure 9.18-a. Taking only failure sequences intoaccount, Figure 9.18-b plots the empirical cumulative distribution function.

Histogram of length of all sequences

sequence length [no of symbols]

Den

sity

0 50 100 150 200 250 300

0.00

00.

005

0.01

00.

015

0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

1.0

ECDF of failure sequence length

failure sequence length [number of symbols]

Fn(

x)

(a) (b)

Figure 9.18: (a) Histogram of length of all sequences. (b) Empirical cumulative distributionfunction (ECDF) for the length of failure sequences.

The histogram of all sequences (Figure 9.18-a) shows two peaks, one around 50, theother around 225 symbols. This means that a large amount of sequences either havearound 50 or 225 symbols, although most of the sequences span a time interval of fiveminutes. An explanation for this phenomenon is that the system writes either a great manyerror log records or only a few, depending on varying call-load on the system present inthe rather small excerpt of data (as can also be observed in Figure 9.14). An analysis ofthe entire data set showed a more equal distribution. In order to be consistent with theother analyses presented in this section, the distribution has been plotted as is.

Regarding failure sequences (Figure 9.18-b), the empirical cumulative distributionfunction is presented since it is the appropriate visualization for the argumentation usedin Section 9.2.5. The reason why the maximum length of failure sequences is smallerthan for all sequences is simply random variability: A separate investigation has shownthat there are also failure sequences with more than 200 symbols. Again, for the reason ofconsistency, the plot is provided for the same data that has been used in investigations ofprevious sections. Comparing Figure 9.18-b to Figure 9.13, it might look surprising howan average length of 25 can result from the ECDF shown in Figure 9.18. The explanationfor this is that Figure 9.13 only plots average length of sequences belonging to one failuregroup. The second group has an average length of 75.2 without noise filtering.


9.4 Training HSMMsIn previous sections, data preprocessing has been explored, which is not necessarily fo-cused on HSMM failure prediction. This section describes and analyzes the steps involvedin training HSMMs for failure prediction. Note that for reasons of legibility the previousanalysis was based on a small excerpt of data. In order to yield more reliable results, alarger data set has been used for the experiments described in the following sections.

9.4.1 Parameter SpaceA lot of parameters are involved in modeling. Although most of them have already beenmentioned and / or explained in previous chapters, an overview is provided, here. More-over, the number of parameters and their possible values is too large to compare all com-binations. Hence some parameters have been explored in detail while reasonable values,based on an “educated guess” has been assumed for others (this approach has been termedgreedy versus non-greedy in Section 8.3.1).

Parameters that have been set heuristically. No experiments have been performedfor the following parameters. Instead, values have been chosen according to the reasonsdescribed.

• Intermediate probability mass and distribution. In the experiments, 10% of theprobability mass of each transition has been distributed among intermediate states(c.f., Section 6.6). The transitions itself have been chosen to be normal distributionssince they are centered around the mean, which is useful for the requirement that thesum of mean intermediate durations should equal the mean duration between origi-nal states that are extended.5 Since for uncorrelated random variables the followingproperty holds:

Var(∑

i

Xi

)=∑i

(VarXi) , (9.1)

variance of the intermediate distributions have been set to the variance of inter-error-delays divided by the number of intermediate states plus one. The assumptionthat two successive delays are uncorrelated might not hold,6 however it lead toreasonable good prediction results.

• Number of tries in optimization. As stated before, the Baum-Welch algorithm con-verges into a local optimum starting from a random initialization. The problem isthat it cannot be determined whether the local optimum is close to the global oneor not. Ignoring more sophisticated techniques such as evolutionary strategies, theBaum Welch algorithm has simply been performed 20 times and the best solutionin terms maximum overall training sequence likelihood has been chosen.

• Type of background distributions. In principle, the concept of background distri-butions for observation probabilities allows to use arbitrary distributions. In this

5Due to the central limit theorem, if there are many intermediate distributions having finite variance, thesum approximates a normal distribution, anyway

6E.g., due to the bursty behavior described in Section 9.3.4

9.4 Training HSMMs 199

thesis, the distribution of symbols estimated from the entire training data set hasbeen used since this reflects the overall frequency of error occurrence.

Parameters that have been varied. Several experiments have been performed in orderto determine the effects of the following parameters. Results of these experiments areprovided in the next section.

• Number of states. As can be seen from the figures on the principal prediction ap-proach (Figure 2.9 on Page 19 and Figure 2.10 on Page 20) u + 1 HSMMs areinvolved, where u is the number of groups obtained from failure sequence cluster-ing. Each model consists of N states. The question is how the number of statesaffects the modeling process. Since the prediction models have a strict left-to-rightstructure (c.f., Figure 6.10 on Page 126), the maximum number of transitions isN − 1. From this one might conclude that the models should have as many statesas there are symbols in the sequences. On the other hand, the larger the model,the more model-parameters have to be estimated from the same limited amount oftraining data resulting in worse estimations. Therefore, a better solution might beobtained if an HSMM with fewer states is used and some very long sequences areignored.

• Maximum span of shortcuts. Figure 6.10 on Page 126 shows that there are shortcuttransitions in the model bypassing several states. Increasing the maximum spanof shortcuts increases flexibility of the models but almost doubles, triples, etc. thenumber of transitions and hence the number of transition parameters.

• Number of intermediate states. After training, intermediate states are added to themodel (c.f., Figure 6.11 on Page 127). The number of states added between eachpair of states affects generality of the model. If there are no intermediate states, themodels might be overfitted. If there are too many, the model is too general.

• Amount of background weight. Background distributions are an important way toreduce variance of hidden Markov models. The weight ρi by which backgrounddistributions are mixed with observation distributions obtained from training alsoaffects the bias-variance trade-off.

9.4.2 Results for Parameter InvestigationFour parameters have been listed that need to be explored with respect to failure predictionperformance. One way to investigate their effect would be to perform a separate experi-ment for each parameter. However, such approach has two problems: First, the approachneglects coherence among parameters and second, while testing one parameter, it is notclear what (fixed) values should be assumed for the others. However, an investigationreveals that there are two effects how parameters influence the model:

1. Number of states and maximum span of shortcuts determine the number of param-eters (i.e. the degree of freedom) of the HSMM that need to be optimized from afixed and finite amount of training data. The trade-off is that a higher degree offreedom in principle allows the model to better adapt to the data specifics – how-ever, since more parameters need to be estimated from the same amount of data, theestimates get worse resulting in worse adaption to data specifics.


2. The number of intermediate states and amount of background weight affect gener-ality of the models after training. More general models can account for a largervariety of input data. On the other hand, too general models yield blurred sequencelikelihoods which in turn can result in worse classification results.

Therefore, the parameters have been investigated in two groups. First, models are trainedfor various combinations of number of states and maximum span of shortcuts. In a sec-ond step, each resulting model is altered by adding intermediate states and applying someamount of background weight. Tests are performed in order to evaluate dependence offailure prediction quality on all four parameters. Additionally, failure prediction is de-pendent on the final classification threshold θ. In order to eliminate dependence on θ, foreach combination of the four parameters, various values of θ have been investigated andmaximum F-measure has been used to compare prediction results.

Training with varying number of states and maximum span of shortcuts. The num-ber of states and maximum span of shortcuts are integer variables. However, a completeenumeration of values is not possible. If, for example, the maximum span of shortcutsshould be varied from zero to five and the number of states from 20 to 500, 2886 combina-tions of model parameters would have to be tested, which is not possible since preparationof the data, set up of models, training, testing and evaluation of prediction results wouldbe too time consuming. Hence, only some values for maximum span of shortcuts and thenumber of states have been selected and all their combinations have been tested. Morespecifically, HSMMs with 20, 50, 100, and 200 states have been investigated. Larger mod-els could not be considered due to requirements both in terms of memory and computingtime. The maximum span of shortcuts has been varied from zero to three. This selectionis based on the following reasons: Shortcuts are introduced to account for missing errorsin failure sequences (e.g., if a symptomatic pattern is B-A-A-B but one example sequencewould only consist of B-A-B, shortcuts enable to align both sequences). By limiting themaximum span of shortcuts to three, it is assumed that not more than three successive er-rors are missing. However, even if this case occurs, the sequence with missing errors canbe aligned from the next but one state on. Furthermore, this limitation is sufficient sincebest failure prediction results are achieved with a shorter maximum span of shortcuts, asis shown in the next paragraphs. Also note that shortcuts are not necessarily required tohandle short sequences: due to the initial probabilities πi, a short sequence may start “inthe middle” of the model.

In order to visualize the tradeoff average training sequence log-likelihood is plotted.However, for legibility reasons, the negative of the sequence log-likelihood is shown inFigure 9.19. That means, the higher the bar, the worse the training result, which could beseen as some sort of training error. The dataset used for these experiments consisted of3650 sequences, among which are 278 failure sequences.

Looking at training likelihoods for a maximum shortcut span of zero (first column inFigure 9.19), it can be observed that adaptation to training data increases for an increasingnumber of states up to a model with 100 states, but gets worse for a model with 200 states.Regarding the effect of the maximum span of shortcuts it can be seen that for 20 and 50states, incorporation of shortcuts spanning one to three states deteriorates models with 20and 50 states and improves model training for models with 100 or 200 states. Overall,the best training result is achieved using a model with 100 states and a maximum shortcutspan of one. The following conclusions can be drawn from these observations:


Figure 9.19: Average negative training sequence log-likelihood for several combinations ofthe number of states and maximum span of shortcuts.

1. Models with 20 and 50 states seem to be inappropriately small since the number ofstates determines the maximum length of sequences that can be handled. Sinceshortcuts do not remedy this problem but only introduce additional parameters,training results get worse due to worse probability estimates.

2. As can be seen from experiments without shortcuts, models with 200 states aretoo large. In case of infinite training one would expect that average training log-likelihood is smaller than for 100 states since the model has more degrees of free-dom and can hence better adapt to the training data. Therefore, the reason whytraining likelihood is worse than for models with 100 states is attributed to worseparameter estimation from the limited amount of training data. Furthermore, sincethe Baum-Welch algorithm assigns some small fraction of the probability mass toall transitions, results get also blurred if there are too many.

3. Considering only models without shortcuts, models with 100 states achieve mini-mum negative log-likelihood. However, by the introduction of shortcuts of lengthone, results can be further improved. The fact that negative training log-likelihoodincreases if shortcuts spanning more states are included can be explained by thesame effects as in 1.

Note that these investigations do not automatically allow for the conclusion that modelswith 100 states and shortcuts spanning one state should be used for online failure predic-tion since such models could be overfitted to the training data, as will be discussed in thenext section.

Number of intermediate states and amount of background weight. Intermediatestates and background distributions are applied after training to control the trained model’sgeneralization capabilities. However, overfitting can be reduced either by using fewerstates and more background weight or vice versa. This is the principle reason why non-greedy parameter selection is necessary: a model with worse training sequence likelihoodmight after introduction of intermediate states and application of background distributionsresult in better failure prediction performance than the model with best training results(see discussion of bias and variance in Section 7.3). Hence, all 16 combinations of num-ber of states and shortcuts have been combined with zero to three intermediate states per


transition, and with five levels of background distribution weight. This selection is basedon the following considerations: Similar to the introduction of shortcuts, the introductionof intermediate states aims at alignment of sequences with additional errors in betweensymptomatic ones. And for similar reasons, the introduction of up to three intermediatestates between each transition is sufficient. Background weight is a real-valued parameterand hence five values had been selected spanning a range from zero to 0.2.

In order to evaluate each combination, the maximum achievable F-measure with re-spect to out-of-sample prediction of validation sequences has been determined. Out-of-sample means that validation sequences have not been available for training. Since itis not possible to present results of all 320 combinations here, the three most importantfindings are described in the following:

1. Application of background distributions can increase failure prediction perfor-mance for all combinations. However, this is only true if the background distri-bution weight is rather small. Too large values for background distribution weightquickly result in “random” models resulting in worse prediction performance thanmodels without background distribution. Hence, in later experiments, a backgroundweight ρi of 0.05 has been applied.7

2. A similar effect can be observed from the introduction of intermediate states. Over-all, the effect of adding intermediate states to the models did not meet expecta-tions: Failure prediction performance could only be improved slightly when oneintermediate state per transition was added. This setting has been used for furtherexperiments.

3. One setting for a model with 50 states and no shortcuts achieved roughly the samefailure prediction quality as the model with 100 states and maximum shortcut spanof one, which gives evidence to the described problem that a model with best train-ing likelihood does not guarantee optimal prediction performance on test data. Onthe other hand, the model with best training likelihood belongs to the set of modelswith best failure prediction performance. Therefore, the model with 100 states hasbeen used for further experiments since it can account for longer sequences.

Computation times. A theoretical analysis of the algorithm’s complexity has been pro-vided in Chapter 6, but the analysis was rather coarse grained and did only take the num-ber of states and length of the sequence into account. Although the effect of the fourparameters investigated in this section could in principle be traced down to the numberof states and the number of edges, or even further to the number of multiplications andadditions, such full-fledged analysis is not provided here. Instead, the time needed to trainthe models and to classify test data has been measured several times on one and the samemachine, which allows at least a guess on the effect of parameters in some relative way.

Training time is affected only by the number of states and maximum span of shortcuts,and testing time is additionally influenced by the number of intermediate states. Theamount of background weight has no influence on testing times since output probabilitiesbi(Oj) are altered before testing starts. Figures 9.20 to 9.22 show the results.

In Figure 9.20, mean training time for all 16 combinations of parameters is shown.Training time is determined by the time needed to train one model. Not surprisingly,

7c.f., Equation 6.63 on Page 112


Figure 9.20: Mean training time depending on the number of states and maximum span ofshortcuts.

training time increases both with the number of states and the maximum span of shortcutssince both increase the number of parameters that need to be estimated from the trainingdata set. The figure suggests that the number of states has a stronger influence than themaximum span of shortcuts. One reason is that the maximum span of shortcuts onlyincreases the number of transitions parameters, which are only a subset of all parametersthat need to be determined. For the configuration used in further experiments (100 states,maximum shortcut span of one), a mean training time of 1365 seconds resulted.

With respect to testing, 75% trimmed mean testing times are plotted in Figure 9.21.Testing time is determined by the mean of the time needed to perform a prediction on onesingle sequence. In Figure 9.21-a, processing time is plotted in dependence on the numberof states and maximum span of shortcuts. The the number of states clearly dominatestesting time, which can again be explained by the fact that the maximum span of shortcutsonly increases the number of transitions (and no state-dependent parameters) and henceonly has effects in the most inner core loops of the algorithm. In addition to the numberof states and maximum span of shortcuts, testing time is also influenced by the number ofintermediate states. Figure 9.21-b shows 75% trimmed mean testing times in dependenceon the number of states and the number of intermediate states for a maximum shortcutspan of one. Surprisingly, computation time decreases by the introduction of intermediatestates. An analysis has revealed that this is due to the fact that with intermediate statesprobabilities decrease more quickly in the forward and backward algorithm such thatimplemented shortcuts in the algorithm are executed if probabilities are below a certainthreshold for some sequences.

Performing online failure prediction is a real-time application. However, no full-fledged real-time analysis can be presented here. Instead, Figure 9.22 shows upper limitsof 95% confidence intervals on mean testing time. This is obviously no guarantee thatthe algorithm can always be performed in real time. However, two things should also betaken into consideration: First, the algorithm operates on a lead time that is much larger(e.g., five minutes), hence there is some space for “buffering”. Second, errors occur inshort bursts with longer time intervals with only very few errors. This means that thereis some chance for the algorithm to catch up. If not, the algorithm could simply ignore


(a) (b)

Figure 9.21: Computation time needed for testing a single sequence dependending on num-ber of states and maximum span of shortcuts for one intermediate state (a) anddependending on number of states and number of intermediate states for a max-imum shortcut span of one (b).

(a) (b)

Figure 9.22: 95% upper confidence interval limits for mean testing times corresponding toFigure 9.21.

some sequences.8

In the experiments, one non-failure model and two failure models have been used.With respect to testing, the effect of the number of groups is linear. However, with re-spect to training, the effect is more complex since with an increasing number of groupsthere are fewer training sequences in each group partly compensating for the overhead oftraining more models. The number of groups is expected to reflect the number of failuremechanisms in the system and is determined during data preprocessing. Nevertheless, theeffect has been analyzed: the same data has been processed with only one failure group.It has been found that the total time of training a model with only one non-failure andone failure model is increased by approximately 20% since more iterations on a largertraining data set have to be performed.

8Note that processing times shown in Figures 9.21 and 9.22 refer to an entire sequence.

9.5 Detailed Analysis of Failure Prediction Quality 205

9.5 Detailed Analysis of Failure Prediction QualityIn the previous sections the parameters involved in setting up an HSMM-based failurepredictor have been investigated. Although some model parameters have been assessedwith respect to failure prediction, only the maximum F-measure has been used. In thissection, the quality of failure prediction is assessed in more detail. Specifically, in Sec-tion 9.5.1, the focus is on precision, recall, and F-measure. In Section 9.5.2, ROC curvesand related metrics are provided while in Section 9.5.3 evaluation deals with cost-basedmetrics. All experiments shown here have been performed using parameter settings aslisted in Table 9.2.

lead time data window length no. of states max. span no. of inter- background∆tl ∆td N of shortcuts mediate states weight

5 min 5 min 100 1 1 0.05

Table 9.2: Experiment settings for detailed analysis.

With respect to data sets, the experiments performed in previous sections have beenevaluated using out-of-sample validation data, while results reported in this section re-fer to out-of-sample test data (c.f. Section 8.3.2). 95% confidence intervals have beenestimated by the procedure described in Section 8.4.5.

9.5.1 Precision, Recall, and F-measurePrecision, recall, and F-measure have been defined in Section 8.2.2. As they have beendeveloped for information retrieval evaluation, their focus is on imbalanced class distri-butions as is the case for failure prediction. However, precision, recall, and F-measure aredependent on the classification threshold θ (c.f., Equation 7.20 on Page 137) and henceprecision/recall plots and a plot of the F-measure for a selection of eleven thresholds rang-ing from −∞ to∞ are provided. At each of the eleven classification threshold levels θ,95% confidence intervals have been computed.

At the threshold level for the maximum F-measure of 0.66, the corresponding valuesof precision and recall are 0.70 and 0.62, respectively, which means that failure warningsare correct in 70% of all cases and almost two third of all failures are caught by the predic-tion algorithm. Both values can be increased to reach 1.0 by adjusting the classificationthreshold θ. It depends on the methods and actions triggered by the prediction algorithm,whether high precision or high recall is more important.

9.5.2 ROC and AUCTaking true negative predictions into account, ROC curves plot true positive rate versusfalse positive rate (c.f., Section 8.2.2) and AUC is the area under the resulting curve asestimated by integrating the piecewise linearly interpolated ROC curve.

Figure 9.24 shows ROC for HSMM failure prediction. Choosing the threshold yield-


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

confidence level: 0.95Recall

Pre

cisi

on

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

decision threshold

F m

easu

re

(a) (b)

Figure 9.23: Precision/Recall plot (a) and corresponding values of F-measure (b) for theHSMM failure prediction model. A selection of eleven thresholds ranging from−∞ to∞ has been plotted including 95% confidence intervals for precision andrecall.

ing the maximum F-measure, a false positive rate of 0.016 and true positive rate (whichis equal to the recall) of 0.62 results. Area under the ROC curve (AUC) equals 0.873.

9.5.3 Accumulated Runtime CostThe plot showing accumulated runtime cost (c.f., Section 8.2.4) is dependent on the as-signment of cost to each of the four cases that can occur in failure prediction:

• A true negative prediction has cost rF F . Since it is a negative prediction, no subse-quent actions are performed. Furthermore, since it is a correct decision, rF F shouldbe the smallest value. A value of 1 has been chosen arbitrarily.

• A true positive prediction has cost rFF . Since the occurrence of a failure is pre-dicted, some actions are performed in order to deal with the upcoming failure re-sulting in higher cost. However, it is a correct prediction and hence cost should notbe too high. Hence, a value of 10 has been chosen.

• A false positive prediction has cost rFF . A failure is predicted and actions areperformed as in the previous case —however, these actions are unnecessary sincein truth no failure is imminent. Hence a value of 20 has been chosen.

• A false negative prediction has cost rFF . From the point of view of computationalworkload, cost should equal rF F . However, this is the worst case since an upcomingfailure is not predicted and nothing would be done about it. The system fails whichimplies highest cost. Therefore, cost of 1000 have been assigned to this case.

Figure 9.25 shows accumulated runtime cost for a simulated run of 31.5 days. The figureincludes boundary cost for:

• oracle predictor: this predictor issues only a true positive failure warning at thetime of failure occurrence setting the lower bound of overall achievable cost.

9.6 Dependence on Application Specific Parameters 207

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

confidence level: 0.95false positive rate

true

pos

itive

rat

e

Figure 9.24: ROC plot for the HSMM failure prediction model applied to telecommunicationsystem data. A selection of 11 thresholds ranging from −∞ to ∞ has beenplotted. At each threshold level, 95% confidence intervals for true and falsepositive rate are provided.

• perfect predictor: performs a prediction at each time instant, an error message oc-curs. However, each prediction is correct, i.e., only true positive or true negativeprediction occur.

• no prediction: if no predictor was in place, cost for false negative prediction occurseach time, a failure occurs

• maximum cost: A prediction is performed each time an error message occurs. How-ever, each prediction is wrong and hence only false positives and false negativepredictions are performed.

As it can be seen from the plot, many failures occurred at the beginning of the run,followed by some “silent” period. However, due to lack of plotting resolution, it cannotbe seen that some failures occurred quite close in time resulting in a total of 232 failures.By use of the HSMM failure predictor, accumulated runtime cost can be cut down toapproximately one fifth of the cost without a failure predictor.

9.6 Dependence on Application Specific ParametersExperiments conducted so far have analyzed the parameters involved in data preprocess-ing and modeling. Parameters were not specific for the application domain for whichfailure prediction should be performed. This section investigates application specific fac-tors, which refers to restrictions or properties imposed by the application domain or thesystem.

9.6.1 Lead-TimeIn Section 2.1, or more specifically in Figure 2.4 on Page 12, it is shown that lead-time∆tl has a lower bound called warning-time ∆tw, which is determined by the time neededto perform some action upon failure warning. In the experiments carried out so far alead-time ∆tl of five minutes has been used. In order to evaluate the effect of lead-time,


0 500000 1000000 1500000 2000000 2500000

050

000

1500

0025

0000

time

cost

0 500000 1000000 1500000 2000000 2500000

050

000

1500

0025

0000

0 500000 1000000 1500000 2000000 2500000

050

000

1500

0025

0000

0 500000 1000000 1500000 2000000 2500000

050

000

1500

0025

0000

0 500000 1000000 1500000 2000000 2500000

050

000

1500

0025

0000

0 500000 1000000 1500000 2000000 2500000

050

000

1500

0025

0000

HSMM predictor

no prediction

maximum cost

oracle predictorperfect predictor

Figure 9.25: Accumulated runtime cost for the HSMM failure prediction model. A test runof 31.5 days has been plotted. A cost ratio of rF F : rFF : rFF : rFF =1 : 10 : 20 : 1000 has been used. The plot also includes boundary cost foran oracle predictor, perfect predictor, a system without prediction and maximumcost. Triangles at the bottom indicate times of failure occurrence.

experiments with a lead-time ranging from ∆tl = 5 minutes to ∆tl = 30 minutes havebeen performed. Figure 9.26 summarizes the results in terms of maximum F-measure with95% confidence intervals determined from out-of-sample test data. Although one couldexpect a rather linear decrease of failure prediction performance, experiments indicate thatfailure prediction performance stays more or less constant until a lead-time of up to 20minutes, then the F-measure drops quickly. The rather sharp drop observed in the figureindicates that symptomatic manifestations of an upcoming failure are only observable upto 20 minutes9 before failure occurrence. Taking into account that errors occur late inthe process from faults to failures, it can be concluded that the fine-grained detectionmechanism in the telecommunication system is able to grasp the first misbehaviors up to20 minutes ahead before failure.

9.6.2 Data Window SizeTraining of HSMMs as well as of other failure prediction models is based on error se-quences. The length of each sequence is determined by data window size ∆td. Although∆td is a data preprocessing parameter (c.f., Section 9.2.4), it is analyzed here since effectswith respect to failure prediction quality have not been investigated in Section 9.2.4.

As is the case with many parameters, the effects of ∆td on failure prediction qual-ity are manifold and can hardly be assessed analytically. In principle, longer sequencesshould result in a more precise classification. On the other hand, the farther sequencesreach back into past, the more likely it becomes that failure-unrelated errors are includedin failure sequences, which deteriorates failure prediction. Figure 9.27 plots maximumF-value for five values of ∆td: data windows of length 1 minute, 5 minutes, 10 minutes,15 minutes and 20 minutes. Figure 9.27-a shows failure prediction quality in terms of

9plus length of the data window ∆td

9.7 Dependence on Data Specific Issues 209

5 10 15 20 25 30

0.0

0.2

0.4

0.6

lead time [min]

F m

easu

re

Figure 9.26: Failure prediction performance for various lead-times ∆tl. The plot shows F-measure with 95% confidence intervals.

maximum F-value and 95% confidence intervals. As can be seen from the figure, despiteof a data window size of ten minutes, failure prediction quality improves with larger datawindow sizes. The reason for the exception at ∆td = 10min might be caused by randomeffects and the fact that the Baum-Welch algorithm only converges to a local maximumrather than a global one, even if it is repeated 20 times.

Improved prediction comes at the price of memory consumption and processing time.Figure 9.27-b shows mean processing time per sequence in seconds. Processing timeincreases heavily with increasing ∆td: With twenty minutes data frames, the average pro-cessing time reaches 2.34 seconds per sequence. This increase is caused by two effects:(a) length of the sequences (L) increases with ∆td, and (b) HSMMs also need to havemore states (N ) in order to represent longer sequences. The reason why confidence in-tervals for processing time get wider with increasing ∆td is that the number of errorsin sequences vary more: time windows of five minute length are “filled with errors” inmost cases whereas in time windows of 20 minute length, there sometimes are larger gapsresulting in a sequence with fewer errors.

9.7 Dependence on Data Specific IssuesBuilding a failure prediction model following a data-driven machine learning approach isalways dependent on quality and quantity of the data and —of course— on the systemitself. This section investigates sensitivity of failure prediction quality with respect todata-specific issues.


5 10 15 20

0.0

0.2

0.4

0.6

0.8

data window size [min]

F m

easu

re

5 10 15 20

01

23

data window size [min]

mea

n pr

oces

sing

tim

e [s

]

(a) (b)

Figure 9.27: Experiments for various data window sizes ∆td. (a) Failure prediction perfor-mance reported as maximum F-value. (b) Mean processing times per sequencein minutes. 95% confidence intervals are shown in both plots.

9.7.1 Size of the Training Data SetThe objective of machine learning is to identify unobservable relationships from measureddata, which is usually blurred by noise. Hence one of the rules of thumb for machinelearning is to use as many datapoints as available. Where in many cases, the maximumsize of the training dataset is the limiting factor, time needed for training may also becritical. Additionally, if very old data is included in the training data set, it might notrepresent precisely the relationships as they are present in the running system. In orderto investigate the effect of the size of the data set, parts of increasing size of training datahave been selected to train a model and have been tested on the same test data set (seeFigure 9.28). More precisely, in the experiments the relationship between the amount of

Figure 9.28: Selection of training data sets for experiments on the effect of the amount oftraining data

available training data and resulting failure prediction quality as well as the time neededfor training has been investigated. In order to visualize the effect, two plots are presented:Figure 9.29-a plots maximum F-measure for the three data sets of different size and Fig-ure 9.29-b shows the time needed to train the models. In failure prediction, usually thenumber of failures in the data set are the limiting factor. In the experiments a small data

9.7 Dependence on Data Specific Issues 211

1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

size of data set

F m

easu

re

1 2 3

020

040

060

080

010

0012

0014

00

size of data set

mea

n tr

aini

ng ti

me

[s]

(a) (b)

Figure 9.29: Effects of the size of the training data set. (a) shows maximum F-measure and(b) the time needed to train models. Data set one contained 72, data set two134 and data set three 278 failure sequences.

set with only 72 failure sequences, a medium data set with 134 failure sequences anda large data set with 278 failure sequences were used, which had been obtained by re-ducing the data set that also used in previous experiments. Regarding Figure 9.29-a, theF-measure is roughly similar for the large data set (3) and the middle sized data set (2).When the data set is further reduced (data set 1) the F-measure drops down significantly.This is due to the fact that there are too less examples to learn from, or, more precisely,to robustly estimate all the parameters of the HSMMs. Figure 9.29-b shows the expecteddependence of time needed to train the models on the size of the data set.

9.7.2 System Configuration and Model AgingComplex computer systems involve a manifold of configuration parameters and are sub-ject to patches and updates. In the case of the telecommunication system investigated inthis thesis, the number of configuration parameters have been estimated by system expertsto exceed 2.000. A separate configuration database is installed on the system and the sys-tem is as flexible as different versions or implementations of a component can be usedjust by updating one value in the configuration database. Hence a single change in theconfiguration may alter system failure behavior significantly. It is the goal of this sectionto investigate sensitivity of the trained HSMM failure prediction models to changes insystem configuration. However, the problem is that we neither had access to the config-uration database nor to any logs indicating configuration changes. Therefore, sensitivitycan only be investigated in terms of the temporal gap ∆tg between the training data setand the test data set (see Figure 9.30).

Figure 9.31 presents results from training a failure prediction model with five differentgaps ∆tg. More precisely, experiments have been performed with a gap of 13 days, 42days, 91 days, 125 days and 152 days between the end of the training and beginningof the test data. Since we had no access to the configuration database, these numbershave been chosen from an in-depth analysis of the entire data set, which revealed, e.g.,


Figure 9.30: Selection of test data sets for experiments on the effect of changing systemconfiguration. ∆tg indicates the gap between the end of training and start of thetest data set.

changes in the log format. Two conclusions can be drawn from the figure: Predictionquality decreases with an increasing temporal distance between training and applicationof the failure predictor. 95% confidence intervals get larger with increasing gap size. Bothcharacteristics can be interpreted with the background of continuous partial updates andpatches: If only parts of the system are changed, some failure indicating error-sequencesare still the same, while others are changed. The HSMM recognizes known (old) errorsequences well while it fails on new sequences. The increasing diversity of sequences isreflected in wider confidence intervals obtained by the bootstrapping procedure.

Besides of the aspect of configuration, plotting failure prediction quality as a functionof ∆tg brings up another aspect of machine learning. The training procedure applied inthis thesis is called supervised offline batch learning, which means that first a batch ofdata is collected which is used entirely to train a model. In this context offline means thattraining is performed not during operation but in the two-phase approach indicated by Fig-ure 2.7 on Page 16. There are other machine learning approaches that try to continuouslyadapt the model in order to keep it up-to-date, however, in order to keep the approachsimple such techniques have not been investigated in this dissertation (see Chapter 12).The important thing to note here is that —assuming an ever changing real system— themodel is always outdated, even right after training. Hence, the question is how quicklykey properties of the system change with respect to the prediction of upcoming failures.Gap ∆tg is one way to expresses “age” of a model.

9.8 Failure Sequence Grouping and FilteringIn this dissertation two data preprocessing techniques have been proposed and used with-out scrutinizing. The experiments described in this section have done so and have inves-tigated the effect of failure sequence grouping as well as noise filtering.

9.8.1 Failure GroupingIn order to obtain more consistent training datasets, failure sequence grouping intendsto separate failure mechanisms in the data. However, this also decreases the number ofsequences available for training of each model. In order to investigate the effects of failuregrouping, prediction performance has been evaluated for a predictor with only one HSMMfailure group model.10 Figure 9.32 presents results, which are intended to be comparedto Figure 9.23 and 9.24, respectively. Results show that failure prediction performancewithout separating failure sequences into groups is worse and a maximum F-measure of

10And, of course, a non-failure model

9.9 Comparative Analysis 213

0 20 40 60 80 100 120 140 160

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

gap size [days]

F m

easu

re

Figure 9.31: Prediction quality (expressed as maximum F-measure) depending on the tem-poral gap ∆tg between training and test data. The gap is expressed in days.

0.5097 and AUC of 0.7700 are achieved. This indicates that failure sequences are toodiverse to be represented by one single HSMM. Since by means of clustering similarsequences are grouped and a separate model is trained for each group, the models canbetter adopt to the specifics of error sequences indicating an upcoming failure.

9.8.2 Sequence FilteringIn order to remove noise from failure sequences, a statistical filter technique has been ap-plied. To investigate the effects, a model with the same parameters as used in Section 9.5has been trained and evaluated using unfiltered data. A maximum F-measure of 0.3601 re-sulting from a precision of 0.670 and a recall of 0.246 with a false positive rate of 0.0095have been achieved. Hence sequence filtering improves failure prediction performanceslightly, at least for the parameter settings used previously.11 Additionally, filtering re-moves symbols from sequences, which in turn has a positive effect on computation times:the average processing time for the prediction of a sequence without filtering is increasedby 16.9%.

9.9 Comparative AnalysisIn order to be able to judge the results presented in previous sections, the HSMM-basedfailure prediction approach has been compared to several published failure predictionapproaches. As already explained in Section 3.2, most promising and well-known ap-proaches to error-driven failure prediction, as they have been identified as subbranches ofCategory 1.3 in the failure prediction taxonomy (c.f., Figure 3.1 on Page 31) have been

11However, it cannot be excluded that other model parametrizations exist that achieve better predictionperformance


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Pre

cisi

on

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


true

pos

itive

rat

e

(a) (b)

Figure 9.32: Precision / recall plot (a) and ROC plot (b) for prediction with a single failuregroup model

selected. Additionally, results are provided for the simplest prediction method: Predic-tion based on mean-time-to-failures. Since the HSMM-based failure prediction approachpresented in this thesis extends standard HMMs, results for standard HMMs are provided,and comparison to a random predictor and the UBF approach proposed by Hoffmann isgiven. Even though the approaches have already been described in Section 3.2, their keyidea is repeated here for convenience. All experiments have been carried out on the samedataset that has been used in Section 9.5 with a lead-time ∆tl of five minutes. Each modelis discussed separately and a results are summarized at the end of the section, includingcomputation times.

9.9.1 Dispersion Frame Technique (DFT)DFT (c.f., Section 3.2.1) investigates the time of error occurrence by defining dispersionframes (DF) and computing the error dispersion index (EDI). A failure is predicted atthe end of the DF if at least one out of five heuristic rules matches. In addition to theoriginal method, predictions that are closer to present time than warning-time ∆tw ofthree minutes have not been considered.

Initial results for DFT using a parameter setting as in the paper by Lin & Siewiorek[167] resulted in poor prediction performance. Parameters such as thresholds for tuplingand for the rules have been modified in order to improve prediction. However, even withinvestigating 540 different combinations of parameters the best achievable result onlyobtained an F-measure score of 0.115, resulting from a precision of 0.597 but only arecall of 0.063. False positive rate equals 0.00352.

Comparing the results of DFT with the original work by Lin and Siewiorek, achievedprediction performance is worse. The main reason for that seems to be the difference ofinvestigated systems: While the original paper investigated failures in the Andrews dis-tributed file system based on the occurrence of host errors, our study applied the techniqueto errors that had been reported by software components in order to predict upcomingperformance failures. In our study, intervals between errors of the same type are much


shorter. As software container IDs have been chosen as the entity corresponding to fieldreplaceable units (FRUs) Figure 9.33 shows histograms of the time between errors forthree different software containers. As can be seen from the histograms, for the leftmost

delay [s]

Fre

quen

cy

0 5 10 15 20 25

020

000

6000

010

0000

1400

00

delay [s]

Fre

quen

cy

0 10000 20000 30000 40000 500000

200

400

600

800

1000

1200

delay [s]

Fre

quen

cy

0 20 40 60 80 100

010

000

2000

030

000

4000

050

000

Figure 9.33: Histogram of time-between-errors for the dispersion frame technique. Since con-tainer IDs have been chosen to be the FRU equivalent, error messages of threecontainers have been analyzed. In order to obtain histograms in which any de-tails can be seen, after tupling only delays up to the 99% quantile has been usedin order to remove very rare but extremely large values

container ID, the vast majority of delays is below five seconds. Since DFT can at mostpredict a failure half of the delay ahead, most of the failure predictions from this containerare dropped since they are closer to present time than the warning period of 100 seconds.The same holds for the rightmost container ID. The fact that most of the predictions havebeen dropped results in the low recall. However, if a failure warning is issued, it is correctin almost 60% of all cases.

9.9.2 EventsetThe eventset method (c.f., Section 3.2.2) is based on data mining techniques identifyingsets of error event types that are indicative for upcoming failures, which set up a ruledatabase. Construction of the rule database includes the choice of four parameters:

• length of the data window

• level of minimum support

• level of confidence

• significance level for the statistical test

The training algorithm has been run for 64 combinations of various values for the param-eters and the best combination with respect to F-measure has been selected. Since thefirst part of the algorithm potentially needs to investigate the power set of all 1435 errorsymbols, which is approximately 9.5 · 10430, a branch and bound algorithm called “apri-ori” has been used as indicated in the paper by Vilalta & Ma [268].12 Best results have

12More specifically, the implementation of Christian Borgelt (see [34])


been achieved for a window length of five minutes, confidence of 10%, support of 25%and significance level of 5% yielding a precision of 0.465, recall of 0.327, F-measure of0.3841, and false positive rate of 0.0422.

9.9.3 SVD-SVMSupport Vector Machines (SVMs) are state-of-the art classifiers showing various desirableproperties such as convexity of the optimization criterion, etc. The major problem whenusing SVM classifiers for failure prediction is representation of error data. Domeniconiet al. [81] have used a bag-of-word representation together with latent semantic indexingtechniques to solve this problem resulting in the failure prediction approach describedin Section 3.2.3. 90 different configurations have been tested and the configuration withmaximum F-measure has been selected. In particular, configurations have been definedby the following parameters:

• length of the data window ∆td

• type of kernel function: linear, polynomial, and radial basis functions (c.f., e.g.,Chen et al. [56])

• parameters controlling the kernels, such as γ for radial basis function kernel

• trade-off between training error and margin (parameter C, as in, e.g., Schölkopfet al. [231])

• feature encoding: either existence, count, or temporal (c.f., Section 3.2.3)

The approach has been implemented using R and the free SVM toolkit “SVMlight”[135]. However, there is one difference to the algorithm as originally published inDomeniconi et al. [81]: Since the output of SVMlight is not only a class label but adistance from the decision boundary, a precision / recall and ROC plot can be drawn.The idea is to classify a sequence as failure prone only if the SVM output is above somecustomizable threshold. Classification performance of the original algorithm hence cor-responds to a threshold equal to zero. Figure 9.34 presents the results.

Best results have been achieved using a radial basis function kernel with γ = 0.6, error/ margin tradeoff c = 10, and count feature encoding. Using this setting, a maximum F-measure of 0.226, precision of 0.182, recall of 0.299, and false positive rate of 0.1103have been achieved.

The fact that encoding error messages by the count scheme rather than the temporalscheme might seem contradictory to one of the principal assumptions in this dissertationthat taking both type and time of error messages into account should improve failureprediction. However, this is not the case, since the way, time is represented in the temporalscheme has a fundamental flaw: By representing the occurrence of each error type as abinary number, the temporal scheme investigates absolute time of error occurrence in thesequence rather than relative, and discretizes time rather than treating it continuously (c.f.,Section 4.2.1). As an example, assume that there is only one occurrence of one specificerror message type in a sequence. If the error messages appears only a little earlier suchthat it falls into the next time slot, the magnitude along the error dimension is doubled inthe bag-of-words representation.


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Pre

cisi

on

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


true

pos

itive

rat

e

(a) (b)

Figure 9.34: Failure prediction results for the SVD-SVM failure prediction algorithm. Resultsare reported as Precision/recall plot (a) and ROC plot (b)

9.9.4 Periodic Prediction Based on MTBFThe reliability model-based failure prediction approach is rather simple and has beenincluded to show the prediction performance that can be achieved with the most straight-forward prediction method. Not surprisingly, prediction performance is low: Precisionequals 0.054, and also recall equals 0.054 yielding an F-measure of 0.0541. Since the ap-proach only issues failure warnings (there are no non-failure-predictions), false positiverate cannot be determined, here.

The reason why this prediction method does not really work for the case study is thatthe prediction method is periodic: the next failure is predicted to occur at the medianof a distribution that is not adapted during runtime. As can be seen from the histogramof time-between-failures (Figure 9.16) the distribution of time-between-failures is wide-spread and from the auto-correlation of failure occurrence (Figure 9.17) that there is noperiodicity evident in the failure data set.

9.9.5 Comparison with Standard HMMsThis dissertation is the first to apply hidden Markov models to the task of online failureprediction. In Section 4.2 it has been argued that standard HMMs are not well-suited forrepresenting error logs due to insufficient capabilities to represent time (c.f., Figure 4.7on Page 64). While this has just been a claim based on theoretical analysis, this sectionprovides experimental evidence for it.

Three experiments have been performed in which

1. no timing information is used

2. time-slotting (c.f., Section 4.2.1) is used.

3. the model described in Salfner [223] has been used.


The third case did not work out due to the fact that the process is forced to always traversethe same limited set of states and hence looses its pattern recognition potential. Dueto this theoretic flaw, this approach was not further investigated. In the first two cases,the structure of the HMM was similar to the structure of the failure prediction HSMM.In case of the time-slotting model, an extra observation symbol representing silence hasbeen added. Theoretically, the time slot size should be set equal to the minimum delaybetween the errors, which is determined by the tupling parameter ε = 0.015s. However,this would lead to huge models since:

]states = 5 min. × 600.015 slots/min. = 20000 slots = 20000 states . (9.2)

However, 20000 states are far to many to be trained from limited data within reasonabletime. Hence a larger time-slotting interval of ε = 0.2s has been used resulting in a modelwith 1500 states. If two error symbols occurred within one time slot, one symbol has beenchosen randomly, which treats such cases as noise. Opposed to that, the model withouttiming had 100 states.

The model without timing achieved a prediction performance of precision = 0.230,recall = 0.176 and hence an F-measure of 0.1996. False positive rate was equal to 0.049.The model with time-slotting achieved a prediction performance of precision = 0.079,recall = 0.129 and hence an F-measure of 0.0982 with a false positive rate of 0.124.

Similar to the SVD-SVM method, this experiment also shows that the sheer incorpo-ration of time information does not automatically lead to good failure prediction results.Rather, in the case considered here, the incorporation of time by time-slotting did renderthe prediction approach almost unusable.

9.9.6 Comparison with Random PredictorThe term “random predictor” denotes a predictor that at each time a prediction is to beperformed, a failure warning is issued with probability of 0.5. Applied to the case study,such predictor would result in a contingency table as shown in Table 9.3. From the table,

True Failure True Non-failure SumPrediction: Failure 139 1686 1825

Prediction: No Failure 139 1686 1825Sum 278 3372 3650

Table 9.3: Contingency table for a random predictor.

the following values for precision, recall, and false positive rate can be computed:

• precision = 1391825 ≈ 0.076

• recall = 139278 = 0.5

• fpr = 16863372 = 0.5


which results in an F-measure of approximately 0.1322. One might conclude from thisconsiderations that any predictor with recall less than 50% is useless. However, this is nottrue: As precision and recall are in most cases inversely linked, many prediction methodstrade recall for precision, and are hence useful even though recall is below 50%.

9.9.7 Comparison with UBFHSMM-based failure prediction operates on event-triggered input data and comparativeapproaches have been selected from this class, too. However, Günther Hoffmann hasproposed a failure prediction technique called “Universal Basis Functions” (UBF) andhas applied it to symptom monitoring data such as workload and memory consumption13

of the same telecommunication system (Hoffmann & Malek [121]). Therefore, results areoutlined here for comparison. In Hoffmann [120], values for true/false positives/negativeshave been published, shown here as contingency Table 9.4. From this table, a precision

True Failure True Non-failure SumPrediction: Failure 4 49 53

Prediction: No Failure 2 192 194Sum 6 241 247

Table 9.4: Contingency table for the UBF failure prediction approach.

of 0.076, recall of 0.667 and false positive rate of 0.2033 are computed. This yields anF-measure of 0.13559. AUC is reported to be 0.846. It should be noted that the abovevalues are derived from a rather small data set containing only 247 predictions, amongwhich are only six failures.

Looking at precision and F-measure, it seems as if UBF is not much better than arandom predictor. However, this is not true, since UBF operates on different data: A ran-dom predictor applied to the UBF data would only achieve a precision of 0.024 resultingin an F-measure of 0.0463. Furthermore, UBF achieves an AUC value that is similar toAUC of the HSMM approach –but at considerably lower computational cost: In order toperform a UBF prediction, each kernel has to be evaluated only once14 and results arelinearly combined. Therefore, in scenarios where, e.g., false positive alarms do not incurhigh cost, UBF can achieve similar results with lower cost. However, if high precision isa requirement, HSMM outperforms UBF significantly.

9.9.8 Discussion and Summary of Comparative ApproachesIn this section, HSMM-based failure prediction has been compared to other well-knownprediction techniques: Dispersion Frame Technique (DFT), Eventset method and SVD-

13More precisely, a variable selection technique called PWA has been applied yielding the number ofsemaphore operations per second and the amount of allocated kernel memory as most descriptive variables

14in case embedding dimension is zero, which has been shown to yield best results for UBF


SVM. Additionally, periodic prediction has been investigated in order to show perfor-mance of the most straightforward failure prediction approach.

Failure prediction quality has been expressed as precision, recall, F-measure andfalse positive rate (FPR). Figure 9.35 summarizes results for event-based failure predic-tors including 95% BCa confidence intervals obtained from bootstrapping. In summary,

0.0

0.2

0.4

0.6

periodicDFTEventsetSVD−SVMHSMM

recall F−measureprecision false positive rate

Figure 9.35: Summary of prediction results for comparative approaches. Results are reportedas mean values and 95% confidence intervals.

HSMM-based failure prediction outperforms the other techniques significantly in most ofthe metrics. The second best technique is Eventset, which has been developed at IBM andhas been used for failure prediction in large scale parallel systems. However, improvedprediction is not for free: HSMM is by far the most complex failure prediction algorithmwith respect to both time and memory consumption. More specifically, Table 9.5 liststraining times and the time needed to perform one prediction along with lower and upperbounds of 95% confidence intervals. It can be observed that with respect to training,the HSMM approach takes approximately 2.4 times as long as SVD-SVM and almost 60times as long as Eventset, which is the second-best prediction algorithm in this compar-ative analysis. Nevertheless, training has no tight real-time constraints and can still beperformed within reasonable timescales. With respect to online prediction, HSMM takesmuch longer in comparison to the other techniques. However, computation times are stillsufficiently small if seen in the context of a lead time of at least five minutes.

HSMM-based failure prediction has also been compared to standard HMMs sinceHSMMs are derived from HMMs. Prediction performance of a random predictor has beencomputed for comparative reasons, and, finally, HSMM-based failure prediction has been

15Time variances have been below system time resolution, hence no confidence intervals can be provided,here

9.10 Summary 221

Prediction technique Training Online Prediction

Reliability-based periodic n/a n/a

DFT n/a 4.9767e-5 / 1.2857e-4 / 2.0728e-4

Eventset 17.22 / 22.90 / 28.58 0.000715

SVD-SVM (max. F-measure) 572.62 / 572.73 / 572.83 5.7566e-4 / 6.7781e-4 / 7.7995e-4

HSMM (max. F-measure) 1295.90 / 1365.00 / 1434.10 5.7761e-2 / 0.15715 / 0.25655

Table 9.5: Summary of average computation times for comparative approaches with 95% con-fidence intervals. Time is reported in seconds.

compared to Universal Basis Functions (UBF), which is a very good failure predictiontechnique for the analysis of periodic measurements such as system workload or memoryconsumption.

9.10 SummaryIn this chapter, the theory developed in previous chapters has been applied to industrialdata of a commercial telecommunication system in order to investigate how well failuresof a complex computing system can be predicted. All the steps from data preprocessingto an evaluation of failure prediction quality have been thoroughly investigated, whichmeans that the effects of the various parameters involved have been assessed. In moredetail, the following issues have been covered:

• Data preprocessing consists of assignment of error-IDs to error messages, tupling,failure sequence clustering, and noise filtering.

• Properties of the resulting data set have been investigated. This involved an analysisof error frequency, the distribution of inter-error delays, the distribution of failures,and length of the resulting error sequences.

• Modeling. Parameters involved in HSMM modeling include the number of states,maximum span of shortcuts, number of intermediate states, intermediate proba-bility mass and distribution, distribution type and amount of background weight,and number of tries for the Baum-Welch algorithm. Some of these parametershave been set heuristically while others, for which no values could be determinedupfront, have been investigated with respect to failure prediction performance onout-of-sample validation data.

• For the given setting of parameters, failure prediction quality has been investigatedin more detail using out-of-sample test data: precision, recall, F-measure, falsepositive rate, precision-recall plot, ROC plot, AUC and accumulated runtime costhave been reported.

• Application specific parameters lead-time ∆tl, and data window size ∆td have beenexplored in order to determine their effect on failure prediction performance.


• A Data-specific issues have been investigated in order to determine how failureprediction depends on the size of the training data set and the temporal distancebetween training and test dataset, which can be taken to give an indication of modelaging due to system configuration changes and updates.

• The effect of failure sequence clustering and noise filtering have been investigated.

• In order to show that the theory developed in this thesis really improves failureprediction quality, a comparative analysis has been performed. The selection ofcomparative approaches includes the best-known approaches to error-driven failureprediction, as they have been identified as subbranches of Category 1.3 in the fail-ure prediction taxonomy (c.f., Figure 3.1 on Page 31). Specifically, HSMM-basedfailure prediction has been compared to Dispersion Frame Technique (DFT) devel-oped by Lin & Siewiorek [167], Eventset method developed by Vilalta & Ma [268]at IBM and Singular value decomposition – support vector machines (SVD-SVM)developed by Domeniconi et al. [81]. In order to provide some rough estimate oneffortless prediction, periodic prediction on the basis of MTBF has also been ap-plied to the same data. In a further experiment, HSMM-based prediction has beencompared to standard hidden Markov models, a random predictor and UniversalBasis Functions (UBF) developed by Hoffmann [120].

In summary, it has been shown that for industrial data of the commercial telecommu-nication system, HSMM-based prediction is superior to all failure prediction approachesit has been compared with. Supposedly, the main reasons for this are first the approachto efficiently exploit both time and type of error messages by treating them as tempo-ral sequence, and second the modeling flexibility provided by HSMMs. For example,one characteristic is that HSMMs can handle permutations of error symbols occurringtogether within a short time interval (c.f., Page 100). This property is relevant for errorsequence-based failure prediction since ordering of error events occurring closely in timecannot be guaranteed in complex environments such as the telecommunication system.On the other hand, it must be conceded that modeling flexibility comes at the price of aconsiderable number of parameters that need to be adjusted. Hence, applying HSMMsas modeling technique requires substantial investigations and experience. Additionally,computational effort is increased: In comparison with, e.g., SVD-SVM, HSMM-basedtraining consumes approximately 2.4 times as much time for training and 232 times aslong for online prediction. However, this is still not prohibitive against the backgroundthat HSMM-based failure prediction is able to reliably predict the occurrence of failureswith a lead-time of up to 20 minutes.

Contributions of this chapter. Contributions of this chapter are two-fold. From anengineering point of view the chapter has shown in detail how an industrial system can bemodelled and how the various parameters can be investigated and adjusted to the specificsof a system. From a scientific point of view, the main contributions of this chapter arean in-depth performance evaluation of the HSMM method and a comparative analysisof the approach to other well-known prediction approaches. Furthermore, it has beenshown that extending standard HMMs to HSMMs is worth the effort since predictionquality is significantly improved and that the proposed preprocessing techniques, i.e.,failure sequence clustering and noise filtering, improve failure prediction results.

9.10 Summary 223

Relation to other chapters. It has been shown in this chapter that a prediction of up-coming failures is possible in complex computer systems. However, prediction alone doesnot improve system dependability! Hence the following fourth part of the thesis addressesthe issue what to do about a failure that has been predicted. In terms of the engineeringcycle, the third phase has been completed and a solution for failure prediction has beenobtained. The next part will cover the fourth and last phase of the engineering cycle,which focuses on system improvement.

Part IV

Improving Dependability, Conclusions,and Outlook

225

Chapter 10

Assessing the Effect on Dependability

The last phase of the engineering cycle named “improvement”, closes the loop: The goalis to use the failure prediction solution developed in previous chapters in order to improvethe system with respect to system dependability. However, dependability improvement isnot the primary goal of this dissertation —it focuses mainly on failure prediction. Thatis why this last part is shorter and investigations are not as detailed as in previous chap-ters. More specifically, in Section 10.1 proactive fault management is introduced, whichdenotes the combination of online failure prediction and actions to improve system de-pendability. Related work on previous approaches to model proactive fault managementis provided in Section 10.2. In Sections 10.3 to 10.6, an availability model and a simpli-fied reliability model are proposed, and closed form solutions for availability, reliabilityand hazard rate are derived. The issue of parameter estimation from experimental data iscovered by Section 10.7 and some experiments that have been performed in the course ofa diploma thesis primarily supervised by the author are presented in Section 10.8.

10.1 Proactive Fault ManagementSystem dependability cannot be improved solely by predicting failures —some actionsare necessary in order to do something about the failure that has been predicted. Asshown in Figure 1.1 on Page 4, online failure prediction and actions form a cycle where arunning system is continuously monitored in order to obtain data on the current status ofthe system, and a prediction algorithm is performed resulting in a classification whetherthe current situation is failure-prone or not. If so, it raises a failure warning and actionsare performed in order to do something about the failure. This might include diagnosis toinvestigate the root cause of the imminent problem and a decision which technique willbe most effective (see Chapter 12). However, there are two different classes of actionsthat can be performed upon failure prediction (see Figure 10.1):

• Downtime avoidance (or failure avoidance) aims at circumventing the occurrenceof the failure such that the system continues to operate without interruption

• Downtime minimization (minimization of the impact of failures) involves down-time, but the goal is to reduce downtime by preparation for true upcoming failures

227

228 10. Assessing the Effect on Dependability

Figure 10.1: Proactive fault management combines failure prediction and proactive actions.Actions either try to avoid or to minimize downtime

or by intentionally bringing the system down in order to shift it from unplanned toforced downtime.

Although several systems combining failure prediction with actions have been de-scribed in the literature, there is no unified name for this approach. Following Castelliet al. [49], the name proactive fault management (PFM) is used in this thesis.

Several examples for such systems employing PFM have been described in the lit-erature. For example, Castelli et al. [49] describe that a resource consumption trend es-timation technique has been implemented into IBM Director Management Software forxSeries servers that can restart parts of the system. In Cheng et al. [57] a framework calledapplication cluster service is described that facilitates failover (both preventive and aftera failure) and state recovery services. Li & Lan [164] propose FT-Pro, which is a fail-ure prediction-driven adaptive fault management system. It uses false positive error rateand false negative rate of a failure predictor together with cost and expected downtime tochoose among the options to migrate processes, to trigger checkpointing or to do nothing.

The behavior of PFM can be described in more detail as follows: If the failure predic-tor’s analysis suggests that the system is running well and hence no failure is anticipated inthe near future (which is a negative prediction), no action occurs. If a failure is predicted(a positive prediction), either downtime avoidance actions or downtime minimization ac-tions are performed, or both. However, it is obvious that any failure predictor can makewrong decisions: the predictor might forecast an upcoming failure even if this is not thecase, which is called a false positive, or the predictor might miss to predict a failure thatis imminent in the system, which is called a false negative (c.f., Table 8.1 on Page 153for an overview of all four cases that may occur). From this follows that in case of afalse positive prediction (FP) actions are performed unnecessarily while in case of a falsenegative prediction (FN), nothing is done about the failure that is imminent in the system.Table 10.1 summarizes these cases.

Prediction Downtime avoidance Downtime minimization

True positive Try to prevent failure Prepare repair (recovery)

False positive Unnecessary action Unnecessary preparation

True negative No action No action

False negative No action Standard (unprepared) repair (recovery)

Table 10.1: Actions performed after prediction. For a definition of true/false positives / nega-tives, see Table 8.1 on Page 153

10.1 Proactive Fault Management 229

Especially in event-based failure prediction, there are situations where a failure occursand no prediction has been performed at all, since there was no triggering event prior tothe failure. However, this case can be easily incorporated by treating it as a false negativeprediction.

All mechanisms that can benefit from the knowledge about an upcoming failure canbe used within PFM. It is not the focus of this thesis to provide a detailed analysis ofall kinds of actions falling into this category, and hence only some major concepts aredescribed in the following.

10.1.1 Downtime AvoidanceDowntime avoidance actions are triggered by a failure predictor in order to prevent theoccurrence of a failure that seems to be imminent in the system but has not yet occurred.Three categories of mechanisms can be identified:

• State clean-up tries to avoid failures by cleaning up resources. Examples includegarbage collection, clearance of queues, correction of corrupt data or elimination of“hung” processes.

• Preventive failover techniques perform a preventive switch to some spare hardwareor software unit. Several variants of this technique exist. One of which is failureprediction-driven load balancing accomplishing gradual “failover” from a failure-prone to failure-free component. For example, Chakravorty et al. [50] describe amultiprocessor environment that is able to migrate processes in case of an imminentfailure.

• Lowering the load is a common way to prevent failures. For example, web-serversreject connection requests in order not to become overloaded. Within proactive faultmanagement, the number of allowed connections is adaptive and would depend onthe risk of failure.

10.1.2 Downtime MinimizationRepairing the system after failure occurrence is the classical way of failure handling.Detection mechanisms such as coding checks, replication checks, timing checks or plau-sibility checks trigger the recovery. Within PFM, these actions still incur downtime, butits occurrence is either anticipated or even intended in order to reduce time-to-repair.

More specifically, there are two types of downtime minimization methods:

1. techniques that react to the occurrence of failures, and the goal is to reduce time-to-repair by preparation for the failure. This is called reactive downtime minimization

2. techniques that intentionally bring the system down in order to cause less downtimein comparison to downtime associated with unplanned failure occurrence. Thisclass of techniques is termed proactive downtime minimization


Figure 10.2: Improved time-to-repair for prediction-driven repair schemes. (a) sketches clas-sical recovery and (b) improved recovery in case of preparation for an upcomingfailure. “Checkpoint” denotes the last checkpoint before failure, “Failure” the timeof failure occurrence, “Reconfigured” the time when reconfiguration has finishedand “Up” the time when the system is up again. In (b) time-to-repair is improvedsince reconfiguration can start after prediction of an upcoming failure and theprediction-triggered checkpoint is closer to the occurrence of the failure, whichresults in less computation that needs to be recomputed after reconfiguration.

Reactive downtime minimization. The goal of such techniques can be summarizedthat the system shall be brought into a consistent fault-free state. If this state is a pre-vious one (a so-called checkpoint), the action applies a roll-backward scheme (see, e.g.,Elnozahy et al. [91] for a survey of roll-back recovery in message passing systems). Inthis case, all computation from the last checkpoint up to the time of failure occurrencehas to be recomputed. Typical examples are recovery from a checkpoint or the recoveryblock scheme introduced by Siewiorek & Swarz [241]. In case of a roll-forward scheme,the system is moved forward to a consistent state by either dropping or approximating thecomputations that have failed (see, e.g., Randell et al. [213]).

Both schemes may comprise reconfiguration such as switching to a hardware spare oranother version of a software program, changing network routing, etc. Reconfigurationtakes place before computations are redone or approximated.

In traditional fault-tolerant computing without PFM, checkpoints are saved indepen-dently of upcoming failures, e.g., periodically. When a failure occurs, first reconfigurationtakes place until the system is ready for recomputation / approximation and then all thecomputations from the last checkpoint up to the time of failure occurrence are redone.Time-to-repair (TTR) is determined by two factors: time needed for reconfiguration andthe time needed for recomputation or approximation of lost computations. In the caseof roll-backward strategies, recomputation time is determined by the length of the timeinterval between the checkpoint and the time of failure occurrence (see Figure 10.2-a).In some cases recomputation may take less time than originally but the implication stillholds. Note that not all types of repair actions exhibit both factors contributing to TTR.

A large variety of repair actions exist that can benefit from failure prediction. Inprinciple, coupling with a failure predictor can reduce both factors contributing to TTR(see Figure 10.2-b):

• Time needed for reconfiguration can be reduced since reconfiguration can be pre-pared for an upcoming failure. Think, for example, of a cold spare: Booting thespare machine can be started right after an upcoming failure has been predicted(and hence before failure occurrence) such that reconfiguration is almost finishedwhen the failure occurs.

10.2 Related Models 231

• Checkpoints may be saved upon failure prediction close to the failure, which re-duces the amount of computation that needs to be repeated. This minimizes timeconsumed by recomputation. On the other hand, it might not be wise to save acheckpoint at a time when a failure can be anticipated since the system state mightalready be corrupted. The question whether such scheme is applicable depends onfault isolation between the system that is going to observe the failure and the statethat is included in the checkpoint. For example, if the amount of free memory ismonitored for failure prediction but the checkpoint comprises database tables of aseparate database server, it might be sure to rely on the correctness of the databasecheckpoint. Additionally, an adaptive checkpointing scheme similar to the one de-scribed in Oliner & Sahoo [197] could be applied.

Leangsuksun et al. [157] describe that they have implemented predictive checkpointingfor a high-availability high performance Linux cluster.

Proactive downtime minimization. Parnas [198] reported on an effect that he calledsoftware aging, being a name for effects such as memory leaks, unreleased file locks,file descriptor leaking, or memory corruption. Based on these observations, Huang et al.introduced a concept that the authors termed rejuvenation. The idea of rejuvenation isto deal with problems related to software aging by restarting the system (or at least partsof it). By this approach, unplanned / unscheduled / unprepared downtime incurred bynon-anticipated failures is replaced by forced / scheduled / anticipated downtime. Theauthors have shown that —under certain assumptions— overall downtime and downtimecost can be reduced by this approach. In Candea et al. [43] the approach is extendedby introducing recovery-oriented computing (see, e.g., Brown & Patterson [40]), whererestarting is organized recursively until the problem is solved.

10.2 Related ModelsThe objective of this chapter is a theoretical assessment of proactive fault managementwith respect to system dependability, or more precisely steady-state system availability,reliability and hazard rate. As is common in reliability theory, a model expressing therelevant interrelations is used.

Proactive fault management is rooted in preventive maintenance, that has been a re-search issue for several decades (an overview can be found, e.g., in Gertsbakh [105]).More specifically, proactive fault management belongs to the category of condition-basedpreventive maintenance (c.f., e.g., Starr [250]). However, the majority of work has beenfocused on industrial production systems such as heavy assembly line machines and morerecently to computing hardware. With respect to software, preventive maintenance hasfocused more on long-term software product aging such as software versions and updatesrather than short-term execution aging. The only exception is software rejuvenation whichhas been investigated heavily (c.f., e.g., Kajko-Mattson [139]).

Starting from general preventive maintenance theory, Kumar & Westberg [150] com-pute reliability of condition-based preventive maintenance. However, their approach isbased on a graphical analysis of so-called total time on test plots of singleton observationvariables such as temperature, etc. rendering the approach not appropriate for applicationto automatic proactive fault management in software systems. An approach better suited


to software has been presented by Amari & McLaughlin [9]. They use a continuous-timeMarkov chain (CTMC) to model system deterioration, periodic inspection, preventivemaintenance and repair. However, one of the major disadvantages of their approach isthat they assume perfect periodic inspection, which does not reflect failure predictionreality, as has been shown along with the case study presented in Chapter 9.

A significant body of work has been published addressing software rejuvenation. Ini-tially, Huang et al. [126] have used a CTMC in order to compute steady-state availabilityand expected downtime cost. In order to overcome various limitations of the model, e.g.,that constant transition rates are not well-suited to model software aging, several varia-tions to the original model of Huang et al. have been published over the years, some ofwhich are briefly discussed here. Dohi et al. have extended the model to a semi-Markovprocess to deal more appropriately with the deterministic behavior of periodic restarting.Furthermore, they have slightly altered topology of the model since they assume that thereare cases where a repair does not result in a clean state and restart (rejuvenation) has to beperformed after repair. The authors have computed steady-state availability (Dohi et al.[80]) and cost (Dohi et al. [79]) using this model. Cassady et al. [47] propose a slightlydifferent model and use Weibull distributions to characterize state transitions. However,due to this choice, the model cannot be solved analytically and an approximate solutionfrom simulated data is presented.

Garg et al. [101] have used a three state discrete time Markov chain (DTMC) withtwo subordinated non-homogeneous CTMCs to model rejuvenation in transaction pro-cessing systems. One subordinated CTMC models queuing behavior of transaction pro-cessing and the second models preventive maintenance. The authors compute steady-stateavailability, probability of loosing a transaction, and an upper bound on response timefor periodic rejuvenation. They model a more complex scheme that starts rejuvenationwhen the processing queue is empty. The same three-state macro-model has been used inVaidyanathan & Trivedi [262], but here, time-to failure is estimated using a monitoring-based subordinated semi-Markov reward model. However, for model solution, the authorsapproximate time-to-failure with an increasing failure rate distribution.

Leangsuksun et al. [158] have presented a detailed stochastic reward net model of ahigh availability cluster system in order to model availability. The model differentiatesbetween servers, clients and network. Furthermore, it distinguishes permanent as wellas intermittent failures that are either covered (i.e., eliminated by reconfiguration) or un-covered (i.e., eliminated by rebooting the cluster). Again, the model is too complex tobe analyzed analytically and hence simulations are performed. An analytical solution forcomputing the optimal rejuvenation schedule is provided by Andrzejak & Silva [10] whouse deterministic function approximation techniques to characterize the relationship be-tween aging factors and work metrics. The optimal rejuvenation schedule can then befound by an analytical solution to an optimization problem.

The key property of PFM is that it operates upon failure predictions rather than apurely time-triggered execution of fault-tolerance mechanisms. One of the first papersto address this issue is Vaidyanathan et al. [261]. The authors propose several stochasticreward nets (SRN), one of which explicitly models prediction-based rejuvenation. How-ever, there are two limitations to this model: first, only one type of wrong predictionsis covered, and second, the model is tailored to rejuvenation —downtime avoidance orreactive downtime minimization are not included. Furthermore, due to the complexityof the model, no analytical solution for availability is presented. Focussing on service

10.3 The Availability Model 233

degradation, Bao et al. [21] propose a CTMC that includes the number of service requestsin the system plus the amount of leaked memory. An adaptive rejuvenation scheme isanalyzed that is based on estimated resource consumption. Later, the model has beencombined with the three-state macro model in order to compute availability (Bao et al.[22]). However, the model does also not investigate the effect of mispredictions.

Last but not least, the model presented in this dissertation is not the first attempt to as-sess the effects of proactive fault management. In Salfner & Malek [225], an approach hasbeen published that directly extends the well-known formula for steady-state availability:

A = MTTF

MTTF +MTTR. (10.1)

However, the approach proposed in Salfner & Malek [225] had three limitations:

1. It did not clearly distinguish between true and false positive and negative predic-tions. This flaw resulted in an inappropriate handling of prevented and inducedfailures.

2. Only steady-state availability could be estimated. Other dependability metrics suchas reliability and hazard rate could not be computed.

3. The model is implicit but not transparent to help better understand the behavior ofproactive fault management.

In summary, to the best of our knowledge, no work has been published that capturesboth downtime avoidance as well as reactive and proactive downtime minimization, andthat is incorporating all four cases of failure prediction: true and false positives as well asnegatives.

10.3 The Availability ModelAs is the case for many of the rejuvenation models mentioned before, the model developedhere is based on the CTMC originally published by Huang et al. [126]. First, the originalmodel is briefly presented and then the new model is introduced.

10.3.1 The Original Model for Software Rejuvenationby Huang et al.

As described by Parnas, software aging can be observed in long-running software. How-ever, software aging does not cause the software to crash immediately but increases therisk of failure. For example, if a memory leak is present, the amount of available memoryis continuously decreasing (in long-term behavior). Assuming that each service requestrequires some (stochastically distributed) amount of memory, the risk that some servicerequest fails due to insufficient free memory is increasing over time. However, if themaximum number of concurrent service requests and the maximum amount of memoryconsumption of each service request are limited, software aging does not affect serviceavailability as long as the amount of free memory is above some threshold. This obser-vation is one of the key concepts in the model for rejuvenation proposed by Huang et al.[126]: Some state exists, where a running system enters a failure probable state Sp (see


Figure 10.3). In the example, the system transits into this state when the amount of freememory drops below the described threshold. Rejuvenation is performed periodically inorder to clean up the system and to bring it back into the fault free state S0.

The occurrence of forced downtime (e.g., incurred by rejuvenation) is known whilefailures occur stochastically (unplanned downtime). The key notion of software rejuvena-tion is that both downtime and the associated downtime cost are less for forced downtimethan for unplanned. Therefore, the model has two different down-states: One for rejuve-nation (SR) and one for failures (SF ). Since the periodically triggered restarting processis different from repair after failure, two transition rates r1 and r3 are used.

Figure 10.3: The original CTMC model as used by Huang et al. [126] to compute availabilityof a system with rejuvenation. S0 denotes the state when everything is up andrunning, SP the failure probable state, SR the rejuvenation state and SF thefailed state with appropriate transition rates as used in the original paper

10.3.2 Availability Model for Proactive Fault ManagementIn order to develop an availability model for proactive fault management, three key dif-ferences are taken into account:

• In addition to rejuvenation, proactive fault management involves downtime avoid-ance techniques. In terms of the model, this means that there needs to be some wayto get from the failure probable state back to the S0 state without an intermediatedown state.

• Proactive fault management actions operate upon failure prediction rather than pe-riodically. However, predictions can be correct or false. Moreover, it makes adifference whether there really is a failure imminent in the system or not. Hence,the single failure probable state SP in Figure 10.3 needs to be split up into a morefine-grained analysis: According to the four cases of prediction, there is a statefor true positive predictions (STP ), false positive predictions (SFP ), true negativepredictions (STN ) and false negative predictions (SFN ).

• Besides rejuvenation, which is a proactive downtime minimization technique,proactive fault management also comprises reactive downtime minimization ac-tions. However, both types of actions can be assessed in terms of their effect ontime-to-repair. Hence, it is sufficient to keep up two down states: one for prepared /forced downtime (SR) and one for unprepared / unplanned downtime (SF ).

10.3 The Availability Model 235

The resulting CTMC is shown in Figure 10.4.

Figure 10.4: Availability CTMC for proactive fault management. State S0 is the fault-free state.States STP , SFP , STN and SFN are failure-probable states corresponding tothe four cases of failure prediction correctness. States 5 and 6 are “down” stateswhere SR accounts for forced downtime caused by scheduled restart or pre-pared repair, and SF accounts for the unplanned counterpart.

In order to better explain the model, consider the following scenario: Starting from theup-state S0 a failure prediction is performed at some point in time. If the predictor comesto the conclusion that a failure is imminent, the prediction is a positive and a failure warn-ing is raised. If this is true (something is really going wrong in the system) the predictionis a true positive and a transition into STP takes place. Due to the warning, some actionsare performed in order to either prevent the failure from occurring (downtime avoidance),or to prepare for some forced downtime (downtime minimization). Assuming first thatsome preventive actions are performed, let

PTP := P (failure | true positive prediction) (10.2)

denote the probability that the failure occurs despite of preventive actions. Hence,with probability PTP a transition into failure state SR takes place, and with probability(1 − PTP ) the failure can be avoided and the system returns to state S0. Due to the factthat a failure warning was raised (the prediction was a positive one), preparatory actionshave been performed and repair is quicker (on average), such that state S0 is entered withrate rR.

If the failure warning is wrong (in truth the system is doing well) the prediction isa false positive (state SFP ). In this case actions are performed unnecessarily. However,although no failure was imminent in the system, there is some risk that a failure is causedby the additional workload for failure prediction and subsequent actions. Hence, let

PFP := P (failure | false positive prediction) (10.3)

denote that an additional failure is induced. Since there was a failure warning, preparationfor an upcoming failure has been carried out and hence the system transits into state SR.


In case of a negative prediction (no failure warning is issued) no action is performed.If the judgment of the current situation to be non failure-prone is correct (there is nofailure imminent), the prediction is a true negative (state STN ). In this case, one wouldexpect that nothing happens since no failure is imminent. However, depending on thesystem, even failure prediction (without subsequent actions) may put additional load ontothe system which can lead to a failure although no failure was imminent at the time whenthe prediction started. Hence there is also some small probability of failure occurrence inthe case of a true negative prediction:

PTN := P (failure | true negative prediction) . (10.4)

Since no failure warning has been issued, the system is not prepared for the failure andhence a transition to state SF rather than SR, takes place. This implies that the transitionback to the fault-free state S0 occurs at rate rF , which takes longer (on average). Ifno additional failure is induced, the system returns to state S0 directly with probability(1− PTN).

If the predictor does not recognize that something goes wrong in the system and a fail-ure comes up, the prediction is a false negative (state SFN ). Since nothing is done aboutthe failure that comes up there is no transition back to the up-state and the model transitsto the failure state SF without any preparation. The reason why there is an intermediatestate SFN originates from the way transition rates are computed, as explained in the nextsection.

10.4 Computing the Rates of the ModelReliability modeling is typically performed to investigate new techniques for systems thatare under design1 in order to determine their potential effect on system parameters suchas availability. The model shown in Figure 10.4 comprises the following parameters:

• PTP , PFP , PTN denote the probability of failure occurrence given a true positive,false positive, or true negative prediction.

• rTP , rFP , rTN , and rFN denote the rate of true/false positive and negative predic-tions

• rA denotes the action rate, which is determined by the average time from start ofthe prediction to downtime or to return to the fault-free state.

• rR denotes repair rate for forced / prepared downtime

• rF denotes repair rate for unplanned downtime

However, some of these parameters are difficult to determine. Therefore, more intuitiveparameters are used. from which the rates of the CTMC model are computed. Usually,there are two groups of parameters:

1. fixed parameters that are estimated / measured from a given system or determinedby the application area

1As already mentioned in the case of this dissertation, it was not possible to try the methods on the com-mercial system

10.4 Computing the Rates of the Model 237

2. parameters that shall be investigated / optimized in order to assess their effect ontarget metrics.

In the case considered here, it is assumed that a system without proactive fault man-agement shall be extended by PFM and the effect of PFM with respect to availability,reliability and hazard rate shall be investigated. More specifically, it is assumed that fixedparameters comprise mean-time-to-failure (MTTF), mean-time-to-repair (MTTR), lead-time ∆tl, and prediction-period ∆tp. The second group of parameters (those that shall beinvestigated) includes parameters evaluating accuracy of failure prediction and parame-ters investigating the efficiency of actions. Table 10.2 summarizes the specific parametersthat are used in the following. Note that in contrast to the definition in Section 8.2.2, forreadability reasons the single letter “f” is used to denote false positive rate in this chapter.

Parameter Symbol Fixed InvestigatedMean time to failure (system w/o PFM) MTTF XMean time to repair (system w/o PFM) MTTR XLead-time ∆tl XPrediction-period ∆tp XPrecision p XRecall r XFalse positive rate f XFailure probability given TP prediction PTP XFailure probability given FP prediction PFP XFailure probability given TN prediction PTN XRepair time improvement k X

Table 10.2: Parameters used for modeling

In summary, it is intuitively clear that any proactive fault management techniqueshould strive to achieve the following parameter values in order to minimize downtime:

1. Failure prediction should be as accurate as possible. This translates into high preci-sion, high recall and low false positive rate.

2. Failure occurrence probabilities PTP , PFP and PTN should be as close to zero aspossible.

3. Time to repair for forced downtime / prepared repair should be as small as possiblein comparison to repair time for unplanned / accidental downtime.

10.4.1 The Parameters in DetailParameters can be divided into three groups (see Table 10.2):

1. Precision, recall and false positive rate specify failure prediction accuracy2.

2in a general sense, not as strict as in Definition 8.10 on Page 156


2. Failure probabilities PTP , PFP , PTN assess effectiveness of downtime avoidanceand the risk of additional failures that are induced by the additional workload ofprediction and actions.

3. Repair time improvement factor k determines effectiveness of downtime minimiza-tion.

Failure prediction accuracy. Figure 10.5 visualizes all four cases of failure predictioncorrectness including lead-time ∆tl and prediction-period ∆tp. The case that a failureoccurs without any failure prediction being performed3 is mapped to a missing failurewarning, which is a false negative prediction. Although contingency table, precision,recall and false positive rate have been defined in Chapter 8 (c.f., Table 8.1 on Page 153,Equations 8.3 and 8.4 on Page 155, and Equation 8.9 on Page 156), they are repeated herefor convenience. Also, notation is slightly changed in order to emphasize that the metricsare defined by numbers that have been counted during an experiment: For example, nTPdenotes the number of true positive predictions within one experiment with a total of npredictions.

Figure 10.5: A timeline showing failures (t) and all four types of predictions (P): true positive,false positive, false negative, and true negative. A failure is counted as predictedif it occurs within prediction-period of length ∆tp, which starts lead-time ∆tl afterbeginning of prediction

Table 10.3 shows the modified version of the contingency table. Using this notation,

True Failure True Non-failure SumPrediction: Failure nTP nFP nPOS

Prediction: No failure nFN nTN nNEG

Sum nF nNF n

Table 10.3: This contingency table is a simplified version of Table 8.1 on Page 153. It empha-sizes that the fields consist of the number of true positives (nTP ), false positives(nFP ), etc. predictions from an experiment with a total of n predictions.

3E.g., in error-based prediction, if no error occurs prior to a failure


precision, recall and false positive rate are defined as follows:

Precision p = nTPnTP + nFP

= nTPnPOS

(10.5)

Recall r = nTPnTP + nFN

= nTPnF

(10.6)

False positive rate f = nFPnFP + nTN

= nFPnNF

. (10.7)

Effectiveness of downtime avoidance and risk of induced failures. Preventive actionsare applied in order to avoid an imminent failure which affects time-to-failure (TTF).However, the opposite effect may also happen: due to additional load generated by failureprediction and actions, failures can be provoked that would not have occurred if no PFMhad been in place. In order to account for this effect, the model uses three probabilitiescorresponding to the types of failure prediction correctness:

PTP is the probability that a failure occurs in case of a correct warning. This is theprobability that the preventive action is not successful.

PFP is the probability of failure occurrence in case of a false positive warning. Since nofailure is imminent at the time of prediction, it corresponds to the probability that afailure is provoked by the extra load of failure prediction and subsequent actions.

PTN is the probability that an extra failure is provoked by prediction alone: since it isa true negative prediction, a failure occurs although no failure is imminent in thesystem and no actions are performed.

There is no need to define a probability for false negative predictions since nothing is doneabout the failure that will occur. The probability of failure occurrence is hence equal toone.

Effectiveness of downtime minimization. Effects of forced downtime / prepared repairon availability, reliability, and hazard rate are gauged by time-to-repair. More specifically,the effect is expressed by mean relative improvement, how much faster the system is up incase of forced downtime / prepared repair in comparison to MTTR after an unanticipatedfailure:

k = MTTR

MTTRp

, (10.8)

which is the ratio of MTTR without preparation to MTTR for the forced / prepared case.Obviously, one would expect that preparation for upcoming failures improves MTTR,thus k > 1, but the definition also allows k < 1 corresponding to a change for the worse.

10.4.2 Computing the Rates from ParametersCTMC models express temporal behavior using exponential distributions for timing in thestate before transitioning. Exponential distributions are determined by a single parameter:the transition rate. In this dissertation, only constant transition rates are considered whichare determined by the inverse of mean time.


It is the objective of this section to relate the model’s rates rTP , rFP , rTN , rFN , rA,rR, and rF to the more intuitive parameters listed in Table 10.2. Therefore, using theformulas developed in the following, the rates of CTMC can be computed from the moreintuitive parameters. The text follows a bottom-up approach such that basic relationshipsand equations are developed first which are subsequently used to derive equations for theCTMC rates given by Equations 10.30 to 10.33.

Starting point for computations is to determine the distribution of predictions amongtrue and false positives and negatives. This can be obtained using the prediction-relatedmetrics precision, recall and false positive rate. The distribution is expressed by the num-ber of, e.g., true positive predictions divided by the total number of predictions. By refer-ence to Table 10.3 and the definitions given by Equations 10.5 to 10.7, it can be derivedthat:

n = nF + nNF (10.9)

= nTPr

+ nFPf

(10.10)

= nTPr

+ nPOS − nTPf

(10.11)

= nTPr

+ nTPp f− nTP

f(10.12)

⇒ nTPn

= 11r

+ 1p f− 1

f

, (10.13)

which is an equation to compute the fraction of true positive predictions nTP (in compari-son to the total number of predictions n) from the prediction-related parameters precision,recall and false positive rate.

In order to compute the fraction of false positive, false negative, and true negativepredictions, it is necessary to determine:

nPOSn

= 1p

nTPn

(10.14)

nFn

= 1r

nTPn

, (10.15)

which leads tonFPn

= nPOSn− nTP

n(10.16)

nFNn

= nFn− nTP

n(10.17)

nTNn

= nNFn− nFP

n= 1− nF

n− nFP

n. (10.18)

Now, as the relative distribution among true and false positive and negative predic-tions is known, the corresponding transition rates rTP , rFP , rTN , rFN can be computed.The approach is to first compute the overall prediction rate rp, which determines tim-ing of the process once it has entered state S0. The mean time is determined by mean-time-to-prediction (MTTP), which is computed in two steps: first, mean-time-between-predictions (MTBP) is computed from the temporal parameters MTTF, MTTR, lead time


∆tl, and prediction period ∆tp. MTTF and MTTR are assumed to be known from a sys-tem without PFM —that is why they are fixed parameters. Then in a second step, MTTPis obtained from MTBP by subtracting time needed for prediction, etc.

The principal notion to compute MTBP is that there are x-times as many predictionsas true failures. By assuming the number of predictions to be n and the number of truefailures to be nF , x can be determined by expressing n in terms of nF , as is shown in thefollowing:

n = nF + nNF (10.19)

= nF + nFPf

(10.20)

= nF + nPOSf− nTP

f(10.21)

= nF + nTPp f− nTP

f(10.22)

= nF + nTP

(1p f− 1f

)(10.23)

= nF + nF r

(1p f− 1f

)(10.24)

= nF

(1 + r

1− pp f

). (10.25)

This means that there are 1 + r 1−pp f

as many predictions as failures. Hence it can beconcluded that for the mean times holds:

MTBP = 11 + r 1−p

p f

MTBF , (10.26)

where MTBF denotes “mean-time-between-failures” for a system without proactive faultmanagement, which can be computed from MTTF and MTTR by the standard formula

MTBF = MTTF +MTTR . (10.27)

As can be seen in Figure 10.6 MTTP can be computed from MTBP by subtracting lead-time ∆tl, and repair time R. Additionally, half of the prediction-period has to be sub-tracted since a failure may occur at any time within the prediction-period and hence onaverage failures occur at half of the prediction-period.4 However, for a system with PFM,repair time R is not equal to MTTR, since there are two different repair times: One forprepared repair (or forced downtime) and one for the unprepared / unplanned case. But,as we only consider mean values, mean repair time R is a combination of both cases andthe mixture is determined by the fraction of positive predictions in comparison to negative

4To be precise, a symmetric distribution centered around the middle of the prediction-period is assumed,i.e., a distribution with zero skewness and median equal to ∆tp/2


Figure 10.6: Time relations for prediction. Failures are indicated by t, predictions by P andrepair by R

predictions. More specifically, MTTP is given by:

MTTP = MTBP −∆tl −∆tp2

−(nTP + nFP

nMTTRp

)−(nTN + nFN

nMTTR

), (10.28)

where MTTRp is mean-time-to-repair for the case of forced / prepared downtime. Itis related to MTTR of unplanned downtime by repair time improvement factor k (c.f.,Equation 10.8). Finally, prediction rate rp is computed by:

rp =MTTF +MTTR

1 + r 1−pp f

−∆tl −∆tp2

−(nTP + nFP

k n+ nTN + nFN

n

)MTTR

]−1. (10.29)

As already mentioned, transition rates from S0 to STP , SFP , STN , and SFN , are de-termined by distributing rp among true / false positive / negative predictions:

rij = nijn∗ rp where i ∈ {T, F} and j ∈ {P, N} , (10.30)

where nijn

denotes the fractions given by Equations 10.13 to 10.18.The three remaining rates are action rate (rA), repair rate for forced downtime / pre-

pared repair (rR) and repair rate for an unprepared failure (rF ). rA is characterized byaverage time from the beginning of the prediction to the occurrence of downtime or itsprevention and can hence be computed from lead-time ∆tl and prediction-period ∆tp:

rA = 1∆tl + 1/2∆tp

. (10.31)

Repair rate rF is determined by the repair rate of a system without PFM, which is MTTR:

rF = 1MTTR

(10.32)

and repair rate for forced downtime / prepared repair is determined by MTTR and k:

rR = 1MTTRp

= k

MTTR= k rF . (10.33)

10.5 Computing Availability 243

10.5 Computing AvailabilitySteady-state availability is defined as the portion of uptime versus lifetime, which is equiv-alent to the portion of time, the system is up. In terms of our CTMC model, this quantitycan be determined by the equilibrium state distribution: It is the portion of probabilitymass in steady-state assigned to the non-failed states, which are S0, STP , SFP , STN , andSFN . In order to simplify representation, numbers 0 to 6 —as indicated in Figure 10.4—are used to identify the states of the CTMC.

The infinitesimal generator Matrix Q of the CTMC model is:

Q =

−rp rTP rFP rTN rFN 0 0(1−PTP ) rA −rA 0 0 0 PTP rA 0(1−PFP ) rA 0 −rA 0 0 PFP rA 0(1−PTN) rA 0 0 −rA 0 0 PTN rA

0 0 0 0 −rA 0 rA

rR 0 0 0 0 −rR 0rF 0 0 0 0 0 −rF

(10.34)

The equilibrium state distribution of a CTMC can be determined by solving the globalbalance equations. This is equivalent to a solution of the following linear equation system(see, e.g., Kulkarni [149]):

πQ = 0 (10.35)

s.t.6∑i=0

πi = 1 . (10.36)

The way to a solution is based on the following observation: If π is a solution toEquation 10.35 then each scaling of π is also a solution and hence, an infinite number ofsolutions exist, one of which fulfills Equation 10.36. Therefore, π6 is arbitrarily set to oneand the inhomogeneous equation system π ′Q′ = b is solved by Gaussian eliminationyielding a single solution π ′ whereQ′ is

Q′ =

−rp rTP rFP rTN rFN 0(1−PTP )rA −rA 0 0 0 PTP rA

(1−PFP )rA 0 −rA 0 0 PFP rA

(1−PTN)rA 0 0 −rA 0 00 0 0 0 −rA 0rR 0 0 0 0 −rR

(10.37)

andb =

(−rF 0 0 0 0 0 .

)(10.38)

The final solution π is obtained by scaling of π′i such that the sum equals one (c.f., Equa-tion 10.36):

πi = π′i(∑5i=0 π

′i

)+ 1

i ∈ {0 . . . 5}

π6 = 1(∑5i=0 π

′i

)+ 1

. (10.39)


By exploiting that rR = k rF , results can be further simplified. Equations for πi areprovided by Table 10.4.

πi Solution

π0rF k rA

rF k (rA + rp) + rA(PFP rFP + PTP rTP + k PTN rTN + k rFN)

π1rF k rTP


π2rF k rFP


π3rF k rTN


π4rF k rFN


π5rA (PFP rFP + PTP rTP )

rF k (rA + rp) + rA (PFP rFP + PTP rTP + k PTN rTN + k rFN)

π6k rA (PTN rTN + rFN)


Table 10.4: Solution to equation system defined by Equations 10.35 and 10.36. πi’s are equi-librium (steady-state) probabilities for the states in the availability model

Steady-state availability is determined by the portion of time the stochastic processstays in one of the up-states 0 to 4:

A =4∑i=0

πi = 1− π5 − π6

A = (rA + rp)k rFk rF (rA + rp) + rA(PFP rFP + PTP rTP + k PTN rTN + k rFN) , (10.40)

yielding a closed-form solution for steady-state availability of systems with PFM.

10.6 Computing ReliabilityReliability R(t) is defined as the probability of failure occurrence up to time t given thatthe system is fully operational at t = 0. In terms of CTMC modeling this is equivalent toa non-repairable system and computation of the first passage time into the down-state.

10.6.1 The Reliability ModelSince a non-repairable system is to be modeled, the distinction between two down-states(SR and SF ) is not required anymore. Furthermore, there’s no transition back to the up-

10.6 Computing Reliability 245

state. That is why a simpler topology can be used where the two failure states are mergedinto one absorbing state SF as shown in Figure 10.7.

Figure 10.7: CTMC model for reliability. Failure states 5 and 6 of Figure 10.4 have beenmerged into one absorbing state SF

The generator matrix for this model has the form:

Q =(T t00 0

), (10.41)

where T equals:

T =

−rp rTP rFP rTN rFN

(1−PTP ) rA −rA 0 0 0(1−PFP ) rA 0 −rA 0 0(1−PTN) rA 0 0 −rA 0

0 0 0 0 −rA

, (10.42)

and t0 equals:t0 = [ 0 PTP rA PFP rA PTN rA rA ]T . (10.43)

10.6.2 Reliability and Hazard RateThe distribution of the probability to first reach the down-state SF yields the cumulativedistribution of time-to-failure. In terms of CTMCs this quantity is called first-passage-time distribution F (t). Reliability R(t) and hazard rate h(t) can be computed from F (t)in the following way:

R(t) = 1− F (t) (10.44)

h(t) = f(t)1− F (t) , (10.45)

where f(t) denotes the corresponding probability density of F (t). F (t) and f(t) are thecumulative and density of a phase-type exponential distribution defined by T and t0 (see,


e.g., Kulkarni [149]):

F (t) = 1−α exp (tT ) e (10.46)f(t) = α exp (tT ) t0 , (10.47)

where e is a vector with all ones, exp (tT ) denotes the matrix exponential, and α is theinitial state probability distribution. It can be determined from the fact that reliability isdefined such that the system is fully operational at time t = 0. Hence:

α = [1 0 0 0 0] . (10.48)

Closed form expressions exist and can be computed using a symbolic computer algebratool. However, the solution would fill several pages5 and will hence not be provided here.

10.7 How to Estimate the Parameters from ExperimentsThe previous sections described how availability and reliability for systems with PFM canbe determined as a function of eleven parameters: MTTF , MTTR, ∆tl, ∆tp, p, r, f ,PTP , PFP , PTN and k (c.f., Table 10.2). MTTF and MTTR were assumed to be knownfrom a system without proactive fault management and can hence be estimated from anexisting system; and ∆tl and ∆tp have been assumed to be application specific. The re-maining seven parameters refer to proactive fault management. If reliable estimates forthese parameters are available from similar systems, the derived formulas can be applieddirectly. However if not, it seems impossible to derive them analytically from systemspecifications. Therefore, they must be estimated by experiment in an environment sim-ilar to the production environment. In this section, an estimation procedure is describedthat separates the mutual influence of failure prediction and reaction schemes in order todetermine all seven parameters.

10.7.1 Failure Prediction AccuracyDuring the first experiment, only those parameters characterizing failure prediction(namely p, r, and f ) are investigated with as little feedback onto the system as possi-ble. This can either be accomplished by performing predictions offline working withpreviously recorded logfiles (as it has been done in Chapter 9 for the telecommunicationsystem) or performing predictions on a separate machine. Side-effects such as additionalworkload are incorporated in later experiments. The outcome of the failure predictionexperiment yields a sequence of predictions (either positive or negative) and a sequenceof failures, by which predictions can be classified as true or false. Figure 10.8 shows allfour cases that can occur. The figure is almost the same as Figure 10.5, however, it assignssituation IDs ¬ to ¯, which are needed in later steps of the estimation procedure.

Starting from a timeline as in Figure 10.8, predictions can be assigned to be truepositive (situation ¬), false positive (situation ), false negative (situation ®) or truenegative (situation ¯). From this assignment p, r, and f can be computed according to

5The solution found by MapleTMcontains approximately 3000 terms.

10.7 How to Estimate the Parameters from Experiments 247

Figure 10.8: A timeline obtained from an experiment showing true failures (t) and predic-tion results. “!” indicate positive predictions (failure warnings) and “Ø” negativepredictions. Four situations can occur as indicated by ¬ to ¯

the definitions given in Equations 10.5 to 10.7

p = count(¬)count(¬) + count() (10.49)

r = count(¬)count(¬) + count(®) (10.50)

f = count()count() + count(¯) , (10.51)

where count(x) denotes the number of times, situation x has occurred in the experiment.Using p, r, and f , the expected ratio of true and false positive and negative predictionsnTPn

, nFPn

, nTNn

, and nFNn

can be computed using Equations 10.13 to 10.18, where in thefollowing N denotes the total number of predictions in the experiment’s trace. From nowon, nTP

n, nFP

n, nTN

n, and nFN

nare assumed to be known. They are used in later experiments.

10.7.2 Failure Probabilities PTP , PFP , and PTNThe goal of the second experiment is to assess the capability of downtime avoidancemechanisms. Since these mechanisms are run only in case of a positive prediction, and afailure can only be avoided if such is imminent, failure avoidance capability is gauged byprobability PTP . More precisely, PTP is the probability that a failure occurs even thoughan upcoming true failure has been predicted and downtime avoidance mechanisms havebeen performed. To estimate it, failure predictions and downtime avoidance mechanismshave to be performed together on a test system that mimics key features of the modeledsystem as close as possible. The outcome of the experiment is again a timeline as shown inFigure 10.8. However, the simple assignment of cases to true / false positives / negativesis not possible any more due to the following observations:

• situation ¬ (co-occurrence of failure warning and failure). This situation mightbe traced back to two scenarios: (a) the prediction was a true positiveand the triggered downtime avoidance action could not prevent theoccurrence of the failure, or (b) the prediction was a false positivethat successively lead to a failure induced by the prediction algorithmand / or the triggered action (e.g., due to additional load).

• situation (failure warning without occurrence of a failure). This situation might


be traced back to (a) a false positive prediction or (b) to a true positiveprediction with successful avoidance of the failure.

• situation ® (occurrence of failure only). This situation can be caused by (a) afalse negative prediction or (b) by a true negative prediction wherethe execution of the failure prediction algorithm (there are no actionsperformed upon negative predictions) caused the failure.

Table 10.5 provides a complete list of all these cases. As is indicated by the horizontal

Situation Comment Prediction Failure Prob. of occur.

¬ action not successful TP F nTPnPTP

¬ failure caused by PFM FP F nFPnPFP

failure prevented TP NF nTPn

(1− PTP ) false positive prediction FP NF nFP

n(1− PFP )

® failure caused by prediction TN F nTNnPTN

® false negative prediction FN F nFNn

¯ correctly no warning TN NF nTNn

(1− PTN)

Table 10.5: Mapping of cases to situations. Although only four different situations can beobserved in the experiment’s output (c.f., Figure 10.8), they can be traced back toseven different cases if downtime avoidance techniques are applied

line, there are two groups of non-overlapping parameters and situations: The first groupcomprises parameters PTP , PFP and situations ¬ and , while the second group com-prises parameter PTN and situations ® and ¯. Since handling of the second group iseasier, this group is discussed first.

Estimation of PTN . By combining rows referring to the same situation in the secondgroup of Table 10.5, the following linear equation system can be set up:

nTNn

PTN + nFNn

= count(®)N

(10.52)

nTNn

(1− PTN) = count(¯)N

(10.53)

Since there are two equations for one parameter (PTN ), it can be shown that a solutiononly exists, if

count(®)N

+ count(¯)N

= nTNn

+ nFNn

, (10.54)

expressing that the observed fraction of negative predictions (left-hand-side) is equal tothe expected fraction computed from precision, recall and false positive rate, which havebeen estimated before (right-hand-side of the equation). Assuming that this is the case,one of Equations 10.52 or 10.53 can be chosen. Since situation ¯ is expected to occur


more frequently, estimation error is expected to be lower and hence Equation 10.53 issolved for PTN :

PTN = 1−count(¯)

NnTNn

, (10.55)

which has an intuitive interpretation: if count(¯)N

equals nTNn

(expressing that all true nega-tive predictions appear in situation ¯), there are no true negative predictions that resultedin situation ®, which means that no failures are induced by the prediction algorithm,which in turn is consistent with PTN being equal to zero.

Estimation of PTP and PFP . The same procedure is applied to the first group in Ta-ble 10.5. The linear equation system is

nTPn

PTP + nFPn

PFP = count(¬)N

(10.56)

nTPn

(1− PTP ) + nFPn

(1− PFP ) = count()N

, (10.57)

which are two equations for two variables. However, Equations 10.56 and 10.57 are notindependent. Similar to the estimation of PTN , a solution only exists, if

count(¬)N

+ count()N

= nTPn

+ nFPn

. (10.58)

Since there’s only one (independent) equation containing two variables, an additional,independent equation involving PTP or PFP has be formed. The following options areavailable:

1. Since PFP denotes the risk of failure induced by execution of failure predictionalgorithms and subsequent (unnecessary) actions, PFP could be set a-priori yielding

PFP = const. (10.59)

2. In case of a true positive prediction, a failure may occur due to two reasons: Theaction was not able to avoid the failure, or the action would have avoided the failurethat had been predicted, but due to additional load, another failure occurs. However,the risk of inducing an additional failure is PFP (see above), and hence one couldassume that

PTP = P (failure cannot be avoided) + PFP . (10.60)

So the difficulty is to determine the probability that a failure cannot be avoided.

3. A fixed ratio of PTP : PFP could be assumed. For example, a ratio of 10:1 wouldexpress that the risk of failure occurrence after issuing of a failure warning is tentimes as high if the warning is correct as if it is a false warning. In general, thisleads to

PFP = c PTP , (10.61)

where c is a constant (ten in the example).


4. Either PTP or PFP can be determined in a separate experiment. This also results in

PFP = const. or PTP = const. (10.62)

Solutions one to three involve assumptions that are vague and difficult to support by mea-surements. In contrast, solution four is based on experimental evidence. It might seemthat it does not make a difference whether PTP or PFP is estimated, but this is not true: Inorder to estimate PFP or PTP it must be known when a prediction is a false or true pos-itive. In the false positive case, it must be proven that a failure would not have occurredif failure prediction and actions had not been in place, which seems infeasible. However,in the second case, it must be assured that a positive prediction is a true positive, whichmeans that a failure is really imminent. This can be achieved by fault injection (see, e.g.,Silva & Madeira [242] for an introduction), as is explained in the following.

Once again, PTP is the probability of failure occurrence given a true positive predic-tion. Applying a maximum likelihood estimator yields:

PTP = P (F |TP ) = count(F ∧ TP )count(TP )

= count(F ∧ TP )count(F ∧ TP ) + count(NF ∧ TP )

, (10.63)

where count(F ∧TP ) denotes the number of true positive predictions where (despite allpreventive actions) a failure has occurred, and count(NF ∧ TP ) denotes the number ofcases where a failure warning is raised correctly that is not followed by a failure. Faultinjection is applied in order to know when a failure really is imminent in the system andhence any positive prediction (failure warning) occurring within some time interval afterfault injection is a true positive. The case that a true positive prediction is followed bya failure (F ∧ TP ) can be identified directly in the log of a fault-injection experiment(c.f., situation ° in Figure 10.9). Identification of the case that no failure occurs after a

Figure 10.9: Identifying true positive predictions by fault injection. °: If a failure (t) occurswithin a given time-interval after fault injection and the failure is preceded by afailure warning (exclamation mark), the situation is assumed to be a true positiveprediction where the failure could not be prevented. ±: If no failure but a failurewarning are observed after fault injection, this corresponds either to a false pos-itive prediction if fault injection was not successful or to a true positive predictionwhere the failure has been prevented

true positive prediction (NF ∧ TP ) is more complicated. The reason for this is that theinjection of a fault does not always lead to a failure. Hence situation ± in Figure 10.9can either be a true positive where the failure has been prevented (this is the case neededfor Equation 10.63) or a false positive prediction in the case that fault injection did notsucceed. However, these two cases can be distinguished by the relative frequencies oftrue positive and false positive predictions, which is determined by precision. But sincea fault injector can in some cases change system behavior significantly, precision has


to be estimated separately for the fault injection experiments following the same offlineprocedure as described in the previous section (p′ is used to indicate precision in thiscase). This leads to the following formula for maximum likelihood estimation of PTP :

PTP = count(°)count(°) + p′ count(±) . (10.64)

It should be noted that fault injection is a difficult issue and care should be taken that abroad range of faults are injected such that failures of different types occur. If downtimeavoidance techniques are only able to compensate for upcoming failures of certain classes,PTP equals one for failure types that are not taken care of. If the distribution of failuretypes is known, the estimate given in Equation 10.64 can be improved.

By substituting the solution for PTP (Equation 10.64) either into Equation 10.56or 10.57, PFP can be computed. Using the first equation yields

PFP =count(¬)

N− nTP

nPTP

nFPn

. (10.65)

Measuring deviation. Since all experiments are finite samples, and since, if actions areperformed, failure prediction accuracy might deviate slightly from the values determinedby offline estimation, exact equalities in Equation 10.54 and 10.58 will be observed ratherrarely. If deviation is sufficiently small, the equations can be used nonetheless. If not,experiments have to be repeated with an increased sample size or with more similar envi-ronments and conditions. The amount of deviation can be determined by Equations 10.54and 10.58. It can be observed that the deviation is symmetric: if the observed fractionof negative predictions is larger than expected (left-hand side > right-hand side in Equa-tion 10.54), the observed fraction of positive predictions is smaller than expected (left-hand side< right-hand side in Equation 10.58) and vice versa. Therefore, one of both canbe used to determine deviation from expectations. Since there usually are more negativethan positive predictions, the estimate is more reliable if negative predictions are used anddeviation is defined as follows (c.f., Equation 10.54):

dev =∣∣∣∣∣count(®)

N+ count(¯)

N− nTN

n− nFN

n

∣∣∣∣∣ . (10.66)

10.7.3 Repair Time Improvement kIn order to estimate the repair time improvement factor k, an experimental trace such asFig. 10.8 that additionally includes time-to-repair is needed. As k is the ratio of MTTR forunplanned /unprepared downtime and MTTR for forced /prepared downtime (c.f., Equa-tion 10.8), mean values for both cases need to be computed. The distinction betweenboth types of downtime is based on failure prediction: in case of a failure warning (situ-ation ¬ in Figure 10.8) time to repair contributes to forced /prepared downtime, in caseof no failure warning (situation ® in Figure 10.8), it contributes to unplanned /unpre-pared downtime. Comparing the value of MTTR for the unpredicted case to the fixedvalue known from a system without PFM yields a further indication how representativethe estimate is.


10.7.4 Summary of the Estimation ProcedureSince the estimation procedure is quite complex involving several experiments, a briefsummary of the procedure is provided in Figure 10.10.

10.8 A Case Study and an ExampleIn his diploma thesis [254], Olaf Tauber has set up an experimental environment in orderto explore the effects of proactive fault management onto a real system. The case studyhas been performed by extending the .NET Pet Shop application6 from Microsoft. Thissection summarizes the work and highlights main results. Since results have not beenconvincing, a more advanced example is also presented.

10.8.1 Experiment Description.NET is a runtime environment developed by Microsoft that is able to execute softwarecomponents written in various programming languages. Furthermore, it provides ready-to-use functionality to handle a multitude of tasks ranging from multi-threading to graph-ical user interfaces. “Pet Shop” is a small, open source web-shop demo application thathas been built in order to demonstrate superiority to the Java-based “PetStore” demo ap-plication developed by Sun Microsystems.

Running the Pet Shop application requires at least two additional components: a web-server that handles http requests from web-browsers (clients) and a database to store thedata in. In order to create an experimental environment for testing proactive fault man-agement techniques, several modules had to be added to the system (see Figure 10.11):

• Stressors. In order to simulate a real scenario, workload must be put onto the sys-tem. An existing load generator called JMeter had been adapted to simulate a va-riety of actions associated with shopping (e.g., logging in, browsing the catalog,viewing and changing the shopping cart, payment, etc.). Activity patterns havebeen executed randomly obeying several boundary conditions such as that usershave to log in prior to payment, etc. Furthermore, stressors have been replicatedsimulating a total of 70 users shopping concurrently. The second important partof stressors is response analysis: Each response has been analyzed with respect toresponse times and correctness of the returned web-page. For performance reasons,relevant data has only be stored during runtime and has been analyzed offline aftereach test run.

• Monitoring. Since proactive fault management is about acting upon an analysis ofthe current state, runtime monitoring is necessary. In this case, a .NET componenthas been used to report system-wide Windows performance counters such as thenumber of active database transactions, size of the swap file, etc.

• Failure Prediction. Monitoring values have been transmitted over a network socketto a failure predictor. It must be pointed out that not the failure prediction algorithmproposed in this dissertation has been used. Instead, Olaf Tauber has developed a

6See http://msdn2.microsoft.com/en-us/library/ms978487.aspx

10.8 A Case Study and an Example 253

1. Experiment 1: without feedback onto the system: Either write logfiles or exe-cute predictions on separate computer.

(a) Classify the resulting data into situations ¬ to ¯ (c.f., Figure 10.8).

(b) Compute precision p, recall r, and false positive rate f using Equa-tions 10.49, 10.50, and 10.51.

(c) Using p, r, and f , compute expected ratios of true and false positives andnegatives nTP

n, nFP

n, nTN

n, and nFN

nusing Equations 10.13 to 10.18.

2. Experiment 2: with failure prediction and actions similar to a production sys-tem.

(a) Classify the resulting data into situations ¬ to ¯ (c.f., Figure 10.8).

(b) Determine the relative amount of negative predictions, which is count(®)N

+count(¯)

Nwhere N is the total number of predictions. Compute deviation

dev by Equation 10.66 using nTNn

, and nFNn

from Experiment 1.

(c) If deviation is significant,a experiments 1 and 2 have to be repeated eitherwith more samples to reduce sampling effects or such that computingenvironments and conditions are more similar.

(d) Estimate PTN using Equation 10.55.

(e) From repair times occurring in the experiment, estimate MTTR andMTTRp and compute repair time improvement factor k as described inSection 10.7.3.

(f) Compute overall prediction rate rp using Equation 10.29

(g) Using rp, compute prediction rates rTP , rTP , rTP , rTP from Equa-tion 10.30 using values nTP

n, nFP

n, nTN

n, and nFN

nfrom Experiment 1-c,

above.

(h) compute rates rA, rF , and rR using Equations 10.31 to 10.33.

3. Experiment 3: with fault injection but without feedback onto the system.

(a) Identify occurrences of situations ¬ and for fault injection experiment(c.f., Figure 10.8).

(b) Estimate p′ using Equation 10.49

4. Experiment 4: with fault injection, prediction and actions in place. By analyz-ing situations ° and ± (c.f., Figure 10.9), estimate PTP using Equation 10.64.

5. Estimate PFP by Equation 10.65 using data of Experiment 2.

athe threshold is application specific and cannot be provided, here

Figure 10.10: Summary of the procedure to estimate model parameters


Figure 10.11: Overview of the case study

simple prediction algorithm that is based on weighted events generated from thresh-old violations. The reason for this is that implementation of HSMM-based failureprediction has not been finished at the time Olaf Tauber carried out the experiments.

• Action. If a failure is predicted, some action is triggered. One downtime avoidanceand one downtime minimization technique have been implemented:

– Load lowering has been chosen for downtime avoidance. More specifically,lowering of the load was achieved by displaying a web page stating that theserver is temporarily overloaded and clients should retry in a few seconds.

– A two-level hierarchical reboot strategy was used for downtime minimization.The reboot strategy was able to either reboot the application layer in the .NETruntime or to reboot the entire system.

• Fault Injection. One of the most effective fault injection techniques is to limit avail-able ressources. Olaf Tauber has opted for allocating memory such that the rest ofthe system (including Pet Shop application, webserver and the database) have tocope with a reduced amount of free memory. Specifically, fault injection has beenimplemented by a multi-threaded process controlled7 from outside the system.

10.8.2 ResultsAt the time when Olaf Tauber carried out his experiments, the model proposed in thisChapter had not been developed, and hence he had used the formulas and estimationtechnique proposed in Salfner & Malek [225]. Fortunately, the supplemental DVD tothe diploma thesis contained the complete recordings collected during experiments andthe data could be analyzed with the estimation procedure described in Section 10.7. Incontrast to this procedure, the one that Olaf Tauber has applied consisted of only twophases, but since he has applied fault injection in his experiments, data could be split

7this means specification of start, duration and amount of memory allocation


further in time intervals with and without failure prediction resulting in data for fourexperiments. In order to clearly separate the parts, some time period after the end of eachfault injection has been removed from considerations.

Two proactive fault management techniques have been investigated by Olaf Tauber:employing downtime minimization by restart and employing downtime avoidance by pre-senting a static page saying “server is busy”. Since the only type of failures observed weresingleton runtime failures, each only affecting a few requests, the restarting approach wasnot at all successful: Even in the case of application-level restarting, eleven times as manyservice requests got lost during restart than by the failure itself. For this reason, only thedowntime avoidance technique has been analyzed in the following.

Analysis has been performed with a lead-time ∆tl of 60s, and prediction-period ∆tp offive minutes. Table 10.6 shows parameter values for the resulting model. Unfortunately,the limited amount of data is not sufficient to yield a statistically reliable assessment ofthe parameters. Hence, the results need to be interpreted with care. Deviation, as definedby Equation 10.66 has been equal to 0.0164.

Fixed Value [s] Estimated Value Resulting Value [1s]

parameters parameters ratesMTTF 25711 p 0.167 rTP 1.178169e-05MTTR 2.00 r 0.25 rFP 5.876737e-05

∆tl 60 f 0.0617284 rTN 0.000893264∆tp 300 PTP 0.5 rFN 3.534508e-05

PFP 0.1463768 rA 0.004761905PTN 0.04366895 rF 0.5k 1.5625 rR 0.78125

Table 10.6: Resulting values for model parameters as estimated from data of the case study.Fixed parameters refer to the parameters not depending on PFM. Estimated pa-rameters are those that are estimated from experiments as described in Sec-tion 10.7. The rightmost column lists the resulting transition rates computed fromestimated parameters.

It might look surprising, that k is not equal to one since showing a “server is busy”page aims at downtime avoidance rather than downtime minimization. The explanationfor this behavior is that MTTR as well as MTTRp are determined by the first successfulrequest after a failed one. If only a static page is displayed, the first successful responsecan be delivered earlier,8 and hence MTTR is reduced.

Using the estimated values for model parameters, steady-state availability, reliabilityand hazard rate can be computed and plotted, respectively. In particular, steady-state avail-ability of the system without proactive fault management was equal to A = 0.9999222and of the system with PFM APFM = 0.9998618. This is a dramatic decrease! Moreprecisely

1− APFM1− A ≈ 1.78 , (10.67)

8To be precise, after 1.28 seconds rather than 2.00 seconds resulting in k = 1.5625


0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

time [s]

R(t

)

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

w/o PFMwith PFM

(a) Reliability

0 100 200 300 400 500

0.98

00.

985

0.99

00.

995

1.00

0

time [s]

R(t

)

0 100 200 300 400 500

0.98

00.

985

0.99

00.

995

1.00

0

0 100 200 300 400 500

0.98

00.

985

0.99

00.

995

1.00

0

w/o PFMwith PFM

(b) Blow-up of the first 500s of (a)

Figure 10.12: Reliability for the case study. The blow-up (b) of the first 500 seconds showsthe phase-type character of the reliability model

which indicates that unavailability is approximately doubled. Regarding reliability, asimilar picture is observed: In Figure 10.12-a reliability of the system with and withoutproactive fault management are plotted. It can be observed that reliability of the PetShopsystem without PFM shows better reliability than the altered PetShop system. A morefine-grained analysis of the first few hundred seconds reveals that reliability of the casewith PFM is slightly higher within the first 300 seconds (see Figure 10.12-b). However,this results most likely from the simple model used to compute reliability of the systemwithout PFM, which employs an exponential distribution of time-to-failure:

R(t) = 1− F (t) = 1−(1− e− 1

MTTFt). (10.68)

Nevertheless, the fine-grained analysis reveals the phase-type character of reliability as aconsequence of the modeling approach.

Hazard rates are shown in Figure 10.13. From the usage of a single exponential dis-tribution for the system without PFM (c.f., Equation 10.68) results a constant hazard rate:

h(t) = λ e−λ t

1− (1− e−λ t) = λ = 1MTTF

. (10.69)

Regarding hazard rate of the system with PFM, the characteristic that the hazard rateis zero for t = 0 results from the fact that there is no direct transition from the initialup-state to a failure. It can also be observed from Figure 10.13 that for t → ∞ hazardrate approaches a constant, which results from the CTMC settling into steady-state. Ascould be expected from worse steady-state availability and reliability, the constant valueis higher than for the case without PFM.

Looking at Table 10.6, the bad performance of the proactive fault management canbe traced back both to low values for precision and recall and the inefficiency of thedowntime avoidance technique:

• Low precision and recall express that the used simplistic threshold-based failureprediction method is not able to achieve sufficiently accurate failure prediction: A


0 200 400 600 800 10000e

+00

2e−

054e

−05

6e−

058e

−05

time [s]

h(t)

0 200 400 600 800 10000e

+00

2e−

054e

−05

6e−

058e

−05

with PFMw/o PFM

Figure 10.13: Hazard rate for the case study

precision of 0.167 implies that about 83% of all failure warnings are false. Or-thogonal to that, only 25% of failures are caught by the prediction algorithm andthree fourth are missed. As a side-remark, these values are a good example to showthat accuracy —as defined by Equation 8.10 on Page 156— is not an appropriatemetric to evaluate failure prediction: accuracy equals 90.59%! The explanation forthis discrepancy is that most of the predictions are true negatives, as can be seenfrom Table 10.7 listing the relative distribution among predictions as obtained fromEquations 10.13 to 10.18.

Type of prediction Relative amountTrue positives 1.18%False positives 5.88%True negatives 89.40%False negatives 3.54%

Table 10.7: Relative amount of the four types of prediction

• Poor ability to prevent failures: the probability that a failure occurs even if it ispredicted is PTP = 0.5. The number is that even since only a total of 36 predictionscontaining five failures are available from the data. The value of PTP to be onehalf indicates that each second true positive prediction, a failure occurs. Althoughdowntime is smaller for predicted outages (k > 1) it cannot compensate for the factthat there are 1 + r 1−p

p f= 21.25 as many predictions as occurrences of failures in

the Petshop without proactive fault management. Even that would be no problem ifmost of the predictions were true negatives and PTN was sufficiently small, whichis also not the case in this example.

In summary, the experiment has shown that the application of proactive fault managementcan make a system worse. However, the applied failure prediction method was too simple


Parameter Valuep 0.70r 0.62f 0.016PTP 0.25PFP 0.1PTN 0.001k 2

Table 10.8: Parameters assumed for the sophisticated example

and downtime avoidance was far from being effective. The next section will demonstratethe effects in a more sophisticated setting.

10.8.3 An Advanced ExampleIn order to show that proactive fault management can indeed improve steady-state sys-tem availability, calculations have been carried out assuming parameter values from abetter failure predictor and more effective actions. More specifically, the values that havebeen observed for HSMM-based failure prediction for the telecommunication system (c.f.,Chapter 9) have been used, which are precision equal to 0.70, recall equal to 0.62, and afalse positive rate of 0.016. Also with respect to effectiveness of actions and risk inducedfailures, slightly better values have been assumed. Exact values for PTP , PFP , PTN , andk are listed in Table 10.8. Values for MTTF and MTTR are the same as in the casestudy by Olaf Tauber.

Using these values, a steady-state availability has been computed showing a value ofAPFM = 0.999962. Availability of a system without PFM is the same as in the previousexperiment. This results in cutting down unavailability approximately by half:

1− APFM1− A ≈ 0.488 . (10.70)

Reliability and hazard rate are also improved, as can be seen from Figures 10.14and 10.15. This time, the constant limiting hazard rate is below the hazard rate of asystem without proactive fault management.

10.9 SummaryIn this chapter, a model has been introduced in order to assess the effect of proactive faultmanagement, which denotes the approach to combine proactive techniques with a failurepredictor: each time, an imminent failure is predicted, actions are triggered that eithertry to avoid or to minimize the downtime incurred by failure occurrence. Examples havebeen given for both types of actions.

The model presented is based on the well-known continuous-time Markov chainmodel used by Huang et al. [126] to model software rejuvenation, which is a specialcase of downtime minimization by periodic restarting. The model replaces the failure-probable state of the original rejuvenation CTMC, which is one its major drawbacks, by

10.9 Summary 259

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

time [s]

R(t

)

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

0 10000 20000 30000 40000 50000

0.0

0.2

0.4

0.6

0.8

1.0

w/o PFMwith PFM

(a) Reliability

0 100 200 300 400 500

0.98

00.

985

0.99

00.

995

1.00

0

time [s]

R(t

)

0 100 200 300 400 500

0.98

00.

985

0.99

00.

995

1.00

0

0 100 200 300 400 500

0.98

00.

985

0.99

00.

995

1.00

0

w/o PFMwith PFM

(b) Magnification of the first 500s of (a)

Figure 10.14: Reliability for the sophisticated example. Similar to Figure 10.12, the first 500seconds are magnified showing the phase-type character of the underlying dis-tribution.

0 200 400 600 800 1000

0e+

002e

−05

4e−

056e

−05

8e−

05

time [s]

h(t)

0 200 400 600 800 1000

0e+

002e

−05

4e−

056e

−05

8e−

05

with PFMw/o PFM

Figure 10.15: Hazard rate for the more sophisticated example.

four states representing correctness of failure predictions. The model is based on elevenparameters, of which four are determined by the boundary conditions of the system andthe remaining seven parameters characterize efficiency of proactive fault management:

• precision, recall and false positive rate are used for assessment of failure predictionaccuracy

• probability of failure occurrence in case of true positive, false positive or true neg-ative predictions are used to assess success of downtime avoidance techniques aswell as to capture probability of failures that are induced by failure prediction andactions themselves.

• a repair time improvement factor accounts for the effect of improved repair times


in case of forced versus unplanned downtime.

Closed-form solutions for steady state availability, reliability, and hazard rate havebeen developed, and a procedure, how these seven parameters can be estimated fromexperimental data has been described.

Finally, a case study has been presented, where the Microsoft .NET demo web-shopcalled “Pet Shop” has been extended in order to facilitate testing of simple proactive faultmanagement techniques. The case study is based on data gathered by Olaf Tauber in thecourse of his diploma thesis, which has primarily been supervised by the author. However,neither the failure prediction algorithm —which is not the HSMM-based algorithm de-scribed in this thesis— nor the applied downtime minimization and avoidance techniqueshave been convincing such that availability, reliability and hazard-rate get worse if thetechniques are applied. For this reason, a second, more advanced example has been pre-sented using values for precision, recall, and false positive rate that have been achieved byHSMM-based failure prediction for the telecommunication system case study. Regardingefficiency of methods, slightly better values than the ones estimated from Olaf Tauber’sexperiments have been used. In this setting reliability was significantly improved, andunavailability was cut down by half. However, it should be noted that HSMM-based pre-diction, if applied to the Pet Shop system, would not have reached as good results as forthe telecommunication case study. The reason for this is that no fine-grained fault detec-tion is built into the Pet Shop. Therefore, only very few indicative symptomatic errors arereported prior to a failure.

One of the major limitations of the model is that it operates only on mean times, whichis a direct consequence of using continuous-time Markov chains. Other models such asstochastic activity networks (SAN) can model more details. On the other hand, findingclosed-form solutions is rather difficult for these models.

A further limitation of the model presented here is that diagnosis and scheduling ofactions (see Chapter 12) are not explicitly modeled: If a PFM system comprises severaldifferent actions, a decision is necessary about which action to trigger in a given situa-tion. This decision can as well be correct or wrong. Although decision accuracy of thedispatcher is inherently contained in probabilities PTP , PFP , PTN , and k,9 a more detailedmodeling would be desirable. On the other hand, introduction of even more states and pa-rameters makes the model more difficult to understand and results in more parameters thatneed to be estimated from experimental data.

Contributions of this chapter. Main contribution of this chapter is the proposal of aCTMC model to assess the effect of proactive fault management on availability, relia-bility, and hazard rate. A brief survey of existing models that try to evaluate the effectof proactive fault management has revealed that —to the best of our knowledge— theproposed CTMC model is the first to

• clearly distinguish between all four types of failure predictions: true positives, truenegatives, false positives and false negatives,

• handle both downtime minimization as well as downtime avoidance techniques,

9Think, for example, of the case that the dispatcher chooses to prepare a repair action instead of triggeringa preventive action then the probability of failure occurrence is increased while k is improved

10.9 Summary 261

• incorporate the case that failure prediction plus triggered actions can induce fail-ures, i.e., due to additional load caused by prediction or actions, a failure occursthat would not have occurred if no proactive fault management was in place.

From a practical point of view, the three main contributions of the model are:

• It can help to decide if application of proactive fault management is useful for agiven system. In order to do so, MTTF and MTTR must be determined from thecurrent version of the system. The remaining parameters must be estimated fromexperiments in similar environments, such as done in Chapter 9 for assessment offailure prediction effectiveness.

• When analyzing a system that already employs proactive fault management tech-niques, partial derivations of the availability / reliability formulas may give an in-dication which of the seven parameters would be most effective to increase avail-ability. For example, if a system’s engineer had $100,000 to spend on improvedproactive fault management, a derivation of the formulas derived in this chapter in-dicates whether it is, e.g., more effective to spend the money on improved failureprediction methods or on a reduction of MTTR for forced / prepared outages.

• It can be used to determine the optimal trade-off between precision, recall, and falsepositive rate. In order to do so, all parameters except for precision, recall, and falsepositive rate must be assumed to be fixed. Then by Equation 10.40, availabilitybecomes a function of these three parameters. Hence, an availability value can beassigned to each point of the trajectory through the space of precision/recall/falsepositive rate and the optimal combination can be chosen.

Relation to other chapters. This chapter is the first of the fourth phase of the engineer-ing cycle —and since the main focus of this thesis is on failure prediction, it is also thelast. The remaining chapters will conclude the thesis and will provide an outlook ontofurther research.

Chapter 11

Summary and Conclusions

The initial spark that has lit up the fire providing energy to write this dissertation has beenthe challenge to predict the failures of a commercial telecommunication platform fromerrors that occur. In this chapter the essentials are summarized, major contributions arepointed out, and remaining issues are discussed.

Beginning with the aim to improve a given system, a typical engineering approach canbe divided into four phases forming the “engineering cycle” (c.f., Figure 1.2 on Page 6).The thesis has been structured along this concept and so is its summary.

11.1 Phase I: Problem Statement, Key Properties andRelated Work

The ultimate goal addressed in this dissertation is to improve computer system depend-ability by means of a proactive management of faults. However, the thesis has been fo-cused on the prerequisite first step, which is online failure prediction where the objectiveis to predict the occurrence of failures in the near future based on the current state of thesystem as it is observed by runtime monitoring. As a case study, failures of a commercialtelecommunication platform of which industrial data has been available were to be pre-dicted. A detailed analysis of the surrounding conditions of the case study has revealedseveral key properties, for which the proposed approach to failure prediction has beendesigned:

• The size of the system is so immense that detailed knowledge of complex relation-ships is rare —it has at least not been overt to us. However, with ever-growingcomplexity of systems and increasing use of commercial-off-the-shelf components,this assumption might also be valid for the companies themselves. For this reason,a black-box approach has been applied. However, as is discussed in the outlook,the model can be augmented by analytical knowledge, which would turn it into agray-box approach.

• A huge amount of data is available. Therefore, a data-driven approach from ma-chine learning has been chosen that aims at filtering out relevant interrelationsfrom data rather than building on an analytical approach where interrelations areextracted manually. This approach has a major consequence: Only those types offailures can be predicted that have occurred (frequently enough) in the training data.

263

264 11. Summary and Conclusions

Events that are really rare are not the focus, here. However, as Levy & Chillarege[162] have pointed out, failure types follow Zipf’s law and targetting at frequentfailures first, results in the biggest impact. Regarding the telecommunication sys-tem case study, the goal was to predict performance failures a few minutes ahead.

• Faults can become visible at four stages: by auditing, which means actively search-ing for faults, by symptom monitoring, by error detection or by observing a failure.In the case of this thesis, errors have been used as input data. There are reasons infavor and against this choice. The most important are:

– Errors occur late in the process from faults to failures. In order to be able topredict failures with reasonable lead time, fine-grained fault detection must bein place that is able to capture misbehavior early enough.

+ Due to the property of occurring late, input occurs only when something isgoing wrong in the system. This alleviates the problem of class skewness: theratio of failure and non-failure data is more even than in symptom monitoring-based approaches.

+ Since error reporting is inherently built into the majority of systems, error-based prediction techniques are expected to have less effects onto the pro-duction system than monitoring-based approaches: As the case study in thediploma thesis of Olaf Tauber has shown, system response times are dramati-cally influenced by the amount as well as the frequency of collected monitor-ing data.

+ Quite a lot of symptom monitoring-based approaches have been publishedwhile the area of error event-based methods is not well explored. From ascientific point of view, it has been alluring to explore this white spot in pre-diction methods.

Experiments conducted in this thesis have been performed using previouslyrecorded logfiles. It should be noted that for real application to a running system adirect interface to error event reporting should be used.

• Component-based software architectures are common in large systems. The clearstructure of encapsulated entities advocates an approach that builds on interrela-tionships and dependencies among components. From this follows that the orderof error events is relevant. Moreover, an analysis has revealed that not only theorder but the temporal delay between errors is even more decisive. Since errorsoccur non-equidistantly and the type of each error belongs to a finite countable set,temporal sequences are the input data for the failure prediction algorithm.

• Fault-tolerant systems can cope with many erroneous situations but fail under someconditions. The principal assumption in this thesis is, that the distinction betweenerroneous situations leading to failure and erroneous situations that do not lead tofailure can be distinguished by identifying patterns in temporal error sequences.For this reason, a pattern recognition approach has been applied.

• It is a non-distributed system. Although the approach might in principle be appli-cable to distributed systems as well, such aspects have not been considered in thisthesis.

11.2 Phase II: Data Preprocessing, the Model, and Classification 265

The resulting approach is divided into two major steps: first, models are adjusted to sys-tem specifics from previously recorded training data. After training, error sequences oc-curring at runtime are analyzed in order to classify the current status of the system asfailure-prone or not. In machine learning, such procedure is called an supervised offlinebatch learning approach.

In order to review existing approaches to online failure prediction, a taxonomy hasbeen developed and a comprehensive survey of major publications has been presented.Additionally, related work on extending hidden Markov models to continuous time hasbeen presented.

11.2 Phase II: Data Preprocessing, the Model, and Clas-sification

The second phase of the engineering cycle aims at synthesizing a problem-specificmethodology. In many cases including this thesis, existing approaches need to be adaptedor a new model needs to be developed.

Online failure prediction is performed in three steps:

1. Error messages that have occurred within a given time window before present timeform an error sequence. The sequence is preprocessed, which includes assignmentof symbols, tupling and noise filtering. In the case of training, failure sequences areadditionally grouped by clustering.

2. Using extended hidden Markov models, similarity to failure and non-failure se-quences is computed. Sequence likelihood is used as a measure for similarity be-tween the observed sequence under investigation and the sequences of the trainingdata.

3. Applying Bayes decision theory, a final decision is made whether the current situa-tion is failure-prone or not.

11.2.1 Data PreprocessingData preprocessing consists of several steps, of which the assignment of error IDs, thetupling technique by Iyer & Rosetti, and sequence extraction are more of technical ratherthan conceptual nature and are hence not summarized, here.

Failure sequence clustering. Due to the complexity of the system, it must be assumedthat several failure mechanisms exist and are hence present in the data. The term failuremechanism is used to denote specific relations of faults and system states to a failure. Inthis thesis, a technique has been developed that separates failure mechanisms by meansof clustering. The basic notion of failure sequence clustering is that a dissimilarity ma-trix is formed by training a small hidden semi-Markov model for each sequence and bycomputing sequence likelihoods with each model for all failure sequences. Then, a stan-dard clustering technique can be applied to identify groups of failure sequences that are“close” in the sense of large mutual sequence likelihoods. An analysis using the telecom-munication data has revealed that agglomerative clustering using Ward’s procedure yields


the most stable results. Other clustering parameters such as the number of states of themodels and the level of background distributions are not very decisive and default valueshave been derived.

Noise filtering. The purpose of noise filtering is to remove non failure-related errorsfrom failure sequences in the training data as well as for online prediction. Noise filteringis based on a statistical test derived from the well-known χ2 test of goodness of fit. Theprinciple notion is that only symbols that are outstanding, i.e., occur more frequently thanexpected at a given time, are considered. At least for the data of the case study, an analysishas shown that computing the symbols’ expected probabilities from all sequences showsa rather clear separation from signal to noise.

Improved logfiles. Although not applied to the data of the case study, a principle in-vestigation of logfiles has resulted in two proposals how logfiles can be improved forautomatic processing:

1. Event type and event source should be clearly separated

2. A hierarchical numbering scheme should be used, which supports data investigationby providing multiple levels of detail. Furthermore, a distance metric can be definedthat would facilitate clustering of error message types.

In order to quantify the quality of logfiles, logfile entropy has been defined. It is based onShannon’s information entropy but additionally incorporates the overlap of required andgiven information in logfiles.

11.2.2 The Hidden Semi-Markov ModelIn this thesis, a pattern recognition approach is applied to the task of online failure pre-diction. Hidden Markov models (HMMs) have been chosen as modeling formalism sincefirst, HMMs have successfully been used in many advanced pattern recognition tasks,and second, there is an appealing match of concepts from faults to hidden states and fromerrors to observation symbols. However, temporal sequences, which are sequences in con-tinuous time are used as input data but standard HMMs are not designed for continuoustime. Four ways how standard HMMs can be used / extended to process continuous-timesequences have been discussed. An extension of the stochastic process of hidden statetraversals seemed most promising due to a lossless representation of time and the powerto mimic the temporal behavior of the underlying stochastic process. In order to achievethis, a new model has been proposed in this dissertation. Its key concepts and propertiesare summarized in the following:

• HMMs have been combined with a semi-Markov process resulting in a hiddensemi-Markov model (HSMM). HSMMs combine the mature formalism and well-understood properties and algorithms of standard HMMs with a great flexibility tospecify the duration of transitions from one hidden state to the next.

• For sequence recognition the efficient forward algorithm has been adapted toHSMMs. By this, sequence likelihood can be computed which is a probabilis-tic measure of similarity between the sequence under investigation and the set of

11.2 Phase II: Data Preprocessing, the Model, and Classification 267

sequences the HSMM has been trained with. In order to find the most probable se-quence of hidden states, the Viterbi algorithm has been adapted, as well. However,this has not been of major concern for this thesis, although it might be of interestfor diagnosis.

• The HSMM can also be used for sequence prediction, which is also not used foronline failure prediction, here. However, this technique might be of interest fordiagnosis or other applications of the model.

• In order to train the model, the Baum-Welch algorithm used for standard HMMshas been adapted to HSMMs. It belongs to the class of generalized expectationmaximization algorithms combining techniques from maximum likelihood estima-tion and gradient-based methods for optimization of transition duration distributionparameters.

• Convergence of the training procedure has been proven based on the rather univer-sal theory of EM algorithms, which employs lower bound optimization resulting ina so-called Q-function. The specific Q-function for HSMMs has been derived andby partial differentiation and application of Lagrange multipliers it has been shownthat the algorithm converges at least to a local maximum of training sequence like-lihood.

• For the specific task of online failure prediction, a dedicated topology of HSMMsis used. Failure prediction models employ a chain-like, or left-to-right structure.However, in order to deal with missing errors in training sequences, shortcuts areincluded in the model. In order to deal with additional error messages (noise) thathas not been present in training data, intermediate states are added after completionof the training procedure: By this, model flexibility is increased without affectingcomplexity of the training procedure. Experiments with the telecommunication datahave shown that one intermediate state per transition is most effective, although thebenefit lags behind expectations.

• In order to assess complexity of the algorithm, two cases must be distinguished:application of HSMMs for online failure prediction during runtime, and trainingof HSMM parameters. Application complexity of general HSMMs belongs to theclass O(N2 L), where N denotes the number of states and L the number of sym-bols in the error sequence. However, due to the left-to-right structure used foronline failure prediction, complexity actually belongs to class O(NL). Theoreti-cally, training complexity is O(N4 L). However, the chain-like structure reducescomplexity to O(N3 L). In order to remedy the problem of convergence to localmaxima, the entire training procedure is repeated 20 times with varying randominitialization.

• The forward algorithm of the HSMM developed in this thesis is much more effi-cient than previous extensions to continuous time. The main reason for this is thatprevious extensions have mainly been developed in the area of speech recognition.An in-depth comparison of the task of failure prediction with speech recognitionhas revealed that for failure prediction a one-to-one mapping between states andobservation symbols can be assumed and temporal properties are mainly includedin the stochastic process of hidden state traversals. This analysis allows a strict


enforcement of the Markov assumption, which results in a forward algorithm thatis almost as efficient as its discrete-time counterpart. Furthermore, this approachallows to model time as transition durations rather than state sojourn times whichoffers more modeling flexibility.

11.2.3 Sequence ClassificationThe final step in a pattern recognition approach to online failure prediction is to classifywhether the current runtime state is failure-prone or not. Bayes decision theory has beenused in order to derive classification rules. More specifically,

• An introduction to Bayes decision theory has been given including the proof thatclassification error rate is minimal if each sequence is classified according to maxi-mum posterior probabilities, and a minimum cost classification rule.

• Since for real applications of hidden Markov models1 only logarithmic sequencelikelihood can be used, the Bayesian decision rule has been extended to a multi-class classification rule for log-likelihoods.

• By introducing the bias-variance dilemma, it has been shown why it is importantto control the trade-off between bias (which is how closely training can adapt tothe training data) and variance (which is how much the resulting model is depen-dent on the selection of training data). Several techniques have been discussed withrespect to their applicability to online failure prediction with HSMMs. In this dis-sertation, model order selection, and background distributions, in combination withmaximum amount of training data have been applied.

11.3 Phase III: Evaluation Methods and Results for In-dustrial Data

Having developed the theoretical methodology, the third phase of the engineering cycleis concerned with implementing it and to perform experiments with data. This leads to asolution that can be applied to a running system.

11.3.1 Evaluation MethodsMany different metrics exist that capture various aspects of failure prediction. The com-prehensive overview and discussion of characteristics is one of this thesis’ contributions.

Metrics for prediction quality. Many metrics for the evaluation of prediction are basedon the contingency table, which classifies each prediction to be either a true positive, falsepositive, true negative, or false negative. A table has been presented listing a great varietyof metrics and their synonymous names. In this thesis, precision, recall, false positiveand true positive rate have been used. Additionally, the F-measure is used in order to turnprecision and recall into a single real number.

1As well as of their hidden semi-Markov extensions

11.3 Phase III: Evaluation Methods and Results for Industrial Data 269

One of the major drawbacks of contingency table-based methods is that they are de-fined on a binary basis: a prediction is either positive (a failure warning) or negative (nowarning). However, many prediction methods such as the HSMM approach employ acustomizable threshold upon which the decision is based, and each threshold value mayresult in a different contingency table and subsequently in different values for the associ-ated metrics. Several plots address this problem: Precision / recall curves plot precisionover recall for various values of the decision threshold. In addition to the F-measure, thepoint where precision and recall are equal can be used to turn them into a single num-ber. A second well-known plot are receiver operating characteristics (ROC) where truepositive rate is plotted versus false positive rate. In order to turn this graph into a singlenumber, the integral under the ROC curve is used, which is called “area under curve”(AUC).

A new type of graph has been introduced in this thesis: accumulated runtime costgraphs plot prediction cost as they accumulate over runtime. In contrast to contingencytable-based metrics, which imply mean values, accumulated runtime cost graphs reveala temporal aspect of prediction since it can be seen when, e.g., false positive predictionshave occurred. Furthermore, any predictor can be compared to an oracle predictor, aperfect measurement-based predictor, a system without predictor, and maximum cost.

In summary, it should be pointed out that there is no single perfect evaluation metric.For example, precision and recall do not account for true negative predictions. AUCweights all threshold values equally which results in cases where a predictor with betterAUC incurs higher cost. Accumulated runtime cost graphs are sensitive to the relativedistribution of cost, which can be chosen such that the graph is altered significantly.

Evaluation process and statistical confidence. One of the major problems with manymachine learning approaches is that a lot of parameters are involved that are not directlyoptimized by the training procedure. For example, the length of the data time window isusually assumed to be fixed, but it is not clear what size of the window results in optimalprediction quality. Moreover, several parameters are dependent so that each combinationof all values for all parameters would have to be tested and evaluated with respect tofinal prediction performance. Since more than 15 parameters are involved, such approachwould result in tremendous computation times. For this reason, a mixed approach hasbeen applied in this thesis: Parameters that could be set by separate experiments havebeen optimized separately (greedy approach) while other parameters have been optimizedin combination with dependent ones that cannot be determined in a greedy way.

Three types of data sets have been used in the experiments:

• Training data is used as input data for the training procedure.

• Validation data is used to assess and control overfitting.

• Test data is used for final out-of-sample assessment of failure prediction perfor-mance.

Even though a lot of data has been available for the telecommunication system, theamount of failure data is still limited. A fixed division into three data sets of equal sizewould not result in sufficient estimation of real prediction performance. The standardsolution to this kind of problem is called m-fold cross validation where data is dividedinto m parts, and 1−m parts are used for training / validation while the remaining part is


used for test. This procedure is repeated m times such that each of the m parts is used fortesting once. Training data is then further divided into training and validation data in thesame way.

In order to get to an estimation of confidence intervals, cross validation is combinedwith a technique called bootstrapping. Other confidence interval estimation techniqueshave been investigated, too, but are either not applicable (such as assuming a Bernoulli-Experiment or normal distributions), or are not flexible enough to be applied to largedatasets (such as the Jacknife).

11.3.2 Results for the Telecommunication System Case StudyIndustrial data of a commercial telecommunication system has been analyzed in orderto assess the potential to predict failures of the system. The entire modeling procedurehas been described and analyzed in detail from the first steps of data preprocessing to adetailed analysis of the modeling parameters on final prediction quality. The most relevantresults are provided, here.

Data preprocessing. The main goal was to investigate whether the assumptions madein the theoretical development of the methodology fit reality as observed in the industrialdata. In particular, findings included

1. The proposed procedure to assign error IDs to error messages is relatively robust.The majority of assignments is non-ambiguous. The procedure reduced the numberof different message types from 1,695,160 to 1,435.

2. It seems safe to determine tupling window size by the procedure proposed by Iyer& Rosetti. The expected bend in the number of resulting tuples can be identifiedclearly.

3. Agglomerative clustering with Ward’s method should be used to group failure se-quences. The number of states for the HSMMs used to compute sequence likeli-hoods should be chosen to be approximately

√L where L denotes the maximum

length of the majority of failure sequences. The weight assigned to backgrounddistributions should be chosen rather small —a value of 0.1 has been used in exper-iments.

4. Noise filtering works best if a global prior estimated from the entire training dataset is used. Experiments indicate that the proposed noise filtering mechanism candistinguish between signal and noise. The filtering threshold should be chosen tobe slightly above a plateau in average sequence length that has been observable indata of the case study. Furthermore, experiments support two principles observedby Levy & Chillarege: prior to a failure, the mix of errors changes and a few errorsoutnumber their expected value heavily.

Analysis of the preprocessed dataset. After preprocessing, the resulting dataset hasbeen investigated with respect to the following characteristics:

• Error frequency varies heavily in the data set. However, no correlation betweenthe number of errors per time unit and the occurrence of failures can be observed.

11.3 Phase III: Evaluation Methods and Results for Industrial Data 271

Hence straightforward counting and thresholding techniques do not seem appropri-ate.

• Delays between errors can be approximated best by a mixed probability distributionconsisting of an exponential and a uniform distribution.

• An analysis of the distribution of time-between-failures revealed that frequentlyused distributions such as exponential, gamma or Weibull do not fit the data verywell. For this reason, failure prediction or preventive maintenance techniques thatsimply rely on lifetime distributions are most likely deemed to fail. Furthermore,an autocorrelation analysis shows that there is no periodicity in the occurrence offailures. That is why periodic techniques cannot achieve good prediction results.

Model parameters. Quite a few parameters are involved in the modeling step. Param-eters have been divided into two groups:

• Parameters that can be fixed heuristically in a greedy manner. This group includesprobability mass and distribution of intermediate states, the number of iterations theBaum-Welch algorithm is performed, and the type of the background distribution.

• Parameters that can only be evaluated by training a model and testing predictionperformance on a test data. This group includes the number of states of the HSMM,the maximum number of states that are skipped by shortcuts, the number of interme-diate states that are added to the model after training, and the amount of backgroundweight applied after training.

Parameters of the second group have been investigated with respect to F-Measure, andtheir effect on computation times. Best results have been achieved with a model of 100states, shortcuts bypassing one state, one intermediate state per transition and a back-ground weight of 0.05.

The optimal set of parameters has then been investigated further and precision / recallplots, ROC curves, AUC, and cost curves have been provided. At the threshold value ofmaximum F-measure (0.66), precision of 0.70, recall of 0.62, and false positive rate of0.016 have been achieved. AUC was equal to 0.873.

Application specific parameters. Two modeling parameters depend on the applicationrather than the model itself: lead-time (i.e., the time how far in the future failures arepredicted), and data window size (how much data is used for prediction). An analysisof these two parameters has shown that lead time approximately stays at the same levelfor predictions up to 20 minutes ahead and drops quickly for predictions with longer leadtime. With respect to the size of the data window, model quality in general becomes betterif longer sequences are taken into account. However, mean processing time increasesheavily for longer sequences putting a limit on the size of the data window.

Sensitivity analysis. Large-scale computer systems such as the telecommunication sys-tem are highly configurable and undergo repetitive updates. In order to assess sensitivityto these issues, the approach has been tested in two ways:


• Dependence of prediction quality on the size of training dataset. Many stochasticestimators such as the mean yield unreliable results if the number of data points isdecreased. A similar effect was observed in the case study. By reducing the size ofthe training data set in two steps, results remained stable for the first step but failureprediction quality broke down after the second reduction. Not surprisingly, meantraining time was also reduced for smaller training data sets.

• Dependence on changing system configurations and model aging. Since with offlinebatch learning, parameters of the HSMM are trained once, behavior of the runningsystem will be increasingly different from system behavior at training time withevery change to configuration and every update. This effect has been simulated byan increasing time gap between selected training and test data. Experiments haveshown that mean maximum F-measure decreases almost linearly with increasingsize of the gap. Additionally, it has been observed, that confidence intervals ob-tained from bootstrapping get wider which can be explained that with increasinggap size more and more sequences are significantly different from training data.

• Grouping of failure sequences has been applied in order to separate failure mech-anisms. However, partitioning the set of failure sequences results in less trainingsequences for each model, which in turn may deteriorate HSMM parameter estima-tion involved in the training procedure. In order to check whether this is the case,a HSMM failure predictor with only one failure group model has been trained. Re-sults for this model have been significantly worse supporting the assumption that theHSMMs can adopt better to the training sequences if failure sequences are groupedaccording to their similarity.

Comparative analysis. The HSMM-based failure prediction approach has been com-pared to the most promising and well-known failure prediction approaches of that area,which are dispersion frame technique (DFT) by Lin & Siewiorek, the Eventset methodby Vilalta & Ma, and SVD-SVM by Domeniconi et al.. DFT only evaluates the time oferror occurrence, while Eventset and SVD-SVM only investigate the type of errors thatoccur.2 In contrast, HSMM-based failure prediction investigates both time of error occur-rence and their type —it treats input data as a temporal sequence. In order to provide acomparison with a very simple prediction method, periodic prediction based on MTBFhas also been included in the comparison.

Standard, discrete-time HMMs can be used for failure prediction, too. In order toassess the gain in prediction performance achieved by introducing a semi-Markov process,prediction performance of standard HMMs have been tested, too. Additionally, HSMM-based failure prediction has been compared to failure prediction based on universal basisfunctions (UBF) developed by Günther Hoffmann, although UBF prediction belongs toa different class of prediction algorithms operating on equidistant monitoring of systemvariables.

In summary, it can be concluded from the comparative analysis that HSMM-basedfailure prediction outperforms other failure prediction approaches significantly. However,

2Although SVD-SVM can in principle incorporate both time and type of error messages, prediction actuallydeteriorates if time is included.

11.4 Phase IV: Dependability Improvement 273

improved failure prediction comes at the price of computational complexity: Model train-ing consumes 2.38 times and online prediction 224.5 times as much time as the slowestcomparative approach. Nevertheless, the approach demonstrates what prediction perfor-mance is achievable with error-event triggered online failure prediction.

11.4 Phase IV: Dependability ImprovementFailure prediction is not worth the effort if it does not help to improve system dependabil-ity. In order to improve dependability, failure prediction must be coupled with subsequentactions that are performed once an upcoming failure has been predicted. This is calledproactive fault management. However, the focus of this thesis is on failure predictionand therefore, only a theoretical analysis of the effect of proactive fault management onsystem dependability has been provided.

11.4.1 Proactive Fault ManagementTwo strategies exist how system dependability can be improved in case of a predictedupcoming failure:

• Downtime avoidance techniques try to prevent the failure. Their goal is to achievecontinuous operation. Three groups of downtime avoidance techniques have beenidentified: state clean-up, preventive failover, and load lowering.

• Downtime minimization techniques can be further divided into two subgroups: re-active techniques let the predicted failure happen, however, the system is preparedfor its occurrence such that time-to-repair is reduced. This is achieved by either oneor both of two effects: (a) reconfiguration time can be shortened if an upcomingfailure is anticipated, and (b) the time needed for recomputation can be reduced.On the other hand, proactive techniques actively trigger repair actions such as arestart, turning unplanned downtime into planned downtime, which is expected tobe shorter or incurring less cost.

Several examples for all types of techniques have been given.

11.4.2 ModelsBased on the continuous Markov-chain model (CTMC) for software rejuvenation (i.e.,preventive restart of components or the entire system) introduced by Huang et al., twoCTMC models have been developed: The first model is used to compute steady-state sys-tem availability, while the second simplified model is used to compute system reliabilityand hazard rate.

It has been shown how the rates of the model can be computed from eleven parameters,of which four are application specific and hence assumed to be fixed. The remaining sevenmodeling parameters are: precision, recall, false positive rate, failure probability given atrue positive, false positive, and true negative prediction, and repair time improvementfactor. Using these parameters, closed-form solutions for steady-state system availability,reliability and hazard rate have been derived.


11.4.3 Parameter EstimationA procedure has been described, how the several parameters can be estimated from exper-iments. The procedure consists of four experiments, two of which include fault injectionin order to assure that the prediction of a failure is a true positive.

11.4.4 Case Study and an Advanced ExampleA diploma thesis has set up an experimental environment where simple proactive faultmanagement techniques have been applied to an open web-shop demo application onthe basis of Microsoft .NET. Specifically, preventive restart on application as well as onsystem level have been used as downtime minimization technique, and delivering a pagestating that the server is temporarily busy has been used as technique to avoid downtimeby system relieving.

The parameter estimation procedure has been applied to the data recorded in the ex-periments. However, neither of the two proactive techniques has been able to improvesystem availability, reliability or hazard rate (in the long term). The main reasons forthis is that the implemented failure prediction algorithm (which was not HSMM-basedprediction but a simple threshold-based method) has not been able to provide sufficientlygood predictions. Furthermore, both types of actions have not proven to be successful:Instead of a reduced downtime, restarting took eleven times as long as downtime incurredby a failure, and at every second true positive prediction, a failure occurred even thoughsystem relieving was in place.

Since the experiments have not resulted in improved system dependability, a moresophisticated example has been provided. In this second example, the values estimatedfrom the telecommunication case study have been used for prediction quality. Addition-ally, better but still realistic values have been assumed for the other parameters. Thisscenario resulted in a considerable improvement in dependability: unavailability was cutby half and reliability as well as hazard rate have been significantly improved.

11.5 Main ContributionsIn summary a novel failure prediction approach has been developed that has strong foun-dations in stochastic pattern recognition rather than heuristics, and that outperforms well-known prediction techniques if applied to industrial data of a commercial telecommuni-cation system of considerable size. On the way to this result, several contributions to thestate-of-the art have been achieved:

• In the fundamental relationship between faults, errors, and failures, side-effects offaults are missing. In this dissertation, side-effects of faults are termed symptomsand the fundamental concept has been extended.

• A comprehensive taxonomy of online failure prediction methods has been intro-duced. Based on the taxonomy, an in-depth survey of online failure prediction tech-niques has been presented. including research areas that have not been explored forthe objective of failure prediction.

• The failure prediction method developed in this thesis is the first to apply patternrecognition to error event-driven time sequences (temporal sequences).

11.6 Conclusions 275

• A novel extension of hidden Markov models to incorporate continuous time hasbeen developed. Since previous extensions to continuous time have focused onequidistant time series, the extension presented here is the first to specifically ad-dress temporal sequences as input data.

• A novel model to theoretically assess dependability of proactive fault management,which is prediction-driven fault tolerance, has been introduced. To our knowledge,it is the first to incorporate correct and false predictions, as well as downtime avoid-ance and downtime minimization techniques. In addition to that, the model incor-porates failures that are induced by proactive fault management itself, e.g., by theadditional load that is put onto the system.

Although not directly related to the failure prediction model, several other contributionsto the state-of-the-art have been made:

• To the best of our knowledge, this thesis is the first to collect and discuss the variousevaluation metrics for prediction tasks.

• A novel methodology to identify failure mechanisms and to group failure sequenceshas been developed. Although only used for data preprocessing in this thesis, theapproach might also be useful for diagnosis, as well.

• To our knowledge the first measure to quantify the quality of logfiles has beenintroduced. Due to its roots in information theory, the measure is called logfileentropy.

11.6 ConclusionsIn this dissertation, an effective online failure prediction approach has been proposed thatbuilds on the recognition of symptomatic patterns of error sequences. A novel continuous-time extension of hidden Markov models has been developed and the approach has beenapplied to industrial data of a commercial telecommunication system. In comparisonto the best-known error-based failure prediction approaches, the proposed methodologyshowed superior prediction accuracy. However, accuracy comes at the prize of compu-tational complexity. Although this is intuitively comprehensible, Legg [160] has inves-tigated performance and complexity of prediction algorithms in a principle way. Basedon a universal formal theory for sequence prediction by Solomonoff [247, 248], which isnot computable in general, Legg has proven that predictors of a given predictive powerrequire some minimum computational complexity (see Figure 11.1). Another importantresult of Legg’s work is that, although very powerful predictors exist for computable se-quences, they are not provable due to Gödel incompleteness problems. In other words,for provable algorithms, an upper bound with respect to predictive power exists. Hence,maximum achievable predictive accuracy for the telecommunication case study might beworse than 100% precision and 100% recall, and HSMM-based failure prediction is evencloser to the optimum than it appears.

Starting point of the model’s development was an analysis of key properties of com-plex, component-based, non-distributed software systems and the failure prediction ap-proach has been designed with these properties in mind. Hence HSMMs should also


Figure 11.1: Trade-off between predictive power and complexity. It can be shown that for agiven complexity, there is an upper bound on predictive power. Hence, there isalso an upper bound on predictive power achievable by algorithms with provablecomplexity. However, it can also be shown that algorithms with better predictivepower exist but their complexity is unprovable. HSMM-based prediction algo-rithm lies within the hatched area (Legg [160]).

show very good prediction results if applied to other systems sharing the same proper-ties. Additionally, HSMMs can be adapted to different situations by adjusting the variousparameters involved in modeling. Furthermore, since they are a general contribution toevent-driven temporal sequence processing, HSMMs might prove to achieve similarlyoutstanding results in other application domains beyond failure prediction, as well.

Chapter 12

Outlook

As is the case with most projects, there is always room for further investigations andimprovements. In this chapter, some potential and promising directions are highlighted.Starting from technical issues how the proposed hidden semi-Markov model (HSMM)could be further improved the scope is widened successively.

12.1 Further Development of Prediction ModelsThe survey of online prediction models (see Chapter 3) has shown that quite a few predic-tion models have been developed in the past, but also that there are several areas that seempromising to be explored. The discussion along the branches of the taxonomy is not bereiterated, here but rather, the focus is on more sophisticated machine learning techniques.

12.1.1 Improving the Hidden Semi-Markov ModelMore sophisticated optimization techniques than the gradient-based could be used forestimation of transition duration parameters in the Baum-Welch algorithm for HSMMs.For example, second order optimization algorithms such as Newton’s method or quasi-Newton methods such as Broyden-Fletcher-Goldfarb-Shanno (BFGS) and Davidon-Fletcher-Powell (DFP) could be applied. In this thesis, the problem of local maxima hasbeen addressed by simply running the Baum-Welch algorithm several times. A more so-phisticated solution would, for example, apply an evolutionary optimization strategy. Ad-ditionally, the EM training algorithm used in this dissertation does not alter the structureof the HSMM. Extended algorithms such as state pruning, which also alter the topologyof an HSMM, may be investigated.

The black-box approach can actually be turned into a gray-box approach by addingfailure group models that are constructed manually. More specifically, if it is known fromsystem design that activation of a special failure mechanism results in a unique sequenceof errors, an additional failure group model can be built that is specifically targeted tothis sequence. Variations and uncertainties in time as well as in error symbols can bemodeled by transition and observation probabilities. The resulting additional failure groupmodel can be seamlessly integrated with the models obtained from data-driven machinelearning. Referring to Figures 2.9 and 2.10 on Pages 19 and 20, respectively, the hand-built model would simply be added as model u + 1. By this procedure, the purely data-driven modeling approach described in this thesis is turned into a machine-learning /

277

278 12. Outlook

analytic hybrid modeling approach.

12.1.2 Bias and VarianceControlling bias and variance means to control the trade-off between under- and overfit-ting. As has been mentioned in Chapter 7, algorithms such as bagging and boosting canbe applied to HSMMs as well.

A further technique controlling the bias-variance tradeoff is called regularization (see,e.g., Bishop [30]). Regularization usually denotes a technique where the optimization ob-jective is augmented by a term putting a penalty on model complexity or specificy, suchas curvature in regression problems. Regularization can in principle also be applied toHSMMs. In order to do so, the Baum-Welch algorithm would have to be changed suchthat the optimization objective, which is training sequence likelihood, is augmented by acomplexity / specificy term. For example, a penalty could be put on setting transition orobservation probabilities to zero. Another approach is to introduce a prior probability dis-tribution over the values of model parameters, as has been introduced by Hughey & Krogh[128]. However, regularization changes the model rather deep in its core and similar re-sults can very likely be achieved by other techniques such as background distributions,which have been used in this thesis.

It is common knowledge that every single modeling technique is well-suited for someproblems but performs worse on others. This is called the inductive bias of a modelingtechnique. Meta-learning (see, e.g., Vilalta & Drissi [267]) makes use of different induc-tive bias of several modeling techniques. For example, one technique of meta-learningassigns a new problem to the base-learner with the most appropriate inductive bias. Thishas been shown to improve failure prediction significantly in Gujrati et al. [111] eventhough a very simple meta-learning algorithm has been applied.

12.1.3 Online LearningSystems are undergoing permanent updates and configuration changes. With each suchstep, failure behavior of the system might be changed. The consequence is that the modelsobtained from training are getting more and more outdated. One solution to this problemis online learning. In online learning, the model is permanently updated such that it adaptsto the changes in the system. A straightforward solution to online learning for HSMMswould be to collect new failure and non-failure sequences at runtime and to periodicallytrain new models in the background. However, most likely more sophisticated approachescan be applied.

12.1.4 Further IssuesPrediction in continuous time. In this dissertation, failures have been predicted witha fixed lead-time ∆tl. However, if the error sequence under investigation is assumed tobe the start of a temporal sequence, sequence prediction techniques (c.f., Section 6.2.2)can be used to determine the continuous cumulative probability of failure occurrence overtime. Such approach is beneficial if several proactive actions are available in a system thatimply different warning times (i.e., minimum lead-time): failure prediction would haveto be performed only once rather than once for every lead-time.

12.1 Further Development of Prediction Models 279

Conditional random fields. Markov models in general are subject to the so-called labelbias problem (see, e.g., Lafferty et al. [152]). The problem is that the entire probabilitymass is distributed among successor states. Hence, if a state has only one successor, thestochastic process transits with probability one to the next state. If there are two succes-sors and both successors are equally likely, they proceed with probability of roughly ahalf. From this follows that sequence likelihood depends on the number of outgoing tran-sitions. Even if this problem is not that urgent for HSMMs, since first, the model topologyis rather symmetric (most states have the same number of successor states) and further-more, transition probability is also determined by the duration of the transition, the prin-ciple restriction still applies. In recent years, new stochastic models have been developed,among which conditional random fields (CRF) are promising candidates. These modelshave a second important advantage: The objective function is convex, which guaranteesthat the training procedure converges to a global rather than a local maximum. However,these models are rather new and experience is limited —that is why they have not beenconsidered in this thesis.

Input variables. In this dissertation, error events have served as input data to the hiddensemi-Markov model. However, as the title of the thesis indicates, any event-based datasource may be used, too. For example, by defining a threshold, any (equidistant) moni-toring of system variables such as memory consumption or workload can be turned intoan event-based data source. Although this has not been applied in this thesis, it might bea valuable solution for systems that do not have fine-grained fault detection installed.

In many machine learning applications, the problem of variable selection is an im-portant issue. For online failure prediction based on symptom monitoring, Hoffmannet al. [122] have shown that a good selection of variables can be even more decisive thana sophisticated choice of modeling technique. In the course of this thesis, some exper-iments with different sets of error-message variables have been performed. However,results could not be improved. The main reason for this is that —in contrast to symptommonitoring— not all variables are available all the time: each error message may containa different set of variables. Hence, in order to successfully apply variable selection tech-niques, extra care must be taken to missing variables. Since this is not the case for mostexisting variable selection algorithms, further research is needed.

Mining rare events. A further issue is related to the problem that failure sequencesare rare in comparison to non-failure sequences. Weiss [276] has comprehensively in-vestigated this topic, even though the main focus has been on data mining. Many of theproposed techniques, e.g., training failure models on the rare class only, and using rareclass robust evaluation metrics such as precision and recall have been applied in this the-sis. However, other techniques, such as advanced sampling methods could be additionallyapplied.

Distributed systems. This dissertation has focused on centralized systems, only. How-ever, distributed systems are also important and should be considered. According to itsfeatures to flexibly incorporate timing behavior, to model interdependencies of more orless isolated entities, and to handle missing events or permutations in their order, HSMM-based failure prediction seems to be a good candidate for failure prediction in distributedsystems.

280 12. Outlook

Design for predictability. it has been assumed throughout this thesis that the system isfixed and given and failure prediction algorithms have to adapt to its specifics. However infuture, it may also be the other way round: “designing for predictability” may be consid-ered from the very beginning of the software development process. At the current stage,it is not yet clear what characteristics of a software design makes failures predictable andfurther research is needed. However, it can be concluded from this dissertation that iferror event-based failure prediction is to be applied, fine-grained fault detection has to beembedded throughout the system.

12.1.5 Further Application Domains for HSMMsThe hidden semi-Markov model developed in this dissertation has been designed for theprocessing of event-triggered temporal sequences. Therefore, HSMMs can be applied toother problem domains as well. The prerequisite key characteristic is that observations(input data) must occur in an event-driven way and input values must belong to a finitecountable set (observation symbols). There are supposedly many areas where HSMMscan be applied, among which are

• Web user profiling. The click stream of a web user navigating through a site formsa temporal sequence: each click is an event and, e.g., the requested URL is theobservation symbol. HSMMs might be used to distinguish between various typesof users (sequence recognition) or to predict the most probable URL that the webuser will click next (sequence prediction). Both could be used to dynamically adjustweb pages to the users needs and preferences.

• Shopping tour prediction. In a retail store, each time a customer puts an item intothe (technically enhanced) cart, an event is generated. The type of the event isdefined, e.g., by the id of the item, its location, etc. Temporal sequence processingbased on HSMMs might be used to, e.g., display context-sensitive advertisements.In contrast to existing data-mining approaches, not only the set of items is relevantbut also the time when the customer has put the item into the cart, which wouldenable to present advertisements that are along the customer’s anticipated routethrough the shop or to enable a predictive planning of cash counter personnel.

• Failure prediction in critical infrastructures. Many infrastructures that are usedeveryday (such as electricity, telecommunication, water supply, food transport) can,in case of a failure, impose drastic restrictions in daily life or even pose a severethreat to the health of many people. Failure prediction may be used to predictinfrastructure failures such that appropriate actions can be undertaken to preventor at least to alleviate them. HSMM-based failure prediction might prove to beespecially successful for infrastructures where only critical events but no continuousmonitoring is available.

12.2 Proactive Fault ManagementThe essence of proactive fault management is to act proactively rather than reactively tosystem failures. In the context of this dissertation, techniques are considered that rely on

12.2 Proactive Fault Management 281

the prediction of upcoming failures. Event though there are techniques such as check-pointing that can be triggered directly by a failure prediction algorithm, subsequent di-agnosis is required in order to investigate what is going wrong in the system, i.e. whatcaused the failure that is imminent. Based on failure prediction and diagnosis results, adecision needs to be made which of the implemented downtime avoidance or downtimeminimization techniques should be applied and when it should be executed in order toremedy the problem (see Figure 12.1).

Figure 12.1: The steps involved in proactive fault management. After prediction of an upcom-ing failure, diagnosis is needed in order to find the fault that causes the upcomingfailure. Failure prediction and diagnosis results are used to decide upon whichproactive method to apply and to schedule their execution.

Both diagnosis and choice / scheduling of actions are complex problems that need tobe solved for proactive fault management to be most effective. Nevertheless, the followingparagraphs will discuss some issues that are related to HSMM-based failure prediction asproposed in this dissertation.

Diagnosis. The objective of diagnosis is to find out where the fault is located (e.g., atwhich component) and sometimes to find out what caused it. Note that in contrast to tradi-tional diagnosis, in proactive fault management diagnosis is invoked by failure prediction,i.e., when the failure is imminent but has not yet occurred. One idea how diagnosis couldbe accomplished is to analyze the hidden semi-Markov model used for failure prediction:Since the HSMM approach makes use of several HSMM instances (one for non-failureand several other for failure sequences), and each failure group model is targeted to afailure mechanism, sequence likelihoods of the failure group models can be comparedin order to analyze which failure mechanism might be active in the system. The faultmight then be determined by an analysis of characteristic error messages, which mightalso include identification of the most probable sequence of hidden states by applying theViterbi algorithm. Some parts of this analysis could be even precomputed after clusteringof training failure sequences.

Scheduling of actions. The investigation of dependability enhancement presented inChapter 10 has been based on a binary classification whether a failure is imminent or not.However, in general, the decision which proactive technique to apply should be based onan objective function taking cost of actions, confidence in the prediction, effectiveness andcomplexity of actions into account in order determine the optimal trade-off. For example,to trigger a rather costly technique such as system restart, the scheduler should be almostsure about an upcoming failure, whereas for a less expensive action such as writing asupplemental checkpoint, less confidence in the correctness of the failure prediction isrequired. In contrast to many other failure prediction approaches, HSMM-based failure

282 12. Outlook

prediction can support the scheduler by reporting the posterior probability (c.f., Equa-tion 7.1) rather than the binary decision whether a failure is coming up or not.

Both topics, diagnosis and scheduling are challenges each worth a separate dissertationraising a manifold of scientific questions. The crucial issue, however, is to bring proactivefault management into practical applications in order to prove that system dependabilitycan be boosted by up to an order of magnitude by the proactive fault handling toolbox,which is a combination of effective downtime avoidance and downtime minimizationtechniques, diagnosis, action scheduling, and last but not least accurate online failureprediction.

Part V

Appendix

283

Derivatives with Respect to Parametersfor Selected Distributions

In order to compute the gradient used in hidden semi-Markov model training (c.f., Sec-tion 6.3.2) partial derivatives with respect to parameters of the transition duration distri-bution are needed. Derivatives for some commonly used distributions are provided in thefollowing. Note that cumulative parametric probability distributions are used to specify ahidden semi-Markov model’s transition durations.

Exponential distribution. The cumulative distribution is given by

κij,r = 1− e−λij,r dk .

The derivative with respect to λij,r is hence:

∂

∂ λij,r

(1− e−λij,r dk

)= dk e

−λij,r dk .

Normal distribution. No closed-form representation for the cumulative normal distri-bution Φµ,σ(t) is known. However, it can be expressed using the so-called error functionerf(t):

erf(t) = 2π

∫ t

0e−τ

2dτ

∂ erf∂t

= 2√πe−t

2

The cumulative normal distribution is then given by:

Φµ,σ(t) = 12

[1 + erf

(t− µ√

2σ

)].

In order to compute the partial derivative of Φ let:

fµ,σ(t) := t− µ√2σ

∂ f

∂ µ= − 1√

2σ

∂ f

∂ σ= µ− t√

2σ2

285

286

and hence,

∂Φ∂µ

= 12

2√πe−f

2(− 1√

2σ

)

= − 1√2πσ

exp(−t

2 − 2tµ+ µ2

2σ2

)

∂Φ∂σ

= µ− t√2πσ2

exp(−t

2 − 2tµ+ µ2

2σ2

)

Log-normal distribution. Similar to the normal distribution, the cumulative log-normal distribution can be expressed using the error function:

Ψµ,σ(t) = 12

[1 + erf

(ln(t)− µ√

2σ

)].

Therefore, derivations are derived similar to the normal distribution:

∂Ψ∂µ

= − 1√2πσ

exp(− ln(t)2 − 2 ln(t)µ+ µ2

2σ2

)

∂Ψ∂σ

= µ− ln(t)√2πσ2

exp(− ln(t)2 − 2 ln(t)µ+ µ2

2σ2

)

Pareto distribution. The cumulative distribution is determined by the location param-eter (tm), which determines the minimum value for t, and a shape parameter k:

Ptm,k(t) := 1−(tmt

)kUsing MapleTM, the derivatives with respect to tm can be determined yielding

∂P

∂tm= −

(tmt

)kk

tm

and the derivative with respect to k is given by:

∂P

∂k= −

(tmt

)kln(tmt

)

287

Gamma distribution. The density of the gamma distribution is defined to be:

gk,θ(t) = tk−1 e−tθ

θk Γ(k)

where Γ(k) denotes the gamma function. The cumulative distribution is given by:

Gk,θ(t) =γ(k; t

θ)

Γ(k)

where γ denotes the incomplete gamma function. Derivation of Gk,θ(t) with respect to kas well as to θ is possible, however, the result comprises many different terms which arehence not displayed here. Rather, the result can be seen by evaluating the following fourlines using MapleTM:

gam := int(t^(a-1)*exp(-t), t=0..x);CDFgam := subs(a=k,x=x/theta,gam) / GAMMA(k);diff(CDFgam,k);diff(CDFgam,theta);

Erklärung

Ich erkläre hiermit, dass

• ich die vorliegende Dissertationsschrift “Event-based Failure Prediction: An Ex-tended Hidden Markov Model Approach” selbstständig und ohne unerlaubte Hilfeangefertigt habe;

• ich mich nicht bereits anderwärts um einen Doktorgrad beworben habe oder einensolchen besitze;

• mir die Promotionsordnung der Mathematisch-Naturwissenschaftlichen Fakultät IIder Humboldt-Universität zu Berlin (veröffentlicht im Amtl. Mitteilungsblatt Nr.34 / 2006) bekannt ist.

289

Acronyms

AC Agglomerative coefficientAFS Andrews file systemAGNES Agglomerative NestingARMA Auto-regressive moving averageARX Auto-regressive model with auxiliary inputAUC Area under (ROC) curveBCa Bias corrected accelerated confidence intervalsBFGS Broyden-Fletcher-Goldfarb-ShannoBLAST Basic local alignment search toolCBE Common base eventCDF Cumulative distribution functionCHMM Continuous hidden Markov modelCPU Central processing unitCRF Conditional random fieldsCTMC Continuous time Markov chainCT-HMM Continuous time hidden Markov modelDC Divisive coefficientDET Detection error tradeoffDF Dispersion frameDFT Dispersion frame technqueDIANA Divisive analysis clusteringDTMC Discrete time Markov chainDVD Digital versatile diskECDF Empirical cumulative distribution functionECG Expectation conjugate gradientEDI Error dispersion indexEFDIA Early failure detection and isolation arrangementEM Expectation maximizationESHMM Expanded State HMMfMRI functional magnetic resonance imagingFN False negativeFOIL First order ??? language (not spelled out in the paper)FP False positiveFPR False positive rateFRU Field replacable unitFWN Fuzzy wavelet networkGHMM General hidden Markov model library

291

292

GPRS General packet radio serviceGSM Global system for mobile communicationHMM Hidden Markov modelHP Hewlett-PackardHSMM Hidden semi-Markov modelHSMESM Hidden semi-Markov event sequence modelHTTP Hypertext transport protocolIBM International business machinesID IdentifierIHMM Inhomogeneous hidden Markov modelIN Intelligent networkIO Input outputIP Internet protocolLDAP Lightweight directory access protocolLSI Latent semantic indexingMAP Maximum a-posterioriML Maximum likelihoodMOC Mobile originated callMSET Multivariate state estimation techniqueMTBF Mean time between failuresMTBP Mean time between predictionsMTTF Mean time to failureMTTP Mean time to predictionMTTR Mean time to repairNBEM Naive Bayes expectation maximizationNF Non-failureOR Odds ratioPCA Principal component analysisPCF Probability cost functionPCFG Probabilistic context-free grammarPFM Proactive fault managementPR Precision-recallPWA Probabilistic wrapper approachQQ Quantile quantileRADIUS Remote authentication dial in user interfaceRAID Redundant array of independent disksRBF Radial basis functionRLC Resistor inductor capacitorROC Receiver operating characteristicSAN Stochastic activity networkSAP Systems, applications, productsSAR System activity reporterSCF Service control functionSCP Service control pointSEP Similar events predictionSHIP Software hardware interoperability peopleSMART Self-monitoring, analysis and reporting technology

293

SMP Semi-Markov processSMS Short message serviceSPRT Sequential probability ratio testSRN Stochastic reward netSSI Stressor susceptibility interactionSTAR Self-testing and repairingSVD Singular value decompositionSVM Support vector machineTBF Time between failuresTCP Transmission control protocolTN True negativeTP True positiveTTF Time to failureTTP Time to predictionTTR Time to repairUBF Universal basis functionUML Universal modeling languageUPGMA Unweighted pair-group average methodURL Uniform resource locatorUTC Universal time coordinatedWSDM Web services distributed management

Index

φ-coefficient, 166

a-priori algorithm, 49abstraction, 6accumulated runtime cost, 163accuracy, 156AdaBoost, 145adaptive enterprise, 5agglomerative coefficient, 151aggregated models, 145alarm, 41alphabet, 56, 109amount of background weight, 199anomaly detectors, 33approximation approach, 25arcing, 145area under curve (AUC), 164autonomic computing, 5

background distribution, 81, 126, 145backward algorithm

HMM, 59HSMM, 105

bag-of-words, 50bagging, 145, 278banner plot, 151Baum-Welch algorithm

HMM, 60HSMM, 106

Bayes error rate, 135Bayes prediction, 30Bayesian prediction, 37BCa, 171bias and variance, 138

bias, 139classification, 140regression, 138variance, 139

bias-variance dilemma, 140boosting, 145, 278bootstrapping, 170boundary bias, 143

boundary error, 141bug

Bohrbugs, 22Heisenbugs, 22Mandelbugs, 22Schrödingbugs, 22

chaining effect, 83checkpoint, 230class skewness, 26classification, 19

cost, 135cost-based, 135failure prediction, 136likelihood ratio, 136log-likelihood, 137loss matrix, 135multiclass log-likelihood, 138rejection thresholds, 136risk, 135sequence likelihood, 136

clustering, 37agglomerative, 81complete linkage, 83divisive, 81failure sequences, 18hierarchical, 81nearest neighbor, 83partitioning, 81stopping rules, 82unweighted pair-group average, 83Ward’s method, 83

clusters, 83collision, 77common base event (CBE), 90conditional random fields, 279confidence, 49confusion matrix, 153containers, 15contingency table, 152, 153continuous output probability densities, 67

295

296 Index

continuous time sequences, 63convex combination, 98cooperative checkpointing, 4correct no-warning, 153correct warning, 153count encoding, 216counting and thresholding prediction, 30

data mining, 41data sets, 167

test, 167training, 167validation, 167

data window size ∆td, 136, 208data window size ∆td., 180decision

boundaries, 134region, 134surfaces, 134

defect trigger, 87defect type, 87delay symbols, 65dendrogram, 149detection error trade-off (DET), 159diagnosis, 281discrete time Markov chain (DTMC), 55dispersion frame technique (DFT), 46dissimilarity matrix, 80distributed system, 16divisive coefficient, 151downtime avoidance, 227downtime minimization, 227duration, 97

early stopping, 144engineering cycle, 6entropy, 90equilibrium state distribution, 243ergodic topology, 81, 110error, 10, 41error function, 285error patterns, 17error type, 76error-based failure prediction, 39

classifier, 45frequency, 39pattern recognition, 43rule-based, 41statistical tests, 45

event, 39event type, 87

event-triggered temporal sequence, 46eventset, 48

accurate, 49frequent, 48method, 42, 48

expectation conjugate gradient (ECG), 110expectation maximization (EM), 61, 116

generalized, 110, 119expected risk, 135

F-measure, 155failure, 10

arbitrary, 14computation, 14crash, 14omission, 14performance, 14, 19timing, 14

failure avoidance, 227failure mechanism, 15, 18, 79failure modes, 15failure prediction, 11

online, 12failure probable, 233failure sequence, 79

clustering, 79, 182, 212grouping, 79, 182, 212

failure warning, 153failure windows, 48false negative, 153, 228false positive, 153, 228false positive rate, 156false warning, 153fault, 10

auditing, 10design, 21detection, 10intermittent, 21monitoring, 10permanent, 21runtime, 21transient, 21

fault injection, 250fault intolerance, 23fault model, 20fault tolerance, 23fault trees, 42feature analysis, 38feature selection, 36first passage time distribution, 103first step analysis, 104

Index 297

forced downtime, 228forward algorithm

HMM, 58HSMM, 101

frequency of error occurrence, 39function approximation-based prediction, 34

curve fitting, 34genetic programming, 35machine learning, 35

furthest neighbor, 83

G-measure, 165generalized EM, 110, 119Gini coefficient, 166growing and pruning, 144

hidden Markov model (HMM), 56basic problems of, 57continuous (CHMM), 56, 67continuous time (CT-HMM), 67discrete, 56

hidden semi-Markov model (HSMM), 68, 95complexity, 128event sequence model (HSMESM), 69expanded state (ESHMM), 69exponentially-distributed durations, 68Ferguson’s model, 68gamma-distributed durations, 68inhomogeneous (IHMM), 70, 116Poisson-distributed durations, 68proof of convergence, 116reestimation formulas, 106segmental, 69structure of, 109topology of, 109Viterbi path constrained durations, 69

hierarchical numbering, 87hybrid modeling approach, 278

incomplete data, 117inductive bias, 278information entropy of logfiles, 89interdependencies, 15intermediate states, 127, 145

jacknife, 170

kernels, 96, 98

label bias problem, 279lead-time ∆tl, 180, 207learning

batch, 25offline, 25online, 25supervised, 25

leave-one-out, 170lift, 166likelihood, 133

logarithmic, 80, 101load lowering, 229logfile

entropy, 89error ID assignment, 178hierarchical numbering, 87tupling, 179type and source, 86

lower bound optimization, 117

m-fold cross validation, 143, 168machine learning, 16marginal, 118margins for non-failure sequences, 180Markov

assumptions, 56properties, 56, 96

Markov renewal sequence, 95kernel, 96

maximum cost, 164maximum span of shortcuts, 199median, 170meta-learning, 278minimal distance methods, 24missing warning, 153mixture of distributions, 98mode, 170model order selection, 144monitoring-based classifiers, 29, 36

Bayesian classifier, 37clustering, 37statistical tests, 37

n-version programming, 4no free lunch theorem, 25noise filtering, 83, 188non-failure sequences, 79non-parametric prediction, 30number of intermediates, 199number of states, 199, 200number of tries in each optimization step, 198

observation probabilities, 56odds ratio, 157

298 Index

online learning, 278oracle, 163out-of-sample, 167, 202overfitting, 5, 140

pairwise alignment, 44parameter setting

greedy, 166non-greedy, 167

parameter tying, 145pattern recognition-based prediction, 43

Markov models, 44pairwise alignment, 44probabilistic context-free grammar, 43

perfect predictor, 164periodic prediction, 53, 217Piatetsky-Shapiro, 166positive, 153posterior probability distribution, 133precision, 154precision recall break-even, 165precision recall curves, 157prediction

overview, 18preparation, 227preventive failover, 229primal-dual method, 117prior, 133proactive downtime minimization, 229proactive fault management, 5, 228probabilistic context-free grammars (PCFG), 43probabilistic wrapper approach (PWA), 36properties of the data set, 221

Q-function, 118quality of logfiles, 89

reactive downtime minimization, 229recall, 154receiver operating characteristics (ROC), 158recovery oriented computing, 5reestimation step, 129regularization, 145, 278rejuvenation, 4, 231reliability model, 244resamples, 170responsive computing, 5roll-backward scheme, 230roll-forward scheme, 230root cause, 10

analysis, 11

rule-based prediction, 41data mining, 41fault trees, 42

sample error rate, 169SAR, 165scaling, 101self-* properties, 5self-testing and repairing computer (STAR), 4self-transitions, 64semi-Markov process (SMP), 67, 95sequence

generation, 25likelihood, 18prediction, 25recognition, 25

sequence extraction, 79sequence likelihood, 57

HSMM, 101sequence prediction, 102sequential decision making, 25sequential pattern mining, 42service degradation, 32SHIP fault model, 23shortcuts, 126, 145signal processing, 39similar events prediction (SEP), 5single linkage, 83singular value decomposition (SVD), 50software aging, 5, 231software components, 15source, 87speech recognition, 113state clean-up, 229state duration, 114statistical confidence, 168statistical methods, 25steady-state availability, 243, 244stratification, 168structure, 109supervised offline batch learning, 212support, 49support vector machines (SVM), 51SVD-SVM, 50symbol, 56, 76symptoms, 10, 40system configuration, 211system model-based prediction, 32

anomaly detectors, 33control theory, 33stochastic, 32

Index 299

temporal encoding, 216temporal output, 66temporal sequence, 15, 63, 115temporal sequence pattern recognition, 53test data, 167test data set, 167time series analysis, 38

feature analysis, 38signal processing, 39time series prediction, 38

time slotting, 64time-varying internal process, 66topology, 109training

overview, 18training data set, 167training with noise, 143transition

duration, 97probability, 97

true negative, 153true positive, 153true positive rate, 156truncation, 77trustworthy computing, 5tupling, 76, 77two dimensional output, 66Type I error, 153Type II error, 153type of background distributions, 198

underfitting, 140universal basis functions (UBF), 219unobservable data, 117

validation, 167validation data set, 167variable selection, 36, 279Viterbi algorithm

HMM, 59HSMM, 101

weighted relative accuracy, 165

Bibliography

[1] Abraham, A. & Grosan, C. Genetic programming approach for fault modeling of elec-tronic hardware. In IEEE Proceedings Congress on Evolutionary Computation (CEC’05),,volume 2, 1563–1569. Edinburgh, UK, 2005

[2] Agrawal, R., Imielinski, T., & Swami, A. Mining association rules between sets of itemsin large databases. In Proceedings of the 1993 ACM SIGMOD international conference onManagement of data (SIGMOD 93), 207–216. ACM Press, 1993

[3] Aitchison, J. & Dunsmore, I. R. Statistical Prediction Analysis. Cambridge UniversityPress, 1975

[4] Albin, S. & Chao, S. Preventive replacement in systems with dependent components. IEEETransactions on Reliability, volume 41(2): 230–238, 1992

[5] Aldenderfer, M. & Blashfield, R. Cluster Analysis. Sage Publications, Inc., Newbury Park(CA,USA), 1984

[6] Alpaydin, E. Introduction To Machine Learning. MIT Press, 2004

[7] Altman, D. G. Practical Statistics for Medical Research. Chapman-Hall, 1991

[8] Altschul, S., Gish, W., Miller, W., Myers, E., & Lipman, D. Basic local alignment searchtool. Journal of Molecular Biology, volume 215(3): 403–410, 1990

[9] Amari, S. & McLaughlin, L. Optimal design of a condition-based maintenance model. InIEEE Proceedings of Reliability and Maintainability Symposium (RAMS), 528–533. 2004

[10] Andrzejak, A. & Silva, L. Deterministic Models of Software Aging and Optimal Reju-venation Schedules. In 10th IEEE/IFIP International Symposium on Integrated NetworkManagement (IM ’07), 159–168. 2007

[11] Apostolico, A. E. D. & Galil, Z. Pattern Matching Algorithms. Oxford University Press,1997

[12] Ascher, H. E., Lin, T.-T. Y., & Siewiorek, D. P. Modification of: Error Log Analysis:Statistical Modeling and Heuristic Trend Analysis. IEEE Transactions on Reliability, vol-ume 41(4): 599–601, 1992

[13] Avižienis, A. Fault-tolerance and fault-intolerance: Complementary approaches to reliablecomputing. In Proceedings of the international conference on Reliable software, 458–464.ACM Press, New York, NY, USA, 1975

[14] Avizienis, A. The N-Version Approach to Fault-Tolerant Software. IEEE Transactions onSoftware Engineering, volume SE-11(12): 1491–1501, 1985

301

302 Bibliography

[15] Avizienis, A., Gilley, G., Mathur, F., Rennels, D., Rohr, J., & Rubin, D. The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design. IEEE Transactions on Computers, volume C-20(11): 1312–1321, 1971

[16] Avizienis, A. & Laprie, J.-C. Dependable computing: From concepts to design diversity.Proceedings of the IEEE, volume 74(5): 629–638, 1986

[17] Avižienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. Basic concepts and taxonomy ofdependable and secure computing. IEEE Transactions on Dependable and Secure Comput-ing, volume 1(1): 11–33, 2004

[18] Azimi, M., Nasiopoulos, P., & Ward, R. K. Offline and Online Identification of HiddenSemi-Markov Models. IEEE Transactions on Signal Processing, volume 53(8): 2658–2663,2005

[19] Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel A., & vanSteen, M. (eds.). Self-Star Properties in Complex Information Systems, Lecture Notes inComputer Science, volume 3460. Springer-Verlag, 2005

[20] Bai, C. G., Hu, Q. P., Xie, M., & Ng, S. H. Software failure prediction based on a MarkovBayesian network model. Journal of Systems and Software, volume 74(3): 275–282, 2005

[21] Bao, Y., Sun, X., & Trivedi, K. Adaptive Software Rejuvenation: Degradation Model andRejuvenation Scheme. In Proceedings of the 2003 International Conference on DependableSystems and Networks (DSN’2003). IEEE Computer Society, 2003

[22] Bao, Y., Sun, X., & Trivedi, K. A workload-based analysis of software aging, and rejuve-nation. IEEE Transactions on Reliability, volume 54(3): 541–548, 2005

[23] Barborak, M., Dahbura, A., & Malek, M. The consensus problem in fault-tolerant comput-ing. ACM Computing Surveys, volume 25(2): 171–220, 1993

[24] Basseville, M. & Nikiforov, I. Detection of abrupt changes: theory and application. Pren-tice Hall, 1993

[25] Baum, L. E. & Sell, G. R. Growth Transformations for Functions on Manifolds. PacificJournal of Mathematics, volume 27(2): 211–227, 1968

[26] Bazaraa, M. S. & Shetty, C. M. Nonlinear Programming. John Wiley and Sons, New York,1979

[27] Berenji, H., Ametha, J., & Vengerov, D. Inductive learning for fault diagnosis. In IEEEProceedings of 12th International Conference on Fuzzy Systems (FUZZ’03), volume 1.2003

[28] Bicego, M., Murino, V., & Figueiredo, M. A. T. A sequential pruning strategy for theselection of the number of states in hidden Markov models. Pattern Recognition Letters,volume 24(9–10): 1395–1407, 2003

[29] Bilmes, J. A. A Gentle Tutorial on the EM Algorithm and its Application to ParameterEstimation for Gaussian Mixture and Hidden Markov Models. Tech. report ICSI-TR-97-021, U.C. Berkeley, International Computer Science Institute, Berkeley, CA, 1998

[30] Bishop, C. M. Neural Networks for Pattern Recognition. Oxford University Press, 1995

Bibliography 303

[31] Bland, J. M. & Altman, D. G. The odds ratio. British Medical Journal, volume 320(7247):1468, 2000

[32] Blischke, W. R. & Murthy, D. N. P. Reliability: Modeling, Prediction, and Optimization.Probability and Statistics. John Wiley and Sons, 2000

[33] Bonafonte, A., Vidal, J., & Nogueiras, A. Duration modeling with expanded HMM appliedto speech recognition. In IEEE Proceedings of the Fourth International Conference onSpoken Language (ICSLP 96), volume 2, 1097–1100. 1996

[34] Borgelt, C. & Kruse, R. Induction of Association Rules: Apriori Implementation. In Pro-ceedings of 15th Conference on Computational Statistics (Compstat 2002). Physica Verlag,Heidelberg, Germany, 2002

[35] Bowles, J. A survey of reliability-prediction procedures for microelectronic devices. IEEETransactions on Reliability, volume 41(1): 2–12, 1992

[36] Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. Time Series Analysis: Forecasting andControl. Prentice Hall, Englewood Cliffs, New Jersey, 3rd edition, 1994

[37] Bridgewater, D. Standardize Messages with the Common Base Event Model. 2004. URLwww-106.ibm.com/developerworks/autonomic/library/ac-cbe1/

[38] Brocklehurst, S. & Littlewood, B. Techniques for Prediction Analysis and Recalibration.In Lyu, M. R. (ed.), Handbook of software reliability engineering, chapter 4, 119–166.McGraw-Hill, 1996

[39] Bronstein, I. N., Semendjajew, K. A., Musiol, G., & Mühlig, H. Taschenbuch der Mathe-matik. Harri Deutsch, Frankfurt am Main, Germany, 6th edition, 2005

[40] Brown, A. & Patterson, D. Embracing Failure: A Case for Recovery-Oriented Computing(ROC). In High Performance Transaction Processing Symposium. 2001

[41] Burckhardt, J. Griechische Kultur. Safari Verlag, Berlin, Germany, 1958

[42] Candea, G. The Enemies of Dependability I: Software. Technical Report CS444a, StanfordUniversity, CA, 2003

[43] Candea, G., Cutler, J., & Fox, A. Improving Availability with Recursive Microreboots: ASoft-State System Case Study. Performance Evaluation Journal, volume 56(1-3), 2004

[44] Candea, G., Delgado, M., Chen, M., & Fox, A. Automatic Failure-Path Inference: AGeneric Introspection Technique for Internet Applications. In Proceedings of the 3rd IEEEWorkshop on Internet Applications (WIAPP). San Jose, CA, 2003

[45] Candea, G., Kiciman, E., Zhang, S., Keyani, P., & Fox, A. JAGR: An Autonomous Self-Recovering Application Server. In Proceedings of the 5th International Workshop on ActiveMiddleware Services. Seattle, WA, USA, 2003

[46] Caruana, R. & Niculescu-Mizil, A. Data mining in metric space: an empirical analysisof supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDDinternational conference on Knowledge discovery and data mining (KDD 04), 69–78. ACMPress, New York, NY, USA, 2004

www-106.ibm.com/developerworks/autonomic/library/ac-cbe1/

304 Bibliography

[47] Cassady, C., Maillart, L., Bowden, R., & Smith, B. Characterization of optimal age-replacement policies. In IEEE Proceedings of Reliability and Maintainability Symposium,170–175. 1998

[48] Cassidy, K. J., Gross, K. C., & Malekpour, A. Advanced Pattern Recognition for Detec-tion of Complex Software Aging Phenomena in Online Transaction Processing Servers. InProceedings of Dependable Systems and Networks (DSN), 478–482. 2002

[49] Castelli, V., Harper, R., P., H., Hunter, S., Trivedi, K., Vaidyanathan, K., & Zeggert, W.Proactive management of software aging. IBM Journal of Research and Development,volume 45(2): 311–332, 2001

[50] Chakravorty, S., Mendes, C., & Kale, L. Proactive fault tolerance in large systems. InHPCRI Workshop in conjunction with HPCA 2005. 2005

[51] Chan, L. M., Comaromi, J. P., Mitchell, J. S., & Satija, M. Dewey Decimal Classification:A Practical Guide. OCLC Forest Press, Albany, N.Y., 2nd edition, 1996

[52] Chen, M., Accardi, A., Lloyd, J., Kiciman, E., Fox, A., Patterson, D., & Brewer, E. Path-based Failure and Evolution Management. In Proceedings of USENIX/ACM Symposium onNetworked Systems Design and Implementation (NSDI). San Francisco, CA, 2004

[53] Chen, M., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. Pinpoint: Problem Determina-tion in Large, Dynamic Internet Services. In Proceedings of 2002 International Conferenceon Dependable Systems and Networks (DSN), IPDS track, 595–604. IEEE Computer Soci-ety, 2002

[54] Chen, M., Zheng, A., Lloyd, J., Jordan, M., & Brewer, E. Failure diagnosis using decisiontrees. In IEEE Proceedings of International Conference on Autonomic Computing, 36–43.2004

[55] Chen, M.-S., Park, J. S., & Yu, P. S. Efficient Data Mining for Path Traversal Patterns.IEEE Transactions on Knowledge and Data Engineering, volume 10(2): 209–221, 1998.URL citeseer.nj.nec.com/article/chen98efficient.html

[56] Chen, P., Lin, C. J., & Schoelkopf, B. A tutorial on ν-Support Vector Machines. AppliedStochastic Models in Business and Industry, volume 21(2): 111–136, 2005

[57] Cheng, F., Wu, S., Tsai, P., Chung, Y., & Yang, H. Application Cluster Service Scheme forNear-Zero-Downtime Services. In IEEE Proceedings of the International Conference onRobotics and Automation, 4062–4067. 2005

[58] Chiang, F. & Braun, R. Intelligent Network Failure Domain Prediction in ComplexTelecommunication Systems with Hybrid Neural Rough Nets. In The Second InternationalSymposium on Neural Networks (ISNN 2005). Chongqing, China, 2005

[59] Chillarege, R., Bhandari, S., Chaar, J. K., Halliday, M. J., Moebus, D. S., Ray, B. K., &Wong, M.-Y. Orthogonal Defect Classification - A Concept for In-Process Measurements.IEEE Transactions on Software Engineering, volume 18(11): 943–955, 1992

[60] Chillarege, R., Biyani, S., & Rosenthal, J. Measurement of Failure Rate in Widely Dis-tributed Software. In FTCS ’95: Proceedings of the Twenty-Fifth International Symposiumon Fault-Tolerant Computing, 424–432. IEEE Computer Society, 1995

citeseer.nj.nec.com/article/chen98efficient.html

Bibliography 305

[61] Cohen, W. W. Fast effective rule induction. In Proceedings of the Twelfth InternationalConference on Machine Learning, 115–123. 1995

[62] Cole, R., Mariani, J., Uszkoreit, H., Varile, G. B., Zaenen, A., Zampolli, A., & Zue, V.(eds.). Survey of the State of the Art in Human Language Technology. Cambridge UniversityPress and Giardini, 1997

[63] Coleman, D. & Thompson, C. Model Based Automation and Management for the Adap-tive Enterprise. In Proceedings of the 12th Annual Workshop of HP OpenView UniversityAssociation, 171–184. 2005

[64] Comission, I. I. T. (ed.). Dependability and Quality of Service, chapter 191. IEC, 2ndedition, 2002

[65] Cook, A. E. & Russell, M. J. Improved duration modeling in hidden Markov models usingseries-parallel configurations of states. Proc. Inst. Acoust., volume 8: 299–306, 1986

[66] Cover, T. M. Learning in pattern recognition. In Watanabe, S. (ed.), Methodologies ofPattern Recognition, 111–132. Academic Press, 1968

[67] Cox, D. R. & Miller, H. D. The Theory of Stochastic Processes. Chapman and Hall,London, UK, 1st edition, 1965

[68] Cristian, F., Aghili, H., Strong, R., & Dolev, D. Atomic Broadcast: From Simple MessageDiffusion to Byzantine Agreement. In IEEE Proceedings of 15th International Symposiumon Fault Tolerant Computing (FTCS). 1985

[69] Cristian, F., Dancey, B., & Dehn, J. Fault-tolerance in the Advanced Automation System. InIEEE Proceedings of 20th International Symposium on Fault-Tolerant Computing (FTCS-20), 6–17. 1990

[70] Cristianini, N. & Shawe-Taylor, J. An introduction to Support Vector Machines and otherkernel-based learning methods. Cambridge University Press, 2000

[71] Crowell, J., Shereshevsky, M., & Cukic, B. Using fractal analysis to model software aging.Technical report, West Virginia University, Lane Department of CSEE, Morgantown, WV,2002

[72] Csenki, A. Bayes Predictive Analysis of a Fundamental Software Reliability Model. IEEETransactions on Reliability, volume 39(2): 177–183, 1990

[73] Daidone, A., Di Giandomenico, F., Bondavalli, A., & Chiaradonna, S. Hidden MarkovModels as a Support for Diagnosis: Formalization of the Problem and Synthesis of theSolution. In IEEE Proceedings of the 25th Symposium on Reliable Distributed Systems(SRDS 2006). Leeds, UK, 2006

[74] Dalgaard, P. Introductory Statistics with R. Springer, 2002

[75] Dempster, A., Laird, N., & Rubin, D. Maximum-Likelihood from incomplete data via theEM algorithm. Journal of the Royal Statistical Society, volume 39(1): 1–38, 1977

[76] Dennis, J. E. J. & Moré, J. J. Quasi-Newton Methods, Motivation and Theory. SIAMReview, volume 19(1): 46–89, 1977

[77] Denson, W. The history of reliability prediction. IEEE Transactions on Reliability, vol-ume 47(3): 321–328, 1998

306 Bibliography

[78] Discenzo, F., Unsworth, P., Loparo, K., & Marcy, H. Self-diagnosing intelligent motors: akey enabler for nextgeneration manufacturing systems. In IEE Colloquium on Intelligentand Self-Validating Sensors. 1999

[79] Dohi, T., Goseva-Popstojanova, K., & Trivedi, K. S. Analysis of Software Cost Modelswith Rejuvenation. In Proceedings of IEEE Intl. Symposium on High Assurance SystemsEngineering, HASE 2000. 2000

[80] Dohi, T., Goseva-Popstojanova, K., & Trivedi, K. S. Statistical Non-Parametric Algorihmsto Estimate the Optimal Software Rejuvenation Schedule. In Proceedings of the Pacific RimInternational Symposium on Dependable Computing (PRDC 2000). 2000

[81] Domeniconi, C., Perng, C.-S., Vilalta, R., & Ma, S. A Classification Approach for Pre-diction of Target Events in Temporal Sequences. In Elomaa, T., Mannila, H., & Toivonen,H. (eds.), Proceedings of the 6th European Conference on Principles of Data Mining andKnowledge Discovery (PKDD’02), LNAI, volume 2431, 125–137. Springer-Verlag, Heidel-berg, 2002

[82] Domingos, P. A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. InProceedings of the Seventeenth National Conference on Artificial Intelligence, 564–569.2000

[83] Drummond, C. & Holte, R. C. Explicitly representing expected cost: an alternative toROC representation. In Proceedings of the sixth ACM SIGKDD international conferenceon Knowledge discovery and data mining (KDD’00), 198–207. ACM Press, New York, NY,USA, 2000

[84] Duda, R. O. & Hart, P. E. Pattern classification and scene analysis. John Wiley and Sons,New York, London, Sydney, Toronto, 1973

[85] Duda, R. O., Hart, P. E., & Stork, D. G. Pattern Classification. Wiley-Interscience, 2ndedition, 2000

[86] Durbin, R., Eddy, S. R., Krogh, A., & Mitchison, G. Biological sequence analysis: proba-bilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK,1998

[87] Efron, B. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics,volume 7(1): 1–26, 1979

[88] Egan, J. P. Signal detection theory and ROC analysis. Academic Press New York, 1975

[89] Elbaum, S., Kanduri, S., & Amschler, A. Anomalies as precursors of field failures. InIEEE Proceedings of the 14th International Symposium on Software Reliability Engineering(ISSRE 2003), 108–118. 2003

[90] Elliott, R. J., Aggoun, L., & Moore, J. B. Hidden Markov Models: Estimation and Control,Stochastic Modelling and Applied Probability, volume 29. Springer Verlag, 1st edition,1995

[91] Elnozahy, E. N., Alvisi, L., Wang, Y., & Johnson, D. A survey of rollback-recovery pro-tocols in message-passing systems. ACM Computing Surveys, volume 34(3): 375–408,2002

Bibliography 307

[92] Esary, J. D. & Proschan, F. The Reliability of Coherent Systems. In Wilcox & Mann (eds.),Redundancy Techniques for Computing Systems, 47–61. Spartan Books, Washington, DC,1962

[93] Faisan, S., Thoraval, L., Armspach, J., & Heitz, F. Unsupervised Learning and Mappingof Brain fMRI Signals Based on Hidden Semi-Markov Event Sequence Models. In Goos,G., Hartmanis, J., & van Leeuwen, J. (eds.), Medical Image Computing and Computer-Assisted Intervention (MICCAI 2003), Lecture Notes in Computer Science, volume 2879,75–82. Springer, 2003

[94] Farr, W. Software Reliability Modeling Survey. In Lyu, M. R. (ed.), Handbook of softwarereliability engineering, chapter 3, 71–117. McGraw-Hill, 1996

[95] Fawcett, T. ROC graphs: notes and practical considerations for data mining researchers.Technical Report 2003-4, HP Laboratories, Palo Alto, CA, USA, 2003

[96] Ferguson, J. Variable duration models for speech. In Proceedings of the Symposium on theApplication of HMMs to Text and Speech, 143–179. 1980

[97] Flach, P. The geometry of ROC space: understanding machine learning metrics throughROC isometrics. In Proceedings of the 20th International Conference on Machine Learning(ICML’03), 194–201. AAAI Press, 2003

[98] Friedman, J. H. On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Data Min-ing and Knowledge Discovery, volume 1(1): 55–77, 1997

[99] Fu, S. & Xu, C.-Z. Quantifying Temporal and Spatial Fault Event Correlation for ProactiveFailure Management. In IEEE Proceedings of Symposium on Reliable and DistributedSystems (SRDS 07). 2007

[100] Garg, S., van Moorsel, A., Vaidyanathan, K., & Trivedi, K. S. A Methodology for Detectionand Estimation of Software Aging. In Proceedings of the 9th International Symposium onSoftware Reliability Engineering, ISSRE 1998. 1998

[101] Garg, S., Puliafito, A., Telek, M., & Trivedi, K. Analysis of Preventive Maintenance inTransactions Based Software Systems. IEEE Trans. Comput., volume 47(1): 96–107, 1998

[102] Ge, X. Segmental semi-Markov models and applications to sequence analysis. Ph.D. thesis,University of California, Irvine, 2002. Chair-Padhraic Smyth

[103] Gellert, W., Küstner, H., Hellwig, M., & Kästner, H. (eds.). Kleine Enzyklopädie Mathe-matik. VEB Bibliographisches Institut, Leipzig, Germany, 1965

[104] Geman, S., Bienenstock, E., & Doursat, R. Neural networks and the bias/variance dilemma.Neural Computation, volume 4(1): 1–58, 1992

[105] Gertsbakh, I. Reliability Theory: with Applications to Preventive Maintenance. Springer-Verlag, Berlin, Germany, 2000

[106] Goldberg, D. E. Genetic Algorithms in Search, Optimization, and Machine Learning. Ad-dison Wesley, 1989

[107] Gray, J. Why do computers stop and what can be done about it? In Proceedings ofSymposium on Reliability in Distributed Software and Database Systems (SRDS-5), 3–12.IEEE CS Press, Los Angeles, CA, 1986

308 Bibliography

[108] Gray, J. A census of tandem system availability between 1985 and 1990. IEEE Transactionson Reliability, volume 39(4): 409–418, 1990

[109] Gray, J. & Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kauf-mann Publishers Inc., San Francisco, CA, USA, 1992

[110] Gross, K. C., Bhardwaj, V., & Bickford, R. Proactive Detection of Software Aging Mech-anisms in Performance Critical Computers. In SEW ’02: Proceedings of the 27th AnnualNASA Goddard Software Engineering Workshop (SEW-27’02). IEEE Computer Society,Washington, DC, USA, 2002

[111] Gujrati, P., Li, Y., Lan, Z., Thakur, R., & White, J. A Meta-Learning Failure Predictorfor Blue Gene/L Systems. In IEEE proceedings of International Conference on ParallelProcessing (ICPP 2007). 2007

[112] Hamerly, G. & Elkan, C. Bayesian approaches to failure prediction for disk drives. InProceedings of the Eighteenth International Conference on Machine Learning, 202–209.Morgan Kaufmann Publishers Inc., 2001

[113] Hamming, W. R. Error Detecting and Error Correcting Codes. Bell Systems TechnicalJournal, volume 29(2): 147–160, 1950

[114] Hansen, J. & Siewiorek, D. Models for time coalescence in event logs. In IEEE Proceedingsof International Symposium on Fault-Tolerant Computing (FTCS-22), 221–227. 1992

[115] Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning: DataMining, Inference, and Prediction. Springer Series in Statistics. Springer Verlag, 2001

[116] Hätönen, K., Klemettinen, M., Mannila, H., Ronkainen, P., & Toivonen, H. TASA: Telecom-munication Alarm Sequence Analyzer, or: How to enjoy faults in your network. In IEEEProceedings of Network Operations and Management Symposium, volume 2, 520 – 529.Kyoto, Japan, 1996

[117] Hellerstein, J. L., Zhang, F., & Shahabuddin, P. An approach to predictive detection forservice management. In IEEE Proceedings of Sixth International Symposium on IntegratedNetwork Management, 309–322. 1999

[118] Herodot. Historien. Kröner Verlag, Stuttgart, Germany, 1971

[119] Hestenes, M. R. & Stiefel, E. Methods of conjugate gradients for solving linear systems.Journal. Research of the National Bureau of Standards, volume 49(6): 409–436, 1952

[120] Hoffmann, G. A. Failure Prediction in Complex Computer Systems: A Probabilistic Ap-proach. Shaker Verlag, 2006

[121] Hoffmann, G. A. & Malek, M. Call Availability Prediction in a Telecommunication Sys-tem: A Data Driven Empirical Approach. In Proceedings of the 25th IEEE Symposium onReliable Distributed Systems (SRDS 2006). Leeds, United Kingdom, 2006

[122] Hoffmann, G. A., Trivedi, K. S., & Malek, M. A Best Practice Guide to Resource Fore-casting for Computing Systems. IEEE Transactions on Reliability, volume 56(4): 615–628,2007

[123] Horn, P. Autonomic Computing: IBM’s perspective on the State of Information Technol-ogy. 2001. URL http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf

http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf

http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf

Bibliography 309

[124] Hotelling, H. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology, volume 24: 417–441, 1933

[125] Huang, X., Acero, A., & Hon, H.-W. Spoken Language Processing: A Guide to Theory,Algorithm, and System Development. Prentice Hall, Upper Saddle River, NJ, USA, 2001

[126] Huang, Y., Kintala, C., Kolettis, N., & Fulton, N. Software Rejuvenation: Analysis, Moduleand Applications. In Proceedings of IEEE Intl. Symposium on Fault Tolerant Computing,FTCS 25. 1995

[127] Hughes, G., Murray, J., Kreutz-Delgado, K., & Elkan, C. Improved disk-drive failurewarnings. IEEE Transactions on Reliability, volume 51(3): 350–357, 2002

[128] Hughey, R. & Krogh, A. Hidden Markov models for sequence analysis: extension andanalysis of the basic method. CABIOS, volume 12(2): 95–107, 1996

[129] Iyer, R. & Rosetti, D. A statistical load dependency of CPU errors at SLAC. In IEEEProceedings of 12th International Symposium on Fault Tolerant Computing (FTCS-12).1982

[130] Iyer, R. K., Young, L. T., & Iyer, P. K. Automatic Recognition of Intermittent Failures:An Experimental Study of Field Data. IEEE Transactions on Computers, volume 39(4):525–537, 1990

[131] Iyer, R. K., Young, L. T., & Sridhar, V. Recognition of error symptoms in large systems.In Proceedings of 1986 ACM Fall joint computer conference, 797–806. IEEE ComputerSociety Press, Los Alamitos, CA, USA, 1986

[132] Jelinski, Z. & Moranda, P. Software reliability research. In Freiberger, W. (ed.), Statisticalcomputer performance evaluation. Academic Press, 1972

[133] Jensen, J. L. W. V. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.Acta Mathematica, volume 30(1): 175–193, 1906

[134] Jiménez, D. A. & Lin, C. Neural methods for dynamic branch prediction. ACM Transac-tions on Computer Systems, volume 20(4): 369–397, 2002

[135] Joachims, T. Making large-scale SVM Learning Practical. In Schölkopf, B., Burges, C., &A., S. (eds.), Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999

[136] Joseph, D. & Grunwald, D. Prefetching Using Markov Predictors. IEEE Transactions onComputers, volume 48(2): 121–133, 1999

[137] Juang, B. H., Levinson, S. E., & Sondhi, M. M. Maximum Likelihood Estimation forMultivariate Mixture Observations of Markov Chains. IEEE Transactions on InformationTheory, volume 32(2): 307–309, 1986

[138] Juang, B.-H. & Rabiner, L. The segmental K-means algorithm for estimating parameters ofhidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing,volume 38(9): 1639–1641, 1990

[139] Kajko-Mattson, M. Can We Learn Anything from Hardware Preventive Maintenance? InICECCS ’01: Proceedings of the Seventh International Conference on Engineering of Com-plex Computer Systems, 106–111. IEEE Computer Society, 2001

310 Bibliography

[140] Kalman, R. E. & Bucy, R. S. New results in linear filtering and prediction theory. Transac-tions of the ASME, Series D, Journal of Basic Engineering, volume 83: 95–107, 1961

[141] Kapadia, N. H., Fortes, J. A. B., & Brodley, C. E. Predictive application-performance mod-eling in a computational gridenvironment. In IEEE Procedings of the eighth InternationalSymposium on High Performance Distributed Computing, 47–54. 1999

[142] Kaufman, L. & Rousseeuw, P. J. Finding Groups in Data. John Wiley and Sons, New York,1990

[143] Kelly, J. P. J., Avizienis, A., Ulery, B. T., Swain, B. J., Lyu, M. R., Tai, A., & Tso, K. S.Multi-Version Software Development. In Proceedings IFAC Workshop SAFECOMP’86,43–49. Sarlat, France, 1986

[144] Kiciman, E. & Fox, A. Detecting application-level failures in component-based Internetservices. IEEE Transactions on Neural Networks, volume 16(5): 1027–1041, 2005

[145] Kim, W.-G., Choi, J.-Y., & Youn, D. H. HMM with global path constraint in Viterbi de-coding for isolatedword recognition. In IEEE Proceedings of International Conference onAcoustics, Speech, and Signal Processing (ICASSP-94), volume 1, 605–608. 1994

[146] Kohavi, R. & Provost, F. Glossary of terms. Machine Learning, volume 30(2/3): 271–274,1998

[147] Korbicz, J., Koscielny, J. M., Kowalczuk, Z., & Cholewa, W. (eds.). Fault Diagnosis:Models, Artificial Intelligence, Applications. Springer Verlag, 2004

[148] Krus, D. J. & Fuller, E. A. Computer Assisted Multicrossvalidation in Regression Analysis.Educational and Psychological Measurement, volume 42(1): 187–193, 1982

[149] Kulkarni, V. G. Modeling and Analysis of Stochastic Systems. Chapman and Hall, London,UK, 1st edition, 1995

[150] Kumar, D. & Westberg, U. Maintenance scheduling under age replacement policy usingproportional hazards model and TTT-plotting. European Journal of Operational Research,volume 99(3): 507–515, 1997

[151] Kurtz, A. K. A research test of Rorschach test. Personnel Psychology, volume 1: 41–53,1948

[152] Lafferty, J., McCallum, A., & Pereira, F. Conditional Random Fields: Probabilistic Modelsfor Segmenting and Labeling Sequence Data. In Proc. 18th International Conf. on MachineLearning, 282–289. Morgan Kaufmann, San Francisco, CA, 2001. URL citeseer.ist.psu.edu/article/lafferty01conditional.html

[153] Lal, R. & Choi, G. Error and Failure Analysis of a UNIX Server. In IEEE Proceedingsof third International High-Assurance Systems Engineering Symposium (HASE), 232–239.IEEE Computer Society Washington, DC, USA, 1998

[154] Lance, G. N. & Williams, W. T. A general theory of classificatory sorting strategies, 1.Hierarchical Systems. The Computer Journal, volume 9(4): 373–380, 1967

[155] Laprie, J.-C. & Kanoun, K. Software Reliability and System Reliability. In Lyu, M. R. (ed.),Handbook of software reliability engineering, chapter 2, 27–69. McGraw-Hill, 1996

citeseer.ist.psu.edu/article/lafferty01conditional.html

citeseer.ist.psu.edu/article/lafferty01conditional.html

Bibliography 311

[156] Laranjeira, L., Malek, M., & Jenevein, R. On tolerating faults in naturally redundantalgorithms. In IEEE Proceedings of Tenth Symposium on Reliable Distributed Systems(SRDS),, 118–127. 1991

[157] Leangsuksun, C., Liu, T., Rao, T., Scott, S., & Libby, R. A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster. In The5th LCI International Conference on Linux Clusters: The HPC Revolution, 18–20. 2004

[158] Leangsuksun, C., Shen, L., Liu, T., Song, H., & Scott, S. Availability prediction and mod-eling of high mobility OSCAR cluster. In IEEE Proceedings of International Conference onCluster Computing, 380–386. 2003

[159] Lee, I. & Iyer, R. K. Software dependability in the Tandem GUARDIAN system. IEEETransactions on Software Engineering, volume 21(5): 455–467, 1995

[160] Legg, S. Is There an Elegant Universal Theory of Prediction? In Algorithmic LearningTheory, Lecture Notes in Computer Science, volume 4264, 274–287. Springer Verlag, 2006

[161] Levinson, S. E. Continuously variable duration hidden Markov models for automaticspeech recognition. Computer Speech and Language, volume 1(1): 29–45, 1986

[162] Levy, D. & Chillarege, R. Early Warning of Failures through Alarm Analysis - A CaseStudy in Telecom Voice Mail Systems. In ISSRE ’03: Proceedings of the 14th InternationalSymposium on Software Reliability Engineering. IEEE Computer Society, Washington, DC,USA, 2003

[163] Li, L., Vaidyanathan, K., & Trivedi, K. S. An Approach for Estimation of Software Aging ina Web Server. In Proceedings of the Intl. Symposium on Empirical Software Engineering,ISESE 2002. Nara, Japan, 2002

[164] Li, Y. & Lan, Z. Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Com-puting. In IEEE Proceedings of the Sixth International Symposium on Cluster Computingand the Grid (CCGRID’ 06), 531–538. IEEE Computer Society, Los Alamitos, CA, USA,2006

[165] Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., & Sahoo, R. BlueGene/L FailureAnalysis and Prediction Models. In IEEE Proceedings of the International Conference onDependable Systems and Networks (DSN 2006), 425–434. 2006

[166] Lin, T.-T. Y. Design and evaluation of an on-line predictive diagnostic system. Ph.D.thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University,Pittsburgh, PA, 1988

[167] Lin, T.-T. Y. & Siewiorek, D. P. Error log analysis: statistical modeling and heuristic trendanalysis. IEEE Transactions on Reliability, volume 39(4): 419–432, 1990

[168] Liporace, L. A. Maximum Likelihood Estimation for Multivariate Observations of MarkovSources. IEEE Transactions on Information Theory, volume 28(5): 729–734, 1982

[169] Lunze, J. Automatisierungstechnik. Oldenbourg, 1st edition, 2003

[170] Lyu, M. R. (ed.). Handbook of Software Reliability Engineering. McGraw-Hill, 1996

[171] Magedanz, T. & Popescu-Zeletin, R. Intelligent networks: basic technology, standards andevolution. Internat. Thomson Computer Press, London, UK, 1996

312 Bibliography

[172] Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. Performance Measures for In-formation Extraction. In Proceedings of DARPA Broadcast News Workshop. Herndon, VA,1999

[173] Malek, M. Responsive Systems: The challenge for the nineties. Microprocessing andMicroprogramming, volume 30: 9–16, 1990

[174] Malek, M. Personal communication. 2007

[175] Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing.The MIT Press, Cambridge, Massachusetts, 1999

[176] Marciniak, A. & Korbicz, J. Pattern Recognition Approach to Fault Diagnostics. In Kor-bicz, J., Koscielny, J. M., Kowalczuk, Z., & Cholewa, W. (eds.), Fault Diagnosis: Models,Artificial Intelligence, Applications, chapter 14, 557–590. Springer Verlag, 2004

[177] Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. The DET curve inassessment of detection task performance. In Proceedings of the 5th European Conferenceon Speech Communication and Technology, volume 4, 1895–1898. 1997

[178] Marzbana, C. & Stumpf, G. J. A Neural Network for Damaging Wind Prediction. Weatherand Forecasting, volume 13(1): 151–163, 1998

[179] Max Planck Institute for Molecular Genetics. General Hidden Markov Model library. 2007.URL http://www.ghmm.org, date: 06-12-07

[180] Melliar-Smith, P. M. & Randell, B. Software reliability: The role of programmed exceptionhandling. SIGPLAN Not., volume 12(3): 95–100, 1977

[181] Minka, T. Expectation-Maximization as lower bound maximization. Tutorial publishedon the web at http://research.microsoft.com/users/minka/papers/minka-em-tut.ps.gz, 1998

[182] Mitchell, C., Harper, M., & Jamieson, L. On the Complexity of Explicit Duration HMM’s.IEEE Transactions on Speech and Audio Processing, volume 3(3): 213–217, 1995

[183] Mitchell, C. & Jamieson, L. Modeling duration in a hidden Markov model with the expo-nential family. In IEEE Proceedings of the International Conference on Acoustics, Speech,and Signal Processing (ICASSP-93), volume 2, 331–334. 1993

[184] Mitchell, T. M. Machine Learning. McGraw-Hill, international edition 1997 edition, 1997

[185] Mojena, R. Hierarchical grouping methods and stopping rules: An evaluation. The Com-puter Journal, volume 20(4): 359–363, 1977

[186] Moll, K. D. & Luebbert, G. M. Arms Race and Military Expenditure Models: A Review.The Journal of Conflict Resolution, volume 24(1): 153–185, 1980

[187] Moore, D. S. & McCabe, G. P. Introduction to the Practice of Statistics. W. H. Freeman &Co., New York, NY, USA, 5th edition, 2006

[188] Mundie, C., de Vries, P., Haynes, P., & Corwine, M. Trustworthy Computing. Technicalreport, Microsoft Corp., 2002. URL http://www.microsoft.com/mscorp/twc/twc_whitepaper.mspx

http://www.ghmm.org

http://research.microsoft.com/users/minka/papers/minka-em-tut.ps.gz

http://research.microsoft.com/users/minka/papers/minka-em-tut.ps.gz

http://www.microsoft.com/mscorp/twc/twc_whitepaper.mspx

http://www.microsoft.com/mscorp/twc/twc_whitepaper.mspx

Bibliography 313

[189] Musa, J. D., Iannino, A., & Okumoto, K. Software Reliability: Measurement, Prediction,Application. McGraw-Hill, 1987

[190] Nassar, F. A. & Andrews, D. M. A Methodology for Analysis of Failure Prediction Data.In IEEE Real-Time Systems Symposium, 160–166. 1985

[191] Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for sim-ilarities in the amino acid sequence of two proteins. Journal of Molecular Biology, vol-ume 48(3): 443–53, 1970

[192] von Neumann, J. Probabilistic Logics and the Synthesis of Reliable Organisms from Un-reliable Components. In Shannon, C. & McCarthy, J. (eds.), Automata Studies, 43–98.Princeton University Press, Princeton, 1956

[193] Neville, S. W. Approaches for Early Fault Detection in Large Scale Engineering Plants.Ph.D. thesis, University of Victoria, 1998

[194] Ning, M. H., Yong, Q., Di, H., Ying, C., & Zhong, Z. J. Software Aging Prediction ModelBased on Fuzzy Wavelet Network with Adaptive Genetic Algorithm. In 18th IEEE Inter-national Conference on Tools with Artificial Intelligence (ICTAI’06), 659–666. IEEE Com-puter Society, Los Alamitos, CA, USA, 2006

[195] Noll, A. & Ney, H. Training of phoneme models in a sentence recognition system. InIEEE Proceedings of International Conference on Acoustics, Speech, and Signal Process-ing (ICASSP ’87), volume 12, 1277–1280. 1987

[196] Ogle, D., Kreger, H., Salahshour, A., Cornpropst, J., Labadie, E., Chessell, M., Horn,B., & Gerken, J. Canonical Situation Data Format: The Common Base Event. IBMSpecification ACAB.BO0301.1.1, 2003. URL http://xml.coverpages.org/IBMCommonBaseEventV111.pdf

[197] Oliner, A. & Sahoo, R. Evaluating cooperative checkpointing for supercomputing systems.In IEEE Proceedings of 20th International Parallel and Distributed Processing Symposium(IPDPS 2006). 2006

[198] Parnas, D. L. Software aging. In IEEE Proceedings of the 16th international conference onSoftware engineering (ICSE ’94), 279–287. IEEE Computer Society Press, Los Alamitos,CA, USA, 1994

[199] Pawlak, Z., Wong, S. K. M., & Ziarko, W. Rough sets: Probabilistic versus deterministicapproach. International Journal of Man-Machine Studies, volume 29: 81–95, 1988

[200] Pena, J. M., Létourneau, S., & Famili, F. Application of Rough Sets Algorithms to Predic-tion of Aircraft Component Failure. In Advances in Intelligent Data Analysis: Third In-ternational Symposium (IDA-99), LNCS, volume 1642. Springer Verlag, Amsterdam, TheNetherlands, 1999

[201] Pepe, M. S., Janes, H., Longton, G., Leisenring, W., & Newcomb, P. Limitations of theOdds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker.American Journal of Epidemiology, volume 159(9): 882–890, 2004

[202] Petsche, T., Marcantonio, A., Darken, C., Hanson, S. J., Kuhn, G. M., & Santoso, I. A Neu-ral Network Autoassociator for Induction Motor Failure Prediction. In Touretzky, D. S.,

http://xml.coverpages.org/IBMCommonBaseEventV111.pdf

http://xml.coverpages.org/IBMCommonBaseEventV111.pdf

314 Bibliography

Mozer, M. C., & Hasselmo, M. E. (eds.), Advances in Neural Information Processing Sys-tems, volume 8, 924–930. The MIT Press, 1996. URL citeseer.ist.psu.edu/petsche96neural.html

[203] Pfefferman, J. & Cernuschi-Frias, B. A nonparametric nonstationary procedure for failureprediction. IEEE Transactions on Reliability, volume 51(4): 434–442, 2002

[204] Pielke, R. Mesoscale Meteorological Modeling, International Geophysics, volume 78. El-sevier, 2nd edition, 2001

[205] Pizza, M., Strigini, L., Bondavalli, A., & Di Giandomenico, F. Optimal Discriminationbetween Transient and Permanent Faults. In IEEE Proceedings of Third International High-Assurance Systems Engineering Symposium (HASE’98), 214–223. IEEE Computer Society,Los Alamitos, CA, USA, 1998

[206] Pylkkönen, J. Phone Duration Modeling Techniques in Continuous Speech Recognition.Master’s thesis, Helsinki University of Technology, Department of Computer Science andEngineering, Laboratory of Computer and Information Science, 2004

[207] Quenouille, M. H. Notes on Bias in Estimation. Biometrika, volume 43(3/4): 353–360,1956

[208] Quinlan, J. Learning logical definitions from relations. Machine Learning, volume 5(3):239–266, 1990

[209] Quinlan, J. C4. 5: Programs for Machine Learning. Morgan Kaufmann, 1993

[210] Rabiner, L. R. A Tutorial on Hidden Markov Models and Selected Applications in SpeechRecognition. Proceedings of the IEEE, volume 77(2): 257–286, 1989

[211] Ramesh, P. & Wilpon, J. G. Modeling state durations in hidden Markov models for au-tomatic speech recognition. In IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP-92), volume 1, 381–384. 1992

[212] Randell, B. System structure for software fault tolerance. IEEE Transactions on SoftwareEngineering, volume 1(2): 220–232, 1975

[213] Randell, B., Lee, P., & Treleaven, P. C. Reliability Issues in Computing System Design.ACM Computing Survey, volume 10(2): 123–165, 1978

[214] van Rijsbergen, C. J. Information Retrieval. Butterworth, London, 2nd edition, 1979

[215] Rousseeuw, P. J. A visual display for hierarchical classification. In Diday, E., Escoufier, Y.,Lebart, L., Pagès, J., Schektman, Y., & Tomassone, R. (eds.), Data Analysis and InformaticsIV, 743–748. North-Holland, Amsterdam, 1986

[216] Rovnyak, S., Kretsinger, S., Thorp, J., & Brown, D. Decision trees for real-time transientstability prediction. IEEE Transactions on Power Systems, volume 9(3): 1417–1426, 1994

[217] Russell, M. A segmental HMM for speech pattern modelling. In IEEE Proceedings of Inter-national Conference on Acoustics, Speech, and Signal Processing (ICASSP-93), volume 2,499–502. 1993

[218] Russell, M. & Cook, A. Experimental evaluation of duration modelling techniques for au-tomatic speech recognition. In IEEE Proceedings of International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’87), volume 12, 2376–2379. 1987

citeseer.ist.psu.edu/petsche96neural.html

citeseer.ist.psu.edu/petsche96neural.html

Bibliography 315

[219] Russell, M. J. & Moore, R. K. Explicit Modelling of State Occupancy in Hidden MarkovModels for Automatic Speech Recognition. In IEEE Proceedings of Int. Conf. on Acoustics,Speech and Signal Processing, 5–8. 1985

[220] Sahoo, R. K., Oliner, A. J., Rish, I., Gupta, M., Moreira, J. E., Ma, S., Vilalta, R., &Sivasubramaniam, A. Critical Event Prediction for Proactive Management in Large-scaleComputer Clusters. In Proceedings of the ninth ACM SIGKDD international conference onKnowledge discovery and data mining (KDD ’03), 426–435. ACM Press, 2003

[221] Saks, S. Theory of the Integral. G. E. Stechert & Co, New York, USA, 1937

[222] Salakhutdinov, R., Roweis, S., & Ghahramani, Z. Expectation-Conjugate Gradient: AnAlternative to EM. IEEE Signal Processing Letters, volume 11(7), 2004

[223] Salfner, F. Predicting Failures with Hidden Markov Models. In Proceedings of 5th Eu-ropean Dependable Computing Conference (EDCC-5), 41–46. Budapest, Hungary, 2005.Student forum volume

[224] Salfner, F., Hoffmann, G. A., & Malek, M. Prediction-Based Software Availability En-hancement. In Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., vanMoorsel A., & van Steen, M. (eds.), Self-Star Properties in Complex Information Systems,Lecture Notes in Computer Science, volume 3460. Springer-Verlag, 2005

[225] Salfner, F. & Malek, M. Proactive Fault Handling for System Availability Enhancement. InIEEE Proceedings of the 19th International Parallel and Distributed Processing Symposium(IPDPS’05) - Workshop 16 IEEE Proceedings, DPDNS Workshop. Denver, CO, 2005

[226] Salfner, F., Schieschke, M., & Malek, M. Predicting Failures of Computer Systems: ACase Study for a Telecommunication System. In Proceedings of IEEE International Paralleland Distributed Processing Symposium (IPDPS 2006), DPDNS workshop. Rhodes Island,Greece, 2006

[227] Salfner, F., Tschirpke, S., & Malek, M. Comprehensive Logfiles for Autonomic Sys-tems. In IEEE Proceedings of International Parallel and Distributed Processing Symposium(IPDPS), Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems(FTPDS). IEEE Computer Society, Santa Fe, New Mexico, USA, 2004

[228] Salvador, S. & Chan, P. Determining the number of clusters/segments in hierarchical clus-tering/segmentation algorithms. In IEEE Proceedings of 16th International Conference onTools with Artificial Intelligence (ICTAI 2004), 576–584. 2004

[229] Salvo Rossi, P., Romano, G., Palmieri, F., & Iannello, G. A hidden Markov model forInternet channels. In IEEE Proceedings of the 3rd International Symposium on SignalProcessing and Information Technology (ISSPIT 2003), 50–53. 2003

[230] Schlittgen, R. Einführung in die Statistik: Analyse und Modellierung von Daten.Oldenbourg-Wissenschaftsverlag, München, Wien, 9th edition, 2000

[231] Schölkopf, B., Smola, A. J., Williamson, R. C., & Bartlett, P. L. New Support VectorAlgorithms. Neural Computation, volume 12(5): 1207–1245, 2000

[232] Scott, D. Making Smart Investments to Reduce Unplanned Downtime. Technical ReportTactical Guidelines, TG-07-4033, GartnerGroup RAS Services, 1999

316 Bibliography

[233] Sen, P. K. Estimates of the Regression Coefficient Based on Kendall’s Tau. Journal of theAmerican Statistical Association, volume 63(324): 1379–1389, 1968

[234] Sfetsos, A. Short-term load forecasting with a hybrid clustering algorithm. IEE Proceed-ings of Generation, Transmission and Distribution, volume 150(3): 257–262, 2003

[235] Shannon, C. A Mathematical Theory of Communication. The Bell System Technical Jour-nal, volume 27: 379–423,623–656, 1948

[236] Shao, J. Linear Model Selection by Cross-Validation. Journal of the American StatisticalAssociation, volume 88(422): 486–494, 1993

[237] Shawe-Taylor, J. & Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge Uni-versity Press, 2004

[238] Shereshevsky, M., Crowell, J., Cukic, B., Gandikota, V., & Liu, Y. Software aging andmultifractality of memory resources. In Proceedings of the International Conference onDependable Systems and Networks (DSN 2003), 721–730. IEEE Computer Society, SanFrancisco, CA, USA, 2003

[239] Shewchuk, J. An introduction to the conjugate gradient method without the agonizing pain.Technical report, School of Computer Science, Carnegie Mellon University, Pittsburgh PA,USA, 1994

[240] Shi, X. & Manduchi, R. Invariant operators, small samples, and the bias-variance dilemma.In IEEE Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR 2004), volume 2. 2004

[241] Siewiorek, D. P. & Swarz, R. S. Reliable Computer Systems. Digital Press, Bedford, MA,2nd edition, 1992

[242] Silva, J. G. & Madeira, H. Experimental Dependability Evaluation. In Diab, H. B. &Zomaya, A. Y. (eds.), Dependable Computing Systems, chapter 12, 327–355. John Wiley &Sons, 2005

[243] Singer, R. M., Gross, K. C., Herzog, J. P., King, R. W., & Wegerich, S. Model-Based Nu-clear Power Plant Monitoring and Fault Detection: Theoretical Foundations. In Proceed-ings of Intelligent System Application to Power Systems (ISAP 97), 60–65. Seoul, Korea,1997

[244] Smith, T. & Waterman, M. Identification of Common Molecular Subsequences. Journal ofMolecular Biology, volume 147: 195–197, 1981

[245] Smyth, P. Clustering Using Monte Carlo Cross-Validation. In ACM proceedings of Knowl-edge Discovery and Data Mining (KDD 1996), 126–133. 1996

[246] Smyth, P. Clustering Sequences with Hidden Markov Models. In Mozer, M. C., Jordan,M. I., & Petsche, T. (eds.), Advances in Neural Information Processing Systems, volume 9,648. The MIT Press, 1997

[247] Solomonoff, R. J. A Formal Theory of Inductive Inference, Part 1. Information and Control,volume 7(1): 1–22, 1964

[248] Solomonoff, R. J. A Formal Theory of Inductive Inference, Part 2. Information and Control,volume 7(2): 224–254, 1964

Bibliography 317

[249] Srikant, R. & Agrawal, R. Mining Sequential Patterns: Generalizations and PerformanceImprovements. In Apers, P. M. G., Bouzeghoub, M., & Gardarin, G. (eds.), Proc. 5th Int.Conf. Extending Database Technology, EDBT, volume 1057, 3–17. Springer-Verlag, 1996.URL citeseer.nj.nec.com/article/srikant96mining.html

[250] Starr, A. A structured approach to the selection of condition based maintenance. In IEEProceedings of Fifth International Conference on Factory 2000 - The Technology Exploita-tion Process. 1997

[251] Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal ofthe Royal Statistical Society, volume 36(2): 111–147, 1974

[252] Sullivan, M. & Chillarege, R. Software defects and their impact on system availability - astudy of field failures in operating systems. 21st Int. Symp. on Fault-Tolerant Computing(FTCS-21), 2–9, 1991. URL citeseer.ist.psu.edu/sullivan91software.html

[253] Sun, R. Introduction to Sequence Learning. In Sun, R. & Giles, C. L. (eds.), SequenceLearning: Paradigms, Algorithms, and Applications, Lecture Notes in Computer Science,volume 1828, 1–11. Springer, Berlin / Heidelberg, 2001

[254] Tauber, O. Einfluss vorhersagegesteuerter Restarts auf die Verfügbarkeit. Master’s thesis,Humboldt-Universität zu Berlin, Berlin, Germany, 2006

[255] Thoraval, L. Hidden Semi-Markov Event Sequence Models. Technical report, UniversitéLouis Pasteur Strasbourg, France, 2002

[256] Todorovski, L., Flach, P., & Lavrac, N. Predictive performance of weighted relative ac-curacy. In Zighed, D. A., Komorowski, J., & Zytkow, J. (eds.), Proceedings of the FourthEuropean Conference on Principles of Data Mining and Knowledge Discovery (PKDD2000), Lecture Notes in Artificial Intelligence, volume 1910, 255–264. Springer, 2000

[257] Troudet, T., Merrill, W., Center, N., & Cleveland, O. A real time neural net estimatorof fatigue life. In IEEE Proceedings of International Joint Conference on Neural Net-works(IJCNN 90), 59–64. 1990

[258] Tsao, M. M. & Siewiorek, D. P. Trend Analysis on System Error Files. In Proc. 13thInternational Symposium on Fault-Tolerant Computing, 116–119. Milano, Italy, 1983

[259] Turnbull, D. & Alldrin, N. Failure Prediction in Hardware Systems. Technical report,University of California, San Diego, 2003. Available at http://www.cs.ucsd.edu/~dturnbul/Papers/ServerPrediction.pdf

[260] Ulerich, N. & Powers, G. On-line hazard aversion and fault diagnosis in chemical pro-cesses: the digraph+fault-tree method. IEEE Transactions on Reliability, volume 37(2):171–177, 1988

[261] Vaidyanathan, K., Harper, R. E., Hunter, S. W., & Trivedi, K. S. Analysis and implementa-tion of software rejuvenation in cluster systems. In Proceedings of the 2001 ACM SIGMET-RICS international conference on Measurement and modeling of computer systems, 62–71.ACM Press, 2001

[262] Vaidyanathan, K. & Trivedi, K. A comprehensive model for software rejuvenation. IEEETransactions on Dependable and Secure Computing, volume 2: 124–137, 2005

citeseer.nj.nec.com/article/srikant96mining.html

citeseer.ist.psu.edu/sullivan91software.html

citeseer.ist.psu.edu/sullivan91software.html

http://www.cs.ucsd.edu/~dturnbul/Papers/ServerPrediction.pdf

http://www.cs.ucsd.edu/~dturnbul/Papers/ServerPrediction.pdf

318 Bibliography

[263] Vaidyanathan, K. & Trivedi, K. S. A Measurement-Based Model for Estimation of Re-source Exhaustion in Operational Software Systems. In Proceedings of the InternationalSymposium on Software Reliability Engineering (ISSRE). 1999

[264] Vapnik, V. N. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995

[265] Vesely, W., Goldberg, F. F., Roberts, N. H., & Haasl, D. F. Fault Tree Handbook. TechnicalReport NUREG-0492, U.S. Nuclear Regulatory Commission, Washington, DC, 1981

[266] Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. Predictive algorithmsin the management of computer systems. IBM Systems Journal, volume 41(3): 461–474,2002

[267] Vilalta, R. & Drissi, Y. A perspective view and survey of meta-learning. Artificial Intelli-gence Review, volume 18(2): 77–95, 2002

[268] Vilalta, R. & Ma, S. Predicting Rare Events In Temporal Domains. In Proceedings of the2002 IEEE International Conference on Data Mining (ICDM’02), 474–482. IEEE Com-puter Society, Washington, DC, USA, 2002

[269] Wahl, M., Howes, T., & Kille, S. Lightweight Directory Access Protocol (v3). RFC 2251,1997. http://www.ietf.org/rfc/rfc2251.txt

[270] Wang, X. Durationally constrained training of HMM without explicit state durational PDF.In Proceedings of the Institute of Phonetic Sciences, University of Amsterdam, volume 18,111–130. 1994

[271] Ward, A., Glynn, P., & Richardson, K. Internet service performance failure detection.SIGMETRICS Performance Evaluation Review, volume 26(3): 38–43, 1998

[272] Ward, A. & Whitt, W. Predicting response times in processor-sharing queues. In Glynn,P. W., MacDonald, D. J., & Turner, S. J. (eds.), Proc. of the Fields Institute Conf. on Comm.Networks. 2000

[273] Warrender, C., Forrest, S., & Pearlmutter, B. Detecting intrusions using system calls: alter-native data models. In IEEE Proceedings of the 1999 Symposium on Security and Privacy,133–145. 1999

[274] Wei, W., Wang, B., & Towsley, D. Continuous-time hidden Markov models for networkperformance evaluation. Performance Evaluation, volume 49(1-4): 129–146, 2002

[275] Weiss, G. Timeweaver: A Genetic Algorithm for Identifying Predictive Patterns in Se-quences of Events. In Proceedings of the Genetic and Evolutionary Computation Confer-ence, 718–725. Morgan Kaufmann, San Francisco, CA, 1999

[276] Weiss, G. M. Mining with rarity: a unifying framework. SIGKDD Explor. Newsl., vol-ume 6(1): 7–19, 2004

[277] Weiss, G. M. & Hirsh, H. Learning to Predict Rare Events in Event Sequences. InR. Agrawal, P. S. & Piatetsky-Shapiro, G. (eds.), Proceedings of the Fourth InternationalConference on Knowledge Discovery and Data Mining, 359–363. AAAI Press, Menlo Park,California, 1998

[278] Williams, J., Davies, A., & Drake, P. (eds.). Condition-based Maintenance and MachineDiagnostics. Springer Verlag, 1994

http://www.ietf.org/rfc/rfc2251.txt

Bibliography 319

[279] Wilson, A. D. & Bobick, A. F. Recognition and interpretation of parametric gesture. InIEEE Proceedings of Sixth International Conference on Computer Vision, 329–336. 1998

[280] Wolpert, D. H. The Mathematics of Generalization. Addison-Wesley, Reading, MA, 1995

[281] Wong, K. C. P., Ryan, H., & Tindle, J. Early Warning Fault Detection Using ArtificialIntelligent Methods. In Proceedings of the Universities Power Engineering Conference.1996. URL citeseer.nj.nec.com/217993.html

[282] Yang, S. A condition-based failure-prediction and processing-scheme for preventive main-tenance. IEEE Transactions on Reliability, volume 52(3): 373–383, 2003

[283] Yu, C. H. Resampling methods: concepts, applications, and justification. Practical Assess-ment, Research and Evaluation, volume 8(19), 2003

[284] Yu, S.-Z., Liu, Z., Squillante, M. S., Xia, C., & Zhang, L. A hidden semi-Markov modelfor web workload self-similarity. In IEEE Proceedings of 21st International Performance,Computing, and Communications Conference, 65–72. 2002

[285] Zipf, G. K. Human Behavior and the Principle of Least Effort: An Introduction to HumanEcology. Addison-Wesley Press, Cambridge, Mass, 1949

citeseer.nj.nec.com/217993.html

Event-based Failure Prediction - hu-berlin.de · gang mit Fehlern: Im Anschluss an die Vorhersage...

Documents

Transcript of Event-based Failure Prediction - hu-berlin.de · gang mit Fehlern: Im Anschluss an die Vorhersage...