Christopher M. Bishop

82
Christopher M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Lecturer: Xiaopeng Hong These slides follow closely the course textbook “Pattern Recognition and Machine Learning” by Christopher Bishop and the slides “Machine Learning and Music” by Prof. Douglas Eck

description

PATTERN RECOGNITION AND MACHINE LEARNING. CHAPTER 1: INTRODUCTION. Christopher M. Bishop. Lecturer: Xiaopeng Hong. - PowerPoint PPT Presentation

Transcript of Christopher M. Bishop

Page 1: Christopher M. Bishop

• Christopher M. Bishop

PATTERN RECOGNITION AND MACHINE LEARNING

CHAPTER 1: INTRODUCTION

• Lecturer: Xiaopeng Hong

• These slides follow closely the course textbook “Pattern Recognition and Machine Learning” by Christopher Bishop and the slides “Machine Learning and Music” by Prof. Douglas Eck

Page 2: Christopher M. Bishop

免责声明

Page 3: Christopher M. Bishop
Page 4: Christopher M. Bishop

CVPR

Feature Extraction

PatternClassification

ML Statistical

Symbol

Contents

IP

SP

ProbabilityTheory

InformationTheory

Mathematical logic

Page 5: Christopher M. Bishop

PR & ML

• Pattern recognition has its origins in engineering, whereas machine learning grew out of computer science. However, these activities can be viewed as two facets of the same field, and together they have undergone substantial development over the past ten years.

Page 6: Christopher M. Bishop

Learning

• Learning denotes changes in the system that is adaptive in the sense that they enable the system to do the same task or tasks drawn from the same population more effectively the next time.

• by H. Simon• 如果一个系统能够通过执行某种过程而改进它的

性能,这就是学习。by 陆汝钤教授

Page 7: Christopher M. Bishop
Page 8: Christopher M. Bishop
Page 9: Christopher M. Bishop

Related Publications

• Conference– ICML– KDD – NIPS– IJCNN – AIML– IJCAI– COLT– CVPR– ICCV– ECCV– …

• Journal– Machine Learning (ML)– Journal of Machine Learnin

g Research – Annals of Statistics– Data Mining and Knowledge

Discovery– IEEE-KDE – IEEE-PAMI – Artificial Intelligence– Journal of Artificial Intellige

nce Research– Computational Intelligence– Neural Computation– IEEE-NN – Research, Information and

Computation– …

Page 10: Christopher M. Bishop
Page 11: Christopher M. Bishop
Page 12: Christopher M. Bishop
Page 13: Christopher M. Bishop
Page 14: Christopher M. Bishop

History

• Classical statistical methods

• Division in feature space

• PAC

• Generalization

• Ensemble learning

M. Minsky. “Perceptron”

F. Rosenblatt. PerceptronBP Network

Page 15: Christopher M. Bishop
Page 16: Christopher M. Bishop

Leslie Gabriel Valiant • He introduced the "probably approximately corre

ct" (PAC) model of machine learning that has helped the field of computational learning theory grow.

by Wikipedia • 将计算复杂性作为一个必须考虑的因素。算法的复杂性必

须是多项式的。为了达到这个目的,不惜牺牲模型精度。• “ 对任意正数 ε>0 , 0≤δ<1 , |F(x)-f(x)|≤ε 成立的概

率大于 1-δ”• 对这个理念,传统统计学家难以接受

by 王珏教授

Page 17: Christopher M. Bishop

Vladimir N. Vapnik

• “ 不能将估计概率密度这个更为困难的问题作为解决机器学习分类或回归问题的中间步骤,因此,他直接将问题变为线性判别问题其本质是放弃机器学习建立的模型对自然模型的可解释性。”

• “ 泛化” “有限样本统计”– 泛化作为机器学习的核心问题– 在线性特征空间上设计算法– 泛化最大边缘

• “ 与其一无所有,不如求其次”。这是统计学的传统无法接受的

by 王珏教授

Page 18: Christopher M. Bishop

Robert Schapire

• “ 对任意正数 ε>0 , 0≤δ<1 , |F(x)-f(x)|≤ε 成立的概率大于 1/2 + δ”

• 构造性证明了 PAC 弱可学习的充要条件是 PAC 强可学习

• 集群学习有两个重要的特点: – 使用多个弱模型代替一个强模型– 决策方法是以弱模型投票,并以少数服从多数的原则决定解答。

by 王珏教授

Page 19: Christopher M. Bishop

Example

Handwritten Digit Recognition28

28

d=784

Pre-processing feature extraction1. reduce variability ; 2. speed up computation

Page 20: Christopher M. Bishop
Page 21: Christopher M. Bishop

Polynomial Curve Fitting

Page 22: Christopher M. Bishop

Sum-of-Squares Error Function

Page 23: Christopher M. Bishop

0th Order Polynomial

Page 24: Christopher M. Bishop

1st Order Polynomial

Page 25: Christopher M. Bishop

3rd Order Polynomial

Page 26: Christopher M. Bishop

9th Order Polynomial

Page 27: Christopher M. Bishop

Over-fitting

Root-Mean-Square (RMS) Error:

Page 28: Christopher M. Bishop

Polynomial Coefficients

Page 29: Christopher M. Bishop

Data Set Size:

9th Order Polynomial

Page 30: Christopher M. Bishop

Data Set Size:

9th Order Polynomial

Page 31: Christopher M. Bishop

Regularization

• Penalize large coefficient values

Page 32: Christopher M. Bishop

Regularization:

Page 33: Christopher M. Bishop

Regularization:

Page 34: Christopher M. Bishop

Regularization: vs.

Page 35: Christopher M. Bishop

Polynomial Coefficients

Page 36: Christopher M. Bishop

Probability Theory统计机器学习 / 模式分类问题 可以在 贝叶斯的框架下表示

MLMAPBayesian

We now seek a more principled approach to solving problems in pattern recognition by turning to a discussion of probability theory. As well as providing the foundation for nearly all of the subsequent developments in this book.

Page 37: Christopher M. Bishop

Probability Theory

Apples and Oranges

Page 38: Christopher M. Bishop

Probability Theory

• Marginal Probability

• Conditional Probability

Joint Probability

Page 39: Christopher M. Bishop

Probability Theory

• Sum Rule

Product Rule

Page 40: Christopher M. Bishop

The Rules of Probability

• Sum Rule

• Product Rule

Page 41: Christopher M. Bishop

Bayes’ Theorem

posterior likelihood × prior

Page 42: Christopher M. Bishop

Probability Densities

Page 43: Christopher M. Bishop

Transformed Densities

Markus Svensén
This figure was taken from Solution 1.4 in the web-edition of the solutions manual for PRML, available at http://research.microsoft.com/~cmbishop/PRML. A more thorough explanation of what the figure shows is provided in the text of the solution.
Page 44: Christopher M. Bishop

Expectations

Conditional Expectation(discrete)

Approximate Expectation(discrete and continuous)

Page 45: Christopher M. Bishop

Variances and Covariances

Page 46: Christopher M. Bishop
Page 47: Christopher M. Bishop

The Gaussian Distribution

Page 48: Christopher M. Bishop

Gaussian Mean and Variance

Page 49: Christopher M. Bishop

The Multivariate Gaussian

Page 50: Christopher M. Bishop

Gaussian Parameter Estimation

Likelihood function

Page 51: Christopher M. Bishop

Maximum (Log) Likelihood

Page 52: Christopher M. Bishop

Properties of and

Page 53: Christopher M. Bishop

Curve Fitting Re-visited

precision para.

Page 54: Christopher M. Bishop

Maximum Likelihood

Determine by minimizing sum-of-squares error, .

1. mean WML

2. precision β

Page 55: Christopher M. Bishop

Predictive Distribution

Page 56: Christopher M. Bishop

MAP: A Step towards Bayes

Determine by minimizing regularized sum-of-squares error, .

Page 57: Christopher M. Bishop

Bayesian Curve Fitting

fully Bayesian approach

Section 3.3

Page 58: Christopher M. Bishop

Bayesian Predictive Distribution

Page 59: Christopher M. Bishop

1.3 Model Selection

Many parameters…we need to determine the values of such parameters, and the principal objective in doing so is usually to achieve the best predictive performance on new data.

Section 3.3

we may wish to consider a range of different types of model in order to find the best one for our particular application.

Page 60: Christopher M. Bishop

Model Selection

• Cross-Validation

complexity information criteria tend to favour overly simple models

Section 3.4 & 4.4.1

Page 61: Christopher M. Bishop
Page 62: Christopher M. Bishop

1.4 Curse of Dimensionality

we will have to deal with spaces of high dimensionality comprising many input variables.

this poses some serious challenges and is an important factor influencing the design of pattern recognition techniques.

Page 63: Christopher M. Bishop

Curse of Dimensionality

Page 64: Christopher M. Bishop

Curse of Dimensionality

Polynomial curve fitting, M = 3

Page 65: Christopher M. Bishop

1.5 Decision TheoryDetermination of p(x, t) from a set of training data is an example of inference and is typically a very difficult problem whose solution forms the subject of much of this book.

In a practical application, we often make a specific prediction for the value of t, or more generally take a specific action based on our understanding of the values t is likely to take, and this aspect is the subject of decision theory.

Page 66: Christopher M. Bishop

Decision Theory

• Inference step• Determine either or .

• Decision step• For given x, determine optimal t.

Page 67: Christopher M. Bishop

Minimum Misclassification Rate

If our aim is to minimize the chance of assigning x to the wrong class, then intuitively we would choose the class having the higher posterior probability.

decision regions

Page 68: Christopher M. Bishop

Minimum Expected Loss• Example: classify medical images as ‘cancer’ or

‘normal’

DecisionTr

uth

loss function / cost function

Page 69: Christopher M. Bishop

Minimum Expected Loss

Regions are chosen to minimize

Page 70: Christopher M. Bishop

Reject Option

Page 71: Christopher M. Bishop

Why Separate Inference and Decision?• Minimizing risk (loss matrix may change over

time)• Reject option• Unbalanced class priors• Combining models

Page 72: Christopher M. Bishop

Decision Theory for Regression• Inference step• Determine .

• Decision step• For given x, make optimal

prediction, y(x), for t.

• Loss function:

Page 73: Christopher M. Bishop

Generative vs Discriminative• Generative approach: • Model• Use Bayes’ theorem

• Discriminative approach: • Model directly

Discriminant function

Page 74: Christopher M. Bishop

1.6 Information Theory

Information theory will also prove useful in our development of pattern recognition and machine learning techniques

Considering a discrete random variable x, how much information is received when we observe a specific value for this variable. The amount of information can be viewed as the ‘degree of surprise’ on learning the value of x.

Page 75: Christopher M. Bishop

Entropy

Important quantity in• coding theory• statistical physics• machine learning

Page 76: Christopher M. Bishop

Entropy

• Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x?

• All states equally likely

Page 77: Christopher M. Bishop

Entropy

Page 78: Christopher M. Bishop

Conditional Entropy

Page 79: Christopher M. Bishop

The Kullback-Leibler Divergence

Page 80: Christopher M. Bishop

Mutual Information

Page 81: Christopher M. Bishop

Conclusion

• Machine Learning• Generalization• Classification/Regression• fitting/over-fitting• Regularization• Bayes’Theorem• Bayes’Decision• Entropy/KLD/MI

Page 82: Christopher M. Bishop

Q & A