Probability in Machine Learning

25
Probability in Machine Learning

Transcript of Probability in Machine Learning

Page 1: Probability in Machine Learning

Probability in Machine Learning

Page 2: Probability in Machine Learning

Three Axioms of Probability

β€’ Given an Event 𝐸 in a sample space 𝑆, S = 𝑖=1ڂ𝑁 𝐸𝑖

β€’ First axiomβˆ’ 𝑃 𝐸 ∈ ℝ, 0 ≀ 𝑃(𝐸) ≀ 1

β€’ Second axiomβˆ’ 𝑃 𝑆 = 1

β€’ Third axiomβˆ’ Additivity, any countable sequence of mutually exclusive events πΈπ‘–βˆ’ 𝑃 𝑖=1Ϊ‚

𝑛 𝐸𝑖 = 𝑃 𝐸1 + 𝑃 𝐸2 +β‹―+ 𝑃 𝐸𝑛 = σ𝑖=1𝑛 𝑃 𝐸𝑖

Page 3: Probability in Machine Learning

Random Variables

β€’ A random variable is a variable whose values are numerical outcomes of a random phenomenon.

β€’ Discrete variables and Probability Mass Function (PMF)

β€’ Continuous Variables and Probability Density Function (PDF)

π‘₯

𝑝𝑋 π‘₯ = 1

Pr π‘Ž ≀ 𝑋 ≀ 𝑏 = ΰΆ±π‘Ž

𝑏

𝑓𝑋 π‘₯ 𝑑π‘₯

Page 4: Probability in Machine Learning

Expected Value (Expectation)

β€’ Expectation of a random variable X:

β€’ Expected value of rolling one dice?

https://en.wikipedia.org/wiki/Expected_value

Page 5: Probability in Machine Learning

Expected Value of Playing Roulette

β€’ Bet $1 on single number (0 ~ 36), and get $35 payoff if you win. What’s the expected value?

𝐸 πΊπ‘Žπ‘–π‘› π‘“π‘Ÿπ‘œπ‘š $1 𝑏𝑒𝑑 = βˆ’1 Γ—36

37+ 35 Γ—

1

37=βˆ’1

37

Page 6: Probability in Machine Learning

Variance

β€’ The variance of a random variable X is the expected value of the squared deviation from the mean of X

Var 𝑿 = 𝐸[(𝑿 βˆ’ πœ‡)2]

Page 7: Probability in Machine Learning

Bias and Variance

Page 8: Probability in Machine Learning

Covariance

β€’ Covariance is a measure of the joint variability of two random variables.

πΆπ‘œπ‘£ 𝑋, π‘Œ = 𝐸 𝑋 βˆ’ 𝐸[𝑋] 𝐸 π‘Œ βˆ’ 𝐸[π‘Œ] = 𝐸 π‘‹π‘Œ βˆ’ 𝐸 𝑋 𝐸[π‘Œ]

https://programmathically.com/covariance-and-correlation/

Page 9: Probability in Machine Learning

Standard Deviation

Οƒ =1

𝑁

𝑖=1

𝑁

(π‘₯𝑖 βˆ’ πœ‡)2 = π‘‰π‘Žπ‘Ÿ(𝑿)

68–95–99.7 rule

Page 10: Probability in Machine Learning

6 Sigma

β€’ A product has 99.99966% chance to be free of defects

https://www.leansixsigmadefinition.com/glossary/six-sigma/

Page 11: Probability in Machine Learning

Union, Intersection, and Conditional Probability β€’ 𝑃 𝐴 βˆͺ 𝐡 = 𝑃 𝐴 + 𝑃 𝐡 βˆ’ 𝑃 𝐴 ∩ 𝐡

β€’ 𝑃 𝐴 ∩ 𝐡 is simplified as 𝑃 𝐴𝐡

β€’ Conditional Probability 𝑃 𝐴|𝐡 , the probability of event A given B has occurred

βˆ’ 𝑃 𝐴|𝐡 = 𝑃𝐴𝐡

𝐡

βˆ’ 𝑃 𝐴𝐡 = 𝑃 𝐴|𝐡 𝑃 𝐡 = 𝑃 𝐡|𝐴 𝑃(𝐴)

Page 12: Probability in Machine Learning

Chain Rule of Probability

β€’ The joint probability can be expressed as chain rule

Page 13: Probability in Machine Learning

Mutually Exclusive

β€’ 𝑃 𝐴𝐡 = 0

β€’ 𝑃 𝐴 βˆͺ 𝐡 = 𝑃 𝐴 + 𝑃 𝐡

Page 14: Probability in Machine Learning

Independence of Events

β€’ Two events A and B are said to be independent if the probability of their intersection is equal to the product of their individual probabilitiesβˆ’π‘ƒ 𝐴𝐡 = 𝑃 𝐴 𝑃 𝐡

βˆ’π‘ƒ 𝐴|𝐡 = 𝑃 𝐴

Page 15: Probability in Machine Learning

Bayes Rule

𝑃 𝐴|𝐡 =𝑃 𝐡|𝐴 𝑃(𝐴)

𝑃(𝐡)

Proof:

Remember 𝑃 𝐴|𝐡 = 𝑃𝐴𝐡

𝐡

So 𝑃 𝐴𝐡 = 𝑃 𝐴|𝐡 𝑃 𝐡 = 𝑃 𝐡|𝐴 𝑃(𝐴)Then Bayes 𝑃 𝐴|𝐡 = 𝑃 𝐡|𝐴 𝑃(𝐴)/𝑃 𝐡

Page 16: Probability in Machine Learning

NaΓ―ve Bayes Classifier

Page 17: Probability in Machine Learning

NaΓ―ve = Assume All Features Independent

Page 18: Probability in Machine Learning

Graphical Model

𝑝 π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒 = 𝑝 π‘Ž 𝑝 𝑏 π‘Ž 𝑝 𝑐 π‘Ž, 𝑏 𝑝 𝑑 𝑏 𝑝(𝑒|𝑐)

Page 19: Probability in Machine Learning

Normal (Gaussian) Distributionβ€’ A type of continuous probability distribution for a real-valued random variable.

β€’ One of the most important distributions

Page 20: Probability in Machine Learning

Central Limit Theoryβ€’ Averages of samples of observations of random variables independently

drawn from independent distributions converge to the normal distribution

https://corporatefinanceinstitute.com/resources/knowledge/other/central-limit-theorem/

Page 21: Probability in Machine Learning

Bernoulli Distribution

β€’ Definition

β€’ PMF

β€’ 𝐸[𝑋] = 𝑝

β€’ Var(𝑋) = π‘π‘ž

https://en.wikipedia.org/wiki/Bernoulli_distribution

https://acegif.com/flipping-coin-gifs/

Page 22: Probability in Machine Learning

Information Theory

β€’ Self-information: 𝐼 π‘₯ = βˆ’ log 𝑃(π‘₯)

β€’ Shannon Entropy:

𝐻 = βˆ’

𝑖

𝑝𝑖 log2 𝑝𝑖

Page 23: Probability in Machine Learning

Entropy of Bernoulli Variable

‒𝐻 π‘₯ = 𝐸 𝐼 π‘₯ = βˆ’πΈ[log 𝑃(π‘₯)]

https://en.wikipedia.org/wiki/Information_theory

Page 24: Probability in Machine Learning

Kullback-Leibler (KL) Divergence

β€’ 𝐷𝐾𝐿(𝑝| π‘ž = 𝐸 log 𝑃 𝑋 βˆ’ log𝑄 𝑋 = 𝐸 log𝑃 π‘₯

𝑄 π‘₯

KL Divergence is Asymmetric!

𝐷𝐾𝐿(𝑝| π‘ž != 𝐷𝐾𝐿(π‘ž| 𝑝

https://www.deeplearningbook.org/slides/03_prob.pdf

Page 25: Probability in Machine Learning

Key Takeaways

β€’ Expected value (expectation) is mean (weighted average) of a random variable

β€’ Event A and B are independent if 𝑃 𝐴𝐡 = 𝑃 𝐴 𝑃 𝐡

β€’ Event A and B are mutually exclusive if 𝑃 𝐴𝐡 = 0

β€’ Central limit theorem tells us that Normal distribution is the one, if the data probability distribution is unknown

β€’ Entropy is expected value of self information βˆ’πΈ[log𝑃(π‘₯)]

β€’ KL divergence can measure the difference of two probability distributions and is asymmetric