PowerPoint 프레젠테이션datamining.uos.ac.kr/wp-content/uploads/2019/02/chap2.pdf ·...
Transcript of PowerPoint 프레젠테이션datamining.uos.ac.kr/wp-content/uploads/2019/02/chap2.pdf ·...
BackGround
02
Basics of Probability and Statistics
서울시립대학교김윤나
001/ Joint and Conditional Probabilities
002/ Bayes’ Rule
003/ Coin Flips and the Binomial Distribution
004/ Maximum Likelihood Parameter Estimation
005/ Bayesian Parameter Estimation
CONTE NTS
006/ Probabilistic Models and Their Applications
Text Data Management and Analysis
𝑥 is drawn from theta
Intro0
random variable 𝑥 is drawn from the probability distribution θ
Ωprobability space
θprobability distribution
Text Data Management and Analysis
- models only assign probabilities for a finite(discrete) set of outcomes
Intro0
- where there are an infinite number of “events” that are not countable
probability distribution
discrete probability distribution
continuous probability distribution
Text Data Management and Analysis
Each event has a probability between zero and one
Intro0
The probability of all events sums to one
event not in has Ω probability zero
3 axioms that should be satisfied in θ with Ω
the probability of any event occurring from Ω is one
Text Data Management and Analysis
Joint probability
1 Joint and Conditional Probabilities
measures the likelihood that two events occur simultaneously
Conditional probabilitymeasures the likelihood that one event occurs given that another event has already occurred
𝑝 𝑥𝑐 = 𝑔𝑟𝑒𝑒𝑛, 𝑥𝑠 = 𝑐𝑖𝑟𝑐𝑙𝑒 =1
6
(𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑋 𝑎𝑛𝑑 𝑌)
Text Data Management and Analysis
𝑖𝑓) 𝑋 𝑎𝑛𝑑 𝑌 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡, 𝑡ℎ𝑒𝑛 𝑝 𝑋 𝑌 = 𝑝(𝑋)
2 Bayes’ Rule
Text Data Management and Analysis
If we want to model 𝑛 throws and find the probability of 𝑘 successes,
3 Coin Flips and the Binomial Distribution
when the order of outcomes are not given
when the order of outcomes are given
𝑝 𝑘 ℎ𝑒𝑎𝑑𝑠 = 𝜃𝑘 1 − 𝜃 𝑛−𝑘
Text Data Management and Analysis
Maximum Likelihood Parameter Estimation4
Maximum Likelihood Estimation(MLE)
- to figure out what 𝜃 is based on the observed data- choose the 𝜃 that has the highest likelihood given our data
ex) 𝑝 𝐷 𝜃 = 𝜃3 1 − 𝜃 2
to find the 𝜃 that maximizes the function 𝑓 𝜃 = 𝜃3 1 − 𝜃 2
log 𝑓 𝜃 = 3𝑙𝑜𝑔𝜃 + 2log(1 − 𝜃)
𝑑 log 𝑓 𝜃
𝑑𝜃=3
𝜃−
2
1 − 𝜃= 0
𝜃 =3
5
Text Data Management and Analysis
Maximum Likelihood Parameter Estimation4
Generally,
The value of an arg max expression stays the same if we perform any monotonic transformation of the function inside arg max
Text Data Management and Analysis
Bayesian Parameter Estimation5
Potential problem of MLE
often inaccurate when the size of the data sample is small since it always attempts to fit the data as well as possible
Problem of “overfitting”
This can be addressed and alleviated by considering the uncertainty on the parameter and using Bayesian parameter estimation instead of MLE
In Bayesian parameter estimation, we consider a distribution over all the possible values for the parameter.
Text Data Management and Analysis
Bayesian Parameter Estimation5
p(θ) : to represent a distribution over all possible values for 𝜃which encodes our prior belief about what value is the true value of 𝜃
D : data D provide evidence for or against that belief
(𝑏𝑦 𝐵𝑎𝑦𝑒𝑠′ 𝑟𝑢𝑙𝑒)
(𝑓𝑜𝑟 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛)
𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑜𝑟 𝑎 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝜃
Text Data Management and Analysis
Bayesian Parameter Estimation5
(𝑓𝑜𝑟 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛)
Text Data Management and Analysis
Bayesian Parameter Estimation5
since the likelihood of the data remains constant
proportional to the prior times the likelihood
Text Data Management and Analysis
Bayesian Parameter Estimation5
𝑓𝑜𝑟 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
𝑓𝑜𝑟 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
To compute the mean of the posterior distribution, which is given by the weighted sum of probabilities and the parameter values.
Sometimes, we are interested in using the mode of the posterior distribution as our estimate of the parameter, called Maximum a Posteriori(MAP)
Text Data Management and Analysis
Probabilistic Models and Their Applications6
In the case of a distribution over words, we have parameter for each element in V
<Workflow>1. Define the model
• Capture the probabilities2. Learn its parameters
• Figure out actually how to set the probabilities for each word3. Apply the model
• Once 𝜃 is defined, analyze the probability of a specific subset of words in the corpus and observe unseen data & calculating the probability of seeing the words in the new text.