Shahid Lecture-7- MKAG1273

Post on 15-Jan-2017

116 views 0 download

Transcript of Shahid Lecture-7- MKAG1273

Dr. Shamsuddin ShahidDepartment of Hydraulics and Hydrology

Faculty of Civil Engineering, Universiti Teknologi Malaysia

Room No.: M46-332; Phone: 07-5531624; Mobile: 0182051586 Email: sshahid@utm.my

MAL1303: STATISTICAL HYDROLOGY

Non-parametric Regression

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Simple Linear Regression: Revisited

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Null Hypothesis, H0 : There is no change, m = 0Alternative Hypothesis, HA: There is a change, m ≠ 0

If |t(calculated)| > t (critical, α, n-2), Null hypothesis rejected.The change is significant.

If t(calculated) = 3.59t (critical, 0.05, 10) = 2.23

As t(calculated) > t (critical, 0.05, 10), Null hypothesis rejected.The change is significant.

A change in rainfall by 1mm cause a change indischarge by 1.08 cumec, at 95% level of confidence.

Test of Significance of Slope

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Null Hypothesis, H0 : The intercept is zero, c = 0Alternative Hypothesis, HA: There intercept is not zero, m ≠ 0

If |t(calculated)| > t (critical, α, n-2), Null hypothesis rejected.The change is significant.

If t(calculated) = 0.11t (critical, 0.05, 10) = 2.23

As t(calculated) < t (critical, 0.05, 10), Null hypothesis CANNOT BE rejected. The intercept is NOT significantlydifferent from zero.It can be commented that discharge is notsignificantly different from zero at 95% level ofconfidence when rainfall is zero.

Test of Significance of Intercept

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

ResidualsDifference between actual observation and the predicted observation is called residual.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Distribution of Residuals

Residuals should be normally distributed.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Distribution of Residuals

Distribution of Residuals for the present example.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Abnormal Distribution of Residuals

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Leverage

Leverage is a measure of an "outlier" in the x direction. It is a function of thedistance from the i-th x value to the middle (mean) of the x values used inthe regression.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

A high leverage point is one where hi > 3p/n where p is the number ofcoefficients in the model (p=2 in simple linear regression, b0 and b1).

Leverage

All hi is less than 3p/n (3*2/12 = 0.5)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Leverage

One hi is more than 3p/n (3*2/12 = 0.5)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Measures of Outliers in the y Direction

One measure of outliers in the y direction is the standardized residual, esi

An extreme outlier is one for which |esi|>3.There should be only an average of 3 of these in 1,000 observations ifthe residuals are normally distributed.

|esi|>2 should occur about 5 times in 100 observations if normallydistributed.

More than this number indicates that the residuals do not have anormal distribution.

Where,

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Measures of Outliers in the y Direction

An extreme outlier is one for which |esi|>3.|esi|>2 should occur about 5 times in 100 observations if normally distributed.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Measures of Influence of Outliers

Observations with high influence are those which have both highleverage and large outliers. These exert a stronger influence on theposition of the regression line than other observations.

There are two most widely used methods to measure the influence ofoutlier in regression equation,

1. Cook's D2. DFFITS

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Cook's D Method"Cook's D" is one of the most widely method used to measures the influence.

The i-th observation is considered to have high influence if

Di > F(p+1,n−p) at α=0.05

where p is the number of coefficients.

For Simple Linear Regression (SLR) with more than about 30observations, the critical value for Di would be about 2.4.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

The DFFITS is a more robust method to diagnosis influence.

DFFITS Method

An observation is considered to have high influence if

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Measures of Influence of Outliers

Cooks D: F(p+1,n−p) at α=0.05 = 3.7, Di is always less than 3.7

DFFITS: 2*√pn = 2 *√2*12 = 9.79, DFFITS values are always less than 9.79

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Abnormal Distribution of Residuals

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Alternative Methods for Regression

Situations such as the above frequently arise where the assumptions ofconstant variance and normality of residuals required by Ordinary LeastRegression (OLS) are not satisfied, and transformations to remedy thisare either not possible, or not desirable.

In these situations, alternative methods are better for fitting lines todata.These include:

• Nonparametric rank-based methods• Minimizing residuals variations• Smooths.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line

Kendall-Theil is non-parametric rank based method.

Related to Kendall-tau rank correlation, it is a robust nonparametricline applicable when Y is linearly related to X.

These are the advantages of Kendall-Theil method in contrast to OLSRegression are:

• Kendall-Theil line does not depend on the normality of residualsfor validity of significance tests

• It is not strongly affected by outliers

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust LineKendall-Theil method also try to find the best fit line:

Where, slope,

and Intercept,

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Example

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Example

0.412 0.595 0.729 0.739 0.750 0.787 0.795 0.812 0.817 0.839 0.856 0.8820.890 0.937 0.985 1.000 1.000 1.010 1.038 1.053 1.063 1.077 1.220 1.2281.393 1.500 1.897 2.222

Median is the average of 14th and 15th slopes, i.e., (0.937+0.985)/2 = 0.961

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Example

C = 49.9 – (0.961 * 47.5) = 4.25

Y = 0.961X + 4.25

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Test of Significance

The test for significance of the Kendall-Theil linear relationship,

H0: m = 0HA: m 0

The steps involve to test the significance:

1. Calculate the S as the sum of the algebraic signs of the possiblepair wise slopes.

2. Calculate the Significance value from table using S and n3. Decide significance.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Test of Significance

Number of positive slopes are 24. Negative slopes are 0. Therefore,

S = 24 – 0 = 24N = 8

Table values or (S = 24 and N = 8) = 0.0009Two-tailed test: Significance = 2 X 0.0009 = 0.0018 (Significant)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Test of Significance

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Confidence Interval of Y

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Confidence Interval for Theil Slope

Method for calculating confidence interval of slope is depends onsample size. For small sample size we use tabulated values.

1. For small sample sizes, table is used to find the critical value Xuhaving a p-value nearest to α/2.

2. This critical value is then used to compute the ranks Ru and Rlcorresponding to the slope values at the upper and lowerconfidence limits for slope

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Confidence Interval

0.412 0.595 0.729 0.739 0.750 0.787 0.795 0.812 0.817 0.839 0.856 0.8820.890 0.937 0.985 1.000 1.000 1.010 1.038 1.053 1.063 1.077 1.220 1.2281.393 1.500 1.897 2.222

There are 24 slopes.Median is the average of 14th and 15th slopes, i.e., (0.937+0.985)/2 = 0.961

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

To determine a confidenceinterval for slope at 95% level ofconfidence (α = 0.05), the tabledcritical value Xu nearest to α/2=0.025 for N = 8 is found to be 16(p=0.031).

Therefore, Ru = (24 + 16)/2 = 20 Rl = [(24 - 16)/2] + 1 = 5

Kendall-Theil Robust Line: Confidence Interval

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Confidence Interval

0.412 0.595 0.729 0.739 0.750 0.787 0.795 0.812 0.817 0.839 0.856 0.8820.890 0.937 0.985 1.000 1.000 1.010 1.038 1.053 1.063 1.077 1.220 1.2281.393 1.500 1.897 2.222

Median = 0.961 with range 0.750 to 1.228

Ru = 20; Rl = 5

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Kendall-Theil Robust Line: Confidence Interval

When, n 20

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Regression: Non-parametric

Sen’s Slope Method

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Example: Sen’s Slope Method

Net change is 1.6

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Weighted Least Squares (WLS)

With WLS, each squared residual is weighted by some weight factor in such a way that observations with greater variance have lesser weight.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Weighted Least Squares (WLS)

With WLS, X and Y are weighted by,

Where,

And, c is a constant, commonly used 3S = the IQR of the residuals

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Weighted Least Squares (WLS)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Weighted Least Squares (WLS)

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Smoothing

1. Smoothing is an exploratory technique, having no simple equationor significance tests associated with it.

2. The most common smooths estimate the center of the data -- theconditional mean or median of Y as X changes.

3. The lack of an equation is a strength in the sense that a smooth isnot constrained by some prior assumption as to the mathematicalfunction of the relationship.

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Moving Average

• It computes an average of the last m consecutive observations• In contrast to modeling in terms of a mathematical equation, the

moving average merely smooths the fluctuations in the data.• A moving average works well when the data have

– a fairly linear trend– a definite rhythmic pattern of fluctuations

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Example of Moving Average

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Example of Moving Average

11/23/2015 Shamsuddin Shahid, FKA, UTM

You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)