Reliability - 生醫統計期末報告 - 學生 : 劉佩昀學號 : 101521090 授課老師 :...

Reliability- 生醫統計期末報告 -

學生 : 劉佩昀學號 : 101521090授課老師 : 蔡章仁

From the perspective of classical test

theory, an examinee's obtained test score (X) is composed of two components, a true score component (T) and an error component (E):

X=T+E

RELIABILITY

The true score component reflects the examinee's status with regard to the attribute that is measured by the test, while the error component represents measurement error.

Measurement error is random error. It is due to factors that are irrelevant to what is being measured by the test and that have an unpredictable (unsystematic) effect on an examinee's test score.

RELIABILITY

The score you obtain on a test is likely to be due both to the knowledge you have about the topics addressed by exam items (T) and the effects of random factors (E) such as the way test items are written, any alterations in anxiety, attention, or motivation you experience while taking the test, and the accuracy of your "educated guesses."

RELIABILITY

Whenever we administer a test to examinees, we would like to know how much of their scores reflects "truth" and how much reflects error. It is a measure of reliability that provides us with an estimate of the proportion of variability in examinees' obtained scores that is due to true differences among examinees on the attribute(s) measured by the test.

RELIABILITY

When a test is reliable, it provides dependable, consistent results and, for this reason, the term consistency is often given as a synonym for reliability (e.g., Anastasi, 1988).

Consistency = Reliability

RELIABILITY

Ideally, a test's reliability would be calculated by dividing true score variance by the obtained (total) variance to derive a reliability index. This index would indicate the proportion of observed variability in test scores that reflects true score variability.

True Score Variance/Total Variance = Reliability Index

THE RELIABILITY COEFFICIENT

A test's true score variance is not known, however, and reliability must be estimated rather than calculated directly.

There are several ways to estimate a test's reliability. Each involves assessing the consistency of an examinee's scores over time, across different content samples, or across different scorers.

The common assumption for each of these reliability techniques that consistent variability is true score variability, while variability that is inconsistent reflects random error.


Most methods for estimating reliability produce a reliability coefficient, which is a correlation coefficient that ranges in value from 0.0 to + 1.0. When a test's reliability coefficient is 0.0, this means that all variability in obtained test scores is due to measurement error.

Conversely, when a test's reliability coefficient is + 1.0, this indicates that all variability in scores reflects true score variability.


The reliability coefficient is symbolized with the letter "r" and a subscript that contains two of the same letters or numbers (e.g., ''rxx'').

The subscript indicates that the correlation coefficient was calculated by correlating a test with itself rather than with some other measure.


Regardless of the method used to calculate a reliability coefficient, the coefficient is interpreted directly as the proportion of variability in obtained test scores that reflects true score variability. For example, as depicted in Figure 1, a reliability coefficient of .84 indicates that 84% of variability in scores is due to true score differences among examinees, while the remaining 16% (1.00 - .84) is due to measurement error.


True Score Variability (84%) Error (16%)

Figure 1. Proportion of variability in test scores

Note that a reliability coefficient does not provide any information about what is actually being measured by a test!

A reliability coefficient only indicates whether the attribute measured by the test— whatever it is—is being assessed in a consistent, precise way.

Whether the test is actually assessing what it was designed to measure is addressed by an analysis of the test's validity.


The selection of a method for estimating reliability depends on the nature of the test.

Each method not only entails different procedures but is also affected by different sources of error. For many tests, more than one method should be used.

METHODS FOR ESTIMATING RELIABILITY

Test-retest reliability A measure of reliability obtained by administering the same test

twice over a period of time to a group of individuals. Parallel forms reliability

A measure of reliability obtained by administering different versions of an assessment tool

Inter-rater reliability A measure of reliability used to assess the degree to which

different judges or raters agree in their assessment decisions Internal consistency reliability

Average inter-item correlation Split-half reliability

TYPES OF RELIABILITY

http://www.uni.edu/chfasoa/reliabilityandvalidity.htm

Split-half reliability and coefficient alpha are two methods for evaluating internal consistency. Both involve administering the test once to a single group of examinees, and both yield a reliability coefficient that is also known as the coefficient of internal consistency.

INTERNAL CONSISTENCY RELIABILITY

To determine a test's split-half reliability, the test is split into equal halves so that each examinee has two scores (one for each half of the test).

Scores on the two halves are then correlated. Tests can be split in several ways, but probably the most common way is to divide the test on the basis of odd- versus even-numbered items.


Example 1: 12 students take a test with 50 questions. For each student the total score is recorded along with the sum of the scores for the even questions and the sum of the scores for the odd question as shown in Figure 1. Determine whether the test is reliable by using the split-half methodology.

SPLIT-HALF METHODOLOGY EXAMPLE

• The statistical test consists of looking at the correlation coefficient (cell G3 of Figure 1). If it is high then the questionnaire is considered to be reliable.

r = CORREL(C4:C15,D4:D15) = 0.667277

One problem with the split-half reliability coefficient is that since only half the number of items is used the reliability coefficient is reduced. To get a better estimate of the reliability of the full test, we apply the Spearman-Brown correction, namely:

SPLIT-HALF METHODOLOGY EXAMPLE

=0.800439

A problem with the split-half method is that it produces a reliability coefficient that is based on test scores that were derived from one-half of the entire length of the test.

If a test contains 30 items, each score is based on 15 items. Because reliability tends to decrease as the length of a test decreases, the split-half reliability coefficient usually underestimates a test's true reliability.

For this reason, the split-half reliability coefficient is ordinarily corrected using the Spearman-Brown prophecy formula, which provides an estimate of what the reliability coefficient would have been had it been based on the full length of the test.


Cronbach's coefficient alpha also involves administering the test once to a single group of examinees. However, rather than splitting the test in half, a special formula is used to determine the average degree of inter-item consistency.

One way to interpret coefficient alpha is as the average reliability that would be obtained from all possible splits of the test. Coefficient alpha tends to be conservative and can be considered the lower boundary of a test's reliability (Novick and Lewis, 1967).

When test items are scored dichotomously (right or wrong), a variation of coefficient alpha known as the Kuder-Richardson Formula 20 (KR-20) can be used.


• The Kuder and Richardson Formula 20 test checks the internal consistency of measurements with dichotomous choices. It is equivalent to performing the split half methodology on all combinations of questions and is applicable when each question is either right or wrong. A correct question scores 1 and an incorrect question scores 0. The test statistic is

• wherek = number of questionspj = number of people in the sample who answered question j correctlyqj = number of people in the sample who didn’t answer question j correctlyσ2 = variance of the total scores of all the people taking the test =

VARP(R1) where R1 = array containing the total scores of all the people taking the test.

KUDER-RICHARDSON FORMULA 20

Example 1: A questionnaire with 11 questions is administered to 12 students. The results are listed in the upper of Figure 1. Determine the reliability of the questionnaire using Kuder and Richardson Formula 20.

KUDER-RICHARDSON FORMULA 20 EXAMPLE(1/2)

Figure 1 – Kuder and Richardson Formula 20 for Example 1

The values of p in row 18 are the percentage of students who answered that question correctly.

We can calculate ρKR20 as described in Figure 2.

The value ρKR20 = 0.738 shows that the test has high reliability.

KUDER-RICHARDSON FORMULA 20 EXAMPLE(2/2)

Figure 2 – Key formulas for worksheet in Figure 1

Content sampling is a source of error for both split-half reliability and coefficient alpha.

For split-half reliability, content sampling refers to the error resulting from differences between the content of the two halves of the test (i.e., the items included in one half may better fit the knowledge of some examinees than items in the other half);

for coefficient alpha, content (item) sampling refers to differences between individual test items rather than between test halves.


One problem with the split-half method is that the reliability estimate obtained using any random split of the items is likely to differ from that obtained using another.

One solution to this problem is to compute the Spearman-Brown corrected split-half reliability coefficient for every one of the possible split-halves and then find the mean of those coefficients. This mean is known as Cronbach’s alpha.

Cronbach’s alpha is superior to Kuder and Richardson Formula 20 since it can be used with continuous and non-dichotomous data. In particular, it can be used for testing with partial credit and for questionnaires using a Likert scale.

CRONBACH'S COEFFICIENT ALPHA

Definition 1: Given variable x1, …, xk and x0 = and Cronbach’s alpha is defined to be

CRONBACH'S COEFFICIENT ALPHA

k

j

xk1

Example 1: Calculate Cronbach’s alpha for the data in Example 1 of Kuder and Richardson Formula 20 (repeated in Figure 1 below).

CRONBACH'S COEFFICIENT ALPHA EXAMPLE(1/2)

Row 17 contains the variance for each of the questions. E.g. the variance for question 1 (cell B17) is calculated by the formula =VARP(B4:B15). Other key formulas used to calculate Cronbach’s alpha in Figure 1 are described in Figure 2.

Since the questions only have two answers, Cronbach’s alpha .73082 is the same as the KR20 reliability calculated in Example 1

Figure 2 – Key formulas for worksheet in Figure 1

CRONBACH'S COEFFICIENT ALPHA EXAMPLE(2/2)

Content sampling is a source of error for both split-half reliability and coefficient alpha.

For split-half reliability, content sampling refers to the error resulting from differences between the content of the two halves of the test (i.e., the items included in one half may better fit the knowledge of some examinees than items in the other half);

for coefficient alpha, content (item) sampling refers to differences between individual test items rather than between test halves.


Coefficient alpha also has as a source of error, the heterogeneity of the content domain.

A test is heterogeneous with regard to content domain when its items measure several different domains of knowledge or behavior.


The greater the heterogeneity of the content domain, the lower the inter-item correlations and the lower the magnitude of coefficient alpha.

Coefficient alpha could be expected to be smaller for a 200-item test that contains items assessing knowledge of test construction, statistics, ethics, epidemiology, environmental health, social and behavioral sciences, rehabilitation counseling, etc. than for a 200-item test that contains questions on test construction only.


The methods for assessing internal consistency reliability are useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test.

They are not appropriate for assessing the reliability of speed tests because, for these tests, they tend to produce spuriously high coefficients. (For speed tests, alternate forms reliability is usually the best choice.)


Thanks for your attention!!

Reliability - 生醫統計期末報告 - 學生 : 劉佩昀學號 : 101521090 授課老師 :...

Documents

Transcript of Reliability - 生醫統計期末報告 - 學生 : 劉佩昀學號 : 101521090 授課老師 :...

Reliability - 生醫統計期末報告 - 學生 : 劉佩昀 學號 : 101521090 授課老師 :...

Documents

Transcript of Reliability - 生醫統計期末報告 - 學生 : 劉佩昀 學號 : 101521090 授課老師 :...

Reliability - 生醫統計期末報告 - 學生 : 劉佩昀學號 : 101521090 授課老師 :...

Transcript of Reliability - 生醫統計期末報告 - 學生 : 劉佩昀學號 : 101521090 授課老師 :...