Chap2 Hypothesis Testing

53
STOCHASTIC PROCESSES, DETECTION AND ESTIMATION 6.432 Course Notes Alan S. Willsky, Gregory W. Wornell, and Jeffrey H. Shapiro Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Cambridge, MA 02139 Fall 2003

description

mtech r13

Transcript of Chap2 Hypothesis Testing

Page 1: Chap2 Hypothesis Testing

STOCHASTIC PROCESSES,DETECTION AND ESTIMATION

6.432 Course Notes

Alan S. Willsky, Gregory W. Wornell, and Jeffrey H. Shapiro

Department of Electrical Engineering and Computer Science

Massachusetts Institute of TechnologyCambridge, MA 02139

Fall 2003

Page 2: Chap2 Hypothesis Testing
Page 3: Chap2 Hypothesis Testing

2

Detection Theory, Decision

Theory, and Hypothesis

Testing

A wide variety of engineering problems involve making decisions based on a setof measurements. For instance, suppose that in a digital communications system,during a particular interval of time one of two possible waveforms is transmit-ted to signal a 0-bit or a 1-bit. The receiver then obtains a noisy version of thetransmitted waveform, and from this data must determine the bit. Of course, thepresence of noise means in general that the decision will not always be correct.However, we would like to use a decision process that is as good as possible in anappropriate sense.

As another example, this time involving air traffic control, suppose that aradar system is set up to detect the presence of an aircraft in the sky. Duringa particular time interval, a suitably designed radar pulse is transmitted, and ifan aircraft is present, this pulse reflects off the aircraft and is received back atthe ground. Hence, the presence or absence of such a “return pulse” determineswhether an aircraft (or other target) is present. Again the presence of noise in thereceived signal means that perfect detection is generally not possible.

Still other examples, sometimes quite elaborate, arise in voice and face recog-nition systems. Given a segment of voice waveform known to come from one of afinite set of speakers, one is often interested in identifying the speaker. Similarly,the problem of identifying a face from an image (i.e., spatial waveform) is alsoimportant in a number of applications.

Addressing problems of this type is the aim of detection and decision theory,and a natural framework for setting up such problems is in terms of a hypothesistest. In this framework, each of the possible scenarios corresponds to a hypothe-sis. When there are M hypotheses, we denote the set of possible hypotheses us-

55

Page 4: Chap2 Hypothesis Testing

56 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

ing {H0, H1, . . . , HM−1}.1 For each of the possible hypotheses, there is a differentmodel for the observed data, and this is what we will exploit to distinguish amongthe hypotheses.

In this chapter, we will restrict our attention to the case in which the observeddata can be represented as a K-dimensional random vector

y =[

y1 y2 · · · yK

]T, (2.1)

with scalar observations corresponding to the special case K = 1. As will becomeapparent, this case is sufficiently general for a wide range of applications, andallows the key concepts and perspectives to be developed. Extensions to obser-vations that take the form of (infinite-length) random sequences and (continuous-time) random waveforms we postpone until Chapter 6.

In many cases the valid hypothesis can be viewed as a (discrete-valued) ran-dom variable, and thus we denote it using H . That is, we can associate a prioriprobabilities

Pm = Pr [H = Hm]

with the hypothesis. Our binary communication example fits within this class,with M = 2 and the a priori probabilities typically being equal. In this case, themodel for the observed data under each hypothesis takes the form of a conditionalprobability density, i.e., py|H(y|Hm) for m = 0, 1, . . . , M−1. As we’ll see, in practicethese conditional probabilities are often specified implicitly rather than explicitly,and must be inferred from the available information.

In other cases, it is more appropriate to view the valid hypothesis not as arandom variable, but as a deterministic but unknown quantity, which we denotesimply by H . In these situations, a priori probabilities are not associated with thevarious hypotheses. The radar detection problem mentioned above is one thatis often viewed this way, since there is typically no natural notion of the a prioriprobability of an aircraft being present. For these tests, while the valid hypoth-esis is nonrandom, the observations still are, of course. In this case, the proba-bility density model for the observations is parameterized by the valid hypothesisrather than conditioned on it, so these models are denoted using py(y; Hm), form = 0, 1, . . . , M − 1. As in the random hypothesis case, these densities are alsooften specified implicitly.

This chapter explores methods applicable to both random and nonrandomhypothesis tests. However, we begin by focusing on random hypotheses to de-velop the key ideas, and restrict attention to the binary (M = 2) case.

1Note that H0 is sometimes referred to as the “null” hypothesis, particularly in asymmetricproblems where it has special significance.

Page 5: Chap2 Hypothesis Testing

Sec. 2.1 Binary Random Hypothesis Testing: A Bayesian Approach 57

2.1 BINARY RANDOM HYPOTHESIS TESTING: A BAYESIAN APPROACH

In solving a Bayesian binary hypothesis testing problem, two pieces of informationare used. One is the set of a priori probabilities

P0 = Pr [H = H0]

P1 = Pr [H = H1] = 1− P0.(2.2)

These summarize our state of knowledge about the applicable hypothesis beforeany observed data is available.

The second is the measurement model, corresponding to the probability den-sities for y conditioned on each of the hypotheses, i.e.,

H0 : py|H(y|H0)

H1 : py|H(y|H1).(2.3)

The observation densities in (2.3) are often referred to as likelihood functions. Ourchoice of notation suggests that y is continuous-valued; however, y can equallywell be discrete-valued, in which case the corresponding probability mass func-tions take the form

py|H [y|Hm] = Pr [y = y | H = Hm] . (2.4)

For simplicity of exposition, we start by restricting our attention to the continuouscase. Again, it is important to emphasize in many problems this measurementmodel information is provided indirectly, as the following example illustrates.

Example 2.1

As a highly simplified scenario, suppose a single bit of information m ∈ {0, 1} is tobe sent over a communication channel by transmitting the scalar sm, where s0 ands1 are both deterministic, known quantities. Let’s further suppose that the channelis noisy; specifically, what is received is

y = sm + w ,

where w is, independent of m, a zero-mean Gaussian random variable with varianceσ2. From this information, we can readily construct the probability density for theobservation under each of the hypotheses, obtaining:

py |H(y|H0) = N(y; s0, σ2) =

1√2πσ2

e−(y−s0)2/(2σ2)

py |H(y|H1) = N(y; s1, σ2) =

1√2πσ2

e−(y−s1)2/(2σ2).

(2.5)

In addition, if 0’s and 1’s are equally likely to be transmitted we would set the apriori probabilities to

P0 = P1 = 1/2.

Page 6: Chap2 Hypothesis Testing

58 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

2.1.1 Optimum Decision Rules: The Likelihood Ratio Test

The solution to a hypothesis test is specified in terms of a decision rule. We willfocus for the time being on deterministic decision rules. Mathematically, such adecision rule is a function H(·) that uniquely maps every possible K-dimensionalobservation y to one of the two hypotheses, i.e., H : RK → {H0, H1}. From thisperspective, we see that choosing the function H(·) is equivalent to partitioning theobservation space Y = {y} into two disjoint “decision” regions, corresponding tothe values of y for which each of the two possible decisions are made. Specifically,we use Zm to denote those values of y for which our rule decides Hm, i.e.,

Z0 = {y | H(y) = H0}Z1 = {y | H(y) = H1}.

(2.6)

These regions are depicted schematically in Fig. 2.1.

Our goal, then, is to design this bi-valued function (or equivalently the asso-ciated decision regions Z0 and Z1) in such a way that the best possible performanceis obtained. In order to do this, we need to be able to quantify the notion of “best.”This requires that we have a well-defined objective function corresponding to asuitable measure of quality. For Bayesian problems we use an objective functiontaking the form of an expected cost function. Specifically, we use

C(Hj, Hi) , Cij (2.7)

to denote the “cost” of deciding that the hypothesis is H = Hi when the correcthypothesis is H = Hj. Then the optimum decision rule takes the form

H(·) = arg minf(·)

J(f) (2.8)

where the average cost, which is referred to as the “Bayes risk,” is

J(f) = E[

C(H, f(y))]

, (2.9)

and where the expectation in (2.9) is over both y and H , and f(·) is a generic deci-sion rule.

Often, the context of the specific problem suggests how to choose the costsCij . For example, a symmetric cost function of the form Cij = 1− δ[i− j], i.e.,

C00 = C11 = 0

C01 = C10 = 1(2.10)

corresponds to seeking a decision rule that minimizes the probability of a decisionerror. However, there are many applications where such symmetric cost functionsare not well-matched. For example, in a medical diagnosis problem where H0

denotes the hypotheses that the patient does not have a particular disease and H1

Page 7: Chap2 Hypothesis Testing

Sec. 2.1 Binary Random Hypothesis Testing: A Bayesian Approach 59

{y}

Z 0

Z 1Z 1

Z 1 Figure 2.1. The regions Z0 and Z1 asdefined in (2.6) corresponding to an ex-ample decision rule H(·).

that he or she does, we would typically want to select cost assignments such thatC01 ≫ C10.2

Having chosen suitable cost assignments, we proceed to our solution by con-sidering an arbitrary but fixed decision rule f(·). In terms of this generic f(·), theBayes risk can be expanded in the form

J(f) = E[

C(H, f(y))]

= E

[

E[

C(H, f(y)) | y = y]

]

=

J(f(y),y) py(y) dy, (2.11)

withJ(H,y) = E

[

C(H, H)∣

∣y = y

]

, (2.12)

and where to obtain the second equality in (2.11) we have used iterated expecta-tion.

From the last equality in (2.11) we obtain a key insight: since py(y) is nonneg-ative, it is clear that we will minimize J if we minimize J(f(y),y) for each particularvalue of y. The implication here is that we can determine the optimum decision ruleH(·) on a point by point basis, i.e., H(y) for each y.

Let’s consider a particular (observation) point y = y∗. For this point, if wechoose the assignment

H(y∗) = H0,

2In still other problems, it is difficult to make meaningful cost assignments at all. In this case,the Neyman-Pearson framework developed later in the chapter is more natural than the Bayesianframework we develop in this section.

Page 8: Chap2 Hypothesis Testing

60 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

then our conditional expectation (2.12) takes the value

J(H0,y∗) = C00 Pr [H = H0 | y = y∗] + C01 Pr [H = H1 | y = y∗] . (2.13)

Alternatively, if we choose the assignment

H(y∗) = H1,

then our conditional expectation (2.12) takes the value

J(H1,y∗) = C10 Pr [H = H0 | y = y∗] + C11 Pr [H = H1 | y = y∗] . (2.14)

Hence, the optimum assignment for the value y∗ is simply the choice correspond-ing to the smaller of (2.13) and (2.14). It is convenient to express this optimumdecision rule using the following notation (now replacing our particular observa-tion y∗ with a generic observation y):

C00 Pr [H = H0 | y = y]

+ C01 Pr [H = H1 | y = y]

H(y)=H1

RH(y)=H0

C10 Pr [H = H0 | y = y]

+ C11 Pr [H = H1 | y = y] .(2.15)

Note that when the two sides of (2.15) are equal, then either assignment is equallygood—both have the same effect on the objective function (2.11).

A minor rearrangement of the terms in (2.15) results in

(C01 − C11) Pr [H = H1 | y = y]H(y)=H1

RH(y)=H0

(C10 − C00) Pr [H = H0 | y = y] . (2.16)

Evidently, the a posteriori probabilities, i.e., the probabilities for each of the two hy-potheses conditioned on having observed the sample value y of the random vectory, play an important role in the optimum decision rule (2.16). These probabilitiescan be readily computed from our measurement models (2.3) together with thea priori probabilities (2.2). This follows from a simple application of Bayes’ Rule,viz.,

Pr [H = Hm | y = y] =py|H(y|Hm) Pm

py|H(y|H0) P0 + py|H(y|H1) P1. (2.17)

Since for any reasonable choice of cost function the cost of an error is higherthan the cost of being correct, the terms in parentheses in (2.16) are both nonnega-tive, so we can equivalently write (2.16) in the form3

Pr [H = H1 | y = y]

Pr [H = H0 | y = y]

H(y)=H1

RH(y)=H0

(C10 − C00)

(C01 − C11). (2.18)

3Technically, we have to be careful about dividing by zero here. To simplify our exposition,however, as we discuss in Section 2.1.2, we will generally restrict our attention to the case wherethis does not happen.

Page 9: Chap2 Hypothesis Testing

Sec. 2.1 Binary Random Hypothesis Testing: A Bayesian Approach 61

When we then substitute (2.17) into (2.18) and multiply both sides by P0/P1, weobtain the decision rule in its final form, directly in terms of the measurementdensities:

L(y) ,py|H(y|H1)

py|H(y|H0)

H(y)=H1

RH(y)=H0

P0(C10 − C00)

P1(C01 − C11), η. (2.19)

The left side of (2.19) is a function of the observed data y referred to as thelikelihood ratio—which we denote using L(y)—and is constructed from the mea-surement model. The right side of (2.19)—which we denote using η—is a precom-putable threshold which is determined from the a priori probabilities and costs.The overall decision rule then takes the form of what is referred to as a likelihoodratio test (LRT).

2.1.2 Properties of the Likelihood Ratio Test

Several observations lend valuable insights into the optimum decision rule (2.19).First, note that the likelihood ratio L(·) is a scalar-valued function, i.e., L : RK → R,regardless of the dimension K of the data. In fact, L(y) is an example of what isreferred to as a sufficient statistic for the problem: it summarizes everything weneed to know about the observation vector in order to make a decision. Phraseddifferently, in terms of our ability to make the optimum decision (in the Bayesiansense in this case), knowledge of L(y) is as good as knowledge of the full datavector y itself.

We will develop the notion of a sufficient statistic in more detail in subse-quent chapters; however, at this point it suffices to make two observations withrespect to our detection problem. First, (2.19) tells us an explicit construction for ascalar sufficient statistic for the Bayesian binary hypothesis testing problem. Sec-ond, sufficient statistics are not unique. For example, the data y itself is a sufficientstatistic, albeit a trivial one. More importantly, any invertible function of L(y) isalso a sufficient statistic. In fact, for the purposes of implementation or analysis itis often more convenient to rewrite the likelihood ratio test in the form

ℓ(y) = g(L(y))H(y)=H1

RH(y)=H0

g(η) = γ, (2.20)

where g(·) is some suitably chosen, monotonically increasing function. An impor-tant example is the case corresponding to g(·) = ln(·), which simplifies many testsinvolving densities with exponential factors, such as Gaussians.

It is also important to emphasize that while L(y) is a scalar, L = L(y) is a ran-dom variable—i.e., it takes on a different value in each experiment. As such, wewill frequently be interested in its probability density function—or at least statis-tics such as its mean and variance—under each of H0 and H1. Such densities canbe derived using the method of events discussed in Section 1.5.2 of the previouschapter, and are often used in calculating system performance.

Page 10: Chap2 Hypothesis Testing

62 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

It follows immediately from the definition in (2.19) that the likelihood ratiois a nonnegative quantity. Furthermore, depending on the problem, some valuesof y may lead to L(y) being zero or infinite. In particular, the former occurs whenpy|H(y|H1) = 0 but py|H(y|H0) > 0, which is an indication that values in a neighbor-hood of y effectively cannot occur under H1 but can under H0. In this case, therewill be values of y for which we’ll effectively know with certainty that the correcthypothesis is H0. When the likelihood ratio is infinite, corresponding a divisionby zero scenario, an analogous situation exists, but with the roles of H0 and H1

reversed. These cases where such perfect decisions are possible are referred to assingular detection scenarios. In some practical problems, these scenarios do in factoccur. However, in other cases they suggest a potential lack of robustness in thedata modeling, i.e., that some source of inherent uncertainty may be missing fromthe model. In any event, to simplify our development for the remainder of thechapter we will largely restrict our attention to the case where 0 < L(y) < ∞ forall y.

While the likelihood ratio focuses the observed data into a single scalar forthe purpose of making an optimum decision, the threshold η for the test plays acomplementary role. In particular, from (2.19) we see that η focuses the relevantfeatures of the cost function and a priori probabilities into a single scalar. Further-more, this information is combined in a manner that is intuitively satisfying. Forexample, as (2.19) also reflects, an increase in P0 means that H0 is more likely, sothat η is increased to appropriately bias the test toward deciding H0 for any partic-ular observation. Similarly, an increase in C10 means that deciding H1 when H0 istrue is more costly, so η is increased to appropriately bias the test toward decidingH0 to offset this risk. Finally, note that adding a constant to the cost function (i.e.,to all Cij) has, as we would anticipate, no effect on the threshold. Hence, withoutloss of generality we may set at least one of the correct decision costs—i.e., C00 orC11—to zero.

Finally, it is important to emphasize that the likelihood ratio test (2.19) indi-rectly determines the decision regions (2.6). In particular, we have

Z0 = {y | H(y) = H0} = {y | L(y) < η}Z1 = {y | H(y) = H1} = {y | L(y) > η}.

(2.21)

As Fig. 2.1 suggests, while a decision rule expressed in the measurement dataspace {y} can be complicated,4 (2.19) tells us that the observations can be trans-formed into a one-dimensional space defined via L = L(y) where the decisionregions have a particularly simple form: the decision H(L) = H0 is made when-ever L lies to the left of some point on the line, and H(L) = H1 whenever L lies tothe right.

4Indeed, the respective sets Z0 and Z1 are not even connected in general, even for the caseK = 1.

Page 11: Chap2 Hypothesis Testing

Sec. 2.1 Binary Random Hypothesis Testing: A Bayesian Approach 63

2.1.3 Maximum A Posteriori and Maximum Likelihood Detection

An important cost assignment for many problems is that given by (2.10), which aswe recall corresponds to a minimum probability-of-error (Pr(e)) criterion. Indeed,in this case, we have

J(H) = Pr[

H(y) = H0, H = H1

]

+ Pr[

H(y) = H1, H = H0

]

= Pr(e).

The corresponding decision rule in this case can be obtained by simply specializ-ing (2.19) to obtain

py|H(y|H1)

py|H(y|H0)

H(y)=H1

RH(y)=H0

P0

P1. (2.22)

Alternatively, we can obtain additional insight by specializing the equivalenttest (2.16), from which we obtain a form of the minimum probability-of-error testexpressed in terms of the a posteriori probabilities for the problem, viz.,

Pr [H = H1 | y = y]H(y)=H1

RH(y)=H0

Pr [H = H0 | y = y] . (2.23)

From (2.23) we see that to minimize the probability of a decision error, we shouldchoose the hypothesis corresponding to the largest a posteriori probability, i.e.,

H(y) = arg maxA∈{H0,H1}

Pr [H = A | y = y] . (2.24)

For this reason, we refer to the test associated with this cost assignment as themaximum a posteriori (MAP) decision rule.

Still further simplification is possible when the hypotheses are equally likely(P0 = P1 = 1/2). In this case, (2.22) becomes simply

py|H(y|H1)H(y)=H1

RH(y)=H0

py|H(y|H0),

and thus we see that our optimum decision rule chooses the hypothesis for whichthe corresponding likelihood function is largest, i.e.,

H(y) = arg maxA∈{H0,H1}

py|H(y|A). (2.25)

This special case is referred to as the maximum likelihood (ML) decision rule. Max-imum likelihood detection plays an important role in a large number of applica-tions, and in particular is widely used in the design of receivers for digital com-munication systems.

Page 12: Chap2 Hypothesis Testing

64 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

Example 2.2

Continuing with Example 2.1, we can obtain from (2.5) that the likelihood ratio testfor this problem takes the form

L(y) =

1√2πσ2

e−(y−s1)2/(2σ2)

1√2πσ2

e−(y−s0)2/(2σ2)

H(y)=H1

RH(y)=H0

η. (2.26)

As (2.26) suggests—and as is generally the case in Gaussian problems—the naturallogarithm of the likelihood ratio is a more convenient sufficient statistic to work within this example. In this case, taking logarithms of both sides of (2.26) yields

ℓ(y) =1

2σ2

[

(y − s0)2 − (y − s1)

2]

H(y)=H1

RH(y)=H0

ln η. (2.27)

Expanding the quadratics and cancelling terms in (2.27) we obtain the test in itssimplest form, which for s1 > s0 is given by

yH(y)=H1

RH(y)=H0

s1 + s0

2+

σ2 ln η

s1 − s0, γ. (2.28)

We also remark that with a minimum probability-of-error criterion, if P0 = P1

then ln η = 0 and we see immediately from (2.27) that the optimum test takes theform

|y − s0|H(y)=H1

RH(y)=H0

|y − s1|,

which corresponds to a “minimum-distance” decision rule, i.e.,

H(y) = Hm, m = arg minm∈{0,1}

|y − sm|.

As we’ll see later in the chapter, this minimum-distance property holds in multidi-mensional Gaussian problems as well.

Note too that in this problem the decisions regions on the y-axis have a partic-ularly simple form; for example, for s1 > s0 we obtain

Z0 = {y | y < γ}Z1 = {y | y > γ}. (2.29)

In other problems—even Gaussian ones—the decision regions can be more compli-cated, as our next example illustrates.

Example 2.3

Suppose that a zero-mean Gaussian random variable has one of two possible vari-ances, σ2

1 or σ20, where σ2

1 > σ20. Let the costs and prior probabilities be arbitrary.

Then the likelihood ratio test for this problem takes the form

L(y) =

1√

2πσ21

e−y2/(2σ21)

1√

2πσ20

e−y2/(2σ20)

H(y)=H1

RH(y)=H0

η.

Page 13: Chap2 Hypothesis Testing

Sec. 2.1 Binary Random Hypothesis Testing: A Bayesian Approach 65

In this problem, it is a straightforward exercise to show that the test simplifies to oneof the form

|y|H(y)=H1

RH(y)=H0

2σ2

0σ21

σ21 − σ2

0

ln

(

ησ1

σ0

)

, γ.

Hence, the decision region Z1 is the union of two disconnected regions in this case,i.e.,

Z1 = {y | y > γ} ∪ {y | y < −γ}.

2.1.4 The Operating Characteristic of the Likelihood Ratio Test

In this section, we develop some additional perspectives on likelihood ratio teststhat will provide us with further insight on Bayesian hypothesis testing. Theseperspectives will also be important in our development of nonBayesian tests laterin the chapter.

We begin by observing that the performance of any decision rule5 H(·) maybe fully specified in terms of two quantities

PD = Pr[

H(y) = H1

∣H = H1

]

=

Z1

py|H(y|H1) dy

PF = Pr[

H(y) = H1

∣H = H0

]

=

Z1

py|H(y|H0) dy,

(2.30)

where Z0 and Z1 are the decision regions defined via (2.6). Using terminologythat originated in the radar community where H1 refers to the presence of a tar-get and H0 the absence, the quantities PD and PF are generally referred to as the“detection” and “false-alarm” probabilities, respectively (and, hence, the choice ofnotation). In the statistics community, by contrast, PF is referred to as the size ofthe test and PD as the power of the test.

It is worth emphasizing that the characterization in terms of (PD, PF ) is notunique, however. For example, any invertible linear or affine transformation ofthe pair (PD, PF ) is also complete. For instance, the pair of “probabilities of errorof the first and second kind” defined respectively via

P 1E = Pr

[

H(y) = H1

∣H = H0

]

= PF

P 2E = Pr

[

H(y) = H0

∣H = H1

]

= 1− PD , PM

(2.31)

constitute such a characterization, and are preferred in some communities thatmake use of decision theory. As (2.31) indicates, from the radar perspective the

5Note that the arbitrary decision rule we consider here need not be optimized with respectto any particular criterion—it might be, but it might also be a heuristically reasonable rule, or evena bad rule.

Page 14: Chap2 Hypothesis Testing

66 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

PD

PF0 1

1

0

η→0

η→ ∞

← increasing η

Figure 2.2. Operating characteristicassociated with a likelihood ratio test.

probability of error of the first kind is the probability of a false alarm, while prob-ability of error of the second kind is the probability of a miss, which is denoted byPM .

In general, a good decision rule (detector) is one with a large PD and a smallPF (or equivalently, small P 1

E and P 2E). However, ultimately these are competing

objectives. As an illustration of this behavior, let us examine the performance of alikelihood ratio test (2.19) when the threshold η is varied. Note that each choice of ηcompletely specifies a decision rule, with which is associated a particular (PD, PF )operating point. Hence, each value of η is associated with a single point in the PD–PF plane. Moreover, as η is varied from 0 to∞, a curve is traced out in this planeas illustrated in Fig. 2.2. This curve is referred to as the operating characteristic ofthe likelihood ratio test.

As Fig. 2.2 suggests, good PD is generally obtained at the expense of highPF , and so choosing a threshold η for a particular problem involves making anacceptable tradeoff. Indeed, as η → 0 we have (PD, PF ) → (1, 1), while as η → ∞we have (PD, PF ) → (0, 0). From this perspective, the Bayesian test represents aparticular tradeoff, and corresponds to a single point on this curve. To obtain thistradeoff, we effectively selected as our objective function a linear combination ofPD and PF . More specifically, we performed the optimization (2.8) using

J(f) = αPF − βPD + γ,

where the choice of α and β is, in turn, determined by the cost assignment (Cij ’s)and the a priori probabilities (Pm’s). In particular, rewriting (2.9) in the form

J(f) =∑

i,j

Cij Pr[

H(y) = Hi

∣H = Hj

]

Pj

Page 15: Chap2 Hypothesis Testing

Sec. 2.1 Binary Random Hypothesis Testing: A Bayesian Approach 67

we obtain

α = (C10 − C00)P0 β = (C01 − C11)P1 γ = (C00P0 + C01P1).

Let us explore a specific example to gain further insight.

Example 2.4

Let us consider the following special case of our simple scalar Gaussian detectionproblem from Example 2.1:

H0 : y ∼ N(0, σ2)

H1 : y ∼ N(m,σ2), m ≥ 0,(2.32)

which corresponds to choosing s0 = 0 and s1 = m. Specializing (2.28), we see thatthe optimum decision rule takes the form

yH(y)=H1

RH(y)=H0

m

2+

σ2 ln η

m, γ,

so that

PD =

∫ ∞

γpy |H(y|H1) dy (2.33a)

PF =

∫ ∞

γpy |H(y|H0) dy. (2.33b)

The expressions (2.33a) and (2.33b) each correspond to tail probabilities in aGaussian distribution, which are useful to express in the “standard form” describedin Section 1.6.2. In particular, we have, in terms of Q-function notation,

PD = Pr [y > γ | H = H1] = Pr

[

y −m

σ>

γ −m

σ

H = H1

]

= Q

(

γ −m

σ

)

(2.34a)

PF = Pr [y > γ | H = H0] = Pr[y

σ>

γ

σ

∣H = H0

]

= Q

σ

)

. (2.34b)

From (2.34a) and (2.34b) we see that as γ is varied, a curve is traced out in the PD–PF

plane. Moreover, this curve is parameterized by d = m/σ. The quantity d2 can beviewed as a “signal-to-noise ratio,” so that d is a normalized measure of “distance”between the hypotheses. Several of these curves are plotted in Fig. 2.3. Note thatwhen m = 0 (d = 0), the hypotheses (2.32) are indistinguishable, and the curvePD = PF is obtained. As d increases, better performance is obtained—e.g., for agiven PF , a larger PD is obtained.6 Finally, as d→ ∞, the PD–PF curve approachesthe ideal operating characteristic: PD = 1 for all PF > 0.

We will explore generalizations of these results in multidimensional Gaussianproblems later in the chapter.

6For any reasonable performance criterion, a curve above and to the left of another is alwayspreferable.

Page 16: Chap2 Hypothesis Testing

68 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False−Alarm Probability PF

Det

ecti

on

Pro

bab

ilit

y P

D

Figure 2.3. Operating characteristic ofthe likelihood ratio test for the scalarGaussian detection problem. The suc-cessively higher curves correspond tod = m/σ = 0, 1, . . . , 5.

One property of the operating characteristic of the likelihood ratio test isthat it is monotonically nondecreasing. This follows rather immediately from thestructure of these tests. In particular, let PD(η) and PF (η) be the detection andfalse-alarm probabilities, respectively, for the deterministic test associated with ageneric threshold η. Then for any η1 and η2 such that η2 > η1 we have

PD(η2) ≤ PD(η1) (2.35a)

PF (η2) ≤ PF (η1). (2.35b)

Hence,PD(η1)− PD(η2)

PF (η1)− PF (η2)≥ 0.

Later in the chapter, we will explore additional properties of the operatingcharacteristic associated with the likelihood ratio test. For example, we’ll see thatthe structure of the likelihood ratio imposes important constraints on the shape ofthis operating characteristic. Furthermore, we’ll relate the operating characteristicto the performance of other kinds of decision rules. For example, to every possi-ble decision rule for a given problem there corresponds an operating point in thePD–PF plane. We will develop ways of using the operating characteristic of thelikelihood ratio test to bound where the operating points of arbitrary rules maylie. To obtain these results, we need to take a look at decision theory from someadditional perspectives. We begin by considering a variation on the Bayesian hy-pothesis testing framework.

2.2 MIN-MAX HYPOTHESIS TESTING

As we have seen, the Bayesian approach to hypothesis testing is a natural onewhen we can meaningfully assign not only costs Cij but a priori probabilities Pm

Page 17: Chap2 Hypothesis Testing

Sec. 2.2 Min-Max Hypothesis Testing 69

as well. However, in a number of applications it may be difficult to determineappropriate a priori probabilities. Moreover, we don’t want to choose these proba-bilities arbitrarily—if we use incorrect values for the Pm in designing our optimumdecision rule, our performance will suffer.

In this section, we develop a method for making Bayesian hypothesis testingrobust with respect to uncertainty in the a priori probabilities. Our approach is toconstruct a decision rule that yields the best possible worst-case performance. Aswe will see, this corresponds to an optimization based on what is referred to as a“min-max” criterion, which is a powerful and practical strategy for a wide rangeof detection and estimation problems.

Example 2.5

As motivation, let us reconsider our simple radar or communications scenario (2.32)of Example 2.4. For this problem we showed that the decision rule that minimizesthe Bayes risk for a cost assignment {Cij} and a set of a priori probabilities {Pm}reduces to

yH(y)=H1

RH(y)=H0

m

2+

σ2

mln

[

(C10 −C00)P0

(C01 −C11)P1

]

. (2.36)

In a communication scenario the prior probabilities P0, P1 may depend on the char-acteristics of the information source and may not be under the control of the en-gineer who is designing the receiver. In the radar problem, it may be difficult toaccurately determine the a priori probability P1 of target presence.

For scenarios such as that in Example 2.5, let us assess the impact on per-formance of using a test optimized for the a priori probabilities {1−p, p} whenthe corresponding true a priori probabilities are {P0, P1}. The Bayes risk for this“mismatched” system is

J(p, P1) = C00(1−P1)+C01P1+(C10−C00)(1−P1)PF (p)−(C01−C11)P1PD(p), (2.37)

where we have explicitly included the dependence of PF and PD on p to empha-size that these conditional probabilities are determined from a likelihood ratio testwhose threshold is computed using the incorrect prior p.

The system performance in this situation has a convenient geometrical inter-pretation, as we now develop. With the notation (2.37), J(P1, P1) denotes the Bayesrisk when the likelihood ratio test corresponding to the correct priors is used. Bythe optimality properties of this latter test, we know that

J(p, P1) ≥ J(P1, P1). (2.38)

with, of course, equality if p = P1. Moreover, for a fixed p, the mismatch Bayesrisk J(p, P1) is a linear function of the true prior P1. Hence, we can conclude, asdepicted in Fig. 2.4, that when plotted as functions of P1 for a particular p, themismatch risk J(p, P1) is a line tangent to J(P1, P1) at P1 = p. Furthermore, since

Page 18: Chap2 Hypothesis Testing

70 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

J

P110

C00

C11

J(p,P )1

J(P ,P )1 1

p

Figure 2.4. Bayes risk objective func-tions for min-max tests. The solidcurve indicates the objective functionas a function of the true prior probabil-ity for a correctly matched detector—i.e., the detector has correct knowl-edge of the prior probabilities. Thedashed curve indicates the objectivefunction with mismatched detector; forthis curve the assumed prior is p.

this must be true for all choices of p, we can further conclude that the optimumrisk J(P1, P1) must be a concave function of P1 as Fig. 2.4 also reflects.

From the geometric picture of Fig. 2.4, it is clear that the performance J(p, P1)obtained using a fixed assumed prior p when the correct prior is P1, depends onthe value of P1. Moreover, because the mismatch risk is a linear function, we seethat the poorest performance, corresponding to worst case mismatch, is obtainedeither when P1 = 0 or P1 = 1, depending on the sign of the slope of the mismatchrisk function. For example, for the p shown in Fig. 2.4, this worst case performancetakes place when the true prior is P1 = 1.

Given this behavior, a conservative design strategy is to use a likelihood ratiotest based on an assumed prior p chosen so that the worst-case performance is asgood as possible. Mathematically, this corresponds to choosing the assumed prioraccording to a “min-max” criterion: the prior P1 obtained in this manner is givenby

P1 = arg minp

{

maxP1

J(p, P1)}

. (2.39)

This minimizes the sensitivity of J(p, P1) to variations in P1, and hence leads to adecision rule that is robust with respect to uncertainty in the prior P1.

The solution to this min-max problem follows readily from the geometricalpicture of Fig. 2.4. Our solution depends on on the details of the shape of J(P1, P1)over the range 0 ≤ P1 ≤ 1. There are three cases, which we consider separately.

Case 1: J(P1, P1) Monotonically Nonincreasing

An example of this case is depicted in Fig. 2.5(a). In this case, the slope of anyline tangent to J(P1, P1) is nonpositive, so for any p the maximum of J(p, P1) lies

Page 19: Chap2 Hypothesis Testing

Sec. 2.2 Min-Max Hypothesis Testing 71

at P1 = 0. Hence, as the solution to (2.39) we obtain P1 = 0. This corresponds tousing a likelihood ratio test with threshold

η =1− P1

P1

(C10 − C00)

(C01 − C11)=∞.

This detector makes the decision H = H0 regardless of the observed data, andtherefore corresponds to the operating point (PD, PF ) = (0, 0).

Case 2: J(P1, P1) Monotonically Nondecreasing

An example of this case is depicted in Fig. 2.5(b). In this case, the slope of any linetangent to J(P1, P1) is nonnegative, so for any p the maximum of J(p, P1) lies atP1 = 1. Hence, as the solution to (2.39) we obtain P1 = 1. This corresponds tousing a likelihood ratio test with threshold

η =1− P1

P1

(C10 − C00)

(C01 − C11)= 0.

This detector makes the decision H = H1 regardless of the observed data, andtherefore corresponds to the operating point (PD, PF ) = (1, 1).

Case 3: J(P1, P1) Nonmonotonic

An example of this case is depicted in our original Fig. 2.4. In this case, J(P1, P1)has a point of zero slope (and hence a maximum) at an interior point (0 < P1 < 1),so that P1 is the value of p for which the slope of J(p, P1) is zero.

The corresponding point on the operating characteristic (and in turn, implic-itly, the threshold η) can be determined geometrically. In particular, substituting(2.37) with our zero-slope condition, it follows that P1 satisfies

d

dP1

J(p, P1)

p=P1

= (C01−C00)−(C10−C00)PF (P1)−(C01−C11)PD(P1) = 0. (2.40)

Then (2.40) describes the following line in the PD–PF plane

PD =C01 − C00

C01 − C11

− C10 − C00

C01 − C11

PF , (2.41)

which has negative slope as we would expect provided correct decisions are al-ways preferable to incorrect ones, i.e., Cij > Cii for all i 6= j,

Hence, its intersection with the operating characteristic for likelihood ratiotests then determines the desired (PD, PF ) operating point, and implicitly the cor-responding threshold η. Note that to implement the test we never need to explic-itly calculate P1, just η. Furthermore, that a unique intersection (and hence operat-ing point) exists for such problems follows from the monotonicity of the operatingcharacteristic which we established at the end of Section 2.1.4.

Page 20: Chap2 Hypothesis Testing

72 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

10

C00

C11

J(P ,P )1 1

1P

(a)

10

C00

C11J(P ,P )1 1

1P

(b) Figure 2.5. Examples of possibleJ(P1, P1) curves without points of zeroslope.

Page 21: Chap2 Hypothesis Testing

Sec. 2.3 Neyman-Pearson Binary Hypothesis Testing 73

As a final comment, examining the endpoints of the curve in Fig. 2.4 we seethat for some cost assignments we can guarantee that this Case 3 will apply. Forexample, this happens when C00 = C11 = 0. In this case, we see that the particularoperating point on the likelihood ratio test operating characteristic is defined asthe point for which the cost of a miss is the same as the cost of a false alarm, i.e.,

C01(1− PD) = C10PF . (2.42)

As C01 and C10 are varied relative to one another, the likelihood ratio test thresholdη is varied accordingly.

2.3 NEYMAN-PEARSON BINARY HYPOTHESIS TESTING

Both our basic Bayesian and min-max hypothesis testing formulations require thatwe choose suitable cost assignments Cij. As we saw, these cost assignments di-rectly influence the (PD, PF ) operating point of the optimum decision rule. How-ever, in many applications there is no obvious set of cost assignments.

In this kind of situation, an optimization criterion that is frequently morenatural is to choose the decision rule so as to maximize PD subject to a constrainton the maximum allowable PF , i.e.,

maxH(·)

PD such that PF ≤ α.

This is referred to as the Neyman-Pearson criterion. Interestingly, as we will seethe likelihood ratio test also plays a key role in the solution for Neyman-Pearsonproblems. For example, for problems involving continuous-valued data, the opti-mum decision rule is again a likelihood ratio test with the threshold chosen so thatPF = α.

This result can be obtained via the following straightforward Lagrange mul-tiplier approach. As in the Bayesian case, we restrict our attention to deterministicdecision rules for the time being. To begin, let PF = α′ ≤ α, and let us considerminimizing

J(H) = (1− PD) + λ(PF − α′)

with respect to the choice of H(·), where λ is the Lagrange multiplier. To obtainour solution, it is convenient to expand J(H) in the following form

J(H) =

Z0

py|H(y|H1) dy + λ

[∫

Z1

py|H(y|H0) dy− α′

]

=

Z0

py|H(y|H1) dy + λ

[

1−∫

Z0

py|H(y|H0) dy− α′

]

= λ(1− α′) +

Z0

[

py|H(y|H1)− λpy|H(y|H0)]

dy. (2.43)

Page 22: Chap2 Hypothesis Testing

74 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

Now recall from our earlier discussion that specifying Z0 fully determinesH(·), so we can view our problem as one of determining the optimum Z0. Fromthis perspective it is clear we want to choose Z0 so that it contains precisely thosevalues of y for which the term in brackets inside the integral in (2.43) is negative,since this choice makes J(H) smallest. This statement can be expressed in the form

py|H(y|H1)− λpy|H(y|H0)H(y)=H1

RH(y)=H0

0,

which, in turn, corresponds to a likelihood ratio test; specifically,

py|H(y|H1)

py|H(y|H0)

H(y)=H1

RH(y)=H0

λ (2.44)

where λ is chosen so that PF = α′. It remains only to determine α′. However,since the operating characteristic of the likelihood ratio test PD is a monotonicallyincreasing function of PF , the best possible PD is obtained when we let α′ = α.

In summary, we have seen that the optimum deterministic decision rules forBayesian, Min-max, and Neyman-Pearson hypothesis testing all take the form oflikelihood ratio tests with suitably chosen thresholds. Because such tests arise asthe solution to problems with such different criteria, a popular folk theorem is thatall reasonable optimization criteria lead to decision rules that can be describedin terms of likelihood ratio tests. While this statement is a difficult one to makeprecise, for most criteria that have been of interest in practice this statement hasbeen true.

And although we have restricted our attention to the case of continuous-valued data, the likelihood ratio test also plays a central role in problems involvingdiscrete-valued data as we will see beginning in Section 2.5. Moreover, for M-aryhypothesis testing problems with M > 2, the solutions can be described in termsof natural generalizations of our likelihood ratio tests. Before we proceed explorethese topics, however, let us examine the structure of likelihood ratio tests for anumber of basic detection problems involving Gaussian data.

2.4 GAUSSIAN HYPOTHESIS TESTING

A large number of detection and decision problems take the form of binary hy-pothesis tests in which the data are jointly Gaussian under each hypothesis. Inthis section, we explore some of the particular properties of likelihood ratio testsfor these problems. The general scenario we consider takes the form

H0 : y ∼ N(m0,Λ0)

H1 : y ∼ N(m1,Λ1).(2.45)

Page 23: Chap2 Hypothesis Testing

Sec. 2.4 Gaussian Hypothesis Testing 75

Note that y could be a random vector obtained from some array of sensors, or itcould be a collection of samples obtained from a random discrete-time signal y [n],e.g.,

y =[

y [0] y [1] · · · y [N − 1]]T

. (2.46)

For the hypotheses (2.45), and provided Λ0 and Λ1 are nonsingular, the like-lihood ratio test takes the form

L(y) =

1(2π)N/2|Λ1|1/2 exp

[

−12(y −m1)

TΛ−11 (y −m1)

]

1(2π)N/2|Λ0|1/2 exp

[

−12(y −m0)TΛ−1

0 (y −m0)]

H(y)=H1

RH(y)=H0

η. (2.47)

Some algebraic manipulation allows a simpler sufficient statistic to be ob-tained for the problem, which corresponds to the test

1

2yT[

Λ−10 −Λ−1

1

]

y

+ yT[

Λ−11 m1 −Λ−1

0 m0

]

H(y)=H1

RH(y)=H0

lnη +1

2ln (|Λ1|/|Λ0|)

+1

2

[

mT1 Λ−1

1 m1 −mT0 Λ−1

0 m0

]

.

(2.48)

Further simplification of (2.48) can be obtained in various important specialcases. We illustrate two of these via examples.

Example 2.6

Suppose in a spread-spectrum binary communication system, each bit is repre-sented by a code sequence of length K for transmission. Specifically, a 1-bit issignaled via the sequence m1[n], and a 0-bit via m0[n], where n = 0, 1, . . . ,K − 1.At the detector we obtain the following noise corrupted version of the transmittedsequence

H0 : y [n] = m0[n] + w [n]

H1 : y [n] = m1[n] + w [n],(2.49)

where under each hypothesis the noise w [n] is a sequence of independent, identi-cally distributed (IID) Gaussian random variables with mean zero and variance σ2.Collecting the observed data y [n] for n = 0, 1, . . . ,N − 1 into a vector y according to(2.46), we obtain

H0 : y ∼ N(m0, σ2I)

H1 : y ∼ N(m1, σ2I),

(2.50)

where

m0 ,[

m0[0] m0[1] · · · m0[K − 1]]T

m1 ,[

m1[0] m1[1] · · · m1[K − 1]]T

.(2.51)

Defining∆m =

[

∆m[0] ∆m[1] · · · ∆m[K − 1]]T

= m1 −m0, (2.52)

we see from simplifying (2.48) that a sufficient statistic for making the optimumBayesian decision at the detector (regardless of the cost and a priori probability as-signments) is

ℓ(y) = yT ∆m =

K−1∑

n=0

y[n]∆m[n], (2.53)

Page 24: Chap2 Hypothesis Testing

76 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

and the test is

ℓ(y)H(y)=H1

RH(y)=H0

η′ , σ2 ln η +1

2

[

mT1 m1 −mT

0 m0

]

. (2.54)

Note that the correlation computation (2.53) that defines ℓ(y) is equivalent toa “convolution and sample” operation. Specifically, it is a straightforward exerciseto verify that ℓ can be expressed in the form

ℓ = (y[n] ∗ h[n])∣

n=K−1, (2.55)

where the filter unit-sample response is defined via

h[n] =

0 n ≥ K

∆m[K − 1− n] 0 ≤ n ≤ K − 1

0 n ≤ −1.

(2.56)

This detector is an example of a matched filter, a concept we will explore in detailin Chapter 6. Whether the direct computation of ℓ as the correlation (2.53) or itscomputation via (2.55) is more efficient depends on the particular implementation.It is also straightforward to obtain expressions for the performance of this detector;we leave these as an exercise.

To develop further insight, let us consider the generalization of Example 2.6to the case of an arbitrary noise covariance, so our hypotheses are

H0 : y ∼ N(m0,Λ)

H1 : y ∼ N(m1,Λ),(2.57)

where Λ = Λ1 = Λ0 is the covariance matrix. In this case the likelihood ratio testsimplifies to

(y −m0)TΛ−1(y −m0)

H(y)=H1

RH(y)=H0

(y −m1)TΛ−1(y −m1) + 2 ln η (2.58)

or

ℓ′(y) , (m1−m0)TΛ−1y

H(y)=H1

RH(y)=H0

1

2

(

2 ln η + mT1 Λ−1m1 −mT

0 Λ−1m0

)

, η′. (2.59)

Useful insights are obtained from a geometrical interpretation of our results.To develop this geometry, we begin by defining an inner product7

〈x,y〉 = xTΛ−1y (2.60)

and the associated induced norm

‖y‖ =√

〈y,y〉 =

yTΛ−1y. (2.61)

7We leave it as an exercise for you to verify that we have defined a valid inner product.

Page 25: Chap2 Hypothesis Testing

Sec. 2.4 Gaussian Hypothesis Testing 77

Then, our optimum decision rule (2.59) corresponds to comparing a projection

ℓ′(y) = 〈y, ∆m〉 (2.62)

against a threshold, where, again, ∆m = m1 −m0.

Specializing further, note that in the case of a minimum probability of errorcost assignment and equally likely hypotheses, it is clear from using (2.61) with(2.58) that the maximum likelihood decision rule is equivalent to a minimum dis-tance rule, i.e.,

H(y) = Hm where m = arg minm∈{0,1}

‖y −mm‖,

where our distance metric is defined via our norm (2.61). In addition, via (2.59),when ‖m0‖ = ‖m1‖, solving for the minimum distance is equivalent to solving forthe maximum projection; in particular, specializing (2.59) we obtain

H(y) = Hm where m = arg maxm∈{0,1}

〈y,mm〉 ,

where our inner product is that defined in (2.62). In our vector space {y}, thecorresponding decisions regions are separated by a hyperplane equidistant fromm0 and m1 and perpendicular to the line connecting them.

The performance of these likelihood ratio tests can be readily calculated sinceℓ′(y) is a linear function of the Gaussian random vector y under each hypothesis.As a result, ℓ′(y) is Gaussian under each hypothesis. Specifically,

H0 : ℓ′ ∼ N(m′0, σ

2ℓ′)

H1 : ℓ′ ∼ N(m′1, σ

2ℓ′),

(2.63)

where

m′m = 〈∆m,mm〉 (2.64)

σℓ′ = ‖m1 −m0‖. (2.65)

From here our performance evaluation is identical to the scalar case. In particular,we have

PD = Pr [ℓ′ ≥ η′ | H = H1]

= Pr

[

ℓ′ −m′1

σℓ′≥ η′ −m′

1

σℓ′

H = H1

]

= Q

(

η′ −m′1

σℓ′

)

(2.66a)

PF = Pr [ℓ′ ≥ η′ | H = H0]

= Pr

[

ℓ′ −m′0

σℓ′≥ η′ −m′

0

σℓ′

H = H0

]

= Q

(

η′ −m′0

σℓ′

)

. (2.66b)

While computing actual values of PD and PF requires that we evaluate the Q (·)function numerically, we can obtain bounds on these quantities that are explic-itly computable. For example, using (1.129) and (1.128) with (2.66) we obtain, for

Page 26: Chap2 Hypothesis Testing

78 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

thresholds in the range m′0 < η′ < m′

1,

PD ≥ 1− 1

2exp

[

−1

2

(

η′ −m′1

σℓ′

)2]

(2.67a)

PF ≤1

2exp

[

−1

2

(

η′ −m′0

σℓ′

)2]

. (2.67b)

As in Example 2.6, the sufficient statistic for this problem can be imple-mented either directly as a correlator, or as a matched filter. In the latter case, thefilter h[n] captures the relevant information about the signal and the noise, takingthe form

h[n] =

0 n ≥ K

∆m′[K − 1− n] 0 ≤ n ≤ K − 1

0 n ≤ −1,

(2.68)

where

∆m′ =[

∆m′[0] ∆m′[1] · · · ∆m′[K − 1]]T

= Λ−1(m1 −m0). (2.69)

Let us next consider an example of a different class of Gaussian detectionproblems.

Example 2.7

Suppose that we are trying to detect a random signal x [n] at an antenna based onobservations of the samples n = 0, 1, . . . K − 1. Depending on whether the signal isabsent or present, the noisy observations y [n] take the form

H0 : y [n] = w [n]

H1 : y [n] = x [n] + w [n],(2.70)

where under both hypotheses w [n] is an IID sequence of N(0, σ2) random variablesthat is independent of the sequence x [n].

If x [n] was a known signal x[n], then our optimum detector would be that de-termined in Example 2.6 with ∆m[n] = x[n]; i.e., our sufficient statistic is obtainedby correlating our received signal y[n] with x[n]. However, in this example we as-sume we know only the statistics of x [n]—specifically, we know that

x =[

x [0] x [1] · · · x [K − 1]]T ∼ N(0,Λx)

where Λx is the signal covariance matrix.In this case, we have that the likelihood ratio test (2.48) simplifies to the test

ℓ(y) , yTx(y)H(y)=H1

RH(y)=H0

2σ2 ln

(

η|σ2I + Λx|1/2

σK

)

, γ (2.71)

wherex(y) = Λx

[

σ2I + Λx

]−1y =

[

x[0] x[1] · · · x[K − 1]]T

. (2.72)

Page 27: Chap2 Hypothesis Testing

Sec. 2.5 Tests with Discrete-Valued Observations 79

This detector has a special interpretation as our notation suggests. In partic-ular, as will become apparent in Chapter 3, (2.72) is, in fact, what will be referredto as the Bayes least-squares estimate of x based on an observation of y. Hence, inthis example the sufficient statistic is obtained by correlating the received signal y[n]with an estimate x[n] of the unknown signal x[n]. Note, however, that the thresholdγ in (2.71) is different from that in the known signal case. Furthermore, it shouldbe emphasized that sufficient statistics for more general detection problems involv-ing unknown or partially known signals do not always have such interpretations.We explore such issues further in Chapter 6 in the context of joint detection andestimation.

Note that evaluation of the performance of the optimum decision rule is lessstraightforward than was the case in Example 2.6. As is apparent from (2.71) and(2.72), the sufficient statistic in this problem is no longer has a Gaussian distributionbut rather a chi-squared distribution of degree K, i.e., ℓ ∼ χ2

K . As a result, we cannotget simple expressions for PD and PF in terms of the Q (·) function. It is tempting toobtain approximations to PD and PF by directly exploiting a central limit theoremargument to approximate ℓ as Gaussian when K is reasonably large. However, aproblem with this strategy is that the resulting Gaussian approximations are poorestin the tails of the distribution, yet this is typically the regime of interest for PF andPD calculations. However, useful bounds and approximations for these quantitiesare obtained via a technique referred to as the Chernoff bound. Such bounds haveproven useful in a wide range of engineering problems over the last several decades.

2.5 TESTS WITH DISCRETE-VALUED OBSERVATIONS

Thus far we have focussed in this chapter on the case of observation vectors ythat are specifically continuous-valued. However, there are many important deci-sion and detection problems that involve inherently discrete-valued observations.Conveniently, the preceding theory carries over to the discrete case in a largelystraightforward manner, with the likelihood ratio test continuing to play a centralrole, as we now develop. However, there are at least some important differences,which we will emphasize.

2.5.1 Bayesian Tests

With the notation (2.4), the optimum Bayesian decision rule takes a form analo-gous to that for the case of continuous-valued data; specifically,

L(y) ,py|H [y|H1]

py|H [y|H0]

H(y)=H1

RH(y)=H0

P0(C10 − C00)

P1(C01 − C11), η. (2.73)

The derivation of this result closely mimics that for the case of continuous-valueddata, the verification of which we leave as an exercise for the reader.

Page 28: Chap2 Hypothesis Testing

80 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

A couple of special issues arise in the case of likelihood ratio tests of theform (2.73). In particular, when the observations are discrete-valued, the like-lihood function L(y) is also discrete-valued; for future convenience, we denotethese values by 0 ≤ η0 < η1 < η2 < · · · . This property has some significantconsequences. First, it means that for many sets of costs and a priori probabilityassignments, the resulting η formed in (2.73) will often not coincide with one ofthe possible values of L = L(y), i.e., we will often have η 6= ηi for all i. In suchcases, the case of equality in the likelihood ratio test will not arise, and the min-imum Bayes risk is achieved by a likelihood ratio test that corresponds, as usual,to a unique (PD, PF ) point on the associated operating characteristic.

However, for some choices of the costs and a priori probabilities, the resultingthreshold η will satisfy η = ηi for some particular i. This means that, unlike withcontinuous-valued data, in this case equality in the likelihood ratio test will occurwith nonzero probability. Nevertheless, it can be readily verified that the Bayesrisk is the same no matter how a decision is made in this event. Hence, whenequality occurs, the decision can still be made arbitrarily, but these choices willcorrespond to different points on the operating characteristic.

2.5.2 The Operating Characteristic of the Likelihood Ratio Test, and Neyman-Pearson Tests

Let us more generally examine the form of the operating characteristic associatedwith the likelihood ratio test in the case of discrete-valued observations. In partic-ular, let us begin by sweeping the threshold η in (2.73) from 0 to∞ and examiningthe (PD, PF ) values that are obtained. As our development in the last section re-vealed, in order to ensure that each threshold η maps to a unique (PD, PF ), weneed to choose an arbitrary but fixed convention for handling the case of equalityin (2.73). For this purpose let us associate equality with the decision H(y) = H1,expressing the likelihood ratio test in the form

L(y) =py|H [y|H1]

py|H [y|H0]

H(y)=H1

≥<

H(y)=H0

η. (2.74)

Before developing further results, let’s explore a specific example to illustratesome of the key ideas.

Example 2.8

Suppose in an optical communication system bits are signaled by turning on andoff a laser. At the detector, the measured photon arrival rate is used to determinewhether the laser is on or off. When the laser is off (0-bit), the photons arrive accord-ing to a Poisson process with average arrival rate m0; when the laser is on (1-bit),the rate is m1 with m1 > m0. Suppose during a bit period we count the number

Page 29: Chap2 Hypothesis Testing

Sec. 2.5 Tests with Discrete-Valued Observations 81

of photons y that arrive and use this observed data to make a decision. Then thelikelihood functions for this decision problem are

py |H [y|Hi] = Pr [y = y | H = Hi] =

myi e

−mi

y!y = 0, 1, . . .

0 otherwise.

In this example the likelihood ratio (2.74) leads to the test

L(y) =

(

m1

m0

)y

e−(m1−m0)

H(y)=H1

≥<

H(y)=H0

η,

which further simplifies to

y

H(y)=H1

≥<

H(y)=H0

ln η + (m1 −m0)

ln(m1/m0), γ. (2.75)

The discrete nature of the this hypothesis testing problem means that the op-erating characteristic associated with the likelihood ratio test (2.75) is a discrete col-lection of points rather than a continuous curve of the type we encountered in anearlier example involving Gaussian data. Indeed, while the left-hand side of (2.75)is integer-valued, the right side is not in general. As a result, we have that PD andPF are given in terms of γ by the expressions

PD = Pr [y ≥ γ | H = H1] = Pr [y ≥ ⌈γ⌉ | H = H1] =∑

y≥⌈γ⌉

my1e

−m1

y!(2.76a)

PF = Pr [y ≥ γ | H = H0] = Pr [y ≥ ⌈γ⌉ | H = H0] =∑

y≥⌈γ⌉

my0e

−m0

y!. (2.76b)

The resulting operating characteristic is depicted in Fig. 2.6. In Figure 2.6, onlythe isolated (PD, PF ) points indicated by circles are achievable by simple likelihoodratio tests. For example, the uppermost point is achieved for all γ ≤ 0, the nexthighest point for all 0 < γ ≤ 1, the next for 1 < γ ≤ 2, and so on. This behavior isrepresentative of such discrete decision problems.

As we discussed in Section 2.5.1, for Bayesian problems the specific cost anda priori probability assignments determine a threshold η in (2.74), which in turncorresponds to one of the isolated points comprising the operating characteristic.

For Neyman-Pearson problems with discrete observations, it is also straight-forward to show that a likelihood ratio test of the form (2.74) is the optimum de-terministic decision rule. Moreover, as we might expect, for this rule the thresholdη is chosen so as to achieve the largest PD subject to our contraint on the maximumallowable PF (i.e., α). When α corresponds to at least one of the discrete points ofthe operating characteristic, then the corresponding PD indicates the achievabledetection probability. More typically, however, α will lie strictly between the PF

Page 30: Chap2 Hypothesis Testing

82 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False−Alarm Probability PF

Det

ecti

on

Pro

bab

ilit

y P

D

Figure 2.6. Operating characteristicassociated with the likelihood ratio testfor the Poisson detection problem. Thephoton arrival rates under H0 and H1

are m0 = 2 and m1 = 4, respectively.The circles mark points achievable bylikelihood ratio tests.

values of points on the operating characteristic. In this case, the appropriate oper-ating point corresponds to the (PD, PF ) whose PF is the largest PF that is smallerthan α. This operating point then uniquely specifies the decision rule.

While this decision rule is a straightforward extension of the result for conti-nuous-valued observations, it is possible to develop a more sophisticated decisionrule that typically yields at least somewhat better performance as measured bythe Neyman-Pearson criterion. To see this, we first note that the straightforwardlikelihood ratio test defined above was obtained as the optimum decision ruleamong all possible deterministic tests. Specifically, the likelihood ratio test was thebest decision rule of the form

H(y) =

{

H0 y ∈ Z0

H1 y ∈ Z1

where Z0 and Z1 are mutually exclusive, collectively exhaustive sets in the obser-vation space {y}. For these rules, each observation y maps to a unique decisionH(y).

When the regular likelihood ratio test cannot meet the false-alarm constraintwith equality, better performance can be achieved by employing a decision rulechosen from outside the class of deterministic rules. In particular, if we allow forsome randomness in the decision process, so that a particular observation does notalways produce the same decision, we can obtain improved detection performancewhile meeting our false-alarm constraint. This can be accomplished as follows.Consider the sequence of thresholds η0, η1, . . . that correspond to values that thelikelihood ratio can take on, and denote the corresponding operating points by(PD(ηi), PF (ηi)) for i = 0, 1, . . . . Determine ı such that ηı is the threshold value thatresults in the likelihood ratio test with the smallest false-alarm probability that isgreater than α. Then as illustrated in Fig. 2.7, PF (ηı) and PF (ηı+1) “bracket” α:using ηı+1 results in test with the largest false-alarm probability that is less than α.

Page 31: Chap2 Hypothesis Testing

Sec. 2.5 Tests with Discrete-Valued Observations 83

PD

PFα

ηî+1

ηî

Figure 2.7. Achievable operatingpoints for a randomization betweentwo likelihood ratio tests, one withthreshold ηı and the other withthreshold ηı+1.

Now consider the following randomized decision rule. We flip a biased coin forwhich the probability of “heads” is p and that of “tails” is 1−p. If “heads” turnsup, we use the likelihood ratio test with threshold ηı; if “tails” turns up, we usethe likelihood ratio test with threshold ηı+1. Then the resulting test achieves

PD = pPD(ηı) + (1− p)PD(ηı+1)

PF = pPF (ηı) + (1− p)PF (ηı+1),(2.77)

and corresponds to a point on the line segment connecting (PD(ηı), PF (ηı)) and(PD(ηı+1), PF (ηı+1)). This line segment is indicated with the dashed line segmentin Fig. 2.7. In particular, as p is varied from 0 to 1, the operating point moves from(PD(ηı+1), PF (ηı+1)) to (PD(ηı), PF (ηı)). Hence, by choosing p appropriately, wecan achieve the false-alarm probability constraint with equality, i.e., there exists ap such that

PF = pPF (ηı) + (1− p)PF (ηı+1) = α.

Since the operating characteristic is a monotonically increasing function, it followsthat PD(ηı) > PD(ηı+1), which in turn implies that p is the value of p yielding thelargest possible PD.

Our randomization argument allows us to conclude that all points on theoperating characteristic as well as all points on the line segments connecting oper-ating characteristic points (adjacent or not) can be achieved via randomized likeli-hood ratio tests. Referring to Example 2.8 in particular, this means that all operat-ing points on the piecewise linear dashed line in Fig. 2.6 can be achieved, allowingany false alarm probability constraint α to be met with equality.

Page 32: Chap2 Hypothesis Testing

84 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

While we have developed randomized likelihood ratio tests as a randomchoice among a pair of likelihood ratio tests, they can be implemented in otherways that can sometimes be more convenient. To see this, note that for the ran-domized rule above, both likelihood ratio tests lead to the decision H1 when L(y) ≥ηı+1. Hence, regardless of the outcome of the coin flip, the decision will be H1 forthe associated values of y. Similarly, if L(y) ≤ ηı−1 then both tests lead to thedecision H0, so the randomized rule will decide H0 regardless of the outcome ofthe coin flip. Hence, only when L(y) = ηı does the randomization play a role. Inthis case, if “heads” comes up, the decision will be H1, while if “tails” comes up,the decision will be H0. We can summarize this implementation of the test in thefollowing form:

Pr[

H(y) = H1

∣y = y

]

=

1 L(y) ≥ ηı+1

p L(y) = ηı

0 L(y) ≤ ηı−1

. (2.78)

Our results in this section have suggested that at least in some problemsinvolving discrete-valued data and the Neyman-Pearson criterion a randomizedtest can lead to better performance than a deterministic test. This observation, inturn, raises several natural and important questions. For example, we considereda very particular class of randomized tests—specifically, a simple random choicebetween the outcomes of two (deterministic) likelihood ratio tests. Would someother type of randomized test be able to perform better still? And could somemore general form of randomized test be able to improve performance in the caseof continuous-valued data with Bayesian or Neyman-Pearson criteria? To answerthese questions, in the next section we develop decision rules optimized over of abroad class of randomized tests.

2.6 RANDOMIZED TESTS

For a randomized test, the decision rule is a random function of the data, whichwe denote using H(·). Hence, even for a deterministic argument y, the decisionH(y) is a random quantity. However, H(y) has the property that conditioned onknowledge of y, the function is independent of the hypothesis H . Such a test isfully described by the probabilities

Q0(y) = Pr[

H(y) = H0

∣y = y

]

= Pr[

H(y) = H0

∣y = y, H = Hi

]

Q1(y) = Pr[

H(y) = H1

∣y = y

]

= Pr[

H(y) = H1

∣y = y, H = Hi

]

,(2.79)

where, of course, Q0(y) + Q1(y) = 1.

Page 33: Chap2 Hypothesis Testing

Sec. 2.6 Randomized Tests 85

With this notation, we see that deterministic rules are a special case, corre-sponding to

Q1(y) =

{

1 y ∈ Z1

0 y ∈ Z0

. (2.80)

Moreover, it also follows immediately that tests formed by a random choice amongtwo likelihood ratio tests, such as were considered in Section 2.5.2 are also specialcases. For example, the test described via (2.78) corresponds to

Q1(y) =

1 L(y) ≥ ηı+1

p L(y) = ηı

0 L(y) ≤ ηı−1.

(2.81)

More generally, for a randomized test that corresponds to the random choice be-tween two likelihood ratio tests with respective thresholds η1 and η2 such thatη2 > η1, and where the probability of selecting the first test is p, it is straightfor-ward to verify that

Q1(y) =

1 L(y) ≥ η2

p η1 ≤ L(y) < η2

0 L(y) < η1.

(2.82)

From our general characterization (2.79), we see that specifying a random-ized decision rule is equivalent to specifying Q0(·)—or Q1(·). Hence, determiningthe optimum randomized test for a given performance criterion involves solvingfor the optimum mapping Q0(·). Using this approach, we develop the optimumrandomized tests for Bayesian and Neyman-Pearson hypothesis testing problemsin the next two sections, respectively. As we will see this allows us to draw someimportant conclusions about when randomized tests are—and are not—needed.

2.6.1 Bayesian Case

We begin by establishing a more general version of our Bayesian result for thecase of randomized tests. We consider the case of continuous-valued data; thederivation in the discrete case is analogous. We begin by writing our Bayes risk inthe form

J(Q0) =

J(y) py(y) dy,

where

J(y) = E[

C(H, H(y))∣

∣y = y

]

.

Page 34: Chap2 Hypothesis Testing

86 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

Again we see it suffices to minimize J(y) for each y. Applying, in turn, (2.79) andBayes’ Rule, we can write J(y) in the form

J(y) =∑

i,j

Cij Pr[

H = Hj , H(y) = Hi

∣y = y

]

=∑

i,j

Cij Pr [H = Hj | y = y] Pr[

H = Hi

∣y = y

]

=∑

i,j

Cij Qi(y)Pj py|H(y|Hj)

py(y), (2.83)

from which, with

∆(y) , C10

P0py|H(y|H0)

py(y)+ C11

P1py|H(y|H1)

py(y), (2.84)

we obtain

J(y) = ∆(y) + Q0(y)py|H(y|H0)

py(y)P1(C01 − C11)

[

py|H(y|H1)

py|H(y|H0)− P0(C10 − C00)

P1(C01 − C11)−]

= ∆(y) + Q0(y)py|H(y|H0)

py(y)P1(C01 − C11) [L(y)− η]

(2.85)

using the notation defined in (2.19). From (2.85) we can immediately concludethat J(y) is minimized over 0 ≤ Q0(y) ≤ 1 by choosing Q0(y) = 0 when theterm in brackets is positive and Q0(y) = 1 when the term in brackets is negative.But this is precisely the (deterministic) likelihood ratio test (2.19) we developedearlier! Hence, for Bayesian problems, we can conclude that a deterministic testwill always suffice. Furthermore, although our derivation has been for the case ofcontinuous-valued data, the same conclusion is reached in the discrete case.

2.6.2 Neyman-Pearson Case

We next establish a more general version of our Neyman-Pearson result for thecase of randomized tests. We begin with the case of continuous-valued data. As inthe deterministic case, we follow a Lagrange multiplier approach, expressing ourobjective function as

J(Q0) = 1− PD + λ(PF − α′)

= λ(1− α′) + Pr[

H(y) = H0

∣H = H1

]

− λ Pr[

H(y) = H0

∣H = H0

]

(2.86)

for some α′ ≤ α. This time, though, (2.86) expands as

J(Q0) = λ(1− α′) +

Pr[

H(y) = H0

∣y = y

]

[

py|H(y|H1)− λpy|H(y|H0)]

dy

Page 35: Chap2 Hypothesis Testing

Sec. 2.6 Randomized Tests 87

from which we obtain, using the definition of L(y),

J(Q0) = λ(1− α′) +

Q0(y)[

py|H(y|H1)− λpy|H(y|H0)]

dy

= λ(1− α′) +

Q0(y)[

L(y)− λ]

py|H(y|H0) dy. (2.87)

Since 0 ≤ Q0(y) ≤ 1, the objective function (2.87) is minimized by setting Q0(y) =0 for all values of y such that the term in braces is positive, and Q0(y) = 1 for allvalues of y such that the term in braces is negative; i.e.,

Q0(y) =

{

0 L(y) > λ

1 L(y) < λ.(2.88)

Hence, we can conclude that our optimum rule is almost a simple (deterministic)likelihood ratio test. In particular, at least except when L(y) = λ we have that

H(y) =

{

H1 L(y) > λ

H0 L(y) < λ.(2.89)

It remains only to determine what the nature of the decision (i.e., Q0(y))when L(y) = λ and the choice of λ. These quantities are determined by meet-ing the constraint PF = α′. Specifically,

PF =

[1−Q0(y)] py|H(y|H0) dy

=

{y|L(y)>λ}

py|H(y|H0) dy +

{y|L(y)=λ}

[1−Q0(y)] py|H(y|H0) dy

= Pr [L(y) > λ | H = H0] +

{y|L(y)=λ}

[1−Q0(y)] py|H(y|H0) dy, (2.90)

where the second equality follows from substituting for Q0(y) using (2.88). Ob-serve that PF = 1 if λ < 0, so it suffices to restrict our attention to λ ≥ 0. Further-more, note that the first term in (2.90) is a nonincreasing function of λ.

Now since y is a continuous-valued random variable, then the second term in(2.90) is zero and the remaining term, which is a continuous of λ, can be chosen sothat it equals α′. In this case, it does not matter how we choose Q0(y) when L(y) =λ, so the optimum randomized rule degenerates to a deterministic likelihood ratiotest again. It remains only to show that for optimum PD we want α′ = α. Givenour preceding results, it suffices to exploit the fact that for likelihood ratio tests PD

is a monotonically nondecreasing function of PF .

Page 36: Chap2 Hypothesis Testing

88 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

Discrete Case

When the data is discrete, some important differences arise, which we now ex-plore. Proceeding as we did at the outset of Section 2.6.2, we obtain

J(Q0) = λ(1− α′) +∑

y

Q0(y)[

py|H [y|H1]− λpy|H [y|H0]]

= λ(1− α′) +∑

y

Q0(y)[

L(y)− λ]

py|H [y|H0]. (2.91)

from which we analogously conclude that for L(y) 6= λ we have (2.88). Hence, itremains only to determine λ and the decision when L(y) = λ from the false alarmconstraint.

An important difference from the continuous case is that when y is discrete-valued, so that L(y) is also discrete-valued, the first term in (2.90) is not a continu-ous function of λ, but is piecewise constant.

As before, let us denote the values that L(y) takes on by η0, η1, η2, . . . , where0 < η0 < η1 < η2 < · · · . And let us choose ı so that λ = ηı is the smallest thresholdsuch that

Pr [L(y) > λ | H = H0] = Pr [L(y) ≥ ηı+1 | H = H0] ≤ α′.

Then we can obtain PF = α′ by choosing

Q1(y) = 1−Q0(y)

appropriately for all y such that L(y) = ηı. In particular, 0 < Q1(·) < 1 in thisrange must be chosen so that

α′ − Pr [L(y) ≥ ηı+1 | H = H0] =∑

{y|L(y)=ηı}

Q1(y) py|H [y|H0]

= q Pr [L(y) = ηı | H = H0] (2.92)

whereq , E [Q1(y) | L(y) = ηı] . (2.93)

As we would expect, our decision probabilities Q1(·) when L(y) = ηı appear in(2.92) only through q, so it suffices to appropriately select the latter.

For q = 0, the resulting decision rule is the deterministic test

L(y)

H(y)=H1

≥<

H(y)=H0

ηı+1. (2.94)

More generally, when q > 0, we have that the decision rule involves a randomchoice between the decision rule (2.94) and

L(y)

H(y)=H1

≥<

H(y)=H0

ηı. (2.95)

Page 37: Chap2 Hypothesis Testing

Sec. 2.7 Properties of the Likelihood Ratio Test Operating Characteristic 89

In particular, the test (2.95) is chosen with probability q, and the test (2.94) withprobability 1− q. In terms of the operating characteristic of the likelihood ratiotest, this is a point on the line segment connecting the points corresponding to thetwo deterministic tests (2.94) and (2.95). This, of course, is precisely the form ofthe heuristically designed randomized test we explored at the end of Section 2.5.2,which we now see is optimal. As a result, when we include randomized tests,it makes sense to redefine the operating characteristic for the discrete case as theisolate likelihood ratio test operating points together with the line segments thatconnect these discrete points.

It remains only to verify that for optimum PD we want α′ = α in the discretecase as well. However, for likelihood ratio tests involving discrete data, the PD

values form a nondecreasing sequence as a function of PF . Then since the perfor-mance of randomized tests corresponds to points on the line segments connectingthe performance points associated with deterministic tests, PD is a nondecreasingfunction of PF for our more general class of randomized tests as well.

In summary, optimum Neyman-Pearson decision rules always take the formof either a deterministic rule in the form of a likelihood ratio test or a randomizedrule in the form of a simple randomization between two likelihood ratio tests. Inboth cases, the likelihood ratio test and its associated operating characteristic playa central role. Accordingly, we explore their properties further.

2.7 PROPERTIES OF THE LIKELIHOOD RATIO TEST OPERATING CHARACTERISTIC

By exploiting the special role that the likelihood ratio test plays in both determin-istic and randomized optimum decisions rules, we can develop a number of keyproperties of the PD–PF operating characteristic associated with the likelihood ra-tio test. For future reference, recall that for continuous-valued data the test takesthe form

L(y) =py|H(y|H1)

py|H(y|H0)

H(y)=H1

RH(y)=H0

η, (2.96)

while for discrete-valued data the form is

L(y) =py|H [y|H1]

py|H [y|H0]

H(y)=H1

≥<

H(y)=H0

η. (2.97)

We emphasize at the outset that the detailed shape of the operating charac-teristic is determined by the measurement model for the data—for example, bypy|H(y|H0) and py|H(y|H1) in the continuous-case—since it is this information thatis used to construct the likelihood ratio L(y). However, all operating character-istics share some important characteristics in common, and it is these that we ex-plore in this section. As a simple example, which was mentioned earlier, we have

Page 38: Chap2 Hypothesis Testing

90 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

that the (PD, PF ) points (0, 0) and (1, 1) always lie on the operating characteristic,and correspond to η →∞ and η → 0, respectively.

It is also straightforward to verify that PD ≥ PF , i.e., that the operating char-acteristic always lies above the diagonal in the PD–PF plane. This can be verifiedusing a randomization argument. In particular, suppose our decision rule ignoresthe data y and bases its decision solely on the outcome of a biased coin flip, wherethe probability of “heads” is p. If the coin comes up “heads” we make the decisionH(y) = H1, while if it comes up “tails” we make the decision H(y) = H0. Then forthis rule we have

PD = Pr[

H(y) = H1

∣H = H1

]

= Pr[

H(y) = H1

]

= p

PF = Pr[

H(y) = H1

∣H = H0

]

= Pr[

H(y) = H1

]

= p,

which corresponds to a point on the diagonal in the PD–PF plane. Hence, wecan achieve all points on the diagonal by varying the bias p in the coin. How-ever, for a given PF this decision rule cannot provide a better PD than the cor-responding point on the operating characteristic; otherwise, this would contra-dict our Neyman-Pearson result that this operating characteristic defines the bestachievable PD for a given PF .8 Hence, we conclude PD ≥ PF .

Randomization arguments play an important role in establishing other prop-erties of likelihood ratio test operating characteristics as well. For example, wecan also use such an argument to establish that the operating characteristic is aconcave function. To see this, let (PD(η1), PF (η1)) and (PD(η2), PF (η2)) be pointson the operating characteristic corresponding to two arbitrary thresholds η1 andη2, respectively. Then the operating characteristic is concave if the points on thestraight line segment joining these two points invariably lie below (or on) the op-erating characteristic. However, the points on this straight line segment can beparameterized according to

(PD, PF ) = (pPD(η1) + (1− p)PD(η2), pPF (η1) + (1− p)PF (η2)), (2.98)

where 0 ≤ p ≤ 1 is the parameter. Moreover, the points (2.98) can be achieved viathe following randomized test: a biased coin is flipped, and if it turns up “heads,”the likelihood ratio test with threshold η1 is used; otherwise, the likelihood ratiotest with threshold η2 is used. Again by varying the bias p in our coin, we canachieve all the points on the line segment. Hence, from our Neyman-Pearson re-sult, no part of the likelihood ratio test operating characteristic between PF (η1) andPF (η2) can lie below this line segment. Thus, since the endpoints were arbitrary,we conclude that the operating characteristic is concave.

Let’s consider one final property, which applies to likelihood ratio test op-erating characteristics associated with continuous-valued data. In particular, we

8Indeed, we wouldn’t expect to obtain better performance by ignoring the data y!

Page 39: Chap2 Hypothesis Testing

Sec. 2.7 Properties of the Likelihood Ratio Test Operating Characteristic 91

show that at those points where it is defined, the slope of the operating character-istic is numerically equal to the corresponding threshold η, i.e.,

dPD

dPF= η. (2.99)

Note that since η ≥ 0, this is another way of verifying that the operating charac-teristic is nondecreasing. A proof is as follows. With

Z1(η) = {y | L(y) > η} (2.100)

we have

PD(η) =

Z1(η)

py|H(y|H1) dy,

which after applying the definition of the likelihood function (2.96) yields

PD(η) =

Z1(η)

L(y) py|H(y|H0) dy. (2.101)

Next, note that with u(·) denoting the unit step function, i.e.,

u(x) =

{

1 x > 0

0 otherwise,

we have

u(L(y)− η) =

{

1 y ∈ Z1(η)

0 otherwise. (2.102)

In turn, using the result (2.102) in (2.101) we obtain

PD(η) = E [u(L(y)− η) L(y) | H = H0] = E [u(L− η) L | H = H0] =

∫ ∞

η

L pL|H(L|H0) dL

(2.103)

Finally, differentiating (2.103) with respect to η then yields

dPD

dη= −η pL|H(η|H0). (2.104)

However, since

PF =

∫ ∞

η

pL|H(L|H0) dL

we knowdPF

dη= −pL|H(η|H0). (2.105)

Hence, dividing (2.104) by (2.105) we obtain (2.99).

Page 40: Chap2 Hypothesis Testing

92 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

2.7.1 Achievable Operating Points

Having determined some of the structure of the operating characteristic, let usnext explore more generally how to use this characteristic to define what oper-ating points can be achieved by any test—deterministic or randomized—in thePD–PF plane. First, we note that every point between the operating characteristicand the diagonal (PD = PF ) can be achieved by a simple randomized test. To seethis it suffices to recognize that every point in this region lies on some line con-necting two points on the operating characteristic. Hence, every such point can beachieved by a simple randomization between two likelihood ratio tests having thecorresponding thresholds, using a coin with suitable bias p.

The test that achieves a given operating point in this region need not beunique, however. To illustrate this, let η0 be the threshold corresponding to a par-ticular point on the operating characteristic. Then if for our decision rule we usea random choice (using a coin with bias p) between the outcome of this likelihoodratio test and that of the “guessing rule” that achieves an arbitrary point on thediagonal (using a different coin with bias q). Hence, by choosing η0 > 0, 0 ≤ p ≤ 1,and 0 ≤ q ≤ 1, appropriately, our “doubly-randomized” test can also achieve anydesired point in the region of interest.

Let us next consider which points below the diagonal are achievable. Thiscan be addressed via a simple “rule-reversal” argument. For this we require thefollowing notion of a reversed test. If H(·) describes a deterministic decision rule,

then the corresponding reversed rule, which we denote using H(·), is simply onewhose decisions are made as follows:

H(y) =

{

H0 H(y) = H1

H1 H(y) = H0

.

More generally, if Q0(·) describes a randomized decision rule, then the correspond-ing decision rule, which we denote using Q0(·), is defined via

Q0(y) = Q1(y) = 1−Q0(y).

If a deterministic or randomized test achieves the operating point (PD, PF ) =(β, α), then it is easy to verify that the corresponding reversed test achieves theoperating point (PD, PF ) = (1− β, 1− α), i.e.,

Pr[

H(y) = H1

∣H = H1

]

= Pr[

H(y) = H0

∣H = H1

]

= 1− β

Pr[

H(y) = H1

∣H = H0

]

= Pr[

H(y) = H0

∣H = H0

]

= 1− α.

Using this property, it follows that those points lying on the curve correspond-ing to the operating characteristic reflected across the PD = 1/2 and PF = 1/2lines are achievable by likelihood ratio tests whose decisions are reversed. In turn,all points between the diagonal and this “reflected operating characteristic” are

Page 41: Chap2 Hypothesis Testing

Sec. 2.8 M -ary Hypothesis Testing 93

PD

PF0 1

1

0Figure 2.8. Achievable region in thePD–PF plane.

achievable using a suitably designed randomized test. Hence, as illustrated inFig. 2.8 we can conclude that all points within the region bounded by the oper-ating characteristic and reversed operating characteristic curves can be achievedusing a suitably designed decision rule.

We can also establish the converse result: that no decision rule—deterministicor randomized—can achieve points outside this region. That no (PD, PF ) pointabove the operating characteristic can be achieved follows from our Neyman-Pearson results. That no point below the reflected operating characteristic can beachieved (including (PD, PF ) = (0, 1)!) follows as well, using a “proof-by-contra-diction” argument. In particular, if such a point could be achieved, then so couldits reflection via a reversed test. However, this reflection would then lie above theoperating characteristic, which would contradict the Neyman-Pearson optimalityof the likelihood ratio test.

2.8 M-ARY HYPOTHESIS TESTING

Thus far we have focussed on the case of binary hypothesis testing in this chap-ter. From this investigation, we have developed important insights that apply todecision problems involving multiple hypotheses more generally. However, somespecial considerations and issues arise in the more general M-ary hypothesis test-ing problem. We explore a few of these issues in this section, but emphasize thatour treatment is an especially introductory one.

Page 42: Chap2 Hypothesis Testing

94 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

To simplify our discussion, we restrict our attention to the Bayesian problemformulation with continuous-valued data and deterministic decision rules. Ac-cordingly, the scenario we consider is as follows:

1. There are M hypotheses H0, H1, ..., HM−1 with associated a priori probabilitiesPi = Pr [H = Hi].

2. We have a set of costs of the form Cij, which corresponds to the cost of de-ciding H = Hi when H = Hj is true.

3. Our (generally vector) measurements y are characterized by the set of densi-ties py|H(y|Hi).

For this problem, we explore a procedure for choosing one of the M hy-potheses based on the observed data y so as to minimize the associated expectedcost. This corresponds to designing an M-valued decision rule H(·) : RK →{H0, H1, . . . , HM−1}. By analogy to the binary case, we begin by noting that ify = y and H(y) = Hm, then the expected cost is

J(Hm,y) =

M−1∑

m=0

Cmm Pr [H = Hm | y = y] . (2.106)

Consequently, given the observation y = y we want to choose m to minimize(2.106), i.e.,

H(y) = Hm where m = arg minm∈{0,1,...,M−1}

M−1∑

j=0

Cmj Pr [H = Hj | y = y] . (2.107)

Let’s explore several aspects of this decision rule. We begin by considering aspecial case.

2.8.1 Special Case: Minimum Probability-of-Error Decisions

If all possible errors are penalized equally, i.e.,

Cij =

{

0 i = j

1 i 6= j, (2.108)

then (2.107) becomes

H(y) = Hm where m = arg minm∈{0,1,...,M−1}

j 6=m

Pr [H = Hj | y = y] , (2.109)

Exploiting the fact that∑

j 6=m

Pr [H = Hj | y = y] = 1− Pr [H = Hm | y = y]

Page 43: Chap2 Hypothesis Testing

Sec. 2.8 M -ary Hypothesis Testing 95

we obtain that (2.109) can be equivalently described in the form

H(y) = Hm where m = arg maxm∈{0,1,...,M−1}

Pr [H = Hm | y = y] . (2.110)

Hence, as in the binary (M = 2) case, the minimum probability-of-error decisionrule (2.110) chooses the hypothesis having the largest a posteriori probability. As aresult, for arbitrary M this is referred to as the maximum a posteriori (MAP) rule.

Also as in the binary case, rules such as this can typically be manipulatedinto simpler forms for actual implementation. For example, from Bayes’ rule wehave that

Pr [H = Hm | y = y] =py|H(y|Hm) Pm

M−1∑

j=0

py|H(y|Hj) Pj

. (2.111)

Since the denominator is, as always, a normalization constant independent of m,we can multiply (2.110) by this constant, yielding

H(y) = Hm where m = arg maxm∈{0,1,...,M−1}

py|H(y|Hm) Pm. (2.112)

If the Pm are all equal, it follows immediately that (2.112) can be simplified to themaximum likelihood (ML) rule

H(y) = Hm where m = arg maxm∈{0,1,...,M−1}

py|H(y|Hm). (2.113)

To illustrate the application of these results, let’s explore a minimum probability-of-error rule in the context of simple Gaussian example.

Example 2.9

Suppose that y is a K-dimensional Gaussian vector under each of the M hypotheses,with

py|H(y|Hm) = N(y;mm, I). (2.114)

In this case, applying (2.112)—and recognizing that we may work with the loga-rithm of the quantity we are maximizing—yields

m = arg maxm∈{0,1,...,M−1}

[

−K

2log(2π) − 1

2(y −mm)T(y −mm) + log Pm

]

. (2.115)

This can be simplified to

m = arg maxm∈{0,1,...,M−1}

[

ℓm(y)− 1

2mT

mmm + log Pm

]

(2.116)

whereℓm(y) = 〈mm,y〉 , i = 0, 1, . . . ,M − 1, (2.117)

with the inner product and associated norm defined by

〈x,y〉 = xTy

Page 44: Chap2 Hypothesis Testing

96 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

and

‖y‖ =√

〈y,y〉 =√

yTy.

Consequently, the required data processing consists of the correlation compu-tations in (2.117) followed by the comparisons in (2.116). Note that if, in addition, thehypotheses are equally likely and have equal signal energy—i.e., ‖mm‖2 is the samefor all hypotheses—(2.115) and (2.116) simplify further to the minimum-distancerule

H(y) = Hm where m = arg maxm∈{0,1,...,M−1}

ℓm(y)

= arg minm∈{0,1,...,M−1}

‖y −mm‖. (2.118)

2.8.2 Structure of the General Bayesian Decision Rule

The Bayesian decision rule in the M-ary case is a natural generalization of thecorresponding binary rule. To see this, we substitute Bayes’ rule (2.111) into thegeneral rule (2.107) and multiply through by the denominator in (2.111) to obtainthe following rule

H(y) = Hm where m = arg minm∈{0,1,...,M−1}

M−1∑

j=0

Cmj Pjpy|H(y|Hj). (2.119)

Simplifying further by dividing by py|H(y|H0) and defining likelihood ratios

Lj(y) =py|H(y|Hj)

py|H(y|H0), j = 1, 2, . . . , M − 1, (2.120)

we obtain the rule

H(y) = Hm where m = arg minm∈{0,1,...,M−1}

[

Cm0P0 +M−1∑

j=1

CmjPjLj(y)

]

. (2.121)

Let us illustrate the resulting structure of the rule for the case of three hy-potheses. In this case, minimization inherent in the rule (2.121) requires accessto a subset of the results from three comparisons, each of which eliminates one

Page 45: Chap2 Hypothesis Testing

Sec. 2.8 M -ary Hypothesis Testing 97

H(y)=Hˆ1

H(y)=Hˆ0

H(y)=Hˆ2

L (y)2

L (y)1

Figure 2.9. Decision Regions for theM -ary Bayesian hypothesis tests. Theline segment separating the decisionH0 from H1 (including its dashedextension) corresponds to the test(2.122a); that separating the decisionH1 from H2 corresponds to the test(2.122b); and that separating the deci-sion H0 from H2 corresponds to the test(2.122c)

hypothesis. Rearranging terms in each of these comparisons yields the following:

P1(C01 − C11)L1(y)

H(y) = H1 or H2

(i.e., H(y)6=H0)

RH(y) = H0 or H2

(i.e., H(y)6=H1)

P0(C10 − C00) + P2(C12 − C02)L2(y) (2.122a)

P1(C11 − C21)L1(y)

H(y) = H2 or H0

(i.e., H(y)6=H1)

RH(y) = H1 or H0

(i.e., H(y)6=H2)

P0(C20 − C10) + P2(C22 − C12)L2(y) (2.122b)

P1(C21 − C01)L1(y)

H(y) = H0 or H1

(i.e., H(y)6=H2)

RH(y) = H2 or H1

(i.e., H(y)6=H0)

P0(C00 − C20) + P2(C02 − C22)L2(y) (2.122c)

The decision rule corresponding to (2.122) has the graphical interpretationdepicted in Fig. 2.9. In particular, equality in each of the three relationships (2.122)determines a linear (affine) relationship between L1(y) and L2(y). If we plot theselines in the L1–L2 plane, values on one side correspond to one inequality direction,and values on the other side to the other inequality. Furthermore, it is straightfor-ward to verify that the three straight lines intersect at a single point—it suffices tosum the three left-hand and right-hand sides of (2.122).

In this three hypothesis case, we have seen that the optimum decision rule(2.121) can be can be reduced to a set of three comparisons (2.122) involving linearcombinations of two statistics, L1(y) and L2(y), which are computed from the data.More generally, examining (2.121) we can conclude that for M hypotheses withM > 3, the minimization involves access to a subset of

(

M2

)

= M(M − 1)/2 com-

Page 46: Chap2 Hypothesis Testing

98 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

parisons of linear combinations of the M−1 statistics L1(y), L2(y), . . . , LM−1(y).Each of these comparisons looks at the relative merits of choosing one particu-lar hypothesis over a second specific hypothesis. For example, in the three hy-pothesis case, (2.122a) compares H0 and H1 (see Fig. 2.9). This comparison estab-lishes whether the point (L1(y), L2(y)) lies above or below the line correspondingto (2.122a) including its extension indicated by a dashed line in the figure. Oncethis determination is made, one of the other comparisons yields the final decision.For example, if comparison (2.122a) tells us that (L1(y), L2(y)) lies above the asso-ciated line, we would then use (2.122b) to decide between H1 and H2.

It is important to emphasize that the structure of the decision rule can beviewed as a process of eliminating one hypothesis at a time. Since we need onlyeliminate M − 1 of the hypotheses, only M − 1 of the comparisons need be used.However, we need to have all comparisons available, since the set of M − 1 com-parisons that get used depends upon the actual value of the observation. For ex-ample, if (2.122a) tells us that (L1(y), L2(y)) lies below the line, we would then use(2.122c) to decide between H0 and H2. Finally, each of these two-way comparisonshas something of the flavor of a binary hypothesis testing problem. An importantdifference, however, is that in deciding between, say, H0 and H1, we need to takeinto account the possibility that the actual hypothesis is H2. For example, if thecost C02 of deciding H0 when H2 is correct is much greater than the cost C12 ofdeciding H1 when H2 is correct, the decision rule accounts for this through a term(namely the last one in (2.122a)) in the comparison of H0 and H1 that favors H1

over H0. To be more specific, let us rewrite (2.122a) as

L1(y)H(y) = H1 or H2

RH(y) = H0 or H2

P0(C10 − C00)

P1(C01 − C11)+

P2(C12 − C02)

P1(C01 − C11)L2(y), (2.123)

and note that if the last term in (2.123) were not present, this would be exactlythe binary hypothesis test for deciding between H0 and H1. Assuming that C01 >C11 (i.e., that it is always more costly to make a mistake than to be correct) andthat C12 < C02, we see that the last term is negative, biasing the comparison infavor of H1. Furthermore, this bias increases as L2(y) increases, i.e., when the dataindicates H2 to be more and more likely.

2.8.3 Performance Analysis

Let us next explore some aspects of the performance of the optimum decision rulefor M-ary Bayesian hypothesis testing problems. Generalizing our approach fromthe binary case, we begin with an expression for the expected cost obtained byenumerating all the possible scenarios (i.e., deciding H(y) = Hi when H = Hj iscorrect for all possible values of i and j):

E [C] =M−1∑

i=0

M−1∑

j=0

Cij Pr[

H(y) = Hi

∣H = Hj

]

Pj. (2.124)

Page 47: Chap2 Hypothesis Testing

Sec. 2.8 M -ary Hypothesis Testing 99

Hence, the key quantities to be computed are the decision probabilities

Pr[

H(y) = Hi

∣H = Hj

]

. (2.125)

However, since these probabilities must sum over i to unity, only M − 1 probabil-ities to be calculated for each j. Thus, in total, M(M − 1) of these quantities needto be calculated. Note that this is consistent with the binary case (M = 2) in whichonly two quantities need to be calculated, viz.,

PF = Pr[

H(y) = H1

∣H = H0

]

and PD = Pr[

H(y) = H1

∣H = H1

]

.

In the more general case, however, not only does the number of quantitiesto be computed increase with M , but also the complexity of each such calculationas well. Specifically, as we have seen, reaching the decision H(y) = Hi involves aset of comparisons among M − 1 statistics. For example, in the 3-hypothesis casePr[H(y) = Hm | H = H0] equals the integral over the region marked “H(y) = Hm”in Fig. 2.9 of the joint distribution for L1(y) and L2(y) conditioned on H = H0.Unfortunately, calculating these kinds of multidimensional integrals over such re-gions are cumbersome and can generally only be accomplished numerically. In-deed, there are few simplifications even in the Gaussian case.

Example 2.10

Let us return to Example 2.9 where the densities of y under the various hypothesesare given by (2.114). For simplicity, let us assume that the hypotheses are equallylikely and the signals have equal energy so that the optimum decision rule is givenby (2.118). Let

ℓ(y) =

ℓ0(y)ℓ1(y)

...ℓM−1(y)

= My (2.126)

whereM =

[

m0 m1 · · · mM−1

]T. (2.127)

Then, since y is Gaussian under each hypothesis, so is ℓ(y). In fact, from (2.113)we see that the second-moment statistics of the log likelihood ratio are

E [ℓ | H = Hj] = Mmj =

mT0 mj

mT1 mj...

mTM−1mj

(2.128)

and

Λℓ|H=Hj= E

[

(ℓ− E [ℓ | H = Hj]) (ℓ− E [ℓ | H = Hj])T]

= MMT

=

mT0 m0 mT

0 m1 · · · mT0 mM−1

mT1 m0 mT

1 m1 · · · mT1 mM−1

......

. . ....

mTM−1m0 mT

M−1m1 · · · mTM−1mM−1

(2.129)

Page 48: Chap2 Hypothesis Testing

100 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

so thatpℓ|H(l|Hj) = N(l;Mmj ,MMT) (2.130)

In this case

Pr[

H(y) = Hi

∣H = Hj

]

= Pr [ℓi ≥ ℓm, m = 0, 1, . . . ,M − 1 | H = Hj] (2.131)

which is an M -dimensional integral of the distribution in (2.130) over the set inwhich the ith coordinate is at least as large as all of the others.

Since exact calculation of performance quantities such as those in Exam-ple 2.10 is generally impractical, approximation techniques are often employed.We next consider a class of approximation techniques, based on the union boundmentioned in Section 1.2 and its generalizations, that are frequently useful in suchproblems.

The Union Bound in Performance Calculations

In order to simply our development, let us restrict our attention to the costs asso-ciated with the minimum probability-of-error criterion. Specifically, suppose thatthe Cij are given by (2.108), and in addition that the hypotheses are equally likely.In this case (2.124) becomes

E [C] = Pr(e) =1

M

M−1∑

j=0

Pr[

H(y) 6= Hj

∣H = Hj

]

(2.132)

Furthermore, from (2.110) we see that the event

Ej = {H(y) 6= Hj}is a union, for k 6= j, of the events

Ekj = {Pr [H = Hk | y] > Pr [H = Hj | y]}. (2.133)

Therefore

Pr[

H(y) 6= Hj

∣H = Hj

]

= Pr [Ej | H = Hj]

= Pr

[

k 6=j

Ekj

H = Hj

]

=∑

k 6=j

Pr [Ekj | H = Hj]

−∑

k 6=ji6=ji6=k

Pr [Ekj ∩ Eij | H = Hj]

+∑

k 6=j, i6=j, n 6=jk 6=i, k 6=n, i6=n

Pr [Ekj ∩ Eij ∩ Enj | H = Hj]

− · · · (2.134)

Page 49: Chap2 Hypothesis Testing

Sec. 2.8 M -ary Hypothesis Testing 101

where to obtain the last equality in (2.134) we have used the natural generalizationof the equality (1.5).9

Eq. (2.134) leads us to a natural approximation strategy. Specifically, it fol-lows that

Pr

[

k 6=j

Ekj

H = Hj

]

≤∑

k 6=j

Pr [Ekj | H = Hj] (2.135)

since the sum on the right-hand side adds in more than once the probabilities ofintersection of the Ekj.

The bound (2.135) is referred to as the union bound and is in fact the simplest(and loosest) of a sequence of possible bounds. Specifically, while (2.135) tells usthat the first term on the right of (2.134) is an upper bound to the desired proba-bility, the first two terms together are a lower bound10, the first three terms againform another (somewhat tighter) upper bound, etc. It is the first of these bounds,however, that leads to the simplest computations and is of primary interest, yield-ing

Pr(e) ≤ 1

M

M−1∑

j=0

k 6=j

Pr [Ekj | H = Hj] . (2.136)

The union bound is widely used in bit error rate calculations for communicationsapplications.

We conclude this section by illustrating the application of this bound in thecontext of an example.

Example 2.11

Again, we return to Example 2.9 and its continuation 2.10. In this case, via (2.118)or, equivalently, (2.131)), we have

Ekj = {ℓk > ℓj} = {ℓk − ℓj > 0}so that

Pr(e) ≤ 1

M

M−1∑

j=0

k 6=j

Pr [ℓk − ℓj > 0 | H = Hj] (2.137)

From our earlier calculations we see that

pℓk−ℓj |H(l|Hj) = N(

l; (mk −mj)Tmj , (mk −mj)

T(mk −mj))

, (2.138)

so that Pr [ℓk − ℓj > 0 | H = Hj] is a single error function calculation.

9For example, we have

Pr(A ∪B ∪ C) = Pr(A) + Pr(B) + Pr(C)− Pr(A ∩B)− Pr(A ∩ C)− Pr(B ∩ C) + Pr(A ∩B ∩C).

10To see this, note that we’ve subtracted out probabilities of intersections of pairs of the Ekj

but in the process have now missed probabilities of intersections of three of the Ekj together.

Page 50: Chap2 Hypothesis Testing

102 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

As a specific numerical example, suppose that there are three hypotheses and

m1 =

100

, m2 =

010

, m3 =

001

In this case, for any k 6= j we have

pℓk−ℓj |H(l|Hj) = N(l;−1, 2) (2.139)

so that

Pr(e) ≤ 2

∫ ∞

0

1√4π

e−(l+1)2/4 dl = 2Q

(

1√2

)

(2.140)

2.8.4 Alternative Geometrical Interpretations

It is also possible to develop alternative geometrical interpretations of the opti-mum Bayesian decision rule (2.107). These interpretations lend additional insightinto the structure of these tests. In this section, we explore one such alternativeinterpretation. To begin, we define the conditional probability vector

π(y) =

Pr [H = H0 | y = y]Pr [H = H1 | y = y]

...Pr [H = HM−1 | y = y]

, (2.141)

and note that all of the components of π(y) are nonnegative and sum to unity.As depicted in Fig. 2.10, the sets of all such probability vectors form a line whenM = 2 and a plane when M = 3. Let us also define a set of cost vectors ci, eachof which consists of the set of possible costs associated with making a particulardecision, viz.,

ci =

Ci0

Ci1...

Ci,M−1

. (2.142)

With this new notation, the optimal decision rule (2.107) can be expressed in theform

H(y) = Hm where m = arg minm∈{0,1,...,M−1}

cTmπ(y), (2.143)

or, equivalently,

H(y) = Hm if for all m we have (cm − cm)Tπ(y) ≤ 0 (2.144)

As before, this rule takes the form of a set of comparisons: for each i and kwe have

(ck − ci)Tπ(y) < 0 =⇒ H(y) 6= Hi (2.145a)

(ck − ci)Tπ(y) > 0 =⇒ H(y) 6= Hk (2.145b)

Page 51: Chap2 Hypothesis Testing

Sec. 2.8 M -ary Hypothesis Testing 103

π2

π1

π +π = 11 2

(0,1)

(1,0)(a)

π3

π1

(0,0,1)

(1,0,0)π2

(0,1,0)

π +π +π = 11 2 3

(b)Figure 2.10. Geometry optimum M -ary Bayesian decision rules in π-space.

Page 52: Chap2 Hypothesis Testing

104 Detection Theory, Decision Theory, and Hypothesis Testing Chap. 2

H(y)=Hˆ1

H(y)=Hˆ3

H(y)=Hˆ2

(0,0,1)

(1,0,0) (0,1,0)Figure 2.11. Optimum decision re-gions with triangle of Fig. 2.10(b).

From the vector space geometry developed in Section 1.7, we see that each of theequations (ck − ci)

Tπ(y) = 0 defines a subspace that separates the space into twohalf-spaces corresponding to (2.145a) and (2.145b). By incorporating all of thesecomparisons, the set of subspaces (ck − ci)

Tπ(y) = 0 for all choices of k and ipartition the space into the optimum decision regions. The set of possible proba-bility vectors, which is a subset of the space, is therefore partitioned into decisionregions. For example, for M = 3, there are three planes through the origin inFig. 2.10(b) that partition the space. In Fig. 2.11 we illustrate what this partitioninglooks like restricted to the triangle in Fig. 2.10(b) of possible probability vectors.

2.9 RANDOM AND NONRANDOM HYPOTHESES, AND SELECTING TESTS

Throughout this chapter, recall that we have assumed that the hypotheses Hm wereinherently outcomes of a random variable H . Indeed, we described the probabil-ity density for the data y under each hypothesis Hm as conditional densities ofthe form py|H(y|Hm). And, in addition, with each Hm we associated an a prioriprobability Pm for the hypothesis.

However, it is important to reemphasize that in many problems it may not beappropriate to view the hypotheses as random—the notion of a priori probabilitiesmay be rather unnatural. Rather, as we discussed at the outset of the chapter, thetrue hypothesis H may be a completely deterministic but unknown quantity. Inthese situations, it often makes more sense to view the density for the observeddata not as being conditioned on the unknown hypothesis but rather as beingparameterized by the unknown hypothesis. For such tests it is then appropriate to

Page 53: Chap2 Hypothesis Testing

Sec. 2.9 Random and Nonrandom Hypotheses, and Selecting Tests 105

use the notationH0 : y ∼ py(y; H0)

H1 : y ∼ py(y; H1),(2.146)

which makes this parameterization explicit.

Hence, we can distinguish between three different classes of hypothesis test-ing problems:

1. The true hypothesis is random, and the a priori probabilities for the possibil-ities are known.

2. The true hypothesis is random, but the a priori probabilities for the possibili-ties are unknown.

3. The true hypothesis is nonrandom, but unknown.

For the first case, we can reasonably apply the Bayesian method provided wehave a suitable means for assigning costs to the various kinds of decision errors.When we cannot meaningfully assign costs, it is often more appropriate to applya Neyman-Pearson formulation for the problem.

In the second case, we cannot apply the Bayesian method directly. How-ever, provided suitable cost assignments can be made, we can apply the min-maxmethod to obtain a decision rule that is robust with respect to the unknown priorprobabilities. When suitable cost assignments cannot be made, we can use theNeyman-Pearson approach in this case as well.

Finally, in the third case, corresponding to hypothesis being nonrandom, theNeyman-Pearson is natural approach. Our development in this case proceeds ex-actly as it does in the random hypothesis case, except for our modified notationfor the densities. As a result, the optimum decision rule is a likelihood ratio test,possibly randomized in the case of discrete data, where the likelihood ratio is nowdefined in terms of the modified notion, i.e.,

L(y) =py(y; H1)

py(y; H0)(2.147)

in the continuous-valued observation case, or

L(y) =py[y; H1]

py[y; H0](2.148)

in the case of discrete-valued observations.

The distinction between random and nonrandom hypotheses may seem pri-marily a philosophical one at this juncture in our development. However, makingdistinctions between random quantities and nonrandom but unknown quantities—and keeping track of the consequences of such distinctions—will make our devel-opment in subsequent chapters much easier to follow and provide some importantperspectives. With this approach, it will also be easier to understand the practicalimplications of these distinctions as well as the connections between the variousapproaches to detection and estimation we develop.