Phân Tích Số Liệu Và Tạo Biểu Đồ

320
1 Phân tích sliu và to biu đồ bng hướng dn thc hành Mc lc 1 Li nói đầu 2 Gii thiu ngôn ngR 2.1 R là gì ? 2.2 Ti và cài đặt R vào máy tính 2.3 Package cho các phân tích đặc bit 2.4 Khi động và ngưng chy R 2.5 “Văn phm” ngôn ngR 2.6 Cách đặt tên trong R 2.7 Htrtrong R 2.8 Môi trường vn hành 3 Nhp dliu 3.1 Nhp sliu trc tiếp: c() 3.2 Nhp sliu trc tiếp: edit(data.frame()) 3.3 Nhp sliu tmt textfile: read.table() 3.4 Nhp sliu tExcel: read.csv 3.5 Nhp sliu tSPSS: read.spss 3.6 Tìm thông tin cơ bn vdliu 4 Biên tp dliu 4.1 Kim tra sliu trng không: na.omit() 4.2 Tách ri dliu: subset 4.3 Chiết sliu tmt data .frame 4.4 Nhp hai data.frame thành mt: merge 4.5 Mã hóa sliu (data coding) 4.5.1 Mã hoá bng hàm replace 4.5.2 Đổi mt biến liên tc thành biến ri rc 4.6 Chia mt biến liên tc thành nhóm: cut 4.7 Tp hp sliu bng cut2 (Hmisc)

description

Sách hướng dẫn phân tích và vẽ biểu đồ bằng chương trình R.

Transcript of Phân Tích Số Liệu Và Tạo Biểu Đồ

  • 1

    Phn tch s liu v to biu bng

    hng dn thc hnh

    Mc lc

    1 Li ni u 2 Gii thiu ngn ng R 2.1 R l g ? 2.2 Ti v ci t R vo my tnh 2.3 Package cho cc phn tch c bit 2.4 Khi ng v ngng chy R 2.5 Vn phm ngn ng R 2.6 Cch t tn trong R 2.7 H tr trong R 2.8 Mi trng vn hnh 3 Nhp d liu 3.1 Nhp s liu trc tip: c() 3.2 Nhp s liu trc tip: edit(data.frame()) 3.3 Nhp s liu t mt textfile: read.table() 3.4 Nhp s liu t Excel: read.csv 3.5 Nhp s liu t SPSS: read.spss 3.6 Tm thng tin c bn v d liu 4 Bin tp d liu 4.1 Kim tra s liu trng khng: na.omit() 4.2 Tch ri d liu: subset 4.3 Chit s liu t mt data .frame 4.4 Nhp hai data.frame thnh mt: merge 4.5 M ha s liu (data coding) 4.5.1 M ho bng hm replace 4.5.2 i mt bin lin tc thnh bin ri rc 4.6 Chia mt bin lin tc thnh nhm: cut 4.7 Tp hp s liu bng cut2 (Hmisc)

  • 2

    5 S R cho cc php tnh n gin v ma trn 5.1 Tnh ton n gin 5.2 S liu v ngy thng 5.3 To dy s bng seq, rep v gl 5.4 S dng R cho cc php tnh ma trn 5.4.1 Chit phn t t ma trn 5.4.2 Tnh ton vi ma trn 6 Tnh ton xc sut v m phng (simulation) 6.1 Tnh ton n gin 6.1.1 Php hon v (permutation) 6.1.2 T hp (combination) 6.2 Bin s ngu nhin v hm phn phi 6.3 Cc hm phn phi xc sut (probability distribution

    function) 6.3.1 Hm phn phi nh phn (Binomial distribution) 6.3.2 Hm phn phi Poisson (Poisson distribution) 6.3.3 Hm phn phi chun (Normal distribution) 6.3.4 Hm phn phi chun chun ha (Standardized Normal

    distribution) 6.3.5 Hm phn phi t, F v 2 6.4. M phng (simulation) 6.4.1 M phng phn phi nh phn 6.4.2 M phng phn phi Poisson 6.4.3 M phng phn phi 2, t, F, gamma, beta, Weibull,

    Cauchy 6.5 Chn mu ngu nhin (random sampling) 7 Kim nh gi thit thng k v ngha tr s P 7.1 Tr s P 7.2 Gi thit khoa hc v phn nghim 7.3 ngha ca tr s P qua m phng 7.4 Vn logic ca tr s P 7.5 Vn kim nh nhiu gi thit (multiple tests of

    hypothesis) 8 Phn tch s liu bng biu 8.1 Mi trng v thit k biu 8.1.1 Nhiu biu cho mt ca s (windows) 8.1.2 t tn cho trc tung v trc honh 8.1.3 Cho gii hn ca trc tung v trc honh 8.1.4 Th loi v ng biu din 8.1.5 Mu sc, khung, v k hiu 8.1.6 Ghi ch (legend) 8.17 Vit ch trong biu

  • 3

    8.2 S liu cho phn tch biu 8.3 Biu cho mt bin s ri rc (discrete variable):

    barplot 8.4. Biu cho hai bin s ri rc (discrete variable):

    barplot 8.5 Biu hnh trn 8.6 Biu cho mt bin s lin tc: stripchart v hist 8.6.1 Stripchart 8.6.2 Histogram 8.6.3 Biu hp (boxplot) 8.6.4 Biu thanh (barchart) 8.6.5 Biu im (dotchart) 8.7 Phn tch biu cho hai bin lin tc 8.7.1 Biu tn x (scatter plot) 8.8 Phn tch Biu cho nhiu bin: pairs 8.9 Mt s biu a nng 8.9.1 Biu tn x v hnh hp 8.9.2 Biu tn x vi kch thc bin th ba 8.9.3 Biu thanh v xc sut tch ly 8.9.4 Biu hnh ng h (clock plot) 8.9.5 Biu vi sai s chun (standard error) 8.9.6 Biu vng (contour plot) 8.9.10 Biu vi k hiu ton 9 Phn tch thng k m t 9.0 Khi nim v tng th (population) v mu (sample) 9.1 Thng k m t: summary 9.2 Kim nh xem mt bin c phi phn phi chun 9.3 Thng k m t theo tng nhm 9.4 Kim nh t (t.test) 9.4.1 Kim nh t mt mu 9.4.2 Kim nh t hai mu 9.5 So snh phng sai (var.test) 9.6 Kim nh Wilcoxon cho hai mu (wilcox.test) 9.7 Kim nh t cho cc bin s theo cp (paired t-test,

    t.test) 9.8 Kim nh Wilcoxon cho cc bin s theo cp

    (wilcox.test) 9.9 Tn s (frequency) 9.10 Kim nh t l (proportion test, prop.test,

    binom.test) 9.11 So snh hai t l (prop.test, binom.test) 9.12 So snh nhiu t l (prop.test, chisq.test) 9.12.1 Kim nh Chi bnh phng 9.12.2 Kim nh Fisher

  • 4

    10 Phn tch hi qui tuyn tnh (regression analysis) 10.1 H s tng quan 10.1.1 H s tng quan Pearson 10.1.2 H s tng quan Spearman 10.1.3 H s tng quan Kendall 10.2 M hnh ca hi qui tuyn tnh n gin 10.2.1 Vi dng l thuyt 10.2.2 Phn tch hi qui tuyn tnh n gin bng R 10.2.3 Gi nh ca phn tch hi qui tuyn tnh 10.2.4 M hnh tin on 10.3 M hnh hi qui tuyn tnh a bin (multiple linear

    regression) 10.4 Phn tch hi qui a thc (Polynomial regression analysis) 10.5 Xy dng m hnh tuyn tnh t nhiu bin 10.6 Xy dng m hnh tuyn tnh bng Bayesian Model

    Average (BMA) 11 Phn tch phng sai (analysis of variance) 11.1 Phn tch phng sai n gin (one-way analysis of

    variance - ANOVA) 11.1.1 M hnh phn tch phng sai 11.1.2 Phn tch phng sai n gin vi R 11.2 So snh nhiu nhm (multiple comparisons) v iu chnh

    tr s p 11.2.1 So snh nhiu nhm bng phng php Tukey 11.2.2 Phn tch bng biu 11.3 Phn tch bng phng php phi tham s 11.4 Phn tch phng sai hai chiu (two-way analysis of

    variance - ANOVA) 11.4.1 Phn tch phng sai hai chiu vi R 11.5 Phn tch hip bin (analysis of covariance - ANCOVA) 11.5.1 M hnh phn tch hip bin 11.5.2 Phn tch bng R 11.6 Phn tch phng sai cho th nghim giai tha (factorial

    experiment) 11.7 Phn tch phng sai cho th nghim hnh vung Latin

    (Latin square experiment) 11.8 Phn tch phng sai cho th nghim giao cho (cross-over

    experiment) 11.9 Phn tch phng sai cho th nghim ti o lng (repeated

    measure experiment) 12 Phn tch hi qui logistic (logistic regression

    analysis) 12.1 M hnh hi qui logistic

  • 5

    12.2 Phn tch hi qui logistic bng R 12.3 c tnh xc sut bng R 12.4 Phn tch hi qui logistic t s liu gin lc bng R 12.5 Phn tch hi qui logistic a bin v chn m hnh 12.6 Chn m hnh hi qui logistic bng Bayesian Model

    Average 12.7 S liu dng cho phn tch 13 Phn tch bin c (survival analysis) 13.1 M hnh phn tch s liu mang tnh thi gian 13.2 c tnh Kaplan-Meier bng R 13.3 So snh hai hm xc sut tch ly: kim nh log-rank (log-

    rank test) 13.4 Kim nh log-rank bng R 13.5 M hnh Cox (hay Coxs proportional hazards model) 13.6 Xy dng m hnh Cox bng Bayesian Model Average

    (BMA) 14 Phn tch tng hp (meta-analysis) 14.1 Nhu cu cho phn tch tng hp 14.2 nh hng ngu nhin v nh hng bt bin (Fixed-

    effects v Random-effects) 14.3 Qui trnh ca mt phn tch tng hp 14.4 Phn tch tng hp nh hng bt bin cho mt tiu ch lin

    tc (Fixed-effects meta-analysis for a continuous outcome) 14.4.1 Phn tch tng hp bng tnh ton th cng 14.4.2 Phn tch tng hp bng R 14.5 Phn tch tng hp nh hng bt bin cho mt tiu ch nh

    phn (Fixed-effects meta-analysis for a dichotomous outcome)

    14.5.1 M hnh phn tch 14.5.2 Phn tch bng R 15 c tnh c mu (estimation of sample size) 15.1 Khi nim v power 15.2 Th nghim gi thit thng k v chn on bnh 15.3 S liu c tnh c mu 15.4 c tnh c mu 15.4.1 c tnh c mu cho mt ch s trung bnh 15.4.2 c tnh c mu cho so snh hai s trung bnh 15.4.3 c tnh c mu cho phn tch phng sai 15.4.4 c tnh c mu cho c tnh mt t l 15.4.5 c tnh c mu cho so snh hai t l 16 Ph lc 1: Lp trnh v vit hm bng ngn ng R

  • 6

    17 Ph lc 2: Mt s lnh thng dng trong R 18 Ph lc 3: Thut ng dng trong sch 19 Li bt (ti liu tham kho v c thm)

  • 1 Li ni u

    Tri vi quan im ca nhiu ngi, thng k l mt b mn khoa hc: Khoa hc

    thng k (Statistical Science). Cc phng php phn tch d da vo nn tng ca ton hc v xc sut, nhng ch l phn k thut, phn quan trng hn l thit k nghin cu v din dch ngha d liu. Ngi lm thng k, do , khng ch l ngi n thun lm phn tch d liu, m phi l mt nh khoa hc, mt nh suy ngh (thinker) v nghin cu khoa hc. Chnh v th, m khoa hc thng k ng mt vai tr cc k quan trng, mt vai tr khng th thiu c trong cc cng trnh nghin cu khoa hc, nht l khoa hc thc nghim. C th ni rng ngy nay, nu khng c thng k th cc th nghim gen vi triu triu s liu ch l nhng con s v hn, v ngha.

    Mt cng trnh nghin cu khoa hc, cho d c tn km v quan trng c no,

    nu khng c phn tch ng phng php s khng c ngha khoa hc g c. Chnh v th th m ngy nay, ch cn nhn qua tt c cc tp san nghin cu khoa hc trn th gii, hu nh bt c bi bo y hc no cng c phn Statistical Analysis (Phn tch thng k), ni m tc gi phi m t cn thn phng php phn tch, tnh ton nh th no, v gii thch ngn gn ti sao s dng nhng phng php hm bo k hay tng trng lng khoa hc cho nhng pht biu trong bi bo. Cc tp san y hc c uy tn cng cao yu cu v phn tch thng k cng nng. Xin nhc li nhn mnh: khng c phn phn tch thng k, bi bo khng c ngha khoa hc.

    Mt trong nhng pht trin quan trng nht trong khoa hc thng k l ng dng my tnh cho phn tch v tnh ton thng k. C th ni khng ngoa rng khng c my tnh, khoa hc thng k vn ch l mt khoa hc bun t kh khan, vi nhng cng thc rc ri m thiu tnh ng dng vo thc t. My tnh gip khoa hc thng k lm mt cuc cch mng ln nht trong lch s ca b mn: l a khoa hc thng k vo thc t, gii quyt cc vn gai gc nht v gp phn lm pht trin khoa hc thc nghim.

    Ngi vit cn nh hn 20 nm v trc khi cn l mt sinh vin theo hc chng trnh thc s thng k c, mt v gio s kh knh k mt cu chuyn v nh thng k danh ting ngi M, Fred Mosteller, nhn c mt hp ng nghin cu t B Quc phng M ci tin chnh xc ca v kh M vo thi Th chin th II, m trong ng phi gii mt bi ton thng k gm khong 30 thng s. ng phi mn 20 sinh vin sau i hc lm vic ny: 10 sinh vin ch vic sut ngy tnh ton bng tay; cn 10 sinh vin khc kim tra li tnh ton ca 10 sinh vin kia. Cng vic ko di gn mt thng tri. Ngy nay, vi mt my tnh c nhn (personal computer) khim tn, phn tch thng k c th gii trong vng trn di 1 giy.

  • Nhng nu my tnh m khng c phn mm th my tnh cng ch l mt ng st hay silicon v hn v v dng. Mt phn mm , ang v s lm cch mng thng k l R. Phn mm ny c mt s nh nghin cu thng k v khoa hc trn th gii pht trin v hon thin trong khong 10 nm qua s dng cho vic hc tp, ging dy v nghin cu. Cun sch ny s gii thiu bn c cch s dng R cho phn tch thng k v th.

    Ti sao R? Trc y, cc phn mm dng cho phn tch thng k c pht trin v kh thng dng. Nhng phn mm ni ting t thi xa xa nh MINITAB, BMD-P n nhng phn mm tng i mi nh STATISTICA, SPSS, SAS, STAT, v.v thng rt t tin (gi cho mt i hc c khi ln n hng trm ngn -la hng nm), mt c nhn hay thm ch cho mt i hc khng kh nng mua. Nhng R thay i tnh trng ny, v R hon ton min ph. Tri vi cm nhn thng thng, min ph khng c ngha l cht lng km. Tht vy, chng nhng hon ton min ph, R cn c kh nng lm tt c (xin ni li: tt c), thm ch cn hn c, nhng phn tch m cc phn mm thng mi lm. R c th ti xung my tnh c nhn ca bt c c nhn no, bt c lc no, v bt c u trn th gii. Ch vi pht ci t l R c th a vo s dng. Chnh v th m i a s cc i hc Ty phng v th gii cng ngy cng chuyn sang s dng R cho hc tp, nghin cu v ging dy. Trong xu hng , cun sch ny c mt mc tiu khim tn l gii thiu n bn c trong nc kp thi cp nht ha nhng pht trin v tnh ton v phn tch thng k trn th gii.

    Cun sch ny c son ch yu cho sinh vin i hc v cc nh nghin cu khoa hc, nhng ngi cn mt phn mm hc thng k, phn tch s liu, hay v th t s liu khoa hc. Cun sch ny khng phi l sch gio khoa v l thuyt thng k, hay nhm ch bn c cch lm phn tch thng k, nhng s gip bn c lm phn tch thng k hu hiu hn v ho hng hn. Mc ch chnh ca ti l cung cp cho bn c nhng kin thc c bn v thng k, v cch ng dng R cho gii quyt vn , v qua lm nn tng bn c tm hiu hay pht trin thm R.

    Ti cho rng, cng nh bt c ngnh ngh no, cch hc phn tch thng k hay

    nht l t mnh lm phn tch. V th, sch ny c vit vi rt nhiu v d v d liu thc. Bn c c th va c sch, va lm theo nhng ch dn trong sch (bng cch g cc lnh vo my tnh) v s thy ho hng hn. Nu bn c c sn mt d liu nghin cu ca chnh mnh th vic hc tp s hu hiu hn bng cch ng dng ngay nhng php tnh trong sch. i vi sinh vin, nu cha c s liu sn, cc bn c th dng cc phng php m phng (simulation) hiu thng k hn.

    Khoa hc thng k nc ta tng i cn mi, cho nn mt s thut ng cha c din dch mt cch thng nht v hon chnh. V th, bn c s thy y trong sch mt vi thut ng l, v trong trng hp ny, ti c gng km theo thut ng gc

  • ting Anh bn c tham kho. Ngoi ra, trong phn cui ca sch, ti c lit k cc thut ng Anh Vit c cp n trong sch. Tt c cc d liu s dng trong sch ny u c th ti t internet xung my tnh c nhn, hay c th truy nhp trc tip qua trang web: http://www.ykhoa.net/R.

    Ti hi vng bn c s tm thy trong sch mt vi thng tin b ch, mt vi k thut hay php tnh c ch cho vic hc tp, ging dy v nghin cu ca mnh. Nhng c l chng c cun sch no hon thin hay khng c thiu st; thnh ra, nu bn c pht hin mt sai st trong sch, xin bo cho ti bit qua in th [email protected] hay [email protected]. Thnh tht cm n cc bn c trc.

    Ti mun nhn dp ny cm n Tin s Nguyn Hong Dzng thuc khoa Ha,

    i hc Bch khoa Thnh ph H Ch Minh, ngi gi v gip ti in cun sch ny trong nc. Ti cm n Bc s Nguyn nh Nguyn, ngi c mt phn ln bn tho ca cun sch, gp nhiu kin thit thc, v thit k ba sch. Ti cng cm n Nh xut bn i hc Bch khoa Thnh ph H Ch Minh gip ti in cun sch ny.

    By gi, ti mi bn c cng i vi ti mt hnh trnh thng k ngn bng R.

    Sydney, 31 Thng Ba Nm 2006

    Nguyn Vn Tun

  • 2 Gii thiu ngn ng R

    2.1 R l g ?

    Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch thng k v th. Tht ra, v bn cht, R l ngn ng my tnh a nng, c th s dng cho nhiu mc tiu khc nhau, t tnh ton n gin, ton hc gii tr (recreational mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc tp. V l mt ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn mn cho mt vn tnh ton c bit.

    Hai ngi sng to ra R l hai nh thng k hc tn l Ross Ihaka v Robert Gentleman. K t khi R ra i, rt nhiu nh nghin cu thng k v ton hc trn th gii ng h v tham gia vo vic pht trin R. Ch trng ca nhng ngi sng to ra R l theo nh hng m rng (Open Access). Cng mt phn v ch trng ny m R hon ton min ph. Bt c ai bt c ni no trn th gii u c th truy nhp v ti ton b m ngun ca R v my tnh ca mnh s dng. Cho n nay, ch qua cha y 5 nm pht trin, cng ngy cng c nhiu cc nh thng k hc, ton hc, nghin cu trong mi lnh vc chuyn sang s dng R phn tch d liu khoa hc. Trn ton cu, c mt mng li gn mt triu ngi s dng R, v con s ny ang tng theo cp s nhn. C th ni trong vng 10 nm na, chng ta s khng cn n cc phn mm thng k t tin nh SAS, SPSS hay Stata (cc phn mm ny rt t tin, c th ln n 100.000 USD mt nm) phn tch thng k na, v tt c cc phn tch c th tin hnh bng R.

    V th, nhng ai lm nghin cu khoa hc, nht l cc nc cn ngho kh nh nc ta, cn phi hc cch s dng R cho phn tch thng k v th. Bi vit ngn ny s hng dn bn c cch s dng R. Ti gi nh rng bn c khng bit g v R, nhng ti k vng bn c bit qua v cch s dng my tnh. 2.2 Ti R xung v ci t vo my tnh

    s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh. lm vic ny, ta phi truy nhp vo mng v vo website c tn l Comprehensive R Archive Network (CRAN) sau y:

    http://cran.R-project.org.

    Ti liu cn ti v, ty theo phin bn, nhng thng c tn bt u bng mu t R v s phin bn (version). Chng hn nh phin bn ti s dng vo cui nm 2005 l 2.2.1, nn tn ca ti liu cn ti l:

  • R-2.2.1-win32.zip

    Ti liu ny khong 26 MB, v a ch c th ti l:

    http://cran.r-project.org/bin/windows/base/R-2.2.1-win32.exe

    Ti website ny, chng ta c th tm thy rt nhiu ti liu ch dn cch s dng

    R, trnh , t s ng n cao cp. Nu cha quen vi ting Anh, ti liu ny ca ti c th cung cp nhng thng tin cn thit s dng m khng cn phi c cc ti liu khc.

    Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh. lm vic ny, chng ta ch n gin nhn chut vo ti liu trn v lm theo hng dn cch ci t trn mn hnh. y l mt bc rt n gin, ch cn 1 pht l vic ci t R c th hon tt. 2.3 Package cho cc phn tch c bit

    R cung cp cho chng ta mt ngn ng my tnh v mt s function lm cc phn tch cn bn v n gin. Nu mun lm nhng phn tch phc tp hn, chng ta cn phi ti v my tnh mt s package khc. Package l mt phn mm nh c cc nh thng k pht trin gii quyt mt vn c th, v c th chy trong h thng R. Chng hn nh phn tch hi qui tuyn tnh, R c function lm s dng cho mc ch ny, nhng lm cc phn tch su hn v phc tp hn, chng ta cn n cc package nh lme4. Cc package ny cn phi c ti v my tnh v ci t.

    a ch ti cc package vn l: http://cran.r-project.org, ri bm vo phn Packages xut hin bn tri ca mc lc trang web. Mt s package cn ti v my tnh s dng cho cc v d trong sch ny l: Tn package Chc nng Trellis Dng v th v lm cho th p hn lattice Dng v th v lm cho th p hn Hmisc Mt s phng php m hnh d liu ca F. Harrell Design Mt s m hnh thit k nghin cu ca F. Harrell Epi Dng cho cc phn tch dch t hc epitools Mt package khc chuyn cho cc phn tch dch t hc foreign Dng nhp d liu t cc phn mm khc nh

    SPSS, Stata, SAS, v.v Rmeta Dng cho phn tch tng hp (meta-analysis) meta Mt package khc cho phn tch tng hp survival Chuyn dng cho phn tch theo m hnh Cox (Coxs

    proportional hazard model)

  • splines Package cho survival vn hnh Zelig Package dng cho cc phn tch thng k trong lnh

    vc x hi hc genetics Package dng cho phn tch s liu di truyn hc BMA Bayesian Model Average leaps Package dng cho BMA 2.4 Khi ng v ngng chy R

    Sau khi hon tt vic ci t, mt icon

    R 2.2.1.lnk s xut hin trn desktop ca my tnh. n y th chng ta sn sng s dng R. C th nhp chut vo icon ny v chng ta s c mt window nh sau:

    R thng c s dng di dng "command line", c ngha l chng ta phi trc tip g lnh vo ci prompt mu trn. Cc lnh phi tun th nghim ngt theo vn phm v ngn ng ca R. C th ni ton b bi vit ny l nhm hng dn bn c hiu v vit theo ngn ng ca R. Mt trong nhng vn phm ny l R phn bit gia Library v library. Ni cch khc, R phn bit lnh vit bng ch hoa hay ch thng. Mt vn phm khc na l khi c hai ch ri nhau, R thng dng du chm

  • thay vo khong trng, chng hn nh data.frame, t.test, read.table, v.v iu ny rt quan trng, nu khng s lm mt th gi ca ngi s dng.

    Nu lnh g ra ng vn phm th R s cho chng ta mt ci prompt khc hay cho ra kt qu no (ty theo lnh); nu lnh khng ng vn phm th R s cho ra mt thng bo ngn l khng ng hay khng hiu. V d, nu chng ta g: > x th R s hiu v lm theo lnh , ri cho chng ta mt prompt khc: >. Nhng nu chng ta g: > R is great R s khng ng vi lnh ny, v ngn ng ny khng c trong th vin ca R, mt thng bo sau y s xut hin: Error: syntax error > Khi mun ri khi R, chng ta c th n gin nhn nt cho (x) bn gc tri ca window, hay g lnh q(). 2.5 Vn phm ngn ng R

    Vn phm chung ca R l mt lnh (command) hay function (ti s thnh thong cp n l hm). M l hm th phi c thng s; cho nn theo sau hm l nhng thng s m chng ta phi cung cp. Chng hn nh: > reg setwd(c:/works/stats) th setwd l mt hm, cn c:/works/stats l thng s ca hm.

    bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args vit tt ch arguments) m trong x l mt hm chng ta cn bit: > args(lm) function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...)

  • NULL R l mt ngn ng i tng (object oriented language). iu ny c ngha l cc d liu trong R c cha trong object. nh hng ny cng c vi nh hng n cch vit ca R. Chng hn nh thay v vit x = 5 nh thng thng chng ta vn vit, th R yu cu vit l x == 5.

    i vi R, x = 5 tng ng vi x # lnh sau y s m phng 10 gi tr normal > x myobject my object

  • Nhng i khi tn myobject kh c, cho nn chng ta nn tc ri bng . Nh my.object. > my.object My.object.u my.object.L My.object.u + my.object.L [1] 20 Mt vi iu cn lu khi t tn trong R l:

    Khng nn t tn mt bin s hay variable bng k hiu _ (underscore) nh my_object hay my-object.

    Khng nn t tn mt object ging nh mt bin s trong mt d liu. V d,

    nu chng ta c mt data.frame (d liu hay dataset) vi bin s age trong , th khng nn c mt object trng tn age, tc l khng nn vit: age ?lm Mt ca s s hin ra bn phi ca mn hnh ch r cch s dng ra sao v thm ch c c v d. Bn c c th n gin copy v dn v d vo R xem cch vn hnh. Trc khi s dng R, ngoi sch ny nu cn bn c c th c qua phn ch dn c sn trong R bng cch chn mc help v sau chn Html help nh hnh di

  • y bit thm chi tit. Bn c cng c th copy v dn cc lnh trong mc ny vo R xem cho bit cch vn hnh ca R.

    Thay v chn mc trn, bn c cng c th n gin lnh: > help.start() v mt ca s s xut hin ch dn ton b h thng R. Hm apropos cng rt c ch v n cung cp cho chng ta tt c cc hm trong R bt u bng k t m chng ta mun tm. Chng hn nh chng ta mun bit hm no trong R c k t lm th ch n gin lnh: > apropos(lm) V R s bo co cc hm vi k t lm nh sau c sn trong R: [1] ".__C__anova.glm" ".__C__anova.glm.null" ".__C__glm" [4] ".__C__glm.null" ".__C__lm" ".__C__mlm" [7] "anova.glm" "anova.glmlist" "anova.lm" [10] "anova.lmlist" "anova.mlm" "anovalist.lm" [13] "contr.helmert" "glm" "glm.control" [16] "glm.fit" "glm.fit.null" "hatvalues.lm" [19] "KalmanForecast" "KalmanLike" "KalmanRun" [22] "KalmanSmooth" "lm" "lm.fit" [25] "lm.fit.null" "lm.influence" "lm.wfit" [28] "lm.wfit.null" "model.frame.glm" "model.frame.lm"

  • [31] "model.matrix.lm" "nlm" "nlminb" [34] "plot.lm" "plot.mlm" "predict.glm" [37] "predict.lm" "predict.mlm" "print.glm" [40] "print.lm" "residuals.glm" "residuals.lm" [43] "rstandard.glm" "rstandard.lm" "rstudent.glm" [46] "rstudent.lm" "summary.glm" "summary.lm" [49] "summary.mlm" "kappa.lm" 2.8 Mi trng vn hnh D liu phi c cha trong mt khu vc (directory) ca my tnh. Trc khi s dng R, c l cch hay nht l to ra mt directory cha d liu, chng hn nh c:\works\stats. R bit d liu nm u, chng ta s dng lnh setwd (set working directory) nh sau: > setwd(c:/works/stats) Lnh trn bo cho R bit l d liu s cha trong directory c tn l c:\works\stats. Ch rng, R dng forward slash / ch khng phi backward slash \ nh trong h thng Windows.

    bit hin nay, R ang lm vic directory no, chng ta ch cn lnh: > getwd() [1] "C:/Program Files/R/R-2.2.1"

    Ci prompt mc nh ca R l >. Nhng nu chng ta mun c mt prompt

    khc theo c tnh c nhn, chng ta c th thay th d dng: > options(prompt=R> ) R> Hay: > options(prompt="Tuan> ") Tuan>

    Mn nh R mc nh l 80 characters, nhng nu chng ta mun mn nh rng hn, th ch cn ra lnh: > options(width=100) Hay mun R trnh by cc s liu dng 3 s thp phn: > options(scipen=3)

  • Cc la chn v thay i ny c th dng lnh options(). bit cc thng s hin ti ca R l g, chng ta ch cn lnh: > options() Tm hiu ngy thng: > Sys.Date() [1] "2006-03-31" Nu bn c cn thm thng tin, mt s ti liu trn mng (vit bng ting Anh) cng rt c ch. Cc ti liu ny c th ti xung my min ph:

    R for beginners (ca Emmanuel Paradis): http://cran.r-project.org/doc/contrib/rdebuts_en.pdf

    Using R for data analysis and graphics (ca John Maindonald):

    http://cran.r-project.org/doc/contrib/usingR.pdf

  • 3 Nhp d liu

    Mun lm phn tch d liu bng R, chng ta phi c sn d liu dng m R c th hiu c x l. D liu m R hiu c phi l d liu trong mt data.frame. C nhiu cch nhp s liu vo mt data.frame trong R, t nhp trc tip n nhp t cc ngun khc nhau. Sau y l nhng cch thng dng nht: 3.1 Nhp s liu trc tip: c()

    V d 1: chng ta c s liu v tui v insulin cho 10 bnh nhn nh sau, v mun nhp vo R. 50 16.5 62 10.8 60 32.3 40 19.3 48 14.2 47 11.3 57 15.5 70 15.8 48 16.2 67 11.2 Chng ta c th s dng function c tn c nh sau: > age insulin

  • chng ta cn phi nhp hai i tng ny thnh mt data.frame R c th x l sau ny. lm vic ny chng ta cn n function data.frame: > tuan tuan V R s bo co: age insulin 1 50 16.5 2 62 10.8 3 60 32.3 4 40 19.3 5 48 14.2 6 47 11.3 7 57 15.5 8 70 15.8 9 48 16.2 10 67 11.2

    Nu chng ta mun lu li cc s liu ny trong mt file theo dng R, chng ta cn dng lnh save. Gi d nh chng ta mun lu s liu trong directory c tn l c:\works\stats, chng ta cn g nh sau: > setwd(c:/works/stats) > save(tuan, file=tuan.rda)

    Lnh u tin (setwd ch wd c ngha l working directory) cho R bit rng chng ta mun lu cc s liu trong directory c tn l c:\works\stats. Lu rng thng thng Windows dng du backward slash /, nhng trong R chng ta dng du forward slash /.

    Lnh th hai (save) cho R bit rng cc s liu trong i tng tuan s lu trong file c tn l tuan.rda). Sau khi g xong hai lnh trn, mt file c tn tuan.rda s c mt trong directory . 3.2 Nhp s liu trc tip: edit(data.frame())

    V d 1 (tip tc): chng ta c th nhp s liu v tui v insulin cho 10 bnh nhn bng mt function rt c ch, l: edit(data.frame()). Vi function ny,

  • R s cung cp cho chng ta mt window mi vi mt dy ct v dng ging nh Excel, v chng ta c th nhp s liu trong bng . V d: > ins
  • 3 Nu 60 18 3.360 3.0 4.7 0.8 4 Nam 65 18 5.920 4.0 7.7 1.1 5 Nam 47 18 6.250 2.1 5.0 2.1 6 Nu 65 18 4.150 3.0 4.2 1.5 7 Nam 76 19 0.737 3.0 5.9 2.6 8 Nam 61 19 7.170 3.0 6.1 1.5 9 Nam 59 19 6.942 3.0 5.9 5.4 10 Nu 57 19 5.000 2.0 4.0 1.9 11 Nu 63 20 4.217 5.0 6.2 1.7 12 Nam 51 20 4.823 1.3 4.1 1.0 13 Nu 60 20 3.750 1.2 3.0 1.6 14 Nam 42 20 1.904 0.7 4.0 1.1 15 Nam 64 20 6.900 4.0 6.9 1.5 16 Nu 49 20 0.633 4.1 5.7 1.0 17 Nu 44 21 5.530 4.3 5.7 2.7 18 Nu 45 21 6.625 4.0 5.3 3.9 19 Nu 80 21 5.960 4.3 7.1 3.0 20 Nu 48 21 3.800 4.0 3.8 3.1 21 Nu 61 21 5.375 3.1 4.3 2.2 22 Nu 45 21 3.360 3.0 4.8 2.7 23 Nu 70 21 5.000 1.7 4.0 1.1 24 Nu 51 21 2.608 2.0 3.0 0.7 25 Nam 63 22 4.130 2.1 3.1 1.0 26 Nam 54 22 5.000 4.0 5.3 1.7 27 Nu 57 22 6.235 4.1 5.3 2.9 28 Nam 70 22 3.600 4.0 5.4 2.5 29 Nu 47 22 5.625 4.2 4.5 6.2 30 Nu 60 22 5.360 4.2 5.9 1.3 31 Nu 60 22 6.580 4.4 5.6 3.3 32 Nam 50 22 7.545 4.3 8.3 3.0 33 Nam 60 22 6.440 2.3 5.8 1.0 34 Nu 55 22 6.170 6.0 7.6 1.4 35 Nu 74 23 5.270 3.0 5.8 2.5 36 Nam 48 23 3.220 3.0 3.1 0.7 37 Nu 46 23 5.400 2.6 5.4 2.4 38 Nam 49 23 6.300 4.4 6.3 2.4 39 Nu 69 23 9.110 4.3 8.2 1.4 40 Nu 72 23 7.750 4.0 6.2 2.7 41 Nam 51 23 6.200 3.0 6.2 2.4 42 Nu 58 23 7.050 4.1 6.7 3.3 43 nam 60 24 6.300 4.4 6.3 2.0 44 Nam 45 24 5.450 2.8 6.0 2.6 45 Nam 63 24 5.000 3.0 4.0 1.8 46 Nu 52 24 3.360 2.0 3.7 1.2 47 Nam 64 24 7.170 1.0 6.1 1.9 48 Nam 45 24 7.880 4.0 6.7 3.3 49 Nu 64 25 7.360 4.6 8.1 4.0 50 Nu 62 25 7.750 4.0 6.2 2.5

    Chng ta mun nhp cc d liu ny vo R tin vic phn tch sau ny. Chng ta s s dng lnh read.table nh sau: > setwd(c:/works/stats)

  • > chol chol Hay > names(chol) R s cho bit c cc ct nh sau trong d liu (name l lnh hi trong d liu c nhng ct no v tn g): [1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc" "tg" By gi chng ta c th lu d liu di dng R x l sau ny bng cch ra lnh: > save(chol, file="chol.rda") 3.4 Nhp s liu t Excel: read.csv nhp s liu t phn mm Excel, chng ta cn tin hnh 2 bc:

    Bc 1: Dng lnh Save as trong Excel v lu s liu di dng csv; Bc 2: Dng R (lnh read.csv) nhp d liu dng csv.

    V d 3: Mt d liu gm cc ct sau y ang c lu trong Excel, v chng ta mun chuyn vo R phn tch. D liu ny c tn l excel.xls.

    ID Age Sex Ethnicity IGFI IGFBP3 ALS PINP ICTP P3NP 1 18 1 1 148.27 5.14 316.00 61.84 5.81 4.21 2 28 1 1 114.50 5.23 296.42 98.64 4.96 5.33 3 20 1 1 109.82 4.33 269.82 93.26 7.74 4.56 4 21 1 1 112.13 4.38 247.96 101.59 6.66 4.61 5 28 1 1 102.86 4.04 240.04 58.77 4.62 4.95 6 23 1 4 129.59 4.16 266.95 48.93 5.32 3.82 7 20 1 1 142.50 3.85 300.86 135.62 8.78 6.75 8 20 1 1 118.69 3.44 277.46 79.51 7.19 5.11 9 20 1 1 197.69 4.12 335.23 57.25 6.21 4.44 10 20 1 1 163.69 3.96 306.83 74.03 4.95 4.84

  • 11 22 1 1 144.81 3.63 295.46 68.26 4.54 3.70 12 27 0 2 141.60 3.48 231.20 56.78 4.47 4.07 13 26 1 1 161.80 4.10 244.80 75.75 6.27 5.26 14 33 1 1 89.20 2.82 177.20 48.57 3.58 3.68 15 34 1 3 161.80 3.80 243.60 50.68 3.52 3.35 16 32 1 1 148.50 3.72 234.80 83.98 4.85 3.80 17 28 1 1 157.70 3.98 224.80 60.42 4.89 4.09 18 18 0 2 222.90 3.98 281.40 74.17 6.43 5.84 19 26 0 2 186.70 4.64 340.80 38.05 5.12 5.77 20 27 1 2 167.56 3.56 321.12 30.18 4.78 6.12

    Vic u tin l chng ta cn lm, nh ni trn, l vo Excel lu di dng csv:

    Vo Excel, chn File Save as Chn Save as type CSV (Comma delimited)

    Sau khi xong, chng ta s c mt file vi tn excel.csv trong directory c:\works\stats. Vic th hai l vo R v ra nhng lnh sau y: > setwd(c:/works/stats) > gh

  • By gi chng ta c th lu gh di dng R x l sau ny bng lnh sau y: > save(gh, file="gh.rda") 3.5 Nhp s liu t mt SPSS: read.spss

    Phn mm thng k SPSS lu d liu di dng sav. Chng hn nh nu chng ta c mt d liu c tn l testo.sav trong directory c:\works\stats, v mun chuyn d liu ny sang dng R c th hiu c, chng ta cn s dng lnh read.spss trong package c tn l foreign. Cc lnh sau y s hon tt d dng vic ny: Vic u tin chng ta cho truy nhp foreign bng lnh library: > library(foreign) Vic th hai l lnh read.spss: > setwd(c:/works/stats) > testo save(testo, file="testo.rda") 3.6 Thng tin c bn v d liu Gi d nh chng ta nhp s liu vo mt data.frame c tn l chol nh trong v d 1. tm hiu xem trong d liu ny c g, chng ta c th nhp vo R nh sau: Dn cho R bit chng ta mun x l chol bng cch dng lnh attach(arg) vi

    arg l tn ca d liu.. > attach(chol) Chng ta c th kim tra xem chol c phi l mt data.frame khng bng lnh

    is.data.frame(arg) vi arg l tn ca d liu. V d: > is.data.frame(chol) [1] TRUE

  • R cho bit chol qu l mt data.frame. C bao nhiu ct (hay variable = bin s) v dng s liu (observations) trong d liu

    ny? Chng ta dng lnh dim(arg) vi arg l tn ca d liu. (dim vit tt ch dimension). V d (kt qu ca R trnh by ngay sau khi chng ta g lnh):

    > dim(chol) [1] 50 8 Nh vy, chng ta c 50 dng v 8 ct (hay bin s). Vy nhng bin s ny tn g?

    Chng ta dng lnh names(arg) vi arg l tn ca d liu. V d: > names(chol) [1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc" "tg" Trong bin s sex, chng ta c bao nhiu nam v n? tr li cu hi ny, chng

    ta c th dng lnh table(arg) vi arg l tn ca bin s. V d: > table(sex) sex nam Nam Nu 1 21 28 Kt qu cho thy d liu ny c 21 nam v 28 n.

  • 4 Bin tp d liu

    Bin tp s liu y khng c ngha l thay i s liu gc (v l mt ti ln, mt s gian di trong khoa hc khng th chp nhn c), m ch c ngha t chc s liu sao cho R c th phn tch mt cch hu hiu. Nhiu khi trong phn tch thng k, chng ta cn phi tp trung s liu thnh mt nhm, hay tch ri thnh tng nhm, hay thay th t k t (characters) sang s (numeric) cho tin vic tnh ton. Trong chng ny, ti s bn qua mt s lnh cn bn cho vic bin tp s liu.

    Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v

    hiu cu chuyn, ti xin nhc li rng chng ta nhp s liu vo trong mt d liu R c tn l chol t mt text file c tn l chol.txt: > setwd(c:/works/stats) > chol attach(chol) 4.1 Kim tra s liu trng khng (missing value)

    Trong nghin cu, v nhiu l do s liu khng th thu thp c cho tt c i

    tng, hay khng th o lng tt c bin s cho mt i tng. Trong trng hp , s liu trng c xem l missing value (m ti tm dch l s liu trng khng). R xem cc s liu trng khng l NA. C mt s kim nh thng k i hi cc s liu trng khng phi c loi ra (v khng th tnh ton c) trc khi phn tch. R c mt lnh rt c ch cho vic ny: na.omit, v cch s dng nh sau: > chol.new nam nu

  • Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l nam v nu. Ch iu kin sex == Nam v sex == Nu chng ta dng == thay v = ch iu kin chnh xc. Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau vi nhng iu kin da vo cc bin s khc. Chng hn nh lnh sau y to ra mt data.frame mi tn l old vi nhng bnh nhn trn 60 tui: > old =60) > dim(old) [1] 25 8 Hay mt data.frame mi vi nhng bnh nhn trn 60 tui v nam gii: > n60 =60 & sex==Nam) > dim(n60) [1] 9 8 4.3 Chit s liu t mt data .frame

    Trong chol c 8 bin s. Chng ta c th chit d liu chol v ch gi li nhng bin s cn thit nh m s (id), tui (age) v total cholestrol (tc). t lnh names(chol) rng bin s id l ct s 1, age l ct s 3, v bin s tc l ct s 7. Chng ta c th dng lnh sau y: > data2 data3 print(data3) id sex tc 1 1 Nam 4.0 2 2 Nu 3.5 3 3 Nu 4.7 4 4 Nam 7.7 5 5 Nam 5.0 6 6 Nu 4.2 7 7 Nam 5.9 8 8 Nam 6.1

  • 9 9 Nam 5.9 10 10 Nu 4.0 Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra, chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3). 4.4 Nhp hai data.frame thnh mt: merge Gi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn l d1 gm 3 ct: id, sex, tc nh sau: id sex tc 1 Nam 4.0 2 Nu 3.5 3 Nu 4.7 4 Nam 7.7 5 Nam 5.0 6 Nu 4.2 7 Nam 5.9 8 Nam 6.1 9 Nam 5.9 10 Nu 4.0 D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau: id sex tg 1 Nam 1.1 2 Nu 2.1 3 Nu 0.8 4 Nam 1.1 5 Nam 2.1 6 Nu 1.5 7 Nam 2.6 8 Nam 1.5 9 Nam 5.4 10 Nu 1.9 11 Nu 1.7 Hai d liu ny c chung hai bin s id v sex. Nhng d liu d1 c 10 dng, cn d liu d2 c 11 dng. Chng ta c th nhp hai d liu thnh mt data.frame bng cch dng lnh merge nh sau: > d d id sex.x tc sex.y tg

  • 1 1 Nam 4.0 Nam 1.1 2 2 Nu 3.5 Nu 2.1 3 3 Nu 4.7 Nu 0.8 4 4 Nam 7.7 Nam 1.1 5 5 Nam 5.0 Nam 2.1 6 6 Nu 4.2 Nu 1.5 7 7 Nam 5.9 Nam 2.6 8 8 Nam 6.1 Nam 1.5 9 9 Nam 5.9 Nam 5.4 10 10 Nu 4.0 Nu 1.9 11 11 NA Nu 1.7 Trong lnh merge, chng ta yu cu R nhp 2 d liu d1 v d2 thnh mt v a vo data.frame mi tn l d, v dng bin s id lm chun. Chng ta thy bnh nhn s 11 khng c s liu cho tc, cho nn R cho l NA (mt dng not available). 4.5 M ha s liu (data coding) Trong vic x l s liu dch t hc, nhiu khi chng ta cn phi bin i s liu t bin lin tc sang bin mang tnh cch phn loi. Chng hn nh trong chn on long xng, nhng ph n c ch s T ca mt cht khong trong xng (bone mineral density hay BMD) bng hay thp hn -2.5 c xem l long xng, nhng ai c BMD gia -2.5 v -1.0 l xp xng (osteopenia), v trn -1.0 l bnh thng. V d, chng ta c s liu BMD t 10 bnh nhn nh sau: -0.92, 0.21, 0.17, -3.21, -1.80, -2.60, -2.00, 1.71, 2.12, -2.11 nhp cc s liu ny vo R chng ta c th s dng function c nh sau: bmd diagnosis diagnosis[bmd -2.5 & bmd -1.0] data
  • > data bmd diagnosis 1 -0.92 3 2 0.21 3 3 0.17 3 4 -3.21 1 5 -1.80 2 6 -2.60 1 7 -2.00 2 8 1.71 3 9 2.12 3 10 -2.11 2 4.5.1 Bin i s liu bng cch dng replace Mt cch bin i s liu khc l dng replace, d cch ny c v rm r cht t. Tip tc v d trn, chng ta bin i t bmd sang diagnosis nh sau: > diagnosis diagnosis diag diag [1] 3 3 3 1 2 1 2 3 3 2 Levels: 1 2 3 Ch R by gi thng bo cho chng ta bit diag c 3 bc: 1, 2 v 3. Nu chng ta yu cu R tnh s trung bnh ca diag, R s khng lm theo yu cu ny, v khng phi l mt bin s s hc: > mean(diag) [1] NA Warning message: argument is not numeric or logical: returning NA in: mean.default(diag) D nhin, chng ta c th tnh gi tr trung bnh ca diagnosis:

  • > mean(diagnosis) [1] 2.3 nhng kt qu 2.3 ny khng c ngha g trong thc t c. 4.6 Chia nhm bng cut Vi mt bin lin tc, chng ta c th chia thnh nhiu nhm bng hm cut. V d, chng ta c bin age nh sau: > age cut(age, 2) [1] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51] (7.96,29.5] (7.96,29.5] (7.96,29.5] (7.96,29.5] [9] (7.96,29.5] (29.5,51] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51] (29.5,51] Levels: (7.96,29.5] (29.5,51] cut chia bin age thnh 2 nhm: nhm 1 tui t 7.96 n 29.5; nhm 2 t 29.5 n 51. Chng ta c th m s i tng trong tng nhm tui bng hm table nh sau: > table(cut(age, 2)) (7.96,29.5] (29.5,51] 11 4 > ageg ageg table(ageg) ageg low medium high 10 2 3 Tt nhin, chng ta cng c th chia age thnh 4 nhm (quartiles) bng cch cho nhng thng s 0, 0.25, 0.50 v 0.75 nh sau: cut(age, breaks=quantiles(age, c(0, 0.25, 0.50, 0.75, 1)), labels=c(q1, q2, q3, q4),

  • include.lowest=TRUE) cut(age, breaks=quantiles(c(0, 0.25, 0.50, 0.75, 1)), labels=c(q1, q2, q3, q4), include.lowest=TRUE) 4.7. Tp hp s liu bng cut2 (Hmisc)

    Hm cut trn chia bin s theo gi tr ca bin, ch khng da vo s mu, cho nn s lng mu trong tng nhm khng bng nhau. Tuy nhin, trong phn tch thng k, c khi chng ta cn phi phn chia mt bin s lin tc thnh nhiu nhm da vo phn phi ca bin s nhng s mu bng hay tng ng nhau. Chng hn nh i vi bin s bmd chng ta c th ct dy s thnh 3 nhm vi s mu tng ng nhau bng cch dng function cut2 (trong th vin Hmisc) nh sau: > # nhp th vin Hmisc c th dng function cut2 > library(Hmisc) > bmd # chia bin s bmd thnh 2 nhm v trong i tng group > group table(group) group [-3.21,-0.92) [-0.92, 2.12] 5 5 Nh thy qua v d trn, g = 2 c ngha l chia thnh 2 nhm (g = group). R t ng chia thnh nhm 1 gm gi tr bmd t -3.21 n -0.92, v nhm 2 t -0.92 n 2.12. Mi nhm gm c 5 s. Tt nhin, chng ta cng c th chia thnh 3 nhm bng lnh: > group table(group) group [-3.21,-1.80) [-1.80, 0.21) [ 0.21, 2.12] 4 3 3

  • 5 Dng R cho cc php tnh

    n gin v ma trn

    Mt trong nhng li th ca R l c th s dng nh mt my tnh cm tay. Tht ra, hn th na, R c th s dng cho cc php tnh ma trn v lp chng. Trong chng ny ti ch trnh by mt s php tnh n gin m hc sinh hay sinh vin c th s dng lp tc trong khi c nhng dng ch ny. 5.1 Tnh ton n gin Cng hai s hay nhiu s vi nhau: > 15+2997 [1] 3012

    Cng v tr: > 15+2997-9768 [1] -6756

    Nhn v chia > -27*12/21 [1] -15.42857

    S ly tha: (25 5)3 > (25 - 5)^3 [1] 8000

    Cn s bc hai: 10 > sqrt(10) [1] 3.162278

    S pi () > pi [1] 3.141593 > 2+3*pi [1] 11.42478

    Logarit: loge > log(10) [1] 2.302585

    Logarit: log10 > log10(100) [1] 2

    S m: e2.7689 > exp(2.7689) [1] 15.94109 > log10(2+3*pi) [1] 1.057848

    Hm s lng gic > cos(pi) [1] -1

    Vector > x x [1] 2 3 1 5 4 6 7 6 8 > sum(x) [1] 42 > x*2

    > exp(x/10) [1] 1.221403 1.349859 1.105171 1.6481.491825 1.822119 2.013753 1.822119[9] 2.225541 > exp(cos(x/10)) [1] 2.664634 2.599545 2.704736 2.4052.511954 2.282647 2.148655 2.282647[9] 2.007132

  • [1] 4 6 2 10 8 12 14 12 16 Tnh tng bnh phng (sum of squares): 12 + 22 + 32 + 42 + 52 = ? > x sum(x^2) [1] 55

    Tnh tng bnh phng iu chnh

    (adjusted sum of squares): ( )21

    n

    ii

    x x=

    = ? > x sum((x-mean(x))^2) [1] 10 Trong cng thc trn mean(x) l s trung bnh ca vector x.

    Tnh sai s bnh phng (mean square):

    ( )21

    /n

    ii

    x x n=

    = ? > x sum((x-mean(x))^2)/length(x) [1] 2 Trong cng thc trn, length(x) c ngha l tng s phn t (elements) trong vector x.

    Tnh phng sai (variance) v lch chun (standard deviation):

    Phng sai: ( ) ( )221

    / 1n

    ii

    s x x n=

    = = ? > x var(x) [1] 2.5 lch chun: 2s : > sd(x) [1] 1.581139

    5.2 S liu v ngy thng

    Trong phn tch thng k, cc s liu ngy thng c khi l mt vn nan gii, v c rt nhiu cch m t cc d liu ny. Chng hn nh 01/02/2003, c khi ngi ta vit 1/2/2003, 01/02/03, 01FEB2003, 2003-02-01, v.v Tht ra, c mt qui lut chun vit s liu ngy thng l tiu chun ISO 8601 (nhng rt t ai tun theo!) Theo qui lut ny, chng ta vit:

    2003-02-01

    L do ng sau cch vit ny l chng ta vit s vi n v ln nht trc, ri dn dn n n v nh nht. Chng hn nh vi s 123 th chng ta bit ngay rng mt trm hai mi ba: bt u l hng trm, ri n hng chc, v.v V cng l cch vit ngy thng chun ca R. > date1 date2

  • > days days Time difference of 28 days Chng ta cng c th to mt dy s liu ngy thng nh sau: > seq(as.Date(2005-01-01), as.Date(2005-12-31), by=month) [1] "2005-01-01" "2005-02-01" "2005-03-01" "2005-04-01" "2005-05-01" [6] "2005-06-01" "2005-07-01" "2005-08-01" "2005-09-01" "2005-10-01" [11] "2005-11-01" "2005-12-01" > seq(as.Date(2005-01-01), as.Date(2005-12-31), by=2 weeks) [1] "2005-01-01" "2005-01-15" "2005-01-29" "2005-02-12" "2005-02-26" [6] "2005-03-12" "2005-03-26" "2005-04-09" "2005-04-23" "2005-05-07" [11] "2005-05-21" "2005-06-04" "2005-06-18" "2005-07-02" "2005-07-16" [16] "2005-07-30" "2005-08-13" "2005-08-27" "2005-09-10" "2005-09-24" [21] "2005-10-08" "2005-10-22" "2005-11-05" "2005-11-19" "2005-12-03" [26] "2005-12-17" "2005-12-31" 5.3 To dy s bng hm seq, rep v gl

    R cn c cng dng to ra nhng dy s rt tin cho vic m phng v thit k th nghim. Nhng hm thng thng cho dy s l seq (sequence), rep (repetition) v gl (generating levels): p dng seq To ra mt vector s t 1 n 12: > x x [1] 1 2 3 4 5 6 7 8 9 10 11 12 > seq(12) [1] 1 2 3 4 5 6 7 8 9 10 11 12 To ra mt vector s t 12 n 5: > x x [1] 12 11 10 9 8 7 6 5 > seq(12,7) [1] 12 11 10 9 8 7 Cng thc chung ca hm seq l seq(from, to, by= ) hay seq(from, to, length.out= ). Cch s dng s c minh ho bng vi v d sau y:

  • To ra mt vector s t 4 n 6 vi khong cch bng 0.25: > seq(4, 6, 0.25) [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 To ra mt vector 10 s, vi s nh nht l 2 v s ln nht l 15 > seq(length=10, from=2, to=15) [1] 2.000000 3.444444 4.888889 6.333333 7.777778 9.222222 10.666667 12.111111 13.555556 15.000000 p dng rep Cng thc ca hm rep l rep(x, times, ...), trong , x l mt bin s v times l s ln lp li. V d: To ra s 10, 3 ln: > rep(10, 3) [1] 10 10 10 To ra s 1 n 4, 3 ln: > rep(c(1:4), 3) [1] 1 2 3 4 1 2 3 4 1 2 3 4 To ra s 1.2, 2.7, 4.8, 5 ln: > rep(c(1.2, 2.7, 4.8), 5) [1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 To ra s 1.2, 2.7, 4.8, 5 ln: > rep(c(1.2, 2.7, 4.8), 5) [1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 p dng gl gl c p dng to ra mt bin th bc (categorical variable), tc bin khng tnh ton, m l m. Cng thc chung ca hm gl l gl(n, k, length = n*k, labels = 1:n, ordered = FALSE) v cch s dng s c minh ho bng vi v d sau y: To ra bin gm bc 1 v 2; mi bc c lp li 8 ln: > gl(2, 8) [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 Levels: 1 2 Hay mt bin gm bc 1, 2 v 3; mi bc c lp li 5 ln: > gl(3, 5) [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 To ra bin gm bc 1 v 2; mi bc c lp li 10 ln (do length=20): > gl(2, 10, length=20)

  • [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Levels: 1 2 Hay: > gl(2, 2, length=20) [1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 Levels: 1 2 Cho thm k hiu: > gl(2, 5, label=c("C", "T")) [1] C C C C C T T T T T Levels: C T To mt bin gm 4 bc 1, 2, 3, 4. Mi bc lp li 2 ln. > rep(1:4, c(2,2,2,2)) [1] 1 1 2 2 3 3 4 4 Cng tng ng vi: > rep(1:4, each = 2) [1] 1 1 2 2 3 3 4 4 Vi ngy gi thng: > x rep(x, 2) [1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-12-31 16:00:00 Pacific Standard Time" [3] "1973-12-31 16:00:00 Pacific Standard Time" "1972-06-30 17:00:00 Pacific Standard Time" [5] "1972-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00 Pacific Standard Time" > rep(as.POSIXlt(x), rep(2, 3)) [1] "1972-06-30 17:00:00 Pacific Standard Time" "1972-06-30 17:00:00 Pacific Standard Time" [3] "1972-12-31 16:00:00 Pacific Standard Time" "1972-12-31 16:00:00 Pacific Standard Time" [5] "1973-12-31 16:00:00 Pacific Standard Time" "1973-12-31 16:00:00 Pacific Standard Time" 5.4 S dng R cho cc php tnh ma trn

    Nh chng ta bit ma trn (matrix), ni n gin, gm c dng (row) v ct (column). Khi vit A[m, n], chng ta hiu rng ma trn A c m dng v n ct. Trong R, chng ta cng c th th hin nh th. V d: chng ta mun to mt ma trn vung A gm 3 dng v 3 ct, vi cc phn t (element) 1, 2, 3, 4, 5, 6, 7, 8, 9, chng ta vit:

    1 4 72 5 83 6 9

    A =

  • V vi R: > y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 Nhng nu chng ta lnh: > A A th kt qu s l: [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 Tc l mt ma trn chuyn v (transposed matrix). Mt cch khc to mt ma trn hon v l dng t(). V d: > y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 v B = A' c th din t bng R nh sau: > B B [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 Ma trn v hng (scalar matrix) l mt ma trn vung (tc s dng bng s ct), v tt c cc phn t ngoi ng cho (off-diagonal elements) l 0, v phn t ng cho l 1. Chng ta c th to mt ma trn nh th bng R nh sau: > # to ra m ma trn 3 x 3 vi tt c phn t l 0. > A # cho cc phn t ng cho bng 1

  • > diag(A) diag(A) [1] 1 1 1 > # by gi ma trn A s l: > A [,1] [,2] [,3] [1,] 1 0 0 [2,] 0 1 0 [3,] 0 0 1 5.4.1 Chit phn t t ma trn > y A A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > # ct 1 ca ma trn A > A[,1] [1] 1 4 7 > # ct 3 ca ma trn A > A[3,] [1] 7 8 9 > # dng 1 ca ma trn A > A[1,] [1] 1 2 3 > # dng 2, ct 3 ca ma trn A > A[2,3] [1] 6 > # tt c cc dng ca ma trn A, ngoi tr dng 2 > A[-2,] [,1] [,2] [,3] [1,] 1 4 7 [2,] 3 6 9 > # tt c cc ct ca ma trn A, ngoi tr ct 1 > A[,-1] [,1] [,2] [1,] 4 7 [2,] 5 8 [3,] 6 9

  • > # xem phn t no cao hn 3. > A>3 [,1] [,2] [,3] [1,] FALSE TRUE TRUE [2,] FALSE TRUE TRUE [3,] FALSE TRUE TRUE 5.4.2 Tnh ton vi ma trn Cng v tr hai ma trn. Cho hai ma trn A v B nh sau: > A A [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > B B [,1] [,2] [,3] [,4] [1,] -1 -4 -7 -10 [2,] -2 -5 -8 -11 [3,] -3 -6 -9 -12 Chng ta c th cng A+B: > C C [,1] [,2] [,3] [,4] [1,] 0 0 0 0 [2,] 0 0 0 0 [3,] 0 0 0 0 Hay A-B: > D D [,1] [,2] [,3] [,4] [1,] 2 8 14 20 [2,] 4 10 16 22 [3,] 6 12 18 24 Nhn hai ma trn. Cho hai ma trn:

  • 1 4 72 5 83 6 9

    A =

    v 1 2 34 5 67 8 9

    B =

    Chng ta mun tnh AB, v c th trin khai bng R bng cch s dng %*% nh sau: > y A B AB AB [,1] [,2] [,3] [1,] 66 78 90 [2,] 78 93 108 [3,] 90 108 126 Hay tnh BA, v c th trin khai bng R bng cch s dng %*% nh sau: > BA BA [,1] [,2] [,3] [1,] 14 32 50 [2,] 32 77 122 [3,] 50 122 194 Nghch o ma trn v gii h phng trnh. V d chng ta c h phng trnh sau y:

    1 2

    1 2

    3 4 46 2

    x xx x

    + =+ =

    H phng trnh ny c th vit bng k hiu ma trn: AX = Y, trong :

    3 41 6

    A = , 1

    2

    xX

    x =

    , v 42

    Y = Nghim ca h phng trnh ny l: X = A-1Y, hay trong R: > A Y X X [,1] [1,] 1.1428571 [2,] 0.1428571

  • Chng ta c th kim tra: > 3*X[1,1]+4*X[2,1] [1] 4 Tr s eigen cng c th tnh ton bng function eigen nh sau: > eigen(A) $values [1] 7 2 $vectors [,1] [,2] [1,] -0.7071068 -0.9701425 [2,] -0.7071068 0.2425356 nh thc (determinant). Lm sao chng ta xc nh mt ma trn c th o nghch hay khng? Ma trn m nh thc bng 0 l ma trn suy bin (singular matrix) v khng th o nghch. kim tra nh thc, R dng lnh det(): > E E [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > det(E) [1] 0 Nhng ma trn F sau y th c th o nghch: > F F [,1] [,2] [,3] [1,] 1 16 49 [2,] 4 25 64 [3,] 9 36 81 > det(F) [1] -216 V nghch o ca ma trn F (F-1) c th tnh bng function solve() nh sau: > solve(F) [,1] [,2] [,3] [1,] 1.291667 -2.166667 0.9305556 [2,] -1.166667 1.666667 -0.6111111 [3,] 0.375000 -0.500000 0.1805556

  • Ngoi nhng php tnh n gin ny, R cn c th s dng cho cc php tnh phc tp khc. Mt li th ng k ca R l phn mm cung cp cho ngi s dng t do to ra nhng php tnh ph hp cho tng vn c th. Trong vi chng sau, ti s quay li vn ny chi tit hn.

    R c mt package Matrix chuyn thit k cho tnh ton ma trn. Bn c c th

    ti package xung, ci vo my, v s dng, nu cn. a ch ti l: http://cran.au.r-project.org/bin/windows/contrib/r-release/Matrix_0.995-8.zip

    cng vi ti liu ch dn cch s dng (di khong 80 trang): http://cran.au.r-project.org/doc/packages/Matrix.pdf

  • 1

    6 Tnh ton xc sut

    v m phng (simulation) Xc sut l nn tng ca phn tch thng k. Tt c cc phng php phn tch s liu v suy lun thng k u da vo l thuyt xc sut. L thuyt xc sut quan tm n vic m t v th hin qui lut phn phi ca mt bin s ngu nhin. M t y trong thc t cng c ngha n gin l m nhng trng hp hay kh nng xy ra ca mt hay nhiu bin. Chng hn nh khi chng ta chn ngu nhin 2 i tng , v nu 2 i tng ny c th c phn loi bng hai c tnh nh gii tnh v s thch, th vn t ra l c bao nhiu tt c phi hp gia hai c tnh ny. Hay i vi mt bin s lin tc nh huyt p, m t c ngha l tnh ton cc ch s thng k ca bin nh tr s trung bnh, trung v, phng sai, lch chun, v.v T nhng ch s m t, l thuyt xc sut cung cp cho chng ta nhng m hnh thit lp cc hm phn phi cho cc bin s . Trong chng ny, ti s bn qua hai lnh vc chnh l php m v cc hm phn phi. 6.1 Cc php m 6.1.1 Php hon v (permutation).

    Theo nh ngha, hon v n phn t l cch sp xp n phn t theo mt th t nh sn. nh ngha ny tht l kh hiu, chng khc g ! C l mt v d c th s lm r nh ngha hn. Hy tng tng mt trung tm cp cu c 3 bc s (x, y v z), v c 3 bnh nhn (a, b v c) ang ngi ch c khm bnh. C ba bc s u c th khm bt c bnh nhn a, b hay c. Cu hi t ra l c bao nhiu cch sp xp bc s bnh nhn? tr li cu hi ny, chng ta xem xt vi trng hp sau y:

    Bc s x c 3 la chn: khm bnh nhn a, b hoc c; Khi bc s x chn mt bnh nhn ri, th bc s y c hai la chn cn li; V sau cng, khi 2 bc s kia chn, bc s z ch cn 1 la chn. Tng cng, chng ta c 6 la chn. Mt v d khc, trong mt bui tic gm 6 bn, hi c bao nhiu cch sp xp

    cch ngi trong mt bn vi 6 gh? Qua cch l gii ca v d trn, p s l: 6.5.4.3.2.1 = 720 cch. (Ch du . c ngha l du nhn hay tch s). V y chnh l php m hon v.

    Chng ta bit rng 3! = 3.2.1 = 6, v 0!=1. Ni chung, cng thc tnh hon v cho

    mt s n l: ( )( )( )! 1 2 3 ... 1n n n n n= . Trong R cch tnh ny rt n gin vi lnh prod() nh sau:

  • 2

    Tm 3! > prod(3:1) [1] 6 Tm 10! > prod(10:1) [1] 3628800 Tm 10.9.8.7.6.5.4 > prod(10:4) [1] 604800 Tm (10.9.8.7.6.5.4) / (40.39.38.37.36) > prod(10:4) / prod(40:36) [1] 0.007659481 6.1.2 T hp (combination).

    T hp n phn t chp k l mi tp hp con gm k phn t ca tp hp n phn t. nh ngha ny phi ni l rt kh hiu v rm r! Cch d hiu nht l qua mt v d nh sau: Cho 3 ngi (hy cho l A, B, v C) ng vin vo 2 chc ch tch v ph ch tch, hi: c bao nhiu cch chn 2 chc ny trong s 3 ngi . Chng ta c th tng tng c 2 gh m phi chn 3 ngi:

    Cch chn Ch tch Ph ch tch 1 A B 2 B A 3 A C 4 C A 5 B C 6 C B Nh vy c 6 cch chn. Nhng ch rng cch chn 1 v 2 trong thc t ch l 1 cp, v chng ta ch c th m l 1 (ch khng 2 c). Tng t, 3 v 4, 5 v 6 cng ch c th m l 1 cp. Tng cng, chng ta c 3 cch chn 3 ngi cho 2 chc v. p s ny c gi l t hp.

    Tht ra tng s ln chn c th tnh bng cng thc sau y:

    ( )3 3! 6 32 2! 3 2 ! 2

    = = = ln. Ni chung, s ln chn k ngi t n ngi l:

    ( )!

    ! !n nk k n k

    =

  • 3

    Cng thc ny cng c khi vit l nkC thay v nk

    . Vi R, php tnh ny rt n gin bng hm choose(n, k). Sau y l vi v d minh ha:

    Tm 52

    > choose(5, 2) [1] 10 Tm xc sut cp A v B trong s 5 ngi c c c vo hai chc v: > 1/choose(5, 2) [1] 0.1 6.2 Bin s ngu nhin v hm phn phi

    Phn ln phn tch thng k da vo cc lut phn phi xc sut suy lun. Hai ch phn phi (distribution) c l cng cn vi dng gii thch y. Nu chng ta chn ngu nhin 10 bn trong mt lp hc v ghi nhn chiu cao v gii tnh ca 10 bn , chng ta c th c mt dy s liu nh sau: 1 2 3 4 5 6 7 8 9 10 Gii tnh N N Nam N N N Nam Nam N Nam Chiu cao (cm) 156 160 175 145 165 158 170 167 178 155 Nu tnh gp chung li, chng ta c 6 bn gi v 4 bn trai. Ni theo phn trm, chng ta c 60% n v 40% nam. Ni theo ngn ng xc sut, xc sut n l 0.6 v nam l 0.4.

    V chiu cao, chng ta c gi tr trung bnh l 162.9 cm, vi chiu cao thp nht l 155 cm v cao nht l 178 cm.

    Ni theo ngn ng thng k xc sut, bin s gii tnh v chiu cao l hai bin s ngu nhin (random variable). Ngu nhin l v chng ta khng on trc mt cch chnh xc cc gi tr ny, nhng ch c th on gi tr tp trung, gi tr trung bnh, v dao ng ca chng. Bin gii tnh ch c hai gi tr (nam hay n), v c gi l bin khng lin tc, hay bin ri rc (discrete variable), hay bin th bc (categorical variable). Cn bin chiu cao c th c bt c gi tr no t thp n cao, v do c tn l bin lin tc (continuous variable). Khi ni n phn phi (hay distribution) l cp n cc gi tr m bin s c th c. Cc hm phn phi (distribution function) l hm nhm m t cc bin s mt cch c h thng. C h thng y c ngha l theo m m hnh ton hc c th vi nhng thng s cho trc. Trong xc sut thng k c kh nhiu hm phn phi, v y chng ta s xem xt qua mt s hm quan trng nht v thng dng nht: l phn

  • 4

    phi nh phn, phn phi Poisson, v phn phi chun. Trong mi lut phn phi, c 4 loi hm quan trng m chng ta cn bit:

    hm mt xc sut (probability density distribution); hm phn phi tch ly (cumulative probability distribution); hm nh bc (quantile); v hm m phng (simulation).

    R c nhng hm sn trn c th ng dng cho tnh ton xc sut. Tn mi hm

    c gi bng mt tip u ng ch loi hm phn phi, v vit tt tn ca hm . Cc tip u ng l d (ch distribution hay xc sut), p (ch cumulative probability, xc sut tch ly), q (ch nh bc hay quantile), v r (ch random hay s ngu nhin). Cc tn vit tt l norm (normal, phn phi chun), binom (binomial , phn phi nh phn), pois (Poisson, phn phi Poisson), v.v Bng sau y tm tt cc hm v thng s cho tng hm: Hm phn phi

    Mt Tch ly nh bc M phng Chun dnorm(x, mean,

    sd) pnorm(q, mean, sd) qnorm(p, mean, sd) rnorm(n, mean, sd)

    Nh phn dbinom(k, n, p) pbinom(q, n, p) qbinom (p, n, p) rbinom(k, n, prob) Poisson dpois(k, lambda) ppois(q, lambda) qpois(p, lambda) rpois(n, lambda) Uniform dunif(x, min, max)

    punif(q, min, max) qunif(p, min, max) runif(n, min, max)

    Negative binomial

    dnbinom(x, k, p) pnbinom(q, k, p) qnbinom (p,k,prob) rbinom(n, n, prob)

    Beta dbeta(x, shape1, shape2) pbeta(q, shape1, shape2)

    qbeta(p, shape1, shape2)

    rbeta(n, shape1, shape2)

    Gamma dgamma(x, shape, rate, scale) gamma(q, shape, rate, scale)

    qgamma(p, shape, rate, scale)

    rgamma(n, shape, rate, scale)

    Geometric dgeom(x, p) pgeom(q, p) qgeom(p, prob) rgeom(n, prob) Exponential dexp(x, rate) pexp(q, rate) qexp(p, rate) rexp(n, rate) Weibull dnorm(x, mean, sd)

    pnorm(q, mean, sd) qnorm(p, mean, sd) rnorm(n, mean, sd)

    Cauchy dcauchy(x, location, scale) pcauchy(q, location, scale)

    qcauchy(p, location, scale)

    rcauchy(n, location, scale)

    F df(x, df1, df2) pf(q, df1, df2) qf(p, df1, df2) rf(n, df1, df2) T dt(x, df) pt(q, df) qt(p, df) rt(n, df) Chi-squared dchisq(x, df) pchi(q, df) qchisq(p, df) rchisq(n, df) Ch thch: Trong bng trn, df = degrees of freedome (bc t do); prob = probability (xc sut); n = sample size (s lng mu). Cc thng s khc c th tham kho thm cho tng lut phn phi. Ring cc lut phn phi F, t, Chi-squared cn c mt thng s khc na l non-centrality parameter (ncp) c cho s 0. Tuy nhin ngi s dng c th cho mt thng s khc thch hp, nu cn. 6.3 Cc hm phn phi xc sut (probability distribution function) 6.3.1 Hm phn phi nh phn (Binomial distribution) Nh tn gi, hm phn phi nh phn ch c hai gi tr: nam / n, sng / cht, c / khng, v.v Hm nh phn c pht biu bng nh l nh sau: Nu mt th nghim

  • 5

    c tin hnh n ln, mi ln cho ra kt qu hoc l thnh cng hoc l tht bi, v gm xc sut thnh cng c bit trc l p, th xc sut c k ln th nghim thnh cng l:

    ( ) ( )| , 1 n kn kkP k n p C p p = , trong k == 0, 1, 2, . . . , n. hiu nh l r rng hn, chng ta s xem qua qua vi v d sau y.

    V d 1: Hm mt nh phn (Binomial density probability function). Trong v d trn, lp hc c 10 ngi, trong c 6 na. Nu 3 bn c chn mt cch ngu nhin, xc sut m chng ta c 2 bn n l bao nhiu? Chng ta c th tr li cu hi ny mt cch tng i th cng bng cch xem xt tt c cc trng hp c th xy ra. Mi ln chn c 2 kh khng (nam hay n), v 3 ln chn, chng ta c 23 = 8 trng hp nh sau. Bn 1 Bn 2 Bn 3 Xc sut Nam Nam Nam (0.4)(0.4)(0.4) = 0.064 Nam Nam N (0.4)(0.4)(0.6) = 0.096 Nam N Nam (0.4)(0.6)(0.4) = 0.096 Nam N N (0.4)(0.6)(0.6) = 0.144 N Nam Nam (0.6)(0.4)(0.4) = 0.096 N Nam N (0.6)(0.4)(0.6) = 0.144 N N Nam (0.6)(0.6)(0.4) = 0.144 N N N (0.6)(0.6)(0.6) = 0.216 Tt c cc trng hp 1.000 Chng ta bit trc rng trong nhm 10 hc sinh c 6 n, v do , xc sut n l 0.60. (Ni cch khc, xc sut chn mt bn nam l 0.4). Do , xc sut m tt c 3 bn c chn u l nam gii l: 0.4 x 0.4 x 0.4 = 0.064. Trong bng trn, chng ta thy c 3 trng hp m trong c 2 bn gi: l trng hp Nam-N-N, N-N-Nam, v N-Nam-N, c 3 u c xc sut 0.144. Thnh ra, xc sut chn ng 2 bn n trong s 3 bn c chn l 3x0.144= 0.432. Trong R, c hm dbinom(k, n, p) c th gip chng ta tnh cng thc

    ( ) ( )| , 1 n kn kkP k n p C p p = mt cch nhanh chng. Trong trng hp trn, chng ta ch cn n gin lnh: > dbinom(2, 3, 0.60) [1] 0.432 V d 2: Hm nh phn tch ly (Cumulative Binomial probability distribution). Xc sut thuc chng long xng c hiu nghim l khong 70% (tc l p = 0.70). Nu chng ta iu tr 10 bnh nhn, xc sut c ti thiu 8 bnh nhn vi kt qu tch cc l bao nhiu? Ni cch khc, nu gi X l s bnh nhn c iu tr thnh cng, chng ta cn tm P(X 8) = ? tr li cu hi ny, chng ta s dng hm pbinom(k, n, p). Xin nhc li rng hm pbinom(k, n, p)cho chng ta P(X k). Do , P(X 8) = 1 P(X 7). Thnh ra, p s bng R cho cu hi l:

  • 6

    > 1-pbinom(7, 10, 0.70) [1] 0.3827828

    V d 3: M phng hm nh phn: Bit rng trong mt qun th dn s c khong 20% ngi mc bnh cao huyt p; nu chng ta tin hnh chn mu 1000 ln, mi ln chn 20 ngi trong qun th mt cch ngu nhin, s phn phi s bnh nhn cao huyt p s nh th no? tr li cu hi ny, chng ta c th ng dng hm rbinom (n, k, p) trong R vi nhng thng s nh sau:

    > b table(b) b 0 1 2 3 4 5 6 7 8 9 10 6 45 147 192 229 169 105 68 23 13 3 Dng s liu th nht (0, 5, 6, , 10) l s bnh nhn mc bnh cao huyt p trong s 20 ngi m chng ta chn. Dng s liu th hai cho chng ta bit s ln chn mu trong 1000 ln xy ra. Do , c 6 mu khng c bnh nhn cao huyt p no, 45 mu vi ch 1 bnh nhn cao huyt p, v.v C l cch hiu l v th cc tn s trn bng lnh hist nh sau: > hist(b, main="Number of hypertensive patients")

    Number of hypertensive patients

    b

    Freq

    uenc

    y

    0 2 4 6 8 10

    050

    100

    150

    200

    Biu 1. Phn phi s bnh nhn cao huyt p trong s 20 ngi c chn ngu nhin trong mt qun th gm 20% bnh nhn cao

  • 7

    huyt p, v chn mu c lp li 1000 ln. Qua biu trn, chng ta thy xc sut c 4 bnh nhn cao huyt p (trong mi ln chn mu 20 ngi) l cao nht (22.9%). iu ny cng c th hiu c, bi v t l cao huyt p l 20%, cho nn chng ta k vng rng trung bnh 4 ngi trong s 20 ngi c chn phi l cao huyt p. Tuy nhin, iu quan trng m biu trn th hin l c khi chng ta quan st n 10 bnh nhn cao huyt p d xc sut cho mu ny rt thp (ch 3/1000).

    V d 4: ng dng hm phn phi nh phn: Hai mi khch hng c mi ung hai loi bia A v B, v c hi h thch bia no. Kt qu cho thy 16 ngi thch bia A. Vn t ra l kt qu ny c kt lun rng bia A c nhiu ngi thch hn bia B, hay l kt qu ch l do cc yu t ngu nhin gy nn?

    Chng ta bt u gii quyt vn bng cch gi thit rng nu khng c khc nhau, th xc sut p=0.50 thch bia A v q=0.5 thch bia B. Nu gi thit ny ng, th xc sut m chng ta quan st 16 ngi trong s 20 ngi thch bia A l bao nhiu. Chng ta c th tnh xc sut ny bng R rt n gin: > 1- pbinom(15, 20, 0.5) [1] 0.005908966

    p s l xc sut 0.005 hay 0.5%. Ni cch khc, nu qu tht hai bia ging nhau th xc sut m 16/20 ngi thch bia A ch 0.5%. Tc l, chng ta c bng chng cho thy kh nng bia A qu tht c nhiu ngi thch hn bia B, ch khng phi do yu t ngu nhin. Ch , chng ta dng 15 (thay v 16), l bi v P(X 16) = 1 P(X 15). M trong trng hp ta ang bn, P(X 15) = pbinom(15, 20, 0.5). 6.3.2 Hm phn phi Poisson (Poisson distribution) Hm phn phi Poisson, ni chung, rt ging vi hm nh phn, ngoi tr thng s p thng rt nh v n thng rt ln. V th, hm Poisson thng c s dng m t cc bin s rt him xy ra (nh s ngi mc ung th trong mt dn s chng hn). Hm Poisson cn c ng dng kh nhiu v thnh cng trong cc nghin cu k thut v th trng nh s lng khch hng n mt nh hng mi gi. V d 5: Hm mt Poisson (Poisson density probability function). Qua theo di nhiu thng, ngi ta bit c t l nh sai chnh t ca mt th k nh my. Tnh trung bnh c khong 2.000 ch th th k nh sai 1 ch. Hi xc sut m th k nh sai chnh t 2 ch, hn 2 ch l bao nhiu?

    V tn s kh thp, chng ta c th gi nh rng bin s sai chnh t (tm t tn l bin s X) l mt hm ngu nhin theo lut phn phi Poisson. y, chng ta c

  • 8

    t l sai chnh t trung bnh l 1( = 1). Lut phn phi Poisson pht biu rng xc sut m X = k, vi iu kin t l trung bnh , :

    ( )|!

    keP X kk

    = =

    Do , p s cho cu hi trn l: ( ) 2 212 | 1 0.18392!

    eP X

    = = = = . p s ny c th tnh bng R mt cch nhanh chng hn bng hm dpois nh sau: > dpois(2, 1) [1] 0.1839397 Chng ta cng c th tnh xc sut sai 1 ch, v xc sut khng sai ch no: > dpois(1, 1) [1] 0.3678794 > dpois(0, 1) [1] 0.3678794 Ch trong hm trn, chng ta ch n gin cung cp thng s k = 2 v ( = 1. Trn y l xc sut m th k nh sai chnh t ng 2 ch. Nhng xc sut m th k nh sai chnh t hn 2 ch (tc 3, 4, 5, ch) c th c tnh bng:

    ( ) ( ) ( )2 3 4 ( 5) ...P X P X P X P X> = = + = + = + = ( )1 2P X = 1 0.3678 0.3678 0.1839 = 0.08

    Bng R, chng ta c th tnh nh sau: # P(X 2) > ppois(2, 1) [1] 0.9196986 # 1-P(X 2) > 1-ppois(2, 1) [1] 0.0803014 6.3.3 Hm phn phi chun (Normal distribution)

    Hai lut phn phi m chng ta va xem xt trn y thuc vo nhm phn phi p dng cho cc bin s phi lin tc (discrete distributions), m trong bin s c nhng gi tr theo bc th hay th loi. i vi cc bin s lin tc, c vi lut phn phi

  • 9

    thch hp khc, m quan trng nht l phn phi chun. Phn phi chun l nn tng quan trng nht ca phn tch thng k. C th ni khng ngoa rng hu ht l thuyt thng k c xy dng trn nn tng ca phn phi chun. Hm mt phn phi chun c hai thng s: trung bnh v phng sai 2 (hay lch chun ). Gi X l mt bin s (nh chiu cao chng hn), hm mt phn phi chun pht biu rng xc sut m X = x l:

    ( ) ( ) ( )22 21| , exp 22 xP X x f x = = =

    V d 6: Hm mt phn phi chun (Normal density probability function).

    Chiu cao trung bnh hin nay ph n Vit Nam l 156 cm, vi lch chun l 4.6 cm. Cng bit rng chiu cao ny tun theo lut phn phi chun. Vi hai thng s =156, =4.6, chng ta c th xy dng mt hm phn phi chiu cao cho ton b qun th ph n Vit Nam, v hm ny c hnh dng nh sau:

    130 140 150 160 170 180 190 200

    0.00

    0.02

    0.04

    0.06

    0.08

    Probability distribution of height in Vietnamese women

    Height

    f(hei

    ght)

    Biu 2. Phn phi chiu cao ph n Vit Nam vi trung bnh 156 cm v lch chun 4.6 cm. Trng honh l chiu cao v trc tung l xc sut cho mi chiu cao. Biu trn c v bng hai lnh sau y. Lnh u tin nhm to ra mt bin s height c gi tr 130, 131, 132, , 200 cm. Lnh th hai l v biu vi iu kin trung bnh l 156 cm v lch chun l 4.6 cm. > height plot(height, dnorm(height, 156, 4.6), type="l", ylab=f(height), xlab=Height,

  • 10

    main="Probability distribution of height in Vietnamese women")

    Vi hai thng s trn (v biu ), chng ta c th c tnh xc sut cho bt c

    chiu cao no. Chng hn nh xc sut mt ph n Vit Nam c chiu cao 160 cm l:

    P(X = 160 | =156, =4.6) = ( )( )2

    2

    160 1561 exp4.6 2 3.1416 2 4.6

    = 0.0594 Hm dnorm(x, mean, sd)trong R c th tnh ton xc sut ny cho chng ta mt cch gn nh: > dnorm(160, mean=156, sd=4.6) [1] 0.05942343 Hm xc sut chun tch ly (cumulative normal probability function). V chiu cao l mt bin s lin tc, trong thc t chng ta t khi no mun tm xc sut cho mt gi tr c th x, m thng tm xc sut cho mt khong gi tr a n b. Chng hn nh chng ta mun bit xc sut chiu cao t 150 n 160 cm (tc l P(160 X 150), hay xc sut chiu cao thp hn 145 cm, tc P(X < 145). tm p s cc cu hi nh th, chng ta cn n hm xc sut chun tch ly, c nh ngha nh sau:

    P(a X b) = ( )ba

    f x dx Thnh ra, P(160 X 150) chnh l din tch tnh t trc honh = 150 n 160 ca biu 2. Trong R c hm pnorm(x, mean, sd) dng tnh xc sut tch ly cho mt phn phi chun rt c ch.

    pnorm (a, mean, sd) = ( )a f x dx = P(X a | mean, sd) Chng hn nh xc sut chiu cao ph n Vit Nam bng hoc thp hn 150 cm l 9.6%: > pnorm(150, 156, 4.6) [1] 0.0960575 Hay xc sut chiu cao ph n Vit Nam bng hoc cao hn 165 cm l: > 1-pnorm(164, 156, 4.6) [1] 0.04100591 Ni cch khc, ch c khong 4.1% ph n Vit Nam c chiu cao bng hay cao hn 165 cm.

  • 11

    V d 7: ng dng lut phn phi chun: Trong mt qun th, chng ta bit

    rng p sut mu trung bnh l 100 mmHg v lch chun l 13 mmHg, hi: c bao nhiu ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg? Cu tr li bng R l: > 1-pnorm(120, mean=100, sd=13) [1] 0.0619679 Tc khong 6.2% ngi trong qun th ny c p sut mu bng hoc cao hn 120 mmHg. 6.3.4 Hm phn phi chun chun ha (Standardized Normal distribution)

    Mt bin X tun theo lut phn phi chun vi trung bnh bnh v phng sai 2 thng c vit tt l:

    X ~ N( , 2) y v 2 ty thuc vo n v o lng ca bin s. Chng hn nh chiu

    cao c tnh bng cm (hay m), huyt p c o bng mmHg, tui c o bng nm, v.v cho nn i khi m t mt bin s bng n v gc rt kh so snh. Mt cch n gin hn l chun ha (standardized) X sao cho s trung bnh l 0 v phng sai l 1. Sau vi thao tc s hc, c th chng minh d dng rng, cch bin i X p ng iu kin trn l:

    XZ =

    Ni theo ngn ng ton: nu X ~ N( , 2), th (X )/2 ~ N(0, 1). Nh vy qua

    cng thc trn, Z thc cht l khc bit gia mt s v trung bnh tnh bng s lch chun. Nu Z = 0, chng ta bit rng X bng s trung bnh . Nu Z = -1, chng ta bit rng X thp hn ng 1 lch chun. Tng t, Z = 2.5, chng ta bit rng X cao hn ng 2.5 lch chun. v.v Biu phn phi chiu cao ca ph n Vit Nam c th m t bng mt n v mi, l ch s z nh sau:

  • 12

    -4 -2 0 2 4

    0.0

    0.1

    0.2

    0.3

    0.4

    Probability distribution of height in Vietnamese women

    z

    f(z)

    Biu 3. Phn phi chun ha chiu cao ph n Vit Nam. Biu trn c v bng hai lnh sau y: > height plot(height, dnorm(height, 0, 1), type="l", ylab=f(z), xlab=z, main="Probability distribution of height in Vietnamese women")

    Vi phn phi chun chun ho, chng ta c mt tin li l c th dng n m t v so snh mt phn phi ca bt c bin no, v tt c u c chuyn sang ch s z. Trong biu trn, trc tung l xc sut z v trc honh l bin s z. Chng ta c th tnh ton xc sut z nh hn mt hng s (constant) no d dng bng R. V d, chng ta mun tm P(z -1.96) = ? cho mt phn phi m trung bnh l 0 v lch chun l 1. > pnorm(-1.96, mean=0, sd=1) [1] 0.02499790 Hay P(z 1.96) = ? > pnorm(1.96, mean=0, sd=1) [1] 0.9750021 Do , P(-1.96 < z < 1.96) chnh l: > pnorm(1.96) - pnorm(-1.96) [1] 0.9500042

  • 13

    Ni cch khc, xc sut 95% l z nm gia -1.96 v 1.96. (Ch trong lnh trn ti khng cung cp mean=0, sd=1, bi v trong thc t, pnorm gi tr mc nh (default value) ca thng s mean l 0 v sd l 1).

    V d 6 (tip tc). Xin nhc li tin vic theo di, chiu cao trung bnh ph n Vit Nam l 156 cm v lch chun l 4.6 cm. Do , mt ph n c chiu cao 170 cm cng c ngha l z = (170 156) / 4.6 = 3.04 lch chun, v ti l cc ph n Vit Nam c chiu cao cao hn 170 cm l rt thp, ch khong 0.1%. > 1-pnorm(3.04) [1] 0.001182891 Tm nh lng (quantile) ca mt phn phi chun. i khi chng ta cn lm mt tnh ton o ngc. Chng hn nh chng ta mun bit: nu xc sut Z nh hn mt hng s z no cho trc bng p, th z l bao nhiu? Din t theo k hiu xc sut, chng ta mun tm z trong nu:

    P(Z < z) = p tr li cu hi ny, chng ta s dng hm qnorm(p, mean=, sd=).

    V d 8: Bit rng Z ~ N(0, 1) v nu P(Z < z) = 0.95, chng ta mun tm z. > qnorm(0.95, mean=0, sd=1) [1] 1.644854 Hay P(Z < z) = 0.975 cho phn phi chun vi trung bnh 0 v lch chun 1: > qnorm(0.975, mean=0, sd=1) [1] 1.959964 6.3.5 Hm phn phi t, F v 2 Cc hm phn phi t, F v 2 trong thc t l hm ca hm phn phi chun. Mi lin h v cch tnh cc hm ny c th c m t bng vi ghi ch sau y:

    Phn phi Chi bnh phng (2). Phn phi 2 xut pht t tng bnh phng ca mt bin phn phi chun. Nu nu xi ~ N(0, 1), v gi 2

    n

    ii

    u x=

    = , th u tun theo lut phn phi Chi bnh phng vi bc t do n (thng vit tt l df). Ni theo ngn ng ton, u ~ 2n .

  • 14

    V d 9: Tm xc sut ca mt bin Chi bnh phng, do , ch cn hai thng s u v n. Chng hn nh nu chng ta mun tm xc sut P(u=21, df=13), ch n gin dng hm pchisq nh sau: > dchisq(21, 13) [1] 0.01977879 Tm xc sut m mt bin s u nh hn 21 vi bc t do 13 df. Tc l tm P(u 21 | df=13) = ? > pchisq(21, 13) [1] 0.9270714

    Cng c th ni kt qu trn cho bit P( 213 < 21) = 0.927. Tm quantile ca mt tr s u tng ng vi 90% ca mt phn phi 2 vi 15 bc t do:

    > qchisq(0.95, 15) [1] 24.99579

    Ni cch khc, P( 215 < 24.99) = 0.95. Phi trung tm (Non-centrality). Ch trong nh ngha trn, phn phi 2 xut

    pht t tng bnh phng ca mt bin phn phi chun c trung bnh 0 v phng sai 1. Nhng nu mt bin phn phi chun c trung bnh khng phi l 0 v phng sai khng phi l 1, th chng ta s c mt phn phi Chi bnh phng phi trung tm. Nu xi ~ N(i, 1) v t 2

    1

    n

    ii

    u x=

    = , th u tun theo lut phn phi Chi bnh phng phi trung tm vi bc t do n v thng s phi trung tm (non-centrality parameter) nh sau:

    2

    1

    n

    ii

    =

    = V k hiu l 2,~ nu . C th ni thm rng, trung bnh ca u l n+, v phng

    sai ca u l 2(n+2). Tm xc sut m u nh hn hoc bng 21, vi iu kin bc t do l 13 v thng

    s non-centrality bng 5.4:

    > pchisq(21, 13, 5.4) [1] 0.6837649 Tc l, P( 2 4.5,13 < 21) = 0.684.

  • 15

    Tm quantile ca mt tr s tng ng vi 50% ca mt phn phi 2 vi 7 bc t do v thng s non-centrality bng 3. > qchisq(0.5, 7, 3) [1] 9.180148 Do , P( 2 3,7 < 9.180148) = 0.50

    Phn phi t (t distribution). Chng ta va bit rng nu X ~ N(, s2) th th(X

    )/2 ~ N(0, 1). Nhng pht biu ng (hay chnh xc) khi chng ta bit phng sai 2. Trong thc t, t khi no chng ta bit chnh xc phng sai, m ch c tnh t s liu thc nghim. Trong trng hp phng sai c c tnh t s liu nghin cu, v hy gi c tnh ny l s2, th chng ta c th pht biu rng: (X )/s2 ~ t(0, v), trong v l bc t do.

    V d 10. Tm xc sut m x ln hn 1, trong bin theo lut phn phi t vi 6 bc t do:

    > 1-pt(1.1, 6)

    [1] 0.1567481 Tc l, P(t6 > 1.1) = 1 P(t6 < 1.1) = 0.157. Tm nh lng ca mt tr s tng ng vi 95% ca mt phn phi t vi 15

    bc t do:

    > qt(0.95, 15) [1] 1.753050

    Ni cch khc, P(t19 < 1.75035) = 0.95.

    Phn phi F. T s gia hai bin s theo lut phn phi 2 c th chng minh l tun theo lut phn phi F. Ni cch khc, nu 2~ nu v 2~ mv , th u/v ~ Fn,m, trong n l bc t do t s (numerator degrees of freedom) v m l bc t do mu s (denominator degrees of freedom).

    V d 11: Tm xc sut m mt tr s F ln hn 3.24, bit rng bin s tun theo lut phn phi F vi bc t do 3 v 15 df v thng s non-centrality 5:

    > 1-pf(3.24, 3, 15, 5) [1] 0.3558721 Do , P(F3, 15, 5 > 3.24) = 1 - P(F3, 15,5 3.24) = 0.355338.

    Vi bc t do 3 v 15, tm C sao cho P(F3, 15 > C) = 0.05. Li gii ca R l:

  • 16

    > qf(1-0.05, 3, 15) [1] 3.287382

    Ni cch khc, P(F3, 15 > 3.287382) = 1 P(F3, 15 3.287382) = 1 0.95 = 0.05

    6.4 M phng (simulation) Trong phn tch thng k, i khi v hn ch s mu chng ta kh c th c tnh mt cch chnh xc cc thng s, v trong trng hp bt nh , chng ta cn n m phng bit c dao ng ca mt hay nhiu thng s. M phng thng da vo cc lut phn phi. y l mt lnh vc kh phc tp m ti khng c nh trnh by y trong chng ny. y, ti ch trnh by mt s m hnh m phng mang tnh minh ha bn c c th da vo m pht trin thm. V d 11: M phng chng minh phng sai ca s trung bnh bng phng sai chia cho n ( ( ) 2var /X n= ). Chng ta s xem mt bin s khng lin tc vi gi tr 1, 3 v 5 vi xc sut nh sau: x P(x) 1 0.60 3 0.30 5 0.10 Qua s liu ny, chng ta bit rng gi tr trung bnh l (1x0.60)+(3x0.30)+(5x0.10) = 2.0 v phng sai (bn c c th t tnh) l 1.8. By gi chng ta s dng hai thng s ny th m phng 500 ln. Lnh th nht to ra 3 gi tr ca x. Lnh th hai nhp s xc sut cho tng gi tr ca x. Lnh sample yu cu R to nn 500 s ngu nhin v cho vo i tng draws. x

  • 17

    500 draws

    draws

    Freq

    uenc

    y

    1 2 3 4 5

    050

    100

    150

    200

    250

    300

    T lut phn phi xc sut chng ta bit rng tnh trung bnh s c 60% ln c gi

    tr 1, 30% c gi tr 2, v 10% c gi tr 5. Do , chng ta k vng s quan st 300, 150 v 50 ln cho mi gi tr. Biu trn cho thy phn phi cc gi tr ny gn vi gi tr m chng ta k vng. Ngoi ra, chng ta cng bit rng phng sai ca bin s ny l khong 1.8. By gi chng ta kim tra xem c ng nh k vng hay khng: > var(draws) [1] 1.835671 Kt qu trn cho thy phng sai ca 500 mu l 1.836, tc khng xa my so vi gi tr k vng. By gi chng ta th m phng 500 gi tr trung bnh x ( x l s trung bnh ca 4 s liu m phng) t qun th trn: > draws draws = matrix(draws, 4) > drawmeans = apply(draws, 2, mean) Lnh th nht v th hai to nn i tng tn l draws vi 4 dng, mi dng c 500 gi tr t lut phn phi trn. Ni cch khc, chng ta c 4*500 = 2000 s. 500 s cng c ngha l 500 ct: 1 n 500. Tc mi ct c 4 s. Lnh th ba tm tr s trung bnh cho mi ct. Lnh ny s cho ra 500 s trung bnh v cha trong i tng drawmeans. Biu sau y cho thy phn phi ca 500 s trung bnh: > hist(drawmeans,breaks=seq(1,5,by=0.25), main=1000 means of 4 draws)

  • 18

    1000 means of 4 draws

    drawmeans

    Freq

    uenc

    y

    1 2 3 4 5

    050

    100

    150

    Chng ta thy rng phng sai ca phn phi ny nh hn. Tht ra, phng sai ca 500 s trung bnh ny l 0.45. > var(drawmeans) [1] 0.4501112 y l gi tr tng ng vi gi tr 0.45 m chng ta k vng t cng thc ( ) 2var / 4 1.8 / 4 0.45X = = = . 6.4.1 M phng phn phi nh phn

    V d 12: M phng mu t mt qun th vi lut phn phi nh phn. Gi d chng ta bit mt qun th c 20% ngi b bnh i ng (xc sut p=0.2). Chng ta mun ly mu t qun th ny, mi mu c 20 i tng, v phng n chn mu c lp li 100 ln: > bin bin [1] 4 4 5 3 2 2 3 2 5 4 3 6 7 3 4 4 1 5 3 5 3 4 4 5 1 4 4 4 4 3 2 4 2 2 5 4 5 [38] 7 3 5 3 3 4 3 2 4 5 2 4 5 5 4 2 2 2 8 5 5 5 3 4 5 7 4 3 6 4 6 6 8 8 3 3 1 [75] 1 4 4 2 3 9 7 4 4 0 0 8 6 9 3 1 4 5 6 4 5 3 2 4 3 2 Kt qu trn l s ln u, chng ta s c 4 ngi mc bnh; ln 2 cng 4 ngi; ln 3 c 5 ngi mc bnh; v.v kt qu ny c th tm lc trong mt biu nh sau: > hist(bin, xlab=Number of diabetic patients, ylab=Number of samples, main=Distribution of the number of diabetic patients)

  • 19

    Distribution of the number of diabetic patients

    Number of diabetic patients

    Num

    ber o

    f sam

    ples

    0 2 4 6 8

    05

    1015

    2025

    > mean(bin) [1] 3.97 ng nh chng ta k vng, v chn mi ln 20 i tng v xc sut 20%, nn chng ta tin on trung bnh s c 4 bnh nhn i ng. 6.4.2 M phng phn phi Poisson

    V d 13: M phng mu t mt qun th vi lut phn phi Poisson. Trong v d sau y, chng ta m phng 100 mu t mt qun th tun theo lut phn phi Poisson vi trung bnh =3: > pois pois > pois [1] 4 3 2 4 2 3 4 4 0 7 5 0 3 3 4 2 2 6 1 4 2 3 3 5 4 2 1 4 0 2 1 5 1 2 2 2 6 [38] 1 3 6 3 3 5 4 3 2 2 5 3 3 3 1 4 7 3 4 3 2 6 1 4 1 0 5 2 2 2 3 6 8 4 4 1 4 [75] 1 0 0 4 3 3 2 3 3 3 4 1 5 4 4 1 3 1 6 4 4 4 2 2 2 4 V mt phn phi:

  • 20

    Histogram of pois

    pois

    Freq

    uenc

    y

    0 2 4 6 8

    05

    1015

    20

    Phn phi Poisson v phn phi m. Trong v d sau y, chng ta m phng thi gian bnh nhn n mt bnh vin. Bit rng bnh nhn n bnh vin mt cch ngu nhin theo lut phn phi Poisson, vi trung bnh 15 bnh nhn cho mi 150 pht. C th chng minh d dng rng thi gian gia hai bnh nhn n bnh vin tun theo lut phn phi m. Chng ta mun bit thi gian m bnh nhn gh bnh vin; do , chng ta m phng 15 thi gian gia hai bnh nhn t lut phn phi m vi t l 15/150 = 0.1 mi pht. Cc lnh sau y p ng yu cu : # To thi gian n bnh vin > appoint times times [1] 37 5 8 10 24 5 1 7 8 6 12 6 3 25 15 6.4.3 M phng phn phi 2, t, F Cch m phng trn y cn c th p dng cho cc lut phn phi khc nh nh phn m (negative binomial distribution vi rnbinom), gamma (rgamma), beta (rbeta), Chi bnh phng (rchisq), hm m (rexp), t (rt), F (rf), v.v Cc thng s cho cc hm m phng ny c th tm trong phn u ca chng. Cc lnh sau y s minh ha cc lut phn phi thng thng : Phn phi Chi bnh phng vi mt s bc t do: > curve(dchisq(x, 1), xlim=c(0,10), ylim=c(0,0.6), col="red", lwd=3) > curve(dchisq(x, 2), add=T, col="green", lwd=3) > curve(dchisq(x, 3), add=T, col="blue", lwd=3) > curve(dchisq(x, 5), add=T, col="orange", lwd=3) > abline(h=0, lty=3)

  • 21

    > abline(v=0, lty=3) > legend(par("usr")[2], par("usr")[4], xjust=1, c("df=1", "df=2", "df=3", "df=5"), lwd=3, lty=1, col=c("red", "green", "blue", "orange"))

    0 2 4 6 8 10

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    x

    dchi

    sq(x

    , 1)

    df=1df=2df=3df=5

    Biu 4. Phn phi Chi bnh phng vi bc t do =1, 2, 3, 5. Phn phi t: > curve(dt(x, 1), xlim=c(-3,3), ylim=c(0,0.4), col="red", lwd=3) > curve(dt(x, 2), add=T, col="blue", lwd=3) > curve(dt(x, 5), add=T, col="green", lwd=3) > curve(dt(x, 10), add=T, col="orange", lwd=3) > curve(dnorm(x), add=T, lwd=4, lty=3) > title(main=Student T distributions) > legend(par("usr")[2], par("usr")[4], xjust=1, c("df=1", "df=2", "df=5", "df=10", Normal distribution), lwd=c(2,2,2,2,2), lty=c(1,1,1,1,3), col=c("red", "blue", "green", "orange", par(fg)))

  • 22

    -3 -2 -1 0 1 2 3

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    dt(x

    , 1)

    Student T distributions

    df=1df=2df=5df=10Normal distribution

    Biu 5. Phn phi t vi bc t do =1, 2, 5, 10 so snh vi phn phi chun. Phn phi F: > curve(df(x,1,1), xlim=c(0,2), ylim=c(0,0.8), lwd=3) > curve(df(x,3,1), add=T) > curve(df(x,6,1), add=T, lwd=3) > curve(df(x,3,3), add=T, col="red") > curve(df(x,6,3), add=T, col="red", lwd=3) > curve(df(x,3,6), add=T, col="blue") > curve(df(x,6,6), add=T, col="blue", lwd=3) > title(main=Fisher F distributions) > legend(par("usr")[2], par("usr")[4], xjust=1, c("df=1,1", "df=3,1", "df=6,1", "df=3,3", df=6,3, df=3,6, df=6,6), lwd=c(1,1,3,1,3,1,3), lty=c(2,1,1,1,1,1,1), col=c(par("fg"), par("fg"), par("fg"), red, blue, blue))

  • 23

    0.0 0.5 1.0 1.5 2.0

    0.0

    0.2

    0.4

    0.6

    0.8

    x

    df(x

    , 1, 1

    )

    Fisher F distributions

    df=1,1df=3,1df=6,1df=3,3df=6,3df=3,66,6

    Biu 6. Phn phi F vi nhiu bc t do khc nhau. Phn phi gamma: > curve( dgamma(x,1,1), xlim=c(0,5) ) > curve( dgamma(x,2,1), add=T, col='red' ) > curve( dgamma(x,3,1), add=T, col='green' ) > curve( dgamma(x,4,1), add=T, col='blue' ) > curve( dgamma(x,5,1), add=T, col='orange' ) > title(main="Gamma probability distribution function") > legend(par('usr')[2], par('usr')[4], xjust=1, c('k=1 (Exponential distribution)', 'k=2', 'k=3', 'k=4', 'k=5'), lwd=1, lty=1, col=c(par('fg'), 'red', 'green', 'blue', 'orange') )

    0 1 2 3 4 5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    dgam

    ma(

    x, 1

    , 1)

    Gamma probability distribution function

    k=1 (Exponential distribution)k=2k=3k=4k=5

  • 24

    Biu 7. Phn phi Gamma vi nhiu hnh dng. Phn phi beta: > curve( dbeta(x,1,1), xlim=c(0,1), ylim=c(0,4) ) > curve( dbeta(x,2,1), add=T, col='red' ) > curve( dbeta(x,3,1), add=T, col='green' ) > curve( dbeta(x,4,1), add=T, col='blue' ) > curve( dbeta(x,2,2), add=T, lty=2, lwd=2, col='red' ) > curve( dbeta(x,3,2), add=T, lty=2, lwd=2, col='green' ) > curve( dbeta(x,4,2), add=T, lty=2, lwd=2, col='blue' ) > curve( dbeta(x,2,3), add=T, lty=3, lwd=3, col='red' ) > curve( dbeta(x,3,3), add=T, lty=3, lwd=3, col='green' ) > curve( dbeta(x,4,3), add=T, lty=3, lwd=3, col='blue' ) > title(main="Beta distribution") > legend(par('usr')[1], par('usr')[4], xjust=0, c('(1,1)', '(2,1)', '(3,1)', '(4,1)', '(2,2)', '(3,2)', '(4,2)', '(2,3)', '(3,3)', '(4,3)' ), lwd=1, #c(1,1,1,1, 2,2,2, 3,3,3), lty=c(1,1,1,1, 2,2,2, 3,3,3), col=c(par('fg'), 'red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green', 'blue' ))

    0.0 0.2 0.4 0.6 0.8 1.0

    01

    23

    4

    x

    dbet

    a(x,

    1, 1

    )

    Beta distribution

    (1,1)(2,1)(3,1)(4,1)(2,2)(3,2)(4,2)(2,3)(3,3)(4,3)

    Biu 8. Phn phi beta vi nhiu hnh dng. Phn phi Weibull: > curve(dexp(x), xlim=c(0,3), ylim=c(0,2)) > curve(dweibull(x,1), lty=3, lwd=3, add=T) > curve(dweibull(x,2), col='red', add=T)

  • 25

    > curve(dweibull(x,.8), col='blue', add=T) > title(main="Weibull Probability Distribution Function") > legend(par('usr')[2], par('usr')[4], xjust=1, c('Exponential', 'Weibull, shape=1', 'Weibull, shape=2', 'Weibull, shape=.8'), lwd=c(1,3,1,1), lty=c(1,3,1,1), col=c(par("fg"), par("fg"), 'red', 'blue'))

    0.0 0.5 1.0 1.5 2.0 2.5 3.0

    0.0

    0.5

    1.0

    1.5

    2.0

    x

    dexp

    (x)

    Weibull Probability Distribution Function

    ExponentialWeibull, shape=1Weibull, shape=2Weibull, shape=.8

    Biu 9. Phn phi Weibull. Phn phi Cauchy: > curve(dcauchy(x),xlim=c(-5,5), ylim=c(0,.5), lwd=3) > curve(dnorm(x), add=T, col='red', lty=2) > legend(par('usr')[2], par('usr')[4], xjust=1, c('Cauchy distribution', 'Gaussian distribution'), lwd=c(3,1), lty=c(1,2), col=c(par("fg"), 'red'))

  • 26

    -4 -2 0 2 4

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    x

    dcau

    chy(

    x)Cauchy distributionGaussian distribution

    Biu 9. Phn phi Cauchy so snh vi phn phi chun. 6.5 Chn mu ngu nhin (random sampling)

    Trong xc sut v thng k, ly mu ngu nhin rt quan trng, v n m bo tnh hp l ca cc phng php phn tch v suy lun thng k. Vi R, chng ta c th ly mu mt mu ngu nhin bng cch s dng hm sample.

    V d: Chng ta c mt qun th gm 40 ngi (m s 1, 2, 3, , 40). Nu

    chng ta mun chn 5 i tng qun th , ai s l ngi c chn? Chng ta c th dng lnh sample() tr li cu hi nh sau: > sample(1:40, 5) [1] 32 26 6 18 9 Kt qu trn cho bit i tng 32, 26, 8, 18 v 9 c chn. Mi ln ra lnh ny, R s chn mt mu khc, ch khng hon ton ging nh mu trn. V d: > sample(1:40, 5) [1] 5 22 35 19 4 > sample(1:40, 5) [1] 24 26 12 6 22 > sample(1:40, 5) [1] 22 38 11 6 18 v.v

  • 27

    Trn y l lnh chng ta chn mu ngu nhin m khng thay th (random sampling without replacement), tc l mi ln chn mu, chng ta khng b li cc mu chn vo qun th. Nhng nu chng ta mun chn mu thay th (tc mi ln chn ra mt s i tng, chng ta b vo li trong qun th chn tip ln sau). V d, chng ta mun chn 10 ngi t mt qun th 50 ngi, bng cch ly mu vi thay th (random sampling with replacement), chng ta ch cn thm tham s replace = TRUE: > sample(1:50, 10, replace=T) [1] 31 44 6 8 47 50 10 16 29 23 Hay nm mt ng xu 10 ln; mi ln, d nhin ng xu c 2 kt qu H v T; v kt qu 10 ln c th l: > sample(c("H", "T"), 10, replace=T) [1] "H" "T" "H" "H" "H" "T" "H" "H" "T" "T" Cng c th tng tng chng ta c 5 qu banh mu xanh (X) v 5 qu banh mu (D) trong mt bao. Nu chng ta chn 1 qu banh, ghi nhn mu, ri li vo bao; ri li chn 1 qu banh khc, ghi nhn mu, v b vo bao li. C nh th, chng ta chn 20 ln, kt qu c th l: > sample(c("X", "D"), 20, replace=T) [1] "X" "D" "D" "D" "D" "D" "X" "X" "X" "X" "X" "D" "X" "X" "D" "X" "X" "X" "X" [20] "D" Ngoi ra, chng ta cn c th ly mu vi mt xc sut cho trc. Trong hm sau y, chng ta chn 10 i tng t dy s 1 n 5, nhng xc sut khng bng nhau: > sample(5, 10, prob=c(0.3, 0.4, 0.1, 0.1, 0.1), replace=T) [1] 3 1 3 2 2 2 2 2 5 1 i tng 1 c chn 2 ln, i tng 2 c chn 5 ln, i tng 3 c chn 2 ln, v.v Tuy khng hon ton ph hp vi xc sut 0.3, 0.4, 0.1 nh cung cp v s mu cn nh, nhng cng khng qu xa vi k vng.

  • 1

    7 Kim nh gi thit thng k

    v ngha ca tr s P (P-value) 7.1 Tr s P Trong nghin cu khoa hc, ngoi nhng d kin bng s, biu v hnh nh, con s m chng ta thng hay gp nht l tr s P (m ting Anh gi l P-value). Trong cc chng sau y, bn c s gp tr s P rt nhiu ln, v i a s cc suy lun phn tch thng k, suy lun khoa hc u da vo tr s P. Do , trc khi bn n cc phng php phn tch thng k bng R, ti thy cn phi c i li v ngha ca tr s ny.

    Tr s P l mt con s xc sut, tc l vit tt ch probability value. Chng ta thng gp nhng pht biu c km theo con s, chng hn nh Kt qu phn tch cho thy t l gy xng trong nhm bnh nhn c iu tr bng thuc Alendronate l 2%, thp hn t l trong nhm bnh nhn khng c cha tr (5%), v mc khc bit ny c ngha thng k (p = 0.01), hay mt pht biu nh Sau 3 thng iu tr, mc gim p sut mu trong nhm bnh nhn l 10% (p < 0.05). Trong vn cnh trn y, i a s nh khoa hc hiu rng tr s P phn nh xc sut s hiu nghim ca thuc Alendronate hay mt thut iu tr, h hiu rng cu vn trn c ngha l xc sut m thuc Alendronate tt hn gi dc l 0.99 (ly 1 tr cho 0.01). Nhng cch hiu hon ton sai! Trong T in ton kinh t thng k, kinh t lng Anh Vit (Nh xut bn Khoa hc v K thut, 2004), tc gi nh ngha tr s P nh sau: P gi tr (hoc gi tr xc sut). P gi tr l mc ngha thng k thp nht m gi tr quan st c ca thng k kim nh c ngha (trang 690). nh ngha ny tht l kh hiu! Tht ra cng l nh ngha chung m cc sch khoa Ty phng thng hay vit. Lt bt c sch gio khoa no bng ting Anh, chng ta s thy mt nh ngha v tr s P na n ging nhau nh Tr s P l xc sut m mc khc bit quan st do cc yu t ngu nhin gy ra (P value is the probability that the observed difference arose by chance). Tht ra nh ngha ny cha y , nu khng mun ni l sai. Chnh v s m m ca nh ngha cho nn rt nhiu nh khoa hc hiu sai ngha ca tr s P.

    Tht vy, rt nhiu ngi, khng ch ngi c m ngay c chnh cc tc gi ca nhng bi bo khoa hc, khng hiu ngha ca tr s P. Theo mt nghin cu c cng b trn tp san danh ting Statistics in Medicine [1], tc gi cho bit 85% cc tc gi khoa hc v bc s nghin cu khng hiu hay hiu sai ngha ca tr s P. c n y c l bn c rt ngc nhin, bi v iu ny c ngha l nhiu nh nghin cu khoa hc c khi khng hiu hay hiu sai nhng g chnh h vit ra c ngha g! Th th, cu hi cn t ra mt cch nghim chnh: ngha ca tr s P l g? tr li cho cu hi ny,

  • 2

    chng ta cn phi xem xt qua khi nim phn nghim v tin trnh ca mt nghin cu khoa hc. 7.2 Gi thit khoa hc v phn nghim Mt gi thit c xem l mang tnh khoa hc nu gi thit c kh nng phn nghim. TheoKarl Popper, nh trit hc khoa hc, c im duy nht c th phn bit gia mt l thuyt khoa hc thc th vi ngy khoa hc (pseudoscience) l thuyt khoa hc lun c c tnh c th b bc b (hay b phn bc falsified) bng nhng thc nghim n gin. ng gi l kh nng phn nghim (falsifiability, c ti liu ghi l falsibility). Php phn nghim l phng cch tin hnh nhng thc nghim khng phi xc minh m ph phn cc l thuyt khoa hc, v c th coi y nh l mt nn tng cho khoa hc thc th. Chng hn nh gi thit Tt c cc qu u mu en c th b bc b nu ta tm ra c mt con qu mu .

    C th xem qui trnh phn nghim l mt cch hc hi t sai lm! Tht vy, trong khoa hc chng ta hc hi t sai lm. Khoa hc pht trin cng mt phn ln l do hc hi t sai lm m gii khoa hc khng ai chi ci. Sai lm l im mnh ca khoa hc. C th xc nh nghin cu khoa hc nh l mt qui trnh th nghim gi thuyt, theo cc bc sau y:

    Bc 1, nh nghin cu cn phi nh ngha mt gi thuyt o (null hypothesis),

    tc l mt gi thuyt ngc li vi nhng g m nh nghin cu tin l s tht. Th d trong mt nghin cu lm sng, gm hai nhm bnh nhn: mt nhm c iu tr bng thuc A, v mt nhm c iu tr bng placebo, nh nghin cu c th pht biu mt gi thuyt o rng s hiu nghim thuc A tng ng vi s hiu nghim ca placebo (c ngha l thuc A khng c tc dng nh mong mun).

    Bc 2, nh nghin cu cn phi nh ngha mt gi thuyt ph (alternative

    hypothesis), tc l mt gi thuyt m nh nghin cu ngh l s tht, v iu cn c chng minh bng d kin. Chng hn nh trong v d trn y, nh nghin cu c th pht biu gi thuyt ph rng thuc A c hiu nghim cao hn placebo.

    Bc 3, sau khi thu thp y nhng d kin lin quan, nh nghin cu dng

    mt hay nhiu phng php thng k kim tra xem trong hai gi thuyt trn,