A novel preprocessing method using Hilbert Huang transform for MALDI-TOF and SELDI-TOF mass...

Post on 18-Dec-2015

219 views 0 download

Transcript of A novel preprocessing method using Hilbert Huang transform for MALDI-TOF and SELDI-TOF mass...

1

A novel preprocessing method using Hilbert Huang transform for MALDI-TOF and SELDI-TOF mass spectrometry data

吳立青

2

Outline

• Introduction• Methods• Data source• Methods of comparison• Results• Conclusion

3

Introduction

• Using protein mass spectrometry to discriminate diseased from healthy individuals becomes more popular

• MALDI-TOF and SELDI-TOF have the advantage :– Fast– High through-put– Accuracy– Protein ID

4

SELDI-TOF MS applications in clinical oncology

5

The example of ovarian cancer data

•Large scale

•Full of noise

•Nonlinear and non-stationary

6

The analysis of mass spectra seems to be simple.However, we do suffer from several problems ,and they need to

be solved.

Motivation

7

Common preprocessing step

• Baseline subtraction• Denoising : very important and also complicated !

• Normalization• Peak detection• Peak alignment

8

Noise component

• Chemical noise : – From the matrix material and sample

contaminations– One kind of the biochemical material

• Electrical noise : The physical characteristics of the machine – Do not mean anything actually

9

Chemical noise

• Chemical component (organic acid)– We call it Matrix

• Ionization– Provide H+ to peptide or protein for ionization and

flight in the machine• Protection

– Protect the peptide or protein in the process of laser flash

10

Problem

• The simulated model before did not separate the chemical noise from spectra.

• The chemical noise mixed with the machine noise is worse.

• A novel preprocessing method should be developed

11

Kwon, D., M. Vannucci, et al. (2008). "A novel wavelet-based thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise." Proteomics 8(15): 3019-29.

12

Goal

• Develop a method which can be satisfied– The electrical noise should be removed– Preserve the significant peaks even the chemical

noise

13

Methods

• Hilbert Huang transform– Denoising

• Modification– Baseline subtraction– Rescale– Peak detection

14

Flow chart

Ovarian cancer data HHT

IMFs

Remove IMFs

Modification

Baseline subtraction

Rescale

Peak detection

15

Hilbert Huang transform

• Method : Hilbert Huang transform (HHT)– Wu, Z., N. E. Huang, et al. (2007). "On the trend, detrending, and variability of nonlinear and

nonstationary time series." Proc Natl Acad Sci U S A 104(38): 14889-94.

– An adaptive data analysis method for nonlinear and non-stationary processes

– The main feature of HHT is the empirical mode decomposition (EMD)

– After the process of EMD, we get the intrinsic mode functions and remove several from them as noise

• Goal : denoising

16

Process of EMD (1/5)

• Find the envelope of the local maxima

17

Process of EMD (2/5)

• Find the envelope of the local minima

18

Process of EMD (3/5)

• Compute the mean envelope from the maximum envelope and minimum envelope

19

Process of EMD (4/5)• We get IMF1 (i.e. h1) by subtracting the mean

envelope m1 from the original signal X(t)

20

Process of EMD (5/5)

• We take IMF1 as X(t) and repeat the same process and so on.

• We terminate the process untill the number of the extrema and the zero-crossing of IMFn

differ by more than one

21

IMFs : 1~16

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

0

1

2

3

4

5

6

7

8

9x 10

9

HHT

22

Modification

• Baseline subtraction– Remove the systematic artifacts

• Rescale– Shift the scale to positive

• Peak detection– Key feature of the preprocessing method– We compare several popular methods

23

Baseline subtraction

24

Baseline subtraction

25

Peak detection

• We use three popular algorithms for peak detection– MassSpecWavelet (Du, Kibbe et al. 2006)– SpecAlign (Wong, Cagney et al. 2005)– PROcess (Li 2005)

26

Data source

• Source : National Cancer Institute• Type : 50 ovarian cancer data• Meuleman, W., J. Y. Engwegen, et al. (2008). "Comparison of normalisation methods for surface-enhanced

laser desorption and ionisation (SELDI) time-of-flight (TOF) mass spectrometry data." BMC Bioinformatics 9: 88

• Kwon, D., M. Vannucci, et al. (2008). "A novel wavelet-based thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise." Proteomics 8(15): 3019-29

27

Methods of comparison

• Judgment– Count of peaks detected– Real location of the peaks in visual

• Interior comparison– HHT and modification+SpecAlign– HHT and modification+PROcess– HHT and modification+MassSpecWavelet

28

Methods of comparison

• Exterior comparison– SpecAlign

• Abbreviation : SA

– PROcess• Interpolation : PRO1• Regression : PRO2

– MassSpecWavelet• Abbreviation : MSW

– PRO2+MSW• As suggested in Cruz-Marcelo, Guerra et al. 2008

29

Raw data

30

Results after HHT and modification

31

Results of interior comparison

• Results of interior comparison

Algorithm \ peak detected 2000~ 4000 DA

4000~6000 DA

6000~8000 DA

8000~10000 DA

10000~12000 DA

12000~14000 DA

14000~15000 DA Total

HHT modification +MassSpecWavelet 35 32 27 21 44 38 21 218

HHT modification+SpecAlign 18 17 6 11 8 9 3 80

HHT modification+PROcess 21 18 19 13 14 13 10 108

32

安追

3

13

0 2000 4000 6000 8000 10000 12000 14000 160000

5

10

15x 10

8

Modified HHT+MSW Peak detected : 218M over z range : whole region

33

安追

3

13

0 2000 4000 6000 8000 10000 12000 14000 160000

5

10

15x 10

8

Modified HHT+SpecAlign Peak detected : 80M over z range : whole region

34

安追

3

13

0 2000 4000 6000 8000 10000 12000 14000 160000

5

10

15x 10

8

Modified HHT+PROcess Peak detected : 108M over z range : whole region

Significant peak lost

35

Results of exterior comparison

• Results of exterior comparisonAlgorithm \ peak detected 2000~

4000 DA4000~6000 DA

6000~8000 DA

8000~10000 DA

10000~12000 DA

12000~14000 DA

14000~15000 DA Total

mHHT+MassSpecWavelet 35 32 27 21 44 38 21 218

mHHT+SpecAlign 18 17 6 11 8 9 3 80

mHHT+PROcess 21 18 19 13 14 13 10 108

SpecAlign 67 43 21 18 16 14 7 186

PRO1 46 24 19 18 16 16 6 145

PRO2 40 23 8 12 13 14 4 114

MassSpecWavelet 51 39 25 24 22 20 7 188

PRO2+MSW 54 37 33 25 23 18 8 198

36

安追

3

13

0 2000 4000 6000 8000 10000 12000 14000 160000

5

10

15x 10

8

PRO1 Peak detected : 145M over z range : whole region

Meuleman, W., J. Y. Engwegen, et al. (2008). "Comparison of normalisation methods for surface-enhanced laser desorption and ionisation (SELDI) time-of-flight (TOF) mass spectrometry data." BMC Bioinformatics 9: 88

37

安追

3

13

0 2000 4000 6000 8000 10000 12000 14000 160000

5

10

15x 10

8

PRO2 Peak detected : 114M over z range : whole region

Meuleman, W., J. Y. Engwegen, et al. (2008). "Comparison of normalisation methods for surface-enhanced laser desorption and ionisation (SELDI) time-of-flight (TOF) mass spectrometry data." BMC Bioinformatics 9: 88

38

安追

3

13

0 2000 4000 6000 8000 10000 12000 14000 160000

5

10

15x 10

8

MSW Peak detected : 188M over z range : whole region

Meuleman, W., J. Y. Engwegen, et al. (2008). "Comparison of normalisation methods for surface-enhanced laser desorption and ionisation (SELDI) time-of-flight (TOF) mass spectrometry data." BMC Bioinformatics 9: 88

39

安追

3

13

6000 6500 7000 7500 8000 85000

5

10

15x 10

8

MSWPeak detected : 25M over z range : 6000~8000

40

安追

3

13

0 2000 4000 6000 8000 10000 12000 14000 160000

5

10

15x 10

8

PRO2+MSW Peak detected : 198M over z range : whole region

Meuleman, W., J. Y. Engwegen, et al. (2008). "Comparison of normalisation methods for surface-enhanced laser desorption and ionisation (SELDI) time-of-flight (TOF) mass spectrometry data." BMC Bioinformatics 9: 88

41

安追

3

13

6000 6500 7000 7500 8000 85000

5

10

15x 10

8

PRO2+MSWPeak detected : 33M over z range : 6000~8000

Ex

42

Results

• Interior comparison:– HHT and modification+MSW covers the most of

the peaks– HHT and modification+SpecAlign pick the most

important peaks• Exterior comparison:

– PROcess miss the significant peaks– MassSpecWavelet and PRO2MSW have many

redundancies

43

Results of validation

• Validation– Data source : Cathay General Hospital– Experiments :

• Divide into three experiments– Water only– VrD1

44

Water

Sample : waterOrganic acid : CHCA (<1000 DA)

45

VrD1

Sample : VrD1Type : proteinOrganic acid : CHCA (<1000 Da)Molecular weight : 5119 Da

46

Results of validation

Algorithm\sample Water VrD1 (Mw: 5119) Peak located in M/Z=5119 of VrD1

Double charge of VrD1 (5119+2)/2

Number of peaks (>1000Da) detected in

VrD1

MassSpecWavelet 400 369 Detected Detected 340

SpecAlign 391 477 Detected Detected 355

HHT modification+SpecAlign 17 22 Detected Detected 5

The amount of peaks which HHT modification removed 95.7% 94.8% 98.6%

Number of the peaks detected

47

The peaks of Water detected by MassSpecWacelet

48

The peaks of VrD1 detected by MassSpecWacelet

Molecular weight : 5119 Da

49

The peaks of Water detected by SpecAlign

50

The peaks of VrD1 detected by SpecAlign

Molecular weight : 5119 Da

51

The peaks of water detected by HHT modification + SpecAlign

52

The peaks of VrD1 detected by HHT modification + SpecAlign

Molecular weight : 5119 Da

53

The peaks of VrD1 detected by HHT modification + SpecAlign 0-5200Da

Whole

Molecular weight : 5119 Da

Double charge : (5119+2)/2

54

Results of validation

• MassSpecWavelet and SpecAlign do not remove the noise

• HHT and modification+SpecAlign detects the least peaks but the most significant peaks

• HHT and modification+SpecAlign removes 98.6% of the peaks (>1000Da) which are redudancies and noise

55

Conclusion

• HHT performs well at denoising• As the result of comparison, HHT and

modification can make the raw data more simple

• Simultaneously, HHT and modification preserve the significant information.

• After the preprocessing of HHT and modification , it is suggested that detect the peaks by SpecAlign

56

Acknowledgement

• 感謝本所陳欣昊同學的投入• 感謝黃鄂院士以及本所博士班林澂同學• 感謝汐止國泰醫院的鄭宇哲博士的實驗驗證

57

Thanks for your attention!