생물학 연구를 위한 컴퓨터 사용기술 제 3강

76
Computational Skill for Modern Biology Research Department of Biology Chungbuk National University 3 rd Lecture 2015.9.15 Advanced Unix commands & Scripting..

Transcript of 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Page 1: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

3rd Lecture 2015.9.15

Advanced Unix commands & Scripting..

Page 2: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?

2주차 Basic of Unix and running BLAST in your PC

3주차 Unix Command Prompt II and shell scripts

4주차 Basic of programming

5주차 Python Scripting I

6주차 Python Scripting II

7주차 Python Scripting III

8주차 Next Generation Sequencing

9주차10주차 Next Generation Sequencing Analysis

11주차 R and statistical analysis

12주차 Bioconductor I

13주차 Bioconductor II

14주차 Network analysis

Page 3: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Basic UNIX cheatsheet

cp : copy file

cp file1 file2cp *.fasta ./directoryCp –r directory1 directory2

cd : change directorycd directory_you_want_gocd ..cd ~cd /from/start/to/end/

mv : move files..

rm file1 file2rm *rm –d directoryrm –rf directory

rm : remove files & directory..

cd directory_you_want_gocd ..cd ~cd /from/start/to/end/

ls : listlsls –lls somefile*

mkdir : make directory

mkdir directory

Page 4: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Basic UNIX cheatsheet ||

nano : Text editor

ls | lesscat filename | grep “search”

> : redirection(save output of one program to file)

ls > filecat filename > filename.txt

| : Pipe (connect output of one program to another)

nano filename

cat : view file or concatenate multiple file

cat filenamecat filename1 filename2cat *

Page 5: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

http://www.purdue.edu/discoverypark/cyber/bioinformatics/assets/pdfs/Unix_for_Biologists_Fall2013.pdf

Other Tutorials

http://training.bioinformatics.ucdavis.edu/docs/2013/12/AWS/linux-bootcamp.html

Page 6: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

In this week…

We will learn more advanced UNIX command..

- Learn how to extract desired data from text file (Parsing)

Learn Shell Scripts (Combine several commands and make ‘program’ to run)and automate your works

Perform Multiple Sequence Alignments using MUSCLE

Page 7: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Text Parsing

Extract desired information from text file (Usually output file of bioinformatic software)

Most common task for biological computing

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728

sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714

sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709

sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703

sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441

sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434

sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429

sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418

sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416

For example : Blast output

curl -O http://www.uniprot.org/uniprot/P28289.fastacat P28289.fasta blastp -query P28289.fasta -db swissprot -outfmt 6 -evalue 1e-5 > list.txtcat list.txt

Page 8: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

gi|23396886|sp|Q9NZR1.1|TMOD2_HUMANGenBank id Swissprot id

How we can do that?

If we can extract these Ids from BLAST output,we can download nucleotide sequence (GenBank) or protein sequence (Swissprot)

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728

sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714

sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709

sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703

sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441

sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434

sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429

sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418

sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416

Extract portion of them

Page 9: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728

sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714

sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709

sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703

sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441

sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434

sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429

sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418

sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416

Tab (space) delimited textMany bioinformatic analysis software generated tab delimited text as output

<Tab> <Tab><Tab> <Tab><Tab> <Tab> <Tab> <Tab> <Tab> <Tab> <Tab>

How we can separate each tab-seperated block?

cat <textfile>

Output textfile

Send to next program

Print second column only

| awk ‘{print $2}’

Page 10: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728

sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714

sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709

sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703

sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441

sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434

sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429

sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418

sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416

$1 $2 $3 $4

Using awk, we can separate field very easily…

….

cat <textfile> | awk ‘{print $2}’gi|135922|sp|P28289.1|TMOD1_HUMANgi|143587951|sp|A0JNC0.1|TMOD1_BOVINgi|342187054|sp|P49813.2|TMOD1_MOUSEgi|23396880|sp|P70567.1|TMOD1_RATgi|23396884|sp|Q9NYL9.1|TMOD3_HUMANgi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSEgi|23396885|sp|Q9NZQ9.1|TMOD4_HUMANgi|23396883|sp|Q9JLH8.1|TMOD4_MOUSEgi|122145549|sp|Q0VC48.1|TMOD4_BOVINgi|23396879|sp|P70566.1|TMOD2_RATgi|23396886|sp|Q9NZR1.1|TMOD2_HUMANgi|146291087|sp|Q9JKK7.2|TMOD2_MOUSEgi|74955935|sp|O01479.2|TMOD_CAEELgi|160395556|sp|Q6P5Q4.2|LMOD2_HUMAN

cat <textfile> | awk ‘{print $3}’

10097.4996.9496.162.3260.8758.7758.1958.7760.5859.77

Page 11: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

cat <textfile> | awk ‘{print $2, $3}’

gi|342187054|sp|P49813.2|TMOD1_MOUSE 100.00gi|23396880|sp|P70567.1|TMOD1_RAT 98.61gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.77gi|135922|sp|P28289.1|TMOD1_HUMAN 96.94gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 63.77gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 62.32gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.48gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 59.06gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 59.06gi|146291087|sp|Q9JKK7.2|TMOD2_MOUSE 60.46gi|23396879|sp|P70566.1|TMOD2_RAT 60.46gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 60.06gi|74955935|sp|O01479.2|TMOD_CAEEL 38.29gi|160395556|sp|Q6P5Q4.2|LMOD2_HUMAN 51.79gi|160395552|sp|A1A5Q0.1|LMOD2_RAT 52.98gi|160395552|sp|A1A5Q0.1|LMOD2_RAT 50.00gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 51.79gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 48.81gi|803374865|sp|E7F7X0.1|LMOD3_DANRE 47.59gi|803374865|sp|E7F7X0.1|LMOD3_DANRE 43.84gi|803374865|sp|E7F7X0.1|LMOD3_DANRE 52.73gi|325511399|sp|P29536.3|LMOD1_HUMAN 46.67gi|325511399|sp|P29536.3|LMOD1_HUMAN 45.00gi|81875385|sp|Q8BVA4.1|LMOD1_MOUSE 46.37

Print specific fields only…

Page 12: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

cat result | awk '/MOUSE/'

Search Row contain ‘MOUSE’

sp|P49813|TMOD1_MOUSE gi|342187054|sp|P49813.2|TMOD1_MOUSE 100.00 359 0 0 1 359 1 359 0.0 729sp|P49813|TMOD1_MOUSE gi|23396880|sp|P70567.1|TMOD1_RAT 98.61 359 5 0 1 359 1 359 0.0 719sp|P49813|TMOD1_MOUSE gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.77 359 8 0 1 359 1 359 0.0 716sp|P49813|TMOD1_MOUSE gi|135922|sp|P28289.1|TMOD1_HUMAN 96.94 359 11 0 1 359 1 359 0.0 709sp|P49813|TMOD1_MOUSE gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 63.77 345 124 1 1 344 3 347 2e-156 450sp|P49813|TMOD1_MOUSE gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 62.32 345 129 1 1 344 3 347 1e-153 443sp|P49813|TMOD1_MOUSE gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.48 342 141 1 2 343 3 343 4e-148 429

Search Row contain ‘MOUSE’ in second field ($2)

cat result | awk '$2 ~/MOUSE/'

sp|P49813|TMOD1_MOUSE gi|342187054|sp|P49813.2|TMOD1_MOUSE 100.00 359 0 0 1 359 1 359 0.0 729sp|P49813|TMOD1_MOUSE gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 62.32 345 129 1 1 344 3 347 1e-153 443sp|P49813|TMOD1_MOUSE gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.48 342 141 1 2 343 3 343 4e-148 429sp|P49813|TMOD1_MOUSE gi|146291087|sp|Q9JKK7.2|TMOD2_MOUSE 60.46 349 128 3 1 344 3 346 9e-145 421sp|P49813|TMOD1_MOUSE gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 51.79 168 81 0 179 346 202 369 3e-4167sp|P49813|TMOD1_MOUSE gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 48.81 84 42 1 1 84 4 86 7e-167.0sp|P49813|TMOD1_MOUSE gi|81875385|sp|Q8BVA4.1|LMOD1_MOUSE 46.37 179 96 0 166 344 296 474 4e-4159sp|P49813|TMOD1_MOUSE gi|81875385|sp|Q8BVA4.1|LMOD1_MOUSE 46.25 80 40 2 3 82 7 83 4e-055.1

Search Row contain ‘MOUSE’ in second field ($2) and print out second field

cat result | awk '$2 ~/MOUSE/ {print $2}'

Page 13: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

cat result | awk '$2 ~/MOUSE|HUMAN/'

Search Row contain ‘MOUSE’ or ‘HUMAN’ in second field ($2)

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 01 359 1 359 0.0 728sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 01 359 1 359 0.0 709sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 11 344 3 347 1e-152 441sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 11 344 3 347 3e-150 434sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 12 343 3 343 2e-148 429sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 12 343 3 343 2e-148 429sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 31 344 3 346 3e-143 416sp|P28289|TMOD1_HUMAN gi|146291087|sp|Q9JKK7.2|TMOD2_MOUSE 60.29 345 135 21 344 3 346 5e-143 416sp|P28289|TMOD1_HUMAN gi|160395556|sp|Q6P5Q4.2|LMOD2_HUMAN 49.40 168 85 0179 346 195 362 4e-46 170sp|P28289|TMOD1_HUMAN gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 49.40 168 85 0179 346 202 369 2e-42 159sp|P28289|TMOD1_HUMAN gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 48.81 84 42 11 84 4 86 8e-11 66.6sp|P28289|TMOD1_HUMAN gi|118572771|sp|Q0VAK6.1|LMOD3_HUMAN 45.28 159 87 0179 337 237 395 1e-41 157sp|P28289|TMOD1_HUMAN gi|118572771|sp|Q0VAK6.1|LMOD3_HUMAN 43.04 79 41 312 88 17 93 3e-08 58.5

cat result | awk ’NR<10’

Print first 10 lines

cat result | awk ’NR==10, NR==20’

Print between 10 and 20 lines

For more examples, http://www.pement.org/awk/awk1line.txt

Page 14: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

gi|135922|sp|P28289.1|TMOD1_HUMANgi|143587951|sp|A0JNC0.1|TMOD1_BOVINgi|342187054|sp|P49813.2|TMOD1_MOUSEgi|23396880|sp|P70567.1|TMOD1_RATgi|23396884|sp|Q9NYL9.1|TMOD3_HUMANgi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSEgi|23396885|sp|Q9NZQ9.1|TMOD4_HUMANgi|23396883|sp|Q9JLH8.1|TMOD4_MOUSEgi|122145549|sp|Q0VC48.1|TMOD4_BOVINgi|23396879|sp|P70566.1|TMOD2_RATgi|23396886|sp|Q9NZR1.1|TMOD2_HUMANgi|146291087|sp|Q9JKK7.2|TMOD2_MOUSEgi|74955935|sp|O01479.2|TMOD_CAEELgi|160395556|sp|Q6P5Q4.2|LMOD2_HUMAN

Now we have these field. How we extract the portion of data?

These data also seperated by “|”. Separate text file with specific chatachter.

cat result | awk '{split($2,a,"|");print a[4];}’

gi|135922|sp|P28289.1|TMOD1_HUMANa[1] a[2] a[3] a[4] a[5]

Split text inside $2 based on “|” and store separate like this..

Then print a[4]!

$2

Page 15: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

cat result | awk '{split($2,a,"|");print a[4];}'P49813.2P70567.1A0JNC0.1P28289.1Q9NYL9.1Q9JHJ0.1Q9JLH8.1Q9NZQ9.1Q0VC48.1Q9JKK7.2P70566.1Q9NZR1.1O01479.2Q6P5Q4.2A1A5Q0.1

If we want to extract these part only..How we can do that?

…uses split function again, but different way

split(a[4], b, “.”)

Then print out b[1]

split(a[4], b, “.”);print b[1];

Page 16: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

cat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}'P49813P70567A0JNC0P28289Q9NYL9Q9JHJ0Q9JLH8Q9NZQ9Q0VC48Q9JKK7P70566Q9NZR1O01479Q6P5Q4A1A5Q0A1A5Q0Q3UHZ5Q3UHZ5E7F7X0

Page 17: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Save these results to new file (uniprot.txt)

cat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}’ > uniprot.txt

cat uniprot.txt

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728

sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714

sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709

sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703

sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441

sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434

sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429

sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418

sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416

cat result

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728

sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714

sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709

sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703

sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441

sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434

sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429

sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418

sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416

cat result | awk {print $2}

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728

sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714

sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709

sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703

sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441

sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434

sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429

sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418

sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416

cat result | awk {split($2,a, “|”);print a[4]}

sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728

sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714

sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709

sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703

sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441

sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434

sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429

sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429

sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418

sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416

cat result | awk {split($2,a, “|”);split(a[4],b,”.”);print b[1];}

P28289A0JNC0P49813P70567Q9NYL9Q9JHJ0Q9NZQ9Q9JLH8Q0VC48P70566Q9NZR1Q9JKK7O01479Q6P5Q4A1A5Q0A1A5Q0Q3UHZ5

Page 18: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Shell ScriptsSo far, we learned many (complicated) commands..

Memorizing all these command and type several times are inconvenient

You can save all these command in text file and execute at once.

open text editor (like nano)

Page 19: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Save it as desired file name

Type previously input commands

Page 20: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

These are special commands specify type of commands

We are using ‘bash’ shell, so it should be like this.

If you use different script languages (like python), it would be changed.

#! ?

Page 21: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Permission changeIn order to execute script, you need to change file permission to execute

chmod +x blast

ls -l blast-rwxr-xr-x 1 suknamgoong staff 146 Sep 14 10:45 blast

./blast <filename>

./blast P28289.fasta

blastp -query $1 -db swissprot –evalue 1e-5 -outfmt 6 > result cat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}' > uniprot.txt

x for executable

<filename> is substututed in $1

P28289.fasta

Page 22: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Now we have scripts..

1. execute blastp using selected fasta file..2. extract uniprot id from the blast result.. New Functions3. download based on the uniprot id stored in text file..

cat uniprot.txtP49813P70567A0JNC0P28289Q9NYL9Q9JHJ0Q9JLH8Q9NZQ9Q0VC48Q9JKK7P70566Q9NZR1O01479Q6P5Q4A1A5Q0A1A5Q0Q3UHZ5Q3UHZ5E7F7X0….

Each line contain one uniprot id

1. Read line by line and get uniprot id2. Based on the line content, download different uniprot id

Page 23: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Open nano

Save it as ‘list’

Change permission to executable

chmod +x list

Execute

./list

Page 24: 생물학 연구를 위한 컴퓨터 사용기술 제 3강
Page 25: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

LOOP

Most common task in computer is repeating same task

#!/bin/bashwhile read p;do echo $pdone <uniprot.txt

Uniprot.txt

Print out $p (current line)

Read Uniprot.txt lines one by one and store at $p

Page 26: 생물학 연구를 위한 컴퓨터 사용기술 제 3강
Page 27: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

LOOPInstead of displaying each line, we want to download file in uniprot…

#!/bin/bashwhile read p; do

echo $pcurl -O "http://www.uniprot.org/uniprot/"$p".fasta";

done <uniprot.txt

Download uniprot file for each uniprot id..

curl –O “http://uniprot.org/uniprot/P49813.fasta” P49813P70567A0JNC0P28289Q9NYL9Q9JHJ0Q9JLH8Q9NZQ9Q0VC48Q9JKK7P70566Q9NZR1O01479Q6P5Q4A1A5Q0A1A5Q0Q3UHZ5Q3UHZ5E7F7X0….

curl –O “http://uniprot.org/uniprot/P70567.fasta” curl –O “http://uniprot.org/uniprot/A0JNC0.fasta”

uniprot.txt

Page 28: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

./listhttp://uniprot.org/uniprot/P49813.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0http://uniprot.org/uniprot/P70567.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0http://uniprot.org/uniprot/A0JNC0.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 336 0 --:--:-- --:--:-- --:--:-- 336http://uniprot.org/uniprot/P28289.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 345 0 --:--:-- --:--:-- --:--:-- 345http://uniprot.org/uniprot/Q9NYL9.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 458 0 --:--:-- --:--:-- --:--:-- 458http://uniprot.org/uniprot/Q9JHJ0.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 484 0 --:--:-- --:--:-- --:--:-- 484http://uniprot.org/uniprot/Q9JLH8.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 468 0 --:--:-- --:--:-- --:--:-- 469http://uniprot.org/uniprot/Q9NZQ9.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 506 0 --:--:-- --:--:-- --:--:-- 506http://uniprot.org/uniprot/Q0VC48.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 486 0 --:--:-- --:--:-- --:--:-- 486http://uniprot.org/uniprot/Q9JKK7.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 484 0 --:--:-- --:--:-- --:--:-- 483http://uniprot.org/uniprot/P70566.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 485 0 --:--:-- --:--:-- --:--:-- 486http://uniprot.org/uniprot/Q9NZR1.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 476 0 --:--:-- --:--:-- --:--:-- 476

Execute commands..lsblast uniprotD4A615.fasta uniprotQ14BP6.fasta uniprotQ5R8C0.fasta uniprotQ8K3Z0.fastadata uniprotE7F7X0.fasta uniprotQ19857.fasta uniprotQ66X01.fasta uniprotQ93650.fastalist uniprotE9Q5R7.fasta uniprotQ1L994.fasta uniprotQ66X03.fasta uniprotQ96HA7.fastaresult uniprotO01479.fasta uniprotQ1ZXD6.fasta uniprotQ66X05.fasta uniprotQ9DAM1.fastauniprot.txt uniprotP28289.fasta uniprotQ3UHZ5.fasta uniprotQ66X22.fasta uniprotQ9HC29.fastauniprotA0JNC0.fasta uniprotP29536.fasta uniprotQ3V3V9.fasta uniprotQ6E804.fasta uniprotQ9JHJ0.fastauniprotA0JPI9.fasta uniprotP34342.fasta uniprotQ4R642.fasta uniprotQ6F5E8.fasta uniprotQ9JKK7.fastauniprotA1A5Q0.fasta uniprotP49813.fasta uniprotQ4UNE4.fasta uniprotQ6NZL6.fasta uniprotQ9JLH8.fastauniprotA6H639.fasta uniprotP70566.fasta uniprotQ53B87.fasta uniprotQ6P5Q4.fasta uniprotQ9NPH0.fastauniprotA8Y3R9.fasta uniprotP70567.fasta uniprotQ53B88.fasta uniprotQ6ZQY2.fasta uniprotQ9NYL9.fastauniprotB4SSQ7.fasta uniprotQ0VAA2.fasta uniprotQ54G18.fasta uniprotQ7RTR2.fasta uniprotQ9NZQ9.fastauniprotC1F960.fasta uniprotQ0VAK6.fasta uniprotQ5DU56.fasta uniprotQ8BHB0.fasta uniprotQ9NZR1.fastauniprotC3VPR6.fasta uniprotQ0VC48.fasta uniprotQ5JU00.fasta uniprotQ8BVA4.fasta uniprotQ9Y239.fasta

Downloaded fasta file..

Page 29: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Let’s combine all together

#!/bin/bash

#Run BLASTblastp -query $1 -db swissprot –evalue 1e-5 -outfmt 6 > result

#Extract Uniprot id and save it in uniprot.txtcat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}' > uniprot.txt

#Read uniprot id saved in uniprot.txt one by one and download itwhile read p; do

echo $pcurl -O "http://www.uniprot.org/uniprot/"$p".fasta";

done <uniprot.txt

# : Comment (Description on scripts)

./blastdownload <filename>

Page 30: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Blastp with swissprot

Extract Uniprot id

Download Uniprot id

Make pipeline using scripts

Let’s make another step in pipeline!

Page 31: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

We have many fasta file contains homologs with original query file

Let’s compare all of them using multiple sequence alignments!

Why we need to learn how to doing multiple sequence alignment?

Page 32: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Why we need to compare multiple sequences?

- Single amino acid sequences

YEKIGKIGEGSYGVVFKCRNRDTGQIVAIKKFLESEDDPVIKKIALREIRMLKQLKHPNLVNLLEVFRRKRRLHLVFEYCDHTVLHELDRYQRGVPEHLVKSITWQTLQAVNFCHKHNCIHRDVKPENILITKHSVIKLCDFGFARLLAGPSDYYTDYVATRWYRSPELLVGDTQYGPPVDVWAIGCVFAELLSGVPLWPGKSDVDQLYLIRKTLGDLIPRHQQVFSTNQYFSGVKIPDPEDMEPLELKFPNISYPALGLLKGCLHMDPTQRLTCEQLLHHPYF

What kinds of information we can get from this?

Not much..

Molecular Weight, isoelectric point?But if we compare Two homologous sequence?

Page 33: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

There is some homology between two sequences

Some gap is here..

….That’s it?How about function of protein?

Page 34: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

But if we align more than three protein and compare them…

* * * * ** * * ** * * *Conserved residues…maybe important for function?

Conserved Region

Variable RegionPhylogenetic analysis

Secondary Structure

Page 35: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Informations from MSA• Multiple sequence alignment is more informative than two sequence aligment

• You can find sequence domain from multiple sequence alignments

From unknown sequence -> finding novel domain -> deduce potential functions

Page 36: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Information from MSA

- Which part of sequences are evolutionary conserved?

Most of evolutionary conserved part of protein is usually essential part of that protein

** * * ** * * **** *

Protein function is determined by protein structureSignificant portion of evolutionary conserved regions are determinant of protein structure

Page 37: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Protein structure and MSA- Protein functions is correlated with three dimensional structure

- Compared with sequence, structure is very hard to change

Structure 2013 21, 1690-1697DOI: (10.1016/j.str.2013.06.020)

- Conserved part of protein is essential for protein structure maintenance

Page 38: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Many different multiple sequence alignment software available..

In this course, we will use ‘MUSCLE’ http://www.drive5.com/muscle

Page 39: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

http://www.ebi.ac.uk/Tools/msa/muscle/

Page 40: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

http://www.drive5.com/muscle/downloads.htm

Page 41: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Make new directory as ‘muscle’ under your home page..

Download muscle (If you have linux, download different file..)

Rename it as ‘muscle’

Uncompress tar.gz

Remove tar.gz file..

Page 42: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Setup PATH for muscle

MUSCLE=~/muscleexport MUSCLEPATH=$PATH:~/ncbi-blast-2.2.31+/bin:$MUSCLEBLASTDB=~/ncbi-blast-2.2.31+/dbexport PATHexport BLASTDB

Change .bash_profile and save it.

Page 43: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

If you see this, your setup is complete.

Let’s do some MSA. First go to your directory before..

Page 44: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Input format for muscleAmino acid sequence as Multi-FASTA format

>sp|Q9JHJ0|TMOD3_MOUSE Tropomodulin-3 OS=Mus musculus GN=Tmod3 PE=2 SV=1MALPFRKDLGDYKDLDEDELLGKLSESELKQLETVLDDLDPENALLPAGFRQKNQTSKSATGPFDRERLLSYLEKQALEHKDRDDYVPYTGEKKGKIFIPKQKPAQTLTEETISLDPELEEALTSASDTELCDLAAILGMHNLIADTPFCDVLGSSNGVNQERFPNVVKGEKILPVFDEPPNPTNVEESLKRIRENDARLVEVNLNNIKNIPIPTLKDFAKTLEANTHVKHFSLAATRSNDPVAVAFADMLKVNKTLKSLNMESNFITGAGVLALIDALRDNETLMELKIDNQRQQLGTSVELEMAKMLEENTNILKFGYQFTQQGPRTRAANAITKNNDLVRKRRIEGDHQ>sp|Q9JKK7|TMOD2_MOUSE Tropomodulin-2 OS=Mus musculus GN=Tmod2 PE=1 SV=2MALPFQKGLEKYKNIDEDELLGKLSEEELKQLENVLDDLDPESATLPAGFRQKDQTQKAATGPFDREHLLMYLEKEALEQKDREDFVPFTGEKKGRVFIPKEKPVETRKEEKVTLDPELEEALASASDTELYDLAAVLGVHNLLNNPKFDEETTNGEGRKGPVRNVVKGEKAKPVFEEPPNPTNVEASLQQMKANDPSLQEVNLNNIKNIPIPTLKEFAKSLETNTHVKKFSLAATRSNDPVALAFAEMLKVNKTLKSLNVESNFITGTGILALVEALRENDTLTEIKIDNQRQQLGTAVEMEIAQMLEENSRILKFGYQFTKQGPRTRVAAAITKNNDLVRKKRVEGDRR>sp|Q9JLH8|TMOD4_MOUSE Tropomodulin-4 OS=Mus musculus GN=Tmod4 PE=2 SV=1MSSYQKELEKYRDIDEDEILRTLSPEELEQLDCELQEMDPENMLLPAGLRQRDQTKKSPTGPLDRDALLQYLEQQALEVKERDDLVPYTGEKKGKPFIQPKREIPAQEQITLEPELEEALSHATDAEMCDIAAILGMYTLMSNKQYYDAICSGEICNTEGISSVVQPDKYKPVPDEPPNPTNIEEMLKRVRSNDKELEEVNLNNIQDIPIPVLSDLCEAMKTNTYVRSFSLVATKSGDPIANAVADMLRENRSLQSLNIESNFISSTGLMAVLKAVRENATLTELRVDNQRQWPGDAVEMEMATVLEQCPSIVRFGYHFTQQGPRARAAHAMTRNNELRRQQKKR

Header

Sequences

Page 45: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

But we have multiple files contains single fasta sequences..

How to combine them as single fasta?

Page 46: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Use cat command and wildcard (*)

cat *.fasta > merged.fasta

cat merged.fasta>sp|A0JNC0|TMOD1_BOVIN Tropomodulin-1 OS=Bos taurus GN=TMOD1 PE=2 SV=1MSYRRELEKYRDLDEDEILGGLTEEELRTLENELDELDPDNALLPAGLRQKDQTTKAPTGPFRREELLDHLEKQAKEFKDREDLVPYTGEKRGKVWVPKQKPMDPVLESVTLEPELEEALANASDAELCDIAAILGMHTLMSNQQYYQALGSSSIVNKEGLNSVIKPTQYKPVPDEEPNATDVEETLERIKNNDPKLEEVNLNNIRNIPIPTLKAYAEALKENSYVKKFSIVGTRSNDPVAFALAEMLKVNKVLKTLNVESNFISGAGILRLVEALPYNTSLVELKIDNQSQPLGNKVEMEIVSMLEKNATLLKFGYHFTQQGPRLRASNAMMNNNDLVRKRRLADLTGPIIPKCRSGV>sp|A0JPI9|LR74A_RAT Leucine-rich repeat-containing protein 74A OS=Rattus norvegicus GN=Lrrc74a PE=2 SV=1MDDDDIEPLEYETKDETEAALAPQSSEDTLYCEAEAAPSVEKEKPTREDSETDLEIEDTEKFFSIGQKELYLEACKLVGVVPVSYFIRNMEESCMNLNHHGLGPMGIKAIAITLVSNTTVLKLELEDNSIQEEGILSLMEMLHENYYLQELNVSDNNLGLEGARIISDFLQENNSSLWKLKLSGNKFKEECALLLCQALSSNYRIRSLNLSHNEFSDTAGEYLGQMLALNVGLQSLNLSWNHFNVRGAVALCNGLRTNVTLKKLDVSMNGFGNDGALALGDTLKLNSCLVYVDVSRNGITNEGASRISKGLENNECLQVLKLFLNPVSLEGAYSLILAIKRNPKSRMEDLDISNVLVSEQFVKVLDGVCAIHPQLDVVYKGLQGLSTKKTVSLETNPIKLIQNYTDQNKISVVEFFKSLNPSGLMTMPVGDFRKAIIQQTNIPINRYQARELIKKLEEKNGMVNFSGFKSLKVTAAGQL

All of fasta file was saved in merged.fasta

Page 47: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Run muscle

muscle –in <inputfile> -out <outfilename> -clw

Page 48: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

cat result.alnMUSCLE (3.8) multiple sequence alignment

sp|Q53B88|NOD2_HYLLA GCWDPHSLHPARDLQSHRPAIVR--RLHSHVEGVLDLAWERGFVSQYECDEIRLPIFTPSsp|Q7RTR2|NLRC3_HUMAN PDAPLGPCSNDSRIQRHRKALLSKVGGGPELGGPWHRLASLLLVEGLTDLQLREHDFTQVsp|O01479|TMOD_CAEEL APSANSQQGTQLPSKVYNKGLKD-------------------------------------sp|E7F7X0|LMOD3_DANRE -------MSERTEQESYTDKIDE-------------------------------------sp|Q0VAK6|LMOD3_HUMAN --------MSEHSRNSDQEELLD-------------------------------------sp|Q9JLH8|TMOD4_MOUSE -------------MSSYQKELEK-------------------------------------sp|Q0VC48|TMOD4_BOVIN -------------MSSYQKELEK-------------------------------------sp|Q9NZQ9|TMOD4_HUMAN -------------MSSYQKELEK-------------------------------------sp|P28289|TMOD1_HUMAN --------------MSYRRELEK-------------------------------------sp|A0JNC0|TMOD1_BOVIN --------------MSYRRELEK-------------------------------------sp|P49813|TMOD1_MOUSE --------------MSYRRELEK-------------------------------------sp|P70567|TMOD1_RAT --------------MSYRRELEK-------------------------------------sp|Q9NZR1|TMOD2_HUMAN ------------MALPFQKELEK-------------------------------------sp|P70566|TMOD2_RAT ------------MALPFQKGLEK-------------------------------------sp|Q9JKK7|TMOD2_MOUSE ------------MALPFQKGLEK-------------------------------------sp|Q9JHJ0|TMOD3_MOUSE ------------MALPFRKDLGD-------------------------------------sp|Q9NYL9|TMOD3_HUMAN ------------MALPFRKDLEK-------------------------------------sp|A1A5Q0|LMOD2_RAT -----------MSTFGYRRGLSK-------------------------------------sp|Q3UHZ5|LMOD2_MOUSE -----------MSTFGYRRGLSK-------------------------------------sp|Q6P5Q4|LMOD2_HUMAN -----------MSTFGYRRGLSK-------------------------------------sp|P29536|LMOD1_HUMAN ----------MSRVAKYRRQVS--------------------------------------sp|Q8BVA4|LMOD1_MOUSE ----------MSKVAKYRRQVS-------------------------------------- :

sp|Q53B88|NOD2_HYLLA QRARRLLDLATVKANGLAAFLLQHVQELPVPLALPLEAATCRKYMAKLRTTVSAQSRFLSsp|Q7RTR2|NLRC3_HUMAN EATRGGGHPARTVALDRLFLPLSRVSVPPRVSITIGVAGMGKTTLVRHFVRLWAHGQVGKsp|O01479|TMOD_CAEEL ------------------------------------------------------------sp|E7F7X0|LMOD3_DANRE ------------------DEILAGLSAEELKQLQSEMDDIAPDERVPVGLRQKDASHEMTsp|Q0VAK6|LMOD3_HUMAN ------------------------------------------------------------sp|Q9JLH8|TMOD4_MOUSE ------------------------------------------------------------sp|Q0VC48|TMOD4_BOVIN ------------------------------------------------------------sp|Q9NZQ9|TMOD4_HUMAN ------------------------------------------------------------sp|P28289|TMOD1_HUMAN ------------------------------------------------------------sp|A0JNC0|TMOD1_BOVIN ------------------------------------------------------------sp|P49813|TMOD1_MOUSE ------------------------------------------------------------sp|P70567|TMOD1_RAT ------------------------------------------------------------sp|Q9NZR1|TMOD2_HUMAN ------------------------------------------------------------sp|P70566|TMOD2_RAT ------------------------------------------------------------sp|Q9JKK7|TMOD2_MOUSE ------------------------------------------------------------sp|Q9JHJ0|TMOD3_MOUSE ------------------------------------------------------------sp|Q9NYL9|TMOD3_HUMAN ------------------------------------------------------------sp|A1A5Q0|LMOD2_RAT ------------------------------------------------------------sp|Q3UHZ5|LMOD2_MOUSE ------------------------------------------------------------sp|Q6P5Q4|LMOD2_HUMAN ------------------------------------------------------------sp|P29536|LMOD1_HUMAN ------------------------------------------------------------sp|Q8BVA4|LMOD1_MOUSE ------------------------------------------------------------

Page 49: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Inspect MSA

http://www.jalview.org

Page 50: 생물학 연구를 위한 컴퓨터 사용기술 제 3강
Page 51: 생물학 연구를 위한 컴퓨터 사용기술 제 3강
Page 52: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

?

I got a RT-PCR and clone it, but one error with reference seq. Should I use it?

Page 53: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

BLAST -> extrac uniprot id -> uniprot download -> Muscle

100% conservedY->C Change is very significant change

Probably it may affect protein function?

Page 54: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

#!/bin/bash

#Run BLASTblastp -query $1 -db swissprot –evalue 1e-5 -outfmt 6 > result

#Extract Uniprot id and save it in uniprot.txtcat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}' > uniprot.txt

#Read uniprot id saved in uniprot.txt one by one and download itwhile read p; do

echo $pcurl -O "http://www.uniprot.org/uniprot/"$p".fasta";

done <uniprot.txt

#merge fasta file into merged.fatacat *.fasta > merged.fasta

#run musclemuscle –in merged.fasta –out align.aln -clw

Combined workflow

Page 55: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Blastp with swissprot

Extract Uniprot id

Download Uniprot id

MSA with muscle

Page 56: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Other examples

Questions

I have a protein sequence. I’m looking for protein structure homologus to my protein.Can I search it and if there are protein structure homoglous with my protein, download it?

1. Search Protein Structure DB

2. Get a id for protein structure

3. Download protein structure based on ids.

Page 57: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

How we can search protein structure database?

In NCBI BLAST db, there is protein database called pdbaa(All amino acid sequences deposited in protein structure database, PDB)

http://ftp.ncbi.nlm.nih.gov/FASTA/pdbaa.gz

cd ~/ncbi-blast-2.2.31+/dbcurl -O ftp://[email protected]/blast/db/FASTA/pdbaa.gzgunzip pdbaa.gzmakeblastdb –in pdbaa –dbtype prot

Download them and make as BLAST db

Page 58: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Let’s do blast search

Make working directory and download query sequences (Your favorite protein sequences)

Blast using newly built blast db (pdbaa)

Page 59: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Pdb id is here. (2J1D, 2Z6E, 3OBV…)How you can extract them?

awk {print $2} awk {split($2,a,’|’);print[4]}

Page 60: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

blastp -query Q9NZ56.fasta -db pdbaa -evalue 1e-10 -outfmt 6 | awk '{split($2,a,"|");print a[4];}'>pdblist

cat pdblist2J1D2Z6E3OBV3O4X1V9D4EAH2YLE

Using These list, let’s download pdb file

Page 61: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Of course, you can download one by one from website. But there is better way..

You can download pdb in this address

http://www.rcsb.org/pdb/files/PDB_ID.pdb.gz

http://www.rcsb.org/pdb/files/2J1D.pdb.gz

curl –O http://www.rcsb.org/pdb/files/2J1D.pdb.gzgunzip 2J1D.pdb

Page 62: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

#!/bin/bashwhile read p;do curl -O "http://www.rcsb.org/pdb/files/"$p".pdb.gz" gunzip $p.pdb.gzdone <pdblist

Modify previous scripts

1. Read pdb id stored in pdblist one by one2. Download using curl at http://www.rcsb.org/pdb/files/pdbid.pdb.gz3. Uncompress gz as gunzip

Save it as download

chmod +x download

Page 63: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Execute scripts

Page 64: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

You can display protein molecules using PyMol (http://pymol.org) or Cuemol2(http://www.cuemol.org/ja/index.php?cuemol2)

Display Proteins

Page 65: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

You have many files do same thing..

#!/bin/bash

for f in *.fastado echo "professing $f..." blastp -query $f -db pdbaa -out $f".blast" -outfmt 6done

Save output file as “filename”+”.blast”

You have lots of fasta fileYou want to run BLAST each of them.How you can do that without typing repeatively?

Every file end as .fasta

Page 66: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

#!/bin/bash

for f in *.fastado echo "professing $f..." blastp -query $f -db pdbaa -out $f".blast" -outfmt 6done

Save output file as “filename”+”.blast”

Every file end as .fasta

echo “prosseing A0JNC0.fasta…” blastp -query AOJNC0.fasta -db pdbaa –out A0JNC0.blast -outfmt 6

echo “prosseing A1A5Q0.fasta…” blastp -query A1A5Q0.fasta -db pdbaa –out A1A5Q0.blast -outfmt 6

…….Continue to all file corresponds to *.fasta

Page 67: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Combining grep and awk do can do many thing…

ls *.pdb1IO0.pdb 1PGV.pdb 2J1D.pdb 4PKG.pdb 4PKI.pdb

Let’s assume you download pdb files..

What is the title of pdb?

grep "^TITLE" *.pdb | awk '{if (substr($0,19,1)==" ") {print substr($0,1,4) "\t" substr($0,20)} else {print "\t" substr($0,21)}}'

1IO0 CRYSTAL STRUCTURE OF TROPOMODULIN C-TERMINAL HALF 1PGV STRUCTURAL GENOMICS OF CAENORHABDITIS ELEGANS: TROPOMODULIN

C-TERMINAL DOMAIN 2J1D CRYSTALLIZATION OF HDAAM1 C-TERMINAL FRAGMENT 4PKG COMPLEX OF ATP-ACTIN WITH THE N-TERMINAL ACTIN-BINDING DOMAIN OF

TROPOMODULIN 4PKI COMPLEX OF ATP-ACTIN WITH THE C-TERMINAL ACTIN-BINDING DOMAIN OF

TROPOMODULIN

Page 68: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

What is the resolution of protein structure?

grep "^REMARK 2 RESOLUTION." *.pdb | awk '{print substr($1,0,4) "\t" $4}'

1IO0 1.451PGV 1.802J1D 2.554PKG 1.804PKI 2.30

https://madscientist.wordpress.com/2012/04/22/텍스트 -툴로 -pdb-뒤비기 /

More informations:

Page 69: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Let’s assume you have many fasta file…

We want to check the title of each proteins…

cat Q9UNT1.fasta>sp|Q9UNT1|RBL2B_HUMAN Rab-like protein 2B OS=Homo sapiens GN=RABL2B PE=2 SV=1MAEDKTKPSELDQGKYDADDNVKIICLGDSAVGKSKLMERFLMDGFQPQQLSTYALTLYKHTATVDGRTILVDFWDTAGQERFQSMHASYYHKAHACIMVFDVQRKVTYRNLSTWYTELREFRPEIPCIVVANKIDDINVTQKSFNFAKKFSLPLYFVSAADGTNVVKLFND

grep "^>" *.fastaA4D1S5.fasta:>sp|A4D1S5|RAB19_HUMAN Ras-related protein Rab-19 OS=Homo sapiens GN=RAB19 PE=2 SV=2O00194.fasta:>sp|O00194|RB27B_HUMAN Ras-related protein Rab-27B OS=Homo sapiens GN=RAB27B PE=1 SV=4O14966.fasta:>sp|O14966|RAB7L_HUMAN Ras-related protein Rab-7L1 OS=Homo sapiens GN=RAB29 PE=1 SV=1O95716.fasta:>sp|O95716|RAB3D_HUMAN Ras-related protein Rab-3D OS=Homo sapiens GN=RAB3D PE=1 SV=1O95755.fasta:>sp|O95755|RAB36_HUMAN Ras-related protein Rab-36 OS=Homo sapiens GN=RAB36 PE=2 SV=2

Search all fasta file and show line started with “>”

Page 70: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

grep "^>" *.fasta | sed 's/^.*>//g'

A4D1S5.fasta:>sp|A4D1S5|RAB19_HUMAN Ras-related protein Rab-19 OS=Homo sapiens GN=RAB19 PE=2 SV=2O00194.fasta:>sp|O00194|RB27B_HUMAN Ras-related protein Rab-27B OS=Homo sapiens GN=RAB27B PE=1 SV=4O14966.fasta:>sp|O14966|RAB7L_HUMAN Ras-related protein Rab-7L1 OS=Homo sapiens GN=RAB29 PE=1 SV=1O95716.fasta:>sp|O95716|RAB3D_HUMAN Ras-related protein Rab-3D OS=Homo sapiens GN=RAB3D PE=1 SV=1O95755.fasta:>sp|O95755|RAB36_HUMAN Ras-related protein Rab-36 OS=Homo sapiens GN=RAB36 PE=2 SV=2

Remove these parts

sp|A4D1S5|RAB19_HUMAN Ras-related protein Rab-19 OS=Homo sapiens GN=RAB19 PE=2 SV=2sp|O00194|RB27B_HUMAN Ras-related protein Rab-27B OS=Homo sapiens GN=RAB27B PE=1 SV=4sp|O14966|RAB7L_HUMAN Ras-related protein Rab-7L1 OS=Homo sapiens GN=RAB29 PE=1 SV=1sp|O95716|RAB3D_HUMAN Ras-related protein Rab-3D OS=Homo sapiens GN=RAB3D PE=1 SV=1sp|O95755|RAB36_HUMAN Ras-related protein Rab-36 OS=Homo sapiens GN=RAB36 PE=2 SV=2sp|P0C0E4|RB40L_HUMAN Ras-related protein Rab-40A-like OS=Homo sapiens GN=RAB40AL PE=1 SV=1sp|P11234|RALB_HUMAN Ras-related protein Ral-B OS=Homo sapiens GN=RALB PE=1 SV=1sp|P20336|RAB3A_HUMAN Ras-related protein Rab-3A OS=Homo sapiens GN=RAB3A PE=1 SV=1sp|P

Extract id and protein name only

grep "^>" *.fasta | sed 's/.*>//g' | awk -F '|' '{print $2, $3;}'

-F ‘|’ : Change ‘seperator’ as ‘|’

Extract second and third fields

Page 71: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

You will see more uses of them in later courses..

Unix-based text utilities (grep, awk, sed..) is very powerful tool

Run software

Extract what you want

Using these information, run another software…

Most of biological computing is involved with text-mungling

Page 72: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Script is your ‘Notebook’ in computer-based ‘experiments’

Exact sequence of ‘computational experiment’

Record of what you done

“Reproducible Research”

Page 73: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

In the Next Lectures…

Basic of computer programming

Setup basic programming environment

Python and Script Languages..

Handling of Biological Sequences (DNA and Proteins)

Page 74: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

Assignments

1. You have these sequences.

MGTRDDEYDYLFKVVLIGDSGVGKSNLLSRFTRNEFNLESKSTIGVEFATRSIQVDGKTIKAQIWDTAGQERYRAITSAYYRGAVGALLVYDIAKHLTYENVERWLKELRDHADSNIVIMLVGNKSDLRHLRAVPTDEARAFAEKNEANVRQTRK

2. Find out all homologs (E-value < e-20) in Human, Mouse, Arabidopsis, Xenopus And fission yeast (Schizosaccharomyces pombe) using results from blast (swissprot)

Page 75: 생물학 연구를 위한 컴퓨터 사용기술 제 3강

3. Find out Uniprot (Swissport) id from BLAST results and download them

4. Make multiple sequence alignments using MUSCLE for each organism

human.alnMouse.alnZebrafish.alnXenopus.aln

5. Mark amino acids in the original sequence highest conservation score (more than 9)

MGTRDDEYDYLFKVVLIGDSGVGKSNLLSRFTRNEFNLESKSTIGVEFATRSIQVDGKTIKAQIWDTAGQERYRAITSAYYRGAVGALLVYDIAKHLTYENVERWLKELRDHADSNIVIMLVGNKSDLRHLRAVPTDEARAFAEKNEANVRQTRK

Like this.

Page 76: 생물학 연구를 위한 컴퓨터 사용기술 제 3강