생물학 연구를 위한 컴퓨터 사용기술 제 3강
-
Upload
suk-namgoong -
Category
Education
-
view
569 -
download
0
Transcript of 생물학 연구를 위한 컴퓨터 사용기술 제 3강
Computational Skill for Modern Biology Research
Department of BiologyChungbuk National University
3rd Lecture 2015.9.15
Advanced Unix commands & Scripting..
Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?
2주차 Basic of Unix and running BLAST in your PC
3주차 Unix Command Prompt II and shell scripts
4주차 Basic of programming
5주차 Python Scripting I
6주차 Python Scripting II
7주차 Python Scripting III
8주차 Next Generation Sequencing
9주차10주차 Next Generation Sequencing Analysis
11주차 R and statistical analysis
12주차 Bioconductor I
13주차 Bioconductor II
14주차 Network analysis
Basic UNIX cheatsheet
cp : copy file
cp file1 file2cp *.fasta ./directoryCp –r directory1 directory2
cd : change directorycd directory_you_want_gocd ..cd ~cd /from/start/to/end/
mv : move files..
rm file1 file2rm *rm –d directoryrm –rf directory
rm : remove files & directory..
cd directory_you_want_gocd ..cd ~cd /from/start/to/end/
ls : listlsls –lls somefile*
mkdir : make directory
mkdir directory
Basic UNIX cheatsheet ||
nano : Text editor
ls | lesscat filename | grep “search”
> : redirection(save output of one program to file)
ls > filecat filename > filename.txt
| : Pipe (connect output of one program to another)
nano filename
cat : view file or concatenate multiple file
cat filenamecat filename1 filename2cat *
http://www.purdue.edu/discoverypark/cyber/bioinformatics/assets/pdfs/Unix_for_Biologists_Fall2013.pdf
Other Tutorials
http://training.bioinformatics.ucdavis.edu/docs/2013/12/AWS/linux-bootcamp.html
In this week…
We will learn more advanced UNIX command..
- Learn how to extract desired data from text file (Parsing)
Learn Shell Scripts (Combine several commands and make ‘program’ to run)and automate your works
Perform Multiple Sequence Alignments using MUSCLE
Text Parsing
Extract desired information from text file (Usually output file of bioinformatic software)
Most common task for biological computing
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728
sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714
sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709
sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703
sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441
sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434
sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429
sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418
sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416
For example : Blast output
curl -O http://www.uniprot.org/uniprot/P28289.fastacat P28289.fasta blastp -query P28289.fasta -db swissprot -outfmt 6 -evalue 1e-5 > list.txtcat list.txt
gi|23396886|sp|Q9NZR1.1|TMOD2_HUMANGenBank id Swissprot id
How we can do that?
If we can extract these Ids from BLAST output,we can download nucleotide sequence (GenBank) or protein sequence (Swissprot)
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728
sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714
sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709
sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703
sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441
sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434
sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429
sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418
sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416
Extract portion of them
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728
sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714
sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709
sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703
sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441
sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434
sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429
sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418
sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416
Tab (space) delimited textMany bioinformatic analysis software generated tab delimited text as output
<Tab> <Tab><Tab> <Tab><Tab> <Tab> <Tab> <Tab> <Tab> <Tab> <Tab>
How we can separate each tab-seperated block?
cat <textfile>
Output textfile
Send to next program
Print second column only
| awk ‘{print $2}’
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728
sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714
sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709
sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703
sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441
sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434
sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429
sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418
sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416
$1 $2 $3 $4
Using awk, we can separate field very easily…
….
cat <textfile> | awk ‘{print $2}’gi|135922|sp|P28289.1|TMOD1_HUMANgi|143587951|sp|A0JNC0.1|TMOD1_BOVINgi|342187054|sp|P49813.2|TMOD1_MOUSEgi|23396880|sp|P70567.1|TMOD1_RATgi|23396884|sp|Q9NYL9.1|TMOD3_HUMANgi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSEgi|23396885|sp|Q9NZQ9.1|TMOD4_HUMANgi|23396883|sp|Q9JLH8.1|TMOD4_MOUSEgi|122145549|sp|Q0VC48.1|TMOD4_BOVINgi|23396879|sp|P70566.1|TMOD2_RATgi|23396886|sp|Q9NZR1.1|TMOD2_HUMANgi|146291087|sp|Q9JKK7.2|TMOD2_MOUSEgi|74955935|sp|O01479.2|TMOD_CAEELgi|160395556|sp|Q6P5Q4.2|LMOD2_HUMAN
cat <textfile> | awk ‘{print $3}’
10097.4996.9496.162.3260.8758.7758.1958.7760.5859.77
cat <textfile> | awk ‘{print $2, $3}’
gi|342187054|sp|P49813.2|TMOD1_MOUSE 100.00gi|23396880|sp|P70567.1|TMOD1_RAT 98.61gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.77gi|135922|sp|P28289.1|TMOD1_HUMAN 96.94gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 63.77gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 62.32gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.48gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 59.06gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 59.06gi|146291087|sp|Q9JKK7.2|TMOD2_MOUSE 60.46gi|23396879|sp|P70566.1|TMOD2_RAT 60.46gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 60.06gi|74955935|sp|O01479.2|TMOD_CAEEL 38.29gi|160395556|sp|Q6P5Q4.2|LMOD2_HUMAN 51.79gi|160395552|sp|A1A5Q0.1|LMOD2_RAT 52.98gi|160395552|sp|A1A5Q0.1|LMOD2_RAT 50.00gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 51.79gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 48.81gi|803374865|sp|E7F7X0.1|LMOD3_DANRE 47.59gi|803374865|sp|E7F7X0.1|LMOD3_DANRE 43.84gi|803374865|sp|E7F7X0.1|LMOD3_DANRE 52.73gi|325511399|sp|P29536.3|LMOD1_HUMAN 46.67gi|325511399|sp|P29536.3|LMOD1_HUMAN 45.00gi|81875385|sp|Q8BVA4.1|LMOD1_MOUSE 46.37
Print specific fields only…
cat result | awk '/MOUSE/'
Search Row contain ‘MOUSE’
sp|P49813|TMOD1_MOUSE gi|342187054|sp|P49813.2|TMOD1_MOUSE 100.00 359 0 0 1 359 1 359 0.0 729sp|P49813|TMOD1_MOUSE gi|23396880|sp|P70567.1|TMOD1_RAT 98.61 359 5 0 1 359 1 359 0.0 719sp|P49813|TMOD1_MOUSE gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.77 359 8 0 1 359 1 359 0.0 716sp|P49813|TMOD1_MOUSE gi|135922|sp|P28289.1|TMOD1_HUMAN 96.94 359 11 0 1 359 1 359 0.0 709sp|P49813|TMOD1_MOUSE gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 63.77 345 124 1 1 344 3 347 2e-156 450sp|P49813|TMOD1_MOUSE gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 62.32 345 129 1 1 344 3 347 1e-153 443sp|P49813|TMOD1_MOUSE gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.48 342 141 1 2 343 3 343 4e-148 429
Search Row contain ‘MOUSE’ in second field ($2)
cat result | awk '$2 ~/MOUSE/'
sp|P49813|TMOD1_MOUSE gi|342187054|sp|P49813.2|TMOD1_MOUSE 100.00 359 0 0 1 359 1 359 0.0 729sp|P49813|TMOD1_MOUSE gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 62.32 345 129 1 1 344 3 347 1e-153 443sp|P49813|TMOD1_MOUSE gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.48 342 141 1 2 343 3 343 4e-148 429sp|P49813|TMOD1_MOUSE gi|146291087|sp|Q9JKK7.2|TMOD2_MOUSE 60.46 349 128 3 1 344 3 346 9e-145 421sp|P49813|TMOD1_MOUSE gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 51.79 168 81 0 179 346 202 369 3e-4167sp|P49813|TMOD1_MOUSE gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 48.81 84 42 1 1 84 4 86 7e-167.0sp|P49813|TMOD1_MOUSE gi|81875385|sp|Q8BVA4.1|LMOD1_MOUSE 46.37 179 96 0 166 344 296 474 4e-4159sp|P49813|TMOD1_MOUSE gi|81875385|sp|Q8BVA4.1|LMOD1_MOUSE 46.25 80 40 2 3 82 7 83 4e-055.1
Search Row contain ‘MOUSE’ in second field ($2) and print out second field
cat result | awk '$2 ~/MOUSE/ {print $2}'
cat result | awk '$2 ~/MOUSE|HUMAN/'
Search Row contain ‘MOUSE’ or ‘HUMAN’ in second field ($2)
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 01 359 1 359 0.0 728sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 01 359 1 359 0.0 709sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 11 344 3 347 1e-152 441sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 11 344 3 347 3e-150 434sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 12 343 3 343 2e-148 429sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 12 343 3 343 2e-148 429sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 31 344 3 346 3e-143 416sp|P28289|TMOD1_HUMAN gi|146291087|sp|Q9JKK7.2|TMOD2_MOUSE 60.29 345 135 21 344 3 346 5e-143 416sp|P28289|TMOD1_HUMAN gi|160395556|sp|Q6P5Q4.2|LMOD2_HUMAN 49.40 168 85 0179 346 195 362 4e-46 170sp|P28289|TMOD1_HUMAN gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 49.40 168 85 0179 346 202 369 2e-42 159sp|P28289|TMOD1_HUMAN gi|123794602|sp|Q3UHZ5.1|LMOD2_MOUSE 48.81 84 42 11 84 4 86 8e-11 66.6sp|P28289|TMOD1_HUMAN gi|118572771|sp|Q0VAK6.1|LMOD3_HUMAN 45.28 159 87 0179 337 237 395 1e-41 157sp|P28289|TMOD1_HUMAN gi|118572771|sp|Q0VAK6.1|LMOD3_HUMAN 43.04 79 41 312 88 17 93 3e-08 58.5
cat result | awk ’NR<10’
Print first 10 lines
cat result | awk ’NR==10, NR==20’
Print between 10 and 20 lines
For more examples, http://www.pement.org/awk/awk1line.txt
gi|135922|sp|P28289.1|TMOD1_HUMANgi|143587951|sp|A0JNC0.1|TMOD1_BOVINgi|342187054|sp|P49813.2|TMOD1_MOUSEgi|23396880|sp|P70567.1|TMOD1_RATgi|23396884|sp|Q9NYL9.1|TMOD3_HUMANgi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSEgi|23396885|sp|Q9NZQ9.1|TMOD4_HUMANgi|23396883|sp|Q9JLH8.1|TMOD4_MOUSEgi|122145549|sp|Q0VC48.1|TMOD4_BOVINgi|23396879|sp|P70566.1|TMOD2_RATgi|23396886|sp|Q9NZR1.1|TMOD2_HUMANgi|146291087|sp|Q9JKK7.2|TMOD2_MOUSEgi|74955935|sp|O01479.2|TMOD_CAEELgi|160395556|sp|Q6P5Q4.2|LMOD2_HUMAN
Now we have these field. How we extract the portion of data?
These data also seperated by “|”. Separate text file with specific chatachter.
cat result | awk '{split($2,a,"|");print a[4];}’
gi|135922|sp|P28289.1|TMOD1_HUMANa[1] a[2] a[3] a[4] a[5]
Split text inside $2 based on “|” and store separate like this..
Then print a[4]!
$2
cat result | awk '{split($2,a,"|");print a[4];}'P49813.2P70567.1A0JNC0.1P28289.1Q9NYL9.1Q9JHJ0.1Q9JLH8.1Q9NZQ9.1Q0VC48.1Q9JKK7.2P70566.1Q9NZR1.1O01479.2Q6P5Q4.2A1A5Q0.1
If we want to extract these part only..How we can do that?
…uses split function again, but different way
split(a[4], b, “.”)
Then print out b[1]
split(a[4], b, “.”);print b[1];
cat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}'P49813P70567A0JNC0P28289Q9NYL9Q9JHJ0Q9JLH8Q9NZQ9Q0VC48Q9JKK7P70566Q9NZR1O01479Q6P5Q4A1A5Q0A1A5Q0Q3UHZ5Q3UHZ5E7F7X0
Save these results to new file (uniprot.txt)
cat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}’ > uniprot.txt
cat uniprot.txt
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728
sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714
sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709
sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703
sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441
sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434
sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429
sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418
sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416
cat result
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728
sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714
sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709
sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703
sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441
sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434
sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429
sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418
sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416
cat result | awk {print $2}
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728
sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714
sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709
sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703
sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441
sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434
sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429
sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418
sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416
cat result | awk {split($2,a, “|”);print a[4]}
sp|P28289|TMOD1_HUMAN gi|135922|sp|P28289.1|TMOD1_HUMAN 100.00 359 0 0 1359 1 359 0.0 728
sp|P28289|TMOD1_HUMAN gi|143587951|sp|A0JNC0.1|TMOD1_BOVIN 97.49 359 9 0 1359 1 359 0.0 714
sp|P28289|TMOD1_HUMAN gi|342187054|sp|P49813.2|TMOD1_MOUSE 96.94 359 11 0 1359 1 359 0.0 709
sp|P28289|TMOD1_HUMAN gi|23396880|sp|P70567.1|TMOD1_RAT 96.10 359 14 0 1359 1 359 0.0 703
sp|P28289|TMOD1_HUMAN gi|23396884|sp|Q9NYL9.1|TMOD3_HUMAN 62.32 345 129 1 1344 3 347 1e-152 441
sp|P28289|TMOD1_HUMAN gi|23396881|sp|Q9JHJ0.1|TMOD3_MOUSE 60.87 345 134 1 1344 3 347 3e-150 434
sp|P28289|TMOD1_HUMAN gi|23396885|sp|Q9NZQ9.1|TMOD4_HUMAN 58.77 342 140 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|23396883|sp|Q9JLH8.1|TMOD4_MOUSE 58.19 342 142 1 2343 3 343 2e-148 429
sp|P28289|TMOD1_HUMAN gi|122145549|sp|Q0VC48.1|TMOD4_BOVIN 58.77 342 140 1 2343 3 343 3e-148 429
sp|P28289|TMOD1_HUMAN gi|23396879|sp|P70566.1|TMOD2_RAT 60.58 345 134 2 1344 3 346 6e-144 418
sp|P28289|TMOD1_HUMAN gi|23396886|sp|Q9NZR1.1|TMOD2_HUMAN 59.77 348 132 3 1344 3 346 3e-143 416
cat result | awk {split($2,a, “|”);split(a[4],b,”.”);print b[1];}
P28289A0JNC0P49813P70567Q9NYL9Q9JHJ0Q9NZQ9Q9JLH8Q0VC48P70566Q9NZR1Q9JKK7O01479Q6P5Q4A1A5Q0A1A5Q0Q3UHZ5
Shell ScriptsSo far, we learned many (complicated) commands..
Memorizing all these command and type several times are inconvenient
You can save all these command in text file and execute at once.
open text editor (like nano)
Save it as desired file name
Type previously input commands
These are special commands specify type of commands
We are using ‘bash’ shell, so it should be like this.
If you use different script languages (like python), it would be changed.
#! ?
Permission changeIn order to execute script, you need to change file permission to execute
chmod +x blast
ls -l blast-rwxr-xr-x 1 suknamgoong staff 146 Sep 14 10:45 blast
./blast <filename>
./blast P28289.fasta
blastp -query $1 -db swissprot –evalue 1e-5 -outfmt 6 > result cat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}' > uniprot.txt
x for executable
<filename> is substututed in $1
P28289.fasta
Now we have scripts..
1. execute blastp using selected fasta file..2. extract uniprot id from the blast result.. New Functions3. download based on the uniprot id stored in text file..
cat uniprot.txtP49813P70567A0JNC0P28289Q9NYL9Q9JHJ0Q9JLH8Q9NZQ9Q0VC48Q9JKK7P70566Q9NZR1O01479Q6P5Q4A1A5Q0A1A5Q0Q3UHZ5Q3UHZ5E7F7X0….
Each line contain one uniprot id
1. Read line by line and get uniprot id2. Based on the line content, download different uniprot id
Open nano
Save it as ‘list’
Change permission to executable
chmod +x list
Execute
./list
LOOP
Most common task in computer is repeating same task
#!/bin/bashwhile read p;do echo $pdone <uniprot.txt
Uniprot.txt
Print out $p (current line)
Read Uniprot.txt lines one by one and store at $p
LOOPInstead of displaying each line, we want to download file in uniprot…
#!/bin/bashwhile read p; do
echo $pcurl -O "http://www.uniprot.org/uniprot/"$p".fasta";
done <uniprot.txt
Download uniprot file for each uniprot id..
curl –O “http://uniprot.org/uniprot/P49813.fasta” P49813P70567A0JNC0P28289Q9NYL9Q9JHJ0Q9JLH8Q9NZQ9Q0VC48Q9JKK7P70566Q9NZR1O01479Q6P5Q4A1A5Q0A1A5Q0Q3UHZ5Q3UHZ5E7F7X0….
curl –O “http://uniprot.org/uniprot/P70567.fasta” curl –O “http://uniprot.org/uniprot/A0JNC0.fasta”
uniprot.txt
./listhttp://uniprot.org/uniprot/P49813.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0http://uniprot.org/uniprot/P70567.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0http://uniprot.org/uniprot/A0JNC0.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 336 0 --:--:-- --:--:-- --:--:-- 336http://uniprot.org/uniprot/P28289.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 345 0 --:--:-- --:--:-- --:--:-- 345http://uniprot.org/uniprot/Q9NYL9.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 458 0 --:--:-- --:--:-- --:--:-- 458http://uniprot.org/uniprot/Q9JHJ0.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 484 0 --:--:-- --:--:-- --:--:-- 484http://uniprot.org/uniprot/Q9JLH8.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 468 0 --:--:-- --:--:-- --:--:-- 469http://uniprot.org/uniprot/Q9NZQ9.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 506 0 --:--:-- --:--:-- --:--:-- 506http://uniprot.org/uniprot/Q0VC48.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 486 0 --:--:-- --:--:-- --:--:-- 486http://uniprot.org/uniprot/Q9JKK7.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 484 0 --:--:-- --:--:-- --:--:-- 483http://uniprot.org/uniprot/P70566.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 485 0 --:--:-- --:--:-- --:--:-- 486http://uniprot.org/uniprot/Q9NZR1.fasta % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 251 100 251 0 0 476 0 --:--:-- --:--:-- --:--:-- 476
Execute commands..lsblast uniprotD4A615.fasta uniprotQ14BP6.fasta uniprotQ5R8C0.fasta uniprotQ8K3Z0.fastadata uniprotE7F7X0.fasta uniprotQ19857.fasta uniprotQ66X01.fasta uniprotQ93650.fastalist uniprotE9Q5R7.fasta uniprotQ1L994.fasta uniprotQ66X03.fasta uniprotQ96HA7.fastaresult uniprotO01479.fasta uniprotQ1ZXD6.fasta uniprotQ66X05.fasta uniprotQ9DAM1.fastauniprot.txt uniprotP28289.fasta uniprotQ3UHZ5.fasta uniprotQ66X22.fasta uniprotQ9HC29.fastauniprotA0JNC0.fasta uniprotP29536.fasta uniprotQ3V3V9.fasta uniprotQ6E804.fasta uniprotQ9JHJ0.fastauniprotA0JPI9.fasta uniprotP34342.fasta uniprotQ4R642.fasta uniprotQ6F5E8.fasta uniprotQ9JKK7.fastauniprotA1A5Q0.fasta uniprotP49813.fasta uniprotQ4UNE4.fasta uniprotQ6NZL6.fasta uniprotQ9JLH8.fastauniprotA6H639.fasta uniprotP70566.fasta uniprotQ53B87.fasta uniprotQ6P5Q4.fasta uniprotQ9NPH0.fastauniprotA8Y3R9.fasta uniprotP70567.fasta uniprotQ53B88.fasta uniprotQ6ZQY2.fasta uniprotQ9NYL9.fastauniprotB4SSQ7.fasta uniprotQ0VAA2.fasta uniprotQ54G18.fasta uniprotQ7RTR2.fasta uniprotQ9NZQ9.fastauniprotC1F960.fasta uniprotQ0VAK6.fasta uniprotQ5DU56.fasta uniprotQ8BHB0.fasta uniprotQ9NZR1.fastauniprotC3VPR6.fasta uniprotQ0VC48.fasta uniprotQ5JU00.fasta uniprotQ8BVA4.fasta uniprotQ9Y239.fasta
Downloaded fasta file..
Let’s combine all together
#!/bin/bash
#Run BLASTblastp -query $1 -db swissprot –evalue 1e-5 -outfmt 6 > result
#Extract Uniprot id and save it in uniprot.txtcat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}' > uniprot.txt
#Read uniprot id saved in uniprot.txt one by one and download itwhile read p; do
echo $pcurl -O "http://www.uniprot.org/uniprot/"$p".fasta";
done <uniprot.txt
# : Comment (Description on scripts)
./blastdownload <filename>
Blastp with swissprot
Extract Uniprot id
Download Uniprot id
Make pipeline using scripts
Let’s make another step in pipeline!
We have many fasta file contains homologs with original query file
Let’s compare all of them using multiple sequence alignments!
Why we need to learn how to doing multiple sequence alignment?
Why we need to compare multiple sequences?
- Single amino acid sequences
YEKIGKIGEGSYGVVFKCRNRDTGQIVAIKKFLESEDDPVIKKIALREIRMLKQLKHPNLVNLLEVFRRKRRLHLVFEYCDHTVLHELDRYQRGVPEHLVKSITWQTLQAVNFCHKHNCIHRDVKPENILITKHSVIKLCDFGFARLLAGPSDYYTDYVATRWYRSPELLVGDTQYGPPVDVWAIGCVFAELLSGVPLWPGKSDVDQLYLIRKTLGDLIPRHQQVFSTNQYFSGVKIPDPEDMEPLELKFPNISYPALGLLKGCLHMDPTQRLTCEQLLHHPYF
What kinds of information we can get from this?
Not much..
Molecular Weight, isoelectric point?But if we compare Two homologous sequence?
There is some homology between two sequences
Some gap is here..
….That’s it?How about function of protein?
But if we align more than three protein and compare them…
* * * * ** * * ** * * *Conserved residues…maybe important for function?
Conserved Region
Variable RegionPhylogenetic analysis
Secondary Structure
Informations from MSA• Multiple sequence alignment is more informative than two sequence aligment
• You can find sequence domain from multiple sequence alignments
From unknown sequence -> finding novel domain -> deduce potential functions
Information from MSA
- Which part of sequences are evolutionary conserved?
Most of evolutionary conserved part of protein is usually essential part of that protein
** * * ** * * **** *
Protein function is determined by protein structureSignificant portion of evolutionary conserved regions are determinant of protein structure
Protein structure and MSA- Protein functions is correlated with three dimensional structure
- Compared with sequence, structure is very hard to change
Structure 2013 21, 1690-1697DOI: (10.1016/j.str.2013.06.020)
- Conserved part of protein is essential for protein structure maintenance
Many different multiple sequence alignment software available..
In this course, we will use ‘MUSCLE’ http://www.drive5.com/muscle
http://www.ebi.ac.uk/Tools/msa/muscle/
http://www.drive5.com/muscle/downloads.htm
Make new directory as ‘muscle’ under your home page..
Download muscle (If you have linux, download different file..)
Rename it as ‘muscle’
Uncompress tar.gz
Remove tar.gz file..
Setup PATH for muscle
MUSCLE=~/muscleexport MUSCLEPATH=$PATH:~/ncbi-blast-2.2.31+/bin:$MUSCLEBLASTDB=~/ncbi-blast-2.2.31+/dbexport PATHexport BLASTDB
Change .bash_profile and save it.
If you see this, your setup is complete.
Let’s do some MSA. First go to your directory before..
Input format for muscleAmino acid sequence as Multi-FASTA format
>sp|Q9JHJ0|TMOD3_MOUSE Tropomodulin-3 OS=Mus musculus GN=Tmod3 PE=2 SV=1MALPFRKDLGDYKDLDEDELLGKLSESELKQLETVLDDLDPENALLPAGFRQKNQTSKSATGPFDRERLLSYLEKQALEHKDRDDYVPYTGEKKGKIFIPKQKPAQTLTEETISLDPELEEALTSASDTELCDLAAILGMHNLIADTPFCDVLGSSNGVNQERFPNVVKGEKILPVFDEPPNPTNVEESLKRIRENDARLVEVNLNNIKNIPIPTLKDFAKTLEANTHVKHFSLAATRSNDPVAVAFADMLKVNKTLKSLNMESNFITGAGVLALIDALRDNETLMELKIDNQRQQLGTSVELEMAKMLEENTNILKFGYQFTQQGPRTRAANAITKNNDLVRKRRIEGDHQ>sp|Q9JKK7|TMOD2_MOUSE Tropomodulin-2 OS=Mus musculus GN=Tmod2 PE=1 SV=2MALPFQKGLEKYKNIDEDELLGKLSEEELKQLENVLDDLDPESATLPAGFRQKDQTQKAATGPFDREHLLMYLEKEALEQKDREDFVPFTGEKKGRVFIPKEKPVETRKEEKVTLDPELEEALASASDTELYDLAAVLGVHNLLNNPKFDEETTNGEGRKGPVRNVVKGEKAKPVFEEPPNPTNVEASLQQMKANDPSLQEVNLNNIKNIPIPTLKEFAKSLETNTHVKKFSLAATRSNDPVALAFAEMLKVNKTLKSLNVESNFITGTGILALVEALRENDTLTEIKIDNQRQQLGTAVEMEIAQMLEENSRILKFGYQFTKQGPRTRVAAAITKNNDLVRKKRVEGDRR>sp|Q9JLH8|TMOD4_MOUSE Tropomodulin-4 OS=Mus musculus GN=Tmod4 PE=2 SV=1MSSYQKELEKYRDIDEDEILRTLSPEELEQLDCELQEMDPENMLLPAGLRQRDQTKKSPTGPLDRDALLQYLEQQALEVKERDDLVPYTGEKKGKPFIQPKREIPAQEQITLEPELEEALSHATDAEMCDIAAILGMYTLMSNKQYYDAICSGEICNTEGISSVVQPDKYKPVPDEPPNPTNIEEMLKRVRSNDKELEEVNLNNIQDIPIPVLSDLCEAMKTNTYVRSFSLVATKSGDPIANAVADMLRENRSLQSLNIESNFISSTGLMAVLKAVRENATLTELRVDNQRQWPGDAVEMEMATVLEQCPSIVRFGYHFTQQGPRARAAHAMTRNNELRRQQKKR
Header
Sequences
But we have multiple files contains single fasta sequences..
How to combine them as single fasta?
Use cat command and wildcard (*)
cat *.fasta > merged.fasta
cat merged.fasta>sp|A0JNC0|TMOD1_BOVIN Tropomodulin-1 OS=Bos taurus GN=TMOD1 PE=2 SV=1MSYRRELEKYRDLDEDEILGGLTEEELRTLENELDELDPDNALLPAGLRQKDQTTKAPTGPFRREELLDHLEKQAKEFKDREDLVPYTGEKRGKVWVPKQKPMDPVLESVTLEPELEEALANASDAELCDIAAILGMHTLMSNQQYYQALGSSSIVNKEGLNSVIKPTQYKPVPDEEPNATDVEETLERIKNNDPKLEEVNLNNIRNIPIPTLKAYAEALKENSYVKKFSIVGTRSNDPVAFALAEMLKVNKVLKTLNVESNFISGAGILRLVEALPYNTSLVELKIDNQSQPLGNKVEMEIVSMLEKNATLLKFGYHFTQQGPRLRASNAMMNNNDLVRKRRLADLTGPIIPKCRSGV>sp|A0JPI9|LR74A_RAT Leucine-rich repeat-containing protein 74A OS=Rattus norvegicus GN=Lrrc74a PE=2 SV=1MDDDDIEPLEYETKDETEAALAPQSSEDTLYCEAEAAPSVEKEKPTREDSETDLEIEDTEKFFSIGQKELYLEACKLVGVVPVSYFIRNMEESCMNLNHHGLGPMGIKAIAITLVSNTTVLKLELEDNSIQEEGILSLMEMLHENYYLQELNVSDNNLGLEGARIISDFLQENNSSLWKLKLSGNKFKEECALLLCQALSSNYRIRSLNLSHNEFSDTAGEYLGQMLALNVGLQSLNLSWNHFNVRGAVALCNGLRTNVTLKKLDVSMNGFGNDGALALGDTLKLNSCLVYVDVSRNGITNEGASRISKGLENNECLQVLKLFLNPVSLEGAYSLILAIKRNPKSRMEDLDISNVLVSEQFVKVLDGVCAIHPQLDVVYKGLQGLSTKKTVSLETNPIKLIQNYTDQNKISVVEFFKSLNPSGLMTMPVGDFRKAIIQQTNIPINRYQARELIKKLEEKNGMVNFSGFKSLKVTAAGQL
All of fasta file was saved in merged.fasta
Run muscle
muscle –in <inputfile> -out <outfilename> -clw
cat result.alnMUSCLE (3.8) multiple sequence alignment
sp|Q53B88|NOD2_HYLLA GCWDPHSLHPARDLQSHRPAIVR--RLHSHVEGVLDLAWERGFVSQYECDEIRLPIFTPSsp|Q7RTR2|NLRC3_HUMAN PDAPLGPCSNDSRIQRHRKALLSKVGGGPELGGPWHRLASLLLVEGLTDLQLREHDFTQVsp|O01479|TMOD_CAEEL APSANSQQGTQLPSKVYNKGLKD-------------------------------------sp|E7F7X0|LMOD3_DANRE -------MSERTEQESYTDKIDE-------------------------------------sp|Q0VAK6|LMOD3_HUMAN --------MSEHSRNSDQEELLD-------------------------------------sp|Q9JLH8|TMOD4_MOUSE -------------MSSYQKELEK-------------------------------------sp|Q0VC48|TMOD4_BOVIN -------------MSSYQKELEK-------------------------------------sp|Q9NZQ9|TMOD4_HUMAN -------------MSSYQKELEK-------------------------------------sp|P28289|TMOD1_HUMAN --------------MSYRRELEK-------------------------------------sp|A0JNC0|TMOD1_BOVIN --------------MSYRRELEK-------------------------------------sp|P49813|TMOD1_MOUSE --------------MSYRRELEK-------------------------------------sp|P70567|TMOD1_RAT --------------MSYRRELEK-------------------------------------sp|Q9NZR1|TMOD2_HUMAN ------------MALPFQKELEK-------------------------------------sp|P70566|TMOD2_RAT ------------MALPFQKGLEK-------------------------------------sp|Q9JKK7|TMOD2_MOUSE ------------MALPFQKGLEK-------------------------------------sp|Q9JHJ0|TMOD3_MOUSE ------------MALPFRKDLGD-------------------------------------sp|Q9NYL9|TMOD3_HUMAN ------------MALPFRKDLEK-------------------------------------sp|A1A5Q0|LMOD2_RAT -----------MSTFGYRRGLSK-------------------------------------sp|Q3UHZ5|LMOD2_MOUSE -----------MSTFGYRRGLSK-------------------------------------sp|Q6P5Q4|LMOD2_HUMAN -----------MSTFGYRRGLSK-------------------------------------sp|P29536|LMOD1_HUMAN ----------MSRVAKYRRQVS--------------------------------------sp|Q8BVA4|LMOD1_MOUSE ----------MSKVAKYRRQVS-------------------------------------- :
sp|Q53B88|NOD2_HYLLA QRARRLLDLATVKANGLAAFLLQHVQELPVPLALPLEAATCRKYMAKLRTTVSAQSRFLSsp|Q7RTR2|NLRC3_HUMAN EATRGGGHPARTVALDRLFLPLSRVSVPPRVSITIGVAGMGKTTLVRHFVRLWAHGQVGKsp|O01479|TMOD_CAEEL ------------------------------------------------------------sp|E7F7X0|LMOD3_DANRE ------------------DEILAGLSAEELKQLQSEMDDIAPDERVPVGLRQKDASHEMTsp|Q0VAK6|LMOD3_HUMAN ------------------------------------------------------------sp|Q9JLH8|TMOD4_MOUSE ------------------------------------------------------------sp|Q0VC48|TMOD4_BOVIN ------------------------------------------------------------sp|Q9NZQ9|TMOD4_HUMAN ------------------------------------------------------------sp|P28289|TMOD1_HUMAN ------------------------------------------------------------sp|A0JNC0|TMOD1_BOVIN ------------------------------------------------------------sp|P49813|TMOD1_MOUSE ------------------------------------------------------------sp|P70567|TMOD1_RAT ------------------------------------------------------------sp|Q9NZR1|TMOD2_HUMAN ------------------------------------------------------------sp|P70566|TMOD2_RAT ------------------------------------------------------------sp|Q9JKK7|TMOD2_MOUSE ------------------------------------------------------------sp|Q9JHJ0|TMOD3_MOUSE ------------------------------------------------------------sp|Q9NYL9|TMOD3_HUMAN ------------------------------------------------------------sp|A1A5Q0|LMOD2_RAT ------------------------------------------------------------sp|Q3UHZ5|LMOD2_MOUSE ------------------------------------------------------------sp|Q6P5Q4|LMOD2_HUMAN ------------------------------------------------------------sp|P29536|LMOD1_HUMAN ------------------------------------------------------------sp|Q8BVA4|LMOD1_MOUSE ------------------------------------------------------------
Inspect MSA
http://www.jalview.org
?
I got a RT-PCR and clone it, but one error with reference seq. Should I use it?
BLAST -> extrac uniprot id -> uniprot download -> Muscle
100% conservedY->C Change is very significant change
Probably it may affect protein function?
#!/bin/bash
#Run BLASTblastp -query $1 -db swissprot –evalue 1e-5 -outfmt 6 > result
#Extract Uniprot id and save it in uniprot.txtcat result | awk '{split($2,a,"|");split(a[4],b,".");print b[1];}' > uniprot.txt
#Read uniprot id saved in uniprot.txt one by one and download itwhile read p; do
echo $pcurl -O "http://www.uniprot.org/uniprot/"$p".fasta";
done <uniprot.txt
#merge fasta file into merged.fatacat *.fasta > merged.fasta
#run musclemuscle –in merged.fasta –out align.aln -clw
Combined workflow
Blastp with swissprot
Extract Uniprot id
Download Uniprot id
MSA with muscle
Other examples
Questions
I have a protein sequence. I’m looking for protein structure homologus to my protein.Can I search it and if there are protein structure homoglous with my protein, download it?
1. Search Protein Structure DB
2. Get a id for protein structure
3. Download protein structure based on ids.
How we can search protein structure database?
In NCBI BLAST db, there is protein database called pdbaa(All amino acid sequences deposited in protein structure database, PDB)
http://ftp.ncbi.nlm.nih.gov/FASTA/pdbaa.gz
cd ~/ncbi-blast-2.2.31+/dbcurl -O ftp://[email protected]/blast/db/FASTA/pdbaa.gzgunzip pdbaa.gzmakeblastdb –in pdbaa –dbtype prot
Download them and make as BLAST db
Let’s do blast search
Make working directory and download query sequences (Your favorite protein sequences)
Blast using newly built blast db (pdbaa)
Pdb id is here. (2J1D, 2Z6E, 3OBV…)How you can extract them?
awk {print $2} awk {split($2,a,’|’);print[4]}
blastp -query Q9NZ56.fasta -db pdbaa -evalue 1e-10 -outfmt 6 | awk '{split($2,a,"|");print a[4];}'>pdblist
cat pdblist2J1D2Z6E3OBV3O4X1V9D4EAH2YLE
Using These list, let’s download pdb file
Of course, you can download one by one from website. But there is better way..
You can download pdb in this address
http://www.rcsb.org/pdb/files/PDB_ID.pdb.gz
http://www.rcsb.org/pdb/files/2J1D.pdb.gz
curl –O http://www.rcsb.org/pdb/files/2J1D.pdb.gzgunzip 2J1D.pdb
#!/bin/bashwhile read p;do curl -O "http://www.rcsb.org/pdb/files/"$p".pdb.gz" gunzip $p.pdb.gzdone <pdblist
Modify previous scripts
1. Read pdb id stored in pdblist one by one2. Download using curl at http://www.rcsb.org/pdb/files/pdbid.pdb.gz3. Uncompress gz as gunzip
Save it as download
chmod +x download
Execute scripts
You can display protein molecules using PyMol (http://pymol.org) or Cuemol2(http://www.cuemol.org/ja/index.php?cuemol2)
Display Proteins
You have many files do same thing..
#!/bin/bash
for f in *.fastado echo "professing $f..." blastp -query $f -db pdbaa -out $f".blast" -outfmt 6done
Save output file as “filename”+”.blast”
You have lots of fasta fileYou want to run BLAST each of them.How you can do that without typing repeatively?
Every file end as .fasta
#!/bin/bash
for f in *.fastado echo "professing $f..." blastp -query $f -db pdbaa -out $f".blast" -outfmt 6done
Save output file as “filename”+”.blast”
Every file end as .fasta
echo “prosseing A0JNC0.fasta…” blastp -query AOJNC0.fasta -db pdbaa –out A0JNC0.blast -outfmt 6
echo “prosseing A1A5Q0.fasta…” blastp -query A1A5Q0.fasta -db pdbaa –out A1A5Q0.blast -outfmt 6
…….Continue to all file corresponds to *.fasta
Combining grep and awk do can do many thing…
ls *.pdb1IO0.pdb 1PGV.pdb 2J1D.pdb 4PKG.pdb 4PKI.pdb
Let’s assume you download pdb files..
What is the title of pdb?
grep "^TITLE" *.pdb | awk '{if (substr($0,19,1)==" ") {print substr($0,1,4) "\t" substr($0,20)} else {print "\t" substr($0,21)}}'
1IO0 CRYSTAL STRUCTURE OF TROPOMODULIN C-TERMINAL HALF 1PGV STRUCTURAL GENOMICS OF CAENORHABDITIS ELEGANS: TROPOMODULIN
C-TERMINAL DOMAIN 2J1D CRYSTALLIZATION OF HDAAM1 C-TERMINAL FRAGMENT 4PKG COMPLEX OF ATP-ACTIN WITH THE N-TERMINAL ACTIN-BINDING DOMAIN OF
TROPOMODULIN 4PKI COMPLEX OF ATP-ACTIN WITH THE C-TERMINAL ACTIN-BINDING DOMAIN OF
TROPOMODULIN
What is the resolution of protein structure?
grep "^REMARK 2 RESOLUTION." *.pdb | awk '{print substr($1,0,4) "\t" $4}'
1IO0 1.451PGV 1.802J1D 2.554PKG 1.804PKI 2.30
https://madscientist.wordpress.com/2012/04/22/텍스트 -툴로 -pdb-뒤비기 /
More informations:
Let’s assume you have many fasta file…
We want to check the title of each proteins…
cat Q9UNT1.fasta>sp|Q9UNT1|RBL2B_HUMAN Rab-like protein 2B OS=Homo sapiens GN=RABL2B PE=2 SV=1MAEDKTKPSELDQGKYDADDNVKIICLGDSAVGKSKLMERFLMDGFQPQQLSTYALTLYKHTATVDGRTILVDFWDTAGQERFQSMHASYYHKAHACIMVFDVQRKVTYRNLSTWYTELREFRPEIPCIVVANKIDDINVTQKSFNFAKKFSLPLYFVSAADGTNVVKLFND
grep "^>" *.fastaA4D1S5.fasta:>sp|A4D1S5|RAB19_HUMAN Ras-related protein Rab-19 OS=Homo sapiens GN=RAB19 PE=2 SV=2O00194.fasta:>sp|O00194|RB27B_HUMAN Ras-related protein Rab-27B OS=Homo sapiens GN=RAB27B PE=1 SV=4O14966.fasta:>sp|O14966|RAB7L_HUMAN Ras-related protein Rab-7L1 OS=Homo sapiens GN=RAB29 PE=1 SV=1O95716.fasta:>sp|O95716|RAB3D_HUMAN Ras-related protein Rab-3D OS=Homo sapiens GN=RAB3D PE=1 SV=1O95755.fasta:>sp|O95755|RAB36_HUMAN Ras-related protein Rab-36 OS=Homo sapiens GN=RAB36 PE=2 SV=2
Search all fasta file and show line started with “>”
grep "^>" *.fasta | sed 's/^.*>//g'
A4D1S5.fasta:>sp|A4D1S5|RAB19_HUMAN Ras-related protein Rab-19 OS=Homo sapiens GN=RAB19 PE=2 SV=2O00194.fasta:>sp|O00194|RB27B_HUMAN Ras-related protein Rab-27B OS=Homo sapiens GN=RAB27B PE=1 SV=4O14966.fasta:>sp|O14966|RAB7L_HUMAN Ras-related protein Rab-7L1 OS=Homo sapiens GN=RAB29 PE=1 SV=1O95716.fasta:>sp|O95716|RAB3D_HUMAN Ras-related protein Rab-3D OS=Homo sapiens GN=RAB3D PE=1 SV=1O95755.fasta:>sp|O95755|RAB36_HUMAN Ras-related protein Rab-36 OS=Homo sapiens GN=RAB36 PE=2 SV=2
Remove these parts
sp|A4D1S5|RAB19_HUMAN Ras-related protein Rab-19 OS=Homo sapiens GN=RAB19 PE=2 SV=2sp|O00194|RB27B_HUMAN Ras-related protein Rab-27B OS=Homo sapiens GN=RAB27B PE=1 SV=4sp|O14966|RAB7L_HUMAN Ras-related protein Rab-7L1 OS=Homo sapiens GN=RAB29 PE=1 SV=1sp|O95716|RAB3D_HUMAN Ras-related protein Rab-3D OS=Homo sapiens GN=RAB3D PE=1 SV=1sp|O95755|RAB36_HUMAN Ras-related protein Rab-36 OS=Homo sapiens GN=RAB36 PE=2 SV=2sp|P0C0E4|RB40L_HUMAN Ras-related protein Rab-40A-like OS=Homo sapiens GN=RAB40AL PE=1 SV=1sp|P11234|RALB_HUMAN Ras-related protein Ral-B OS=Homo sapiens GN=RALB PE=1 SV=1sp|P20336|RAB3A_HUMAN Ras-related protein Rab-3A OS=Homo sapiens GN=RAB3A PE=1 SV=1sp|P
Extract id and protein name only
grep "^>" *.fasta | sed 's/.*>//g' | awk -F '|' '{print $2, $3;}'
-F ‘|’ : Change ‘seperator’ as ‘|’
Extract second and third fields
You will see more uses of them in later courses..
Unix-based text utilities (grep, awk, sed..) is very powerful tool
Run software
Extract what you want
Using these information, run another software…
Most of biological computing is involved with text-mungling
Script is your ‘Notebook’ in computer-based ‘experiments’
Exact sequence of ‘computational experiment’
Record of what you done
“Reproducible Research”
In the Next Lectures…
Basic of computer programming
Setup basic programming environment
Python and Script Languages..
Handling of Biological Sequences (DNA and Proteins)
Assignments
1. You have these sequences.
MGTRDDEYDYLFKVVLIGDSGVGKSNLLSRFTRNEFNLESKSTIGVEFATRSIQVDGKTIKAQIWDTAGQERYRAITSAYYRGAVGALLVYDIAKHLTYENVERWLKELRDHADSNIVIMLVGNKSDLRHLRAVPTDEARAFAEKNEANVRQTRK
2. Find out all homologs (E-value < e-20) in Human, Mouse, Arabidopsis, Xenopus And fission yeast (Schizosaccharomyces pombe) using results from blast (swissprot)
3. Find out Uniprot (Swissport) id from BLAST results and download them
4. Make multiple sequence alignments using MUSCLE for each organism
human.alnMouse.alnZebrafish.alnXenopus.aln
5. Mark amino acids in the original sequence highest conservation score (more than 9)
MGTRDDEYDYLFKVVLIGDSGVGKSNLLSRFTRNEFNLESKSTIGVEFATRSIQVDGKTIKAQIWDTAGQERYRAITSAYYRGAVGALLVYDIAKHLTYENVERWLKELRDHADSNIVIMLVGNKSDLRHLRAVPTDEARAFAEKNEANVRQTRK
Like this.