DNA Sequence Analysis and and Fragment Assembly System (FAS) System (FAS) Lecture 92-06.
-
Upload
nathaniel-ward -
Category
Documents
-
view
216 -
download
0
Transcript of DNA Sequence Analysis and and Fragment Assembly System (FAS) System (FAS) Lecture 92-06.
DNA Sequence AnalysisDNA Sequence Analysis andandFragment AssemblyFragment Assembly
System (FAS)System (FAS)
Lecture 92-06
GENE
Exon 1 Intron Exon 3 Intron Exon 4Exon 2IntronPromoterEnhancer
mRNA transcript
Exon 1 Intron Exon 3 Intron Exon 4Exon 2Intron5’-untranslated
region
5’ 3’
Poly(A) signal
3’-untranslatedregion
Mataure mRNA
Transcription
Processing
The Organization of an Eukaryotic Gene
Exon 1 Exon 3 Exon 4Exon 23’(AAAAAA)n7-mG cap
start stop
5’
Find non-coding features of interest in the sequence
Gene identification involves Gene identification involves 4 main stages4 main stages
Determine the exon-intron organization
Identify the gene
Find the putative coding region(s) in the sequence
motif, signal and patternBlast, FASTAFunctional studies
CpG islandsTandemly and dispersed repeatsPromoter regions (TATA box, cap signal,CCAAT-box)Transcription factors, Poly-A sites
Branch point signalCT(G,A)A(C,T)
5’ and 3’ splice sites: AG/GUAAGU--------------PyPyPyPyPyPyPyPy-CAG/G
Open reading frame
Banbury Cross http://igs-server.cnrs-mrs.fr/igs/banburyFGENEH http://genomic.sanger.ac.uk/gf/gf.shtmlGeneID http://www1.imim.es/geneid.htmlGeneMachine http://genome.nhgri.nih.gov/genemachineGeneParser http://beagle.colorado.edu/_eesnyder/GeneParser.htlGENSCAN http://genes.mit.edu/GENSCAN.htmlGenotator http://www.fruitfly.org/_nomi/genotator/GRAIL http://compbio.ornl.gov/tools/index.shtmlGRAIL-EXP http://compbio.ornl.gov/grailexp/HMMgene http://www.cbs.dtu.dk/services/HMMgene/MZEF http://www.cshl.org/genefinderPROCRUSTES http://www-hto.usc.edu/software/procrustesRepeatMasker http://ftp.genome.washington.edu/RM/RepeatMasker.htmlSputnik http://rast.abajian.com/sputnik/
GENE FINDERS
Sequence manipulationORF Searching
Mapping (restriction sites)
Mapping (transcription factors)
ReverseFramesMapTranslateMap (-minc)(-maxc)Mapsort(-exclude)(-digest)Mapplot
Map tfsites
+++++++++++
+
GCG SeqWEBFunction Command
++++++++--+
-
What to do next?The predictions by these programs is just that: a prediction.
NEVER TRUST A COMPUTER!
Programs used in this exercise:(1) Sequence manipulation – reverse(3) ORF Searching – frames , map , translate(4) Mapping (restriction sites) – map (-minc, -maxc), mapsort(-exclude, -digest), mapplot, plasmidmap(5) Mapping (transcription factor) – map(tfsites).
Sequences used in this exercise:gb:z18853 (C.elegans mRNA for capping protein alpha subunit.)
cds:10-858gb:x03795 (Human mRNA for platelet derived growth factor A-chain, P
DGF-A) cds:388-1020.
Exercise 92-06-1
Fragment AssemblyFragment Assembly System (FAS)System (FAS)
Please Download Ex92-06.exe
Exercise92-06-2.doc ( 上課習作 )Gelassemble commands.doc & SeqED commands.doc ( 指令集 )Seq01.txt - seq10.txt ( 習作用序列 )
Fragment Assembly System (FAS)Fragment Assembly System (FAS)
(1) Store fragment sequences;(2) Recognize overlapping sequences and create aligned
assemblies, called contigs; (3) Display, edit and output the contigs for further analysis.
Fragment Assembly System (FAS)Fragment Assembly System (FAS)
Assemble overlapping fragment sequences from a sequencing project.
Contig 1
Contig 2
Consensus
A contig may not contain more than 1,650 fragments and may not be longer than 200,000 bases. No single fragment may be longer than 2,500 bases
4
31
2
5
Begins a fragment assembly session bycreating a new fragment assembly project or by identifying an existing project.
Enters a fragment sequences to a fragment assembly project from your terminal keyboard, a digitizer, or existing sequence files. Aligns the sequences in a fragment assembly project into assemblies called contigs.
A multiple sequence editor for viewing and editing contigs assembled by GelMerge.
Displays the structure of the contigs in a fragment assembly project.
Breaks up the contigs in a fragment assembly project into single fragments.GelDisassembleGelDisassemble
GelViewGelViewGelAssembleGelAssembleGelMergeGelMergeGelEnterGelEnter
GelStartGelStart
Contig: mu26b
8 mu18b +---------------------> 7 mu9 <---+ 6 mu32 +---> 5 mu26 <----+ 4 mu18 +----> 3 mu27 <--------------------------------+ 2 mu26b <------------------+ C CONSENSUS <-----------------------------------------------+ |----------|----------|----------|---------|---------| 0 100 200 300 400
Use GelStart to create a new project database for each sequencing project. For each new project, GelStart creates a new directory, named after the project, as a subdirectory of your current working directory.
gcg% gelstart -check
Minimal Syntax: % gelstart [-NAME=]MyProject -Default
Prompted Parameters:
-NEWproject begins a new sequencing project-VECtors=GB:M13mp18,GB:SynpBR322 highlights specified sequences in GELENTER-SITes=GAATTC,GGATCC highlights specified patterns in GELENTER
Local Data Files: None
Optional Parameters:
-DELete deletes a whole project!-NOMONitor suppresses the screen monitor
GelStartGelStart
SeqEd is an interactive editor for entering and modifying sequences and for assembling parts of existing sequences into new genetic constructs. You can enter sequences from the keyboard or from a digitizer.
SeqEDSeqED
AGTCTTAGTCGATCGTAcTGCATRCGA ....|:.......:|.........i.......:.|.........|.........|.........|.........|.. 0 10 20 30 40 50 60 70 "sample.seq" 27 nucleotides
<ctr>d
<return>
screen mode command mode
Screen Mode
G, A, T, . . . - insert a sequence character <Delete> - delete a sequence character <Ctrl>H - delete a sequence character /TAACG<Return> - find the next occurrence of TAACG (last pattern entered is the default) 1<Return> - move to start of the sequence <Ctrl>E - move to end of the sequence [n]<Right-arrow> - go ahead n characters [n]<Left-arrow> - go back n characters <Up-arrow> - go up to check sequence <Down-arrow> - go down to original sequence 'markcharacter - go to marked position 37<Return> - go to position 37 (any positive integer) < - go back 50 characters > - go ahead 50 characters <Ctrl>R - redraw the screen <Ctrl>D - enter command mode
[n] is an optional numeric parameter.
Command Mode EDit seqname - get a new sequence file to edit[n] Include [seqname] - insert another sequence [at position n] (SeqEd prompts for range and strand)s,f Delete - delete a range of bases[s] Check [/Blind] - check a range of bases [beginning at s] 37 - go to base 37 REDraw - redraw the screen[n] COmment comment - insert a comment [at position n][n] COmment - enter comment editing mode [at position n][n] HEAding - edit documentary heading [at line n] change - enter screen mode (<Return> is sufficient) screen - enter screen mode (<Return> is sufficient) OVERstrike - enter overstrike mode INSert - enter insert mode[n] Mark markcharacter - mark the sequence [at position n] PERFect - require finds to be perfect matches PROtein - set sequence type to PROTEIN NUCleotide - set sequence type to NUCLEOTIDE[s,f] Write [seqname] - write [a part of] the sequence to a file DIGitizer - enter digitizer mode RELoad - enter reload mode ACCept - terminate reload mode Help - show commands in screen and command modes[s,f] EXit [seqname] - write [a part of] the sequence and quit Quit - quit the editor without writing the sequence
[n] indicates an optional parameter. s and f are numbers for start and finish of a range of interest
GelEnter is a sequence editor that accepts sequence data. GelEnterGelEnter
gcg% gelenter –check
Minimal Syntax: % gelenter [-INfile1=]mu*.seqPrompted Parameters: NoneLocal Data Files:set.keys (must be in your current working directory to be used)Optional Parameters:-ENTER=mu*.seq enters existing files into the database -STAden enters existing Staden format files into the database -FASTA enters existing FASTA format files into the database-SINGlecommand automatically returns to screen mode after each command-PERFect sets find to search for perfect symbol matches-VECtors=gb:synpbr322 highlights sequences from pBR322-SITes=gaattc highlights GAATTC patterns-LANes=g,A,T,C sets lane order for digitizer-MINOverlap=10 sets minimum overlap length for Reload command-PCTOverlap=95 sets stringency for the Reload command-TOLerance=0.4 sets tolerance for digitizing ambiguity (0 to 1), with 1 being the most tolerant
GelEnter accepts any valid GCG sequence character.Once you enter sequences into a project database, you can no longer edit them with GelEnter.
GelEnterGelEnter
gcg2 21% gelenter seq02.dat
GelEnter adds fragment sequences to a fragment assembly project. Itaccepts sequence data from your terminal keyboard, a digitizer, orexisting sequence files. "seq02" 593 nucleotides IUB/GCG Meaning
A A C C G G T/U T M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T X/N G or A or T or C ./~ gap character
GelMerge automatically recognizes overlaps among all of the sequences in a project database and creates aligned assemblies, called contigs, from the overlapping sequences. These contigs are stored in the project database. As you add new sequences that connect separate contigs to the project database, GelMerge aligns the contigs into larger assemblies.
GelMergeGelMerge
% GelMerge
What word size (* 7 *) ? What fraction of the words in an overlap must match (* 0.80 *) ? What is the minimum overlap length (* 14 *) ? Reading ............ Comparing ............ Aligning ......... Writing ...
Input Contigs: 12 Output Contigs: 3
CPU time: 02.29 (seconds)
Minimal Syntax: % gelmerge -Default
Prompted Parameters:
-WORdsize=7 sets word size for overlap determination-STRIngency=0.8 sets minimum fraction of matching words in overlap-MINOverlap=14 sets minimum length of overlap
Local Data Files:
-MATRix1=gelmergedna.cmp assigns the scoring matrix for contig assembly-MATRix2=gelmergelocaldna.cmp assigns the scoring matrix for vector recognition
Optional Parameters:
-MINIdentity=14 sets minimum run of identical bases found at least once in an overlap between two contigs-MAXGap=10 sets maximum gap size for overlap determination-GAPweight=8 sets gap creation penalty in contig assembly-LENgthweight=2 sets gap extension penalty in contig assembly-ARChive creates contigs from the original gel readings-WORKing creates contigs from individual working fragment (with gaps removed)-REPortfile[=Filename] writes report of recognized vector sequences-EXCise removes vector sequences from single-fragment contigs-VECTORSTrigency=0.8 sets minimum fraction of matches in vector recognition-VECTORMINIdentity=12 sets minimum run of identical bases found at least once in a match between vector and fragment-VECTORMAXGap=5 sets maximum gap size in first step of vector recognition-VECTORGAPweight=30 sets gap creation penalty in vector recognition-VECTORLENgthweight=3 sets gap extension penalty in vector recognition-NOMERge suppresses contig assembly-NOMONitor suppresses screen trace of program progress-NOSUMmary suppresses screen summary at the end of the program-BATch submits program to the batch queue
After assembling contigs with GelMerge, use the contig editor, GelAssemble, to review and modify the alignments. After choosing a contig for review, GelAssemble lets you edit the individual sequences in that contig to resolve inconsistencies. GelAssemble creates a consensus sequence that uses the IUB nucleotide ambiguity codes. You can modify a sequence and change the alignment in the same way you edit text with a text editor. Although GelMerge assembles and aligns contigs automatically, you can assemble contigs manually using GelAssemble. For example, you could manually assemble separate contigs that do not share sufficient overlap for GelMerge to assemble automatically. You can also separate fragments from a contig if you believe they should not be included. Once you are satisfied with a contig, you can store it in the sequencing project database.
GelAssembleGelAssemble
seq03 > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 220seq01 > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 540CONSENSUS > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 540
.........+.........+.........+.........+.........+.........+
Screen mode Command mode<ctr>D
<return>
Keys Pressed Action[n]<Right-arrow> move ahead [n bases][n]<Left-arrow> move back [n bases][n]<Up-arrow> move up [to row n][n]<Down-arrow> move down [to row n] > scroll one screen to the right < scroll one screen to the left1<Return> move to start of the sequence<Ctrl>E move to end of the sequence165<Return> move to base 165 in sequence/GATTC<Return> find next occurrence of GATTC<Ctrl>A move to next ambiguity in alignment<Ctrl>R move to next ambiguity in sequence<Ctrl>V move to next gap in consensus<Ctrl>D enter Command Mode<Ctrl>L toggle alignment display enlargement<Ctrl>W redraw the screen<Ctrl>O toggle INSERT/OVERSTRIKE mode ! summary of current sequence ? display these help screens<Ctrl>G recalculate the consensusG A T C .... add base at the cursor<Delete> delete a base, or move sequence left<Ctrl>H delete a base, or move sequence left<Space bar> move the sequence to the right<Ctrl>X delete alignment column<Ctrl>I restore alignment column<Ctrl>B begin selecting a range for removal<Ctrl>N remove the selected range<Ctrl>P insert the removed range - reject current fragment
Gelassemble Screen Mode
Gelassemble Command Mode[a,b] specifies a range of fragments.[x,y] specifies a range of bases.[n] is an optional numeric parameter.
EDit [ContigName] replace current contig with a new contig CONTIGs select another contig for editing WRite write a contig to the database EXit write the contig and quit QUIT quit without writing ERASE delete current contig from the database 238 move to position 238 in the current fragment[x,y] PRETTYout [FileName] write the sequence alignment [position x - y][a,b] SEQOUT write fragments [a - b] to sequence files BIGPICture [FileName] write bar schematic to an output file OVERstrike select OVERSTRIKE sequence edit mode NOOVERstrike select INSERT sequence edit mode[x,y] CONSensus recalculate the consensus sequence[a,b] LOCk lock strands [a through b][a,b] Unlock unlock strands [a through b][x,y] SELect select bases [x through y] REMove remove the selected bases[n] INSert insert the removed bases [at position n] CAncel cancel the selection
[x,y] DElete delete bases [x through y] GOTo [FragmentName] move to strand by name FInd GAATC find the next occurrence of GAATC DIfferences show differences from the consensus MAtches show matches with the consensus Neither show neither matches nor differences REDraw redraw the screen Help display these help screens SORt [DEScending] sorts strands by their offsets in alignment[a,b] MOve moves a strand [from line a to line b] OPen opens a blank line at the cursor position[a,b] ANChor anchors strands [a through b][a,b] NOANchor unanchors strands [a through b] LOad [ContigName] loads another contig into the Edit Screen REVerse reverse-complement the (anchored) strand(s)[n] Offset shifts the current fragment [to begin at n] REJect removes the current fragment from the screen NODUPlicate removes a duplicated fragment from the screen SPAWN renames a duplicated fragment SEParate makes two contigs from anchored and unanchored strands
GelView displays bar diagrams that show the overlaps among the fragments in each contig, providing a schematic view of the whole sequencing project.
GelView GelView Gelview filename.vew.cat/more filename.view
GELVIEW Fragment Assembly contig display of Project: bio May 4, 2000 17:42
Contig: seq01
3 seq03 +-------------------> 2 seq01 +-----------------------------> C CONSENSUS +------------------------------------> |----------|----------|----------|---------|---------| 0 200 400 600 800
Contig: seq04
3 seq02 <---------------+ 2 seq04 +------------> C CONSENSUS +---------------------------> |----------|----------|----------|---------|---------| 0 400 800 1200 1600
Contig: seq05
2 seq05 +----------------------------> C CONSENSUS +----------------------------> |----------|----------|----------|---------|---------| 0 200 400 600 800
5 Fragments in 3 Contigs
GelDisassemble breaks up the contigs in a sequencing project, thus recreating the database as a collection of single fragments.
GelDisassemble GelDisassemble
% geldisassemble
Are you sure you want to disassemble your project (* No *) ? Yes
1) Emptying "relation" directory....
2) Emptying "consensus directory....
3) Copying "working" to "consensus"....
4) Creating "relation"....
Gel Project Disassembled
Exercise 92-06-2 Exercise 92-06-2
Download Ex92-06.exe Decompress the file
Exercise 92-06-2.doc ( 上課習作 )Gelassemble commands.doc & SeqED commands.doc ( 指令集 )Seq01.txt - seq10.txt ( 習作用序列 )
Start GCG FAS
Questions:(1) What is the correct order of the assembled sequence?(2) Which putative protein this sequence encodes?(3) Is there any potential regulatory elements upstream of the gene?(4) What is the identity with the human protein?
生物資訊分析生物資訊分析完全攻略完全攻略
GeneGene-mRNA-Protein -mRNA-Protein
Download Bioinfo91-08.exe Decompress the fileYou will found the following files in FASTA format:
ProteinProtein-mRNA-Gene -mRNA-Gene mRNAmRNA-Protein-Gene -Protein-Gene
Is there any standard procedures?Is there any standard procedures?
Gene.txt RNA.txt Protein.txt
Gene-mRNAGene-mRNA-Protein -Protein
OPEN READING FRAMEDNA RNAReverse or Directional
HOMOLOGY SEARCHFASTA, BLASTn, BLASTx
MOTIF SEARCH
ALIGNMENT
RESTRICTION MAPPING
2nd Structure
FILE PROCESSING FILE PROCESSING (Trace File Viewer(Trace File Viewer & & Format ConverterFormat Converter))
Bestfit, gap, pileup