ARN et Bioinformatiquedenise/CoursBioinfo/ARN-BIBS-CASM...ADN/ARN ADN ADN/ARN droit droit gauche...
Transcript of ARN et Bioinformatiquedenise/CoursBioinfo/ARN-BIBS-CASM...ADN/ARN ADN ADN/ARN droit droit gauche...
ARN et Bioinformatique
Daniel GautheretV.2006.0 http://www.esil.univ-mrs.fr/~dgaut/Cours
Plan du cours
• Séance 1– Les molécules d’ARN– Structures des ARN
• Secondaire• Tertiaire
– La prédiction de structure. • Energies et covariation
• Séance 2– ARNomique
• Résultats expérimentaux, RFAM– Recherche d’ARN dans les génomes
• ARN connus• ARN inconnus
• Séance 3– TD: l’ARNt en 3D / Mfold / Erpin
1a. Molécules d’ARN
L’ARN messager
Les ARN non-messagers ou ARN non codants (ncRNA)
• ARNt• ARNr• ARNsn • ARNsno • micro-ARN• Autres ARN 4.5S, 10Sa, Spot42, DicF, MicF, OxyS, DsrA, 6S
(procaryotes) produits des gènes XIST, H19, IPW, 7H4, His-1, NTT (mammifères).
ARNr 18S et 28S: promoteurs pol-I. ARNt et ARNr 5S: promoteurs pol-III
ARN de transfert (tRNA)
• Potentiellement 64 différents dans un génomes
• En pratique une quarantaine dans les génomes microbiens. Qq centaines dans les génomes de mammifères.
• Le premier ARN observé en 3D (1974)
ARN ribosomique (rRNA)
• 5S, 16S, 23S pour les procaryotes (120, 1500 et 3000 bp)
• 5.5S, 18S et 28S pour les eucaryotes.
• Complexés avec protéines (~40)
• Observées en quelques exemplaires chez les procaryotes (7 opérons chez coli)
• Les génomes vertébrés peuvent accueillir plusieurs centaines de copies identiques. (clusters de gènes)
• Structure 3D en 1999Source: http://www.cytochemistry.net/Cell-biology/ribosome.htm
Small nuclear RNA (snRNA)
• Petits ARN du noyau impliqués dans l’épissage et la maintenance des télomères
• Complexés à des protéines• Plusieurs sortes nommées U1, U2, U4, U6,
U12…
Source: RFAM
Small nucleolar RNAs (snoRNAs)
• Guide de méthylation des rRNA• Eucaryotes et archae• Deux familles: boite C/D et boite H/ACA• Nomenclature: E2, E3, U3, U14, U23…
Source: plant snoRNA database http://bioinf.scri.sari.ac.uk/cgi-bin/plant_snorna/home
microRNA (miRNA)
Stem loop Stem loop precursor precursor (pre(pre--miRNA70nt)miRNA70nt)
miRNA miRNA transcription transcription unitunit
Dr
Dr
nucleus cytoplasm
()
mRNAmRNAtargettargetMature miRNAMature miRNA
(21(21--22 nt.)iceice 22 nt.)Translation Translation block or block or cleavagecleavage
Polycistronic Polycistronic transcript (pritranscript (pri--miRNA)
miRNA+RISC miRNA+RISC complexcomplex
miRNA)
miRNA
• Se trouvent chez les animaux et les plantes• 300 à 1000 chez l’homme• Un seul miRNA peut cibler 100 gènes ou plus• Recherche de miRNA
– Conservation– Structure
Les génomes à ARN
• Virus à ARN– Double brin, simple brin– En partie codant, en partie non-codant
Source: Wikipedia
1b. Structure des ARN
Les bases
Le ribose ou déoxyribose
O2’
C1’
C3’ C2’
C4’
C5’
O5’
O4’
O3’Credit: Richard Hallick. http://www.blc.arizona.edu/Molecular_Graphics/DNA_Structure/DNA_Tutorial.HTML
2’OH: ribose
Nucléosides
Credit: Richard Hallick. http://www.blc.arizona.edu/Molecular_Graphics/DNA_Structure/DNA_Tutorial.HTML
La chaine d’ADN/ARN
O2’
C1’
C3’ C2’
C4’
C5’
O5’
P
O3’
OBOA
baseO4’
O3’
Credit: Richard Hallick. http://www.blc.arizona.edu/Molecular_Graphics/DNA_Structure/DNA_Tutorial.HTML
Les degrés de liberté du nucléotide
ν0 à ν4 résumés par:Phase+amplitude
Le plissement du sucre (sugar pucker)En raison des interactions entre non covalentes entre substituants du cycle, les susbstituants ont tendance à se palcer le plus loin possible les uns des autres. En conséquence, 4 des 5 atomes du ribose sont approximativement dans un plan, mais le 5ème sort du plan de 0,5 Å environ. 4 conformations majeures: C2': endo, C3'-exo, C3'-endo, C2' -exo.Endo: coté base, exo: loin base
C3’ endo
C2’ endo
L’orientation de la base
N
N
ON
N
O
HH
H
HH
3'
3'G C– Effet sur l ’orientation
des brins (parallèle ou antiparallèle):
Contraintes sur les angles de torsion
– Roues conformationnelles
La double-hélice
(ADN)
Les paires Watson-Crick
La distance entre les deux points d’attachement des paires A - T et G - C sont identiques.
La même géométrie est compatible avec des pairesA – T et T – A ou G – C et C – G
Les sillons
majeur
mineur
Axe de l’hélice
Axe de pseudo-symétrie
Donneurs et accepteurs de ponts-H: l’identité des paires de bases
majeur majeur
mineurmineur
P
OH
OO
donneur
accepteur
(RNA)
Hélices A, B et Z
A: ADN/ARN B: ADN Z: ADN/ARN
Vues axiales des hélices A et B
Les types d’hélicesSelon: force ionique, solvents, degré d’hydratation.
A B Z
ADN/ARN ADN ADN/ARN
droit droit gauche2’endo 2’ endo (py)
3’endo (pu)20 Å 18 Åanti Anti (py)
Syn (pu)aucun12 Å plat
Sens hélice
Déplacement bp/axelargeurSillon majeur 9 Åprof
6 Å étroitlargeurSillon mineur 7,5 Å profondprof
11 10 12Nt/tour
Confo sucre 3’ endoDiamètre 26 Å
Liaison glycosidique anti4 Å3 Å
13,5 Å11 Å3 Å
L’empilement des bases (stacking)
Empilement dans un brin de type B
• L’empilement n’est pas imposé par la double hélice: il est l’un des principaux facteurs de stabilité des A.N.
• Cycles aromatiques séparés d’environ 4 Å
• Causes: hydrophobicité et interactions VdW
• Les séquences YpR et PrY ont un empilement différent: La séquence influe sur la stabilité et le forme de l’hélice.
Paramètres des hélices
On peut définir une double hélice avec les paramètres suivants:Tilt (theta-t): autour de l’axe pseudo-symTwist (t): tour/residu autour axe hélicepropeller Twist (theta-p): entre une base et son appariéeaxial rise (h): élévation/residuDislocation (D): distance entre centre bp et axe héliceRoll (theta-r): angle autour du 3ème axe (axe C6-C8).
ARN: les structures secondaires
Extrait ARNr 23S
Interactions secondaires et
tertiaires
Eléments de structure secondaire
a. épingle à cheveux (hairpin)
b. interne
c. bulge
d. multi-branche
e. duplex (longue-distance)
f. pseudonoeud
Secondaire et tertiaire: un exemple…
– The 1030-1124 Region of 23S rRNAhas seveal tertiary interactions: two canonical lone base pairs (1082:1086and 1087:1102), a base triple ((1092:1099)1072) and three non-canonical tertiary base pairs (1032:1122, 1039:1116, and1040:1115)
L’ARNt
Présence de nombreux nucléosides inhabituels, par ex: inosine (I), pseudouridine (Ψ), dihydrouridine (D), ribothymidine (T), base Y, etc.
L’ARNt: du 2D au 3D
Source: Molecular Biology of the Cell (http://www.garlandscience.com/textbooks/0815332181.asp)
L’ARNtBoucle TΨC
Boucle D
Boucle Variable
Tige Acceptrice
BoucleAnticodon Forme d’un « L »
Interactions dans la structure tertiaire:
Formation des triplets; interactions avec des P – du backbone et avec le 2’OH des riboses
ARNt: interactions tertiaires
Le ribosome
Masse totale: 2.6 x 103 kDa (100 fois plus que le Lysosyme)Composition: 1/3 protéine 2/3 nucléotidesSous-unité 30S : Interaction avec les codons du mRNA et les anticodons du tRNASous-unité 50S : activité peptidyl-transferase et interaction avec le GTP-binding protein.Meilleures structures aujourd’hui :Ribosome 70S de T. thermophilus à 7.8 Å (1999) Sous-unité 30S de T. thermophilus à 4.5 Å (1999)Sous-unité 50S de H. marsimortui à 2.4 Å (2000)
23S: interactions secondaires et tertiaires
(moitié 5’)
23S: interactions
secondaires et tertiaires
(moitié 3’)
Ribosome: les sites de fixation des ARNt
Source: Molecular Biology of the Cell (http://www.garlandscience.com/textbooks/0815332181.asp)
La grande sous-unité (ARN 23S)
crystal structure of the large ribosomal subunit from Haloarcula marismortui at 2.4 angstrom resolution
Grande sous Unité: 35 prot + 2 ARN (23S, 5S)
Nenad Ban, Poul Nissen, Jeffrey Hansen, Peter B. Moore, Thomas A. Steitz Science. 289:878-9, 2000
Protéines de la grande sous-Unité
Repliement des domaines
dominés apr les interactions
inter-hélices
A proposed mechanism of peptide synthesis catalyzed by the ribosome. A2486 (A2451) is shown as the standard tautomer in allsteps, but could be represented as the imino tautomer, which would have a negative unprotonated N3 and a neutral protonated N3. We expect that the electronic distribution is actually between these two extremes. (A) The N3 of A2486 abstracts a proton from the NH2 group as the latter attacks the carbonyl carbon of the peptidyl-tRNA. (B) A protonated N3 stabilizes the tetrahedral carbon intermediate by hydrogen bonding to the oxyanion. (C) The proton is transferred from the N3 to the peptidyl tRNA 3' OH as the newly formed peptide deacylates. Among the variations on this mechanism that should be considered would be a protonated A2486 stabilizing the intermediate, as in (B), with less contribution on acid-base catalysis, as shown in (A) and (C).
stabilisation du carbone tetrahedrique par N+
Site P libéré par le déplacement du peptide sur le tRNA du site A
Site P Site A
Centre peptidyl-
transférase
La stabilisation de l’imino tautomère du A2486 augmente le pKa de A2486 N3 à 7.6
Notez les interactions tertiaires
Principale fonction de la S.U. 50S
Pas de peptide à moins de 18Å
Abstraction proton par N3 de 2486
Attaque nucléophile
The polypeptide exit tunnel. (A) The subunit has been cut in half, roughly bisecting its central protuberance and its peptide tunnel along the entire length. The two halves have been opened like the pages of a book. All ribosome atoms are shown in space-filling representation, with all RNA atoms that do not contact solvent shown in white and all protein atoms that do not contact solvent shown in green. Surface atoms of both protein and RNA are color-coded with carbon yellow, oxygen red, and nitrogen blue. A possible trajectory for a polypeptide passing through the tunnel is shown as a white ribbon. PT, peptidyl transferase site. (B) Detail of the polypeptide exit tunnel showing distribution of polar and nonpolar groups, with atoms colored as in (A), the constriction and bend in the tunnel formed by proteins L4 and L22 (green patches close to PT), and the relatively wide exit of the tunnel. A modeled polypeptide is in white. (C) The tunnel surface is shown with backbone atoms of the RNA color coded by domain. Domains I (yellow), II (light blue), III (orange), IV (green), V (light red), 5S (pink), and proteins are blue.
Tunnel de sortie du
polypeptide
Les motifs ARN
••
--
RNAG
Boucle GNRA
•-
• -
R NA G
G U
G
-
C -U
AA
UA
GC -
• •-interaction GNRA- boucle interne
-•-
YU
U-turn
--
CNGU
?
Boucle UNCG
Plateforme d'adénine
• •-
-AG U/A•
Aempilement
Quelques motifs ARN. Les traits pointillés désignent des interactions tertiaires. Les flèches indiquent le sens 5'->3'. Le point d'interrogation dans le motif d indique que le partenaire de cette interaction tertiaire n'est pas connu.
a
ed
cb
Boucle E
La paire G:U Wobble
La paire Hoogsteen
Boucle TTC de tRNA Phe levure
Le U-turn
Boucle anticodon de tRNA Phe levure
La boucle GNRA
Les triplets de bases
Triple hélice 10-13 + 22-25+ 9+45+46 de tRNA Phe levure
1.c La prédiction des structures secondaires d’ARN
Les Dot-Plots
Algorithme de Zuker et Stiegler
• L’algorithme de programmation dynamique appliqué aux structures secondaires
• recherche les structures de plus basse énergie pour tous les sous-fragments d’une séquence (Zuker and Stiegler, 1981).
• Garantit une solution optimale à l’intérieur du modèle énergétique choisi. N’autorise pas les pseudo-nœuds.
Prédire les structures secondaires: énergies libres d’empilement (stacking free energy) (Turner et al.)
– Dans l’hélice (seulement paires WC et GU):
------------------ ------------------ ------------------ ------------------A C G U A C G U A C G U A C G U
------------------ ------------------ ------------------ ------------------5' --> 3' 5' --> 3' 5' --> 3' 5' --> 3'
UX UX UX UXAY CY GY UY
3' <-- 5' 3' <-- 5' 3' <-- 5' 3' <-- 5'. . . -8.1 . . . . . . . -4.0 . . . .. . -13.3 . . . . . . . -5.3 . . . . .. -10.5 . -4.0 . . . . . -1.8 . -5.3 . . . .
-6.6 . -3.6 . . . . . -3.6 . -3.6 . . . . .
ACGU
– Attention: AU
U
A
A
5 ’
A
U
U
5 ’
5 ’5 ’
Turner’s free energies (suite)
– Terminal mismatches:
Y Y Y Y------------------ ------------------ ------------------ ------------------A C G U A C G U A C G U A C G U
------------------ ------------------ ------------------ ------------------5' --> 3' 5' --> 3' 5' --> 3' 5' --> 3'
AX AX AX AXAY CY GY UY
3' <-- 5' 3' <-- 5' 3' <-- 5' 3' <-- 5'. . . -4.0 . . . -4.3 . . . -3.8 -4.3 -6.0 -6.0 -6.0. . -5.2 . . . -7.2 . . . -7.1 . -2.6 -2.4 -2.4 -2.4. -10.3 . -7.2 . -5.2 . -4.8 . -9.4 . -6.6 -3.4 -6.9 -6.9 -6.9
-4.3 . -4.3 . -2.6 . -2.6 . -3.4 . -3.4 . -3.3 -3.3 -3.3 -3.3
ACGU
A
G
C
A
5 ’
5 ’
Turner’s free energies (suite)
– Dangling ends:
X X X X------------------ ------------------ ------------------ ------------------A C G U A C G U A C G U A C G U
------------------ ------------------ ------------------ ------------------5' --> 3' 5' --> 3' 5' --> 3' 5' --> 3'
AX AX AX AXA C G U
3' <-- 5' 3' <-- 5' 3' <-- 5' 3' <-- 5'. . . . . . . . . . . . -4.9 -0.9 -5.5 -2.3
A
U
A
5 ’
5 ’
Analyse comparative
L’analyse comparative est la façon la plus fiable de déterminer les structures secondaires
Elle repose sur la détection de covariationsElle nécessite plusieurs séquences homologues alignées
Fonctionne même pour les paires non Watson-Crick!
Mesures de la covariation
• Table de contingence: table p normal 53 61
a c g u -53, 61| 0 97 0 3 0 61------+------------------a 3 | 0 0 0 3 0 c 0 | 0 0 0 0 0 g 97 | 0 97 0 0 0 u 0 | 0 0 0 0 0 - 0 | 0 0 0 0 0 53
gc=( 29, 28.03, 96.7%) au=( 1, 0.03, 3.3%)
• Tests:– Chi 2– Information Mutuelle– évènements phylogénétiques
Exemple: covariations d’un nucléotide avec tous les autres
– Position 1 du tRNA contre toutes les autres positions:
Cours 2. ARNomique
Supports:
http://www.esil.univ-mrs.fr/~dgaut/cours/orsay.html
A partir de novembre:
http://rna.igmors.u-psud.fr/
Bacterial genome
• 90% transcribed (~90% coding)
codingintergenic
T
Transcription: T
Vertebrate Genomes: 90% transcribed too!
Intergenic DNAIntrons, UTRs
TT
T
T?Transcription: T Satellite DNA
Transposable elementstRNA, rRNACoding regions: 1.5%
Vertebrate gene: 30kb (coding: 1,5kb)
RFAM: The RNA databaseSam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay
Khanna, Sean R. Eddy and Alex Bateman.
RFAM Stats
• 503 Familles d’ARN– une famille = un groupe d’ARN homologues et
alignables. – miRNA= >20 familles.
• Homme: – ~100 familles – 3000 ARN différents annotés
• E. Coli: ~40 familles
Familles, orthologues etc.
Combien d’autres ARNnc ?
2.1 Approches expérimentales
Cloning– Rnomics*
– Extract total RNA– Isolate small RNAs – Tag & Reverse transcribe– Clone & Sequence
• ~200 ncRNAs identified in mouse • ~100 ncRNA in bacteria
* Huttenhofer et al., 2001
Limitations
• Sensibilité: ncRNA rares, exprimés à des sites précis ou pendant une courte durée
• Non exhaustivité
Autres approches
– Full-length cDNA projects• FANTOM3:
– 100,000 mouse cDNAs– 32,000 non-coding!
– Tiling arrays• Half human transcriptome is polyA-, cytoplasmic and maps
unnanotated loci
Limitations
• Nombreux transcrits non fonctionnels (« fuites » de la transcription)
• Pas de preuve de fonction en tant qu’ARN
2.2 Recherche Bioinformatique de gènes ARN
What’s Special About ncRNA Detection?
s
s
s
BLAST HMM
– No ORF– No Markov model / sequence statistics– ncRNA is defined both by primary and secondary structure– « Substitution matrices » for nucleic acids are terrible compared
to aminoacids counterparts
2.2.a Looking for known ncRNAs
– How can we detect ncRNA genes from known families?
Descriptor-based programs
Rnamot / Rnamotif (Gautheret 91, Macke ‘02)
Palingol (Viari 96)
Patscan (Overbeek ’00)
PatSearch (Pesole ‘01)
h1 s1 h1 s2 h2 s3 h2
h1 5:5 1h2 5:5 NNNNR:YNNNNs1 7:7 NUNNNNNs2 4:40s3 7:7 UUCNNNN
RnaMot descriptor for anticodon+TYC domain of
tRNA
Descriptor-based programs
PROS CONS
Draft descriptors can be quickly sketched and tested
Requires a good prior knowledgeof secondary structure and
sequence constraints
Alignment is not compulsory, although it is very helpful to
have one
Requires basic computer skills to translate biological constraints
into computer script
Biologists decide what features are important or not (see also
CONS!)
Biologists have the responsibility of correctly weighting each
important feature
Probabilistic ncRNA search programs
• Stochastic Context Free Grammars (firstadaptation of CFG to RNA: Searls 94; SCFG: Eddy & Durbin 94)
– Time cost = O(N4) for sequence of length N – Not « practical » for large alignments or genome-
wide searches– Pseudoknots not allowed
describe how to generateany structure in thetraining set
ProductionrulesTraining set
ERPIN: Secondary Structure Profiles
• 16-row matrix captures base correlations and base-pair freqs.
1 2 3A C CA C CG C GA C CA C G
7 8 9G G UG G UU G CG G UC G U
A:AG:AC:AU:AA:GG:GC:GU:G...
4 5 6C G AC A AC G -- G -- G A
AGCU-Sb1,b2 =
log(Fb1b2 /Fb1xFb2)
Alignment
Weight matrices Implemented inPSI-Blast orProsite
Usual 5-row matrix for single strands
Profile Search Strategy
l=10 l=14
Seq
uenc
e (1
4nt
)
best score for forl=14 (0 gaps)
Single-strand profile (14 positions)
best score forl=10 (4 gaps)
Helix score for h5-h3computed from helix profile
GTTCTTGCATGTTTGACGGAACGTTCTTGCATGATTGACGGAACGTTCTTGCATGTTTGACGGAACTTTCCTGCATGCTTGACGGAACTTTAT--CAAGTTCAT-ATAAAATTAT--CGTGCCTTC-ATAATATTAT--CGTGTCTTC-ATAATATTAT--CATGTTTC--ATAAT
Training set
h3h5
Target sequence
ERPIN: Profile-based searchGautheret & Lambert, JMB, 2001, 313, p. 1005.
Training set
A:AG:AC:AU:AA:GG:GC:GU:G
AGCU-
U:U...
Single-strandprofile (5xN)
Helix profile (16xN)
Sb = log(Fb /Eb)
Sb1,b2 = log(Fb1b2 /Fb1xFb2)
Problem with information-poor training sets
A Mir-133 training-set:
(( - ((((((( ------ ((( - (((( ---------------------- )))) ))) ------ ))))))) - ))TC t GGCTGGT caaac- GGA a CCAA gtccgtcttcctgagaggt--- TTGG TCC CCTTCA ACCAGCT a CATG t GGCTGGT caaac- GGA a CCAA gtcaggtgtttctgtgaggt-- TTGG TCC CCTTCA ACCAGAC t ATTG t GGCTGGT aaaac- GGA a CCAA gtcaggtgtttttgtgaggt-- TTGG TCC CCTTCA ACCAGCT a TGTG c GGCTGGT gaaaa- GGA a CCAC atcaacccagaaaaaggat--- TTGG TCC CCTTCA ACCAGCC g CATA t GGCTGGT caaac- GGA a CCAA gtccgtcttccttagaggt--- TTGG TCC CCTTCA ACCAGCT a TTAG t TGCTGGT aaaac- GGA a CCAA gtcgggtgtttgcgagaggt-- TTGG TCC CTTTCA ACCAGCT a CTTG t GGCTGGT caaat- GGA a CCAA gtcaggtgtttctgcgaggt-- TTGG TCC CCTTCA ACCAGCT a CT
100% C:GOther scores = log (obs/expected) = abritrary low value!
What about G:C or A:U in this column? Is it as bad as C:C or A:G?
Pseudocounts
– Principle: fill columns with expected counts, based on a reasonable model
– Example: column c contains 7 C:Gs, we know C:G often substitutes for G:C, let’s allow for someG:Cs.
– We need substitution matrices!
Henikoff & Henikoff pseudocounts
(for aminoacids)
Total # pseudocounts in column c
Counts of a in column c
a substituted by i
With the previous example:
Column c is 100% C:GProbability(C:G)=1, others = 0
Count of A:Ts = Bc * 1 * Probability (C:G | A:T)Count of A:As = Bc * 1 * Probability (C:G | A:A), etc.
RNA substitution matrices
Obtained from euk+archae+bac 16S/18S rRNA alignement
AA AT AG AC TA TT TG TC GA GT GG GC CA CT CG CC6.54e-04 5.20e-06 3.88e-05 4.22e-05 2.13e-05 5.51e-06 1.21e-05 3.84e-05 8.52e-05 1.28e-05 1.76e-04 2.89e-06 1.47e-05 6.47e-06 3.19e-06 4.69e-067.96e-05 9.00e-04 5.19e-05 1.78e-04 1.69e-04 1.43e-04 8.85e-05 1.86e-04 4.15e-05 1.69e-04 1.22e-04 1.99e-04 8.73e-05 2.44e-04 1.25e-04 3.30e-041.00e-04 8.72e-06 1.35e-03 1.27e-04 1.72e-05 5.09e-06 3.10e-05 1.38e-04 5.74e-05 1.59e-05 1.01e-04 8.22e-06 9.99e-06 1.62e-05 1.33e-05 2.56e-054.11e-05 1.13e-05 4.81e-05 9.79e-04 2.79e-06 7.02e-06 2.79e-06 4.47e-05 4.93e-06 1.97e-05 3.05e-05 8.06e-06 5.40e-06 1.55e-05 2.47e-06 7.30e-054.23e-04 2.19e-04 1.33e-04 5.69e-05 1.16e-03 2.21e-04 2.35e-04 2.78e-04 9.59e-05 1.18e-04 1.79e-04 1.08e-04 3.54e-04 2.04e-04 2.24e-04 9.28e-051.05e-05 1.80e-05 3.80e-06 1.38e-05 2.14e-05 9.30e-04 2.57e-05 7.75e-05 5.79e-06 2.33e-05 4.87e-05 1.18e-05 1.57e-05 8.72e-05 1.83e-05 5.25e-041.05e-04 5.03e-05 1.04e-04 2.49e-05 1.03e-04 1.16e-04 1.14e-03 1.80e-04 4.69e-05 4.56e-05 1.25e-04 4.26e-05 1.70e-04 2.15e-04 7.52e-05 3.23e-051.45e-05 4.59e-06 2.03e-05 1.73e-05 5.30e-06 1.52e-05 7.82e-06 1.60e-04 4.55e-06 8.99e-06 4.77e-06 3.66e-06 9.00e-06 6.17e-05 2.95e-06 1.61e-052.57e-04 8.19e-06 6.74e-05 1.53e-05 1.46e-05 9.11e-06 1.63e-05 3.64e-05 1.47e-03 2.50e-05 8.70e-05 2.12e-05 3.02e-05 2.83e-05 4.40e-06 8.02e-061.24e-04 1.07e-04 6.02e-05 1.96e-04 5.81e-05 1.18e-04 5.10e-05 2.31e-04 8.04e-05 1.28e-03 9.39e-05 8.77e-05 2.53e-05 9.12e-05 3.55e-05 4.58e-051.82e-04 8.24e-06 4.08e-05 3.24e-05 9.35e-06 2.61e-05 1.49e-05 1.30e-05 2.97e-05 9.98e-06 5.62e-04 6.96e-06 6.83e-06 8.80e-06 1.32e-05 1.06e-051.14e-04 5.14e-04 1.26e-04 3.27e-04 2.16e-04 2.44e-04 1.94e-04 3.84e-04 2.78e-04 3.57e-04 2.67e-04 1.49e-03 1.07e-04 5.26e-04 2.57e-04 2.87e-041.30e-05 5.04e-06 3.43e-06 4.90e-06 1.58e-05 7.22e-06 1.73e-05 2.10e-05 8.85e-06 2.30e-06 5.85e-06 2.40e-06 5.30e-04 5.30e-05 1.58e-05 7.68e-063.86e-06 9.54e-06 3.78e-06 9.51e-06 6.16e-06 2.71e-05 1.48e-05 9.78e-05 5.60e-06 5.61e-06 5.10e-06 7.94e-06 3.58e-05 2.95e-04 5.22e-06 3.52e-051.04e-04 2.68e-04 1.70e-04 8.32e-05 3.71e-04 3.12e-04 2.83e-04 2.56e-04 4.77e-05 1.19e-04 4.21e-04 2.12e-04 5.86e-04 2.86e-04 1.35e-03 2.50e-042.13e-06 9.81e-06 4.54e-06 3.41e-05 2.12e-06 1.24e-04 1.69e-06 1.94e-05 1.21e-06 2.14e-06 4.70e-06 3.31e-06 3.95e-06 2.68e-05 3.48e-06 5.45e-04
A T G C9.13e-04 8.22e-05 1.05e-04 9.35e-055.57e-05 6.70e-04 7.98e-05 1.41e-046.94e-05 7.78e-05 7.32e-04 5.03e-054.09e-05 9.15e-05 3.33e-05 6.03e-04
AAATAGACTATTTGTCGAGTGGGCCACTCGCC
ATGC
Performance as a function of training set size and pseudocount weight
specificity
sensitivity
From 1 to 19 sequences in training set…
pseudocountspseudocounts + true-counts
pcw =
Setting the pseudocount weight
pseudocountspcw = pseudocounts + true-counts
An E-value for RNA motifs?
E-value:# expected hits of score > S, by chance
Can we guess this? Idea 1:
run against random database and compute score distributionExtrapolate for any score
High Scores are Well Behaved
• Y = 6.6 105 e-0.3 S
• E = Kmn e-λ S
• (extreme value distrib.)
• Score 30 (min):
• 3.8 hits/100mb• Score 40 (ave): • 0.13 hits/100mb10
100
1000
10000
20 25 30 35score
#hits
randommarkov
SECIS hits in700 mb randomized sequences An E-value is possible
1 day to run simulation!
Global Score Distribution is not as Good!
A:AG:AC:AU:AA:GG:GC:GU:GA:CG:CC:CU:CA:UG:UC:U
« Log(0) » or pseudocount
« Finite » scoreHigh scores
Helix profile
Distribution of all scores
U:U
Not GaussianHow can we model it? (same for single-
strand profile)
Discrete convolutions
computation
A:AG:AC:AU:AA:GG:GC:GU:GA:CG:CC:CU:CA:UG:UC:UU:U
simulation
E-values of complete motifs
simulated
computed
Profile-based search
PROS CONS
All constraints in the training set are efficiently exploited, resulting
in highly specific detections
Alignment and secondary structure constraints must be
accurate
No programming is needed Helices of variable length need to be reduced to their shortest
consensus
Scoring system is defined automatically
Program will not depart from initial alignment in terms of motif
sizeE-values are provided for each hit Users still have to decide on
search order and masked elements
Running a successful ncRNA search
Example: the Signal Recognition Particle(SRP) RNA
172 sequences availableAll 3 kingdomsSignature: 50-nt domain IV
Organize ncRNA informationAlignment is a must
Should be structure-based
ClustalW OK only as a first attempt
RNAalifold (Vienna package) can identify covarying basepairs
Secondary Structureannotation
Will help identify sequence/ structureconstraints: helix sizes, conservedbases, etc.
Want to publish your finds in absence of E-value?Prepare Control Procedure
TP TP+FN
Sensitivity: SN = Total « true » objects
TP TP+FP
Specificity : SP = Total predictions
TP and FN: easy to obtain, using training set (leave-one out)
FP: harder! How do you know a hit is false?
Hint: express SP as: FP / Mb in a random sequence
Make it large enough and of same composition (mono & di-nt) as search database (e.g. with the shuffle program)
Using the ERPIN program
>structure000000000000000000000000000000000000001111001000222443333333666555588877777788899996661111443222>AQU.AEO.AGGGUGAACU-CCCCCAGGCCCGAA--AGGGAGCAAGGGUAAGC-CCG>THE.THE.GGCGUGAACC-GGGUCAGGUCCGGA--AGGAAGCAGCCCUAAGC-GCC
erpin srp.epn sequence.fasta -8,8 -nomaskerpin srp.epn sequence.fasta –2,2 -nomaskerpin srp.epn sequence.fasta -2,2 –umask 5 9 -nomask
ERPIN results
Score: based on profile values
E-value: How manyhits expected at thisscore or higher?
No need for random sequence tests!
Looking for MicroRNA Precursors
Stem loop Stem loop precursor precursor (pre(pre--miRNA70nt)miRNA70nt)
miRNA miRNA transcription transcription unitunit
Dr
Dr
nucleus cytoplasm
()
mRNAmRNAtargettargetMature miRNAMature miRNA
(21(21--22 nt.)iceice 22 nt.)Translation Translation block or block or cleavagecleavage
Polycistronic Polycistronic transcript (pritranscript (pri--miRNA)
miRNA+RISC miRNA+RISC complexcomplex
miRNA)
miR Training sets
• >50 miRNA families with different signatures– Can’t use a single profile for all
• 18 training sets built for 18 miRNA families– Using CLUSTALW + Alifold– 10 sequences/family on average
Legendre, Lambert, Gautheret, Bioinformatics 2004
ERPIN vs BLAST
• 20 animal genomes scanned• Compare Erpin and WU-BLAST w/ sensitive
parameters (W=7)• E-value ≤ 0.01
ERPIN WU-BLAST
43 (0) 212 (5) 41 (9)
Analysis of a miR Cluster
miR17 clusterciona
ciona Grey: initial training set
“E” indicates hits identified by ERPIN only,
“EB” indicates hits identified by both ERPIN and BLAST.
• Important homologues missed by WU-BLAST• Profile search a must in miRNA detection
2.2b De novo ncRNA finding
– How can we detect ncRNA genes when noprior sequence/structure data is available?
Thermodynamic Profiling (Le et al. 88)
-20100 bases
-2
100 bases
-3100 bases
-2
100 bases
Profile : - 4
100 bases
window free energy - mean (energy of rnd seq.)Z-score =
Var(energy of rnd seq.)
The problem with thermodynamics
OK for strong local structures (some success in viral genomes)However: true ncRNA (tRNA, rRNA) do not display higher folding energy than random sequences of same composition (di-nt: Rivas & Eddy 2000)The method would not detect many known ncRNAsG+C content alone is a better ncRNA predictor than free energy
G+C content
In high A+T background (thermophilic archaebacteria), ncRNA stand out clearly. Combining (G+C)% and CpG% provides the best discriminant (Schattner ’02). A dozen such predicted ncRNA experimentally confirmed in M. jannaschii and P. furiosus.Does not work in genomes with « normal » G+C contents, except as a complement to other methods (thermodynamics, etc.)
RNAGenie (Carter, Dubchak & Holbrook, 2001)
– Partition E. coli sequences into 2 sets of overlapping 80nt windows :
• 675kb intergenic (negative set) • 8kb true ncRNA (16S,23S,5S,tRNA,other small RNAs)
(positive set)
– Train neural network capturing different « compositional characters »
• nt and di-nt composition• frequency of typical RNA motifs: UNCG, GNRA, AAR, CUAG• Folding free energy
RNAGenie results
– Claim about 80-90% accuracy– Predict 370 new ncRNA– 13 predictions confirmed subsequently– Limitations
• small training set: squewed prediction• Possible contamination of non-ncRNA sequences in
« negative » set.
Comparative Genomics rules!
Wassarman et. al. ‘01: comparaison Escherichia, Salmonella, Klebsiella : 60 ncRNA predicted, 23 confirmedMany ongoing projects in bacteria, Xenopus, Ciona, human
Comparative genomics:a major source of ncRNA in eukaryotes
– 5-6% of mamalian genome under selection vs 1.5% coding (3 times as much as in nematodes)
– Tiling arrays and full length cDNA identify as many transcripts in intergenic as in known genes
– As many polyA- as polyA+
Functional assignment of conserved regions
– Coding exons– Regulatory sequences in exons
and introns– Promoters– ncRNA– Ancestral repeats– Others (matrix attachment,
etc.)
Margulies et al, 2003
Fraction of conserved sequences in.. (AR=ancestral repeats)
Detect this!
Need classification software- QRNA (Rivas & Eddy, 2001)- RNAz (Hofacker & Stadler, 2005)
Q-RNA (Rivas & Eddy 2001)
– Analysis of Blast alignment (SCFG based)
Synonymous mutations
Compensatory mutations
•Model for protein coding gene
•Model for ncRNA
(also include loop probabilities obtained fromtraining set of real ncRNA)
RNA-Alifold (Hofacker, Stadler)
• Originally: secondary structure prediction• Predicts best common structure for a set of
aligned RNAs• Dynamic programming, averaging:
– Energies of aligned bases– Covariation term
RNAz (Washietl et al. 2004)
• Uses multiple alignments. • two basic components:
– (1) Measure for RNA secondary structure conservation based on computing an RNAalifold consensus secondary structure (Structure Conservation Index)
– (2) Measure of thermodynamic stability, based on a zscore normalized w.r.t. both sequence length and base composition and can be calculated without sampling from shuffled sequences (SVM)
– An SVM again to combine 1 and 2
Q-RNA & RNAz
– QRNA • Limited range for similarity (65%-85%): too dissimilar= incorrect
Blast alignments, too similar=no covariation
• Pairwise: limited covariation
– RNAz• Multiple alignment: better use of sequence variation• Already applied to mammals (5-species alignment) & Ciona
Problem: Human/mouse/rat ncRNAs not in this range!
The right species for ncRNA detection?
Human/mouse ncRNA: ~98-100% id18S fugu/xenopus/human: 95% id! Obvious interest for diverse animal models
Can’t observe covariation
Conserved elements detected using various species
Human/mouse tRNAAsp