Generation and Bioinformatic Analysis of Synthetic Ago HITS...

IT 13 039

Examensarbete 45 hpJuni 2013

Generation and Bioinformatic Analysis of Synthetic Ago HITS-CLIP Data

Mehmet Ali Arslan

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Generation and Bioinformatic Analysis of SyntheticAgo HITS-CLIP Data

Mehmet Ali Arslan

Micro-RNAs (miRNAs) have been discovered to regulate messenger RNA (mRNA) translation and degradation. Various recent studies have been focused on miRNA target prediction, in order to get a better understanding of the rules and nature of miRNA regulation over mRNAs. In this project we aim to create a software module to identify miRNA target sites on mRNAs. As basis to this project, we refer to a study that identified a platform for miRNA-mRNA interaction in protein-RNA complexes in mouse brain (AGO HITS-CLIP study). We propose a probabilistic model of the data from this study, and generate synthetic sample data according to this model, in order to create a test bed for a discovery module. Our discovery module analyzes the sample data to identify peak regions where the interaction density is high. We present results both on synthetic sample data and data from the AGO HITS-CLIP study to evaluate our module.

Tryckt av: Reprocentralen ITCIT 13 039Examinator: Ivan ChristoffÄmnesgranskare: Lars ArvestadHandledare: Jens Lagergren

Contents

1 Introduction 3

2 Background 4

2.1 mRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Translation . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 miRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 miRNA-mRNA interaction . . . . . . . . . . . . . . . . . . . 82.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . 8

3 Literature Review 10

4 Methodology 12

4.1 Inputs and their handling . . . . . . . . . . . . . . . . . . . . 124.1.1 Selected Genome and Genes . . . . . . . . . . . . . . . 124.1.2 Ago HITS-CLIP Data . . . . . . . . . . . . . . . . . . 12

4.2 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . 134.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 134.2.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . 144.2.3 Generation . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Peak (Target Site) Detection . . . . . . . . . . . . . . . . . . 194.3.1 Peak Calling . . . . . . . . . . . . . . . . . . . . . . . 19

5 Results 22

5.1 Peak calling on synthetic data . . . . . . . . . . . . . . . . . . 225.1.1 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.2 Peaks called . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Peak calling on Ago data . . . . . . . . . . . . . . . . . . . . 26

6 Conclusions 27

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Bibliography 28

2

Chapter 1

Introduction

miRNAs are discovered to regulate gene expression by binding to mRNAsand causing them to degrade or inhibit their translation. This in turn a↵ectshow proteins coded by mRNAs that are targeted by miRNAs are generated.Hence, every life related function they carry out is regulated by miRNAs(see “Background” for details). This is why we are interested in findingmiRNA target sites on mRNAs.

As Chi et al. [1] provide a platform for investigating miRNA-mRNA in-teraction, we use the output of their study as input to ours. Aligning theisolated mRNA tags from the Argonaute-miRNA-mRNA ternary complexto the genome, we aim to identify regions that are dense in interactionwith miRNAs in this ternary complex. We propose a probabilistic modelof the aligned mRNA tags from [1]. This model is used both for generat-ing synthetic sample data, and in the target site detection algorithm. Thecapability of generating synthetic sample data is important to create a testbed for the detection algorithm.

The rest of this report is organized as follows: Chapter 2 is meant as abrief background to the biology and probability theory behind the project,while chapter 3 focuses on a summary of the related studies. Methodology ispresented in chapter 4 and chapter 5 presents the results of our experimentsincluding their discussions. Finally, chapter 6 concludes the report.

3

Chapter 2

Background

2.1 mRNA

While the essence or meaning of life is an open debate, functions necessaryfor life, including catalyzing metabolic reactions and DNA replication, arecarried out by protein molecules. These molecules are encoded by genes inDNA. Messenger RNAs (mRNA) carry this encoded genetic information forthe amino acid sequence of a protein. Gene expression, namely manufac-turing a protein, happens in two main phases: transcription which is thegeneration of the mRNA; and the translation of the genetic code residing inthe mRNA to a protein [2]. mRNA in prokaryotic and eukaryotic cells havedi↵erent properties and they act slightly di↵erently. We focus on eukary-otic mRNAs in this study and further discussion considers only eukaryoticmRNAs.

2.1.1 Transcription

Transcription is the process where the mRNA is generated by complementingpart of a DNA strand named the template strand including the geneticinformation for the protein to be coded. As it can be seen in Fig. 2.1, theentire process starts with the RNA polymerase enzyme (pol II for mRNAtranscription) binding to the promoter region for the gene in the templatestrand of the DNA, which is necessary for the enzyme to be bound to theDNA. In return, RNA polymerase starts unwinding the DNA and adds thecomplementing nucleotide at the 3’ end of the newly generated RNA untilit reaches the termination site. The result is called a pre–mRNA which isprocessed further to produce the mature mRNA [2]. This post–processingis called splicing.

4

Figure 2.1: DNA transcription to RNA. Figure from Sadava et al.©2008 Sinauer Associates [3].Used with permission.

2.1.2 Translation

Translation is the synthesis process for a protein from the information resid-ing in the mRNA that is coding this protein (see Fig. 2.2). With the helpof transfer RNAs (tRNA) and the ribosome, the mRNA is read codon bycodon (three base pairs specifying an amino acid) to synthesize the aminoacid chain that constitutes the initial form of a protein. For each codon thatis read, a tRNA with the corresponding anticodon carries the amino acidcoded by the codon to the ribosome and transfers it to the growing amino

5

acid chain [2, 4].

Figure 2.2: Summary of the translation process. Figure from Mariana Ruiz Villarreal/Wikimedia

Commons. Used with permission.

2.2 miRNA

Before we begin introducing miRNA, note that only metazoan miRNAs andtheir functions in metazoans are considered in this project. Thus, the readershould consider our elaboration within metazoans only.

The history of micro RNAs (miRNA) goes back to 1993 [5], where it wasdiscovered that the LIN-14 protein’s abundance was regulated by a shortRNA product through inhibiting its translation. The discovery was con-sidered peculiar until in the turn of the millennium, studies reported theevolutionarily conserved [6] let-7 miRNA to regulate expression of severalgenes [7]. Today, a search in PubMed with the keyword ”microRNA” givesmore than 15000 citations, which gives an idea about the amount of interestin miRNAs in the new century.

6

Figure 2.3: miRNA biogenesis steps. Figure from He and Hannon ©2004 Nature PublishingGroup [8]. Used with permission.

miRNAs are ⇠22-nucleotide residue RNAs. The precursor of the maturemiRNA is the ⇠70-nucleotide imperfectly base-paired hairpin segment fromthe RNA that the miRNA is derived from [9]. Further biogenesis steps occurfor the pre-miRNA to be transformed into the mature miRNA (see Fig. 2.3).The mature miRNA, together with Argonaute and several other proteins, isassembled into a complex named RNA-induced silencing complex (RISC),which is also referred to as miRNA-protein complex (miRNP). miRNA di-rects this complex to the binding site on the mRNA in order for the miRNPto perform its functions on the mRNA.

Rules and regulations for miRNA functions are not crystal clear. This factis one of the reasons why there is an increasing amount of diverse researchfocused on miRNAs. However, there is some common ground, such as thefact that the human genome encodes several hundred unique miRNAs andthat these miRNAs interact with thousands of mRNAs [10], which in turnmeans that one unique miRNA interacts with more than one mRNA. It

7

is also known that miRNA expression levels are often perturbed in diseasestates [11, 12, 13] which is another important cause for research interest.

2.3 miRNA-mRNA interaction

miRNAs interact with mRNAs through complementing mRNA target / in-teraction sites. Thus, the e↵ect of the interaction depends on the level ofcomplementarity. Perfect complementarity causes endonucleolytic cleavage(i.e. a split of the strand into two by phosphodiester bond cleavage betweenthe nucleotides) of the RISC bound mRNA. This occurs only if the RISCincludes an Argonaute protein that is capable of endonucleolytic cleavage.Cleaved mRNAs are degraded as a result. [14, 15]. For mammals Ago2 isthe only enzyme capable of endonucleolytic cleavage [16]. This kind of cleav-age can happen with perfect or near-perfect matches, while the mismatchpositions are significant for higher success rates in a near-perfect match.[17, 18]. But the more interesting and intriguing interaction between miR-NAs and mRNAs is when there is partial complementarity involved. Stud-ies show that miRNAs that partially complement their mRNA target sitescan cause translation inhibition [19, 17]. This in turn causes decrease inthe expression of the protein that the mRNA concerned codes for, whilecleavage through (near-)perfect complementarity causes decrease in mRNAabundance through degradation. Further research about the e↵ects of tar-get site sequence and complementarity on the expression/repression of theprotein coded by the target mRNA identifies the residues 2 to 8 in the 5’of the miRNA as a so-called seed section, which should be almost perfectlycomplementary to the mRNA to cause translation inhibition. [20, 21]

2.4 Poisson Distribution

For data modeling purposes later on (see Section 4.2.2.1), we use the Poissondistribution, which we introduce in the following.

Named after Simeon Poisson, the Poisson process is a counting process thatcan be used to describe many daily-life situations such as customer arrivalsto a queue, phone calls made to a call-center, etc. where a counting process{C(t), t � 0} is a stochastic process that counts events that have occurredup to time t, where C(t) is non-negative and integer-valued for all t � 0 andC(t) is non-decreasing in t.

A Poisson distribution on the other hand, is a discrete probability distribu-tion which describes the probability of a given number of events that occurin a fixed time/space interval. Namely, Poisson distribution describes how

8

events occur in a Poisson process. The average rate of occurrences is knownand occurrences in di↵erent intervals are independent of each other [22]

Characteristics of a Poisson process include [23]:

• The average rate of success/occurrence (expectation) is known.

• Probability of a single success in an interval is proportional to the sizeof the interval

• Probabilities of successes that are in di↵erent intervals are independentof each other

• Probability that a success will occur in an extremely small region isvirtually zero

The key property of Poisson distribution is the average success rate (expec-tation) �, which used to describe the variance, mode, etc. of the distribution.

Random splitting of a Poisson process results in two independent Poissonprocesses, while several independent Poisson processes can be compoundedinto a new Poisson process as well, the random variable of the compound be-comes the sum of the independent processes that constitute the compound.[24].

The probability mass function of the discrete stochastic variable k for aPoisson distribution is given as [25]:

f(k;�) = Pr(X = k) =�ke��

k!(2.1)

where � is the average success rate, and e is the base of the natural logarithm.

See subsection ”4.2.2.1 Poisson Expectations” for the use of the Poissonprocess and the distribution in this project, together with the motivationfor doing so.

9

Chapter 3

Literature Review

mRNA expression is regulated by miRNAs via miRNA-containing ribonu-cleoprotein particles (miRNPs). Argonaute protein is part of these particles,where they bind miRNAs and mediate target mRNA recognition [26]. Im-munopurification is a technique involving purification of the proteins andthe RNAs in RNP complexes. This is achieved using antibodies of the con-stituent proteins [27, 28].

Microarray analysis is a technique used for identifying the genes that areactive in a target sample. The isolated mRNA from the target sampleis converted into complementary DNA (cDNA) that is dyed in fluorescentafterwards. The dyed cDNA is injected into a silicon chip or a glass slidefull of single DNA strands representing each gene. The DNA segments thatare complemented by the cDNA identify the genes that are highly expressed[29].

In [30], the authors use miRNP immunopurification in order to identifyfunctional mRNA targets. Following [1] they use the fact that mRNAs andmiRNAs bind to the protein in the miRNP complex (in their case Ago1) atthe same time. Their microarray analysis shows a high degree of enrichmentfor miRNA complementary sites on the 30UTRs of the mRNAs that werebound to the miRNP in regard. Further on they validate the regulation ofmRNAs that are associated with miRNP by one particular miRNA, miR-1.

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a tech-nique for genome-wide profiling of DNA-binding proteins [31]. ChIP-seq isout of the scope of this work, therefore we are not considering the detailsof this technique. We mention it here since in “Model-based Analysis ofChIP-seq data (MACS)” [32], the authors focus on peak identification tech-niques on short read sequencer data resulted from ChIP-seq. They try tocompensate for various technical characteristics of such sequencers, such as

10

the fact that ChIP-Seq tags represent only the ends of the ChIP fragments,instead of precise protein-DNA binding sites; or biases it exhibits along thegenome due to sequencing and mapping biases, chromatin structure andgenome copy number variations. They model the tag distribution along thegenome with Poisson distribution where they traverse the genome in win-dows to find candidate peaks with a significant tag enrichment. In order toevaluate significance, there is a need for a reference to compare to, namely abackground expectation (�

BG

) for having reads in genomic positions. Thisdescribes the noise, or in other words, expected number of reads by chance.Significance is measured by the Poisson distribution p-value based on this�BG

. Any window that has a p-value that is less than a pre-defined thresholdis identified as significant. Their default threshold value is 10�5.

They expand this with a dynamic background expectation concept. Insteadof using a uniform �

BG

estimated from the whole genome, they use the max-imum of the expectations estimated from several windows centered from theposition in regard (�

local

), with di↵erent lengths. This captures the influenceof local biases, and is robust against occasional low tag counts at small localregions. Through the use of �

local

, false positives that would otherwise becalled by �

BG

are eliminated. This invention of dynamic background expec-tation is important for our study since we used a slightly modified versionof this technique in peak calling.

Argonaute (Ago) high-throughput sequencing of RNAs isolated by crosslink-ing immunoprecipitation (HITS-CLIP) is introduced in [1]. HITS-CLIP isdeveloped to “directly identify protein–RNA interactions in living tissuesin a genome-wide manner”. The idea stems from the X-ray crystal struc-tures of an Ago-miRNA-mRNA ternary complex, which suggest that Agomaintains close enough contacts with a miRNA that is bound to it and thenearby mRNAs, and this in turn lets Ago HITS-CLIP identify the interac-tion sites in vivo. We use parts of their outputs as input to our study. See“Methodology” for further information.

11

Chapter 4

Methodology

4.1 Inputs and their handling

It is important to describe the inputs in detail in order to provide the readerthe assumptions made in our study. In this section, we will describe thegenome sequencing used, the list of genes selected and the details about theinput from the Ago HITS-CLIP study [1].

4.1.1 Selected Genome and Genes

The Ago HITS-CLIP study is done on mouse brain, therefore we limit our-selves to mouse genome. We use the mm9 mouse genome taken from theUCSC database for any alignments done in this study [33, 34].

Since the Ago HITS-CLIP study is on the mouse brain, we limit ourselves tomouse genes. We use the genes listed in the refGene table from the RefSeqGene track in UCSC database [34] which shows known mouse protein-codingand non-protein-coding genes taken from the NCBI RNA reference sequencescollection (RefSeq) [35].

4.1.2 Ago HITS-CLIP Data

As we need mRNA tag samples to build our generation model on (to generatesimilar data), we use the samples from the Ago HITS-CLIP study [1]. Herewe describe the data we picked to use, in more detail.

To identify the particles from the HITS-CLIP results, they use a techniquecalled radioisotopic labeling, which is a molecular tracing technique by useof radioisotopes injected into the molecules that are to be traced. The

12

radiation emitted when these isotopes decay is used to trace the labeledcompound(s) [36]. Radiolabeling the results of HITS-CLIP on mouse brain(P13 neocortex), they identify two complexes with di↵erent modal molecularsizes, 110kD and 130kD respectively. As reported in their study, the 110kDproduct corresponds to the miRNAs while the 130kD product corresponds tomRNAs that simultaneously bind to the Ago compound [1]. In this study weare interested in miRNA target site prediction which are located on mRNAs,thus we focus on the mRNA product of the HITS-CLIP results, namely the130kD modal sized product. As they conducted the same experiment withdi↵erent mouse brains we needed to pick one of the samples. We use thesample labeled as “Brain A” [1], containing the tags acquired from the 130kDproduct of the HITS-CLIP in FASTQ file format, which is accessible fromthe website of the respective study, as part of the supplementary material[37]. For more information on the tags generated by AGO HITS-CLIP pleaserefer to Supplementary Table 1 in the supplement to [1].

More than the tags themselves, we are interested in where they align on thegenome. Hence, we align the tags ourselves using a third party short readmapper software, since the alignments are not provided as supplementarymaterial. We use the short read mapper SHRiMP [38] to align the tagsto the reference mouse genome mentioned above. To keep things relativelysimpler, we consider only the perfect alignments of these to the genome.

4.2 Synthetic Data Generation

In this section we will go into the details of why and how we generatesynthetic data, including motivation, parameters and the outlines of thegeneration procedure.

While generating synthetic data, we generate reads from genes that are inthe list mentioned in “Selecting Genes” section above. This means that weskip reads that do not perfectly align to any identified gene from our list.

4.2.1 Motivation

The Ago HITS-CLIP sequencing data is available at [37]. In order to test amiRNA target-site prediction system as ours, one has to validate the func-tionality. This can be done via testing with various datasets with di↵erentparameters which can imply a trustworthy system, or at the least a methodworth investigating deeper with costlier methods. To acquire sequencingdata that has similar properties as the Ago HITS-CLIP but varying at thesame time, we either have to do our own genuine sequencing experiments orgenerate synthetic data. While technological improvements have lead not

13

just to better tools and methods but also to cheaper ones, generating se-quencing data for such datasets is still quite expensive. Thus, while we wantto create a system that will be applied to real sequencing data, it is a goodstart to test the methodology with synthetic data initially.

4.2.2 Parameters

In [1], the authors identify the peaks of clusters of tags by cubic-splineinterpolation technique described in the supplementary material for theirstudy. According to their results, Ago binds within 45-62 nucleotide of thesecluster peaks 95% of the time. This region is defined as the Ago-mRNAfootprint, and their further results indicate that this is a good predictionsite for miRNA targets. In our study we rename this footprint as peakwindow length, and use this as one of the parameters.

It is important to evaluate the behavior of our system given datasets withvarious numbers of peaks. Thus, number of peaks is another parameter thatis used and tweaked in generation.

4.2.2.1 Poisson Expectations

In read generation and peak detection, we assume that the the probability tohave a read at a particular position is independent from the same probabil-ities of other positions. We also want to use di↵erent probability values forbackground reads and peak regions respectively. This means that if we knowthat we are in a non-peak region (i.e. background region) the probability tohave a read there is dependent only on the background expectation; whileon the other hand, if we know that we are in a peak region, the probabilityto have a read there is dependent both on the peak region expectation andbackground expectation.

As described in the introduction, a Poisson process counts independentstochastic events that occur in unit intervals. Poisson distribution is goodto describe systems where there are very small number of events in a verylarge domain. In our case, an event is having a read start at a genomicposition, the unit interval is 1 base pair, and the very large domain is thetranscriptome. The ratio of the total number of reads (taken from [1] sup-plementary materials) to the length of the transcriptome (total number ofpositions to check for a read) is very small. Thus Poisson distribution is agood model for our problem. On the side, having a Poisson expectation foreach position means that we assume independence of probabilities for them.The other important property of Poisson processes is that joining two inde-pendent Poisson processes results again in a Poisson process, while joining

14

is a very simple calculation, namely addition of the two expectation valuesfor respective processes. This property is important for us to be able toconsider the background expectation even in a peak region. We simply addthe background expectation to the peak expectation to get the expectationfor the position considered.

Background Expectations

In order to decide whether a region is a peak region or not, we need areference to reason about our observations regarding this region. In otherwords we need some kind of control data or information that we can compareto our observations about the region at hand (e.g. number of reads that fallinto this region). We call this the background expectation, namely theexpectation to have a read start at a particular position assuming it is notin a peak region. Given a region, when we count the number of reads thatfall into this region as x, we can calculate the likelihood of having at least xreads in a non-peak region according to this background expectation. Then,if this likelihood is below a certain threshold, we can name this region apeak since it is very unlikely to have occurred by chance/noise according tothe aforementioned background expectation.

As reported by Chi et al. [1] only 1% of the reproducible Ago-mRNA tagsare found in the 5’ untranslated region (50 UTR). This low percentage ofreproducible tags led us to believe that number of reads in the 50 UTRs is agood proxy to derive background expectations from.

Inspired by [32], we define three di↵erent backgrounds and respective ex-pectations, namely one background per gene (and expectation: E

b

gene

), oneper chromosome (and expectation: E

b

chrom

), and one global background(and expectation: E

b

global

). The motivation behind this decision is to havea varying background, and to build a detection algorithm that is resilientto such background. A dynamic background entails di↵erent thresholds forpeak calling depending on the current chromosome, and gene the regionresides. How to calculate these expectations will be explained in a minute.

Peak Expectation

While generating reads starting from a particular peak position, we needto know how many reads to generate. Peak expectation (E

p

) is to be usedat this point. Note that E

p

is expected to be remarkably bigger than thebackground expectations, especially E

b

global

, so that the read generation al-gorithm generates enough number of reads in a peak region, which is to havea very low likelihood to be generated by a background expectation.

Calculation of Expectations

First and foremost, it is important to note that all expectations are regardingonly one position. Namely, if for a position pos the expectation is E

pos

, it

15

means that the expected number of reads starting from position pos is Epos

.If we want to calculate the expected number of reads in a region with npositions that are in a similar region as x (i.e. have the same expectation),we simply multiply the expectation for one position with the length of theregion E

pos

⇥ n.

As explained above, we use number of tags that are found in 50 UTRs of genesas our input for calculating the background expectations. Hence we have tocalculate the number of alignments that fall in the 50 UTRs of the genes first.Using the RefSeq genes table, we calculate the 50 UTR for each gene andcount the number of alignments that fall into them from the Ago alignmentfile (see section “Ago HITS-CLIP Data”) and save this information into atable (50utr table)

For a gene g:

if:

ng

= # reads in g’s 50 UTR (4.1)

lg

= g’s 50 UTR length (4.2)

then:

Eb

genes

[g] =ng

lg

(4.3)

For a chromosome chr:

Eb

chroms

[chr] =

Pg in chr ngPg in chr lg

(4.4)

Finally,

Eb

global

=

Pg in transcriptome ngPg in transcriptome lg

(4.5)

We mentioned above that multiplying the length of the region we are consid-ering with the expectation, will give us the expected number of reads. Fromthe supplementary documents (and from the actual data) of the Ago study,we get the actual total number of reads for the entire transcriptome, which

16

we further on use to calculate the peak expectation. We get the backgroundexpectation(s) that are needed for peak expectation calculation, using theformulas above. Window length and number of peaks (see “Parameters”)are the other parameters in the following formulas:

if:

N = total number of reads (4.6)

l = window length (4.7)

p = # of peaks to be generated (4.8)

b = |transcriptome|� p⇥ l (4.9)

then:

(p⇥ l ⇥ Ep

) + (b⇥ Eb

global

) = N (4.10)

and:

Ep

=N � (b⇥ E

b

global

)

p⇥ l(4.11)

Noting that b represents the background positions (namely, the positionsthat will not reside in peaks); Eq. 4.10 summarizes that the total numberof reads is the number of reads in peaks plus the number of reads in thebackground positions. The number of peak positions are calculated as p⇥ l.Thus, peak expectation E

p

is calculated as in Eq. 4.11. A careful readershould notice that we use E

b

global

in the equation. This obviously generates aslight conflict with the idea of having di↵erent background expectations fordi↵erent chromosomes and genes. One can suggest having di↵erent expec-tations for peak positions as well. However, in order to be able to calculatedi↵erent peak expectations, we have to know how many peaks are to begenerated in a particular gene and chromosome respectively. This can beincluded in future work, but for now we believe that using E

b

global

here issatisfactory.

17

4.2.3 Generation

After calculating the expectations as described above, we are ready to gen-erate synthetic sequencing data.

As the real sequencing data that we imitate comprises of short reads (namely32 base-pair tags), we copy segments from genes that are 32 bp long. Theimportant question is: Where will we copy from? We need to decide onwhere exactly to start a read. The expectations we defined in the previoussection let us decide whether or not to generate a read starting from aparticular position. But as we saw, there are more than one expectations touse. When do we use the peak expectation, when do we use a backgroundexpectation?

We distribute the peaks randomly across the transcriptome. For each po-sition of each gene, we “flip a coin“ to decide whether or not a peak startsfrom this position. “Flipping a coin“ is implemented in such a way so thatthe number of peaks generated will be around the desired number, input tothe system as p. If the position is to be the start of a peak, we generate apeak with window length l by generating reads in this window according tothe peak expectation. We want to be able to cross-check our called peaks(see “Peak Calling“) with these synthetically generated peaks, so we recordthem when we generate them. Since we do not consider overlapping peaks,we continue traversing the gene after l positions. From all the non-peakpositions we generate background reads according to the E

b

gene

of the genethat the regarding position resides in.

The algorithm for synthetic sequencing data generation is given below:

for all g in genes do

for p in [g.start, g.end] doif flip a coin() then

for i in [p, p+ l] dogenerate read(i, E

p

)end for

record the peak in synth peaksp = p+ l

else

generate read(i, Eb

genes

[g])end if

end for

end for

The function generate read(pos,�) generates reads that start from the po-sition pos according to the Poisson distribution with the given expectation(�). g.start and g.end denotes the transcript start and end of the gene g,

18

respectively.

4.3 Peak (Target Site) Detection

Our goal is to build a miRNA target site prediction system based on theresults of the Ago study. The synthetic sequencing data generation part ofour project we described in the previous section, only is a stepping stoneto reach this goal. With that part explained, we can now assume that wehave various datasets which are similar to the sequencing data from the AgoHITS-CLIP study.

After read generation, we align the reads to the genome with the short readermapper SHRiMP as we did with the Ago reads.

In order to call a peak, we have to know about the background. In theprevious section, we derived the background and peak expectations fromthe number of Ago HITS-CLIP alignments that are in the 50 UTRs of thegenes in the mouse transcriptome. This was necessary to know where andhow much reads to generate.

For peak calling, we derive the expectations in the same way but this timefrom the data that was generated, the data that we will try to detect thepeaks (namely the target sites) of. Thus, in the next section, when wemention peak expectation or background expectation, we refer to the ex-pectations for the synthetic data; not the Ago HITS-CLIP data.

4.3.1 Peak Calling

The main idea in our peak calling is to go through each possible window(with length l) in the transcriptome, count the number of alignments thatstart in this window and check whether or not at least this number of align-ments could be background alignments. If the probability of this happeningis lower than a threshold, then we call the window a peak.

It is important to note that one unique read that is generated results inmany alignments. Since we control data generation by generation of reads,not alignments, this might cause misleading results. To alleviate this e↵ect,while calculating the number of alignments in a window, we calculate thecontribution of each alignment as 1

n

where n is the number of alignmentsthat are generated through same read.

A crude outline of our peak calling process would be:

for all g in genes do

for p in [g.start, g.end] do

19

k = countAlignmentsInWindow(g.chrom, g.strand, p, l)E

b

= max(Eb

genes

[g], Eb

chroms

[g.chrom], Eb

global

)if cumulative poisson prob((E

b

⇥ l), k) < threshold then

call this window a peakp = p+ l

end if

end for

end for

The function countAlignmentsInWindow(g.chrom, g.strand, p, l) counts thenumber of alignments that start in the window with length l, which in turnstarts at the position p. Chromosome and strand information is also impor-tant since p refers to a position on a chromosome, and strands are definitivefor mRNAs and alignments. We will explain how we implement this func-tion in more detail soon. But for now, let us continue with the rest of thealgorithm outline above

The line Eb

= max(Eb

genes

[g], Eb

chroms

[g.chrom], Eb

global

) is where we getthe maximum of the background expectations regarding gene g. We cross-check the observed number of alignments in the current window against thismaximum as the background expectation. This is inspired by [32], and aimsto protect the system against noise, providing a conservative backgroundestimation. The function cumulative poisson prob(�, k) uses the Poisson

probability mass function �

k

e

��

k! (2.1) to calculate the probability of havingat least k events with the event expectation �. As the reader might recall,E

b

, as all the other expectations, is per position. But we need the probabilityfor having k alignments in a window with length l. That is the reason why wemultiply E

b

with l while calculating the cumulative poisson prob(�, k): toget the background expectation value for the window. We set the thresholdto p/|transcriptome| to limit the false positives to the number of peaks p, thatis input to the synthetic sequencing data generation. This threshold can betweaked higher or lower to increase true positives or decrease false positives,respectively.

The bottleneck of this algorithm resides in the functioncountAlingmentsInWindow(g.chrom, g.strand, p, l), since it is executed foralmost each position in the entire transcriptome. Hence, the way we imple-ment this function is critical for the performance of peak calling (in termsof execution time). Briefly, the function is supposed to check all the align-ments for each window; count the alignments that start in the given window,and return this count. Using brute force approach and going through all thealignments one by one for each window, to check if it is in the window, resultsin a runtime of several days, just for a small part of the transcriptome.

Instead of the brute force approach, we introduced indexing techniques to

20

reduce the runtime drastically. We created a massive dictionary (hash table)that keeps the number of alignments at each position. To lower the loadfactor of the dictionary, we index on three tiers: first on chromosome; secondon strand ; and third on the position relative to the chromosome. At thebeginning of peak calling, we go through the alignment file once to initializethis dictionary. With the use of this dictionary, checking the number ofalignments that start from a given position is reduced to a look up on adictionary, while in the naive approach each alignment had to be visited foreach position. A miss on the look up means that there are no alignmentson the position that is looked up. By this indexing, the complexity wasreduced from O(|transcriptome| ⇥N) to O(N), where N is the number ofalignments.

Since genes are independent of each other regarding peak calling, it is possi-ble to divide the transcriptome into collections of genes and do peak callingon these collections in parallel. This way it is possible to further reduce thetotal runtime of peak calling with the expense of runtime memory, assuminga multi-processor system to run the peak calling on.

21

Chapter 5

Results

In the following, p refers to the number of peaks parameter for syntheticsequencing data generation while l is the window length.

5.1 Peak calling on synthetic data

In order to analyze the performance of our model with various data sets, wegenerated sequencing data with p = 1500, p = 2500, p = 3500, p = 4500, p =11500. We also tweaked the window length from 62 to 150. l = 62 is the“Ago-footprint” taken from [1].

5.1.1 P-values

As mentioned in the peak calling part in methodology, we traverse the tran-scriptome window by window to identify the peak windows. To call a windowwhich has k alignments a peak, the probability of having k or more align-ments in this window given the background expectation, has to be lowerthan the threshold. We call this probability the p-value for this window.Note that we are considering each alignment’s contribution to the afore-mentioned k as 1

n

where n is the number of alignments that are generatedthrough same read.

We plotted these p-values to have a general idea about their distribution.In the plots in Fig. 5.1, x-axis is the normalized ranks of the p-values. Thismeans that the x-axis of a point gives the portion of plotted p-values thatare less than the p-value of the point in consideration (which actually layto the left of this point in the plot). The y-axis on the other hand is simplythe p-value which is described above.

22

The total number of windows we traverse in the transcriptome (which isalmost the same as the length of the transcriptome) is around 1 billion.However, since the probability of having an alignment in a random positionis very low, most of the windows end up having 0 alignments, which in turngives a p-value very close to 1 (given the background expectation). Thesep-values are not crucially interesting for plotting purposes since they almostcorrespond to a long horizontal line at the end of the plot, hence we omittedthem in order to ease the plotting process as much as analyzing them. Thenumber of windows (thus the number of points in the plots) after omittingthe ones which had no alignments is around 14 million which is almost a 1

70fraction of the original number. Observe also that we plotted the datasetwith p = 0 (Fig. 5.1a) to see the di↵erence with the other datasets. Thenumber of windows that are plotted is of a di↵erent order of magnitudecompared to the number of called peaks for each set. Keeping this in mind,one can see that each figure except Fig. 5.1a, which is for p = 0, startswith a part where p-values are very close to zero, and that moving slightlytowards the right on the x-axis we come to a point where there is a relativelysharp increase in the p-values. This is the point where the threshold lies.The points below this threshold belong to the called peaks.

5.1.2 Peaks called

After doing peak calling on the synthetic data, we analyze the peak callingperformance. This is feasible since we record the starting positions of thegenerated peaks during synthetic sequencing data generation. Comparingthe positions of the called peaks with the generated peaks gives us an ideaabout how good the peak calling performs.

We label a called peak as a true positive, if it overlaps with a generatedpeak at least with one base-pair. As described in methodology, we jump lpositions in traversal when we call a window a peak, to avoid overlappingcalled peaks. As an artifact of this we consider even the least overlap as anindication of being a true positive. Otherwise we would classify called peaksthat are actually related to the generated peaks as false positives, whichwould be misleading.

Another artifact of jumping after calling a peak is that, we call approxi-mately two peaks per synthetically generated peak. The explanation forthis fact is that, in peak calling, at the end of the window we are currentlychecking, we see some alignments from the beginning of a generated peakthat are su�ciently many to call this window a peak, and jump to the nextnon-overlapping window (since we called the previous window a peak), whichalso has su�ciently many alignments from the end of the same generatedpeak, so we call this new window a peak too. Thus, we end up having two

23

(a) p=0 (b) p=1500

(c) p=2500 (d) p=3500

(e) p=4500 (f) p=11500

Figure 5.1: Cumulative p-value distributions for peak calling on synthetic datasets with variousnumber of peaks

24

called peaks per generated peak. In tables 5.1 and 5.2, we call the number ofgenerated peaks that are overlapped by at least one called peak the uniquetrue positives. As a corollary to that, we call all the called peaks that overlapwith a generated peak all true positives.

In order to eliminate these artifacts of jumping whenever a peak is called, thejumping should be replaced with a delay in peak calling. That is: wheneverthe number of alignments detected results in a p-value that is less than thethreshold, the system should calculate the p-values for the next l windowsand call the one with the lowest p-value as the peak. This way, there willbe at most one called peak per generated peak, while the position accuracyof the called peak will also increase.

Tables 5.1 and 5.2 are the summary tables for peak calling done on syntheticsequencing data with various parameters. The window length is di↵erent foreach table.

p unique true all true false calledpositives positives positives peaks

1500 812 1643 2244 38872500 1256 2537 3783 63203500 1793 3629 5149 87784500 2249 4568 6615 1118311500 5674 11232 15592 26824

Table 5.1: Peak calling analysis for l = 62

p unique true all true false calledpositives positives positives peaks

1500 777 1588 2299 38872500 1297 2653 3792 64453500 1811 3691 5190 88814500 2307 4680 6550 1123011500 5787 11377 15841 27218

Table 5.2: Peak calling analysis for l = 150

As seen in both table 5.1 and table 5.2, the number of false positives steadilyincreases with p. This phenomenon can be explained: Increasing the pincreases the total number of reads since for each peak ⇠20-40 reads aregenerated. This in turn results in a much higher number of alignmentswhich are caused by the same read. For example, the total number ofunique alignments are 993544 for p = 1500 while the same metric is 1244820for p = 4500 (both values are for l = 150). This in turn causes more peaksto be called, and eventually more false positives.

25

5.2 Peak calling on Ago data

We also tried peak calling on the Ago data provided in [37]. This data is theresult of HITS-CLIP with Ago antibody for three di↵erent mouse brains,labeled as A, B and C. As mentioned before, we used the 130kD model sizedsamples from their sequencing results. In table 5.3, we summarize the calledpeaks according to the region they belong in their respective transcripts.

Dataset Peaks in Peaks in Peaks in Total30 UTR 50 UTR CDS

Brain A 315 28 238 581Brain B 927 60 1070 2057Brain C 744 39 630 1413

Table 5.3: Regions for called peaks for Ago data

In the supplementary material to their study, Chi et al. plot the regionaldistribution of mRNA tag clusters within the transcribed genes (determinedby RefSeq annotation) in supplementary figure 8 [1]. The results in table5.3 are significantly close to this distribution of clusters (called peaks inthis study) in terms of percentages of clusters discovered in various regions.The percentages they present are 1.9%, 44.9% and 53.2% respectively for 50

UTR, CDS and 30 UTR. In table 5.3, the distribution is 3% 48% and 49%for the same respective regions. On the other hand, the total number ofclusters is 11118 in their study while we discovered only 4051 clusters. This,combined with the numbers in table 5.2, points to the fact that our methodneeds improvement in terms of clusters left undiscovered.

26

Chapter 6

Conclusions

mRNA-miRNA interaction is a key topic in understanding the rules andregulations for gene expression. In this study, we focused on the resultspresented in [?], which introduce the Ago-miRNA-mRNA ternary complexas a promising platform for investigating miRNA target sites. We proposeda probabilistic model of a part of their result data, and generated severalsynthetic data sets using this model to test our software for target sitedetection.

Our results on the synthetic data showed that our relatively simple prob-abilistic model works fine regarding true positives, as the detection soft-ware finds more than half of the generated peaks for all the tests we con-ducted. On the other hand, the results from tests on genuine data from[37] showed that the regional distribution of the interaction sites (peaks)our software identified matches the distribution given in [1]. Despite thesepositive results, there is vast room for improvements and further study. Asdiscussed in Chapter 5, there are side e↵ects from our method of syntheticdata generation, these should be alleviated in the generation process or apre-processing phase should be introduced before detection. Furthermore,partially caused by the previously mentioned side e↵ects, number of falsepositives and unidentified peaks are higher than expected. This lowers thequality of the results; as in a real scenario, called peaks are to be cross-checked to find out whether they actually correspond to interesting sites inthe respective genome or not.

Even with the problems mentioned, we believe that this study provides apromising introduction for important research in mRNA-miRNA interac-tion.

27

Chapter 7

Bibliography

[1] S.W. Chi, J.B. Zang, A. Mele, and R.B. Darnell. Argonaute HITS-CLIPdecodes microRNA-mRNA interaction maps. Nature, 2009.

[2] S. Clancy and W. Brown. Translation: DNA to mRNA to protein.2008.

[3] D. Sadava, H.C. Heller, and G.H. Orians. Life: The Science of Biology.W. H. Freeman, 2008.

[4] B.A. Pierce. Genetics: A Conceptual Approach. W. H. Freeman, 2010.

[5] R.C. Lee, R.L. Feinbaum, and V. Ambros. The C. elegans heterochronicgene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell, 75(5):843–54, 1993.

[6] A.E. Pasquinelli, B.J. Reinhart, F. Slack, M.Q. Martindale, M.I.Kuroda, B. Maller, D.C. Hayward, E.E. Ball, B. Degnan, P. Mller,J. Spring, A. Srinivasan, M. Fishman, J. Finnerty, J. Corbo, M. Levine,P. Leahy, E. Davidson, and G. Ruvkun. Conservation of the sequenceand temporal expression of let-7 heterochronic regulatory RNA. Nature,408(6808):86–9, 2000.

[7] B.J. Reinhart, F.J. Slack, M. Basson, A.E. Pasquinelli, J.C. Bettinger,A.E. Rougvie, H.R. Horvitz, and G. Ruvkun. The 21-nucleotide let-7RNA regulates developmental timing in Caenorhabditis elegans. Na-ture, 403(6772):901–6, 2000.

[8] L. He and G.J. Hannon. MicroRNAs: small RNAs with a big role ingene regulation. Nat Rev Genet, 5(7):522–31, 2004.

[9] B.P. Lewis, C.B. Burge, and D.P. Bartel. Conserved seed pairing, oftenflanked by adenosines, indicates that thousands of human genes aremicroRNA targets. Cell, 120(1):15–20, 2005.

28

[10] V. Ambros, B. Bartel, D.P. Bartel, C.B. Burge, J.C. Carrington,X. Chen, G. Dreyfuss, S.R. Eddy, S. Gri�ths-Jones, M. Marshall,M. Matzke, G. Ruvkun, and T. Tuschl. A uniform system for mi-croRNA annotation. RNA, 9(3):277–9, 2003.

[11] Jun Lu, Gad Getz, Eric A Miska, Ezequiel Alvarez-Saavedra, JustinLamb, David Peck, Alejandro Sweet-Cordero, Benjamin L Ebert, Ray-mond H Mak, Adolfo A Ferrando, and et al. MicroRNA expressionprofiles classify human cancers. Nature, 435(7043):834–838, 2005.

[12] J. Lu, G. Getz, E.A. Miska, E. Alvarez-Saavedra, J. Lamb, D. Peck,A. Sweet-Cordero, B.L. Ebert, R.H. Mak, A.A. Ferrando, J.R. Down-ing, T. Jacks, H.R. Horvitz, and T.R. Golub. A microRNA polycistronas a potential human oncogene. Nature, 435(7043):834–8, 2005.

[13] I. Alvarez-Garcia and E.A. Miska. MicroRNA functions in animal de-velopment and human disease. Development, 132(21):4653–62, 2005.

[14] Y. Tomari and P.D. Zamore. Perspective: machines for RNAi. GenesDev, 19(5):517–29, 2005.

[15] J. Martinez and T. Tuschl. RISC is a 5’ phosphomonoester-producingRNA endonuclease. Genes Dev, 18(9):975–80, 2004.

[16] J. Liu, M.A. Carmell, F.V. Rivas, C.G. Marsden, J.M. Thomson,J. Song, S.M. Hammond, L. Joshua-Tor, and G.J. Hannon. Argonaute2is the catalytic engine of mammalian RNAi. Science, 305(5689):1437–41, 2004.

[17] R.J. Jackson and N. Standart. How do microRNAs regulate gene ex-pression? Sci STKE, 2007(367):re1, 2007.

[18] S. Yekta, I. Shih, and D.P. Bartel. MicroRNA-directed cleavage ofHOXB8 mRNA. Science, 304(5670):594–6, 2004.

[19] M.R. Fabian, N. Sonenberg, and W. Filipowicz. Regulation of mRNAtranslation and stability by microRNAs. Annu Rev Biochem, 79, 2010.

[20] J. Brennecke, A. Stark, R.B. Russell, and S.M. Cohen. Principles ofmicroRNA-target recognition. PLoS Biol, 3(3):e85, 2005.

[21] J.G. Doench and P.A. Sharp. Specificity of microRNA target selectionin translational repression. Genes Dev, 18(5):504–11, 2004.

[22] Frank A. Haight. Handbook of the Poisson distribution. Wiley, 1967.

[23] Stat Trek. Poisson distribution, http://stattrek.com/

probability-distributions/poisson.aspx. Date accessed: 2012-02-01.

[24] O. J. Boxma and U. Yechiali. Poisson processes. 2008.

29

[25] Wikipedia. Poisson distribution, http://en.wikipedia.org/wiki/

Poisson_distribution. Date accessed: 2012-02-01.

[26] M. Landthaler, D. Gaidatzis, A. Rothballer, P.Y. Chen, S.J. Soll,L. Dinic, T. Ojo, M. Hafner, M. Zavolan, and T. Tuschl. Molecu-lar characterization of human Argonaute-containing ribonucleoproteincomplexes and their bound target mRNAs. RNA, 2008.

[27] F. Lejeune and L.E. Maquat. Immunopurification and analysis of pro-tein and RNA components of mRNP in mammalian cells. Methods MolBiol, 257, 2004.

[28] A. Galgano and A.P. Gerber. RNA-binding proteinimmunopurification-microarray (RIP-Chip) analysis to profile lo-calized RNAs. Methods Mol Biol, 714, 2011.

[29] Scitable by Nature Education. Scientists Can Studyan Organism’s Entire Genome with Microarray Anal-ysis, http://www.nature.com/scitable/topicpage/

scientists-can-study-an-organism-s-entire-6526266. Dateaccessed: 2012-12-01.

[30] G. Easow, A.A. Teleman, and S.M. Cohen. Isolation of microRNAtargets by miRNP immunopurification. RNA, 13(8):1198–204, 2007.

[31] P.J. Park. ChIP-seq: advantages and challenges of a maturing technol-ogy. Nat Rev Genet, 2009.

[32] Y. Zhang, T. Liu, C. Meyer, J. Eeckhoute, D. Johnson, B. Bernstein,C. Nussbaum, R. Myers, M. Brown, W. Li, and X. Liu. Model-basedAnalysis of ChIP-Seq (MACS). Genome Biol, 9(9):R137, 2008.

[33] R.H. Waterston, K. Lindblad-Toh, E. Birney, J. Rogers, and et al.Abril. Initial sequencing and comparative analysis of the mouse genome.Nature, 420(6915):520–62, 2002.

[34] University of California, Santa Cruz. UCSC Genome Browser.http://genome.ucsc.edu/. Date accessed: 2012-01-01.

[35] K.D. Pruitt, T. Tatusova, W. Klimke, and D.R. Maglott. NCBI Ref-erence Sequences: current status, policy and new initiatives. NucleicAcids Res, 2008.

[36] A.N.H. Creager. Phosphorus-32 in the Phage Group: radioisotopes ashistorical tracers of molecular biology. Stud Hist Philos Biol BiomedSci, 40(1):29–42, 2009.

[37] The Rockefeller University. Ago-miRNA-mRNA ternary map by HITS-CLIP, http://ago.rockefeller.edu. Date accessed: 2011-11-01.

30

[38] University of Toronto Computational Biology Lab. SHort Read Map-ping Package, http://compbio.cs.toronto.edu/shrimp/. Date ac-cessed: 2012-03-01.

31

Generation and Bioinformatic Analysis of Synthetic Ago HITS...

Documents

Transcript of Generation and Bioinformatic Analysis of Synthetic Ago HITS...