Image Retrieval on Large-Scale Image Databases

Image Retrieval on Large-Scale Image Databases

Eva HorsterMultimedia Computing Lab

University of AugsburgAugsburg, Germany

[email protected]

Rainer LienhartMultimedia Computing Lab

University of AugsburgAugsburg, Germany

[email protected]

Malcolm SlaneyYahoo! Research

Santa Clara, CA 95054USA

[email protected]

ABSTRACTOnline image repositories such as Flickr contain hundreds ofmillions of images and are growing quickly. Along with thatthe needs for supporting indexing, searching and browsing isbecoming more and more pressing. In this work we will em-ploy the image content as a source of information to retrieveimages. We study the representation of images by LatentDirichlet Allocation (LDA) models for content-based imageretrieval. Image representations are learned in an unsuper-vised fashion, and each image is modeled as the mixture oftopics/object parts depicted in the image. This allows us toput images into subspaces for higher-level reasoning whichin turn can be used to find similar images. Different simi-larity measures based on the described image representationare studied. The presented approach is evaluated on a realworld image database consisting of more than 246,000 im-ages and compared to image models based on probabilisticLatent Semantic Analysis (pLSA). Results show the suit-ability of the approach for large-scale databases. Finallywe incorporate active learning with user relevance feedbackin our framework, which further boosts the retrieval perfor-mance.

Categories and Subject DescriptorsI.5.10 [Pattern Recognition]: Models—Statistical ; H.3.1[Information Storage and Retrieval]: Content Analysisand Indexing—Indexing methods

General TermsAlgorithms, Experimentation, Performance

Keywordslarge-scale image retrieval, latent dirichlet allocation

1. INTRODUCTIONNowadays there exist online image repositories contain-

ing hundreds of millions of images of all kinds of quality,size and content. One example of such an image reposi-tory is FlickrTM. These repositories grow day by day making

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIVR’07, July 9–11, 2007, Amsterdam, The Netherlands.Copyright 2007 ACM 978-1-59593-733-9/07/0007 ...$5.00.

techniques for navigating, indexing, and searching prudent.Currently indexing is mainly based on manually entered tagsand/or individual and group usage patterns. Manually en-tered tags, however, are very subjective and not necessarilyreferring to the shown image content. A good example, forinstance is the tag “Christmas” in Flickr. Only a very smallfraction of the images depict the religious event as one mightexpect. Instead the tag often denotes the time and dateof creation. Thus thousands of vacation and party photospop up with no real common theme. This subjectivity andambiguity of tags makes image retrieval based on manuallyentered tags difficult.In this work we employ a different source of information toretrieve images: the image content. Recently advanced gen-erative models originally developed for statistical text mod-eling in large document collections such as probabilistic La-tent Semantic Analysis (pLSA) [7] and Latent Dirichlet Al-location (LDA) [4] have been introduced and repurposed forimage content analysis tasks such as scene classification [8]and object recognition [13]. Documents are modeled as mix-tures of intermediate (hidden) topics (also called aspects)under the assumption of a bag-of-words document represen-tation. Applied to visual tasks, the mixture of hidden topicsrefers to the degree to which a certain object/scene type iscontained in the image. In the ideal case, this gives rise to alow-dimensional description of the coarse image content andthus enables retrieval in very large databases.Given unlabeled training images, the probability distribu-tions of the above mentioned models are estimated in acompletely unsupervised fashion. The pLSA model has beenshown to work in image similarity search tasks in large realworld databases [9]. The LDA model is closely related to thepLSA model, but provides a completely generative modeland therefore overcomes some problems of the pLSA. Thusthe suitability of LDA models to solve the image retrievalproblem is studied in this paper. Our evaluation is based ona real world database consisting of more than 246,000 im-ages downloaded from Flickr. The resulting image databasewas not cleaned nor preprocessed in any way to increaseconsistency. Retrieval results are evaluated purely based onimage similarity as perceived by ordinary users.By definition query-by-example methods are only able tofind images of similar content independent of the precisequery concept a user has in his mind. Retrieval results areimproved by user relevance feedback since the feedback re-fines the precise query concept of the user. Thus we combineone of the best active learning approaches [15] with the LDAimage representation and a novel preprocessing scheme for

data selection in order to improve retrieval results. Perfor-mance is evaluated again based on user studies.

1.1 Related WorkRecently a few research groups have started to use prob-

abilistic text models [4, 7] for visual retrieval tasks. In thisnew approach, each image is modeled as a mixture of hiddentopics, which in turn model the co-occurrence of so calledvisual words inside and across images. These models havebeen successfully applied and extended to scene classifica-tion [5, 8, 12] and object categorization [6, 13, 16]. Varia-tions of latent space models have also been applied to theproblem of modeling annotated images [2, 3].In the visual domain, so far these aspect models are mostlyapplied to relatively small, carefully selected image databasesranging from a few hundred to a few thousand images. Thosedatabases are far from being representative for realistic re-trieval tasks on large-scale databases. Our previous work [9]shows that the use of pLSA models (i.e., the topic distri-bution of the images) improves retrieval performance onlarge-scale real world image database. The work centeredon finding ‘suitable’ visual words and we will build on theseinsights when computing the visual vocabulary.In this work we will combine the generative LDA model [4]with a large-scale real world image retrieval task. The workwas inspired by previous work [17] that uses LDA models toimprove information retrieval.

1.2 ContributionsThe main contributions of this paper are:

• We explore the application of LDA models for contentbased image retrieval and judge its suitability by userstudies on a real world, large-scale database with morethan 246,000 images.

• We evaluate various parameter settings and differentdistance measures for similarity judgment. In addi-tion, we perform a competitive comparison with thepLSA-based image representation.

• We apply an active learning algorithm [15] to the LDA-based image representation. Retrieval results are fur-ther improved by means of a novel data selection me-thod that prunes the set of candidate images used dur-ing active learning.

The paper is organized as follows. Section 2 describesthe LDA-based image representation. We outline the vi-sual word computation and review the LDA model. Thenwe present different similarity measures for example-basedretrieval based on the LDA representation. Experimentalresults of the proposed retrieval system on the complete im-age database are shown in Section 3. Section 4 outlinesthe combination of LDA-based image features and an activelearning algorithm. Modifications with respect to the orig-inal algorithm are described and experimentally evaluated.Section 5 concludes the paper.

2. LDA-BASED IMAGE RETRIEVAL

2.1 Image RepresentationLatent Dirichlet Allocation (LDA) [4] is a generative prob-

abilistic model developed for collections of text documents.

It represents documents by a finite mixture over latent top-ics, also called hidden aspects. Each topic in turn is charac-terized by a distribution over words. In this work our aimis to model image databases not text databases, thus ourdocuments are images and topics correspond to objects de-picted in the images. Most importantly LDA allows us torepresent an image as a mixture of topics, i.e. as a mixtureof multiple objects.The starting point for building an LDA model is to first rep-resent the entire corpus of documents by a term-documentco-occurrence table of size M ×N . M indicates the numberof documents in the corpus and N the number of differ-ent words occurring across the corpus. Each matrix entrystores the number of times a specific word (column index)is observed in a given document (row index). Such a repre-sentation ignores the order of words/terms in a document,and is commonly called a bag-of-words model.When applying those models to images, a finite number ofelementary visual parts, called visual words, are defined inorder to enable the construction of the co-occurrence table.Then each database image is searched for the occurrencesof these visual words. The word occurrences are counted,resulting in a term-frequency vector for each image docu-ment. The set of term-frequency vectors constitutes the co-occurrence table of the image database. Since the order ofterms in a document is ignored, any geometric relationshipbetween the occurrences of different visual words in imagesis disregarded.A finite number of hidden topics is then used in the LDA tomodel the co-occurrence of (visual) words inside and acrossdocuments/images. Each occurrence of a word in a specificdocument is associated with one unobservable topic. Prob-ability distributions of the visual words given a hidden topicas well as probability distributions of hidden topics given thedocuments are learned in a complete unsupervised manner.

2.1.1 Visual Words ComputationThe first step in computing the observable co-occurrence

matrix of words in images is to compute a visual vocabu-lary consisting of N visual words. This is usually derivedby vector quantizing automatically extracted local image de-scriptors. In this work the well-known SIFT features [10] arechosen as local image descriptors. They are computed in twosteps: A sparse set of interest points is detected at extremasin the difference of Gaussian pyramid and a scale, positionand orientation are assigned to each interest point. Thenwe compute a 128-dimensional gradient-based feature vec-tor from the local grayscale gradient neighborhood of eachinterest point in a scale and orientation invariant manner.Most works perform k-means clustering on local image fea-tures and keep the means of each cluster as visual words. Inour previous work [9] we investigated three different tech-niques for computing visual words from local image featuresfor large-scale image databases such as Flickr. Surprisingly,clustering based on subsets of features derived from imageswith the same tags did not improve performance. This maybe the result of inconsistent labeling as we can often seein community databases such as Flickr. We use the bestperforming technique [9] – merging the results of multiplek-means clustering on non-overlapping feature subsets – inthis work for visual word computation. Therefore relativelysmall sets (compared to the entire number of features in all246,000 images) of features are selected randomly from all

Figure 1: Graphical representation of LDA model(M denotes the number of images in the databaseand Nd the number of visual words in image Id). Theshadowed node denotes the observable random vari-able w for the occurence of a visual word, z denotesthe topic variable and θ the topic mixture variable.

features. Then k-means clustering is applied to each subsetand the derived visual words of each subset are amalga-mated into the vocabulary. This approach is more efficientwith respect to runtime than determining all clusters fromone large set of features.Given the vocabulary, we represent each image Id as con-sisting of Nd visual words by replacing each detected featurevector by its most similar visual word, defined as the closestword in the 128-dimensional vector space.

2.1.2 LDA ModelEach image Id is represented as a sequence of Nd visual

words wn, and written wd = {w1, w2, ..., wNd}. In an LDA

model [4], the process of generating such an image is de-scribed as follows:

• Choose a K-dimensional Dirichlet random variableθ ∼ Dir(α), where K denotes the finite number of top-ics in the corpus.

• For each of the Nd words wn:

– Choose a topic zn ∼ Multinomial(θ)

– Choose a word wn from p(wn|zn, β), a multino-mial probability conditioned on the topic zn

The grahical representation of the LDA model is shown inFigure 1. M indicates the number of images in the entiredatabase and Nd denotes the number of visual words in im-age Id.The likelihood of an image Id according to this model isgiven by:

p(wd|α, β) =

Z

p(θ|α)

NdY

n=1

(

KX

j=1

p(zj |θ)p(wn|zj , β))dθ (1)

The probability of a corpus/database is the product of themarginal probabilities of a single document. We learn anLDA model by finding the parameters α and β such that thelog marginal likelihood of the entire database is maximized.Since Eqn. 1 cannot be solved directly, model parametersare estimated by variational inference [4].Given the learned corpus parameters α and β, the LDAmodel allows us to assign probabilities to data outside thetraining corpus by maximizing the log marginal likelihood ofthe respective document. Thus we may learn the LDA cor-pus level parameters on a subset of the image database (in

order to reduce total training time) and then assign proba-bility distributions to all images. This is one of the advan-tages of the fully generative LDA topic model compared tothe pLSA aspect model. In the pLSA model there existsno direct way to assign probabilities to unseen documents.Additionaly the LDA overcomes some overfitting problemsof the pLSA which occur due to its large set of parametersthat are directly linked to the training set [4].Several extensions of the LDA model have been proposed [3,14].

2.2 Image Similarity MeasuresOnce we train an LDA model and we compute a prob-

abilistic representation for each image in the database, weneed to define an image similarity measures in order to per-form image retrieval. In this work, we focus on the taskof query-by-example, thus searching in the database for themost similar items to a given query image. The topic mix-ture θ for each image indicates to what degree a certain topicis contained in the respective image. Based on the topic mix-tures, we look at four different ways to measure similarityand evaluate these measures experimentally in Section 3.First the similarity between two images Ia and Ib can bemeasured by calculating the cosine similarity between thetopic distributions P(z|θa, α) and P(z|θb, α). The cosinecos(a,b) between two vectors a and b is popluar in textretrieval [1] and is defined by:

cos(a,b) =a · b

‖a‖ · ‖b‖(2)

A second possibility to measure image similarity is the useof the L1 distance between two topic distributions. The L1distance between to K dimensional vectors a and b is givenby:

L1(a,b) =

KX

i=1

|ai − bi| (3)

The third similarity measure that we study is the sym-metrized Jensen-Shannon divergence JS(P(z|θa, α),P(z|θb, α))between the topic distributions of two images. The JS mea-sure is based on the discrete Kullback Leibler divergenceKL(P(z|θa, α),P(z|θb, α):

JS(a,b) =1

2(KL(a,

a + b

2) + KL(b,

a + b

2)) (4)

where

KL(a,b) =K

X

i=1

ai logai

bi

(5)

The fourth measure is adopted from language based infor-mation retrieval. Here, each document is indexed by thelikelihood of its model generating the query document, i.e.the most relevant documents are the ones whose model max-imizes the conditional probability on the query terms. Incontent-based image retrieval, a query image can be pe-sented as a sequence of visual words wa and the above men-tioned likelihood can be written as:

P (wa|Mb) =

NdY

i=1

P (wai |Mb) (6)

where Mb is the model of an image Ib and Nd the total num-ber of detected visual words in image Ia.

Wei and Croft [17] combine the LDA model and the uni-gram model with Dirichlet smoothing to estimate the termsP (wa

i |Mb):

P (wai |Mb) = λ ·Pu(wa

i |Mub )+ (1−λ) ·PLDA(wa

i |Mldab ) (7)

where Pu(wai |M

ub ) is specified by the unigram document

model with Dirichlet smoothing according to [18]:

Pu(wai |M

ub ) =

Nd

Nd + µPML(wa

i |Mub )

+ (1 −Nd

Nd + µ)PML(wa

i |D) (8)

Here D denotes the entire set of images in the database andµ the Dirichlet prior. The term PLDA(wa

i |Mldab ) in Eqn. 7

refers to the probability of a visual word wai in image Ia

given the LDA topic model M ldab of image Ib:

PLDA(wai |M

ldab ) = PLDA(wa

i |α, θb, β) =

KX

j=1

P (wai |zj , β) · P (zj |θ

b, α) (9)

3. EXPERIMENTAL RESULTSThe objective of example-based image retrieval is to ob-

tain images with content similar to the given sample image.We evaluate retrieval results based on the judgments of sev-eral test users about the visual similarity of the retrievedimages with respect to the query image.All experiments are performed on a database consisting ofapproximately 246,000 images. The images were selectedfrom all public Flickr images uploaded prior to Sep. 2006and labeled as geotagged together with one of the followingtags: sanfancisco, beach, and tokyo. Of these images onlyimages having at least one of the following tags were kept:wildlife, animal, animals, cat, cats, dog, dogs, bird, birds,flower, flowers, graffiti, sign, signs, surf, surfing, night, food,building, buildings, goldengate, goldengatebridge, baseball.The images can thus be grouped into 12 categories as shownin Table 1.The preselection of a subset of images from the entire Flickrdatabase based on tags is needed as Flickr is a repositorywith hundreds of millions of images. However, it should benoted, that indexing purely based on tags is not sufficientas the tags are a very noisy indication of the content shownin the images.We computed the visual vocabulary from 12 randomly se-lected non-overlapping subsets each consisting of 500,000local features. Each of those subsets produces 200 visualwords giving a total vocabulary size of 2400 visual words.

3.1 Parameter SettingsThe first step in evaluating the retrieval system is to de-

termine suitable parameters for the LDA model, such as thenumber of training images as well as the number of topicsK. Thus a suitable measure to assess the performance withrespect to different parameter settings is needed. The per-plexity is frequently used to assess the performance of lan-guage models and to evaluate LDA models in the context ofdocument modeling [4]. It measures the performance of themodel on a held out dataset Dtest and is defined by:

per(Dtest) = exp

(

−

PM

d=1log p(wd)

PM

d=1Nd

)

(10)

Category OR list of tags # of images1 wildlife animal animals

cat cats 285092 dog dogs 246603 bird birds 209084 flower flowers 254575 graffiti 218886 sign signs 143337 surf surfing 295528 night 331429 food 1860210 building buildings 1682611 goldengate goldengatebridge 2380312 baseball 12372

Total # of images(Note images may have 246,348multiple tags)

Table 1: Image database and its categories used forexperiments

0 10 20 30 40 50 601560

1580

1600

1620

1640

1660

1680

1700

1720

# concepts K

perp

lexi

ty

Figure 2: Perplexity vs. number of topics K

This measure decreases monotonically in the likelihood ofthe test data, thus lower values indicate better modelingperformance.In order to evaluate the influence of the choice of the num-ber of hidden topics, we trained an LDA model on a subsetof 50,000 images using different numbers of aspects. Theperplexity is then calculated on a previously unseen testset of 25,000 images. Figure 2 shows the perplexity plot-ted against the number of hidden aspects K. One can seethat the perplexity decreases with an increasing number oftopics. If the number of topics is small, i.e. K < 30, theperplexity grows rapidly indicating that the model does notfit the unseen training data. For K ≥ 30 the perplexity isalmost constant. We need a rich image description for ourretrieval task, thus we will set K = 50 in our experiments.Figure 3 shows the perplexity for different sizes of the train-ing set, i.e. the number of images in the training set isvaried. The number of topics is fixed to K = 50 in orderto evaluate the change of the perplexity with respect to thenumber of images used for training the LDA corpus levelparameters. Perplexity is again calculated for each settingbased on a perviously unseen test set consisting of 25,000 im-ages. The perplexity decreases with an increasing numberof training samples and is approximately constant for train-ing set sizes above 20, 000 images. However, the decrease inperplexity is not as fast as the decrease based on the choice

0 1 2 3 4 5

x 104

1575

1580

1585

1590

1595

1600

# of images for training

perp

lexi

tiy

Figure 3: Perplexity vs. number of training samples

of the number of topics, thus no definite conclusion aboutthe appropriate size of training samples can be drawn. Theappropriate number of images used to train the LDA modelmay also depend on other parameters such as the choice ofthe maximum number of iteration in the variational infer-ence part as well as the number of topics and the size of thevocabulary, respectively. It is still important to notice thatin our tests it does not seem to be necessary to performLDA model computation on the entire database, which isa huge advantage in large-scale databases. It also enablesadding novel images to the database without relearning theLDA corpus level parameters as long as they show alreadylearned topics.

3.2 Different Similarity MeasuresWe decribed different similarity measures for the LDA-

based image representation in Section 2.2. Here we evaluatetheir effects on the image retrieval task, with the number oftopics in the LDA model set to 50 and the model trainedon 50,000 images. Once we compute the model, we assignprobabilities to all images by maximizing the log marginallikelihood of the respective document (see Section 2.1). Theparameters µ and λ of the information retrieval based dis-tance measure are set to 50 and 0.2, respectively.We judge the effect of the similarity measures on the re-trieval results by users: We selected five query images percategory at random resulting in a total of 60 query imagesfor the experiments. For each query image the 19 most sim-ilar images derived by the four different measures are pre-sented to the users. The test users were asked to judge theretrieval results by putting them in an order from best toworst by assigning 3 points to the best technique, 2 pointsto the second best, and 1 and 0 to the second worst andworst performing technique, respectively. We compute theaverage score for each method over all 60 images. It shouldbe noted that sometimes the performance of all four tech-niques were not satisfying at all. In those cases the usercould assign 0 points to all similarity measures. We allow acorresponding procedure in cases where all four techniquesproduced perfect results: all techniques could earn 3 points.We depict the resulting mean scores over 10 test users in

Figure 4. The vertical bars mark the standard deviation ofthe test users’ scores. The best performing distance mea-sure is the probability measure adopted from informationretrieval (Eqn. 6) [17]. This indicates that retrieval basedon the topic distribution is enhanced by also taking worddistributions into account. Note, that the word probability

0.8

1

1.2

1.4

1.6

1.8

2

2.2

Cosine JS L1 IR measure

Figure 4: User preferences for the four image simi-larity measures using the LDA image representation

0.7

0.9

1.1

1.3

1.5

1.7

1.9

Cosine JS L1 IR measure

Figure 5: User preferences for the four image sim-ilarity measures using the pLSA image representa-tion

calculated based on the unigram model is assigned only asmall weight of 0.2 whereas the word probability based onthe LDA model is assigned large weight (0.8).Out of the three similarty measures based on only the topicdistributions, the Jensen-Shannon divergence performs best,followed by the L1 distance. If image retrieval is performedon large-scale databases the probability measure from in-formation retrieval may be too time consuming and dimen-sionality reduction in image representation is important. Inthis case one should also consider the second best approach,the Jensen-Shannon divergence. As word occurences are so-ley needed to build the LDA representation, only the lowdimensional topic distribution needs to be stored and pro-cessed for the retrieval task. This allows us to search evenlarger databases in reasonable time.

3.3 pLSA versus LDAIn our earlier study [9] we used a latent aspect model to

represent images in the context of a retrieval-by-exampletask. The work combined the topic vector produced by thepLSA model for each image with the cosine distance mea-sure and found that this approach outperformed the pure vi-sual word-occurrence vectors as well as color coherence vec-tors [11]. In this work we use the same real world databasefor evaluation purposes.In order to determine the most appropriate image represen-tation for the studied retrieval task, the results obtained us-ing LDA-based image features should be compared to thosederived using the pLSA-based image representation. Since

0.3

0.4

0.5

0.6

0.7

LDA+IR measure pLSA+IR measure

Figure 6: User preferences for the comparsion be-tween the retrieval approach using LDA image fea-tures and the approach using pLSA image features

the previous section shows that retrieval performance de-pends on the distance measure used, the most suitable dis-tance measure for the pLSA representation needs to be iden-tified first. Thus, all four similarity measures described inSection 2.2 are applied to the pLSA-based image represen-tation.1 Results on 60 query images are then judged by testusers as described in the previous section. Mean scores andstandard deviations of 10 test users are shown in Figure 5.A similar result as for the LDA image features is observed forthe pLSA-based image representation (see also Section 3.2).The IR measure outperforms all other similarity measures,followed by the Jensen-Shannon divergence. The cosine dis-tance shows the worst performance. It can also be seen thatthe resulting scores are more consistent, thus not showingas large differences between the distance measures as we ob-tained using the LDA image features.As we obtain relative scores only for the comparsion of sim-ilarity measures, we still do not know which image featuresare more appropriate for the retrieval task. Therefore, wecompare the retrieval performance of LDA and pLSA-basedfeatures using the best performing similarity measure – theIR measure (Eqn. 6) – by using 60 images evaluated by 10users. Since we compare only two techniques in the exper-iment, test users judge the retrieval results of each queryimage by assigning 1 point to the better performing methodand 0 points to the other method. Mean scores and stan-dard deviations are depicted in Figure 6. It can be clearlyseen that the score of the pLSA-based image representationis significantly lower than the results for the LDA-based im-age representation. Thus, we conclude that the LDA-basedimage representation studied in this work is more suited forthe image retrieval task on a large real world database thanthe pLSA-based image features.

3.4 ResultsFinally we show some retrieval results obtained by the

proposed LDA-based system in Figure 7. As one can see,in the top seven rows the systems performs very well. Thefollowing rows show queries were the returned results aresuboptimal, especially in the last row the systems fails com-pletely. Displayed results are obtained using different simi-larity measures.

1The number of topics in the pLSA model is set to 48.

Figure 7: Retrieval results obtained by our LDA-based system. The left most image in each rowshows the query image, the four images to the rightshow the most relevant images retrieved.

4. ACTIVE LEARNINGSo far retrieval results are obtained in a completely unsu-

pervised manner. In this section, active learning is deployedto improve retrieval results. In active learning the user in-teractively informs the system about the concept he/she islooking for. The system poses ‘questions’ to the user, whichthe user must answer in order to provide feedback to thesystem about his/her actual search goal. Questioning isperformed by asking the user to label an image or a set ofimages as relevant or irrelevant. As users expect the systemto capture their desired concept effectively, i.e. quickly andaccurately, the main issue in designing such systems is find-ing the most informative instances to present for labelingpurposes to the user. In this work we combine the supportvector machine (SVM) based active learning approach [15]on the LDA-based image representation with a simple pre-processing scheme to effectively prune the image candidatespace.

4.1 SVM-based Active LearningTong and Chang [15] proposed active learning with sup-

port vector machines (SVM) by regarding the task of learn-ing a target concept as the task of separating the relevantimages from the irrelevant ones by learning an SVM binaryclassifier, i.e. a hyper plane in some high dimensional space.The presented active learning method works as follows: An

SVM classifier is trained in each query round. In the firstquery round the algorithm is initialized with one relevantand one irrelevant image and the user labels a randomlyselected set of T images. In each following round the T

most informative images are presented to the user for label-ing. The most informative images are defined as the clos-est images to the current hyper plane according to the socalled ‘simple method’. After a number of relevance feed-back rounds, the most relevant images are presented to theuser as the query result. The binary SVM classifier sub-divides the space by the hyper plane in two sets, relevantand irrelevant images and thus the most relevant images arethose that are farthest from the current SVM boundary inthe kernel space and on the right side of the hyper plane.In order to apply this algorithm to images, each image needsto be presented as a vector. We propose to represent the im-ages in the database by their P(z|θ, α) distributions, thuscombining LDA image representation and SVM active learn-ing.The active learning algorithm works well for small databaseswith carefully selected images. Problems arise when apply-ing this algorithm to large-scale databases. First, the userneeds to find at least one positive query image to initial-ize the algorithm. Fortunately in this work the query by-example task is considered and thus the example image canbe used to initialize the algorithm. A second problem arisesdue to the number of images showing the desired contentwith respect to the total number of images in the database.If this fraction is very small (as it usually is in large-scaledatabases), active search is aggravated.In order to solve this problem, a preprocessing step is per-formed before starting the active learning algorithm. Thispreprocessing step aims to reduce the total amount of imagesin the database while at the same time keeping images thatlikely contain the desired concept, i.e. the active learningalgorithm will not work on the entire database of 246,000images but only on a preselected subset of images. As aconvenient side effect of preprocessing, computation time ofeach query round is reduced as the algorithm is running ona smaller dataset making active search faster.The proposed data selection approach takes advantage ofthe learned LDA image representation: We choose a sub-set of R images for active learning based on the prior de-tected relevance to the query image. Relevance is definedby similarity based on the LDA image representation andthe distance measures discussed in Section 2.2. This makesintuitive sense as an LDA-based image representation mod-els the image content by topic assignment and thus imageshaving completely different topic distributions are unlikelyto match the desired user’s concept.

4.2 Experimental ResultsFor the evaluation of the active learning algorithm based

on LDA image features the parameters in the experimentare set as follows: images are represented by their topic dis-tribution, which is learned from a 50 topics LDA model. Weused a radial basis function (RBF) kernel with α = 0.01 inthe SVM and we set the number of query images T pre-sented to the user in each query round to 20. We chosethe parameter R, the size of the preselected subset, to be20,000. This ensures a sufficient downsizing from the origi-nal total amount of images while at the same time keepingan adequate number of images likely containing the desired

0,5

0,7

0,9

1,1

1,3

1,5

1,7

unsupervised original active learning[15]

active learning withdata preselection

Figure 8: User preferences for the comparsion be-tween the two active learning approaches and theunsupervised approach

content. The subset of R images is determined by applyingthe L1 distance on the topic distributions.The results of the active learning algorithm with pre-filteringare compared to the results obtained by the active learningalgorithm without pre-filtering [15] and the results from un-supervised retrieval using the IR similarity measure. Eval-uation is again performed through user studies. 25 samplequery images are chosen from the pool of 60 images used forthe evaluation in Section 3. As a common user will mostlikely perform no more than three to four query rounds wepresented the 19 most relevant images to the given queryconcept after three rounds of active learning to the testusers. Test users compare the results of all three methodsand the best performing method earns 2 points, whereas thesecond best and worst receive 1 and 0 points, respectively.The mean over all 25 images is then calculated and the re-sults over all 10 test users are depicted in Figure 8. Theresults show that active learning clearly improves the resultscompared to the complete unsupervised retrieval. Moreover,an additional improvement over the original active learningalgorithm [15] can be achieved by using pre-filtering (i.e.,data preselection).In Figure 9 some sample results showing the effectiveness ofthe presented active learning approach are depicted. Threepairs of 20 images are displayed, each pair showing the queryimage and the nine most relevant images found using the un-supervised algorithm evaluated in Section 3 (top) and afterthree rounds of active learning with data preselection (bot-tom). Green dots mark images showing the correct content,red dot mark incorrectly retrieved images. Clearly an im-provement of the results by active learning can be noticed.

5. CONCLUSIONSThis work studies the representation of images by Latent

Dirichlet Allocation (LDA) models in the context of query-by-example retrieval on a large real world image databaseconsisting of more than 246,000 images. Results show thatthe approach performs well. The combination of LDA-basedimage representation with an appropriate similarity measureoutperforms previous approaches such as a pLSA-based im-age representation. We found that a similarity measure de-veloped for information retrieval and based on probabilitiesgives the best retrieval results. We examined the applicationof an active learning algorithm on the LDA-based image fea-tures and proposed a novel data subset selection scheme for

Figure 9: Retrieval results: Each image pair showsthe results obtained by the unsupervised algorithm(top) and by active learning with pre-filtering (bot-tom)

retrieval in large databases. Future work will verify the re-sults using a larger number of users and we will incorporatedifferent types of image features.

6. REFERENCES[1] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern

Information Retrieval. Addison-Wesley LongmanPublishing Co., Inc., Boston, MA, USA, 1999.

[2] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas,D. M. Blei, and M. I. Jordan. Matching words andpictures. J. Mach. Learn. Res., 3:1107–1135, 2003.

[3] D. M. Blei and M. I. Jordan. Modeling annotateddata. In SIGIR ’03: Proceedings of the 26th annualinternational ACM SIGIR conference on Research anddevelopment in informaion retrieval, pages 127–134,New York, NY, USA, 2003. ACM Press.

[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latentdirichlet allocation. J. Mach. Learn. Res., 3:993–1022,2003.

[5] A. Bosch, A. Zisserman, and X. Munoz. Sceneclassification via pLSA. In Proceedings of theEuropean Conference on Computer Vision, 2006.

[6] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman.

Learning object categories from google’s image search.In ICCV ’05: Proceedings of the Tenth IEEEInternational Conference on Computer Vision, pages1816–1823, Washington, DC, USA, 2005. IEEEComputer Society.

[7] T. Hofmann. Unsupervised learning by probabilisticlatent semantic analysis. Mach. Learn.,42(1-2):177–196, 2001.

[8] F.-F. Li and P. Perona. A Bayesian hierarchical modelfor learning natural scene categories. In CVPR ’05:Proceedings of the 2005 IEEE Computer SocietyConference on Computer Vision and PatternRecognition (CVPR’05) - Volume 2, pages 524–531,Washington, DC, USA, 2005. IEEE Computer Society.

[9] R. Lienhart and M. Slaney. pLSA on large scale imagedatabases. In IEEE International Conference onAcoustics, Speech and Signal Processing, 2007.

[10] D. G. Lowe. Distinctive image features fromscale-invariant keypoints. Int. J. Comput. Vision,60(2):91–110, 2004.

[11] G. Pass, R. Zabih, and J. Miller. Comparing imagesusing color coherence vectors. In MULTIMEDIA ’96:Proceedings of the fourth ACM internationalconference on Multimedia, pages 65–73, New York,NY, USA, 1996. ACM Press.

[12] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez,T. Tuytelaars, and L. V. Gool. Modeling scenes withlocal descriptors and latent aspects. In ICCV ’05:Proceedings of the Tenth IEEE InternationalConference on Computer Vision (ICCV’05) Volume 1,pages 883–890, Washington, DC, USA, 2005. IEEEComputer Society.

[13] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, andW. T. Freeman. Discovering objects and their locationin images. In International Conference on ComputerVision (ICCV 2005), 2005.

[14] E. B. Sudderth, A. B. Torralba, W. T. Freeman, andA. S. Willsky. Describing visual scenes usingtransformed dirichlet processes. In NIPS, 2005.

[15] S. Tong and E. Chang. Support vector machine activelearning for image retrieval. In MULTIMEDIA ’01:Proceedings of the ninth ACM international conferenceon Multimedia, pages 107–118, New York, NY, USA,2001. ACM Press.

[16] G. Wang, Y. Zhang, and L. Fei-Fei. Using dependentregions for object categorization in a generativeframework. In CVPR ’06: Proceedings of the 2006IEEE Computer Society Conference on ComputerVision and Pattern Recognition, pages 1597–1604,Washington, DC, USA, 2006. IEEE Computer Society.

[17] X. Wei and W. B. Croft. LDA-based document modelsfor ad-hoc retrieval. In SIGIR ’06: Proceedings of the29th annual international ACM SIGIR conference onResearch and development in information retrieval,pages 178–185, New York, NY, USA, 2006. ACMPress.

[18] C. Zhai and J. Lafferty. A study of smoothing methodsfor language models applied to ad hoc informationretrieval. In SIGIR ’01: Proceedings of the 24th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 334–342,New York, NY, USA, 2001. ACM Press.

Image Retrieval on Large-Scale Image Databases

Documents

Transcript of Image Retrieval on Large-Scale Image Databases