Nhận dạng đối tượng trong ảnh tĩnh sử dụng phép học không giám sát mô hình...
description
Transcript of Nhận dạng đối tượng trong ảnh tĩnh sử dụng phép học không giám sát mô hình...
Hội nghị toàn quốc về Điều khiển và Tự động hoá - VCCA-2011
VCCA-2011
Nhận dạng đối tượng trong ảnh tĩnh sử dụng phép học không giám sát mô
hình CABM
Generic objects recognition in static images based on unsupervised learning of
a color-based active basis model
T. T. Quyen Bui1,2
and Keum-Shik Hong2
1Institute of Information Technology, Hanoi, Vietnam 2Pusan National University, Busan, 609-735, Korea
E-mails: {quyenbt, kshong}@pusan.ac.kr
Tóm tắt: Bài báo trình bày bài toán nhận dạng lớp
đối tượng trong ảnh tĩnh sử dụng quá trình học không
giám sát mô hình active basis model dựa trên đặc
trưng màu (CABM). Ở đây, quá trình nhận dạng đối
tượng sử dụng không gian màu CIELAB. Các sóng
Gabor wavelet của ảnh màu LAB, các đặc trưng màu
như năng lượng quang phổ cục bộ của mức sáng, mỗi
thành phần màu và gradient màu đều được sử dụng
trong cả hai quá trình học không giám sát và nhận
dạng đối tượng qua mẫu. Hơn nữa, tính hiệu quả của
phép học không giám sát mô hình CABM cho quá
trình nhận dạng đối tượng trong các ứng dụng thị giác
và những đóng góp của nó cho bài toán nhận dạng lớp
đối tượng cũng được giới thiệu.
Abstract: In this paper, the problem of detecting
generic objects in static images utilizing unsupervised
learning of a color-based active basis model (CABM)
is addressed. In here, the CIELAB color space is used,
Gabor wavelets of LAB color images as well as color
features, such as the local power spectrum of each of
the color cues and gradient features, are considered by
both unsupervised learning and inference algorithms.
Moreover, the effectiveness of unsupervised learning
of the CABM for practical object recognition and its
significant improvement in recognizing generic
objects are demonstrated.
Ký hiệu
Ký hiệu Ý nghĩa
),,,( yxG
a general form of Gabor wavelets
IDic
a dictionary of Gabor wavelet
elements of an image I .
),,( *** baL the brightness, red-green channel,
and yellow-blue channel,
respectively
D the domain of an image lattice
N
the number of different orientations
N
the total number of scales
}{ mI a set of training color images
LabmI
a representation of
mI in the
CIELAB color space
mIDic
a dictionary of Gabor wavelets
elements of LabmI
N the number sketches
NC the number of clusters
M the number of training images
}{ CkT a common template
}{ , kmT
a deformable template of image LabmI
}{Cj
kT
a common template of the cluster
j -th
)
,,(
,
,,
km
kmkm yx
perturbations of the basis element
of image LabmI
db and b
the bounds
kmc , coefficients
mU
an unexplained residual image
abCG
a full color gradient
)(
the normalizing constant of an
exponential family model
(.)u a candidate function of image Lab
uI
),( yxQ template matching score
a set of activities for a basis element
(.)sf a sigmoid function
Chữ viết tắt
ABM Active basis model
CABM Color-based active basis model
EM Expectation maximization
LPS Local power spectrum
1. Introduction Object recognition is one of challenges in vision
computer applications. There are many approaches to
solve this problem such as the use of interested points,
color of objects, shape of objects, input templates of
objects, etc. In this paper, we present the detection of
generic objects based on unsupervised learning of a
color-based active basis model.
Among various existing approaches of object
recognition, methods using a deformable template
have proven to be markably successful and the
research results are presented in [1-9]. Recently, Wu
et al. [10] proposed an active basis model (ABM) for
the detecting of objects. An ABM, a shared sketch
519
Hội nghị toàn quốc về Điều khiển và Tự động hoá - VCCA-2011
VCCA-2011
algorithm, and a computational architecture of sum-
max maps are utilized for representing, learning, and
recognizing a deformable template, respectively.
However, the ABM completely ignores all color
features in its detection of objects because the share
sketch algorithm applies a grey-value local energy to
find a common template together with its deformable
versions from training images.
A color-based active basis model (CABM) is
proposed and supervised learning of the CABM for
object recognition presented in [11] is an extension
problem of the ABM. Both supervised learning and
inference algorithm use image features based on color
such as color-based local power spectrum and color
gradients. Experimental results in [11] showed the
usefulness of the CABM as long as its significant
improvement in practical object recognition.
However, supervised leaning is only used when
objects in the training images have the same pose and
appear at the same location under the same scale. In a
situation where objects may appear at different
unknown locations, orientations, and scales in the
training images, unsupervised learning based on an
expectation-maximization (EM) type algorithm is
used. The color space CIELAB is utilized and as a
result, three opponent colors, ),,( *** baL , which are
the brightness, red-green channel, and yellow-blue
channel, respectively, are obtained per pixel. The
color-based LPS (LPSs of the components *L , *a ,
and *b , respectively, and a conjunction of these
LPSs) and color gradients are used in both the
learning and inference algorithms. The EM learning
process iterates between recognition and supervision.
First, the E-step involves using the current template to
recognize objects in the training images. Second, the
M-step re-learns the template based on the imputed
locations, orientations, and scales from aligned
images by the shared sketch algorithm.
The contribution of the paper is the research of
unsupervised learning of the CABM for generic
objects recognition. Moreover, experimental results
showed how much each of the color cues, gradient
features, as well as the EM clustering, help with
object recognition. In this paper, we use images from
two widely used benchmark image sets: the PASCAL
VOC 2009 database [12] and the Weizmann horse
dataset [13].
The paper is organized as follows: the CABM is
overviewed briefly in Section 2. Moreover,
unsupervised learning and inference algorithms are
discuss in details in this section. Experimental results
are demonstrated in Section 3. Finally, conclusions
are drawn in Section 4.
2. A color-based active basis model We refer the readers to reference [11] for more
information of the Gabor filters, color images
features, supervised learning of the CABM, etc., in
the detection of generic objects.
2.1 The CABM model
Let ),,,( yxG be a general form of Gabor
wavelets, IDic
be a dictionary of Gabor wavelet
elements of an image, I . The dictionary, IDic , is
defined as follows:
,}}1,...,0,{,),(
,),(),,(
),,,({
2/,,,0,,,
NiN
iDyx
yxgyxg
yxGDicI
(1)
where ),(0,,, yxg and ),(2,,, yxg are the
symmetric and anti-symmetric Gabor components,
respectively, D is the domain of an image lattice,
N is the number of different orientations, and N is
the total number of scales. In this paper, 15N , and
7.0 .
The proposed scheme of the CABM is shown in
Fig. 1 in which unsupervised learning based on the
EM-type algorithm is used. As presented in [11],
instead of utilizing color components RGB of natural
images, we consider color components *L , *a , and *b in the CIELAB color space. Given a set of training
color images, },...,1,{ MmIm , let LabmI be the
representation of mI in the CIELAB color space.
LetmIDic be the dictionary of Gabor wavelets
elements of LabmI and N be the number sketches that
are picked. Let
},...,1),,,,({ NkyxGT kkkC
k be
a common template that is shared by training images.
The location of each elements, CkT , can be shifted in
its normal direction with the permitted displacement
of the location, ),( yx , and the orientation of each
element can be shifted by the permitted value, .
The set of activities for a basis element is defined as
follows:
]},[],,[
,cos,sin:),,{(
bbbbd
dydxyx
dd
, (2)
where db and
b are the bounds for the permitted
displacements of the location and orientation,
respectively. Let kmT ,{
},...,1),,,,( ,,, NkyyxxG kmkkmkkmk
be the specific template that only describes image, LabmI , where kmT , is chosen from
mIDic and
),,,( ,,, kmkmkm yx Nk ,...,1 , is perturbations of
the basis element of image , LabmI . Each image, Lab
mI ,
can be represented as follows:
N
kmkmkm
Labm UTcI
1,, , Mm ,...,1 , (3)
where Nkc km ,...,1,, , are coefficients, and mU is an
unexplained residual image.
520
Hội nghị toàn quốc về Điều khiển và Tự động hoá - VCCA-2011
VCCA-2011
Fig. 1. A scheme of the color-based active basis model in
which unsupervised learning based on the EM-type is
utilized for object recognition.
2.2 Inference algorithm
Given a common template of an object, to detect
the object in images, we adopted the template
matching in the ABM, which determines a deformed
template of an object by using the log-likelihood
score. For instance, suppose that an object in a new
color image, uI , has to be found. Let Lab
uI be the
representation of uI in the CIELAB color space. The
inference algorithm is carried out in the following
manner where the parameters, },...,1,{ Nkk , are
the weights and the function, )( , is the
normalizing constant of an exponential family model
[10], (.)sf is a sigmoid function and its value
increases from 0 to the saturation constant, , and
(.)u is a candidate function of the image, LabuI . If
only the color-based LPS is taken into account, the
value of (.)u equals the normalization value of the
color-based LPS of the image. If utilizing the
combination of the LPS and the gradient-based
features in a single detector, (.)u is determined by
using the logistic model [14].
2.3 Unsupervised learning
Given a sample of response vectors (a set of
training images), we easily find a dictionary of
predictor or repressors ( }{ , kmT ) by using a linear
regression, in order that each response vector (image LabmI ) can be depicted as a linear combination of a
small number of repressors that are picked from the
dictionary as in equation (3). For this reason, the
output of unsupervised learning is the deformed
template, },...,1,{ , NkT km , which describes the
image, LabmI , by a small number of outline edges in
the texture of objects and the common templates,
},...,1,,...,1,{ NCjNkT Cjk of clusters where NC
is the number of clusters that is set before the
learning is performed. Templates are learned by using
the maximum likelihood method. However, unlike
supervised learning, unsupervised learning gives us
more than one common template.
If the objects appear at unknown orientations, the
E-step recognizes the orientation of the object in the
image, LabmI ; then, for a given orientation, Lab
mI is
changed to a new image LabmI
~, so that we get a new
set of images in which the image LabmI
~ becomes
aligned with each other. The M-step is carried out for
relearning templates from the aligned images through
supervised learning.
If the objects appear at different locations in the
training images, the locations of the templates should
be predicted during learning. The shift of the template
is transferred to the shift of the image and the latter
shift gives us a set of aligned images. Because we can
use a given template to recognize and locate the
},...,1,{ MmI Labm
Image features
-Color-based LPS
-Gradient-based
features
Color components
),,( *** baL
Gabor filter
Unsupervised learning
(EM algorithm)
Inference algorithm Testing images
Image features
Dictionaries
},...,1,{ MmDicmI
Deformed templates Common templates
Training images
},...,1,{ MmIm Inference algorithm:
1. For every pixel ),( yx of the image LabuI , compute the
template matching score,
N
k k
kykyyxkxxusfyx
k
yxQ
1 )(log
),,,(),,(
max
),(
2. Select ),(
,maxarg)ˆ,ˆ( yxQ
yxyx . For Nk ,...,1 , record
),,ˆ,ˆ(),,(
maxarg
),,(
kykyyxkxxuyx
kkykx
3. Return the location )ˆ,ˆ( yx and the deformed template of an
object, },...,1),,,ˆ,ˆ(,{ NkkkkykyykxkxxkuT .
521
Hội nghị toàn quốc về Điều khiển và Tự động hoá - VCCA-2011
VCCA-2011
object in each image, LabmI , by the inference
algorithm (see Section 2.2), the predictive distribution
of the unknown location of a template within the
image lattice of LabmI is determined; then, the
anticipation of the complete data log-likelihood is
obtained in the E-step. Next, we find the maximum of
the complete data log-likelihood by using the shared
sketch algorithm in the M-step. In other words,
supervised learning is executed for relearning
templates from the aligned images as in [11].
Nevertheless, unlike the previous ABM from EM-
type algorithm, color-based features, such as the LPSs
and the full color gradient, are taken into account in
our method through supervised learning and inference
algorithms.
By using the conjunction of color-based features,
varied versions of the CABM are acquired. Let the
notation, ABM EMl , denote the CABM in
which unsupervised learning and the color-based LPS
are used, i.e., },,,,,{ LabLbLabaLl . For instance,
ABM EMLb corresponds to the CABM in which
the unsupervised learning and the sum of the LPSs of
the components, *L and *b , are used. Let the
notation, ABM EMgl , denote the CABM in
which unsupervised learning based on the EM-type
algorithm and a combination of the color-based LPS
and a full color gradient are utilized. In here, the full
color gradient, abCG , is the sum of the two marginal
gradients of the *a and *b components. Fig. 2 depicts
an example of the deformed templates of images. The
objects are a horse, a cow, and a butterfly. The
number of sketches corresponding to the horse, the
cow, and the butterfly are 50, 42, and 55, respectively.
Viewing the second row through the bottom row, the
first method of learning that uses grey features only
[5], and the subsequent method corresponds to the
unsupervised learning in which the color-based LPSs
are used, i.e., },,,,,{ LabLbLabaLl , respectively.
3. Experimental results The area under the curve (AUC) is usually the best
discriminator [15], when we have a number of
receiver operating characteristic (ROC) curves to
compare. The AUC value has a range from 0 to 1,
therefore the smaller , AUC1 , value the
method has, the better the method is in object
recognition. In this section, we illustrate the
usefulness of unsupervised learning of the CABM for
object recognition and use to evaluate/compare the
performance of the method in recognizing objects.
In this experiment, the number of iterations is
three; hence, three common templates are created
after learning. Table 1 shows the value comparison
of ABMEMl , ABM EMgl , and ABM.
Here, the object is the butterfly, the image size is 150
x 110, the number of training images is 65, the
number of elements is 42, and the number of cluster is
LPS
Horse
N = 50
(150x125)
Butterfly
N = 42
(150x110)
Cow
N = 55
(170x125)
grey
featu
re
[5]
L
a
b
La
Lb
Lab
Fig. 2. Examples of deformable templates (represented by
yellow pixels).
Table 1. A comparison of the values of
ABMEMl , ABM EMgl , and the ABM
for butterfly recognition. Here, 65M , 42N , the
size is 150x110, and 3NC .
Method AUC1
T1 T2 T3 All
Wu et al. [10] 0.0381 0.0490 0.0415 0.0376
ABM
EM
l
L 0.0327 0.0302 0.0343 0.0282 a 0.0789 0.0938 0.0815 0.0778
b 0.0617 0.0625 0.0610 0.0602
La 0.0405 0.0329 0.0317 0.0311
Lb 0.0403 0.0325 0.0312 0.0305
Lab 0.0414 0.0336 0.0341 0.0332
ABM
EM
g
l
L 0.0260 0.0192 0.0130 0.0121 a 0.0612 0.0694 0.0574 0.0572
b 0.0490 0.0445 0.0418 0.0413
La 0.0301 0.0248 0.0267 0.0247
Lb 0.0302 0.0248 0.0260 0.0242
Lab 0.0290 0.0227 0.0254 0.022
3. The columns named “Template i ” ( }3,2,1{i )
522
Hội nghị toàn quốc về Điều khiển và Tự động hoá - VCCA-2011
VCCA-2011
and “All” respectively show the AUCs of the separate
use of template i and of the usage of all the templates
in object recognition. In the case of all templates, each
and “All” respectively show the AUCs of the separate
use of template i and of the usage of all the templates
in object recognition. In the case of all templates, each
template is utilized to detect and then sketch an object
in an image in turn.
Table 2. A comparison of the values of
ABMEMl , ABM EMgl , and the ABM
for horse recognition. Here, 84M , 45N , the
size is 150x125, and 3NC .
Method AUC1
T1 T2 T3 All
Wu et al. [10] 0.0362 0.0543 0.0456 0.0361
ABM
EM
l
L 0.0285 0.0538 0.0333 0.0178 a 0.0441 0.0448 0.0725 0.0406
b 0.0455 0.0531 0.0483 0.0439
La 0.0336 0.0447 0.0542 0.0336
Lb 0.027 0.049 0.0415 0.0241
Lab 0.0353 0.0454 0.0371 0.0335
ABM
EM
g
l
L 0.0229 0.0491 0.0283 0.014 a 0.0424 0.0427 0.0608 0.0387
b 0.0434 0.0501 0.0455 0.0399
La 0.0296 0.0428 0.0407 0.0279
Lb 0.0248 0.0448 0.0387 0.02
Lab 0.0285 0.0396 0.0333 0.0253
The performance in object recognition is improved
when the color-based LPS is used with
},,,{ LabLbLaLl . Moreover, the value of
ABM EMgl with regard to the utilization of
all the templates is always smaller than the those of
ABMEMl and the ABM when
},,,{ LabLbLaLl . If only the LPS of either of the
components, *a or *b , is utilized in unsupervised
learning, the quality of object recognition is worse
because the common templates achieved cannot
represent all the necessary edges of generic objects.
Besides, the value of ABM EMLa is
approximately equal to that of ABM EMLb .
This shows that the use of a combination of *L and
*a is similar to the utilization of the conjunction of *L and *b in both learning and inference algorithms.
In addition, among all the various versions of the
CABM, ABM EMgL consistently yielded the
best performance for the detection of objects. Testing
with the others objects such as a horse, a pigeon,…
experimental results are archived similarly.
4. Conclusions The paper presented unsupervised learning of the
CABM in generic object recognition. The learning
based on the EM-type is discussed in details.
Experimental results achieved are compared with
previous research results. It shows that the use of the
CABM and unsupervised learning based on the EM
algorithm contributed a significant improvement for
the detection of generic objects. The learning is able
to discover not only general shapes of objects but also
parts of objects. For this reason, the future research is
to improve the flexible ability of unsupervised
learning of the CABM for detecting generic objects
that may appear in differences of viewpoints or only
parts appearing in images.
References
[1] Coughlan, J.; Yuille. A.; English. C.; Snow , D.:
“Efficient deformable template detection and
localization without user initialization,”
Computer Vision and Image Understanding, 78
(3) (2000) 303-319.
[2] Cootesm, T.; Edwards, G.; Taylor, C.: Active
appearance models, IEEE Trans. on Pattern
Analysis and Machine Intelligence, 23 (6)
(2001) 681-685.
[3] Crandall, D.; Felzenszwalb, P.; Huttenlocher,
D.: Spatial priors for part-based recognition
using statistical models, Proceedings of the
IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, San
Diego, CA, USA, June 20-26, 2005, vol. 1, pp.
10-17.
[4] Felzenszwalb, P. F.; McAllester, D.: “The
generalized A* architecture,” Journal of
Artificial Intelligent Research, vol. 29, pp. 153-
190, 2007.
[5] Amit, Y.; Trouve, A.: POP: Patchwork of parts
models for object recognition, International
Journal of Computer Vision, 75 (2) (2007) 267-
282.
[6] Zhu, S.; Mumford, D.: A stochastic grammar of
images, Foundations and Trends in Computer
Graphic and Vision, 2 (4) (2007) 259-362.
[7] Leibe, B.; Leonardis, A.; and Schiele, B.:
Robust object detection with interleaved
categorization and segmentation, International
Journal of Computer Vision, 77 (1) (2008) 259-
289.
[8] Si, Z.; Gong, H.; Wu, Y. N.; Zhu S. C.:
“Learning mixed image templates for object
recognition,” IEEE Conference on Vision and
Pattern Recognition, pp. 272-279, 2009.
[9] Felzenszwalb, P. F.; Girshick, R. B.;
McAllester, D.; Ramanan, D.: Object detection
with discriminatively trained part based models,
IEEE Trans. on Pattern Analysis and Machine
Intelligence, 32 (9) (2010) 1627-1645.
[10] Wu, Y. N.; Si, Z. Z.; Gong H.; Zhu S. C.:
“Learning active basis model for object
detection and recognition,” International Journal
523
Hội nghị toàn quốc về Điều khiển và Tự động hoá - VCCA-2011
VCCA-2011
of Computer Vision, 2010, doi: 10.1007/s11263-
009-0287-0.
[11] Bui, T. T. Q.; Hong, K. S.: “Supervised
Learning of a Color-Based Active Basis Model
for Object Recognition,” Proc. of the Second Int.
Conf. on Knowledge and Systems Engineering
KSE, Hanoi, Vietnam, 7-9 Oct. 2010, pp. 69-74.
[12] Everingham, M.; Van-Gool, L.; Williams, C. K.
I.; Winn, J.; Zisserman, A.: “The PASCAL
visual object classes challenge 2009
(VOC2009), results." [Online]. Available:
"http://www.pascalnetwork.org/challenges/VOC
/voc2009/workshop/index.html.
[13] Borenstein, E.; Sharon, E.; Ullman, S.:
“Combining top-down and bottom-up
segmentation,” IEEE Computer Society
Conference on Computer Vision and Pattern
Recognition Workshops, pp. 46-54, 2004.
[14] Martin, D. R.; Fowlkes, C. C.; Malik, J.:
Learning to detect natural image boundaries
using local brightness, color, and texture cues,
IEEE Trans. on Pattern Analysis and Machine
Intelligence, 26 (5) (2004) 530-549.
[15] Ce, M.: Basic principles of ROC analysis,
Seminars in Nuclear Medicine, 8 (4) (1978)
283-298.
T. T. Quyen Bui received her
B.S. and M.S. degrees in Hanoi
University of Technology,
Vietnam, in 2001 and 2006,
respectively. She is currently a
Ph.D. program student in the
School of Mechanical
Engineering, Pusan National
University, Korea. Her research interests include
robotics, vision systems, and navigation of
autonomous vehicles.
524