Post on 16-Mar-2016
description
计算机视听觉-人工智能之梦Computer Seeing and Hearing-A
Dream of AI
张钹清华大学信息科学与技术学院清华大学计算机科学与技术系清华信息科学与技术国家实验室智能技术与系统国家重点实验室
Is it possible ?YesNo It is just a daydream !
Computer Vision /Hearing
The Characteristic of Auditory Information (Data)
Ears, Earphones A continuous waveDigital Data: 20K-100K bits/sSparseness (Redundant)Noisy
The Characteristics of Visual Information (Data)
Eyes, Digital Camera • Pixel-based (million, ten million bits) Sparseness (Redundancy) Noisy• Eyes: a sequence of images 109 bits/sec
The Sparseness of Auditory Signal
采样频率 位分辨率• 广播质量- 48kHz• CD 质量- 44kHz 16 位• 收音音质- 22kHZ 8 位• 可接受的音乐- 11kHz 4 位• 可接受的语音- 5kHz
The Sparseness of Visual Signal
分辨率与识别率的关系 (conceptual)
一个不适定问题An Ill-posed Problem
Sparse, redundant, noisy data(110000111100011100011000………… )
Microphone (Ears)(Camera (Eyes))
Speaker-invariant Vowel RepresentationVowel-invariant Speaker Representation( Object-invariant Representation )
Existence Uniqueness Stability
1. Segmentation & Recognition
Image Segmentation vs. Recognition
Which comes first, Chicken or Egg
Where is the object ?
What is the object ?
?
Speech Segmentation vs. Recognition
? What, Where
技术上的困难(Technology)
Sparse, redundant, noisy data
Speaker-invariant Vowel RepresentationVowel-invariant Speaker Representation
A Robust Detector
An Invariant Descriptor
Top-down feedback
Top-down feedback
Local connection
Data-driven From egg to chicken
High-levelApriori-knowledge
人类是如何解决的?
The Relation Between Activation Patterns and Early Stages of Sound Processing
Speech Encoding occurs not only in specialized high-level region but also in early stages of sound processing. The early sound processing may exhibit complex spectrotemporal receptive fields and may participate in high-level encoding of auditory objects, e.g., via local feedback
Multi-layer Neural Network with feedback connections
G. E. Hinton, The “wake-sleep” algorithms for unsupervised neural networks, SCIENCE vol.268, 26 May 1995, 1158-1161
RepresentationRBM:Restricted Boltzmann Machine
Experimental Results
G. E. Hinton, Learning multiple layers of representation, TRENDS inCognitive Sciences vol.11, no.10, 428-434, 2007
2 、 Feature Extraction
Computer Robustly Extractable Features
Sparse, redundant, noisy dataStatistical
Approaches
Speech-base Invariant Statistics (Features)
Speaker-invariant Vowel RepresentationVowel-invariant Speaker Representation
Statistical Method• 选择一个语音训练库• 提取语音特征• 无监督学习( Classification )• 分类准则- Generalization 提取何种特征 ?Computer robustly detectable
Representation at Different Granularities
Global Features-one vector The coarsest
The finest
Pixel Based-1280X800X3 vectors
An Image
Pixel-based Representation-the finest representation
• • • • • • • • • • •• • • • • • • • • • •• • • • • • • • • • •• • • • • • • • • • •• • • • • • • • • • •• • • • • • • • • • •• • • • • • • • • • •• • • • • • • • • • •• millionX3-dimensional vectors -all the details , ,
( , )
( , ) [ ( , )], 1 ,i j
k i j
X F f
X x y
F f g x y i j n
( , ) ,1 ,k kG g i j i j n
Global Features -the coarsest representation
N
jiji P
Nu
1
121
1
2 ))(1(
N
jiiji uP
N
Color moments
31
1
3 ))(1(
N
jiiji uP
Ns
N-the number of pixels, P-the value of each colorOne 9-dimensional vector
Coarse vs. Fine Representation
Representations
The Finest Representation
The Coarsest Representation
Expressiveness
Full Structural KnowledgeGood
No Structural KnowledgePoor
Robustness Poor Good (rotation, translation, scaling,…)
Representation with Middle Grain-Size
• • • • • • • • • • • • • • • •
•
• • • • • •
Region-based Representation
1 2
([ ] ,[ ] ,[ ] )
[ ] , 1, 2,...,
[ ] ( ), ( ),..., ( )k
i i i
i i k
i k k k n
X F f
X x i n
f f x f x f x
Local (Spatial) Feature Region-01 Region-11 Region-12
Foreground vs. Background
Vector Representation
1 2 1
[ ] :
( ), ( ),..., ( ) , 1,2,...,k
i
k k k n
f
f x f x f x k l
A set of vectors (tens) (with different length)Similarity MeasureWeighted
Region-adaptive Grid Partition
Jinhui Yuan (2005…)
Hierarchical (粒度)结构(X, F, f )-the finest space([X], [F], [f] )-coarse space[X] the quotient space of X[F] the quotient structure of F an equivalence class[f]-the quotient attributes of f
• • •
• • • • • • • • • • • • •
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•Semantics (text, image)
Primitive (words, pixels)
Semantic Gap
PM: Pyramid Match (feature space-quantization level) SPM: Spatial Pyramid Match (physical space-grid)FESCO: Feature Spatial Covariant Kernel
Concept Detection from Video Shots
ExperimentsTRECVID 2005, 10 concepts 170 hours news (MSNBC, NBC Nightly News, CNN, LBC, CCTV, NTDTV)TRECVID 2006, 20 concepts 170+150 hours newsKeypoint descriptor: 64-dimensional SURF feature (Speeded Up Robust Features)AP: Non-interpolated Average PrecisionMAP: Mean Average Precision (7 concepts)
TRECVID Data
Name hours no. shots no. frames dateTRECVID05d 80 44,000 75,000 2004 10-11TRECVID05t 80 46,000 78,000 2004 11-12TRECVID06t 150 80,000 144,000 2005 11-12TRECVID07d 50 18,000 18,000TRECVID07t 50 22,000 63,000
d: training data, t: testing data
Coarse vs. Fine Granulation
MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront Test Set TRECV05t TRECV06tVocabulary Size
18 72 288 18 72 288
Grid 11 Grid 2 2 Grid 4 4
0.073 0.210 0.2440.223 0.260 0.2510.271 0.254 0.275
0.025 0.074 0.109 0.078 0.117 0.119 0.116 0.123 0.128
Multi-granulation
Combination
TRECV05t
TRECV06t
Whole Comb.
9 combinations 0.307 0.166
FESCO Fine SpaceFine FeatureFine Comb.Coarse Comb.
PiQj=288 Qj=G44; Pi=288, 72, 18 Pi=288; Qj=G11, G22, G44
PiQj>288 PiQj<288
0.306 0.300 0.294 0.293 0.250
0.166 0.158 0.155 0.151 0.106
MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront
Multi-granulation (2)MAP: 7 concepts: car, explosion-fire, flag-US, maps, mountain, sports, waterscape-waterfront
Test Set TRECV05t TRECV06t Fusion Method
pre-fusion post-fusion
pre-fusion post-fusion
FESCO SPM PM
0.297 0.306 0.274 0.285 0.254 0.269
0.154 0.166 0.140 0.146 0.124 0.125
Multi-Granular & Multi-modalTRECVID2005 (Video Retrieval Evaluation Conference)86.6 hours of news videos (45766 shots in 140 video clips)Features: A: auto-speech recognition text T: visual texture R: color of segmented image regions
PMSRA
Probabilistic Model Supported Rank Aggregation
The Comparison between Uni-modal and Multi-granular, modal
Uni-Modal Multi-Granular, Modal
ASR Texture Region A+T A+R T+R A+T+R
US-flag 0.0335 0.0155 0.0375 0.0359 0.0506 0.0372 0.0521
Water 0.0034 0.1143 0.0814 0.1022 0.0735 0.1333 0.1211
Mountain 0.0033 0.0693 0.1104 0.0668 0.1066 0.1176 0.1154
Sports 0.0723 0.0769 0.2156 0.1465 0.2678 0.2802 0.3050
Average 0.0281 0.0690 0.1112 0.0879 0.1246 0.1421 0.1484
TRECVID Text Retrieval Conference Video Retrieval Evaluation
声波、声谱图( Spectrograms )
语音信息Global Features-one vector The coarsest
The Finest-sampling
不同粒度的语音特征• 语音单元(粒度)选择: 音素、音节、词… .• 语音参数选择 MFCC: Mel 频率倒谱参数 (Mel Frequency Cepstral Coefficients) LSP :线谱对 (Line Spectrum Pair) ICA (Independent Component Analysis)
• 多(粒度)特征融合
3 、 Structural Model• Temporal Model (HMM)• Spatial Model
语音的时间结构 (Temporal Structure)
多粒度结构
Image Region Annotation -horse, sky, mountain, grass, tree
Region-adaptive Grid Partition (2)
Experiments• 4002 Corel images (384256 or 256384)• 11 basic (region) concepts• Features: color moment + wavelet• 5 models: 2 without structural knowledge (GMM, SVM) 3 with structural knowledge (HMM*, RMF*, CRF*)
Image Region Annotation
Image Region Annotation
Spatial Structural Representation
n images, each image has mi=HV grids
( , ) ( , ), 1,2...,
( , ) ( , ), 1,2,...,i i
j ji i i i i
x y x y i n
x y x y j m
(a) i.i.d generative model(b) i.i.d. discriminative model(c) 2-dimensional hidden Markov (2D HMM)(d) Markov Random Field (MRF)(e) Conditional Random Field (CRF)
Different Models
Label Configuration ( , ), 1,2,...,i ix y i N
Given a training data, MAP (maximal a posterior) : label configuration
1: 1:* argmax ( )m my P y x
For 2D HMM, MRF, CRF using path limited Viterbi algorithm
Probabilistic distribution P Cs: labeling clique, C0: labeling and feature cliquey* the optimal label configuration
0
1: 1:
( , ) ( , )
1( , ) ( , ) ( , )i j s k k
m m i j k k
y y C y x C
P x y y y y xZ
1:0( , ) ( , )
* argmax ( , ) ( , )m
i j s k k
i j k k
yy y C y x C
y y y y x
Markov Random Field Model - MRF model
Comparison Among Different Models
GMM: Gaussian Mixture Model (30 components)SVM: Support Vector Machine Gaussian kennel, one-against-oneHMM: Hidden Markov ModelRMF: Random Markov FieldCRF: Conditional Random Field Limited Path Viterbi Algorithm
Experimental Results
The Spatial Relation Among Region Labels
The probability that some things are above the “sky”, “flower” or “building”
Future Direction4. Data Driven Approach
数据驱动法( Data-driven )数据驱动法的本质: 针对特定数据(语音、图像)库 高维空间的划分问题今后的发展方向:• Large scale annotated database• Sparseness in high dimensional space
*******
Data Space
HorsePrecision: 25/30 pictures
Global ColorFeatureHorse-Green
EaglePrecision:13/25 pictures
Global Color FeatureEagle-Blue
Local Features 17/36 picturesRegion-based Color Features Foreground Color pink
The Bless of Dimensionality
Sparse RepresentationSample Space(Data Space)Extended Yale B2414 frontal-facewith different lighting38 individuals192168 image
J. Wright, et al. Robust face recognition via sparse representation, IEEE PAMI 08
Anti-Noise
30%
50%
70%
Anti-Occluded
5. Brain Science (Structural Model)
From eye to primary visual cortex
Li Zhaoping, Theoretical understanding of the early visual processes by data compression and data selection, Network: Computation in Neural Systems, December 2006; 17: 301-334
Two Basic Problems• Description: What is the object-invariant descriptor in human brain?• Detection: How to obtain the descriptor from a huge amount of data?There is some answer but is not a full answer.
Vision: 2D image- 3D scene
This is a hard problem even for human being
eyes + brain• Billions years evolution• 1/3 of brain resource• Several years learningMany problems are still unsolved for human
being
基于人类认知的图像处理数据空间 感知空间(语义)
数据空间 原空间 语义空间2,000 bytes-50% (6464) 维特征 几十 bytes
Cognition (Perception) SpacePerception spaceSemantically meaningful features• 多层次 (hierarchy)• 自底向上的数据驱动 + 自顶向下的反馈(上下文,先验, 标注知识)
Object Recognitionwith sparse, localizedfeatures
MIT-CSAIL-TR-2006-028 T. Serre
HMAX-sum + max
Computational Model
Experimental Results• Caltech 101 The number of categories: 101 Training samples: 30/per class Average recognition rate: 51%• Vista (car, passerby, bicycle) AUC>90% • AUC: the area under the ROC (Receiver Operating Characteristics) curve
人脑听觉皮层的试验研究Three Dutch vowels (a, i, u)Three speakers (1 female, 2 males)Features: F1-F2 F0
Elia Formisano “Who” Is Saying “What”? Brain-based Decoding of Human Voice and SpeechScience vol 322, 7 Nov. 2008
谢谢 !