Learning Large-Scale Multimodal Data Streams
– Ranking, Mining, and Machine Comprehension
Hung-Yi LEE (李宏毅)National Taiwan University
@GTC 2017, May 8, 2017
http://speech.ee.ntu.edu.tw/~tlkagk/http://winstonhsu.info/
Winston H. HSU (徐宏民)National Taiwan University &
IBM TJ Watson Ctr., New York
@GTC, May 2017 – Winston Hsu
2
@GTC, May 2017 – Winston Hsu
The First AI-Generated Movie Trailer – Identifying the
“Horror” Factors by Multimodal Learning
▪ The first movie trailer generated by AI system (Watson)
(tender)
(scary)
(suspenseful)
https://www.ibm.com/blogs/think/2016/08/cognitive-movie-trailer/
1
@GTC, May 2017 – Winston Hsu
Detecting Activities of Daily Living (ADL) from
Egocentric Videos
▪ Activities of daily living – used in healthcare to refer to
people's daily self care activities
– Enabling technologies for exciting applications
▪ Very challenging!!
4
ADL: brushing teeth
2
https://www.advancedrm.com/measuring-adls-to-assess-needs-and-
improve-independence/
@GTC, May 2017 – Winston Hsu
Our Proposal: Beyond Objects – Leveraging More
Contexts by Multimodal Learning
tap
cup
toothbrush
Objects [1]:
• tap
• cup
• toothbrush
Scene:
bathroom …Scenes:
• Bathroom: 0.8• Kitchen: 0.1
• Living room: 0.01
• ….
Sensors:
• Accelerometer
• Mic.
• Heartrate
Sensors
[1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012
[2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016
CNN for
scene recognition (67)
5
[Hsieh et al., ICME’16]
@GTC, May 2017 – Winston Hsu
Experimental Results for ADL
– Multimodal Learning Matters!
▪ Egocentric videos collected of 20 people (by Google Glass, GeneActiv)
6
0%
10%
20%
30%
40%
50%
60%
70%Accuracy
[1] Ramanan et al., Detecting Activities of Daily Living in First-person Camera Views, CVPR 2012
[2] Hsieh et al., Egocentric activity recognition by leveraging multiple mid-level representations, ICME 2016
@GTC, May 2017 – Winston Hsu 7
Perception/understanding is multimodal.
How to design multimodal (end-to-end)
deep learning frameworks?
@GTC, May 2017 – Winston Hsu
Outlines
▪ Why learning with multimodal deep neural networks
▪ Requiring techniques for multimodal learning
▪ Sample projects
– Medical segmentation by cross-modal and sequential
learning
– Cross domain and cross-view learning for 3D retrieval
– Speech Summarization
– Speech Question Answering
– Audio Word to Vector
8
@GTC, May 2017 – Winston Hsu
3D Medical Segmentation by Deep Neural Networks
▪ Motivations – 3D biomedical segmentation plays a vital
role in biomedical analysis.
▪ Brain tumors have different kinds of shapes, and can
appear anywhere in the brain very challenging to
localize the tumors
▪ Goal – To perform 3D segmentation
with deep methods and segment by
stacking all the 2D slices (sequences).
▪ Observing oncologists leverage the
multi-modal signals in tumor
diagnosis
9
3
[Tseng et al., CVPR 2017]
@GTC, May 2017 – Winston Hsu
Multi-Modal Biomedical Images
▪ 3D multi-modal MRI images
– Different modalities used to distinguish the boundary of
different tumor tissues (e.g., edema, enhancing core,
non-enhancing core, necrosis)
– Four modalities: Flair, T1, T1c, T2
10
T1T1c T2Flair
@GTC, May 2017 – Winston Hsu
Related Work – SegNet (2D Image)
▪Structured as encoder and decoder with multi-
resolution fusion (MRF)
▪But
– Ignoring multi-modalities
– lacking sequential learning11Badrinarayanan, et al., SegNet: A Deep Convolutional Encoder-Decoder Architecture for
Image Segmentation, 2015
@GTC, May 2017 – Winston Hsu
3D Medical Segmentation by Deep Neural Networks
▪ Our proposal – (first-ever) utilizing cross-modal learning
in the (end-to-end) sequential and convolutional neural
networks and effectively aggregating multiple resolutions
12
[Tseng et al., CVPR 2017]
Kuan-Lun Tseng, Yen-Liang Lin, Winston Hsu and Chung-Yang Huang. Joint Sequence
Learning and Cross-Modality Convolution for 3D Biomedical Segmentation. CVPR 2017
@GTC, May 2017 – Winston Hsu
ConvLSTM – Temporally Augmented Convolutional
Neural Networks
▪ Convolutional + sequential networks, e.g., convLSTM
– Modeling spatial cues in temporal (sequential) evolvements
▪ LSTM vs. convLSTM: Traditional LSTM
employs the dot-product; Conv-LSTM
replaces the dot-product by convolution.
13Shi, et al., Convolutional LSTM Network: A Machine Learning Approach for Precipitation
Nowcasting, NIPS 2015
@GTC, May 2017 – Winston Hsu
Cross Modality Convolution (CMC)
– For Each Slice
T1c
T2
Flair
Chan C
Chan C
Chan C
Chan 2
Chan 1
Chan 2
Chan 1
Chan 2
Chan 1
C
w
……
…
Chan C
Chan 2
Chan 1
…
Chan 1
Chan 1
Chan 1
Chan 1w
w
K
…
h
…
T1
h
…
Chan 2
Chan 2
Chan 2
Chan 2
Chan C
Chan C
Chan C
Chan C
…
Multi-Modal Encoder
Cross-Modality Convolution
Decoder
Convolution
LSTM
: Conv + Batch Norm + ReLU
: Max pooling
: Deconv
: Conv + Batch Norm + ReLU
Encoder:
Decoder:
Flair
T2
T1c
…
…
T1
Convolution
LSTM
Cross-Modality
Convolution
Multi-Modal
Encoder
Convolution
LSTM
Cross-Modality
Convolution
Multi-Modal
Encoder
slice 1
slice 2
slice n
Convolution
LSTM
Cross-Modality
ConvolutionMulti-Modal
Encoder
slice 1
slice 2
…
slice n
slice 1
slice 2
…
slice n
slice 1
slice 2
…
slice n
slice 1
slice 2
…
slice n
…
…
…
Decoder
Decoder
Decoder
Detailed structure in Figure 2
(a) (b) (c) (d) (e) (f) (g) (h)
convolution with
kernel 4x1x1xC
Tensor(C * h * w * 4)
14
@GTC, May 2017 – Winston Hsu
Comparing with the State-of-The-Art in BRATS-2015
(b) Ground truth (c) U-Net (e) CMC +
convLSTM (ours)
(d) CMC (ours)(a) MRI slices
15
▪ MRF is effective
▪ MME + CMC is better than regular encoder + decoder
▪ Two phase is an important training strategy for
imbalanced data
▪ convLSTM, sequential modeling, helps slightly
@GTC, May 2017 – Winston Hsu
Sketch/Image-Based 3D Model Search
▪Speeding up 3D design and printing
– Current 3D shape search engines take text inputs only
– Leveraging large-scale freely available 3D models
▪Various applications in 3D models: 3D printing,
AR, 3D game design, etc.
16
[Liu et al., ACMMM’15]
demo4
[Lee et al., 2017]
@GTC, May 2017 – Winston Hsu
▪ To retrieve 3D shapes based on photo inputs
▪ Challenges:
– Effective feature representations of 3D shapes (with CNNs)
– Image to 3D cross-domain similarity learning
Image-based 3D Shape Retrieval
17
Query
[Lee et al., 2017]
@GTC, May 2017 – Winston Hsu
Our Proposal – Cross-Domain 3D Shape Retrieval with
View Sequence Learning
▪ Novel proposal – End-to-end deep neural networks for cross-domain
and cross-view learning and efficient triplet learning
▪ A brand-new problem
18
[Lee et al., 2017]
Image-CNNAdaptation
Layer
Cross-View
Convolution
Rank by
L2 distanceView-CNN…
Query Image
3D Shapes
Image
representation
Shape
representation
Top Ranked 3D Shapes:
Rendered
Views
…
View-CNN
@GTC, May 2017 – Winston Hsu
Cross-Domain (Distance Metric) Learning: Siamese vs.
Triplet Networks
19Wang, Jiang, et al. "Learning fine-grained image similarity with deep ranking." CVPR 2014.
Contrastive
Loss
image1 image2
Triplet
Loss
positive
image
anchor
image
negative
image
identical,
weights sharedidentical,
weights shared
Neural Networks
(CNN / DNN..)
@GTC, May 2017 – Winston Hsu
▪ Straightforward but ignoring view sequences
– Each view is passed to the same CNN (shared weights)
– View-pooling is a MAX POOLING operation
Baseline: MVCNN, 3D Shape Feature by Max Pooling
– Ignoring Sequences
……
View-Pooling
conv1 → pool5
fc6 fc7 fc8
airplane
carbed
…
feature
(4096D)(same size as pool5)
Pool 5
(4096D)
Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition.” CVPR 2015 20
@GTC, May 2017 – Winston Hsu
Our Proposal: Cross-Domain Triplet NN with View
Sequence Learning
21
▪ Cross-View Convolution aggregates multi-view features
▪ The adaptation layer adapts image features to the joint embedding space
▪ Late triplet sampling speeds up the training of cross-domain triplet learning
@GTC, May 2017 – Winston Hsu
Cross-View Convolution (CVC)
22
▪ Stack the feature maps from V views by channel:
V x ( H x W x C ) → H x W x V x C
▪ Convolve the new tensor with K kernels (1 x 1 x V x C)
– Assign K == C → #output channel == #input channel (for comparisons)
– K = C = 256 = AlexNet pool5 feature map #channels
▪ CVC works as a weighted summation across views and channels
reshape
from CNN
features
@GTC, May 2017 – Winston Hsu
Late Triplet Sampling (Fast-CDTNN) – Speeding Up
Cross-Domain Learning
▪ Naive cross-domain triplet neural networks (CDTNN) has three streams
▪ Fast-CDTNN has two streams. It forward sampled image/3D shape, and
enumerates the triplets (combinations) at the integrated triplet loss layer
▪ In our experiments, Fast-CDTNN is ~4x - 5x faster.
23
@GTC, May 2017 – Winston Hsu
Comparisons and Datasets
▪ New image/3D shape: 12,311 3D shapes, 10,000 images,
across 40 categories
24
Methods mAP
AlexNet pool5 [1] 7.16%
MVCNN [2] 7.92%
Joint Embedding [3] 3.44%
CDTNN + view pooling [2] 40.85%
CDTNN + Adaptation Layer 47.84%
CDTNN + Adaptation Layer + CVC 52.67%
[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in
neural information processing systems. 2012.
[2] Su, Hang, et al. "Multi-view convolutional neural networks for 3d shape recognition." Proceedings of the IEEE International Conference
on Computer Vision. 2015.
[3] Li, Yangyan, et al. "Joint embeddings of shapes and images via cnn image purification." ACM Trans. Graph 5 (2015).
@GTC, May 2017 – Winston Hsu
Sample Results and Demo
25
Bathtub
Person
Stool
Bed
Car
Bookshelf
Keyboard
Guitar
(a)
(b)
demo
Speech Summarization5
Summarization
Audio File
to be summarized
This is the summary.
➢ Select the most informative segments to form a compact version
Extractive Summaries
…… deep learning is powerful …… …… ……
[Lee & Lee, Interspeech 12][Lee & Lee, ICASSP 13][Shiang & Lee, Interspeech 13]
➢ Machine does not write summaries in its own words
Abstractive Summarization
• Now machine can do abstractive summary (write summaries in its own words)
• Title generation: abstractive summary with one sentence
Title 1
Title 2
Title 3
Training Data
title generated by machine
without hand-crafted rules
(in its own words)
Sequence-to-sequence
• Input: transcriptions of audio, output: title
ℎ1 ℎ2 ℎ3 ℎ4
RNN Encoder: read through the input
w1 w4w2 w3
transcriptions of audio from automatic speech recognition (ASR)
𝑧1 𝑧2 ……
……wA wB
RNN generator
Sequence-to-sequence
• Training data: 2M story-headline pairsROUGE-1 ROUGE-L
Manual Transcription as input 26.8 23.9
ASR Transcription as input 21.3 20.0
ASR Transcription as input+ Special Structure
22.9 20.9
Learn to ignore the words which are misrecognized when generating the title
[Yu, Lee, Lee, SLT 2016]
Demo
• http://140.112.30.37:2401/
• https://www.youtube.com/watch?v=X3BapMl7Wv8
• https://www.youtube.com/watch?v=hFVKpVMB-Rc
• https://www.youtube.com/watch?v=hYf3fARyNvg
• https://www.youtube.com/watch?v=usi8EUabU7Y
• https://www.youtube.com/watch?v=FUmd6EnVeWw
• From SONG TUYEN NEWS: https://www.youtube.com/channel/UC-P4mEcWZVrFfdZIuiODiTg
Speech Question Answering
6
Speech Question Answering
What is a possible origin of Venus’ clouds?
Speech Question Answering: Machine answers questions based on the information in spoken content
Gases released as a result of volcanic activity
New task for Machine Comprehension of Spoken Content
• TOEFL Listening Comprehension Test by Machine
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
New task for Machine Comprehension of Spoken Content
• TOEFL Listening Comprehension Test by Machine
“what is a possible origin of Venus‘ clouds?"
Question:
Audio Story:Neural
Network
4 Choices
e.g. (A)
answer
Using previous exams to train the network
ASR transcriptions
Model Architecture
“what is a possible origin of Venus‘ clouds?"
Question:
Question Semantics
…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……
Audio Story:
Speech Recognition
Semantic Analysis
Semantic Analysis
Attention
Answer
Select the choice most similar to the answer
Attention
The whole model learned end-to-end.
More Details
(A)
(A) (A) (A) (A)
(B) (B) (B)
Demo
Experimental ResultsA
ccu
racy
(%
)
random
Naïve approaches
Example Naïve approach:1. Find the paragraph containing most key terms in
the question.2. Select the choice containing most key terms in
that paragraph.
not easy to get high score without understanding
Experimental ResultsA
ccu
racy
(%
)
random
42.2% [Tseng, Shen, Lee, Lee, Interspeech’16]
Naïve approaches
48.8% [Fan, Hsu, Lee, Lee, SLT’16]
Analysis
• There are three types of questions
Type 3: Connecting Information➢Understanding Organization➢Connecting Content➢Making Inferences
Analysis
• There are three types of questions
Type 3: Pragmatic Understanding➢Understanding the Function of What Is Said➢Understanding the Speaker’s Attitude
Corpus & Codehttps://github.com/sunprinceS/Hierarchical-Attention-Model
Audio Word to Vector7
Framework of Spoken Language Understanding Tasks
Spoken Content
Text
Spoken Content Retrieval
Dialogue
Speech Summarization
SpeechQuestion Answering
SpeechRecognition
Can we bypass speech recognition? …
Why? Need the manual transcriptions of lots of audio to learn.
Most languages have little transcribed data.
New Research Direction: Audio Word to Vector
Typical Word to Vector
• Machine represents each word by a vector representing its meaning
• Learning from lots of text without supervision
dog
cat
rabbit
jumprun
flower
tree
Audio Word to Vector
• Machine represents each audio segment also by a vector
audio segment
vector
(word-level)
Learn from lots of audio without supervision
[Chung, Wu, Lee, Lee, Interspeech 16)
Sequence-to-sequence Auto-encoder
audio segment
acoustic featuresx1 x2 x3 x4
RNN Encoder
vector
The vector we want
We use sequence-to-sequence auto-encoder here
The training is unsupervised.
Sequence-to-sequence Auto-encoder
RNN Generatorx1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3x4
RNN Encoder
audio segment
acoustic features
The RNN encoder and generator are jointly trained.
Input acoustic features
What does machine learn?
• Typical word to vector:
• Audio word to vector (phonetic information)
𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 + 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛
𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 + 𝑉 𝑎𝑢𝑛𝑡 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒
V( ) - V( ) + V( ) = V( )
GIRL PEARL PEARLS
V( ) - V( ) + V( ) = V( )
GIRLS
Demo
Next Step ……
• Audio word to vector with semantics
flower tree
dog
cat
cats
walk
walked
run
One day we can build all spoken language understanding applications directly from audio word to vector.
@GTC, May 2017 – Winston Hsu
Take-Home Messages
▪ Multimodal deep learning is ”must” for practical
applications
– Perception/understanding is multimodal
– Sensors are complementary and low-cost
▪ Dealing with issues including
– proper networks with domain knowledge, imbalanced training data,
cross-modality/cross-domain learning, training strategies, etc.
▪ Promising in many tasks
– Segmentation, Ranking, Summarization, Question Answering,
Unsupervised Representation
53
@GTC, May 2017 – Winston Hsu
Thanks and Comments!
54
Hung-Yi LEE (李宏毅)
National Taiwan UniversityWinston H. HSU (徐宏民)
National Taiwan University
& IBM TJ Watson Ctr., New York
Top Related