Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher...
Transcript of Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher...
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
M.S. THESIS
Knowledge Transfer via Stochastic Dropand Skip Connection
확률적누락과스킵연결을통한지식전달기법
BY
LEE KWANG-JIN
FEBRUARY 2020
DEPARTMENT OF ELECTRICAL ENGINEERING ANDCOMPUTER SCIENCE
COLLEGE OF ENGINEERINGSEOUL NATIONAL UNIVERSITY
M.S. THESIS
Knowledge Transfer via Stochastic Dropand Skip Connection
확률적누락과스킵연결을통한지식전달기법
BY
LEE KWANG-JIN
FEBRUARY 2020
DEPARTMENT OF ELECTRICAL ENGINEERING ANDCOMPUTER SCIENCE
COLLEGE OF ENGINEERINGSEOUL NATIONAL UNIVERSITY
Knowledge Transfer via Stochastic Dropand Skip Connection
확률적누락과스킵연결을통한지식전달기법
지도교수심병효
이논문을공학석사학위논문으로제출함
2020년 2월
서울대학교대학원
전기컴퓨터공학부
이광진
이광진의공학석사학위논문을인준함
2020년 2월
위 원 장:부위원장:위 원:
Abstract
Deep neural networks have achieved state-of-the-art performance in various fields.
However, DNNs might need to be scaled down to fit real-word applications since they
are computationally and memory intensive. As a means to compress the network yet
still maintain the performance of the network, knowledge distillation has brought a
lot of attention. This technique is based on the idea to train a student network using
the provided output of a teacher network. Various distillation methods have been pro-
posed and one of them is deploying multiple teacher networks. However, it causes to
some extent waste of resources, so did not receive much attention. In the proposed
approach, we generate multiple sub-networks from one single teacher network by ex-
ploiting stochastic block and skip connection. Thus, they can play the role of multiple
teacher networks and provide sufficient knowledge to the student network without ad-
ditional resources. We observe the improved performance of student networks with the
proposed approach for CIFAR-100 and tiny-imagenet dataset.
keywords: convolutional neural networks, knowledge transfer, image classification,
multiple teacher networks.
student number: 2018-26533
i
Contents
Abstract i
Contents ii
List of Tables iv
List of Figures v
1 INTRODUCTION 1
2 Related Works 5
2.1 Knowledge Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Multiple Teacher Networks . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Regularizing Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Proposed Framework 9
3.1 Generating Multiple Networks . . . . . . . . . . . . . . . . . . . . . 9
3.2 Availability of the proposed framework . . . . . . . . . . . . . . . . 10
3.3 Application to Other Distillation Techniques . . . . . . . . . . . . . . 12
4 Experiment 15
4.1 Dataset and Simulation Setting . . . . . . . . . . . . . . . . . . . . . 15
4.2 CIFAR-100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Tiny Imagenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
ii
5 Ablation Study 23
6 Conclusion 25
Abstract (In Korean) 30
iii
List of Tables
4.1 Improvement of knowledge distillation (KD) with the proposed struc-
ture on CIFAR 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Improvement of attention transfer (AT) with the proposed method on
CIFAR 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Improvement of mutual learning (ML) with the proposed structure on
CIFAR 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Improvement of knowledge distillation (KD) with the proposed struc-
ture on tiny imagenet . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Improvement of attention transfer (AT) with the proposed method on
tiny imagenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
List of Figures
1.1 The diagram of knowledge distillation . . . . . . . . . . . . . . . . . 2
1.2 (a), (b), (c) Unravelled views when one block is dropped from a net-
work with the proposed approach. . . . . . . . . . . . . . . . . . . . 3
2.1 Learning diagram of (a) knowledge distillation (b) attention transfer
(c) mutual learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Accuracy when each block is dropped from (a) residual network 32,
(b) mobilenet, and (c) wide residual network 28-10. . . . . . . . . . . 13
3.2 Entropy when each block is dropped from (a) residual network 32, (b)
mobilenet, and (c) wide residual network 28-10. . . . . . . . . . . . . 14
v
Chapter 1
INTRODUCTION
Deep neural networks (DNN) have achieved state-of-the-art performance on complex
tasks like computer vision [1], language modeling [2], and machine translation [3], to
name just a few. Despite the superior performance, it is not easy to use the DNN-based
models for the embedded systems having restricted memory size and computational
resources. Over the years, many approaches have been suggested to make smaller but
efficient DNNs. One of them, knowledge distillation (KD), has received much atten-
tion in recent years [4].
The basic idea of KD is to train a smaller network (a.k.a. student network) with
the help of softened outputs (a.k.a., soft targets) of a larger network (a.k.a., teacher
network). The teacher network is usually pre-trained by supervised learning. While
maintaining the low computational complexity, the student network learns the infor-
mation of the teacher network by mimicking the soft targets (see Fig 1). Specifically,
the student network is optimized to the dataset by minimizing the loss function which
is the sum of cross-entropy loss between the outputs of the student network and soft
targets, and cross-entropy loss between the outputs of the student network and label
data. Recently, it has been shown that the learning of student networks can be much im-
proved with the help of multiple teacher networks [4, 5, 6]. Multiple teacher networks
have different knowledge and views each other. Thus, they generate different outputs
1
�����
������ ������
�������� ������
���� ��������������������������
Figure 1.1: The diagram of knowledge distillation
on the same input. Then, the student network can fuse distinct knowledge from mul-
tiple teacher networks to establish its own comprehensive and in-depth understanding
of the knowledge [6]. However, most of the current literatures do not focus much on
this problem, which to some extent causes the waste of resources.
The primary goal of this paper is to improve the performance of student network
with comprehensive knowledge of multiple sub-networks. Here, sub-networks are gen-
erated by exploiting stochastic block and skip connection. Stochastic blocks are con-
ceptually similar to units in dropout setting [7]. Dropout zeros the output of individual
units at random during training, while stochastic blocks consisting of one or more lay-
ers are skipped at random during training. Skip connections [1] are identity mappings
that bypass one or more layers. Fig 1.2 is an example of sub-networks generated by
dropping each block from the original network. In Fig 1.2, the original network con-
sists of 3 blocks represented by fi (i ∈ {1, 2, 3}) and let Id represent the identity map-
ping generated by skip connection. The dropped blocks are noticed by cross marks, and
the corresponding outputs are zero. Skip connections allow the network to be viewed
as an aggregate of multiple valid paths, while an ordinary feed-forward network has
one valid path. Therefore, if any block drops, the feed-forward network becomes use-
less, while a network with skip connections is still reliable. Indeed, in Fig 1.2, when
one block drops, each sub-network still has 4 valid paths of 8 total paths.
2
Initial layer
f1 Id
f 2 f 2 IdId
input
+
f 3 Id f 3 Id f 3 Id f 3 Id
Initial layer
input
f1
f 3
+
f 2
+
+
(a)
Initial layer
f1 Id
f 2 f 2 IdId
input
+
f 3 Id f 3 Id f 3 Id f 3 Id
Initial layer
input
f1
f 3
+
f 2
+
+
(b)
Initial layer
f1 Id
f 2 f 2 IdId
input
+
f 3 Id f 3 Id f 3 Id f 3 Id
Initial layer
input
f1
f 3
+
f 2
+
+
(c)
Figure 1.2: (a), (b), (c) Unravelled views when one block is dropped from a network
with the proposed approach.
3
We set a teacher network to have stochastic blocks and skip connections. As each
block drops randomly, a sub-network of remaining paths is generated for each batch.
In this way, we can generate multiple sub-networks from one teacher network with-
out additional resources. Therefore, the student network can be trained with multiple
sub-networks in the entire training process. In this paper we make the following con-
tributions:
• We propose a framework generating multiple sub-networks from a single teacher
network without additional resources.
• We show that generated sub-networks provide easy-to-learn and regularizing
knowledge to the student network.
• As a result, we observe that the performance of the student network improves
further compared to the same architecture trained by conventional knowledge
transfer methods.
4
Chapter 2
Related Works
2.1 Knowledge Transfer
Recently, as a means to make smaller but efficient DNNs, knowledge transfer has re-
ceived much attention. Some intuitions and a generalized approach (KD) are demon-
strated in [4]. In essence, softened outputs of a teacher network are transferred to a
student network as an additional supervision. They provide the student network with
the knowledge of the teacher network and prevent the student network from overfit-
ting. Later, an approach using distillation to transfer knowledge from powerful and
easy-to-train networks to small but hard-to-train networks has been proposed [9]. In
this approach, intermediate outputs of the teacher network are provided as hints for
the student network. The student network is trained to mimic not only the final outputs
but also the intermediate outputs of the teacher network. Another approach using an
attention-based distillation method has also been proposed [8]. There, attention maps
which are made from intermediate feature maps of networks are used like hints in [9].
To transfer knowledge while avoiding direct mimicry, an approach has been proposed
in which layers of the teacher network are deployed to generate flows using the Gram
matrix of feature maps [10]. Then, the student network is optimized to learn the flows
of the teacher network in the training process. More recently, mutual learning [5] has
5
been suggested as a new paradigm of bidirectional knowledge transfer. Unlike conven-
tional teacher-student paradigm where knowledge is transferred in one way from the
fixed teacher network to the student network, all networks in mutual learning exchange
knowledge in a mutually beneficial way. We show diagrams of conventional transfer
methods which are used in simulation in Fig 2.1.
Knowledge transfer technique is deployed in several applications. In [11], a mech-
anism called defensive distillation has been proposed. Defensive distillation reduces
the effectiveness of adversarial samples on DNNs so that enhance the robustness of
the DNNs against adversarial attacks. In [12], student networks are trained simultane-
ously, transferring knowledge between each other, such as [5]. Differing from [5], they
are trained with a new mutual learning loss based on metric learning.
6
�������������
����������
(a)
�������������
����������
����������
��
��
�
�
(b)
��������
��������
��
��
(c)
Figure 2.1: Learning diagram of (a) knowledge distillation (b) attention transfer (c)
mutual learning.
7
2.2 Multiple Teacher Networks
In order to provide more extensive knowledge to a student network, knowledge trans-
fer using multiple teacher networks have been proposed [6]. Each teacher network
generates distinct and useful outputs on given inputs. Thus, multiple teacher networks
together can provide the student network with more generalized and extensive knowl-
edge, further enhancing the student network. This is effective in various applications,
such as image classificaion, person reID [5], and speech recognition [13]. However,
deploying multiple teacher networks to transfer knowledge to a single student network
is to some extent a waste of resource. To overcome this issue, one straightforward way
is to perturb the outputs of the teacher network with random noise to get the effect
of multiple teacher networks [14]. This is problematic since the corrupted knowledge
could be transferred to the student network. Also, this method is rather a regularizing
effect like [15] than getting knowledge from multiple teacher networks. In contrast,
our proposed approach generates multiple networks of valid paths (see Figure 1.2) so
that reliable and various knowledge is ensured.
2.3 Regularizing Output
In reinforcement learning, encouraging the policy to have the output distribution with
high entropy has been used [16, 17]. It prevents the policy from converging to narrow
action space and exploration happens more. Thus, it is more probable that the policy
that is closer to optimal is determined. Also, encouraging high entropy output [18] and
smoothing label [19] are proved to help the training of DNNs. Regularizing the high
confident outputs prevents the network from overfitting and increases the adaptivity
of the network. In our approach, generated sub-networks have higher entropy outputs
compare to the original network. Thus, a student networks is trained to mimic the less
confident outputs from sub-networks. This is analogous to penalizing high confident
output of DNNs and helps the student network generalize to the dataset.
8
Chapter 3
Proposed Framework
We propose a framework to generate multiple sub-networks of valid paths from a sin-
gle teacher network. In essence, the key point of this approach is to incorporate a sin-
gle teacher network with stochastic blocks and skip connections. We explain how to
generate sub-networks of valid paths and demonstrate that these sub-networks further
enhance the performance of student networks.
3.1 Generating Multiple Networks
We exploit the characteristics of skip connections and stochastic blocks to generate
multiple sub-networks from a single teacher network. First, we add skip connections
to each input of the block to the corresponding output of the teacher network. In resid-
ual networks, where skip connections are introduced, skip connections let a residual
network to be viewed as an ensemble of multiple paths of different lengths [20]. Let fi
be the representation of the i-th block in the residual network. Then, the output (oi+1)
of the (i+ 1)-th block can be expressed as
oi+1 = fi+1(oi) + oi. (3.1)
Since two paths exist from the input to the output of each block, for the teacher network
of n blocks, there exist 2n valid paths in total (see Figure 1.2).
9
Then, we set the blocks of the teacher network to be stochastic with a survival
probability which is the keep probability of each block as in dropout. To be specific,
let pi be the survival probability of the i-th block (1 ≤ i ≤ n). Since initial blocks
extract low-level features that will be used by later blocks, they should be more reliably
present [21]. Thus, we use linear decay mode in which pi satisfies
pi = 1− (1− pn)×i− 1
n− 1. (3.2)
Stochastic blocks drop randomly in the training phase. However, dropping some
blocks from the network with proposed structure does not harm the performance much.
This is because skip connections make exponential paths in the network so that there
still exist valid paths even though some blocks drop1. Therefore, when a neural net-
work consists of stochastic blocks and skip connections, multiple sub-networks which
are some parts of the original network are generated and their reliable performance is
ensured.
Note that pn implies a trade-off between reliability and variety of sub-networks.
If pn is high, each generated sub-network will have longer length. However, high pn
generates less sub-networks. In contrast, if pn is low, more sub-networks with shorter
length are generated but performance is a bit lower. In our proposed framework, we
set 0.5 ≤ pn ≤ 0.9 with 0.1 interval and choose the best pn for each pair of teacher
network and student network.
3.2 Availability of the proposed framework
In this work, we propose to use multiple sub-networks generated from a single teacher
network to help the learning of a student network. To be used as teacher networks,
the sub-networks should have reliable performance and provide student networks with
meaningful knowledge. Here we show that the performance of sub-networks is com-
1If k blocks are dropped from n blocks, 2n−k valid paths still exist.
10
parable to that of the original teacher network and they provide the student network
with easy-to-learn and regularizing knowledge.
First, we observe the accuracy of sub-networks when each block is dropped from
pre-trained networks, including resnet of 32 layers, mobile network, and wide resnet
28-10 (see Figure 3.1), with test dataset of cifar 100. In Fig 3.1, sto denotes that the
network is pre-trained as in [21] and basic denotes that the network is pre-trained
normally without dropping any blocks. Baseline denotes the accuracy when no block
drops. Mostly, the performance of sub-networks from sto networks is maintained.
However, dropping initial blocks of mobile network and 4-th block of wide resnet
28-10 degrades the performance significantly. To observe the impact of such blocks
that are fatal to drop, we compare cases where the blocks drop or does not drop like
other blocks in the ablation study.
Sub-networks provide student network with high entropy output distribution which
helps the learning process of student networks (see Figure 3.2). It is known that regu-
larizing a neural network to be less confident improves the performance [5, 18]. This
is because training a network to have high confident output let it overfit to the train
dataset and reduce the adaptivity by bounding the gradient. Also, the high entropy
distribution contains important information like relations between classes which are
the salient cues how the teacher network generalizes. In [5], it has been shown that
using an ensemble of n networks as a teacher is less helpful than using n individual
networks as n teachers. This is because the ensemble makes the secondary values of
outputs small and high entropy distribution. Therefore, training student networks to
mimic the higher entropy distribution is helpful for the student network to general-
ize to unseen dataset. In our proposed approach, generating sub-networks from the
teacher network is analogous to using individual networks instead of the ensemble of
them in [5]. Therefore, the knowledge that sub-networks provide contains important
information and is learned easily by student networks.
11
3.3 Application to Other Distillation Techniques
We apply the proposed framework to other distillation techniques, including KD, AT,
and ML. For KD and AT, we slightly change the model by adding skip connections
and stochastic blocks to the teacher network. In ML, the notions of teacher and student
vanish since both networks give and take knowledge among each other. However, for
convenience, we denote a network with larger capacity as the teacher network and the
other as the student network. Here, we add skip connections and stochastic blocks only
to the teacher network as in KD and AT. In ML, both networks should be pre-trained
since teacher networks are not fixed. If the networks are not pre-trained, they cannot be
improved because of the random property of the teacher network. We assume a situa-
tion when both networks are not pre-trained. At the beginning of training process, the
teacher networks are initialized. Thus, the sub-networks of the teacher network do not
have any meaningful knowledge of the given dataset. As a result, the student network
gets meaningless knowledge from the sub-networks, thus, the student network does
not improve. Also, sub-networks are not optimized due to the disturbing knowledge
from the student network.
12
��������
����������������
(a)
����������������
��������
(b)
����������������
��������
(c)
Figure 3.1: Accuracy when each block is dropped from (a) residual network 32, (b)
mobilenet, and (c) wide residual network 28-10.
13
����������������
�������
(a)
����������������
�������
(b)
����������������
�������
(c)
Figure 3.2: Entropy when each block is dropped from (a) residual network 32, (b)
mobilenet, and (c) wide residual network 28-10.
14
Chapter 4
Experiment
4.1 Dataset and Simulation Setting
We evaluate the proposed method with two datasets - CIFAR-100 [22] and tiny im-
agenet [23]. CIFAR-100 dataset consists of 32 × 32 RGB color images drawn from
100 classes, which are split into 50, 000 train and 10, 000 test images. Tiny imagenet
dataset is a down-sampled version of ImageNet dataset. It consists of 64 × 64 RGB
color images drawn from 200 classes, which are split into 100, 000 train and 10, 000
test images.
For CIFAR-100, we normalize each image and augment the train images. The data
augmentation includes horizontal flips and random crops from the image padded by 4
pixels on each side, filling missing pixels with reflections of the original image. Each
network is trained for 200 epochs with batch size of 128 and the initial learning rate
which is decreased at every 60 epochs. For tiny imagenet, we simulate with the pure
dataset without augmentation. Each network is trained for 100 epochs with batch size
of 128 and learning rate which is decreased at every 40 epochs. We use stochastic
gradient descent optimizer with momentum of 0.9. The initial learning rate is 0.01 for
ML and 0.1 for the others.
15
Table 4.1: Improvement of knowledge distillation (KD) with the proposed structure on
CIFAR 100
Net 1 Net 2 independent KD ours
Res 32 VGG 13 69.86 67.74 71.5 72.2
Res 110 Res 20 71.69 68.32 68.72 70.99
WRN 28-10 Res 32 78.98 69.86 69.85 74.87
MobileNet Res 32 74.08 69.86 69.88 71.77
Res 110 Res 32 71.69 69.86 70.12 73.36
MobileNet VGG 13 74.08 67.74 68.83 71.12
16
Table 4.2: Improvement of attention transfer (AT) with the proposed method on CIFAR
100
Net 1 Net 2 independent AT ours pend
ResNet 110 ResNet 20 71.69 68.32 68.34 68.67 0.6
WRN 28-10 ResNet 32 78.98 69.86 69.87 70.64 0.7
ResNet 110 ResNet 32 71.69 69.86 70.22 71.23 0.7
WRN 40-4 ResNet 32 75.67 69.86 70.03 70.59 0.7
WRN 28-10 WRN 40-4 78.98 75.67 75.36 76.09 0.7
17
Table 4.3: Improvement of mutual learning (ML) with the proposed structure on CI-
FAR 100
Net 1 Net 2 independent ML ours
ResNet 32 ResNet 32 69.86 69.86 71.14 71.21 73.68 73.58
MobileNet ResNet 32 74.08 69.86 75.62 71.1 76.2 72.76
WRN 28-10 ResNet 32 78.98 69.86 78.53 72.18 80.65 73.08
MobileNet MobileNet 74.08 74.08 75 75.16 75.5 76.1
WRN 28-10 MobileNet 78.98 74.08 78.34 76.41 81.03 76.82
WRN 28-10 WRN 28-10 78.98 78.98 78.83 78.95 81 80.66
18
4.2 CIFAR-100
Here, we present simulation results of knowledge transfer methods on CIFAR-100.
Table 4.1 is the simulation results of KD and KD with the proposed approach. In
Table 4.1, we confirm that the proposed approach further improves the performance
of student networks. Especially, for (WRN 28-10, ResNet 32) pair, the accuracy of
ResNet 32 trained with the proposed structure improves more than 5% compared to
when ResNet 32 is trained the pure WRN 28-10. Table 4.2 is the simulation results
of AT and AT with the proposed approach. For AT, we use only resnets and wide
resnets to fit the spatial size conveniently. Attention maps are made by square sum
via channel axis and l2 normalization. We confirm that the proposed approach show
further improvement over the pure AT method. Table 4.3 is the simulation results of
ML and ML with the proposed approach. Networks in Net 1 of Table 4.3 are changed
to the proposed structure. The proposed approach shows further improvement in peer
learning paradigm and both networks are improved further.
19
Table 4.4: Improvement of knowledge distillation (KD) with the proposed structure on
tiny imagenet
Net 1 Net 2 independent KD ours pend
ResNet 32 VGG 13 49.01 44.61 55.76 57.56 0.9
ResNet 32 ResNet 20 49.01 46.85 49.57 50.6 0.9
MobileNet ResNet 20 55.38 46.85 51.8 52.15 0.7
MobileNet ResNet 32 55.38 49.01 54.48 54.85 0.8
MobileNet ResNet 110 55.38 52.32 58.15 58.2 0.9
WRN 28-10 ResNet 32 58.91 49.01 55.7 55.34 0.6
20
Table 4.5: Improvement of attention transfer (AT) with the proposed method on tiny
imagenet
Net 1 Net 2 independent AT ours
Res 110 Res 20 52.32 46.85 51.49 51.9
WRN 28-10 Res 32 58.91 49.01 53.56 54.15
Res 110 Res 32 52.32 49.01 54.52 54.91
WRN 40-4 Res 32 55.19 49.01 54.33 54
WRN28-10 WRN40-4 58.91 55.19 60.98 61.36
21
4.3 Tiny Imagenet
Here, we present simulation results of knowledge transfer methods on tiny imagenet.
Table 4.4, 4.5 are the simulation results of KD, AT and proposed approach. The pro-
posed approach still improves student networks generally, but there exists one pair for
each table that the student network lags behind the conventional transfer methods.
22
Chapter 5
Ablation Study
In Figure 3.1, when some blocks drop, performance of neural networks drops signifi-
cantly. The blocks are the 4th block of wrn 28-10 and 1st to 6th blocks of mobilenet.
We name these blocks significant blocks. Sub-networks generated by dropping sig-
nificant blocks have low performance so that the networks might not be adequate as
teacher networks. Hence, we observe if a student networks is degraded, when sub-
networks in which include significant blocks are used as teacher networks. We per-
form simulation with KD and use CIFAR-100 dataset for (wrn 28-10, resnet 32) and
(mobilenet, resnet 32) pairs.
In Table 5.1, partial means that none of significant blocks drops and full means
that all the blocks drop stochastically in training phase. The results show that using the
whole sub-networks is more helpful improving a student network even if some of them
do not perform well. This is in line with the result of ML [5] where larger network still
benefits from being trained together with a smaller network.
23
Table 5.1: Comparison
Net 1 Net 2 independent partial full pend
WRN 28-10 ResNet 32 78.98 69.86 74.81 74.87 0.5 0.5
mob ResNet 32 74.08 69.86 70.3 71.77 0.9 0.7
24
Chapter 6
Conclusion
In this work, we propose to change the structure of a teacher network to get the effect
of multiple teacher networks. In our proposed approach, we obtain multiple teacher
networks without additional resources so that compact networks improve further with
the help of more extensive knowledge. The proposed structure can be easily applied to
other transfer methods and tasks, e.g object detection and segmentation
25
Bibliography
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 770–778, 2016.
[2] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui
Wu. Exploring the limits of language modeling. arXiv preprint
arXiv:1602.02410, 2016.
[3] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
Google’s neural machine translation system: Bridging the gap between human
and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[4] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a
neural network. arXiv preprint arXiv:1503.02531, 2015.
[5] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual
learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4320–4328, 2018.
[6] Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple
teacher networks. In Proceedings of the 23rd ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pages 1285–1294. ACM,
2017.
26
[7] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfit-
ting. The journal of machine learning research, 15(1):1929–1958, 2014.
[8] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Im-
proving the performance of convolutional neural networks via attention transfer.
arXiv preprint arXiv:1612.03928, 2016.
[9] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang,
Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint
arXiv:1412.6550, 2014.
[10] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge
distillation: Fast optimization, network minimization and transfer learning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 4133–4141, 2017.
[11] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram
Swami. Distillation as a defense to adversarial perturbations against deep neu-
ral networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages
582–597. IEEE, 2016.
[12] Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei
Jiang, Chi Zhang, and Jian Sun. Alignedreid: Surpassing human-level perfor-
mance in person re-identification. arXiv preprint arXiv:1711.08184, 2017.
[13] Yevgen Chebotar and Austin Waters. Distilling knowledge from ensembles of
neural networks for speech recognition. In Interspeech, pages 3439–3443, 2016.
[14] Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression:
Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650,
2016.
27
[15] Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel:
Regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 4753–4762, 2016.
[16] Ronald J Williams and Jing Peng. Function optimization using connectionist
reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
[17] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo-
thy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous
methods for deep reinforcement learning. In International conference on ma-
chine learning, pages 1928–1937, 2016.
[18] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey
Hinton. Regularizing neural networks by penalizing confident output distribu-
tions. arXiv preprint arXiv:1701.06548, 2017.
[19] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. Rethinking the inception architecture for computer vision. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition, pages
2818–2826, 2016.
[20] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave
like ensembles of relatively shallow networks. In Advances in neural information
processing systems, pages 550–558, 2016.
[21] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep
networks with stochastic depth. In European conference on computer vision,
pages 646–661. Springer, 2016.
[22] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features
from tiny images. Technical report, Citeseer, 2009.
28
[23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.
Imagenet large scale visual recognition challenge. International journal of com-
puter vision, 115(3):211–252, 2015.
29
초록
깊은신경망은여러분야에서뛰어난성능을보여왔습니다.그러나,깊은신경망
은계산및메모리집약적이므로실제로사용되기위해서는그규모가줄여질필요
가있습니다.신경망의크기를줄이면서도그성능을유지하는방안으로지식증류
기술이많은관심을받았습니다.이기법의기본아이디어는학생신경망을선생님
신경망의 도움을 받아 학습시키는 것입니다. 다양한 지식 추철 기법들이 제시되었
고 그것들 중 하나가 다중 선생 신경망을 사용하는 것입니다. 그러나, 이것은 어느
정도자원의낭비를유발하므로많은관심을받지못했습니다.이연구에서우리는
확률적블록과스킵연결을활용하여추가적인자원없이한개의선생님신경망으
로부터여러개의신경망을생성하는것을제안합니다.따라서,생성된신경망들은
다중 선생 신경망의 역할을 할 수 있고 추가적인 자원 없이 학생 신경망에 충분한
지식을 제공할 수 있습니다. 우리는 제안하는 접근법으로 학생 신경망이 cifar-100
과 tiny-imagenet데이타셋에대하여성능향상이있음을확인하였습니다.
주요어:컨볼루션신경망,지식전달,이미지분류,다중교사네트워크. TeX
학번: 2018-26533
30