Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher...

39
저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게 l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다: l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다. l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다. 저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약 ( Legal Code) 을 이해하기 쉽게 요약한 것입니다. Disclaimer 저작자표시. 귀하는 원저작자를 표시하여야 합니다. 비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다. 변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.

Transcript of Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher...

Page 1: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

Page 2: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

M.S. THESIS

Knowledge Transfer via Stochastic Dropand Skip Connection

확률적누락과스킵연결을통한지식전달기법

BY

LEE KWANG-JIN

FEBRUARY 2020

DEPARTMENT OF ELECTRICAL ENGINEERING ANDCOMPUTER SCIENCE

COLLEGE OF ENGINEERINGSEOUL NATIONAL UNIVERSITY

Page 3: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

M.S. THESIS

Knowledge Transfer via Stochastic Dropand Skip Connection

확률적누락과스킵연결을통한지식전달기법

BY

LEE KWANG-JIN

FEBRUARY 2020

DEPARTMENT OF ELECTRICAL ENGINEERING ANDCOMPUTER SCIENCE

COLLEGE OF ENGINEERINGSEOUL NATIONAL UNIVERSITY

Page 4: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Knowledge Transfer via Stochastic Dropand Skip Connection

확률적누락과스킵연결을통한지식전달기법

지도교수심병효

이논문을공학석사학위논문으로제출함

2020년 2월

서울대학교대학원

전기컴퓨터공학부

이광진

이광진의공학석사학위논문을인준함

2020년 2월

위 원 장:부위원장:위 원:

Page 5: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Abstract

Deep neural networks have achieved state-of-the-art performance in various fields.

However, DNNs might need to be scaled down to fit real-word applications since they

are computationally and memory intensive. As a means to compress the network yet

still maintain the performance of the network, knowledge distillation has brought a

lot of attention. This technique is based on the idea to train a student network using

the provided output of a teacher network. Various distillation methods have been pro-

posed and one of them is deploying multiple teacher networks. However, it causes to

some extent waste of resources, so did not receive much attention. In the proposed

approach, we generate multiple sub-networks from one single teacher network by ex-

ploiting stochastic block and skip connection. Thus, they can play the role of multiple

teacher networks and provide sufficient knowledge to the student network without ad-

ditional resources. We observe the improved performance of student networks with the

proposed approach for CIFAR-100 and tiny-imagenet dataset.

keywords: convolutional neural networks, knowledge transfer, image classification,

multiple teacher networks.

student number: 2018-26533

i

Page 6: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Contents

Abstract i

Contents ii

List of Tables iv

List of Figures v

1 INTRODUCTION 1

2 Related Works 5

2.1 Knowledge Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Multiple Teacher Networks . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Regularizing Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Proposed Framework 9

3.1 Generating Multiple Networks . . . . . . . . . . . . . . . . . . . . . 9

3.2 Availability of the proposed framework . . . . . . . . . . . . . . . . 10

3.3 Application to Other Distillation Techniques . . . . . . . . . . . . . . 12

4 Experiment 15

4.1 Dataset and Simulation Setting . . . . . . . . . . . . . . . . . . . . . 15

4.2 CIFAR-100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Tiny Imagenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

ii

Page 7: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

5 Ablation Study 23

6 Conclusion 25

Abstract (In Korean) 30

iii

Page 8: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

List of Tables

4.1 Improvement of knowledge distillation (KD) with the proposed struc-

ture on CIFAR 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Improvement of attention transfer (AT) with the proposed method on

CIFAR 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Improvement of mutual learning (ML) with the proposed structure on

CIFAR 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Improvement of knowledge distillation (KD) with the proposed struc-

ture on tiny imagenet . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.5 Improvement of attention transfer (AT) with the proposed method on

tiny imagenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

iv

Page 9: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

List of Figures

1.1 The diagram of knowledge distillation . . . . . . . . . . . . . . . . . 2

1.2 (a), (b), (c) Unravelled views when one block is dropped from a net-

work with the proposed approach. . . . . . . . . . . . . . . . . . . . 3

2.1 Learning diagram of (a) knowledge distillation (b) attention transfer

(c) mutual learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Accuracy when each block is dropped from (a) residual network 32,

(b) mobilenet, and (c) wide residual network 28-10. . . . . . . . . . . 13

3.2 Entropy when each block is dropped from (a) residual network 32, (b)

mobilenet, and (c) wide residual network 28-10. . . . . . . . . . . . . 14

v

Page 10: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Chapter 1

INTRODUCTION

Deep neural networks (DNN) have achieved state-of-the-art performance on complex

tasks like computer vision [1], language modeling [2], and machine translation [3], to

name just a few. Despite the superior performance, it is not easy to use the DNN-based

models for the embedded systems having restricted memory size and computational

resources. Over the years, many approaches have been suggested to make smaller but

efficient DNNs. One of them, knowledge distillation (KD), has received much atten-

tion in recent years [4].

The basic idea of KD is to train a smaller network (a.k.a. student network) with

the help of softened outputs (a.k.a., soft targets) of a larger network (a.k.a., teacher

network). The teacher network is usually pre-trained by supervised learning. While

maintaining the low computational complexity, the student network learns the infor-

mation of the teacher network by mimicking the soft targets (see Fig 1). Specifically,

the student network is optimized to the dataset by minimizing the loss function which

is the sum of cross-entropy loss between the outputs of the student network and soft

targets, and cross-entropy loss between the outputs of the student network and label

data. Recently, it has been shown that the learning of student networks can be much im-

proved with the help of multiple teacher networks [4, 5, 6]. Multiple teacher networks

have different knowledge and views each other. Thus, they generate different outputs

1

Page 11: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

�����

������ ������

�������� ������

���� ��������������������������

Figure 1.1: The diagram of knowledge distillation

on the same input. Then, the student network can fuse distinct knowledge from mul-

tiple teacher networks to establish its own comprehensive and in-depth understanding

of the knowledge [6]. However, most of the current literatures do not focus much on

this problem, which to some extent causes the waste of resources.

The primary goal of this paper is to improve the performance of student network

with comprehensive knowledge of multiple sub-networks. Here, sub-networks are gen-

erated by exploiting stochastic block and skip connection. Stochastic blocks are con-

ceptually similar to units in dropout setting [7]. Dropout zeros the output of individual

units at random during training, while stochastic blocks consisting of one or more lay-

ers are skipped at random during training. Skip connections [1] are identity mappings

that bypass one or more layers. Fig 1.2 is an example of sub-networks generated by

dropping each block from the original network. In Fig 1.2, the original network con-

sists of 3 blocks represented by fi (i ∈ {1, 2, 3}) and let Id represent the identity map-

ping generated by skip connection. The dropped blocks are noticed by cross marks, and

the corresponding outputs are zero. Skip connections allow the network to be viewed

as an aggregate of multiple valid paths, while an ordinary feed-forward network has

one valid path. Therefore, if any block drops, the feed-forward network becomes use-

less, while a network with skip connections is still reliable. Indeed, in Fig 1.2, when

one block drops, each sub-network still has 4 valid paths of 8 total paths.

2

Page 12: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Initial layer

f1 Id

f 2 f 2 IdId

input

+

f 3 Id f 3 Id f 3 Id f 3 Id

Initial layer

input

f1

f 3

+

f 2

+

+

(a)

Initial layer

f1 Id

f 2 f 2 IdId

input

+

f 3 Id f 3 Id f 3 Id f 3 Id

Initial layer

input

f1

f 3

+

f 2

+

+

(b)

Initial layer

f1 Id

f 2 f 2 IdId

input

+

f 3 Id f 3 Id f 3 Id f 3 Id

Initial layer

input

f1

f 3

+

f 2

+

+

(c)

Figure 1.2: (a), (b), (c) Unravelled views when one block is dropped from a network

with the proposed approach.

3

Page 13: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

We set a teacher network to have stochastic blocks and skip connections. As each

block drops randomly, a sub-network of remaining paths is generated for each batch.

In this way, we can generate multiple sub-networks from one teacher network with-

out additional resources. Therefore, the student network can be trained with multiple

sub-networks in the entire training process. In this paper we make the following con-

tributions:

• We propose a framework generating multiple sub-networks from a single teacher

network without additional resources.

• We show that generated sub-networks provide easy-to-learn and regularizing

knowledge to the student network.

• As a result, we observe that the performance of the student network improves

further compared to the same architecture trained by conventional knowledge

transfer methods.

4

Page 14: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Chapter 2

Related Works

2.1 Knowledge Transfer

Recently, as a means to make smaller but efficient DNNs, knowledge transfer has re-

ceived much attention. Some intuitions and a generalized approach (KD) are demon-

strated in [4]. In essence, softened outputs of a teacher network are transferred to a

student network as an additional supervision. They provide the student network with

the knowledge of the teacher network and prevent the student network from overfit-

ting. Later, an approach using distillation to transfer knowledge from powerful and

easy-to-train networks to small but hard-to-train networks has been proposed [9]. In

this approach, intermediate outputs of the teacher network are provided as hints for

the student network. The student network is trained to mimic not only the final outputs

but also the intermediate outputs of the teacher network. Another approach using an

attention-based distillation method has also been proposed [8]. There, attention maps

which are made from intermediate feature maps of networks are used like hints in [9].

To transfer knowledge while avoiding direct mimicry, an approach has been proposed

in which layers of the teacher network are deployed to generate flows using the Gram

matrix of feature maps [10]. Then, the student network is optimized to learn the flows

of the teacher network in the training process. More recently, mutual learning [5] has

5

Page 15: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

been suggested as a new paradigm of bidirectional knowledge transfer. Unlike conven-

tional teacher-student paradigm where knowledge is transferred in one way from the

fixed teacher network to the student network, all networks in mutual learning exchange

knowledge in a mutually beneficial way. We show diagrams of conventional transfer

methods which are used in simulation in Fig 2.1.

Knowledge transfer technique is deployed in several applications. In [11], a mech-

anism called defensive distillation has been proposed. Defensive distillation reduces

the effectiveness of adversarial samples on DNNs so that enhance the robustness of

the DNNs against adversarial attacks. In [12], student networks are trained simultane-

ously, transferring knowledge between each other, such as [5]. Differing from [5], they

are trained with a new mutual learning loss based on metric learning.

6

Page 16: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

�������������

����������

(a)

�������������

����������

����������

��

��

(b)

��������

��������

��

��

(c)

Figure 2.1: Learning diagram of (a) knowledge distillation (b) attention transfer (c)

mutual learning.

7

Page 17: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

2.2 Multiple Teacher Networks

In order to provide more extensive knowledge to a student network, knowledge trans-

fer using multiple teacher networks have been proposed [6]. Each teacher network

generates distinct and useful outputs on given inputs. Thus, multiple teacher networks

together can provide the student network with more generalized and extensive knowl-

edge, further enhancing the student network. This is effective in various applications,

such as image classificaion, person reID [5], and speech recognition [13]. However,

deploying multiple teacher networks to transfer knowledge to a single student network

is to some extent a waste of resource. To overcome this issue, one straightforward way

is to perturb the outputs of the teacher network with random noise to get the effect

of multiple teacher networks [14]. This is problematic since the corrupted knowledge

could be transferred to the student network. Also, this method is rather a regularizing

effect like [15] than getting knowledge from multiple teacher networks. In contrast,

our proposed approach generates multiple networks of valid paths (see Figure 1.2) so

that reliable and various knowledge is ensured.

2.3 Regularizing Output

In reinforcement learning, encouraging the policy to have the output distribution with

high entropy has been used [16, 17]. It prevents the policy from converging to narrow

action space and exploration happens more. Thus, it is more probable that the policy

that is closer to optimal is determined. Also, encouraging high entropy output [18] and

smoothing label [19] are proved to help the training of DNNs. Regularizing the high

confident outputs prevents the network from overfitting and increases the adaptivity

of the network. In our approach, generated sub-networks have higher entropy outputs

compare to the original network. Thus, a student networks is trained to mimic the less

confident outputs from sub-networks. This is analogous to penalizing high confident

output of DNNs and helps the student network generalize to the dataset.

8

Page 18: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Chapter 3

Proposed Framework

We propose a framework to generate multiple sub-networks of valid paths from a sin-

gle teacher network. In essence, the key point of this approach is to incorporate a sin-

gle teacher network with stochastic blocks and skip connections. We explain how to

generate sub-networks of valid paths and demonstrate that these sub-networks further

enhance the performance of student networks.

3.1 Generating Multiple Networks

We exploit the characteristics of skip connections and stochastic blocks to generate

multiple sub-networks from a single teacher network. First, we add skip connections

to each input of the block to the corresponding output of the teacher network. In resid-

ual networks, where skip connections are introduced, skip connections let a residual

network to be viewed as an ensemble of multiple paths of different lengths [20]. Let fi

be the representation of the i-th block in the residual network. Then, the output (oi+1)

of the (i+ 1)-th block can be expressed as

oi+1 = fi+1(oi) + oi. (3.1)

Since two paths exist from the input to the output of each block, for the teacher network

of n blocks, there exist 2n valid paths in total (see Figure 1.2).

9

Page 19: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Then, we set the blocks of the teacher network to be stochastic with a survival

probability which is the keep probability of each block as in dropout. To be specific,

let pi be the survival probability of the i-th block (1 ≤ i ≤ n). Since initial blocks

extract low-level features that will be used by later blocks, they should be more reliably

present [21]. Thus, we use linear decay mode in which pi satisfies

pi = 1− (1− pn)×i− 1

n− 1. (3.2)

Stochastic blocks drop randomly in the training phase. However, dropping some

blocks from the network with proposed structure does not harm the performance much.

This is because skip connections make exponential paths in the network so that there

still exist valid paths even though some blocks drop1. Therefore, when a neural net-

work consists of stochastic blocks and skip connections, multiple sub-networks which

are some parts of the original network are generated and their reliable performance is

ensured.

Note that pn implies a trade-off between reliability and variety of sub-networks.

If pn is high, each generated sub-network will have longer length. However, high pn

generates less sub-networks. In contrast, if pn is low, more sub-networks with shorter

length are generated but performance is a bit lower. In our proposed framework, we

set 0.5 ≤ pn ≤ 0.9 with 0.1 interval and choose the best pn for each pair of teacher

network and student network.

3.2 Availability of the proposed framework

In this work, we propose to use multiple sub-networks generated from a single teacher

network to help the learning of a student network. To be used as teacher networks,

the sub-networks should have reliable performance and provide student networks with

meaningful knowledge. Here we show that the performance of sub-networks is com-

1If k blocks are dropped from n blocks, 2n−k valid paths still exist.

10

Page 20: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

parable to that of the original teacher network and they provide the student network

with easy-to-learn and regularizing knowledge.

First, we observe the accuracy of sub-networks when each block is dropped from

pre-trained networks, including resnet of 32 layers, mobile network, and wide resnet

28-10 (see Figure 3.1), with test dataset of cifar 100. In Fig 3.1, sto denotes that the

network is pre-trained as in [21] and basic denotes that the network is pre-trained

normally without dropping any blocks. Baseline denotes the accuracy when no block

drops. Mostly, the performance of sub-networks from sto networks is maintained.

However, dropping initial blocks of mobile network and 4-th block of wide resnet

28-10 degrades the performance significantly. To observe the impact of such blocks

that are fatal to drop, we compare cases where the blocks drop or does not drop like

other blocks in the ablation study.

Sub-networks provide student network with high entropy output distribution which

helps the learning process of student networks (see Figure 3.2). It is known that regu-

larizing a neural network to be less confident improves the performance [5, 18]. This

is because training a network to have high confident output let it overfit to the train

dataset and reduce the adaptivity by bounding the gradient. Also, the high entropy

distribution contains important information like relations between classes which are

the salient cues how the teacher network generalizes. In [5], it has been shown that

using an ensemble of n networks as a teacher is less helpful than using n individual

networks as n teachers. This is because the ensemble makes the secondary values of

outputs small and high entropy distribution. Therefore, training student networks to

mimic the higher entropy distribution is helpful for the student network to general-

ize to unseen dataset. In our proposed approach, generating sub-networks from the

teacher network is analogous to using individual networks instead of the ensemble of

them in [5]. Therefore, the knowledge that sub-networks provide contains important

information and is learned easily by student networks.

11

Page 21: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

3.3 Application to Other Distillation Techniques

We apply the proposed framework to other distillation techniques, including KD, AT,

and ML. For KD and AT, we slightly change the model by adding skip connections

and stochastic blocks to the teacher network. In ML, the notions of teacher and student

vanish since both networks give and take knowledge among each other. However, for

convenience, we denote a network with larger capacity as the teacher network and the

other as the student network. Here, we add skip connections and stochastic blocks only

to the teacher network as in KD and AT. In ML, both networks should be pre-trained

since teacher networks are not fixed. If the networks are not pre-trained, they cannot be

improved because of the random property of the teacher network. We assume a situa-

tion when both networks are not pre-trained. At the beginning of training process, the

teacher networks are initialized. Thus, the sub-networks of the teacher network do not

have any meaningful knowledge of the given dataset. As a result, the student network

gets meaningless knowledge from the sub-networks, thus, the student network does

not improve. Also, sub-networks are not optimized due to the disturbing knowledge

from the student network.

12

Page 22: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

��������

����������������

(a)

����������������

��������

(b)

����������������

��������

(c)

Figure 3.1: Accuracy when each block is dropped from (a) residual network 32, (b)

mobilenet, and (c) wide residual network 28-10.

13

Page 23: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

����������������

�������

(a)

����������������

�������

(b)

����������������

�������

(c)

Figure 3.2: Entropy when each block is dropped from (a) residual network 32, (b)

mobilenet, and (c) wide residual network 28-10.

14

Page 24: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Chapter 4

Experiment

4.1 Dataset and Simulation Setting

We evaluate the proposed method with two datasets - CIFAR-100 [22] and tiny im-

agenet [23]. CIFAR-100 dataset consists of 32 × 32 RGB color images drawn from

100 classes, which are split into 50, 000 train and 10, 000 test images. Tiny imagenet

dataset is a down-sampled version of ImageNet dataset. It consists of 64 × 64 RGB

color images drawn from 200 classes, which are split into 100, 000 train and 10, 000

test images.

For CIFAR-100, we normalize each image and augment the train images. The data

augmentation includes horizontal flips and random crops from the image padded by 4

pixels on each side, filling missing pixels with reflections of the original image. Each

network is trained for 200 epochs with batch size of 128 and the initial learning rate

which is decreased at every 60 epochs. For tiny imagenet, we simulate with the pure

dataset without augmentation. Each network is trained for 100 epochs with batch size

of 128 and learning rate which is decreased at every 40 epochs. We use stochastic

gradient descent optimizer with momentum of 0.9. The initial learning rate is 0.01 for

ML and 0.1 for the others.

15

Page 25: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Table 4.1: Improvement of knowledge distillation (KD) with the proposed structure on

CIFAR 100

Net 1 Net 2 independent KD ours

Res 32 VGG 13 69.86 67.74 71.5 72.2

Res 110 Res 20 71.69 68.32 68.72 70.99

WRN 28-10 Res 32 78.98 69.86 69.85 74.87

MobileNet Res 32 74.08 69.86 69.88 71.77

Res 110 Res 32 71.69 69.86 70.12 73.36

MobileNet VGG 13 74.08 67.74 68.83 71.12

16

Page 26: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Table 4.2: Improvement of attention transfer (AT) with the proposed method on CIFAR

100

Net 1 Net 2 independent AT ours pend

ResNet 110 ResNet 20 71.69 68.32 68.34 68.67 0.6

WRN 28-10 ResNet 32 78.98 69.86 69.87 70.64 0.7

ResNet 110 ResNet 32 71.69 69.86 70.22 71.23 0.7

WRN 40-4 ResNet 32 75.67 69.86 70.03 70.59 0.7

WRN 28-10 WRN 40-4 78.98 75.67 75.36 76.09 0.7

17

Page 27: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Table 4.3: Improvement of mutual learning (ML) with the proposed structure on CI-

FAR 100

Net 1 Net 2 independent ML ours

ResNet 32 ResNet 32 69.86 69.86 71.14 71.21 73.68 73.58

MobileNet ResNet 32 74.08 69.86 75.62 71.1 76.2 72.76

WRN 28-10 ResNet 32 78.98 69.86 78.53 72.18 80.65 73.08

MobileNet MobileNet 74.08 74.08 75 75.16 75.5 76.1

WRN 28-10 MobileNet 78.98 74.08 78.34 76.41 81.03 76.82

WRN 28-10 WRN 28-10 78.98 78.98 78.83 78.95 81 80.66

18

Page 28: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

4.2 CIFAR-100

Here, we present simulation results of knowledge transfer methods on CIFAR-100.

Table 4.1 is the simulation results of KD and KD with the proposed approach. In

Table 4.1, we confirm that the proposed approach further improves the performance

of student networks. Especially, for (WRN 28-10, ResNet 32) pair, the accuracy of

ResNet 32 trained with the proposed structure improves more than 5% compared to

when ResNet 32 is trained the pure WRN 28-10. Table 4.2 is the simulation results

of AT and AT with the proposed approach. For AT, we use only resnets and wide

resnets to fit the spatial size conveniently. Attention maps are made by square sum

via channel axis and l2 normalization. We confirm that the proposed approach show

further improvement over the pure AT method. Table 4.3 is the simulation results of

ML and ML with the proposed approach. Networks in Net 1 of Table 4.3 are changed

to the proposed structure. The proposed approach shows further improvement in peer

learning paradigm and both networks are improved further.

19

Page 29: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Table 4.4: Improvement of knowledge distillation (KD) with the proposed structure on

tiny imagenet

Net 1 Net 2 independent KD ours pend

ResNet 32 VGG 13 49.01 44.61 55.76 57.56 0.9

ResNet 32 ResNet 20 49.01 46.85 49.57 50.6 0.9

MobileNet ResNet 20 55.38 46.85 51.8 52.15 0.7

MobileNet ResNet 32 55.38 49.01 54.48 54.85 0.8

MobileNet ResNet 110 55.38 52.32 58.15 58.2 0.9

WRN 28-10 ResNet 32 58.91 49.01 55.7 55.34 0.6

20

Page 30: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Table 4.5: Improvement of attention transfer (AT) with the proposed method on tiny

imagenet

Net 1 Net 2 independent AT ours

Res 110 Res 20 52.32 46.85 51.49 51.9

WRN 28-10 Res 32 58.91 49.01 53.56 54.15

Res 110 Res 32 52.32 49.01 54.52 54.91

WRN 40-4 Res 32 55.19 49.01 54.33 54

WRN28-10 WRN40-4 58.91 55.19 60.98 61.36

21

Page 31: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

4.3 Tiny Imagenet

Here, we present simulation results of knowledge transfer methods on tiny imagenet.

Table 4.4, 4.5 are the simulation results of KD, AT and proposed approach. The pro-

posed approach still improves student networks generally, but there exists one pair for

each table that the student network lags behind the conventional transfer methods.

22

Page 32: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Chapter 5

Ablation Study

In Figure 3.1, when some blocks drop, performance of neural networks drops signifi-

cantly. The blocks are the 4th block of wrn 28-10 and 1st to 6th blocks of mobilenet.

We name these blocks significant blocks. Sub-networks generated by dropping sig-

nificant blocks have low performance so that the networks might not be adequate as

teacher networks. Hence, we observe if a student networks is degraded, when sub-

networks in which include significant blocks are used as teacher networks. We per-

form simulation with KD and use CIFAR-100 dataset for (wrn 28-10, resnet 32) and

(mobilenet, resnet 32) pairs.

In Table 5.1, partial means that none of significant blocks drops and full means

that all the blocks drop stochastically in training phase. The results show that using the

whole sub-networks is more helpful improving a student network even if some of them

do not perform well. This is in line with the result of ML [5] where larger network still

benefits from being trained together with a smaller network.

23

Page 33: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Table 5.1: Comparison

Net 1 Net 2 independent partial full pend

WRN 28-10 ResNet 32 78.98 69.86 74.81 74.87 0.5 0.5

mob ResNet 32 74.08 69.86 70.3 71.77 0.9 0.7

24

Page 34: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Chapter 6

Conclusion

In this work, we propose to change the structure of a teacher network to get the effect

of multiple teacher networks. In our proposed approach, we obtain multiple teacher

networks without additional resources so that compact networks improve further with

the help of more extensive knowledge. The proposed structure can be easily applied to

other transfer methods and tasks, e.g object detection and segmentation

25

Page 35: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

Bibliography

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-

ing for image recognition. In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 770–778, 2016.

[2] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui

Wu. Exploring the limits of language modeling. arXiv preprint

arXiv:1602.02410, 2016.

[3] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,

Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.

Google’s neural machine translation system: Bridging the gap between human

and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[4] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a

neural network. arXiv preprint arXiv:1503.02531, 2015.

[5] Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual

learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 4320–4328, 2018.

[6] Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple

teacher networks. In Proceedings of the 23rd ACM SIGKDD International Con-

ference on Knowledge Discovery and Data Mining, pages 1285–1294. ACM,

2017.

26

Page 36: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

[7] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan

Salakhutdinov. Dropout: a simple way to prevent neural networks from overfit-

ting. The journal of machine learning research, 15(1):1929–1958, 2014.

[8] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Im-

proving the performance of convolutional neural networks via attention transfer.

arXiv preprint arXiv:1612.03928, 2016.

[9] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang,

Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint

arXiv:1412.6550, 2014.

[10] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge

distillation: Fast optimization, network minimization and transfer learning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-

tion, pages 4133–4141, 2017.

[11] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram

Swami. Distillation as a defense to adversarial perturbations against deep neu-

ral networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages

582–597. IEEE, 2016.

[12] Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei

Jiang, Chi Zhang, and Jian Sun. Alignedreid: Surpassing human-level perfor-

mance in person re-identification. arXiv preprint arXiv:1711.08184, 2017.

[13] Yevgen Chebotar and Austin Waters. Distilling knowledge from ensembles of

neural networks for speech recognition. In Interspeech, pages 3439–3443, 2016.

[14] Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression:

Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650,

2016.

27

Page 37: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

[15] Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel:

Regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 4753–4762, 2016.

[16] Ronald J Williams and Jing Peng. Function optimization using connectionist

reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.

[17] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo-

thy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous

methods for deep reinforcement learning. In International conference on ma-

chine learning, pages 1928–1937, 2016.

[18] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey

Hinton. Regularizing neural networks by penalizing confident output distribu-

tions. arXiv preprint arXiv:1701.06548, 2017.

[19] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew

Wojna. Rethinking the inception architecture for computer vision. In Proceed-

ings of the IEEE conference on computer vision and pattern recognition, pages

2818–2826, 2016.

[20] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave

like ensembles of relatively shallow networks. In Advances in neural information

processing systems, pages 550–558, 2016.

[21] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep

networks with stochastic depth. In European conference on computer vision,

pages 646–661. Springer, 2016.

[22] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features

from tiny images. Technical report, Citeseer, 2009.

28

Page 38: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

[23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean

Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.

Imagenet large scale visual recognition challenge. International journal of com-

puter vision, 115(3):211–252, 2015.

29

Page 39: Disclaimers-space.snu.ac.kr/bitstream/10371/166488/1/000000158704.pdf · 2020-05-18 · teacher networks and provide sufficient knowledge to the student network without ad-ditional

초록

깊은신경망은여러분야에서뛰어난성능을보여왔습니다.그러나,깊은신경망

은계산및메모리집약적이므로실제로사용되기위해서는그규모가줄여질필요

가있습니다.신경망의크기를줄이면서도그성능을유지하는방안으로지식증류

기술이많은관심을받았습니다.이기법의기본아이디어는학생신경망을선생님

신경망의 도움을 받아 학습시키는 것입니다. 다양한 지식 추철 기법들이 제시되었

고 그것들 중 하나가 다중 선생 신경망을 사용하는 것입니다. 그러나, 이것은 어느

정도자원의낭비를유발하므로많은관심을받지못했습니다.이연구에서우리는

확률적블록과스킵연결을활용하여추가적인자원없이한개의선생님신경망으

로부터여러개의신경망을생성하는것을제안합니다.따라서,생성된신경망들은

다중 선생 신경망의 역할을 할 수 있고 추가적인 자원 없이 학생 신경망에 충분한

지식을 제공할 수 있습니다. 우리는 제안하는 접근법으로 학생 신경망이 cifar-100

과 tiny-imagenet데이타셋에대하여성능향상이있음을확인하였습니다.

주요어:컨볼루션신경망,지식전달,이미지분류,다중교사네트워크. TeX

학번: 2018-26533

30