Classifying Hand-written Chinese Characters using ...

IT 17 080

Examensarbete 30 hpApril 2018

Classifying Handwritten Chinese Characters using Convolutional Neural Networks

Georgios Ziogas

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Classifying Handwritten Chinese Characters usingConvolutional Neural Networks

Georgios Ziogas

Image recognition applications have been increasingly gaining popularity, as computerhardware was getting more powerful and cheaper. This increase in computationalresources, led researchers even closer to their target on creating algorithms thatcould achieve high accuracy in image recognition tasks. These algorithms are appliedin many different fields, such as in medical images analysis and object recognition inreal-time applications.

Previously studies have shown that among many image recognition algorithms,artificial neural networks and specifically deep neural networks, perform outstandinglydue to their ability to recognize extremely accurate patterns, shapes and specificcharacteristics in an image.

In this thesis project we are going to investigate a specialized type of Deep NeuralNetworks, called Convolutional Neural Networks or CNNs, which are designedspecifically for image recognition tasks. Furthermore we will analyze their hyperparameters, as well as explore different architectures, in order to understand howthese affect the accuracy and speed of the recognition. Finally we will present theresults of the different tests, in terms of accuracy and validate them according tospecific statistical metrics. For the purpose of our research, a data-set of handwrittenChinese characters was used.

Tryckt av: Reprocentralen ITCIT 17 080Examinator: Mats DanielsÄmnesgranskare: Michael AshcroftHandledare: Yan Shao

i

AcknowledgementsI would like to thank my reviewer Michael Ashcroft for his determinant role in intro-

ducing me to this amazing and interesting field of computer science and formulating

together an appropriate research topic. I would like also to thank my supervisor Yan

Shao, who supported and guided me during the whole period of my thesis disser-

tation, in order to achieve the best results and have a deep understanding of the

topic.

Furthermore, I would like to thank all of my friends in Thessaloniki and Uppsala for

their support. Last but not least, I would like to thank my family for believing in me,

for supporting me during all my studies and encouraging me in order to fulfill all

my targets and dreams.

iii

Contents

Acknowledgements i

Contents iii

1 Introduction 1

1.1 Optical Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5

2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 CNN blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . 6

Activation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Data Pre-processing 13

3.1 Images Parsing and Manipulation . . . . . . . . . . . . . . . . . . . . . 13

3.2 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.2 Contrast Increase . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Experiments 19

4.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.2 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.3 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Layers Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.2 Activation Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iv

4.2.3 Filter Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.4 Number of Neurons . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.5 Max Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.6 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

SGD: Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 22

RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Adadelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.7 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.8 Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.9 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Gabor Filter Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Evaluation 29

5.1 Architecture Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Optimizers Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Data Transformation Method Performance . . . . . . . . . . . . . . . . 31

5.4 Directional Feature Extraction method Performance . . . . . . . . . . . 31

5.5 Statistical Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Conclusion 39

6.1 What has been done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Bibliography 43

v

List of Figures

2.1 Example of how Max-Pooling layer calculates the new reduced activa-

tion map. [Source: cs231n Convolutional Neural Networks for Visual

Recognition] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Typical fully connected neural network. All neurons of a precedent

layer are connected to all neurons of the following layer. [Source:

cs231n Convolutional Neural Networks for Visual Recognition] . . . . 11

2.3 Fully connected layers in a CNN. Fully connected layers are always

placed after the convolutional layers, which is usually the end of the

network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Above are the feature maps of the Chinese character for each orienta-

tion. Below are Gabor filters. . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 First 10 chinese characters in the data-set. The first column core-

sponds to the original sample, the next 8 columns correspond to the

8-directional feature maps. . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Connections among neurons. a) Before applying dropout layer, b)

After applying dropout layer. . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Example of comparing different wavelengths λ, used for feature ex-

traction with Gabor filters. Every 4 images in a row correspond to a

specific angle. The bottom half of this grid-figure, visualizes the spe-

cific filters that were used for each case. . . . . . . . . . . . . . . . . . . 26

4.3 8-direction strokes feature maps, extracted by Gabor filter with λ=3.5

and σ=0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


and σ=0.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


and σ=0.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


and σ=0.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vi

5.1 Visualizing the 100 activation maps corresponding to the 1st convolu-

tional layer of M100 model. . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Visualizing the 200 activation maps corresponding to the 2nd convo-

lutional layer of M100 model. . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Visualizing the 300 activation maps corresponding to the 3rd convo-


5.4 Visualizing the 400 activation maps corresponding to the 4th convo-


5.5 Visualization of the 2nd Convolutional Layer of M100 model. From

the 200 filters of the 2nd layer, we show the top 32 with the highest

activations. In lower level convolutional layer we can see mostly line

patterns that are learned from the filters. . . . . . . . . . . . . . . . . . . 37

5.6 Visualization of the 4th Convolutional Layer of M100 model. From

the 400 filters of the 4th layer, we show the top 32 with the highest

activations. Most of the images show easily recognized patterns, that

can be found in many Chinese characters. . . . . . . . . . . . . . . . . . 37

vii

List of Tables

3.1 Format of offline isolated character data file (*.gnt) . . . . . . . . . . . . 14

5.1 Comparison among the 4 CNN architectures, that were chosen for our

experimentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Comparison among the 4 CNN architectures, that were chosen for our

experimentation by using the original 120 classes data-set. . . . . . . . 30

5.3 In this table we present the accuracy results from the 6 different con-

figurations, which include the 3 chosen optimizers (RMSprop, SGD,

Adadelta) and their learning rates. For these experiments, the original

120 classes data-set was used. . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Accuracy comparison among the architectures, after the addition of

transformed samples. The last column stands for the accuracy gain/loss

compared to the original, 120 classes, dataset. . . . . . . . . . . . . . . . 31

5.5 Accuracy comparison among the architectures, after the addition of

directional features. The last column stands for the accuracy gain/loss

compared to the original, 120 classes dataset. . . . . . . . . . . . . . . . 32

5.6 Given the confusion matrices of the experiments, McNemar’s Chi-

squared test gives the probability values that show if the Null Hy-

pothesis is rejected or not. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

ix

List of Abbreviations

OCR Optical Character Recognition

HCCR Handwritten Chinese Character Recognition

MLP Multi-Layer Perceptron

CNN Convolutional Neural Network

1

Chapter 1

Introduction

Since the invention of typography, people have been gathering enormous amount of

information by writing books and documents, in the fields of history, medicine and

technology. The succeeding years, led people to invent methods in order to have

faster access on this information, due to the high demand on research resources, in

order to gain knowledge and innovate. During the last century there were invented

some mechanical devices for reading text and reproduce it in another format like

sound, but the big evolution in text recognition took place in the last decades, where

the development of computers played a big role. The increase on computational

resources offered a great tool that could help in the complex problem of having an

accurate text recognition mechanism.

Modern applications of text recognition can be found nowadays on different areas

such as banking sector and government services, in order to achieve information ex-

traction. Furthermore, text recognition is widely used in book digitization of many

historical and important books that can be accessed via electronic libraries, provides

better readability, in case that the book is old and fragile and also faster information

searchability.

1.1 Optical Character Recognition

In the field of computer science, Optical Character Recognition (OCR), is the method

of taking a large amount of text data and convert it into digitized letters in order to be

used by text-processing software. In general, OCR is divided into two categories, on-

line and off-line OCR. The first method is when a data-stream of letters is used, taken

from input equipment such as a digitizing tablet. The off-line method is the most

usual one and involves the recognition of text on images or scanned and printed

2 Chapter 1. Introduction

documents. OCR is a significant part of research in Natural Language Processing,

Artificial Intelligence and Computer Vision[1].

1.2 Objectives

The main purpose of this project is to create a classifying algorithm for hand-written

Chinese characters taken from images and scans. This algorithm will be based on

Neural Networks which is a model-family in Machine Learning. The reason why

Chinese characters were chosen is basically two distinguishing characteristics. First,

the complexity of the shape of the Chinese characters shape and the extreme vari-

ability among the different handwritings. Secondly the significantly bigger size of

Chinese alphabet, compared to the Latin-letters alphabets like English, which con-

sists of only 26 letters. The above characteristics make the task of Chinese handwrit-

ten characters classification, a really challenging one.

This master thesis project will try to answer the following research questions:

• How well a Neural Network with deep architecture can do in handwritten

Chinese character recognition.

• How the different architectures of Convolutional Neural Networks affect the

classification accuracy as well as the training time.

• What other hyper-parameters can be tuned in order to achieve even higher

classification accuracy.

1.3 Thesis Structure

Report’s structure follows a straightway chapter separation. Chapter 2 starts by giv-

ing to the audience the background on the theory and technologies that are used

in this project. Chapter 3 describes and explains the data-set and the related ac-

tions. More specifically, a brief analysis of the data-set preparation, the type of

data, pre-processing and image analysis methods that were used, are given in this

chapter. Chapter 4 describes the experiments that were conducted, including the

whole design process, neural networks architecture design and hyper-parameters

tuning. Chapter 5 presents the results, meaning the classification accuracy derived

from the different configurations and also the evaluation of the results. In Chapter

6 we briefly discuss about the results, together with the conclusion and also future

work.

1.4. Related Work 3

1.4 Related Work

Nowadays, in image classification tasks, a state-of-the-art algorithm is used, called

Convolutional Neural Networks, which is an image analysis-specified type of Neu-

ral Networks. There is already a big amount of published research in various image

analysis tasks and more specifically for OCR.

Having carried out a research on online sources, we noticed that there is much fewer

research on the algorithms and methods for recognizing Chinese characters, com-

pared to the widely developed Latin-characters recognition which has already an

almost human-performance accuracy. Due the complexity of the Chinese characters

and also the fact that it is not a world-wide used language as English is, it makes

sense why this specific field of recognition is not so common. Following up, we

present three well-known published papers on the field of Chinese characters recog-

nition with important findings.

Starting from the oldest paper [2], the authors are presenting their method for Chi-

nese character classification task, called Multi Column Deep Neural Network[3].

This method is also their participation in the ICDAR competition [4]. In the begin-

ning, a small discussion is held on the preprocessing of the HWDB1.1 data-set[5] that

is used, including rescaling image size and center-alignment. In addition, executable

C++ files, together with the OpenCV library, were used for further preprocessing of

the images for rescaling and contrast maximization which, as the author mentions,

provides a significant increase in prediction accuracy. Continuing their previous

research in image classification, they tried to apply it in Chinese characters, by de-

signing 8 different CNN column architectures and combining them by averaging

their predictions. Furthermore, each column, which represents a single CNN, is fed

with an instance of a differently preprocessed dataset. The results derived from the

MCDNN were claimed to be the first ones that achieved human-competitive results.

In [6], a method called Domain-Specific Knowledge is used together with the deep

CNNs. The authors presents a way of using feature extraction, like 8-way directional

lines, in combination with various deformation transformations, non-linear transfor-

mations and imaginary strokes technique. These methods are used to train separate

deep CNNs. Then, they created a method of combining these separate CNNs in a

hybrid model, using both serial and parallel structure. In this architecture, initially

every CNN is accessed one by one. If the first CNN predicts the class with a prob-

ability less than a pre-defined threshold, then the next CNN attempts to give the

recognition decision. If none of them achieves to predict with a probability higher

than the threshold, then the architecture goes into parallel mode and the decision

4 Chapter 1. Introduction

which is made, is based on the average output of all CNNs. The authors claim that

their method exceeds the accuracy of the state-of-the-art methods of single CNNs at

that time.

In the final related work[7], the authors created 2 CNN architectures based on the

AlexNet[8] and GoogLeNet[9] architectures, which have already great results on

high resolution image classification tasks due to their very deep architecture (GoogLeNet

has a total of 19 layers). One of the most significant characteristics of deep and thin

architectures like GoogLeNet, as mentioned by the authors, is the reduction in pa-

rameters, which leads to a reduction in space requirements. Their contribution in

these architectures is the addition of directional feature extraction domain knowl-

edge information. This information is basically provided as input into the CNN.

The feature extraction methods that are used here are Gabor filtering, Gradients ex-

traction and HoG (Histogram of gradients) feature. This method, at the time of their

publication, seems to outperform all previous state-of-the-art methods of single and

ensemble deep CNNs, in HCCR tasks, in the terms of both accuracy and storage

performance, with their achieved test error being at 3,26%.

5

Chapter 2

Background

2.1 Convolutional Neural Networks

Convolutional Neural Networks, or ConvNets or CNNs, is a widely used type of

Neural Networks, specified in computer vision tasks. The idea is inspired from the

function of brain’s visual cortex, namely how brain perceives visual fields.

Regular multi-layer networks which use only fully connected layers, such as MLPs,

handle an enormous amount of weight parameters. Having so many parameters

though, makes it difficult to generalize, meaning that in the case of character recog-

nition or generally in computer vision tasks, it will be difficult to recognize similar

patterns in slightly different spatial areas of the images. The latter is the reason why

we need neurons that will perform as local receptive fields to detect local features.

These neurons are modeled by volumes of convolution filters, small sized usually,

which is the basic idea behind convolutional neural networks[10]. Another big ad-

vantage of CNNs is the fact that can easily scale to any input image size, compared

to the regular Neural Networks which depend on the image’s size, resulting a huge

amount of parameters, causing often over-fitting.

By utilizing local receptive fields, we may use the same feature learned from a filter,

to another spatial area of the image. Thus, shared weights are used locally among

all neurons of the input image and convolutional layer filters. The method of using

shared weights is significantly reducing the amount of total parameters, which re-

sults the decreased memory required and complexity of the network. In case that

the input has a depth of more than one layer, for instance a RGB image-set which

has a depth of three layers, then the volume of filters must also have the same depth.

6 Chapter 2. Background

2.1.1 CNN blocks

A simple CNN block consists of three types of layers. These are, Convolutional

Layer, Pooling Layer and Fully Connected Layer. Among them, there are also some

other type of layers, which are mainly considered as part of the layers mentioned

above, such as Activation Layer and Dropout Layer. Pooling, Activation and Dropout

layers do not keep any weights, so they don’t have any parameters. Furthermore,

Activation layer doesn’t have any additional hyperparameters like for instance Pool-

ing Layer which has the pooling factor. Concerning the sequence of these layers, it

is quite often to meet successively blocks of sets of Convolutional-Layer, Activation-

Layer and Pool-Layer, before the Fully-Connected Layers in CNNs.

Convolutional Layer

As we mentioned above, the basic characteristic of a CNN is the use of receptive

fields, implemented by small convolution filters which are convolved with the in-

put images. Given an input image f and a filter g, convolution is the dot product

between values of matrices f and g. Discrete convolution operation, describes how

the output of a convolutional layer is calculated:

Discrete convolution of a 1D signal:

y[n] = f [n] ∗ g[n] =∞∑

k=−∞f [m] · g[n− k]

Discrete convolution of a 2D signal(image):

y[m,n] = f [m,n] ∗ g[m,n] =∞∑

j=−∞

∞∑i=−∞

f [i, j] · g[m− i, n− j]

Each convolutional layer consists of many filters that are randomly initialized with

different weights. These filters are used as the neurons of a convolutional layer.

The variability among the filters is useful for the detection of different patterns and

shapes of the images, which are stored in new images, called feature maps. These

feature maps are fed afterwards as input to the next convolutional layer.

2.1. Convolutional Neural Networks 7

In order to understand the operation of convolution, we can have a look at the fol-

lowing example which performs a convolution between the input image and a Sobel

filter.

And we want to convolve it with the following Sobel filter:

Indexing of filters is different than the one on images. h[0, 0] is located at the center

sample of the filter, not the first first sample.


Let’s assume that we want to calculate the y[1, 1] output pixel. According to the

2D convolution formula mentioned above, the following calculation shows how a

specific pixel in our example is calculated.

y[1, 1] = x[0, 0] · h[1, 1] + x[1, 0] · h[0, 1] + x[2, 0] · h[−1, 1]

+ x[0, 1] · h[1, 0] + x[1, 1] · h[0, 0] + x[2, 1] · h[−1, 0]

+ x[0, 2] · h[1,−1] + x[1, 2] · h[0,−1] + x[2, 2] · h[−1,−1]

= 1 · 1 + 2 · 2 + 3 · 1 + 4 · 0 + 5 · 0 + 6 · 0 + 7 · (−1) + 8 · (−2) + 9 · (−1)

= −24

(2.1)

When the indexes of f are out of the range (i.e. x[−2, 0]), we use zero values padding

in the above calculations.

y[0, 0] = x[−1,−1] · h[1, 1] + x[0,−1] · h[0, 1] + x[1,−1] · h[−1, 1]

+ x[−1, 0] · h[1, 0] + x[0, 0] · h[0, 0] + x[1, 0] · h[−1, 0]

+ x[−1, 1] · h[1,−1] + x[0, 1] · h[0,−1] + x[1, 1] · h[−1,−1]

= 0 · 1 + 0 · 2 + 0 · 1 + 0 · 0 + 1 · 0 + 2 · 0 + 0 · (−1) + 4 · (−2) + 5 · (−1)

= −13

(2.2)

After we calculate all the output pixels, the sample image will look like this:

Activation Layer

Purpose of the activation layer, is to change the mapping of the input space into

a different space, by using an activation function. Below we can see two types of


activation layers which are often used in CNNs.

ReLU standing for Rectified Linear Unit, uses thresholding at zero. This function

converges much faster compared to other activation functions like hyperbolic tan-

gent(tanh) when used with stochastic gradient descent [11] [8].

f(x) = max(0, x)

Softmax activation layer is used as the last layer of a network. Its purpose is to

produce classificasion results, so it is also called classification layer.[12]

P (y = j|x) = exTwj∑K

k=1 exTwk

Pooling Layer

Pooling layer is often used for the spatial reduction of parameters, resulting the size

reduction of the feature maps produced by a convolutional layer. Therefore, the

input to the next convolutional layer has smaller size. By reducing the number of

parameters, the computational time and also over-fitting effect are decreased. The

parameters chosen for the sub-sampled feature maps, are usually the dominant ones,

according to the size and type of the pooling layer.

There are more than one pooling functions used for pooling layers. Average-pooling

and max-pooling are the two most known pooling functions. Average-Pooling cal-

culates the mean value from the four activation values of the 2x2 region, while

Max-Pooling layer keeps the dominant activation value. In general, max-pooling is

kept being used more often, as it outperforms the subsampling method of average-

pooling layer[13].

In addition to the different pooling functions, pooling-layers have two parameters.

One is the stride, which is the moving step size of the filter and the other is the size

of the filter. Most commonly used configuration is a 2x2 filter size, with a stride

of 2 [14]. Stride, which is a hyper parameter also used in convolutional layers,

plays an important role, as it is the indicator for deciding the most dominant among

the neighboring pixels. If a stride of 1 was used in combination with a 2x2 max-

pooling layer, it would lead in re-using the same dominant pixels multiple times in

the pooling-layer output image because of the overlapping batches, meaning that

other useful pixels less dominant would be rejected. On the other hand, if stride


is bigger than the size of the max-pooling layer, there would be a lot of pixels that

would not been considered in deciding the most dominant pixel.

In figure 2.1 we can see an example of using a max-pooling layer. The new reduced

activation map is then fed to the next convolutional layer.

FIGURE 2.1: Example of how Max-Pooling layer calculates the newreduced activation map. [Source: cs231n Convolutional Neural Net-

works for Visual Recognition]

Fully Connected Layer

A fully connected layer, is a layer that has all of its neurons connected to all acti-

vations of the previous layer, as shown in figure 2.2. Due to its fully connectivity

characteristic, they hold a big amount of weights of the total weights of the network,

compared to the convolutional layers which hold only the weights of the receptive

field.

Fully connected layers are commonly used as the last parts of a CNN, as we can

see in figure 2.3. The size of the layer can vary, according to the complexity of the

problem, except the last fully connected layer, which must have the same size as the

number of prediction classes. Instead of using ReLU activation like in the previous

convolutional and fully-connected layers, the activation function of the last fully

connected layer, which is used for predicting the classification scores, is Softmax.


FIGURE 2.2: Typical fully connected neural network. All neurons ofa precedent layer are connected to all neurons of the following layer.[Source: cs231n Convolutional Neural Networks for Visual Recogni-

tion]

FIGURE 2.3: Fully connected layers in a CNN. Fully connected layersare always placed after the convolutional layers, which is usually the

end of the network.

13

Chapter 3

Data Pre-processing

The data-set used in our experiments, was downloaded from CASIA institute web-

site[5]. More specifically, we used the HWDB1.1 data-set which includes offline iso-

lated handwritten Chinese characters from 300 writers, meaning that each character

is written in 300 different ways. Furthermore, the data-set is divided into training-

set and test-set. An important step was to understand the structure of the data-set

and what its values were representing.

HWDB1.1 characteristics:

Total samples: 1,121,749

Classes: 3,755

Train/Test Set: 240/60 writers

The samples above are stored in the ’.gnt’ files from the downloaded package. Hav-

ing the table 3.1 as a guidance, which was provided by the CASIA institute, the next

step was to create a python script in order to parse all samples from the ’.gnt’ files

and convert them into handy data structures.

3.1 Images Parsing and Manipulation

After careful consideration of the table 3.1, we created the script that was reading

each entry and saving them into appropriate, new data structures. The information

that we extracted were the labels of the samples, the size of the two dimensions and

the values of their pixels. Furthermore the values of the pixels were ranged in [0-255]

scale, representing the gray-scale values.

As the resolution of each sample was different, we proceeded by rescaling them into

the same image size. After reading various published papers, that were using the

14 Chapter 3. Data Pre-processing

Item Type Length Instance CommentSample size unsigned int 4B Number of

bytes for onesample (bytecount to nextsample)

Tag code (GB) char 2B The labelStored as0xa1b0

Width unsigned short 2B Number of pix-els in a row

Height unsigned short 2B Number ofrows

Bitmap unsigned char Width*Heightbytes

Stored row byrow

TABLE 3.1: Format of offline isolated character data file (*.gnt)

same data-set, we decided to set the size at 48x48 pixels, which seemed an appropri-

ate image resolution for the purpose of our project. More specifically, image samples

were resized to 40x40 pixels and then we added a zero-padding of 4 pixels on each

side. The padding was added so that there would be a clear distinction between the

character and the border of the image.

Additionally, zero-padding by 1 pixel is added before every input image is fed to

each convolutional layer. This hyper parameter is provided as an input parameter

to the Keras function. The reason is that when an input image is passing through

a convolutional layer, the output size decreases before it is fed to the next convo-

lutional layer. The loss of this minimum information, results the gradual decrease

of the total information much faster, even before passing the data volume through

the max-pooling layer, which may lead to worse performance and misclassifications.

Zero-padding prevents the data loss of the input volumes, as we want to preserve

the spatial dimensions of the data. The appropriate padding to be used was calcu-

lated by using the following formula. Output of the formula is the output size of the

data. Input parameters are the height/width of the input data(W), the filter size(K),

the stride(S) and padding size(P).

O =W −K + 2P

S+ 1

In our case, in order to preserve the input size 48x48 pixels by using 3x3 filters with

stride 1 pixel, we need P=1 pixel for padding.

As a last step, Numpy package[15] was used in order to create the container of the

newly extracted data. All the samples were stored into a 3D matrix and their labels

3.2. Data transformation 15

into a 1D matrix. The first dimension of the 3D matrix corresponds to the index of

the image which is also the same index for the label matrix.

3.2 Data transformation

In deep convolutional neural networks, due to the enormous amount of parame-

ters, overfitting is an often phenomenon. Data augmentation is a useful method for

reducing overfitting by enlarging our data-set, significantly useful in combination

with deep networks. In our case it is even more critical, as we have only a few

samples for each class, i.e. 240 training samples, 60 test samples.

At [8] chapter 4, authors propose specific augmenting methods, for reducing overfit-

ting with very little computation. These methods are image transformations on the

original image samples. Among these, we chose two methods in order to enlarge

our data-set, rotation and color altering by increasing the contrast. Combining the

new images with the original ones, we enlarged 3 times, our training data-set.

3.2.1 Rotation

Rotation is one of the transformation methods used, so that our training-data will

be increased. We decided to rotate all characters by 10°, anti-clockwise. It is an

acceptable degree value, as the characters that we usually write by hand, are not

always aligned with each other. At this point, we couldn’t rotate them more, which

would have a similar effect as horizontal and vertical flip. This cannot be applied in

characters images, because such transformations would lead to a wrong recognition

of the characters, as if we flip some of them, they are interpreted as other characters.

3.2.2 Contrast Increase

Changing the contrast of the samples, is also a good method to create more samples

and augment the training-data. The explanation is that sometimes, according to the

pressure we put on the pencil while writing, handwritten characters can have hard

or soft colors. So this method has the same effect as if we were varying the writing

pressure.

For this task we used the "auto equalization" method of OpenCV library, which au-

tomatically increases the contrast of the given images.


3.3 Feature Extraction

Feature extraction methods are often used in OCR tasks. Features can be specific

patterns such as circles, horizontal and vertical lines. There are several filters and

transformations which can provide such features.

In CNNs, many features are recognized by its neurons in the training process but

unfortunately not all important features can be recognized. According to the authors

[7], in order to avoid the difficulty in learning some domain-specific HCCR features,

we tried to extract directional feature maps, which were added afterwards to the

input layer in combination with the original data. The authors propose to use 3

feature extraction methods. These are histogram of gradients, Gabor transformation

and directional features extraction using Sobel filter. According to the characteristics

of our data, we decided to proceed with Gabor transformation, which is considered

a widely used and promising feature extraction method for HCCR.

The function that performs the Gabor transformation is:

gθ,λ,σ(x, y) = exp

{−x

2 + y2

2σ2

}exp

{jπ

λ(x cos θ + y sin θ)

}The convolution of the above function with the input images, results the Gabor re-

sponse. Gabor responses are basically edges, line patterns and other textures that

can be identified in the input images.

According to [16], in order to be able to successfully detect specific desired patterns,

like in our case horizontal, vertical and diagonal lines of the Chinese letters, we

should be able to find a good combination of the variables σ, θ, γ and λ.

θ : The orientation of the parallel ellipses of the Gabor kernel, which means the

orientation of the line patterns to detect in the image.

γ : It is a value in [0,1] that represents the spatial aspect ratio of the Gabor kernels

length with regard to the filter’s length. A value of 1 is representing a circle, while

a number of γ<1 stretches the parallel lines of the kernel to the orientation that was

set up by θ. Default value is 0.5.

λ : It is the wavelength of the cosine factor in Gabor function. Bigger wavelength

causes the widen of the ellipses in Gabor filter. Practically, such a filter with high

wavelength, can identify less condensed patterns in the image.

3.3. Feature Extraction 17

σ : Standard deviation of Gausian factor in Gabor function. It is a parameter that

controls the spread of Gabor filter and is used as a scaling factor for the filter. Prac-

tically in our samples, as this value is increasing, smaller irrelevant lines are filtered

out.

After reading carefully the Gabor parameters that were used in [7], we chose 8 val-

ues for the orientation θ, 0°, 22.5°, 45°, 67.5°, 90°, 112.5°, 135°, 157.5°respectively. The

rest of the parameters that were configured to work efficiently with our data sam-

ples, are standard deviation of the Gaussian envelope σ=0.7, spatial aspect ratio that

specifies the ellipticity of the support of the Gabor function γ=0.5 and wavelength of

the sinusoidal factor λ=3.5. The parameters were selected after empirical analysis of

the results.

Purpose of this method, is to enhance the performance of the CNN by adding knowl-

edge about writing strokes of each Chinese character. For instance, feature maps that

represent vertical lines, have higher pixel values for these vertical lines, meaning that

they can activate neurons that recognize these vertical lines.

Below, in figure 3.1 we can see the 8 different Gabor filters and how they impact

on a Chinese character. In figure 3.2, we can see 10 examples of Chinese characters

together with their directional feature maps.

FIGURE 3.1: Above are the feature maps of the Chinese character foreach orientation. Below are Gabor filters.


FIGURE 3.2: First 10 chinese characters in the data-set. The first col-umn coresponds to the original sample, the next 8 columns corre-

spond to the 8-directional feature maps.

19

Chapter 4

Experiments

4.1 Software

In order to work with convolutional neural networks, we did a research on all avail-

able tools and libraries that are used for neural networks implementation. There are

already enough enough libraries, covering most of the demands on deep-learning

research.

4.1.1 Theano

Theano is a python framework for efficient mathematical expressions, especially op-

timized and fast for calculation among matrices[17]. Deep CNNs involve enormous

and intense matrix calculations, which were carried out efficiently by Theano.

4.1.2 Keras

Keras is a model-level python deep-learning library using high-level building blocks.

It uses Theano as the backend engine. Some of its advantages are the GPU support,

fast prototyping and experimentation [18].

It fitted perfectly for our purpose, which was about the investigation on different

CNN architectures and hyper-parameter tuning and less about implementing new

modules.

4.1.3 CUDA

CUDA library is a fast, efficient, parallel programming library, used for mathemati-

cal calculations and implemented by NVidia. CUDA is especially used in problems

20 Chapter 4. Experiments

related to computer graphics and neural networks, problems that involve big com-

putational complexity.

The GPU used for our experiments is NVidia GTX460, which supports up to CUDA

2.1. Unfortunately, the version of CUDA library supported by our GPU is quite old,

so we weren’t able to use new tools implemented for newer CUDA versions, such

as cuDNN.

4.2 Architecture

The design of an appropriate Neural Network is not always an easy task. There are

many characteristics to take into account in order to design an efficient and accurate

Neural Network. The study and understanding of the data-set is a significant step

in this part.

For the ease of reading and understanding the architecture of each Convolutional

Neural Network, their architecture naming will be done as in the next example. As-

sume we have a network with 4 convolutional layers of size 10,20,30,40 respectively

and 1 fully connected layer of size 20. Then we name it as "cnn10-20-30-40-fc-20".

Additionally, we should mention here that when we talk about a convolution layer,

we assume that it is followed by an activation layer and a max-pooling layer.

More detailed description of the architectures, follow in the subsections below.

4.2.1 Layers Depth

As it is already reported in Chapter 3 about data pre-processing, the final size of the

image samples after the pre-processing is 48x48 pixels. The resolution is big enough

for our experiments, but it doesn’t give the possibility to work with more than 4

convolutional layers, due to the downsampling by factor of 2 after each convolution

layer. The final feature maps before the fully connected layers, will be 6x6 pixels

size. The generic design which we decided to use is 4 convolutional layers followed

by 2 fully connected layers.

Networks with smaller depths that had 3 or 2 convolutional layers instead of 4, were

also tested. Purpose of this design was to analyze the trade-off between performance

and speed, as smaller networks are trained faster but at the same time they have

worse accuracy.

4.2. Architecture 21

4.2.2 Activation Layers

We use "reLU" activation after all convolutional layers, except the last fully con-

nected layer. For the last fully connected layer, "softmax" activation is used instead,

which gives the probabilities of each object belonging to each of the classes.

4.2.3 Filter Size

The convolution filter size decided to be used in all the final experiments is 3x3. We

also performed some additional experiments using 5x5 filter size, but it seemed to

be less efficient. Smaller filter size is proposed by many published papers as [9] and

[19] compared to 7x7 and 5x5 sizes, that were being used in older architectures, like

LeNet and AlexNet [8].

4.2.4 Number of Neurons

The first step was to run many experiments with big varieties in number of neurons,

in order to understand the behavior of the CNNs in relation to the increase or de-

crease of neurons. After that step, we decided to continue with 4 models, which are

described below. The name of each model indicates the amount of neurons of the

first convolutional layer. There is a brief description next to each name, as described

in the beginning of this chapter.

M16: cnn16-32-64-128-fc-512-120

M32: cnn32-64-128-256-fc-512-120

M64: cnn64-128-256-512-fc-512-120

M100: cnn100-200-300-400-fc-512-120

The first fully connected layer has always 512 neurons and the final layer 120, indi-

cating the number of classes used in the experiments.

4.2.5 Max Pooling Layer

The max pooling layer is used for down-sampling the feature maps. We used max-

pool layers of size [2,2] which means that each dimension is decreased by half size.

Pooling layer results less parameters to compute in the network, resulting the control

of over-fitting.


4.2.6 Optimizer

In this section we are going to present and describe, the three different optimizers

that were used in the training period. In order to have faster results and conclusions,

there weren’t used any augmented data-sets, but only the initial data-sets, in order

to reduce execution time.

SGD: Stochastic Gradient Descent

Gradient Descent is one of the most common optimization algorithms that is used

extensively in many machine learning tasks, due to its ease of implementation and

faster training speed[20]. It is used to minimize the cost function by calculating the

gradients of the function and then update the weights of the model. We can find this

algorithm in three different variations.

Batch Gradient Descent or Vanilla Gradient Descent is updating the weights after

calculating the gradients of the cost function by processing the complete training

data set. Although its good stability, Batch Gradient Descent can be resource inten-

sive, as it may be impossible to fit it in the memory the whole training data-set when

that is big.

Stochastic Gradient Descent is the second variation. Main difference is that the al-

gorithm is calculating the gradients and updates the weights for each training sam-

ple. This algorithm is much faster and also it requires much less memory, but the

convergence behaves in a much different way, as the cost function may hop to dif-

ferent local minimas every time the algorithm is processing a different sample.

Last variation is the mini-Batch Stochastic Gradient Descent (MB-SGD). This algo-

rithm is basically combining the two previous mentioned algorithms. It is a stochas-

tic gradient descent, but instead of selecting a single sample to perform the weight

updating, it chooses a batch of samples whose size is configured as a hyper-parameter.

A usual batch size can be around 50-256, but it can vary according to the applica-

tion and the resources. It combines the stability of the Batch Gradient Descent that

has a more stable convergence, with the memory efficiency and performance of the

Stochastic Gradient Descent. As MB-SGD is the most chosen variaton, it is remained

to be called Stochastic Gradient Descent (SGD)[21][22]. So in this project when we

mention SGD, we refer to MB-SGD.

Although SGD performs well in general, in some cases it doesn’t produce as good

results as other optimization algorithms, due to the manual process of adjusting its

4.2. Architecture 23

parameters such as the learning rate. Thus, many improvements have been intro-

duced for increasing the accuracy of the basic SGD algorithm. One common method

that was used in combination with SGD in order to achieve faster convergence to the

optimal solution, is Nesterov momentum[23].

RMSProp

A per-parameter adaptive learning rate optimizer. RMSProp is an updated opti-

mizing algorithm of Adagrad[24]. There is a major difference between Adagrad and

RMSProp relevant to the learning process. Adagrad has a rapid decrease of its learn-

ing rates which causes the early stop of the learning process. In contrary to Adagrad,

RMSProp adapts the learning rates of each weight according to the magnitude of its

gradients, which can cause a non-monotonical decrease of the learning rates, thus

better learning process[25].

Adadelta

As with RMSProp, Adadelta is a dynamic optimizer that was also created in order

to solve Adagrad’s problem of the rapid monotonical learning rate decrease[26][24].

Basic difference between Adadelta and RMSProp is that Adadelta’s update rule is

not dependent on the default learning rate. Some of Adadelta’s most significant

abilities are the fact that is continuously adjusting its learning rate without any man-

ual interference, doesn’t have the need for manual setting of learning rate and also

is insensitive to hyperparameters. Furthermore, it is robust against noise and differ-

entiation of architecture design.

Batch Size

For the purpose of our project, the training data was divided into smaller batches.

This method is used to make the training process of the network faster and also

more stable, as described on the previous subsection which analyzes the different

SGD optimizers. The batch size that was used, considering the performance of our

hardware and the relevant literature that was studied, was 50 samples. The reason

why we didn’t try going further up on the batch size, was due to the fact that the

GPU memory limitation which made it impossible to fit in this amount of data when

running the bigger sized networks. Nevertheless, with this batch size, the conver-

gence was smooth and quick enough.


4.2.7 Learning Rate

In order to increase the variability of our experiments, we used more than one learn-

ing rate for the SGD and RMSprop optimizers. Below we present the different values

of the learning-rates for each of the optimizers.

Adadelta: Learning rate is not initialized manually. The default learning rate of

value 1.0 is used.

RMSProp: 0.001, 0.0005

SGD: 0.01, 0.005, 0.001

4.2.8 Epochs

Epoch is considered one full pass of the training data, meaning one forward pass fol-

lowed by a backward pass, needed for updating the weights of the neural network.

In the beginning of the experiments, when fewer classes were used and we were

initially trying to conclude in a few models to proceed with, among the 50 different

that we designed, 20 epochs were used. The reason was that in order to try 50 differ-

ent models would be time consuming and at that point we just wanted to compare

the models and understand which of them work better. Achieving the best possible

accuracy was the following step.

Then as the next step, we increased the epochs to 50, as there was a still a notable

increasing accuracy rate at the end of the 20 epochs. After 50 epochs, the accuracy

seemed to become stable.

4.2.9 Regularization

Neural Networks are powerful Machine Learning algorithms that often produce re-

sults of high accuracy. However, due to their complexity and large scale, they often

suffer from over-fitting. In order to overcome the over-fitting problem, we introduce

the use of various methods called regularization methods.

Dropout

Dropout[27] is a technique used in Deep Neural Networks, where the complexity is

high and the amount of data is big. The basic idea behind this technique is that dur-

ing training period, dropout algorithm drops randomly chosen neurons and their

4.3. Gabor Filter Parameters 25

connections, as we can see in figure 4.1. The number of neurons to be deactivated

is chosen randomly according to a predefined percentage of the total neurons in the

specific layer.

This method has basically two advantages, over-fitting reduction and performance

improvement. The first advantage is the fact that co-adaption of the same features

by different neurons is avoided. The second is the fact that during training period,

less weights have to be calculated, because of the fewer connections.

Dropout keras layer was placed between the first and the second fully connected

Layer. The drop-factor was set to 50%, which means that 50% randomly chosen

neurons of that layer will be dropped at each update of the training-time.

FIGURE 4.1: Connections among neurons. a) Before applyingdropout layer, b) After applying dropout layer.

4.3 Gabor Filter Parameters

Gabor filter has a lot of parameters which have to be properly configured in order to

achieve a useful result. In that case, we tried to find the best values for σstandard de-

viation and λwavelength. For the rest of the parameters we used the default values.

Concerning the amount of angles and which specific angles to be used, we followed

the ones used at [7].

In order to select the most appropriate wavelength, we started from a specific value

and increased it with a constant step until a specific maxima. Then we printed all

the outputs together in a figure, in order to have a visual inspection and choose the

best one to proceed. In the example in figure 4.2, a few of the values that we used are


π/4, 2π/4, 3π/4 and π. After comparing empirically many values in order to choose

the right wavelength, we decided that λ=3.5 had the best visual result, in the sense

of how discrete the writing strokes were displayed.

FIGURE 4.2: Example of comparing different wavelengths λ, used forfeature extraction with Gabor filters. Every 4 images in a row corre-spond to a specific angle. The bottom half of this grid-figure, visual-

izes the specific filters that were used for each case.

Concerning the σ standard deviation, we tried values between 0.6 and 0.9, increasing

it by 0.1 in every step and then we visually compared them. The following figures,

4.3, 4.4, 4.5 and 4.6, display the extracted images together with the Gabor filters

used. Each figure corresponds to a specific sigma value. The value that we decided

to use and proceed with is σ=0.7, as it seemed to provide the best visual result,

meaning that the results emphasize the corresponding to the orientation strokes,

without adding a lot of noise in the samples.

4.3. Gabor Filter Parameters 27

FIGURE 4.3: 8-direction strokes feature maps, extracted by Gabor fil-ter with λ=3.5 and σ=0.6.


29

Chapter 5

Evaluation

In this chapter, we are going to present the results for the different CNN configu-

rations. The results indicate the accuracy performed and the gain achieved on each

case. The configurations that we tried are different CNN architectures, various en-

larged training-data, different optimizers and different learning-rates.

5.1 Architecture Performance

As it was noted in chapter 4, the first step was to design various CNN architectures

and then proceed with only the best ones. The architectures were carefully chosen,

after taking into consideration our HCCR task.

Initially we designed a big variety of architectures. Some of them were quite sim-

ilar and some others had completely different architecture design, which led to an

amount of 50 architectures. We tested them in combination with the initial data-set

with a subset of 30 classes and below we present the 4 predominant architectures

that we decided to continue with. Beside them, we show their validation and test-

set accuracy.

Name CNN Architecture Filter-Size

Epochs Batch-Size

Val. Ac-curacy

Test Ac-curacy

M16 cnn16-32-64-128-fc-512-30 3x3 50 50 0.9652 0.9638M32 cnn32-64-128-256-fc-256-30 3x3 50 100 0.9868 0.9766M64 cnn64-128-256-512-fc-512-30 3x3 50 100 0.9805 0.9732M100 cnn100-200-300-400-fc-512-30 3x3 50 100 0.9861 0.9760

TABLE 5.1: Comparison among the 4 CNN architectures, that werechosen for our experimentation.

30 Chapter 5. Evaluation

The best test-set accuracy belongs to the M32 CNN architecture. A probable reason

for this result, is that the increasing number of neurons, causes overfitting to the

training-data.

After using these information, we increased the amount of data by using 120 classes

of Chinese characters, with minor changes needed in the CNN architectures, due to

the increase of the classes. The 30-classes training samples are 7174. The 120-classes

training samples now are 28682. Below we present the results on this data-set.

Name CNN Architecture Filter-Size

Epochs Batch-Size

Val. Ac-curacy

Test Ac-curacy

M16 cnn16-32-64-128-fc-512-120 3x3 50 50 0.9662 0.9639M32 cnn32-64-128-256-fc-512-120 3x3 50 50 0.9693 0.9696M64 cnn64-128-256-512-fc-512-120 3x3 50 50 0.9733 0.9682M100 cnn100-200-300-400-fc-512-120 3x3 50 50 0.9719 0.9690

TABLE 5.2: Comparison among the 4 CNN architectures, that werechosen for our experimentation by using the original 120 classes data-

set.

In comparison to table 5.1, there is a small drop on the test-set accuracy, that can be

explained by the increase of the complexity of the task, due to the increase of the

classes to 120.

5.2 Optimizers Performance

In this section, we are going to present the comparison among RMSprop, SGD and

Adadelta optimizers, using different learning rates. The data-set used for the optimiz-

ers’ comparison is the original data-set using 120 classes and not the augmented. We

proceeded with that data-set, as we wanted to compare the optimizers and not to im-

prove in general the accuracy at that point. The time consumption that would had

been needed in case we used an artificially enlarged data-set, would be a significant

restriction.

Following up, we can see on table 5.3, that the best test-set accuracy, was achieved

by the M100 model, using the RMSprop optimizer with a learning-rate of 0.0005.

5.3. Data Transformation Method Performance 31

Opt. RMSprop SGD AdadeltaLR 0.001 0.0005 0.001 0.005 0.01 (not applicable)

Val.Acc.

TestAcc.

Val.Acc

TestAcc.

Val.Acc

TestAcc

Val.Acc.

TestAcc.

Val.Acc.

TestAcc.

Val.Acc.

Test.Acc.

M16 0.9622 0.9565 0.9660 0.9596 0.9456 0.9353 0.9604 0.9563 0.9637 0.9577 0.9662 0.9639M32 0.9651 0.9639 0.9653 0.9654 0.9486 0.9392 0.9655 0.9632 0.9627 0.9540 0.9693 0.9696M64 0.9685 0.9672 0.9707 0.9682 0.9472 0.9413 0.9629 0.9580 0.9611 0.9579 0.9733 0.9682M100 0.9651 0.9651 0.9739 0.9704 0.9486 0.9347 0.9653 0.9597 0.9650 0.9591 0.9719 0.9690

TABLE 5.3: In this table we present the accuracy results from the 6different configurations, which include the 3 chosen optimizers (RM-Sprop, SGD, Adadelta) and their learning rates. For these experiments,

the original 120 classes data-set was used.

5.3 Data Transformation Method Performance

After artificially enlarging the training-data, by using the methods described in chap-

ter 3, we have an accuracy gain. This gain was expected, as initially the amount of

data for each class was not large enough. In table 5.4, we introduce the results after

using data-augmentation, together with the accuracy increase.

Model Test Accuracy Accuracy VariationM16 0.9667 0.28%M32 0.9732 0.36%M64 0.9750 0.68%M100 0.9748 0.58%

TABLE 5.4: Accuracy comparison among the architectures, after theaddition of transformed samples. The last column stands for the ac-

curacy gain/loss compared to the original, 120 classes, dataset.

5.4 Directional Feature Extraction method Performance

After extracting directional features using Gabor filter and enlarging the training

data-set with these new samples, we run once again the same experiments. We must

point out here that for this step we didn’t use the data-augmentation part, but only

the directional features extracted from the original samples.

In table 5.5 we can see the accuracy gain, after using directional features. An interest-

ing thing to point out here is that although the bigger amount of training samples,


Model Test Accuracy Accuracy VariationM16 0.9520 -1.19%M32 0.9639 -0.57%M64 0.9686 0.04%M100 0.9649 -0.41%

TABLE 5.5: Accuracy comparison among the architectures, after theaddition of directional features. The last column stands for the accu-

racy gain/loss compared to the original, 120 classes dataset.

compared to the data-augmentation samples and although the fact that using fea-

ture extraction method produces really promising results in OCR, the accuracy we

got for almost all models, was a few smaller than in table 5.4.

5.5 Statistical Significance Test

In order to prove that the improvement in accuracy is strongly connected to the

changes we introduced on the data-set, we can use a statistical measuring method

called statistical significance. A result is statistically significant when the probability

p of the result to occur due to the rejection of the null hypothesis is below the signif-

icance level a. Usually a common value a that is used in most of the experiments, is

a = 0.05 and that means that we are looking for a probability p < a [28].

In our case, the null hypothesis is that the improvement of the accuracy is not related

to the increase of the data-set. Thus there is not any connection that proves that the

increase of data would benefit our experiments. An explanation could be that this

improvement occurs just from the existence of random noise in the data. By trying

to prove that our experiments are significant, means that we are trying to reject the

above statement.

In order to perform the statistical significance test, we used McNemar’s test method.

The test was applied between the baseline (control) experiment without enlarged

data-set and each of other two cases, transformed data-set and data-set created using

Gabor filter, for each model.

P-Values M16 M32 M64 M100AugmentedData-Set

0.2239 0.0712 0.0005874 0.003273

Gabor FilteredData-Set

0.000001905 0.009369 0.8939 0.04952

TABLE 5.6: Given the confusion matrices of the experiments, McNe-mar’s Chi-squared test gives the probability values that show if the

Null Hypothesis is rejected or not.

5.6. Visualization 33

5.6 Visualization

Visualization of Neural Networks is an important part of understanding how Neural

Networks work and what they have learned. Many visualization methods have been

implemented for this purpose, depending also on the specific aspects that need to

be investigated. Following up, we are going to present two of the most common

visualization methods.

Visualizing the layer’s activations.

On a feed-forward step, what is usually happening is that given a specific input

image on each convolution layer, an activation map is created according to each

convolution filter. For this type of visualization, we need one image from our data-

set and also the pre-trained model. Then we feed-forward the image to each layer

and we get all the activation maps corresponding to each filter. In figures 5.1,5.2,5.3

and 5.4, we used the first image-character from our data-set on the M100 model.

FIGURE 5.1: Visualizing the 100 activation maps corresponding to the1st convolutional layer of M100 model.


FIGURE 5.2: Visualizing the 200 activation maps corresponding to the2nd convolutional layer of M100 model.

Visualizing the images that maximize the activations of the filters.

This method is a useful way of showing the learned patterns and shapes from the

filters of the neural network. In that case, we are looking for images that maximize

the activations of convolution filters. We need to use the pre-trained model and also

create images with random noise that will be used as input-data. Then during back-

propagation from the output feature map of the specific filter, to the input image, we

perform gradient-ascent while trying to maximize the activations of a specific filter.

In figures 5.5 and 5.6, we present the learned filters of the M100 model.


FIGURE 5.3: Visualizing the 300 activation maps corresponding to the3rd convolutional layer of M100 model.


FIGURE 5.4: Visualizing the 400 activation maps corresponding to the4th convolutional layer of M100 model.


FIGURE 5.5: Visualization of the 2nd Convolutional Layer of M100model. From the 200 filters of the 2nd layer, we show the top 32 withthe highest activations. In lower level convolutional layer we can see

mostly line patterns that are learned from the filters.

FIGURE 5.6: Visualization of the 4th Convolutional Layer of M100model. From the 400 filters of the 4th layer, we show the top 32 withthe highest activations. Most of the images show easily recognized

patterns, that can be found in many Chinese characters.

39

Chapter 6

Conclusion

6.1 What has been done

The focus of our experiments was based on three aspects. Network architectures,

optimizers and data pre-processing methods. We tried different configurations and

implementation methods on each of these three. After every step we chose the best

results and proceeded with the rest of the experiments.

We started on working with the network’s architecture. 4 models were used after

testing 50 different models. These have a depth of 4 convolutional layers and 2 fully

connected layers. The differentiation among the models is on the number of the

neurons on each layer. We continued experimenting with other aspects by using

these specific 4 architectures.

Proceeding with the original data-set of 120 classes, we then tried 3 different opti-

mizers, RMSProp, Adadelt and, SGD for various learning rates, in order to decide

which optimizer works better with the type of our data. Adadelta outperformed

all optimizers, even when trying different configurations. There was only one case

where RMSProp with 0.0005 learning rate went slightly better compared to Adadelta

with the corresponding network model.

Finally, the last part of our experimentation, was to investigate if we can further

increase the accuracy by artificially enlarging our data. We tried 2 different enlarge-

ment methods. The first one was the use of transformed images by rotating them

and increasing their contrast, in addition to the original data. The other method was

by using directional features extracted by gabor filters, added in the original data.

Concerning the size of these sets, the first data-set was x3 times bigger and in the

case of extracted features, the data-set was x9 times bigger, compared to the initial

size of the data-set.

40 Chapter 6. Conclusion

6.2 Results

After completing the stage of configurations and parameters tuning, the highest ac-

curacy we achieved was 97.5%. This score was accomplished by using the following

methods and configurations. The M64 model, which consists of 4 convolutional lay-

ers of 64, 128, 256 and 512 neurons respectively, Adadelta optimizer and by using

the data-set with the transformed images.

Finally, in order to evaluate the results, we applied the McNemar’s significance test

in order to prove that the increase in accuracy was not due to random variance but

is connected to the changes we applied. The most significant experiment resulted a

probability p = 0.0005874, which is a strong evidence that the accuracy improvement

occurred due to the introduced changes. The data-set used in the above experiment

involved the transformed images in combination with the M64 model.

6.3 Future Work

Although the results we got were sufficient enough, there is more room for improve-

ment in the field of HCCR. There were some constraints that didn’t let us extend the

investigation research on this project. Below we are going to present some aspects

that could bring even better results.

Due to the limitation on time and computational resources, we didn’t use the whole

data-set, but only a subset. The subset consisted of 120 classes instead of the full

data-set that includes 3755 classes. Using more classes, increase the complexity of

the classification task and also gives us the chance to experiment with more complex

architectures. In addition to that, we could use more data per class in order to gen-

eralize better and reduce overfitting. This increase in data per class could be held

be applying further data augmentation. For instance, there are much more transfor-

mations, like skewing, zooming, translation and locally croped image that could be

used to extend our data-set. In case we had a newer and stronger GPU, with bigger

memory, we would be able to investigate even more complex configurations and

networks.

Furthermore, the use of higher image resolution could improve even more the ac-

curacy. Due to subsampling function of max-pooling layers, we weren’t able to use

more convolutional layers and increase the depth of the CNN. Chinese letters are

quite complex characters, which means that higher abstract filters could be used in

order to identify more complex shapes or patterns.

6.3. Future Work 41

Finally, there are already well known architectures that could be extended and var-

ied for our task, like ResNet and GoogLeNet[9] and apply new ideas. Unfortunately,

such complex architectures cannot be executed on our GPU, due to complexity of the

architectures and memory demanding characteristics.

43

Bibliography

[1] E. N. Bhatia, “International Journal of Advanced Research in Computer Sci-

ence and Software Engineering Optical Character Recognition Techniques: A

Review”, vol. 4, no. 5, 2014.

[2] D. C. Ciresan and J. Schmidhuber, “Multi-column deep neural networks for of-

fline handwritten chinese character classification”, CoRR, vol. abs/1309.0261,

2013. [Online]. Available: http://arxiv.org/abs/1309.0261.

[3] D. C. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural net-

works for image classification”, CoRR, vol. abs/1202.2745, 2012. [Online]. Avail-

able: http://arxiv.org/abs/1202.2745.

[4] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, “Icdar 2013 chinese handwriting

recognition competition”, 13th International Conference on Document Analysis

and Recognition,2013, 2013. [Online]. Available: http://www.icdar2013.

org/program/competitions.

[5] C. L. Liu, F. Yin, D. H. Wang, and Q. F. Wang, “Casia online and offline chinese

handwriting databases”, in 2011 International Conference on Document Analysis

and Recognition, 2011, pp. 37–41. DOI: 10.1109/ICDAR.2011.17.

[6] W. Yang, L. Jin, Z. Xie, and Z. Feng, “Improved deep convolutional neural

network for online handwritten chinese character recognition using domain-

specific knowledge”, CoRR, vol. abs/1505.07675, 2015. [Online]. Available: http:

//arxiv.org/abs/1505.07675.

[7] Z. Zhong, L. Jin, and Z. Xie, “High performance offline handwritten chinese

character recognition using googlenet and directional feature maps”, CoRR,

vol. abs/1505.04925, 2015. [Online]. Available: http://arxiv.org/abs/

1505.04925.

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with

Deep Convolutional Neural Networks”, in Advances in Neural Information Pro-

cessing Systems 25, F Pereira, C. J. C. Burges, L Bottou, and K. Q. Weinberger,

Eds., Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http:

http://arxiv.org/abs/1309.0261


http://www.icdar2013.org/program/competitions

http://www.icdar2013.org/program/competitions

http://dx.doi.org/10.1109/ICDAR.2011.17





http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf



44 BIBLIOGRAPHY

/ / papers . nips . cc / paper / 4824 - imagenet - classification -

with-deep-convolutional-neural-networks.pdf.

[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V.

Vanhoucke, and A. Rabinovich, “Going deeper with convolutions”, CoRR, vol.

abs/1409.4842, 2014. [Online]. Available: http://arxiv.org/abs/1409.

4842.

[10] Y. L. Cun, B. Boser, J. S. Denker, R. E. Howard, W. Habbard, L. D. Jackel, and

D. Henderson, “Advances in neural information processing systems 2”, D. S.

Touretzky, Ed., pp. 396–404, 1990. [Online]. Available: http://dl.acm.

org/citation.cfm?id=109230.109279.

[11] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve

neural network acoustic models”, 2013.

[12] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.

[13] D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations in

convolutional architectures for object recognition”, in Artificial Neural Networks

– ICANN 2010: 20th International Conference, Thessaloniki, Greece, September 15-

18, 2010, Proceedings, Part III, K. Diamantaras, W. Duch, and L. S. Iliadis, Eds.

Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 92–101, ISBN: 978-3-

642-15825-4. DOI: 10.1007/978-3-642-15825-4_10. [Online]. Available:

http://dx.doi.org/10.1007/978-3-642-15825-4_10.

[14] B. Graham, “Fractional max-pooling”, CoRR, vol. abs/1412.6071, 2014. [On-

line]. Available: http://arxiv.org/abs/1412.6071.

[15] E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools for

Python, [Online; accessed 10/09/2016], 2001–. [Online]. Available: http://

www.scipy.org/.

[16] P. Moreno, A. Bernardino, and J. Santos-Victor, “Gabor parameter selection for

local feature detection”, J. S. Marques, N. Pérez de la Blanca, and P. Pina, Eds.,

pp. 11–19, 2005.

[17] Theano Development Team, “Theano: A Python framework for fast compu-

tation of mathematical expressions”, ArXiv e-prints, vol. abs/1605.02688, May


[18] F. Chollet, Keras, https://github.com/fchollet/keras, 2015.







http://dl.acm.org/citation.cfm?id=109230.109279


http://dx.doi.org/10.1007/978-3-642-15825-4_10

http://dx.doi.org/10.1007/978-3-642-15825-4_10


http://www.scipy.org/

http://www.scipy.org/


https://github.com/fchollet/keras

BIBLIOGRAPHY 45

[19] M. Lin, Q. Chen, and S. Yan, “Network in network”, CoRR, vol. abs/1312.4400,


[20] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On opti-

mization methods for deep learning.”, L. Getoor and T. Scheffer, Eds., pp. 265–

272, 2011. [Online]. Available: http://dblp.uni-trier.de/db/conf/

icml/cml2011.html#LeNCLPN11.

[21] S. Ruder, “An overview of gradient descent optimization algorithms”, CoRR,

vol. abs/1609.04747, 2016. arXiv: 1609.04747. [Online]. Available: http:

//arxiv.org/abs/1609.04747.

[22] L. Bottou, Stochastic gradient descent tricks. 2012, pp. 421–436. DOI: 10.1007/

978-3-642-35289-8_25. [Online]. Available: https://doi.org/10.

1007/978-3-642-35289-8_25.

[23] A. Botev, G. Lever, and D. Barber, “Nesterov’s accelerated gradient and mo-

mentum as approximations to regularised update descent”, CoRR, vol. abs/1607.01981,

2016.

[24] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online

learning and stochastic optimization”, no. UCB/EECS-2010-24, 2010. [Online].

Available: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/

EECS-2010-24.html.

[25] T. Tieleman and G. Hinton, Lecture 6.5—RmsProp: Divide the gradient by a run-

ning average of its recent magnitude, COURSERA: Neural Networks for Machine

Learning, 2012.

[26] M. D. Zeiler, “ADADELTA: an adaptive learning rate method”, CoRR, vol.

abs/1212.5701, 2012. [Online]. Available: http://arxiv.org/abs/1212.

5701.

[27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: A simple way to prevent neural networks from overfitting”, J. Mach.

Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014, ISSN: 1532-4435. [Online].

Available: http://dl.acm.org/citation.cfm?id=2627435.2670313.

[28] V. E. Johnson, “Revised standards for statistical evidence”, vol. 110, no. 48,

pp. 19 313–19 317, 2013. DOI: 10.1073/pnas.1313476110.


http://dblp.uni-trier.de/db/conf/icml/cml2011.html#LeNCLPN11

http://dblp.uni-trier.de/db/conf/icml/cml2011.html#LeNCLPN11




http://dx.doi.org/10.1007/978-3-642-35289-8_25

http://dx.doi.org/10.1007/978-3-642-35289-8_25

https://doi.org/10.1007/978-3-642-35289-8_25

https://doi.org/10.1007/978-3-642-35289-8_25

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-24.html

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-24.html




http://dx.doi.org/10.1073/pnas.1313476110

Classifying Hand-written Chinese Characters using ...

Documents

Transcript of Classifying Hand-written Chinese Characters using ...