Disclaimer - POSTECHdpnm.postech.ac.kr › thesis › 15 › draft_jaeyoon.pdf · 2015-08-13 ·...

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

Doctoral Thesis

A Management Architecture for

Client-defined Cloud Storage Services

Jae Yoon Chung (정 재 윤)

Department of Computer Science and Engineering

Pohang University of Science and Technology

클라이언트 정의 클라우드 스토리지

서비스를 위한 관리 구조

A Management Architecture for Client-defined

Cloud Storage Services

Jae Yoon Chung

Department of Computer Science and Engineering

Pohang University of Science and Technology

A thesis submitted to the faculty of the Pohang University of

Science and Technology in partial fulfillment of the requirements

for the degree of Doctor of Philosophy in the Department of

Computer Science and Engineering.

Jae Yoon Chung

The undersigned have examined this thesis and hereby cer-

tify that it is worthy of acceptance for a doctoral degree from

POSTECH.

12/15/2014

DCSE 정재윤 Jae Yoon Chung (JaeYoon Chung)

20090260 A Management Architecture for Client-defined Cloud Storage Ser-

vices.

클라이언트 정의 클라우드 스토리지 서비스를 위한 관리 구조,

Department of Computer Science and Engineering, 2015, 76P,

Advisor: James Won-Ki Hong.

Text in English

ABSTRACT

Development of networking and computing technologies accelerates the growth

of cloud business. Cloud service providers (CSPs) offer virtual instances to

users. The users then pay as much as they use. High speed networks also

help to overcome the limitation of geological distances between clients and ap-

plication servers. Users have used cloud storage services for data backup and

sharing. However, users use only few cloud storage services due to the complex-

ity of managing multiple accounts and distributing data to store. In this thesis,

we propose CLient-defined Management Architecture (CLIMA) that re-defines a

new storage service by coordinating multiple cloud storage services from clients.

We address practical issues on coordinating multiple CSPs using client-based

approach. We implement a prototype, called Client-defined privacY-protected

Reliable cloUd Storage (CYRUS), as a realization of CLIMA. CYRUS achieves

both reliability and privacy-protection using erasure code and higher perfor-

mance by optimally scheduling data transmission. We evaluate the benefits of

CLIMA using our prototype with commercial cloud storage service providers.

CLIMA contributes to encourage users to use more cloud storage services with

better manageability and flexibility.

Contents

I Introduction 1

1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . 2

1.2 Research Goals and Approach . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

II Related Work 6

2.1 Distributed Storage System . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Client-based Distributed Storage System . . . . . . . . . . . . . . . . 8

2.3 Optimizing Networked Applications . . . . . . . . . . . . . . . . . . 10

III Client-defined Management Architecture 11

3.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Cloud Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Network Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

IV CYRUS Design 17

4.1 CYRUS Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 CYRUS Design Considerations . . . . . . . . . . . . . . . . . . . . . 21

4.3 Optimized Cloud Selection . . . . . . . . . . . . . . . . . . . . . . . . 26

V CYRUS Implementation 31

5.1 Unifying Heterogeneous CSPs . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Decoupling Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Storing Files as Shares . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 File Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.5 Distributed conflict detection . . . . . . . . . . . . . . . . . . . . . . 40

5.6 Adapting to CSP Changes . . . . . . . . . . . . . . . . . . . . . . . . 41

VI Performance Evaluation 44

6.1 Testbed Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.2 Real-World Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3 Comparison with DepSky . . . . . . . . . . . . . . . . . . . . . . . . 51

6.4 Deployment Trial Results . . . . . . . . . . . . . . . . . . . . . . . . 52

- ii -

VII Discussion 55

7.1 Management Technologies in Network Domain . . . . . . . . . . . . 55

7.2 Promoting Market Competition in Cloud Market . . . . . . . . . . . 62

VIII Conclusion 64

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

REFERENCES 71

- iii -

List of Figures

III.1 Client-defined Internet Application Overview . . . . . . . . . . . 13

III.2 CLIMA’s application architecture. . . . . . . . . . . . . . . . . . 14

IV.1 CYRUS allows multiple clients to control files stored across mul-

tiple CSPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

IV.2 Mapping files to shares. . . . . . . . . . . . . . . . . . . . . . . . 25

V.1 Client-based CYRUS implementation. . . . . . . . . . . . . . . . 32

V.2 Screenshots of the CYRUS user interface. . . . . . . . . . . . . . 32

V.3 Decoupling metadata and file control. . . . . . . . . . . . . . . . 34

V.4 Metadata data structures. . . . . . . . . . . . . . . . . . . . . . . 35

V.5 A non-systematic Reed-Solomon erasure code. . . . . . . . . . . 36

V.6 Modified zfec library to use arbitrary vector for generating dis-

persal matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

V.7 File synchronization among multiple clients using client-based co-

ordination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

V.8 Uploading a file (lines refer to Algorithm 2). . . . . . . . . . . . 40

V.9 Two types of file conflicts. . . . . . . . . . . . . . . . . . . . . . . 42

V.10 Share migration when removing a CSP. . . . . . . . . . . . . . . 43

VI.1 Performance comparison between zfec (solid line) and modified

zfec (dashed line). (t, n) configurations - Red: (2, 3), Green: (2,

4), Blue: (3, 4), Purple: (2, 5), Sky blue: (3, 5), and Black: (4, 5). 45

- iv -

VI.2 Storage overhead and availability while changing (t, n) configura-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

VI.3 Testbed download performance of random, heuristic, and CYRUS’s

cloud selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

VI.4 Testbed completion times of different privacy and reliability con-

figurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

VI.5 Completion times of different storage schemes. . . . . . . . . . . 50

VI.6 Logical topology between a client and 20 CSPs based on TRACER-

OUTE results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

VI.7 Completion times with CYRUS and DepSky. . . . . . . . . . . . 52

VI.8 Share distribution with CYRUS and DepSky. . . . . . . . . . . . 53

VI.9 Completion times during the trial. . . . . . . . . . . . . . . . . . 54

VII.1 Subnet-based application traffic grouping and identification . . . 58

VII.2 Vector Space Model represents application traffic. . . . . . . . . 60

List of Tables

II.1 Comparison of CYRUS’s features with similar cloud integration

systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

IV.1 CYRUS’s Application Programming Interface. . . . . . . . . . . 20

IV.2 APIs and measured performance of commercial cloud storage providers.

Throughput is calculated from the measured RTT assuming a

0.1% packet loss rate and 65535 byte TCP window size. . . . . . 22

VI.1 Testbed evaluation dataset. . . . . . . . . . . . . . . . . . . . . . 48

VII.1 Classification accuracy comparison. Fixed-port applications: Fi-

leguri, ClubBox, Melon, BigFile. Untraceable-port Applications:

eMule, BitTorrent. . . . . . . . . . . . . . . . . . . . . . . . . . . 61

- vi -

List of Algorithms

1 CSP and bandwidth selection. . . . . . . . . . . . . . . . . . . . . . . 30

2 Uploading files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Downloading files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

- vii -

Chapter IIntroduction

Cloud services have emerged rapidly due to development of networking and comput-

ing technologies. The cloud technology has changed Internet applications, market,

and business models. Enterprise customers have built their IT (Information Tech-

nology) infrastructure using virtual hardware resources, such as CPU and memory,

from remote clouds. Clouds also have provided development platforms for devel-

opers to ease the application development and deployment. Rich applications run-

ning on clouds such as Google Apps have replaced on-device applications. Major

cloud venders blurring the lines between the business models, i.e. Infrastructure-

as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS),

users are considering clouds as remote resource providers and leveraging them to ex-

tend the limited capacity of local equipment and devices. Applications in clients use

cloud services to seamlessly store and share data among multiple devices and users.

For example, photos in smartphones are automatically synchronized with desktop

and laptop computers through cloud storage services.

I. Introduction

Cloud storage is one of the most popular cloud services in these days. Cloud

storage industry is forecast to be worth $240 billion in 2020 [1], and to have 3.8

billion personal storage accounts in 2018 [2]. From the perspective of user’s data

management1, the reliability and privacy protection of cloud storage services are

critical concerns. However, service level agreements (SLAs) of cloud services are not

perfectly matched with user’s requirements. In fact, cloud storage providers have

experienced high-profile data leaks, e.g., credit card information from Home Depot

[3], celebrity photos from iCloud [4], and outages: Amazon’s popular S3 service went

down for almost an hour in 2013, affecting websites like Instagram and Flipboard

[5]. More recently, seven million accounts in Dropbox were exposed due to stolen

ID and password from a third-party application [6]. Most cloud storages guarantee

99.9% server up-time in month, which means that failure time is up to 8.46 hours

per year.

1.1 Motivation and Problem Statement

Even though IaaS and PaaS provide well-designed programable interfaces, SaaS

storages provide pre-implemented functions to reduce the high complexity of config-

urations. However, SaaS storages do not allow users to manage the services beyond

the configuration capabilities. To achieve the flexibility of IaaS/PaaS and low man-

agement complexity of SaaS, we need to adopt distributed storage solutions to client

devices such as (1) unified cloud storage interfaces, (2) data coding for high privacy-

protection and reliability, and (3) scheduling data transmission requests. Empower-

1In this research, “users” and “clients” are used synonymously and can refer to multiple devicesowned by one person.

I. Introduction

ing clients can maximize economic benefits of users. Malone, et al. argued that IT

facilitates a move from single-supplier arrangements to multiple supplier arrange-

ment because the IT reduces the cost of coordination with suppliers [7, 8]. Many

cloud service providers (CSPs) are competing in the market by attracting users with

cheaper price. If a client-based management architecture reduces the complexity of

account and data management by coordinating multiple CSPs, users can distribute

their data efficiently to multiple storage services with higher reliability, privacy pro-

tection, and cheaper price.

To improve client-defined management architecture, this thesis tries to answer

the following key questions.

• How can we manage multiple cloud resources from clients efficiently?

• What is the most important issue in public cloud storage services?

• How can the client control and coordinate multiple commoditized clouds?

• How can we distribute use data among multiple autonomous clients?

• How to maximize benefits of the client?

• How can we integrate existing cloud storage services?

• How can we avoid lock-in from clouds?

1.2 Research Goals and Approach

In this section, we enumerate our research goals of the client-defined cloud storage

services to resolve the current problems and outline our solution approach to achieve

I. Introduction

the goals. The research goals of a management architecture for client-defined cloud

storage services are as follows.

• To state the most important problem of public cloud storage service, i.e. user

privacy, cloud reliability and cloud lock-in.

• To propose a general management architecture that manages multiple clouds

at client.

• To show how to realize client-defined management architecture as a client

application that provides data synchronization service.

• To improve privacy-protection, reliability, and performance.

• To implement a prototype and deploy.

1.3 Organization

In this thesis, we propose CLIent-defined Management Architecture (CLIMA) for

cloud storage services that coordinates multiple cloud storages for clients. CLIMA

offers efficient data management for users while satisfying users’ requirements for re-

liability and privacy-protection. We analyze users’ requirements and realize CLIMA

by implementing a prototype called Client-defined Reliable and privacY-protected

cloUd Storage (CYRUS). The organization of this thesis is as follows. In Chapter II,

we present previous studies in client-based Internet service extension. We describe

design of CLIMA in Chapter III. Chapter IV explains design considerations and

practical challenges. We then describe our prototype implementation in Chapter V

I. Introduction

and present evaluation results in Chapter VI. In Chapter VII, we introduce network-

side traffic management techniques that identify applications and represent traffic

using Vector Space model. We also describe the impact of CLIMA in cloud storage

market. Finally we conclude the thesis in Chapter VIII.

Chapter IIRelated Work

We adopt basic ideas from distributed storage systems. We consider CSPs as object

storages that support simple operations only: PUT, GET, LIST, DELETE, etc.

Unlike distributed storage systems, in CLIMA, clients are responsible for entire

data management functions including encoding/decoding and scattering/gathering

2.1 Distributed Storage System

Byzantine fault-tolerance systems have been studied for reliable distributed storage

[9, 10, 11, 12, 13]. These works proposed protocols that can read and write with

unreliable storages. Realizing these protocols, however, requires running codes at

remote storage servers, which is not applicable to CLIMA. For example, object-

based distributed storage systems [14, 15] require a daemon on the server to handle

multiple transactions and data synchronization.

II. Related Work

HAIL [14] is designed for high availability and integrity of distributed storages.

Client makes sure their file is not corrupted. Instead of verifying entire file, verifying

coded samples. Parities within server as well as among servers are maintained. HAIL

does not address security threats from cloud service providers who can access user

RACS [16] employs RAID technique at cloud storage level. RACS’s proxy server

encodes/decodes user data (erasure coding) into shares and distributes them to

multiple cloud storages. RACS underlines diversifying cloud storages to avoid vendor

lock-in, so that improving data availability and durability.

Scalia [17] proposed optimal data placement algorithm and periodically change

the data locations based on data access pattern. Scalia’s optimization factor is to

pay fair price. Assuming the data access frequency is related with price, Scalia per-

forms optimization procedures periodically (every 5 minutes). Scalia’s optimization

formulation does not reflect tradeoffs among security, performance, and monetary.

The periodic optimization also may spend huge network resource on client-based

architecture.

Ceph [15] proposed a distributed file system considering high performance, reli-

ability, and scalability. Ceph decouples meta data and actual data using hash-based

addressing. Ceph achieves higher reliability from unreliable object servers based

on multiple replicas. Ceph requires software on server-side to communicate among

clients, meta data servers, and object servers.

Advanced techniques for efficient and reliable data storage have been proposed.

Data deduplication is a technique reducing stored data [18]. By splitting a file into

small pieces, a storage stores unique pieces only and manage references of duplicated

II. Related Work

pieces. Content-based chunking such as Rabin’s fingerprinting is widely used because

it does not shift chunk boundaries when a file is modified [19, 18].

Scattering data to n pieces and requiring any t pieces to reconstruct original

data, (t, n) property, is the most important property of erasure codes (or regenerate

code). As a cryptographic study, Shamir proposed (t, n) secret sharing based on

random polynomial interpolation. Erasure codes are more focusing on efficiency.

Reed Solomon code also has (t, n) property but uses t times less storage space than

(t, n) secret sharing. Reed Solomon code is widely in distributed storage and data

deduplication research due to its generality. AONT-RS enhanced security level of

Reed Solomon code [20].

2.2 Client-based Distributed Storage System

Attasena proposed a solution to distribute data to multiple data-warehouses [21].

After encoding each column of data using (t, n) secret sharing scheme or erasure

code, reading and writing operations are requested to distributed database manage-

ment systems. The scope of Attasena’s approach is limited by database systems

that support efficient transaction control and atomic operations.

PiCsMu proposed an approach that hides user’s data to image files [22]. PiCsMu

leverages multiple cloud services and SNS services sharing photos. PiCsMu is not

designed to synchronize data like DropBox application, so it simply supports data

storing in distributed locations.

DepSky, which is the closest solution with CYRUS, attempted to coordinate

cloud storages using simple operations only [23]. DepSky addressed issues on si-

II. Related Work

multaneous updates from multiple clients. Their solution is to create a lock file at

cloud and checking if the only client hold the lock after random backoff time. Even

though DepSky addressed practical issues and solved the problem, lock-based solu-

tion shows poor performance, i.e. spending two RTT to check a lock file is properly

created.

To match user’s requirements and developed technologies in storage area re-

search, we define six comparison properties. Erasure coding is the most significant

property that improves reliability as well as encrypt (encode) data. We should

achieve concurrency that guarantees data integrity while multiple clients are read-

ing/writing data simultaneously. Client-side data deduplication reduces transmis-

sion cost because duplicated data is not uploaded to cloud again. Metadata should

be designed to change data encoding configurations dynamically in operation. Ta-

ble II.1 summarize the comparison between the proposed solution with other studies.

Attasena[21]

PiCsMu[22]

InterCloudRAIDer[24]

Depsky[23]ProposedSolution

Erasure Coding Yes No Yes Yes YesConcurrency Yes No No Yes YesDatadeduplication

No No Yes No Yes

Versioning No No Yes Yes YesOptimal cloudselection

No No No No Yes

Elastic reliability No No No No YesClient-basedarchitecture

No No Yes Yes Yes

Table II.1: Comparison of CYRUS’s features with similar cloud integration systems.

II. Related Work

2.3 Optimizing Networked Applications

CLIMA is also inspired by application-based traffic control. Application-Layer Traf-

fic Optimization (ALTO) provides network information to build optimal overlay

topology for Peer-to-Peer (P2P) applications [25]. Application-based Network Op-

eration (ABNO) supports dynamic resource planning in operation [26, 27]. Client-

defined management architecture is necessary to interact with cloud and network

services efficiently. Development of network management and virtualization tech-

nology provides programmable network to applications, which allows seamless com-

munications from hardware to software and even to end users [28].

- 10 -

Chapter IIIClient-defined Management

Architecture

In this chapter, we propose CLIMA that maximizes users’ benefits (e.g. cheaper

cost, more network and storage resource, etc.) and matches users’ requirements.

CLIMA is a general architecture defining interactions among applictions, servers,

clouds, and networks. The followings are the benefits of CLIMA in terms of privacy,

reliability, and performance.

• Privacy: Resource providers cannot read user data because they possess a

part of user data. To access user data, an attacker including resource providers

should break multiple CSPs and collect sufficient pieces of data. Direct com-

munications between clients and CSPs prevent from intercepting user data in

the middle.

- 11 -

III. Client-defined Management Architecture

• Reliability: Assuming that failure probability of resources is independent,

building a system by parallelizing storages improve reliability1.

• Performance: Striping I/O improves performance such as RAID storage

system. Even though network technologies have been developed rapidly, net-

working is still the bottleneck when storing data in remote servers. Table IV.2

shows that measured RTT and expected throughput to 20 cloud services in

the world. Connecting multiple clouds and striping upload/download connec-

tions improve throughput until it reaches available bandwidth of client-side

network.

CLIMA leverages existing cloud storage services. We assume that the existing

services are not 100% trustable, which is similar with Byzantine fault tolerable

systems. The challenge of CLIMA is to design protocol between client and CSPs

without changing cloud APIs. Thus, the solution of CLIMA should not require

any specific functions or implementations to clouds. Required functions should be

designed as client-side components. CLIMA requires I/O interfaces from CSPs and

context information from networks.

Figure III.1 illustrates three domains of CLIMA. Application Domain is the only

domain that CLIMA can design and implement for re-defined applications. CSPs

in Cloud Domain are autonomously managed by different players and implemented

optimizing their own services. CLIMA also measures Network Domain to collect

context data such as performance statistics and quality of services.

1The probability is not always independent. We consider the failure dependency among CSPsin Section 3.2

- 12 -

Controller

Coordinator

App. Foundation

Operating System API

OSSCloud Service Provider

App. Server

Network Domain Cloud Domain

Application Domain Client

Figure III.1: Client-defined Internet Application Overview

3.1 Application Domain

Application Domain is fully-manageable by CLIMA like clients and servers of legacy

Internet applications. In Application Domain, CLIMA applications are installed on

client devices. Figure III.2 shows components of client application. Application

Foundation is the implementation of required functions and modules not related

with coordination (e.g. data synchronization). According to user’s requirements

Application Foundation is implemented, which includes user interface, application

functions, logics, handling exceptions, etc. Coordinator is responsible for integrating

- 13 -

Sca$er Connector 1

Connector 2

Gather

Coordinator Controller

Informa7on Base

Performance informa7on

Encoding/Decoding Informa7on

Configura7on

Scheduler

Op7mizer

Performance Manager

Metadata

API & Auth. informa7on

Applica7on Founda7on

Func7on

User Interface

System I/O

Applica7on Informa7on

Figure III.2: CLIMA’s application architecture.

and coordinating multiple CSPs in remote. Information base in Coordinator stores

configurations about how to encode data, address of data, etc. For storing data,

Coordinator encodes/decodes and encrypts/decrypts data to scatter it into multiple

storages. Coordinator schedules upload/download requests to CSPs and handles

responses asynchronously. Controller is the connector to each resource providers.

Controller is dependent on available APIs such as authentication, get, put, list,

delete, etc. Controller unifies different API implementations. Libraries for cloud

connector such as JClouds [29] help to reduce implementation efforts.

Application Server is an optional component that is responsible for supporting

CLIMA applications (e.g. user account management). We do not use the appli-

cation server as a proxy that relays messages from clients to CSPs and vice versa.

Indeed, using Application Server as proxy takes advantages in terms of data caching

and information sharing. However, the proxy server makes another dependency on

CLIMA, and causes security threats such as intercepting messages in the middle.

- 14 -

Thus, CLIMA applications connect to CSPs directly through secure communication

channels.

3.2 Cloud Domain

Cloud Domain offers powerful computing machines and storages. Design and imple-

mentation of Cloud Domains are completely dependent on CSPs. For example, IaaS

(Infrastructure as a Servics), for example Amazon, provides virtual instances itself.

On the other hand, PaaS (Platform as a Service) and SaaS (Software as a Service)

provide APIs to access their cloud resources. In Table IV.2, we survey that most

CSPs provide HTTP(S) RESTful APIs. When coordinating CSPs, CLIMA applica-

tion considers dependencies in terms of reliability and performance. If cloud storage

services are deployed on their own data center, the CSPs are physically independent.

Otherwise, if cloud storage services are deployed on the other storage service, CSPs

are logically independent but physically correlated. User data cannot be shared

among logically independent CSPs who protect user data from others. However,

failure probability of logically independent CSPs is not independent. For example,

failures or outage of an IaaS CSP impact on other CSPs that are deployed on the

IaaS CSP. The different locations of CSPs cause different upload/download perfor-

mance. As we measured in Table IV.2, heterogeneous CSPs show different latency

that impacts on upload/download throughput. CLIMA maximizes reliability and

performance by selecting physically independent CSPs while optimally scheduling

data transmission.

- 15 -

3.3 Network Domain

Network Domain offers communication paths among CLIMA applications, servers,

and CSPs. Network Domain is not as flexible as Application and Cloud Domain.

Even though northbound APIs of SDN controller have studied in these days, it is

not possible that clients obtain network information from standard APIs yet. In

this stage, we decide that CLIMA application uses active and passive measurement

techniques to obtain information from Network Domain. Network measurement

tools such as PING and TRACEROUTE measure RTT and logical topology among

clients, servers, and CSPs. Statistics of performance metrics such as throughput,

service response time, etc. also help applications to be aware of network status.

Then, CLIMA will achieve network information more easily and even control traffic

flows while guaranteeing QoS requirements.

- 16 -

Chapter IVCYRUS Design

In this chapter, we design CYRUS as a realization of CLIMA. We assume that

an user requires more reliable and secure cloud storage service. Then, we specify

CLIMA’s components and benefits with implementation challenges and design a

system with available solutions. Figure IV.1 illustrates that CYRUS distributes

user data to multiple CSPs. To obtain user data, CYRUS collect data pieces, which

is not readable itself, from multiple CSPs. Even if some pieces are not available

(damaged or unaccessible), data is reconstructable. CYRUS assumes that a user

has file storage accounts at multiple CSPs, e.g., Dropbox and Google Drive; it

provides an interface for uploading and downloading files to these CSPs. By moving

control of the cloud to the client, CYRUS enables clients to simultaneously leverage

services from multiple cloud providers. The cloud thus becomes commoditized, with

the client controlling the services provided by each cloud and integrating these into

a service for users. User files, for instance, can be uploaded by their client devices

to multiple CSPs. From the perspective of the client, these CSPs are nothing more

- 17 -

IV. CYRUS Design

Google Drive!

Dropbox!Enterprise

server!

CYRUS! CYRUS!

Client Control!

CYRUS!

Client Control!

Google Drive!

Figure IV.1: CYRUS allows multiple clients to control files stored across multipleCSPs.

than file repositories, and CYRUS can choose which data to upload to which cloud

in order to ensure performance requirements such as user-specified reliability and

privacy levels.

4.1 CYRUS Requirements

We identify the basic requirements of CYRUS as a cloud storage service:

• Privacy: By scattering user data to multiple CSPs, user data becomes secure

from CSPs as well as attacker out of CSPs.

• Reliability: Users can avoid vendor lock-in so distributed architecture im-

proves reliability and accessibility level.

- 18 -

IV. CYRUS Design

• Performance: Striping data transmission to multiple CSPs increases through-

put. In addition, Users can parallelize data transfers to the CSPs and select

the CSPs used for uploading and downloading files so as to minimize delays.

Leveraging multiple CSPs represents an architectural change from the usual

Infrastructure-as-a-Service (IaaS) model, which rents customized virtual machines

(VMs) to users. While some CSPs offer software-defined service level agreements

with customized performance guarantees [30], these approaches fundamentally leave

control at the CSP. Clients control their VMs or specify performance through soft-

ware on the cloud and are still limited by the CSP’s functionality. Yet CYRUS

does not consist solely of autonomous clients either, as in peer-to-peer systems that

allow each client node to leverage the resources at other connected nodes for storage

or computing. CYRUS does not treat CSPs as peers and does not assume that

clients can directly communicate with each other; instead, the CSP “nodes” become

autonomous storage resources separately controlled by the clients.

CYRUS’s distributed, client-controlled architecture enforces privacy and relia-

bility by scattering user data to multiple CSPs, so that attackers must obtain data

from multiple CSPs to read users’ files; by building redundancy into the file pieces,

the file remains recoverable if some CSPs fail. File pieces can be uploaded and

downloaded in parallel, reducing latency. Since the client controls the file distribu-

tion, clients can choose where to distribute the file pieces so as to define customized

privacy, reliability, and latency.

Users interact with CYRUS through Table IV.1’s set of API calls. These include

standard file operations, such as uploading, downloading, and deleting a file, as well

- 19 -

IV. CYRUS Design

Functionality CYRUS API callcreate a CYRUS cloud s s = create()

add a cloud storage c add(s,c)

remove a cloud storage c remove(s,c)

get a file f of version v f’ = get(s,f,v)

put a file f put(s,f)

delete a file f delete(s,f)

list files under a directory d [(f,r),] = list(s,d)

reconstruct s′ s’ = recover(s)

Table IV.1: CYRUS’s Application Programming Interface.

as file recovery and CSP addition and removal. Finally, CYRUS must perform three

main functions: integrate multiple clouds, scale to multiple clients, and optimize

performance.

CYRUS’s requirements addresses four challenges considering CLIMA in Chap-

ter III. First, CSPs have different file management policies and APIs, e.g., tracking

files with different types of IDs and varied support for file locking. To accommo-

date these differences, CYRUS uses only the most basic cloud APIs. This limitation

makes it particularly difficult to support multiple clients trying to simultaneously

update the same file. CYRUS’s approach is inspired by large-scale, performance-

focused database systems [31, 32, 33]: we allow clients to upload conflicting files

and then resolve conflicts as necessary. Other approaches require more client-CSP

communication, increasing latency [23].

Second, CYRUS’s client-based architecture and the lack of reliable client-to-

client communication make sharing files between clients difficult: clients need to

share information on each file’s storage customization. We avoid storing this data

at a centralized location, which creates a single point of failure, by scattering file

metadata among CSPs.

- 20 -

IV. CYRUS Design

Third, some CSPs may share infrastructure, e.g., Dropbox running its software

on Amazon’s servers [34]. Storing files at CSPs with the same physical infrastructure

makes simultaneous CSPs failures more likely. To maximize reliability, CYRUS

infers these sharing relationships and uses them in selecting the CSPs where files

are stored. CYRUS can monitor correlations in the network performance at different

CSPs and choose where to store files accordingly.

Finally, client connections to different CSPs have different speeds. Thus, CYRUS

significantly reduces latency by optimally choosing the CSPs from which to download

file pieces, subject to privacy and reliability constraints on the number of file pieces

needed to reconstruct the file.

4.2 CYRUS Design Considerations

We first highlight some important challenges in building CYRUS before presenting

CYRUS’s system architecture.

- 21 -

CSP Format Protocol Authentication RTT (ms) Throughput (Mbps)Amazon S3* XML SOAP/REST AWS Signature 235 1.349Box JSON REST OAuth 2.0 149 2.128Dropbox JSON REST OAuth 2.0 137 2.314OneDrive JSON REST OAuth 2.0 142 2.233Google Drive JSON REST OAuth 2.0 71 4.465SugarSync XML REST OAuth-like 146 2.171CloudMine JSON REST ID/Password 215 1.474Rackspace XML/JSON REST API Key 186 1.704Copy JSON REST OAuth 192 1.651ShareFile JSON REST OAuth 2.0 215 1.4744Shared XML SOAP OAuth 1.0 186 1.704DigitalBucket* XML REST ID/Password 217 1.461Bitcasa* JSON REST OAuth 2.0 139 2.281Egnyte JSON REST OAuth 2.0 153 2.072MediaFire XML/JSON REST OAuth-like 192 1.651HP Cloud XML/JSON REST OpenStack Keystone V3 210 1.509CloudApp* JSON REST HTTP Digest 205 1.546Safe Creative* XML/JSON REST Two-step authentication 295 1.075FilesAnywhere XML SOAP Custom 202 1.569CenturyLink XML/JSON SOAP/REST SAML 2.0 293 1.082

Table IV.2: APIs and measured performance of commercial cloud storage providers. Throughput is calculated fromthe measured RTT assuming a 0.1% packet loss rate and 65535 byte TCP window size.

IV. CYRUS Design

CYRUS’s operations and functionalities must reside either on the clients, CSPs,

or a combination of both. Yet in realizing such functionalities, we face two main

challenges: heterogeneity in CSP APIs and the lack of reliable direct client-to-client

communication.

Unifying heterogeneous CSPs: To illustrate CSP heterogeneity, Table IV.2

shows the APIs and achieved performance for a range of CSPs. Though most are

similar, the API implementations differ in the functions they provide and their file

object handling. For example, Dropbox uses files’ names as their identifiers, while

Google Drive uses a separate file ID. Thus, when a client uploads a file with existing

filename, Dropbox overwrites the previous file, but Google Drive does not. CYRUS

accommodates such differences by only using basic cloud API calls: authenticate,

list, upload, download, and delete, which are available even on FTP servers. We

therefore shift much of CYRUS’s functionality to the clients, influencing our design

choices:

Ensuring privacy and reliability: Standard replication methods for ensuring

reliability do not require significant client or CSP resources but are not secure. Many

encryption methods, on the other hand, are vulnerable to loss of the encryption key.

Figure IV.2 illustrates this division of files to chunks and chunks into shares. CYRUS

overcomes these limitations by splitting and encoding files at the client and uploading

the file pieces to different CSPs. For further security, the pieces’ filenames are hashes

of information known only to the clients. Reconstructing the file requires access to

pieces on multiple, but not all, CSPs, ensuring both privacy and reliability. Finally,

CYRUS scatters file pieces to multiple CSPs so that no single CSP can reconstruct

a user’s data.

- 23 -

IV. CYRUS Design

Concurrent file access: Since most CSPs do not support file locking, CYRUS

cannot easily prevent simultaneous file uploads from different clients: the second

client will not know that a file is being updated until the first client finishes upload-

ing. Theoretically, since POST in HTTP is not idempotent, the server status should

change when the first update finishes and we could handle the conflict by overwrit-

ing the first with the second file update. However, CSPs do not always enforce this

standard. Thus, a locking or overwriting approach requires creating lock files and

checking them after a random backoff time, leading to long delays [23]. CYRUS

instead creates new files for each update and then detects and resolves any conflicts

from the client.

Client-based architecture: To reconstruct a file, a client needs access to its

metadata, i.e., information about where and how different pieces of the file are

stored. The easiest way to share this metadata is to maintain a central metadata

server [35], but this solution makes CYRUS dependent on a single server, introducing

a single point of failure and making user data vulnerable to attacks at a single

server. At the other extreme, a peer-to-peer based solution does not guarantee

data accessibility since clients are not always available. Our solution is to scatter

the metadata across all of the CSPs, as we do with the files. Clients access the

metadata by downloading its pieces from the CSPs; without retrieving information

from multiple CSPs, attackers cannot access user data.1 This approach ensures that

CYRUS is as consistent as the CSPs where it stores files.

CSPs selection and performance optimization: CYRUS chooses the num-

1Since we store metadata pieces at all CSPs, clients can always find and download metadatapieces.

- 24 -

IV. CYRUS Design

Chunk! Chunk! …! Chunk!

…!Share!

Share!

Share! Share!

Share!

Figure IV.2: Mapping files to shares.

ber of file pieces to upload or download so as to satisfy privacy and reliability require-

ments. When choosing which CSPs to use, CYRUS chooses CSPs on independent

cloud platforms so as to maximize reliability: for instance, CSPs marked with an

asterisk in Table IV.2 have Amazon destination IPs. The table also shows that CSPs

have very different client connection speeds; thus, CYRUS chooses the CSPs from

which to download files so as to minimize latency. CYRUS reduces users’ cost by

limiting the amount of data that must be stored on CSPs. Before scattering files,

we divide each file into smaller discrete chunks. Unique chunks are then divided into

shares using secret sharing, which are scattered to the CSPs. Since different files

can use the same chunks, deduplication reduces the total amount of data stored at

CSPs, conserving storage capacity. We also limit the amount of data that can be

stored at different CSPs, e.g., to the maximum amount of free storage capacity.

- 25 -

IV. CYRUS Design

4.3 Optimized Cloud Selection

4.3.1 Balancing Privacy and Reliability

Users specify CYRUS’s privacy and reliability levels by setting the secret sharing pa-

rameters n and t for each chunk. The user first specifies the privacy level by choosing

t, or the number of shares required to reconstruct a chunk: since we upload at most

one share to each CSP, t specifies the number of CSPs needed for reconstruction.

Taking t = 2 is sufficient to ensure that no one CSP can access any part of users’

data, but more privacy-sensitive users may specify a larger t.

The user then specifies reliability in terms of an upper bound ε on the overall

failure probability (i.e., the probability that we cannot download t shares of a chunk

due to CSP failure).2 The failure probability of any given CSP, which we denote

by p, is estimated using the number of consistent failed attempts to contact CSPs.

Users specify a threshold, e.g., one day, of time; if a CSP cannot be contacted for

that length of time, then we count a CSP failure. The probability that we cannot

download t shares equals the probability that more than n− t CSPs fail, which we

calculate to be∑t−1

s=0C(n, s) (1− p)s pn−s. We then bound this probability below ε

by searching for the minimum n such that

t−1∑s=0

C(n, s) (1− p)s pn−s ≤ ε. (IV.1)

We find n by increasing its value from t to its maximum value (the total number of

CSPs or clusters); taking the minimum such n limits the data stored on the cloud.

2We do not consider link failures; CYRUS is designed to reduce the impact of failures at CSPs.

- 26 -

IV. CYRUS Design

The chunk share size is independent of n, so uploading n shares requires an amount

of data proportional to n.

4.3.2 Downlink Cloud Selection

CYRUS chooses the n CSPs to which shares should be uploaded using consistent

hashing on their SHA-1 hash values. To download a file, a client must choose t of

the n CSPs from which to download shares. Since we choose t < n for reliability,

the number of possible selections can be very large: suppose that R chunks need to

be downloaded at a given time, e.g., if a user downloads a file with R unique chunks.

There are then C(t, n)R possible sets of CSPs, which grows rapidly with R. We thus

choose the CSPs so as to minimize download completion times.

We suppose that file shares are stored at C CSPs. We index the chunks by

r = 1, 2, . . . , R and the CSPs by c = 1, 2, . . . , C, defining dr,c as an indicator variable

for the share download locations: dr,c = 1 if a share of chunk r is downloaded

from CSP c and 0 otherwise. Denoting chunk r’s share size as br, the total data

downloaded from CSP c is∑

r brdr,c. We use βc to denote the download bandwidth

allocated to CSP c and find the total download time

brdr,cβc

). (IV.2)

While minimizing (IV.2), the selected CSPs must satisfy constraints on available

bandwidth and feasibility (e.g., selecting exactly t CSPs).

Bandwidth: The bandwidth allocated to the CSP connectors is restricted in two

ways. First, each CSP has a maximum achievable download bandwidth, which may

- 27 -

IV. CYRUS Design

vary over time, e.g., as the CSP demand varies. We express these maxima as upper

bounds: βc ≤ βc for all CSPs c. Second, the client itself has a maximum download

bandwidth, which must be shared by all parallel connections. We thus constrain

βc ≤ β ∀ i, (IV.3)

where β denotes the client’s total downstream bandwidth.3

Feasibility: CYRUS must download t of shares of each chunk, and can only

download shares from CSPs where they are stored. We thus introduce the indicator

variables ur,c, which take the value 1 if a share of chunk r is stored at CSP c and 0

otherwise. Our feasibility constraints are then

dr,c = t, dr,c ≤ ur,c. (IV.4)

CYRUS’s download optimization problem is then

miny,d,β

y (IV.5)

∑r brdr,cβc

≤ y; c = 1, 2, . . . , C (IV.6)∑c

βc ≤ β, βc ≤ βc,∑c

dr,c = t, dr,c ≤ ur,c (IV.7)

Exactly solving (IV.5–IV.7) is difficult: first, the constraints on y are non-convex,

and second, there are integrality constraints on dr,c. We thus propose a heuristic

algorithm that yields a near-optimal solution with low running time. Moreover, our

3Each client maintains local bandwidth statistics to all CSPs for different network interfaces.

- 28 -

IV. CYRUS Design

algorithm is solved online: we iteratively select the CSPs from which each chunk’s

shares should be downloaded, allowing us to begin downloading chunk shares before

finding the full solution. This allows us to outperform a heuristic that downloads

shares from the CSPs with the highest available bandwidth. In that case, all chunk

shares will be downloaded sequentially from the same t CSPs, so some shares will

wait a long time before being downloaded. Our algorithm downloads shares in

parallel from slower CSPs.

Our solution proceeds in two stages. First, we compute a convex approxima-

tion to (IV.5–IV.7) that does not consider the integer constraints on dr,c. To do

so, we first convexify (IV.6) by defining Dr,c ≡ d1/2r,c . The resulting constraints∑

r brD2r,c/βc ≤ y are then convex, with the non-convexity moved to the new con-

straints Dr,c = d1/2r,c . These are approximated with linear over-estimators Dr,c ≥ d1/2r,c

that minimize the discrepancy between D2r,c and dr,c.

4 We find that the closest linear

estimator is Dr,c = 31/4dr,c/2 + 3−1/4/2. On adding these constraints to (IV.5–IV.7)

and replacing (IV.6) with∑

r brD2r,c/βc ≤ y, we can solve the convexified version of

(IV.5–IV.7) for the optimal y, βc, dr,c, and Dr,c.

We then fix the optimal bandwidths βc, turning (IV.5–IV.7) into a linear integer

optimization problem that can be solved with the standard branch-and-bound algo-

rithm. Branch-and-bound scales exponentially with the number of integer variables,

so we impose integer constraints on one chunk’s variables at a time; thus, only C

variables are integral (dr,c for one chunk r). We constrain chunk 1’s d1,c variables

to be integral and re-solve (IV.5–IV.7), then re-solve the convex approximation, fix

4By requiring that these approximations be over-estimators, we ensure that if the constraintsmaxc

∑r brD

2r,c/βc ≤ y hold, then (IV.6) holds as well.

- 29 -

IV. CYRUS Design

1 for η = 1 to R do2 Solve convexified, relaxed (IV.5–IV.7) with fixed CSP selections dr,c(i) for r < η;3 Fix bandwidths βc;4 Constrain dη,c ∈ {0, 1} for all c;5 Solve for the dr,c with fixed bandwidths;6 Fix CSP selections dη,c;

Algorithm 1: CSP and bandwidth selection.

the resulting bandwidths, constrain d2,c to be integral, etc. Algorithm 1 formalizes

this procedure.

- 30 -

Chapter VCYRUS Implementation

We have prototyped CYRUS, as a validation of CLIMA, on three representative

platforms (Mac OS X, Windows, and Linux) using Python and C, with the soft-

ware architecture shown in Figure V.1. The prototype has seven main components

matching CLIMA’s application architecture:

1) a graphical user interface and application functions, 2) C modules implement-

ing content-based chunking and data encoding, 3) threads uploading and down-

loading contents to and from CSPs, 4) a metadata manager that constructs and

maintains metadata trees, 5) an event handler for upload and download requests, 6)

a cloud selector selecting CSPs for uploading and downloading shares, and 7) cloud

connectors for popular commercial CSPs. Our CYRUS prototype consists of 3,500

lines of Python code.

Our implementation integrates the CYRUS APIs (Table IV.1) and provides a

graphical interface for people to use CYRUS. To increase efficiency, we use C modules

to implement file chunking (Rabin’s fingerprinting) and dividing chunks into shares

- 31 -

V. CYRUS Implementation

Metadata Manager!

ser Interface!

File Constructor!

CSP Selector!

Cloud Connector!

Scatter!

CSP APIs!

Authentication!

Downloader!

Chunk List!

Meta files!

…CSP 1!

CSP Table!

CSP 2!

CSP n!Gather!

Chunk Aggregator!

Share Decoder!

Uploader!Chunk Constructor!

Share Encoder!

Event Handler!

Figure V.1: Client-based CYRUS implementation.

(a) Files stored on CYRUS. (b) Shares uploaded to CSPs.

Figure V.2: Screenshots of the CYRUS user interface.

(Reed-Solomon coding). Figure V.2 shows screenshots of user interface of CYRUS

prototype1. Figure V.2(a) shows list of files and folders uploaded. The icons on

the top allow users to upload, download, and delete files. Figure V.2(b) shows the

shares that are being uploaded/downloaded to/from different CSPs.

1A demo video of the prototype is available at http://youtu.be/DPK3NbEvdM8

- 32 -

5.1 Unifying Heterogeneous CSPs

To ensure transparency to different CSPs, we create a standard interface to map

CYRUS’s file operations to vendor-specific cloud APIs. This task involves creating

a specific REST URL with proper parameters and content. We utilize existing

CSP authentication mechanisms for access to each cloud, though such procedures

are not mandatory. We have implemented four connectors as resource Controller

in Chapter III for Google Drive, DropBox, OneDrive (previously called SkyDrive),

and Box, with connectors to more CSPs planned in the future. These providers, as

shown in Table IV.2, use OAuth2 authentication mechanisms. Open source cloud

connectors like JClouds [29] can be used in place of CYRUS’s own connectors. We

identify set of APIs for CYRUS implementation. We use GET, PUT, LIST, and

DELETE operations to commodify CSPs as simple object storages. Letting PUT

operation always create new file or overwrite existing file, we design a protocol for

concurrent file access. Simplifying CSPs’ operations avoids provide high extensibility

and applicability to network accessible storage resources.

5.2 Decoupling Metadata

CYRUS decouples metadata lookup and file placement. Note that clouds for meta-

data indexing and for file placement are computed independently, so the two cloud

groups could be different, as shown in Figure V.3. A metadata file represents an up-

date, which stores its modification history and share composition. CYRUS decides

the location of the metadata, which is also distributed to multiple CSPs, based on

user’s configuration. Instead of looking up entire CSPs, we directly detect whether

- 33 -

Cloud Storage!

Metadata Storage!

CYRUS!

Cloud Communication!

Cloud 2!

Cloud 1!

Cloud n!

Metadata Sync !

File Sync!

Chunk & Share Creation!

Efficient metadata lookup!

Private, reliable share transfers!

Figure V.3: Decoupling metadata and file control.

new metadata is uploaded using LIST operation. CYRUS maintains metadata for

each file, which stores its modification history and share composition. The metadata

are stored as a logical tree with dummy root node (Figure V.4); subsequent levels

denote sequential file versions. New files are represented by new nodes at the first

level. Each node consists of three tables:

FileMap: This table stores the file Id, or SHA-1 hash of its content, and prevId,

its parent node’s ID (prevID= 0 for new files). We also store the clientID, indi-

cating the client creating this version of the file, as well as the file name, whether it

has been deleted, last modified time, and size.

ChunkMap: This table provides the information to reconstruct the file from

its chunks. It includes the Id, or SHA-1 hash, of each chunk in the file; offset, or

positions of the chunks in the file; the chunk sizes; and the t and n values used to

divide the chunks into shares.

ShareMap: This table stores the shares’ CSP locations. The chunkId is the

chunk content’s SHA-1 hash value, idx is the share index, and cId gives the CSP

- 34 -

File Metadata!…!

File Metadata!

ChunkMap! Id, offset, length, t, n!

chunkId, cId, idx! 7e3a, 2, 1!7e3a, 5, 2!

ShareMap! …!

Id! prevId! name! clientId! isDelete! time! length!

3d841 ! 4918e! a.txt! 2! false-! 1392832122! 49381!

7e3a, 0, 4829, 2, 3! …!

File Metadata!Id! prevId! name!

ab3a2! 4918e! a.txt!…!

…!…!

FileMap!

Id! prevId! name!

37e14! 4918e! a.txt!…!

…!…!

ab31, 896, 4811, 2, 3!

ab31, 1, 1!ab31, 2, 2!

Figure V.4: Metadata data structures.

where it is stored.

5.3 Storing Files as Shares

CYRUS divides files into chunks and then divides the chunks into shares, which are

stored at CSPs. Chunk is a piece of a file and share is a encoded chunk with (t, n)

threshold property, i.e. need any t shares from n to decode original chunk.

Constructing file chunks: Chunking is widely used for data deduplication

in storage area, storing unique chunks only. By dividing files into chunks at clients,

CYRUS does not need to transfer duplicated chunks if those are already stored at

CSPs. CYRUS divides a user’s file into chunks based on content-dependent chunking

and determines chunk boundaries using Rabin’s Fingerprinting [36], which computes

the hash value of the content in a sliding window wi, where i is the offset on the

- 35 -

!"0$ !"

1$ !"2$ '$

!%0$ !%

1$ !%2$ '$

!&0$ !&

1$ !&&$ '$

*"$%&'()*"

+"+%,*"

+"-&-(-).*"

)"$,*$)-./("$

)%$,*$)-./(0$

)&$,*$)-./(1$

2"$&%'/"*.%+01-*"

23*4-+*0&"50.+36" 20.0"7%/-""

,%+/*"

7&%'/"0&&%$08%)"

Figure V.5: A non-systematic Reed-Solomon erasure code.

file ranging from 0 ≤ i < file size − window size. When the hash value modulo

a pre-defined integer M equals a pre-defined value K with 0 ≤ K < M , we set the

chunk boundary and move on to find the next chunk boundary. Unlike fixed-size

chunking, which divides a file into fixed length of chunks, content-based chunking

changes boundaries where data is updated instead of shifting entire boundaries.

Dividing chunks into shares: After CYRUS constructs chunks from a file,

CYRUS encodes a chunk into n shares using Reed-Solomon (R-S) code [37] that has

the same property with (t, n) secret sharing [38]. Figure V.5 shows how R-S code

encodes data. Unlike Shamir’s (t, n) secret sharing, R-S code is designed for efficiency

data recovery, so R-S code uses constant dispersal matrix whose rows are linearly

independent. For simplicity, non-systematic R-S code uses Vandermonde matrix over

a finite field with elements Vi,j = gj−1i where i = 1, 2, ..., n, j = 1, 2, ..., t, and gi = i.

Using user’s secret key, we randomize gi by picking n random numbers from 4-byte

- 36 -

Figure V.6: Modified zfec library to use arbitrary vector for generating dispersalmatrix.

integer field (except for zero). Figure V.6 shows that we shuffle elements of dispersal

matrix with minimum code modification of optimal R-S coding implementations in

zfec library [39].

Naming: We name each share as SHA-1 hash of content. Then, the same

file name is assigned to files with the same content. The collision probability of

SHA-1is lower than the probability of hardware error so it is widely used as an

identifier of data blocks [40, 18]. Thus, we only overwrite the existing file share if

its content is the same, which guarantees concurrent file reads/writes from multiple

clients without file locking.

5.4 File Operations

Figure V.7 shows how CYRUS synchronize files among autonomous clients. We

assume that all clients have permission to access clouds where shares and metadata

are stored.

- 37 -

GET list of files in metadata folder

GET metadata shares that are missing in local

GET data shares

Client1 Client2 CSP1 CSP2 CSP3 CSP4

PUT data shares

PUT metadata shares

Upload data shares

Upload metadata shares

Check new metadata

Download metadata shares

Download data shares

Add or modify file

(T, N) = (2, 3)

Synchronize updated file (T, N

) = (2, 3)

Figure V.7: File synchronization among multiple clients using client-based coordi-nation.

5.4.1 Uploading files

Algorithm 2, illustrated in Figure V.8, shows CYRUS’s procedure for uploading

files. The client first updates its metadata tree to ensure that it uses the correct

parent node when constructing the file’s metadata (steps 1 and 2 in Figure V.8). We

then divide the file into chunks (step 3) and avoid uploading redundant chunks by

checking whether shares of each chunk are already stored in the cloud (step 4). New

chunks are divided into shares, as in Section 5.3, and the shares scattered to CSPs

(step 5). To scatter the shares, we calculate the CSPs where the shares will be stored

and then send upload requests to the cloud connectors. The returns of the requests

are handled asynchronously, as explained below. We upload the file metadata to

CSPs (line 10 in Algorithm 2) only after receiving returns from all upload requests,

- 38 -

1 Function Upload(file)2 head = getHead(file)3 newHead = SHA1(file)4 UpdateHead(file, head, newHead)5 chunks = Chunking(file)6 for chunk in chunks do7 UpdateChunkMap(newHead, chunk)8 Scatter(chunk)

9 end// Wait until uploading all chunks

10 UploadMeta(getMeta(file))

11 Function Scatter(chunk)12 clouds = ConsistentHash(chunk.id)13 if chunk is not stored then14 shares = RSEncode(chunk, t, n)15 for share in shares do16 conn = clouds.next()17 conn.Requester(PUT, share)// Async event requester

18 end

19 end

Algorithm 2: Uploading files.

so that no other client will attempt to download the file before all shares have been

uploaded.

5.4.2 Downloading files

CYRUS follows Algorithm 3 to download a file. First, the client requests list of

metadata files from the cloud. If the client identifies new metadata files, download

procedure is triggered. After downloading and decoding metadata, CYRUS identifies

file’s chunks and gathers their shares, selecting the download CSPs. Clients resolve

the resulting conflicts after downloading metadata files and building version tree.

We then identify two types of file conflicts: First, two clients may create files with

the same filename but different contents. Second, one client can modify the previous

version of a file due to delays in sync-ing metadata. In both case, we detect conflicts

- 39 -

ec11!…!

ec11! 1245!

1) Find current file version (line 1)!

2) Check if modified (lines 2—4)!

3) Chunk the file (line 5)!

5) Create and upload shares of new chunks!(lines 6—9, 13—19)!

4) Find new chunks (line 12)!

Client! Cloud!

Figure V.8: Uploading a file (lines refer to Algorithm 2).

if a node has multiple children. If a client detects conflicts, CYRUS downloads all

files corresponding to the children with different file names (attach version name

as suffix). When the user merges the conflicted files with the original file name,

CYRUS updates the file as the next version.

5.5 Distributed conflict detection

We identify two types of file conflicts, as shown in Figure V.9. First, two clients

may create files with the same filename but different contents, resulting in different

metadata file names, e.g., metadata 7dba2abd and 8f456eda. Second, one client can

modify the previous version of a file due to delays in sync-ing metadata. We then

find two child nodes from one parent, e.g., ab3a363c and 3d84621d in the figure.

Each node represents an independent modification of the parent node’s file.

When new metadata is downloaded from the cloud, we check for conflicts by

first checking if it has a parent node. If so, we check for the first type of conflict

by searching for other nodes with the same filename. The second type of conflict

- 40 -

1 Function Download(file)2 meta = downloadMeta(file)3 for chunk in meta.chunkMap do4 Gather(chunk,meta.shareMap[chunkId])5 end

// Wait until downloading all chunks

6 if checkConflict(meta) then7 meta.conflict = True8 end9 updateSnapshot(file, snapshot)

10 Function Gather(chunk, shareMap)11 clouds = OptSelect(chunk, shareMap)12 for share in chunk.shares do13 conn = clouds.next()14 conn.Requester(GET, share)// Async event requester

15 end

Algorithm 3: Downloading files.

arises if the new node has a parent. We traverse the tree upwards from this node,

and detect a conflict if we find a node with multiple children.

File deletion and versioning: Clients can recover previous versions of files

by traversing the metadata tree up from the current file version to the desired

previous version. CYRUS also allows clients to recover deleted files by locating their

metadata. When clients delete a file, CYRUS marks its metadata as “deleted,” but

does not actually delete the metadata file.2 Shares of the file’s component chunks

are left alone, since other files may contain these chunks.

5.6 Adapting to CSP Changes

Over time, a user’s set of viable CSPs may change: users may add or remove CSPs,

which can occasionally fail. Thus, CYRUS must be able to adapt to CSP addition,

removal, and failure. We consider these three cases individually.

2Since file metadata is very small, metadata from deleted files does not take up much CSPcapacity.

- 41 -

9b9af694!

ab3a363c! 3d84621d!

7dba2abd! 8f456eda!File roots!

Conflict 1!

Conflict 2!

Figure V.9: Two types of file conflicts.

Adding CSPs: A user may add a CSP to CYRUS by updating the list of

available CSPs at the cloud. Once this list is updated, subsequently uploaded chunks

can be stored at the new CSP. Moreover, the new CSP does not affect the reliability

or privacy experienced by previously uploaded chunks. Since uploading shares to

the new cloud can use a significant amount of data, we do not change the locations

of already-stored shares.

Removing CSPs and failure recovery: CYRUS detects a CSP failure if it

fails to upload shares to that CSP; once this occurs, CYRUS periodically checks if

the failed CSP is back up. Until that time, no shares are uploaded to that CSP.

CSP removal can be detected by repeated upload failures or a manual signal from

users. A failed or removed CSP is marked as such in the list of available CSPs.

Unlike adding a CSP, removing a CSP reduces the reliability of previously up-

loaded chunks. Thus, to maintain reliability we must reconstruct the removed shares

and upload them to other CSPs. However, uploading all of these chunk shares re-

- 42 -

Cloud 1! Cloud 2! Cloud 3! Cloud 4!

Chunk A! A1! A2!

Chunk B! B1! B2!

Chunk C! C2!

2) Download share C1 to construct a file of chunks B and C!

Chunk B!

3) Construct and upload share B2!

Chunk C!

1) Remove Cloud 2!

Figure V.10: Share migration when removing a CSP.

quires uploading a large amount of data. Indeed, from a user’s perspective, a cloud

may fail temporarily (e.g., for a few hours) but then come back up; it is therefore

impractical to move all shares at once. Instead, we note that reliability matters

most for frequently accessed shares and use a “lazy addition” scheme. Whenever a

client downloads a file, we check the locations of its chunks’ shares. Should one of

these locations have been removed or deleted, we create a new share and upload it

to a new CSP. Figure V.10 shows the procedure of adding these shares: after a CSP

is removed, a client downloads a file. If a share of one of the file’s chunks was stored

at the removed CSP, we reconstruct the share from the chunk and upload it to a

new CSP.

- 43 -

Chapter VIPerformance Evaluation

CYRUS’s architecture and system design ensure that it satisfies client requirements

for privacy, reliability, and latency. This chapter considers CYRUS’s performance in

a testbed environment before presenting real-world results from a comparison with

similar systems and trial deployment.

6.1 Testbed Experiments

Testbed setup: We construct a testbed with a MacBook Pro client and seven

private cloud servers as our CSPs, connected with 1Gbps ethernet links. We emulate

CSP cloud performance by using tc and netem to change network conditions for the

different servers. We set maximum throughputs of 15MB/s for four cloud servers

(the “fast” clouds) and 2MB/s for the remaining three clouds (the “slow” clouds).

We evaluate CYRUS’s performance on a dataset of several different file types.

Table VI.1 summarizes the number and size of the dataset files by their extensions.

- 44 -

VI. Performance Evaluation

10 20 30 40 50 60 70 80 90 100

Chunk Size [MB]

Figure VI.1: Performance comparison between zfec (solid line) and modified zfec(dashed line). (t, n) configurations - Red: (2, 3), Green: (2, 4), Blue: (3, 4), Purple:(2, 5), Sky blue: (3, 5), and Black: (4, 5).

The total dataset size is 638.43 MB, with an average file size of 3.71 MB. We use

content-based chunking to divide the files into chunks with an average chunk size of

4MB, following Dropbox [41].

Modified zfec library: Figure VI.1 shows the processing time to encode and

decode files while changing the size from 10 MB to 100 MB and (t, n) configurations.

When t and n are larger, encode and decode time is longer. However, even if we

modify zfec library to randomly generate dispersal matrix, the processing time is

not affected by our implementation. The differences between solid and dashed lines

with the same color are less than 0.02 seconds in average.

Storage overhead and availability: We present storage overhead and avail-

ability while changing (t, n) configurations in Figure VI.2. We use availability as a

- 45 -

N=1 N=2 N=3 N=4 N=5 N=6 N=7

rage O

T=1T=2T=3T=4T=5T=6T=7

(a) Storage overhead

N=1 N=2 N=3 N=4 N=5 N=6 N=7

Availa

T=1T=2T=3T=4T=5T=6T=7

(b) Availability

Figure VI.2: Storage overhead and availability while changing (t, n) configurations

reliability metric of storage service. We assume that availability of CSPs is 99.9%

and CSPs are independent each other. R-S code requires n/t times storage space.

n = 1 means that CYRUS uploads entire files to one cloud storage. If t is close

to n, storage overhead is also close to 100%, which means no additional space is

required. However, if t is close to n, availability is decreased. Specially if t = n,

availability is lower than 99.9%. The reason is that t = n requires that all CSPs

should be available, which is more strict condition. If t = 1, CYRUS requires only

one share to decode a chunk. It means that the share is stored without encoding.

Otherwise, if 1 < t < n, availability of storage is improved. Availabilities when

(t, n) = (2, 3) and (t, n) = (2, 4) are 99.9997 %(five nines) and 99.9999996 % (eight

nines) respectively. Thus, users achieve reliable, secure, and cost effective storage

service by coordinating only several number of CSPs.

Performance results: We consider the effect of changing reliability and pri-

vacy on file download completion times. We test three configurations: (t, n) = (2, 3),

(2, 4), and (3, 4). Compared to the (2, 3) configuration, (t, n) = (2, 4) is more reli-

- 46 -

able, while (t, n) = (3, 4) gives more privacy. We compare the performance of three

download CSP selection algorithms: optimal, random, and heuristic. The random

algorithm chooses CSPs randomly with uniform probability, and the heuristic algo-

rithm is a round-robin scheme.

The download completion times are shown in Figure VI.3(a). For all configura-

tions, CYRUS’s algorithm has the shortest download times. The random algorithm

has the longest, likely due to the high probability of downloading from a slow cloud.

Figure VI.3(b) shows the distribution of throughputs achieved by all files; we see

that the throughput with CYRUS’s algorithm is decidedly to the right of (i.e., is

larger than) the random and heuristic algorithms’ throughputs.

The completion time of CYRUS’s algorithm when (t, n) = (3, 4) is especially

short compared to the other (t, n) values (Figure VI.3(a)). Since the share size of a

given chunk equals the chunk size divided by t, higher values of t will yield smaller

share sizes, lowering completion times for (parallel) share downloads. However,

the random and heuristic algorithms’ completion times do not vary much with the

configurations. With (t, n) = (3, 4), CYRUS must download shares from three clouds

instead of two with the other configurations, increasing the probability that these

algorithms will use slow clouds and canceling the effect of smaller shares.

Figure VI.4 shows the cumulative upload and download times for all files with

CYRUS’s algorithm. As expected, the more private (3, 4) configuration has consis-

tently shorter completion times, especially for uploads. The more reliable (2, 4) and

(2, 3) configurations yield more similar completion times; their shares are the same

size. The (2, 4) configuration has slightly longer upload times since the shares must

be uploaded to all 4 clouds, including the slowest ones.

- 47 -

Extension # of files Total bytes Avg. size (bytes)pdf 70 60575608 865366

pptx 11 12263894 1114899docx 15 9844628 656309jpg 55 151918946 2762163mov 7 351603110 50229016apk 10 4872703 487270ipa 4 47354590 11838648

Total 172 638433479 3711823

Table VI.1: Testbed evaluation dataset.

(2,3) (2,4) (3,4)

tion tim

e [sec]

CYRUSHeuristicRandom

(a) Mean completion time.

0 2 4 6 8 10 12 14 16

Bandwidth [MB/s]

CYRUSHeuristicRandom

(b) Throughput, (t, n) = (2, 3).

Figure VI.3: Testbed download performance of random, heuristic, and CYRUS’scloud selection.

6.2 Real-World Benchmarking

We benchmark CYRUS’s upload and download performance on Dropbox, Google

Drive, SkyDrive, and Box.

Figure VI.5 shows the upload and download completion times for a 40 MB file

with CYRUS, DepSky, and two other baseline storage schemes. Full Replication

stores a 40 MB replica and Full Striping a 10 MB fragment at each of the four

CSPs. Both CYRUS and DepSky use (t, n) = (2, 3), so each share is 20 MB. Since

- 48 -

0 20 40 60 80 100 120 140 160 180

tive c

File id

(2,3)(2,4)(3,4)

(a) Upload

0 20 40 60 80 100 120 140 160 180

tive c

File id

(2,3)(2,4)(3,4)

(b) Download

Figure VI.4: Testbed completion times of different privacy and reliability configura-tions.

full striping uploads the least amount of data to each CSP, it has the shortest

upload completion times. However, full striping is not reliable, as any CSP failure

would prevent the user from retrieving the file. CYRUS has the second-best upload

performance. DepSky’s upload time is more than twice as long as CYRUS’s and

longer than Full Replication’s, though Full Replication uploads twice as much data

as DepSky to each CSP.

CYRUS’s optimal downlink CSP selection algorithm allows it to achieve the

shortest download completion time. DepSky, which downloads only from the fastest

CSPs, shows longer completion times than Full Striping, which must download from

all four CSPs, including the slowest. Full Replication has the longest download time,

since we averaged its performance over all four CSPs. Its download time would have

been shorter (24.118 seconds) with the optimal CSP, but longer (519.012 seconds)

with the slowest.

Figure VI.6 shows the logical topology between a client and 20 CSPs in Table IV.2

- 49 -

Upload Time Download Time

ime [sec]

CYRUSDepSkyFull Rep

Full Stripe

Figure VI.5: Completion times of different storage schemes.

based on TRACEROUTE results. Green circle and blue octagons represent the client

and CSPs respectively. To reduce complexity, we ignore first 9-10 hops that belong

to local ISP. White circles represent subnet address of router with 24-bit mask.

By looking up domain name, we infer that circles in level-1 are gateway routers in

core network such as xe-11-2-0.edge2.LosAngeles9.Level3.net. Routers in level-2 are

routers in access network where data center routers are connected. We identify five

CSPs with asterisk in Table IV.2 are connected to the same access network, which

are deployed on Amazon’s data center. To reduce failure dependency, we select fore

different CSPs in different access networks. We believe that assists such as reserving

bandwidth and logical topology will be available as deployment of carrier-grade SDN

in the future.

- 50 -

Amazon

Datacenter

Network

Access

Network

Figure VI.6: Logical topology between a client and 20 CSPs based on TRACER-OUTE results.

6.3 Comparison with DepSky

We compare upload and download performance with previous studies and simple

baselines in trials. We implemented simplified DepSky protocol that create a lock file

and backoff to check creation of the lock file. According to DepSky implementation,

we set random backoff time that ranges from 1s to 2s. Figure VI.7 shows upload and

download throughput when uploading/downloading 1MB files. DepSky spends two

RTT + backoff time to set a file lock. If clouds are not closely located and an user

attempts to upload a document file, the locking overhead is not ignorable. From

South Korea, uploading throughput of DepSky is almost half of that of CYRUS.

- 51 -

1 2 3 4

tion T

ime [S

CSP ID

CYRUSDepSky

(a) Upload

1 2 3 4

tion T

ime [S

CSP ID

CYRUSDepSky

(b) Download

Figure VI.7: Completion times with CYRUS and DepSky.

Download throughput is also shows the large performance gap.

Figure VI.8 shows the number of shares to upload/download. DepSky protocol

always selects the fastest clouds by canceling pending requests to slower clouds.

CYRUS selects clouds considering load balancing.

6.4 Deployment Trial Results

We recruited 20 academic (faculty, research staff, and student) users from the United

States and Korea to participate in a small-scale trial. We deployed trial versions of

CYRUS for OS X and Windows that send log data to our development server. Over

the course of the trial in summer 2014, we collected approximately 35k lines of logs.

We compare upload and download times for trial participants in the U.S. and

Korea in Figure VI.9, which shows CYRUS’s completion times when connected to

Dropbox, Google Drive, SkyDrive, and Box. We use (t, n) = (2, 3) and (2, 4), so that

uploading a file to individual CSPs is neither as reliable nor as private as CYRUS.

- 52 -

1 2 3 4

CYRUSDepSky

(a) Upload

1 2 3 4

CYRUSDepSky

(b) Download

Figure VI.8: Share distribution with CYRUS and DepSky.

In the U.S. (Figure VI.9(a)), CYRUS encounters a bottleneck of limited total

uplink throughput from the client, slowing its connections to each individual CSP

and lengthening upload completion time. When (t, n) = (2, 3), CYRUS is still faster

than all but one CSP. However, when (t, n) = (2, 4), CYRUS uploads twice as

much data in total than just uploading the file to one CSP; (2, 4)’s upload time is

therefore longer than that of all single CSPs. In Korea (Figure VI.9(b)), connections

to individual CSPs are much slower than in the U.S., so CYRUS does not encounter

this client throughput bottleneck with either configuration. Since we upload less

data to each CSP, both CYRUS configurations give shorter upload times than all

individual CSPs.

CYRUS’s download times for both configurations are shorter than those from

individual CSPs. In both the U.S. and Korea, CYRUS has slightly longer download

times than the fastest CSP, as it downloads shares from two CSPs. However, its

download times are shorter than those from all other CSPs. Thus, CYRUS shortens

the average upload time in Korea, where client bandwidth is not a bottleneck, and

- 53 -

Upload Download

tion T

ime [sec]

Cloud1Cloud2Cloud3Cloud4

CYRUS (2,3)CYRUS (2,4)

(a) U.S.

Upload Download

tion T

ime [sec]

Cloud1Cloud2Cloud3Cloud4

CYRUS (2,3)CYRUS (2,4)

(b) Korea

Figure VI.9: Completion times during the trial.

the average download time in both countries.

- 54 -

Chapter VIIDiscussion

In this chapter, we introduce network-side management technologies to monitor

application-layer traffic and protect network resources from selfish and malicious

applications. We also describe the impact of CLIMA at Cloud Market.

7.1 Management Technologies in Network Domain

As an advanced network management technology, traffic classification is an essential

technique to identify the source of traffic. Because Internet protocol does not provide

any information of application-layer, advanced monitoring and analysis techniques

such as Deep Packet Inspection (DPI) for payload signature analysis are needed.

Basically, port-based traffic classification is used to filter out specific target traf-

fic, such as instant messenger and P2P applications, because it is the simplest

method of filtering. However, today’s internet applications use dynamic port al-

location and port masquerading techniques to avoid port filtering, and the accuracy

- 55 -

VII. Discussion

of port-based traffic classification, even those using well-known ports, is less than

70% [Moore:2005]. For this reason, alternative methodologies for traffic classifica-

tion have been developed. We summarize these previous studies into three cate-

gories: payload-based, behavior-based, and machine learning based. Payload-based

classification approaches [42, 43, 44, 45, 46] inspect packet payloads to find targeted

application signatures. This approach assumes that most application traffic contains

signaling packets, and the signatures in signaling packets are unique. By identify-

ing these signatures, we can classify which flow is generated by which application.

Payload-based approaches show reliable accuracy if the signatures are exactly and

uniquely extracted. However, payload-based approaches require exhaustive signa-

ture generation, and cannot be used with encrypted packets. Behavior-based traffic

classification approaches [47, 48] have been developed that analyze host interactions

to distinguish application types. These approaches focused on the connection pat-

terns used by different applications (e.g., P2P and HTTP). For example, BLINC

profiles host behavior using destination IP addresses and port allocation patterns.

However, the accuracy of behavior-based classification is still questionable. In ad-

dition, behavior-based classifications cannot be applied to application level traffic

classification because most applications cannot be distinguished using only host

connection patterns. Recent studies apply machine learning (ML) algorithms to

classify network traffic [49, 50]. Most ML-based approaches use statistical ML algo-

rithms that require training before they can be used. However, once trained, most

ML-based approaches provide high classification accuracy. However, ML-based clas-

sification cannot distinguish application level traffic, due to the same problems that

the behavior-based approach has. In addition, this approach typically has high CPU

- 56 -

VII. Discussion

overhead and requires a large amount of memory.

The initial tasks and the challenges in traffic classification research are as follows:

1. Surveying and listing popular applications: Prior to extracting traffic

classifiers, the target applications that are to be observed in network traffic

should be selected. The common approach to selecting applications is based on

user popularity data provided by application markets and surveys. However,

these market ranking and surveys of user popularity do not guarantee that

the application traffic will appear in the managed network. As a result, it

is required that the application traffic in the target network that we want to

analyze be identified. This analysis of application traffic as the initial task in

traffic classification research offers a basic insight into the target network even

though we cannot identify all of the application traffic in the network.

2. Applying traffic classifiers: When applying traffic classifiers generated by

other research or products, we are faced with unfamiliar applications that are

not popular or applications that have never been observed in our network.

Network operators are also unwilling to apply the traffic classifiers due to the

old-updated traffic classifiers. The commercial entities usually do not publish

their classifiers for business reasons; hence, it is difficult for us to modify or

add new application traffic classifiers. Even in cases where traffic classifiers

are available to the general public, updating hundreds of classifiers of Internet

applications requires the exhaustive extraction of application traffic classifiers.

3. Extracting application traffic classifiers Extracting application traffic

classifiers from application traffic is an exhaustive task. To collect the tar-

- 57 -

VII. Discussion

A1 A1 A2 A3 A4

Subnet 1 Subnet 2 Subnet 3

Subnet 2 Subnet 1

Subnet 1

Subnet 3

Iteration 1

Iteration 2

Identified Traffic = {A1}

Identified Traffic = {A1, A3}

Identified Traffic = {}

Figure VII.1: Subnet-based application traffic grouping and identification

get application traffic exclusively, the target application is usually installed

on a machine that is equipped to capture traffic. Application traffic is then

intentionally generated by using the application for several minutes. From the

application traffic collected, distinguishing traffic characteristics such as port

number, destination IP address, and common substring are then extracted.

7.1.1 Identifying Application Traffic In the Dark

Figure VII.1 illustrates the proposed application traffic grouping and identification

approach. The proposed approach is based on subnet grouping. We identify appli-

cation traffic from the subnet where the largest traffic is directed. In Figure VII.1,

An denotes the traffic generated by an application n. In subnet 1, application traffic

from A1 and A2 exists. In the first iteration, if A1 is identified, then subnet 2 is the

next target for analysis in the second iteration. Even though A2 is not identified in

the first iteration, it will have a chance to be analyzed when subnet 1 is the subnet

containing the most unidentified traffic.

- 58 -

VII. Discussion

7.1.2 Vector Space Model for Traffic Classification

Vector Space Modeling (VSM) is an algebraic model from the natural language

processing research area that represents text documents as vectors, and is widely

used for document classification. The goal of document classification is to find a

subset of documents from a set of stored text documents D which satisfy certain

information requests or queries Q. One of the simplest ways to do this is to determine

the significance of a term that appears in the set of documents D [51]. This is

quite similar to traffic classification, which identifies and classifies network traffic

according to the type of application that generated the traffic. The main issue

in VSM is how to determine the weight of each term. Salton et al. [52] have

proposed some recommended combination of term-weighting components: (1) Term

Frequency Component (b, t, n), (2) Collection Frequency Component (x, f, p), and

(3) Normalization Component (x, c). However, these combinations are not suitable

for traffic classification, because these were derived based on the needs of document

classification. We therefore have to consider other weighting methods in order to use

VSM for traffic classification; these methods should correspond to those portions of

the traffic payload that are application-specific and disregard other data.

We apply Vector Space Modeling to traffic classification [53, 54], which math-

ematically calculates similarities among payloads in application traffic. We define

word as a payload data within an i-bytes sliding window where the position of the

sliding window can be 1, 2...n− i+ 1 with n-length payload. The size of whole rep-

resentative words is 28∗i. The followings are packet and flow representation using

vector space model.

- 59 -

VII. Discussion

Figure VII.2: Vector Space Model represents application traffic.

Payload Vector When wi is the count of the i-th word that appears repeatedly in

a payload, the payload vector is Payload V ector = [w1w2...wn]T where n is

the size of word space.

Paylaod Flow Matric Payload Flow Matrix (Figure VII.2) is PFM = [p1p2...pk]T

where pi is Payload Vector.

Then, we calculate a similarity between flows using in Payload Flow Matrix.

Basically, Cosine similarity is widely used in the natural language processing field.

Jarccard similarity counts the common set between vectors, which is almost similar

with payload signature matching. To show the applicability of Vector Space Model

for traffic classification, we classified five P2P applications that were top-6 popular

applications in Korea, 2009. In fact, P2P application traffic is the most difficult

traffic to classify because they use masquerading techniques to avoid traffic filters

and firewalls.

In Table VII.1, Cosine Similarity shows the worst accuracy, which is widely used

- 60 -

VII. Discussion

Application Metric Classified (MB) Total (MB) AccuracyFileGuri Jaccard 2069.644 2195.551 94.27%

Cosine 2069.644 94.27%ClubBox Jaccard 1073.425 1073.45 100.00%

Cosine 544.029 50.68%Melon Jaccard 178.454 178.487 99.98%

Cosine 178.374 99.94%BigFile Jaccard 0.557 0.561 99.20%

Cosine 0.537 95.59%eMule Jaccard 1164.291 2296.639 50.70%

Cosine 1147.812 49.98%BitTorrent Jaccard 834.607 834.632 100.00%

Cosine 834.622 100.00%

Table VII.1: Classification accuracy comparison. Fixed-port applications: Fileguri,ClubBox, Melon, BigFile. Untraceable-port Applications: eMule, BitTorrent.

in natural language processing. For Cosine similarity, the most important thing is

key words, which implies words in documents are written very intentionally. If we

excluded stop words in the documents, Cosine Similarity would show more accurate

classification results. In the payloads, however, filtering out stop words is much more

difficult because in many cases, some byte patterns are used as both signatures and

garbage data (e.g. 0x00 is used to fill unused space in the payload and also frequently

used as a hand shake message or a test message). In addition, if we remove signatures

in the payload, the data is almost random. For this reason, some arbitrary word acts

as a signature, which significantly reduces the accuracy of the similarity calculation.

Jaccard Similarity shows higher accuracy, and can also deliver relatively good results

for unknown traffic. In our payload vector, which is a kind of term-frequency vector,

the signature is broken into small fragments. Eventually, Jaccard Similarity counts

the number of signature fragments.

- 61 -

VII. Discussion

7.2 Promoting Market Competition in Cloud Market

While commoditization can ordinarily decrease market competition by favoring large

economies of scale, CYRUS may instead promote competition, as its privacy and

reliability guarantees depend on users having accounts at multiple CSPs.

Without CYRUS, users experience a phenomenon known as “vendor lock-in:”

After buying storage from one CSP, a user will not wish to use storage from other

CSPs due to the overhead in storing files at and retrieving them from multiple

places. Since different CSPs enter the market at different times, vendor lock-in can

thus lead to very uneven adoption of CSPs and correspondingly uneven revenues. To

combat vendor lock-in, many CSPs offer some free storage for individual (i.e., non-

business) users. However, persuading users to later pay for more storage at these

additional CSPs is still difficult, unless the new CSP offers much better service than

the previous one. Business users, who do not receive free storage, have no incentive

to join more than one CSP, exacerbating vendor lock-in.

CYRUS eliminates vendor lock-in for its users by removing the overhead of stor-

ing files in multiple places. Indeed, CYRUS encourages users to purchase similar

amounts of storage at multiple CSPs, as this increases its achievable reliability and

privacy: users can store more chunk shares at different CSPs. Thus, assuming

comparable CSP prices, a given user might purchase storage at all available CSPs,

even-ing out CSPs’ market shares. CSPs entering into the market would also be

able to gain users, as accounts there would allow users greater privacy and reliabil-

ity. CSPs with better client connectivity and coordination with CYRUS could gain

a competitive advantage by providing better services, but some user demand would

- 62 -

VII. Discussion

still exist at other CSPs.

Since CYRUS’s secret sharing scheme increases the amount of data stored by a

factor of n/t, users would need to purchase more total cloud storage with CYRUS.

Total CSP revenue from business users might then increase, though it would likely be

more evenly distributed among CSPs. CSP revenue from individual users, however,

might decrease: some users could collect free storage from all available CSPs without

needing to purchase any additional storage. Thus, CYRUS may discourage CSPs

from offering free introductory storage to individual users, in order to earn more

revenue.

- 63 -

Chapter VIIIConclusion

This chapter summarizes the overall contents of the thesis and lists contributions.

Furthermore, it discusses several research topics as future work

8.1 Summary

Recent years have witnessed an explosion in the popularity of cloud services, as

more and more businesses and individuals move their computing and storage into

the cloud. Current cloud services impose centrally determined policies on users,

e.g., using the same reliability guarantees and privacy standards for all users’ data.

Most cloud service providers (CSPs), for instance, provide 99.99% availability, or

about 8 hours of downtime per year. While this number is likely acceptable for most

users, some business customers might require, and be willing to pay for, significantly

higher reliability, e.g., 99.9999% or less than 5 minutes of downtime per year [30].

Similarly, some users may want guaranteed privacy. Most CSPs ensure privacy by

- 64 -

VIII. Conclusion

encrypting users’ data [55], but since they know the encryption key, they can also

decrypt and view this data themselves [56]

In this thesis, we proposed CLIMA that leads a change of clients that actively

manage cloud services without modifying implementation of clouds. After analyzing

users’ requirements, we realized CLIMA as a client application called CYRUS, a

client application for Mac OS X, Windows, and Linux machines. The application

connects to four popular commercial providers: Dropbox, Google Drive, OneDrive,

and Box. CYRUS provides customized cloud storage for securely and reliably sharing

files among multiple clients. Instead of providing physical storage space, CYRUS

integrates storage accounts from multiple providers. CYRUS’s key insight is to

scramble files at the client, instead of simply encrypting them at the cloud, and to

store the scrambled files by splitting them among multiple clouds. We addressed and

solved practical issues on managing CSPs as well as distributing and sharing data

using purely client-based architecture, i.e. optimally selecting CSPs, sharing file

metadata, and concurrent data access. Erasure code guarantees that reconstructing

original data with t shares from n, which allows that n − t failures are acceptable

and need multiple CSP access accounts to decode data. We also defined cloud-side

operations that consists of primary functions only such that GET, PUT, LIST, and

DELETE.

By intelligently leveraging multiple cloud accounts, CYRUS provides the follow-

ing unique advantages:

• Privacy: Since files are scrambled at the client, no storage provider can see

users’ data. Most providers today encrypt users’ data, but still have access to

- 65 -

VIII. Conclusion

the data since they keep the encryption keys.

• Reliability: Redundancy is built into the file pieces. Even if multiple (up

to a certain number) cloud providers fail, files remain recoverable from pieces

stored at the other providers.

• Performance: CYRUS minimizes download delays for each client by opti-

mally choosing cloud providers from which to download file pieces.

• File sharing: Multiple users can seamlessly upload and download files with

the same CYRUS account.

We performed experiments in both testbed and real-world environment using our

prototype. Distributing data to four cloud storages with (t, n) = 2, 4 configuration,

CYRUS shows 99.999999 % availability. Striping and scheduling data transmis-

sion improve throughput comparing with baseline approach and previous studies.

CLIMA contributes to manage multiple cloud storages as a resource pool, and en-

courage users to use more cloud services without reliability and privacy concerns.

8.2 Contributions

• CLIent-defined Management Architecture (CLIMA): We defined three

domains (Application, Network, and Cloud Domain), and analyzed ownership

and manageability. Users can deploy programs and implement functions at

Application Domain. Because Network Domain is not manageable by users,

CLIMA applications perform active and passive monitoring to get network

- 66 -

VIII. Conclusion

status information. Using cloud APIs of Cloud Domain, CLIMA applications

leverage cloud resources through HTTP RESTful interfaces.

• Client-defined privacY-protected Reliable cloUd Storage: CYRUS,

a realization of CLIMA, scatters files across different clouds so that no one

CSP can access them but the file can be recovered if some clouds fail. CYRUS

supports multiple clients (i.e., devices) accessing the same CSP resources; these

devices may be owned by different users, allowing them to share files.

• Customizable reliability, privacy, and performance: Users can config-

ure reliability and privacy levels by specifying the number of CSPs to which

different files should be scattered and the number of scattered pieces required

to reconstruct the file. We choose the CSPs from which to download these

pieces so as to minimize latency. CYRUS further reduces latency by down-

loading data from multiple CSPs in parallel.

• Defining operations on multiple clouds and clients: Users can build a

CLIMA cloud by calling a standard set of APIs. CLIMA application trans-

parently aggregates file operations on multiple clouds, internally storing and

tracking file metadata and CSP storage locations for efficient file operations.

• Prototype implementation and deployment: We implemented a proto-

type of CLIMA in Python on Mac OS X and Windows, evaluating its per-

formance on four commercial providers and showing that our prototype out-

performs other CSP integration services in real-world tests. We also present

results from a small pilot trial in the U.S. and Korea.

- 67 -

VIII. Conclusion

• Discussion of market implications: While many users today only use

storage from one CSP to avoid managing multiple CSP accounts, CLIMA

encourages users to use more CSPs for greater privacy and reliability. CLIMA

thus evens out demand across CSPs and makes it easier for market entrants

to gain users.

8.3 Future Work

As future work, we plan to develop P2P-based metadata management system. An

issue of CYRUS is to share metadata through clouds with GET and PUT operations.

Our design does not change or delete stored files because clouds do not provide lock

operations. Even though clouds are reliable contact points, P2P-based solution is

more flexible to implement sharing and file locking features. To do this, we plan

to find a strategy that guarantees higher searchability of meta files and reliable

connections.

Considering standardization efforts of northbound APIs of network controllers,

we also plan to define and implement interfaces between CLIMA clients and Net-

work Domain. SDN lets applications aware network status and request applications’

demands to ISPs [28]. Then ISPs will schedule the demands optimally and provide

higher QoS. We are considering three scenarios of carrier-grade SDN: 1) SDN for

network management; 2) SDN for cloud and CDN providers; 3) SDN for Internet

applications and end-users. If SDN is widely deployed, CLIMA applications can

obtain network status and request QoS parameters, which is currently measured by

active probing (TRACEROUTE) and locally scheduled at client. Defining north-

- 68 -

VIII. Conclusion

bound APIs of SDN controller that exposes interfaces for Internet applications to

exchange network status information, dynamically control flows, guarantee QoS, etc.

will be our future research as network-side application traffic management.

- 69 -

요약문

클라이언트 정의 클라우드 스토리지 서비스를 위한 관리 구조

클라우드스토리지시장은매년급격하게성장하고있다. 가트너(Gartner 2013)에

따르면 클라우드 시장은 2017년에 2천 4백억달러 규모로 예측된다. 2013년 구글과

드랍박스는 자사의 클라우드 스토리지 이용자가 각각 1.2억명과 2 억명을 넘어섰

다고 발표하였다. 하지만 클라우드 사업자에게 자신의 데이터를 모두 맡기는 현재

서비스 모델은 스토리지 서비스 이용자의 개인정보와 데이터를 유출 시킬 수 있는

문제점을 갖고 있다. 최근 iCloud 해킹 사건이 대표적인 사례이다. 또한 모든 관리

권한이 사업자에게 있는 현재의 서비스 모델은 사용자가 능동적으로 장애 발생에

대응 할 수 없다.

스토리지분야에서데이터분산저장시스템은대표적인스토리지장애허용시

스템으로서 연구되어 왔다. 기존 스토리지 연구는 직접 H/W와 S/W 시스템을 구

축하였지만, 본 연구는 서로 다른 클라우드 서비스들을 단말에서 서로 연동한다는

점에서 차별화 된다.

Client-defined Management Architecture (CLIMA)를 제안하고 이를 구체화 한

Client-defined privacY-protected Reliable cloUd Storage service (CYRUS)를구현하

였다. 이를 위해 CLIMA의 제안 사항과 CYRUS구현에 따른 요구사항 분석, 구현

이슈 분석을 수행하였고 4개의 상용 클라우드 서비스를 사용하는 프로토타입을 구

현하였다. Erasure code를사용하여안정성과보안성을동시에높였으며병렬전송

및 전송 최적화를 통해 전송 성능을 향상시켰다.

4개의 클라우드 스토리지를 조합하여 사용함으로써 99.999999%의 안정성을 확

보하였으며 기존 연구 대비 제안하는 방법은 높은 전송 성능을 보이고 있다.

REFERENCES

[1] M. Dumon, “Cloud storage industry continues rapid growth,” Exam-iner.com, 2013. [Online]. Available: http://www.examiner.com/article/cloud-storage-industry-continues-rapid-growth

[2] S. Patterson, “Personal cloud storage hit 685 petabytesthis year,” WebProNews, 2013, http://www.webpronews.com/personal-cloud-storage-hit-685-petabytes-this-year-2013-12.

[3] N. Perlroth, “Home Depot data breach could be the largest yet,” The NewYork Times, 2014. [Online]. Available: http://bits.blogs.nytimes.com/2014/09/08/home-depot-confirms-that-it-was-hacked/

[4] D. Wakabayashi, “Tim Cook says Apple to add security alerts for icloudusers,” The Wall Street Journal, 2014, http://online.wsj.com/articles/tim-cook-says-apple-to-add-security-alerts-for-icloud-users-1409880977.

[5] B. Stone, “Another Amazon outage exposes the cloud’s dark lin-ing,” Bloomberg Businessweek, 2013, http://www.businessweek.com/articles/2013-08-26/another-amazon-outage-exposes-the-clouds-dark-lining.

[6] Dropbox Blog, “Dropbox wasn’t Hacked,” Oct. 2014, https://blog.dropbox.com/2014/10/dropbox-wasnt-hacked/#more-4050.

[7] T. W. Malone, J. Yates, and R. I. Benjamin, “Electronic markets and electronichierarchies,” Commun. ACM, vol. 30, no. 6, pp. 484–497, June 1987.

[8] J. Y. Bakos and E. Brynjolfsson, “Information technology, incentives, and theoptimal number of suppliers,” J. Manage. Inf. Syst., vol. 10, no. 2, pp. 37–53,Sept. 1993.

- 71 -

REFERENCES

[9] C. Cachin and S. Tessaro, “Optimal resilience for erasure-coded Byzantinedistributed storage,” in Proc. of IEEE DSN. Washington, DC, USA:IEEE Computer Society, 2006, pp. 115–124. [Online]. Available: http://dx.doi.org/10.1109/DSN.2006.56

[10] G. R. Goodson, J. J. Wylie, G. R. Ganger, and M. K. Reiter, “Efficientbyzantine-tolerant erasure-coded storage,” in Proc. of IEEE DSN. Washington,DC, USA: IEEE Computer Society, 2004, pp. 135–. [Online]. Available:http://dl.acm.org/citation.cfm?id=1009382.1009729

[11] D. Malkhi and M. Reiter, “Byzantine quorum systems,” in Proc. of ACMSTOC. New York, NY, USA: ACM, 1997, pp. 569–578. [Online]. Available:http://doi.acm.org/10.1145/258533.258650

[12] D. Malkhi and M. K. Reiter, “Secure and scalable replication in Phalanx,” inProc. of IEEE SRDS. Washington, DC, USA: IEEE Computer Society, 1998,pp. 51–. [Online]. Available: http://dl.acm.org/citation.cfm?id=829523.831001

[13] J.-P. Martin, L. Alvisi, and M. Dahlin, “Minimal Byzantine storage,” in Proc.of DISC. London, UK, UK: Springer-Verlag, 2002, pp. 311–325. [Online].Available: http://dl.acm.org/citation.cfm?id=645959.676126

[14] K. D. Bowers, A. Juels, and A. Oprea, “Hail: A high-availabilityand integrity layer for cloud storage,” in Proc. of ACM CCS. NewYork, NY, USA: ACM, 2009, pp. 187–198. [Online]. Available: http://doi.acm.org/10.1145/1653662.1653686

[15] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn,“Ceph: A scalable, high-performance distributed file system,” in Proc. ofOSDI. Berkeley, CA, USA: USENIX Association, 2006, pp. 307–320. [Online].Available: http://dl.acm.org/citation.cfm?id=1298455.1298485

[16] H. Abu-Libdeh, L. Princehouse, and H. Weatherspoon, “Racs: A case for cloudstorage diversity,” in Proc. of ACM SoCC. New York, NY, USA: ACM, 2010,pp. 229–240. [Online]. Available: http://doi.acm.org/10.1145/1807128.1807165

[17] T. G. Papaioannou, N. Bonvin, and K. Aberer, “Scalia: An adaptive schemefor efficient multi-cloud storage,” in Proc. of IEEE SC. Los Alamitos, CA,USA: IEEE Computer Society Press, 2012, pp. 20:1–20:10. [Online]. Available:http://dl.acm.org/citation.cfm?id=2388996.2389024

[18] B. Debnath, S. Sengupta, and J. Li, “Chunkstash: Speeding up inlinestorage deduplication using flash memory,” in Proc. of the 2010 USENIX

- 72 -

REFERENCES

Conference on USENIX Annual Technical Conference, ser. USENIXATC’10.Berkeley, CA, USA: USENIX Association, 2010, pp. 16–16. [Online]. Available:http://dl.acm.org/citation.cfm?id=1855840.1855856

[19] K. Park, S. Ihm, M. Bowman, and V. S. Pai, “Supporting practicalcontent-addressable caching with czip compression,” in 2007 USENIX AnnualTechnical Conference on Proc. of the USENIX Annual Technical Conference,ser. ATC’07. Berkeley, CA, USA: USENIX Association, 2007, pp. 14:1–14:14.[Online]. Available: http://dl.acm.org/citation.cfm?id=1364385.1364399

[20] J. K. Resch and J. S. Plank, “AONT-RS: Blending security and performancein dispersed storage systems,” in Proc. of the 9th USENIX FAST. Berkeley,CA, USA: USENIX Association, 2011, pp. 14–14. [Online]. Available:http://dl.acm.org/citation.cfm?id=1960475.1960489

[21] V. Attasena, N. Harbi, and J. Darmont, “Sharing-based privacy and availabilityof cloud data warehouses,” in 9emes journees francophones sur les Entrepotsde Donnees et l’Analyse en ligne (EDA 13), Blois, ser. Revues des NouvellesTechnologies de l’Information, vol. B-9. Paris: Hermann, June 2013, pp. 17–32.

[22] G. S. Machado, T. Bocek, M. Ammann, and B. Stiller, “A cloud storage overlayto aggregate heterogeneous cloud services,” in Proc. of IEEE LCN. IEEE, 2013.

[23] A. Bessani, M. Correia, B. Quaresma, F. Andre, and P. Sousa, “Depsky:Dependable and secure storage in a cloud-of-clouds,” in Proc. of ACMEuroSys. New York, NY, USA: ACM, 2011, pp. 31–46. [Online]. Available:http://doi.acm.org/10.1145/1966445.1966449

[24] C. W. Ling and A. Datta, “InterCloud RAIDer: A do-it-yourself multi-cloud private data backup system,” in Distributed Computing and Networking.Springer, 2014, pp. 453–468.

[25] R. Alimi, R. Penno, and Y. Yang, “Application-layer traffic optimization (alto)protocol,” IETF, RFC 7285, 2014.

[26] D. King and A. Farrel, “A pce-based architecture for application-based networkoperations,” IETF, Active Internet-Draft, 2014.

[27] L. Velasco, A. Castro, D. King, O. Gerstel, R. Casellas, and V. Lopez, “In-operation network planning,” Communications Magazine, IEEE, vol. 52, no. 1,pp. 52–60, Jan. 2014.

[28] S. Sezer, S. Scott-Hayward, P. Chouhan, B. Fraser, D. Lake, J. Finnegan,N. Viljoen, M. Miller, and N. Rao, “Are we ready for sdn? implementation

- 73 -

REFERENCES

challenges for software-defined networks,” Communications Magazine, IEEE,vol. 51, no. 7, pp. 36–43, July 2013.

[29] Apache, “What is the JClouds?” 2013, http://jclouds.apache.org/.

[30] J. Lango, “Toward software-defined SLAs,” Queue, vol. 11, no. 11, p. 20, 2013.

[31] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storagesystem for structured data,” ACM Transactions on Computer Systems, vol. 26,no. 2, p. 4, 2008.

[32] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Ama-zon’s highly available key-value store,” in Proc. of ACM SOSP. ACM, 2007,pp. 205–220.

[33] Apache Software Foundation, “HBase,” 2014. [Online]. Available: http://hbase.apache.org/

[34] K. Finley, “Three reasons why Amazon’s new storage service won’t killDropbox,” Wired, 2014. [Online]. Available: http://www.wired.com/2014/07/amazon-zocalo/

[35] A. Bessani, R. Mendes, T. Oliveira, N. Neves, M. Correia, M. Pasin, andP. Verissimo, “SCFS: A shared cloud-backed file system,” in Proc. of the 2014USENIX Annual Technical Conference, 2014.

[36] M. Rabin, Fingerprinting by Random Polynomials, ser. Center for Researchin Computing Technology. Aiken Computation Laboratory, Univ., 1981.[Online]. Available: http://books.google.com/books?id=Emu tgAACAAJ

[37] R. J. McEliece and D. V. Sarwate, “On sharing secrets and Reed-Solomoncodes,” Communications of the ACM, vol. 24, no. 9, pp. 583–584, 1981.

[38] A. Shamir, “How to share a secret,” Commun. ACM, vol. 22, no. 11, pp. 612–613, Nov. 1979. [Online]. Available: http://doi.acm.org/10.1145/359168.359176

[39] J. S. Plank, J. Luo, C. D. Schuman, L. Xu, and Z. Wilcox-O’Hearn,“A performance evaluation and examination of open-source erasure codinglibraries for storage,” in Proccedings of the 7th USENIX FAST. Berkeley,CA, USA: USENIX Association, 2009, pp. 253–265. [Online]. Available:http://dl.acm.org/citation.cfm?id=1525908.1525927

- 74 -

REFERENCES

[40] S. Quinlan and S. Dorward, “Venti: A new approach to archival data storage,”in Proceedings of the 1st USENIX Conference on File and Storage Technologies,ser. FAST ’02. Berkeley, CA, USA: USENIX Association, 2002. [Online].Available: http://dl.acm.org/citation.cfm?id=1083323.1083333

[41] I. Drago, M. Mellia, M. M. Munafo, A. Sperotto, R. Sadre, and A. Pras,“Inside Dropbox: Understanding personal cloud storage services,” in Proc.of ACM IMC. New York, NY, USA: ACM, 2012, pp. 481–494. [Online].Available: http://doi.acm.org/10.1145/2398776.2398827

[42] S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalable in-networkidentification of p2p traffic using application signatures,” in Proceedingsof the 13th International Conference on World Wide Web, ser. WWW’04. New York, NY, USA: ACM, 2004, pp. 512–521. [Online]. Available:http://doi.acm.org/10.1145/988672.988742

[43] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “Acas: Automatedconstruction of application signatures,” in Proceedings of the 2005 ACMSIGCOMM Workshop on Mining Network Data, ser. MineNet ’05.New York, NY, USA: ACM, 2005, pp. 197–202. [Online]. Available:http://doi.acm.org/10.1145/1080173.1080183

[44] B.-C. Park, Y. Won, M.-S. Kim, and J. Hong, “Towards automated applica-tion signature generation for traffic identification,” in Network Operations andManagement Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 160–167.

[45] T. Choi, C. Kim, S. Yoon, J. Park, B. Lee, H. Kim, H. Chung, and T. Jeong,“Content-aware internet application traffic measurement and analysis,” in Net-work Operations and Management Symposium, 2004. NOMS 2004. IEEE/IFIP,vol. 1, April 2004, pp. 511–524 Vol.1.

[46] F. Risso, M. Baldi, O. Morandi, A. Baldini, and P. Monclus, “Lightweight,payload-based traffic classification: An experimental evaluation,” in Commu-nications, 2008. ICC ’08. IEEE International Conference on, May 2008, pp.5869–5875.

[47] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “Blinc: Multileveltraffic classification in the dark,” in Proceedings of the 2005 Conferenceon Applications, Technologies, Architectures, and Protocols for ComputerCommunications, ser. SIGCOMM ’05. New York, NY, USA: ACM, 2005, pp.229–240. [Online]. Available: http://doi.acm.org/10.1145/1080091.1080119

- 75 -

REFERENCES

[48] T. Karagiannis, A. Broido, M. Faloutsos, and K. claffy, “Transportlayer identification of p2p traffic,” in Proceedings of the 4th ACMSIGCOMM Conference on Internet Measurement, ser. IMC ’04. NewYork, NY, USA: ACM, 2004, pp. 121–134. [Online]. Available: http://doi.acm.org/10.1145/1028788.1028804

[49] A. W. Moore and K. Papagiannaki, “Toward the accurate identification ofnetwork applications,” in Proceedings of the 6th International Conferenceon Passive and Active Network Measurement, ser. PAM’05. Berlin,Heidelberg: Springer-Verlag, 2005, pp. 41–54. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-31966-5 4

[50] J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification using clusteringalgorithms,” in Proceedings of the 2006 SIGCOMM Workshop on MiningNetwork Data, ser. MineNet ’06. New York, NY, USA: ACM, 2006, pp.281–286. [Online]. Available: http://doi.acm.org/10.1145/1162678.1162679

[51] H. P. Luhn, “A statistical approach to mechanized encoding and searching ofliterary information,” IBM J. Res. Dev., vol. 1, no. 4, pp. 309–317, Oct. 1957.[Online]. Available: http://dx.doi.org/10.1147/rd.14.0309

[52] G. Salton and C. Buckley, “Term-weighting approaches in automatic textretrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Aug. 1988.[Online]. Available: http://dx.doi.org/10.1016/0306-4573(88)90021-0

[53] J. Y. Chung, B. Park, Y. J. Won, J. Strassner, and J. W. Hong, “Trafficclassification based on flow similarity,” in Proceedings of the 9th IEEEInternational Workshop on IP Operations and Management, ser. IPOM ’09.Berlin, Heidelberg: Springer-Verlag, 2009, pp. 65–77. [Online]. Available:http://dx.doi.org/10.1007/978-3-642-04968-2 6

[54] J. Y. Chung, B. Park, Y. Won, J. Strassner, and J. Hong, “An effective sim-ilarity metric for application traffic classification,” in Network Operations andManagement Symposium (NOMS), 2010 IEEE, April 2010, pp. 286–292.

[55] S. Rosenblatt, “Google now encrypts cloud storage by de-fault,” CNET, 2013, http://news.cnet.com/8301-1023 3-57598786-93/google-now-encrypts-cloud-storage-by-default/.

[56] J. Kloc, “Is Dropbox reading your documents?” The Daily Dot, 2013, http://www.dailydot.com/news/dropbox-reading-documents-nsa/.

- 76 -

Acknowledgements

감사의 글

먼저 긴 박사학위 기간 동안 믿고 지지해주신 사랑하는 가족들께 감사 드립니다.

또한 무엇보다 박사 학위 졸업 논문을 완성 할 수 있도록 많이 지도해 주신 홍원기

교수님, 유재형 교수님께 감사의 말씀을 전하고 싶습니다. 좋은 연구 환경 속에서

마음껏 연구하고 다양한 기회를 접할 수 있도록 도와주셨고 이끌어주신 교수님께

다시 한번 감사의 말씀을 드립니다.

짧다면 짧고 길다면 길다 할 수 있는 지난 6년간 DPNM연구실에서 함께 생활

한 선배님들 그리고 후배님께도 역시 감사의 말씀을 전합니다. 처음 학부 때 만나

뵌 최미정 교수님, 항상 DPNM연구실과 함께 하신 주홍택 교수님, 연구와 관련해

서 많은 조언을 해 주신 김명섭 교수님, 그리고 이런 저런 잔소리 많이 하시며 챙겨

주시던 원영준 교수님께 감사의 말씀을 드립니다.

항상 힘이 넘치신 강준명 박사님, 항상 힘들어 보이던 홍성철 박사님, 저와 고

생 많이 한 박병철 박사님, 유쾌한 김성수 박사님, 성실한 서신석 박사님, 모두 잊

을 수 없을 것 같습니다. 학부 동기 윤선이, 종환이, 연구 잘하는 건이형, 잘생긴 태

열이, 그리고 신입생 도영이 세연이. 저에게는 모두 잊을 수 없는 인연들 입니다.

DPNM 가족으로서 항상 감사하는 마음으로 그리고 받은 만큼 베푼다는 생각으

로 살겠습니다. 감사합니다.

Curriculum Vitae

Jae Yoon Chung (정 재 윤)

EducationDegree Year University Department

M.S. & Ph.D. 2009-2015 POSTECH, Pohang, Korea CSE

(Supervisor: James Won-Ki Hong)

B.S. 2005-2009 POSTECH, Pohang, Korea CSE

Research InterestsClient-defined Internet Services; Internet traffic measurement and analysis; Intelli-

gent traffic classification; Datacenter networking

Research/Project ExperiencesClient-defined Cloud Storage Services

Collaborative Research while visiting Edge Lab., Princeton University (2013-2014)

This research aims at designing and developing client-defined cloud storage services.

I design and develop a system that coordinates multiple cloud storage services from

client. I also develop a prototype called CYRUS with four commercial cloud storage

services. CYRUS was deployed to twenty academic users in US and Korea.

Effective Mobile Network Probing Method for LTE

Collaborative Research while visiting Edge Lab., Princeton University (2013)

This research focuses on developing effective probing method and inferring network

status from client-side. I developed a probing method and inference algorithm con-

sidering with LTE resource allocation. I also developed a probing agent (prototype)

for Android devices. Twelve devices have been running near Time Square, NYC and

collecting LTE network status.

Design and Implementation of Mobile Traffic Classification Methodol-

Funded by Microsoft Research Asia (2011-2012)

This research project aims at developing a novel application-level traffic classifica-

tion methodology to monitor and analyze mobile traffic. The key challenges are

the different traffic characteristics of mobile traffic compared to traditional Internet

traffic and the limited computing resources of mobile devices. My role in this project

includes, but is not limited to developing methodologies for discrimination of mobile

traffic from the mixture of mobile and non-mobile traffic and a new application-level

mobile traffic classification.

Highly Manageable Network and Service Architecture for Next Genera-

tion (HiMang)

Funded by Electronics and Telecommunications Research Institute (2010-2012)

HiMang is a part of 3-year-long government funded project. This research project

is to develop a novel autonomic and cognitive approach to providing a highly man-

ageable network and service management architecture for current as well as future

networks. It is based on using an innovative knowledge representation methodology

that unifies disparate knowledge sources and greatly improves learning, decision-

making, and reasoning in management systems. My work involves investigating

traffic monitoring methodologies for HiMang architecture.

Security Research for Mobile Cloud Service

Funded by KT Corporation (2010-2010)

This research project aims at surveying and implementing security techniques for

mobile cloud service. To improve security-level of mobile cloud service, monitoring

and analyzing abnormal behavior of each mobile virtual instance have been required.

The major challenge of this project is to minimize the computation of hypervisor

while monitoring mobile virtual instances in terms of both host-level and network

level behaviors. To solve this problem, I have developed a system which monitors

behaviors of mobile virtual instances using traffic mirroring technique. I have also

applied machine learning algorithm to detect abnormal behavior of virtual mobile

instances.

IT Convergence for Ubiquitous Autonomic Systems

Funded by Ministry of Education, Science, and Technology, Korea (2009-2013)

This is part of the research work proposed under our World Class University grant

from the Korean government. My work includes investigating how autonomic mech-

anisms can be applied to manage new ubiquitous computing systems that use bio-

informatics, nano-technologies, and networking technologies for building ubiquitous

computing applications (called ubiquitous health and ubiquitous environment ap-

plications in Korea).

Publications: International Journals1. Jae Yoon Chung, Sangtae Ha, and James Won-Ki Hong, “A Management

Architecture for Client-Defined Cloud Storage Services”, International Journal

of Network Management (IJNM), 2015 (accepted to appear).

2. Byungchul Park, Youngjoon Won, Jae Yoon Chung, Myung-sup Kim, and

James Won-Ki Hong, “Fine-grained Traffic Classification based on Functional

Separation”, International Journal of Network Management (IJNM), vol. 23,

no. 6, Sept./Oct, 2013, pp. 350-381.

Publications: International Conference1. Jae Yoon Chung, Carlee Joe-Wong, Sangtae Ha, James Won-Ki Hong, and

Mung Chiang, “CYRUS: Towards Client-Defined Cloud Storage”, ACM Eu-

roSys, Bordeaux, France, April 21-24, 2015.

2. Yoonseon Han, Jian Li, Jae Yoon Chung, Jae-Hyoung Yoo and James Won-

Ki Hong, “SAVE: Energy-Aware Virtual Data Center Embedding and Traf-

fic Engineering using SDN”, 1st IEEE Conference on Network Softwarization

(NetSoft 2015), UCL, UK, April 13-17, 2015.

3. Taeyoel Jeong, Sin-seok Seo, Jae Yoon Chung, Bernard Niyonteze, Jae-Hyoung

Yoo and James Won-Ki Hong,“Multi-Objective Optimization-based Traffic

Engineering for Data Center Networks”, 16th Asia-Pacific Network Opera-

tions and Management Symposium (APNOMS 2014), Hsinchu, Taiwan, Sept.

17-19, 2014.

4. Jae Yoon Chung, Jian Li, Yeongrak Choi, and James Won-Ki Hong, “Appli-

cation Traffic Identification Based on Remote Subnet Grouping”, 14th Asia-

Pacific Network Operations and Management Symposium (APNOMS 2012),

Seoul, Korea, Sep. 25-27, 2012.

5. Yeongrak Choi, Jae Yoon Chung, Byungchul Park, and James Won-Ki Hong,

“Automated Classifier Generation for Application-Level Mobile Traffic Identi-

fication”, 13th IEEE/IFIP Network Operations and Management Symposium

(NOMS 2012), Maui, Hawaii, USA, April 16-20, 2012, pp. 1075-1081.

6. Taehyun Kim, Yeongrak Choi, Seunghee Han, Jae Yoon Chung, Jonghwan

Hyun, Jian Li, and James Won-Ki Hong, “Monitoring and Detecting Abnor-

mal Behavior in Mobile Cloud Infrastructure”, 2012 IEEE/IFIP International

Workshop on Cloud Management (CloudMan 2012), Maui, Hawaii, USA, April

20, 2012, pp. 1303-1310.

7. Jae Yoon Chung, Yeongrak Choi, Byungchul Park, and James Won-Ki Hong,

“Measurement Analysis of Mobile Traffic in Enterprise Networks”, 13th Asia-

Pacific Network Operations and Management Symposium (APNOMS 2011),

Taipei, Taiwan, Sep. 21-23, 2011, pp. 1-4.

8. Jian Li, Jae Yoon Chung, Jin Xiao, James Won-Ki Hong, and Raouf Boutaba,

“On The Design and Implementation of a Home Energy Management Sys-

tem”, 6th IEEE International Symposium on Wireless and Pervasive Comput-

ing (ISWPC 2011), Hong Kong, China, Feb. 23-25 2011, pp. 1-6.

9. Jin Xiao, Jae Yoon Chung, Jian Li, Raouf Boutaba, and James Won-Ki Hong,

“Near Optimal Demand-Side Energy Management Under Real-time Demand-

Response Pricing”, 6th International Conference on Network and Service Man-

agement (CNSM 2010), Niagara Falls, Canada, Oct. 25-29, 2010, pp. 527-522.

10. Arum Kwon, Joon-Myung Kang, Sin-seok Seo, Sung-Su Kim, Jae Yoon Chung,

John Strassner, and James Won-Ki Hong, “The Design of a Quality of Ex-

perience Model for Providing High Quality Multimedia Services”, 5th Inter-

national Workshop on Modelling Autonomic Communication Environments

(MACE 2010), Niagara Falls, Canada, Oct. 28, 2010, pp. 24-36.

11. Jae Yoon Chung, Byungchul Park, Young J. Won, John Strassner, and James

Won-Ki Hong, “An Effective Similarity Metric for Application Traffic Classi-

fication”, the 12th IEEE/IFIP Network Operations and Management Sympo-

sium (NOMS 2010), Osaka, Japan, Apr. 19-23, 2010, pp. 286-292.

12. Jae Yoon Chung, Byungchul Park, Young J. Won, John Strassner, and James

Won-Ki Hong, “Traffic Classification Based on Flow Similarity”, 9th IEEE

International Workshop on IP Operations and Management (IPOM 2009),

Venice, Italy, Oct. 29-30, 2009, pp. 65-77.

Publications: Domestic Journals1. 최영락, 정재윤, 박병철, 홍원기, “응용 레벨 모바일 트래픽 모니터링 및 분석

을 위한 시스템 연구”, KNOM Review, vol. 14, no. 2, Dec. 2011, pp. 10-21.

2. 정재윤, 리건, 홍원기, “스마트 그리드 환경에서 수요 측면의 에너지 관리”,

KNOM Review, vol. 13, no. 2, Dec. 2010, pp. 1-13.

Publications: Domestic Conference1. 현종환, 정재윤, 리건, 홍원기, “차량용 블랙박스 동영상 무결성 보장 전송 기

법”, KNOM Conference 2013, DaeGu, Korea, May 9-10, 2013, pp. 140-141.

2. 김태현, 정재윤, 현종환, 리건, 홍원기, “모바일 클라우드 환경에서의 비정상

행동 모니터링 및 탐지” , KNOM Conference 2012, JeJu, Korea, May 3-4,

2012, pp. 130-134.

3. 정재윤, 최영락, 리건, 홍원기, “응용계층 트래픽 분류를 위한 응용프로그램

선정 및 시그니처 추출 방법” , KNOM Conference 2012, JeJu, Korea, May

3-4, 2012, pp. 29-33.

4. 정재윤,현종환,홍원기, “클라우드서비스의비정상행동탐지시스템” , 2012

한국통신학회 동계종합학술발표회, 용평, 대한민국, 2012년 2월 8-10, 2012,

pp. 118.

5. 정재윤,최영락,박병철,홍원기, “모바일트래픽의모니터링및분석”, KNOM

Conference 2011, Pohang, Korea, Apr. 21-22, 2011.

6. 권아름,강준명,서신석,김성수정재윤, John Strassner,홍원기, “고품질의멀

티미디어서비스제공을위한 Quality of Experience모델”, KNOM Conference

2011, Pohang, Korea, Apr. 21-22, 2011.

7. 리건, 정재윤, 홍원기, “스마트 그리드를 위한 홈 에너지 관리 시스템 서례 및

구현”, KNOM Conference 2011, Pohang, Korea, Apr. 21-22, 2011.

8. 박병철, 정재윤, 홍원기, “Flow Similarity를 활용한 응용 트래픽 분류에 관한

연구”, 한국통신학회 하계학술대회, 제주, 2010년 6월 23-25, 2010.

9. 정재윤, 박병철, 홍원기, “지능형 전력 계측 및 제어를 위한 그린 홈 서버 설

계 및 구현”, 한국통신학회 하계학술대회, 제주, 2010년 6월 23-25, 2010.

10. 정재윤, 원영준, 홍원기, “어플리케이션의 네트워크 행동 기반 시그니처 자동

생성 및 수집”, 한국통신학회 하계종합학술발표회, Jeju, Korea, Jun. 22-24,

Publications: Domestic Patents1. 정재윤,홍원기,최영락, “클라우드시스템에서의가상인스턴스행동분석장

치 및 방법”, 특허 제 10-13203860호 (출원번호: 10-2012-0022880), 2013. 10.

15 (출원: 2012.03.06).

2. 홍원기,원영준,정재윤, “응용프로그램의네트워크행동시그니처생성장치,

수집 서버”, 특허 제 10-1148705호 (출원번호: 10-2009-0046093), 2012. 05. 15

(출원: 2009. 05. 26).

AwardsTitle Organization Date

Excellence Award POSTECH Business Model Contest 11/2014

Best Paper Award APNOMS 2012 09/2012

Student Travel Grant IEEE/IFIP NOMS 2010 04/2010

Academic Scholarship Korea Student Aid Foundation 03/2005-02/2009

SkillsProgramming Experience C/C++, Python, Java, PHP, MySQL,

Web Programming

System Experience Windows, Linux, Unix, Android

Teaching Assistent & IntershipTitle Organization Date

Visiting Student Edge Lab., Princeton University

and Research - Client-defined Cloud Storage Services 06/2013

Collaborator - Effective Mobile Network Probing Method -03/2014

Teaching Assistant CSE, POSTECH

Operating System 09/2011

- Programming Projects (Pintos) -12/2011

Digital System Design 03/2009

- Circuit Design and Implementation -06/2009

Research Internship DPNM Lab. POSTECH

- An Automated Signature Generation System 03/2008

for Anomaly Traffic -12/2008

Disclaimer - POSTECHdpnm.postech.ac.kr › thesis › 15 › draft_jaeyoon.pdf · 2015-08-13 ·...

Documents

Transcript of Disclaimer - POSTECHdpnm.postech.ac.kr › thesis › 15 › draft_jaeyoon.pdf · 2015-08-13 ·...

국내 NMS 업체 - POSTECHdpnm.postech.ac.kr/cs720h/presen/nms/hst.doc · Web view인터넷 및 네트워크 관련 서비스 제공 업체 시스템 및 네트워크 유지 보수

moon.vn · Con có thê di choi vói diêu kiên là con phåi làm xong bài tap vê nhà trwóc dã. 2.4. Nhóm THOUGH chi sv tuŒng phån Though = Although: mac dù Even though

2013 - xn--b1afaaaktleeshbpqir1gsh.xn--p1aiпермскоеземлячество.рф/storages/files/data_357.pdf · Петрович Савиных и Ге-рой России,

정보 정합성 향상을 위한 FTTH장비 운용 관리 방법 - POSTECHdpnm.postech.ac.kr/conf/knom2008/Proceeding/papers/Post... · 2008-04-22 · ing)-PON 이 된다. TDM-PON

Информационный вестник РОО «Пермское землячество»пермскоеземлячество.рф/storages/files/data_492.pdf · в Перми.

How to increase sales though e-mail marketing in 6 months

PERATURAN BADAN PENGAWAS PEMILIHAN UMUM …bawaslu-kaltimprov.go.id/storages/file/Rnco1R7yUyD0Ecs9.pdf · tentang pedoman pengawasan kampanye peserta pemilihan umum anggota dewan

Számítógépek architektúrája - DoACT, AKTmazsola.iit.uni-miskolc.hu/DATA/storages/files/_ixPPeaL__ctTZXea.pdf · A jegyzet műszaki informatikus hallgatók számára készült,

A Primer of GeodesGeodesy strictly works within the SI-system of physical units [m-kg-s]. Though modern Though modern techniques of position fixing with the Global Positioning System

TOTAL LIFE CARE 금융부동산학부khcu.ac.kr/storages/admission/finance_brochure_20191211.pdf · 취업을 위한 전문 자산관리 자격증 취득 : 한국재무설계사(afpk),

DTM340 ИНСТРУКЦИЯ ПОЛЬЗОВАТЕЛЯxena-vaisala.ru/storages/files/data_37.pdfРисунок 2. Схема трансмиттера 17 Рисунок 3. Варианты

ВТОРНИК, 27 января 2015 Г.пермскоеземлячество.рф/storages/files/data_461.pdf · Это первое знакомство с металлом на

Reading Comprehensions are interesting and challenging, though difficult.

ITS 와WiBro 기반실시간 교통정보분석방법연구 - POSTECHdpnm.postech.ac.kr/thesis/09/forstar/Thesis-presentation.pdf · 2009-03-07 · ITS 와WiBro 기반실시간 교통정보분석방법연구

As Though Phone Maung Only_Sayar Ba Maw Tin Aung

ПРИВЕТСТВУЕМ УЧАСТНИКОВ XI СЪЕЗДА ПЕРМСКОГО …пермскоеземлячество.рф/storages/files/data_482.pdf · землячества,

Enchantment Passing Though aida

EMC © EMC Corporation. All rights reserved.valuenet.kolon.com/Storages/Education/2010/10/2 가상화...EMC 솔루션의이점및가치제안 고객의모든가상화요구를충족할수있는EMC의폭넓은제품솔루션

инфОРМациОннО-иЛЛюСТРиРОВанный жуРнаЛ …пермскоеземлячество.рф/storages/files/data_561.pdfцы которого первыми

NETCONF기반 구성관리를 위한 성능향상 방법 - POSTECHdpnm.postech.ac.kr/thesis/06/sunny/thesis_sunny.pdf전자컴퓨터 공학부 (컴퓨터공학) 네트워크 전공