Satoshi Yoshida* and Takuya Kida*

Satoshi Yoshida* and Takuya Kida**Hokkaido University Graduate school of Information Science and

Technology Division of Computer Science

※ESKM2012（9月20⽇〜9月22⽇）にて発表予定

Input Text

Fixed Length Variable Length

CompressedText

FixedLength

FF Code(Fixed length to Fixed

length code)

VF Code(Variable length to Fixed

length code)

VariableLength

FV Code(Fixed length to

Variable length code)

VV Code(Variable length to Variable

length code)

Huffman Huffman Code

TunstallTunstallCode

Program Searching on Compressed Data

011101011101110100101001011101011101110100101001・・・

Compressed DataSearch Directly

Tunstall code [Tunstall, 1967]

AIVF code using multiple parse trees[Yamamoto and Yokoo, 2001]

AIVF code [Yamamoto and Yokoo, 2001]

utilizing unused codewords

utilizing context between blocks

YY codeYY code

Improves

considerably

Improves compression ratio

considerably

Multiple parse trees

001011010010010001…

VMA tree

Huge time Huge time and memory

Proposing methodProposing method

relatively to the number of kind of characters (k) in the input text

k-1 parse trees

� We can reduce the total number of nodes in comparison to YY code by using a VMA tree.

� We can reduce the total number of nodes and compression time considerably also in experiments.

� We found an upper bound and a lower bound of the number of nodes in a VMA tree.

Number of nodes in VMA tree

Number of nodes reduced byintegration

Upper bound �� 1

2��

2� � � � 1

2� � 1 � � 2

� �� 1

� � 1� � � 2

Lower bound1

� � 1� ��

2��

2� � 3

k : alphabet sizeM : number of codewords

� Let Σ be a finite alphabet.

� Elements in alphabet are sorted in descending

order of their probabilities.

� We assume information source is memoryless.

� We simply say “encoding” encoding to the sequence

on {0, 1}.

aabbbbc Σ={a, b, c}

c a001

110011

Incomplete internal nodeAn internal node which doesn’thave k children.

input text:aabbbbc

compressed data sequence:001 101 101101 101 111

Each branch is labeled by a symbol

in Σ={a, b, c}.

Each leaf node and incomplete internal node has a codeword.

to the block.

Input text is parsed into blocks and we output

codeword corresponding to the block.

001 010000

c a001

110011

We switch multiple parse trees according to the context.

� Problem: time and space consuming◦ We construct k – 1 parse trees.

� We can reduce total number of nodes by sharing nodes.◦ We need to mark each node n in a VMA tree in order to tell which trees the node belongs to.◦ We can realize that by holding the least i such that nbelongs to Ti.

…0T1T

2−kT2T

VMA treemultiple parse trees

� Theorem

kS −

kS −)(i

kS)1( +i

Ti Ti +1

Let be the subtree of that consists of all the nodes under the node corresponding with , which is direct child of the root. Then completely covers . We denote this relation by .

ja)1( +

jiS)(i

)1()( +

ji SS p

From this theorem, we can tell which trees a node belongs to easily.

0T 1T 2−kT

2−kS)0(

1−kS

2−kS)1(

1−kS)1(

)2( −k

kS1−kS1S 2−kS

−−

)2()0(

From the theorem, we have:

VMA tree

multiple parse trees

We call the integrated parse tree a VMA tree.

�Theorem

The number of reduced nodes is not less

than .

2��

2� � 3

2��

2� � 3

……

2−kT

2−kS

1−kS

2−kS)1(

1−kS)1(

)2( −k

3−kT

)3( −k

Each reduced subtree has at least one node.

1−k 2−k 2

Summation:

�Theorem

The number of nodes in a VMA tree is

not less than .

� � 1� ��

��

� ��

��

Largest subtree and it remains in VMA tree.

Pr �� Pr �� ⋯ � Pr ��

The VMA tree is smallest when Pr �� Pr �� ⋯ � Pr �� .

)2( −k

#of codewords:

#of nodes:

� � 1

��

�� 1

� � 1

��

�� 1

� � 1

��

�� 1

� � 1

��

�� 1

� � 1

��

�� 1

� � 1

Summation is not less than �

�� .

� Comparison◦ Total number of nodes (Exp1)◦ Compression times (Exp2)◦ Total number of nodes and upper bound and lower bound of nodes in VMA tree on random sequence (Exp3)

� Algorithms◦ YY coding (YY)◦ Encoding using a VMA tree (VMA)

� Environments◦ CPU: Intel Pentium® 4 Processor 3.0GHz Hyper Threading◦ Memory: 2GB◦ OS: Debian GNU/Linux 5.0◦ Language: C++◦ Compiler: g++4.3

� Codeword length: 12 bits

� The Canterbury Corpus†

file size (in bytes) k content

alice29.txt 152,089 74 English text

asyoulik.txt 125,179 68 Shakespeare

cp.html 24,603 86 HTML source

fields.c 11,150 90 C source

grammar.lsp 3,721 76 LISP source

kennedy.xls 1,029,744 256 Excel Spreadsheet

lcet10.txt 424,754 84 Technical writing

plrabn12.txt 481,861 81 Poetry

ptt5 513,216 159 CCITT test set

sum 38,240 255 SPARC Executable

xargs.1 4,227 74 GNU manual page†http://corpus.canterbury.ac.nz/descriptions

k : 74 68 86 90 76 256 84 81 159 255 74

7 times

32 times

ressio

k : 74 68 86 90 76 256 84 81 159 255 74

3 times

18 times

0 32 64 96 128 160 192 224 256

alphabet size

the number of nodes

on uniform distribution

upper bound

lower bound

0 32 64 96 128 160 192 224 256

alphabet size

the number of nodes

on Zipf distribution

upper bound

lower bound

� We showed that we can reduce the total numberof nodes in comparison to YY code by using a VMAtree.

� We also showed that we can reduce the totalnumber of nodes and times considerably inexperiments.

� We found an upper bound and a lower bound of thenumber of nodes in a VMA tree.

� Applying this idea to STVF coding [Kida, DCC2009]

� Finding tighter bounds

Satoshi Yoshida* and Takuya Kida*

Documents

Transcript of Satoshi Yoshida* and Takuya Kida*

Photographer Kimiko Yoshida

FNC Lozano&Yoshida ÍndicedeComCafete

Yoshida & Romme kunst en cultuur

JUNKI YOSHIDA SENSEI · 2016. 11. 26. · JUNKI YOSHIDA SENSEI TOURNAMENT DIRECTOR AND HOST OF THE YOSHIDA CUP NORTHWEST CLASSIC INVITATIONAL INTERNATIONAL KARATE CHAMPIONSHIP & SEMINARS

Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.

Palestra CMMI Fatec Ipiranga 2011 - David Yoshida

Laser stability Takuya Natsui. Oscillator phase lock stability.

Tokyo Stock Exchange’s XBRL projectarchive.xbrl.org/.../files/Yoshida-TokyoStockExchangeXBRLProject.pdf · Tokyo Stock Exchange’s XBRL project Koji Yoshida Tokyo Stock Exchange,

Saki Yoshida Mio Imada Suhyun Kim

Spectral decomposition of the automorphic …takuya/papers/spec.pdfSpectral decomposition of the automorphic spectrum of GSp(4) Takuya KONNO takuya@math.kyushu-u.ac.jp Graduate School

Yoshida Gakuen - final project - Ryuu Taiji

TCC_Claudio Yukio Yoshida

Mitsuhiro Yoshida

Masakatsu Yoshida Origami

Sho Murakami, Takuya Yoshihiro, Etsuko Inoue and Masaru Nakagawa

Instructions for use€¦ · MICHITOSHI YOSHIDA Okayama Astrophysical Observatory, National Astronomical Observatory, Kamogata, Okayama 719-02, Japan Electronic mail: yoshida@oao.nao.ac.jp

Material de Aula Rubens Yoshida parte 1

Revolución industrial kida

Nakije kida poster 01.03.2014 Malta conference

Takuya KAWABATA (tkawabat@mri-jma.go.jp) · Takuya KAWABATA (tkawabat@mri-jma.go.jp), TohruKURODA, HiromuSEKO, Kazuo SAITO (Meteorological Research Institute/JMA) NHM-4DVAR is a cloud-resolving