Ry pyconjp2015 karaoke
-
Upload
renyuan-lyu -
Category
Education
-
view
10.334 -
download
0
Transcript of Ry pyconjp2015 karaoke
1
PyCon JP 2015
Renyuan Lyu
呂仁園
Chun-Han Lai
賴俊翰
Karaoke-style Read-aloud System
Chang Gung Univ.
Taiwan
Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
CguTextKaraoke a Karaoke-style Read-aloud System
Using Speech Alignment and Text-to-Speech Technology
Chun-Han Lai (賴俊翰)
Renyuan Lyu (呂仁園)
Chang Gung University (長庚大學) Taiwan (台灣)
2
Abstract
• A procedure to create a Speech-to-Text Synchronization file from an original text-only file
– can be used to show high-light text just like a Karaoke machine
– very useful for language learning purpose.
• TTS (Text-to-speech) technology on clouds, like Google TTS
• Speech-recognition technology, like HTK, for temporal alignment
3
Introduction
• Starting from a text-only file, using a cloud-based text-to-speech (TTS) technology, like Google Translate/TTS, and also a speech-recognition technology, like Hidden Markov Model Toolkits (HTK), we could generate its associated timed-text file which aligns up text with speech waveform file in the temporal axis.
• Python is used not only as a glue to link all different styles of software resources, like Google Translate and HTK, but also as a powerful tool to deal with all text processing tasks in this project.
• From such a kind of timed text file, we have also provided a JavaScript based web-app and also a Python GUI software to demonstrate the time-aligned high-lighted text like a karaoke machine in word level, which are considered very useful for the language learning purpose.
4
a Karaoke-style Text Read-aloud System
https://www.youtube-nocookie.com/embed/9a5KoXNCagM?start=180
• Karaoke (カラオケ) is a form of interactive entertainment in which an amateur singer sings along with recorded music.
• Lyrics are usually displayed on a video screen, along with a moving symbol, changing color, or music video images, to guide the singer.
• Here is an example of my favorites
https://en.wikipedia.org/wiki/Karaoke
5
Speech Shadowing Technique for Language Learning
• The motivation of this project » https://en.wikipedia.org/wiki/Speech_shadowing
– Speech shadowing
• is an Language Learning technique in which subjects repeat speech immediately after hearing it.
– The technique is used in language learning.
– A demonstration can be viewed at the following Youtube link.
• “English Speaking Practice: How to improve your English Speaking and Fluency: SHADOWING”
• https://www.youtube.com/watch?v=GVWFGIyNswI
6
Text-to-Speech Synthesis
7
Wikipedia is a multilingual, web-based, free-content encyclopedia project supported
by the Wikimedia Foundation and based on a model of openly editable content. The
name "Wikipedia" is a portmanteau of the words wiki (a technology for creating
collaborative websites, from the Hawaiian word wiki, meaning "quick") and
encyclopedia. Wikipedia's articles provide links designed to guide the user to related
pages with additional information.
Given: a piece of Text and its speech, e.g.,
The goal is to obtain its speech
Google TTS API in a Python module
8
• pip install gTTS
from gtts import gTTS
aText= 'Wikipedia is a multilingual, ...'
aLang= 'en'
tts= gTTS(text= aText, lang= aLang)
tts.save("aSpeech.mp3")
aSpeech.mp3 aText
https://github.com/pndurette/gTTS
FFmpeg
• About Ffmpeg – [https://en.wikipedia.org/wiki/FFmpeg]
– FFmpeg is a free software project that produces libraries and programs for handling multimedia data.
– It is one of the leading multimedia frameworks, able to do many DSP tasks, including ...
• decode, encode,
• transcode, mux, demux, stream, filter and play
9
10
FFmpeg -i aSpeech.mp3 -y -
vn -acodec pcm_s16le -ac 1
-ar 16000 -f wav
aSpeech.wav
aSpeech.mp3 aSpeech.wav
Pcm, 16 bits/sample Little endian 1 (mono) channel 16000 samples/sec
FFplay
aSpeech.wav
Verifying by seeing and hearing
Or using an interactive audio tool, like Audacity.
Audacity (audio editor) • Audacity is a powerful, free open source digital audio editor
– Its features include: • Recording and playing back sounds
• Importing and exporting of WAV, MP3, ....
• Viewing and editing via cut, copy, and paste, ...
11
aSpeech.mp3
aSpeech.wav
Text-to-Speech Alignment
12
Wikipedia is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links designed to guide the user to related pages with additional information.
Given: a piece of Text and its speech, e.g.,
The goal is to obtain a ‘Timed-Text’
0.0
00
0.0
80
sil
0.0
80
0.8
70
wik
iped
ia
0.8
70
0.9
90
is
0.9
90
1.0
80
a
1.0
80
2.0
10
mu
ltil
ing
ual
2.0
10
2.1
40
sil
2.1
60
2.2
40
sil
2.2
40
3.0
20
web
bas
ed
3.0
20
3.1
80
sil
3.2
04
3.3
54
sil
3.3
54
4.2
84
fre
eco
nte
nt
4.2
84
5.3
74
en
cycl
op
edia
5.3
74
5.7
74
pro
ject
5.7
74
6.4
54
su
pp
ort
ed
6.4
54
6.7
54
by
6.7
54
6.9
04
th
e
6.9
04
7.5
74
wik
imed
ia
7.5
74
8.4
14
fo
un
dat
ion
8.4
14
8.5
14
sil
8.5
32
8.6
22
sil
8.6
22
8.8
52
an
d
8.8
52
9.2
42
bas
ed
9.2
42
9.3
82
on
9.3
82
9.4
32
a
9.4
32
9.9
82
mo
del
9.9
82
10
.03
2 o
f
10
.03
2 1
0.5
92
op
enly
10
.59
2 1
1.2
12
ed
itab
le
11
.21
2 1
1.8
02 c
on
ten
t
11
.80
2 1
1.9
32 s
il
: : :
Wav splitting
13
In Sentence-level, this can be straightforward done by extracting the time information from the TTS mp3 files, which are received sentence by sentence.
Sentence boundaries
Phonetic Transcription
• Speech recognition technology needs to transcribe text into phonetic symbols, in order to build up phone models.
14
“Wikipedia is a multilingual, web-based, free-content encyclopedia project.”
“wikipedia ɪz ə məltilɪŋwəl, wɛb- best, fri- kɑntɛnt ənsɑjkləpidiə prɑdʒɛkt.”
”wikipedia Iz @ m@ltilINw@l, wEb- best, fri- kAntEnt @nsAykl@pidi@ prAdZEkt.”
Original English Text: (ASCII only, perhaps!)
Transcription in IPA: (needs Unicode)
Transcription in SAMPA: (ASCII only, including non-alphabet symbols)
http://upodn.com/phon.asp
• Post processing of phonetic transcription • To map or simply clean all undesired symbols from multiple
styles of outputs – (usually in unicode, or some non-alphabet symbols)
• For plain English (en), – Approximately using the original Text as the phone sequence.
– Although it seems too simple, it is so far so good.
• For Traditional Chinese (zh-tw), – Google Translate was used to get phonetic symbols in Pinyin (拼音,
pīnyīn), and then plain romaji (eliminating the tone mark)
• For Japanese (ja),
– Mecab has been used recently to get the Katakana (片仮名, カタカナ).
– Romkan has been used to transform katakana to romaji (kunrei)
• Thanks to Python, it helps me do the most jobs during this stage of processing!!
15
• Phonetic transcription for English
– Using regular expression module
16
phn= text2phn_en(enText)
enText= ‘’’Wikipedia is a multilingual, web-based, free-content encyclopedia project.‘’’
phn= ‘’’wikipedia_is_a_multilingual_webbased _freecontent_encyclopedia_project’’'
import re
pats= '\'|\"|\-|^_|_$|,|\.|\(|\)' phn= re.sub(pats, '', phn)
• Phonetic transcription for Traditional Chinese
– Using Google Translate/TTS api
17
phn= text2phn_tc(tcText)
tcText= ‘維基百科是一個自由內容’
phn= ‘weiji_baike_shi_yige_ziyou_neirong’
GOOGLE_TTS_URL= 'https://translate.google.com.tw/translate_a/single?dt=bd&dt=ex&dt=at&'
req= urllib.request.Request(GOOGLE_TTS_URL + data)
• Phonetic transcription for Japanese
– Using MeCab and Romkan
18
phn= text2phn_jp(jpText)
jpText= ‘‘’ウィキペディアは、 信頼されるフリーなオンライン百科事典、‘’’
phn= ‘‘’wikipedyia_wa_sil_sinrai_sa_reru_furi-_ na_onrain_hyakka_ziten‘’’
import MeCab import romkan
y= MeCab.Tagger().parse(text) ... kun= romkan.to_kunrei(phn)
At the Halfway
• a bundle of files wav/lab
19
• HMM Toolkits (HTK), – http://htk.eng.cam.ac.uk/
– Given a speech utterance, with its phone sequence, the speech can be well aligned with phones by ‘forced alignment’ techniques in the HMM approach.
– A set of HMM Toolkits, called HTK, provided a convenient way to utilize the HMM approach.
20
Speech recognition technology
• The HTK overview
21
HTK processing (abstract) ....
22
• #[00] setting the working dir
• #[01] creating the (hmm) model prototype
• #[02] label processing
• #[03] feature extraction
• #[04] model initialization
• #[05] model training
• #[06] forced alignment
• #[07] post file moving operation
HTK processing (detail)....
23
#[00] setting the working dir
dirName= ./_wav/
#[01] creating the (hmm) model prototype
CreateHProto....
myHmmPro
N = 3 M = 6
#[02] label processing
000, 0,----> .\_htk\hled -A -i spLab00.mlf -n spLab00.lst -S spLab.scp hLed00
001, 0,----> .\_htk\hled -A -i spLab.mlf -n spLab.lst -S spLab.scp hLed.led
002, 0,----> .\_htk\hled -A -i spLab_p.mlf -n spLab_p.lst -S spLab.scp -I spLab
#[03] feature extraction
003, 0,----> .\_htk\HCopy -A -C hCopy.conf -S spWav2Mfc.scp 1>> 1.htk.out 2>> 2.htk.out
#[04] model initialization
004, 1,----> mkdir hmms_p
005, 0,----> .\_htk\HCompV -A -m -C hInit.conf -S spMfc.scp -I spLab_p.mlf -M hmms_p
#[05] model training
006, 0,----> .\_htk\HERest -A -C hErest.conf -S spMfc.scp -p 1 -t 2000.0 -w 3 -
007, 0,----> .\_htk\HERest -A -C hErest.conf -p 0 -t 2000.0 -w 3 -v 0.05 -I spLab_p
: (repeating several times...)
:
#[06] forced alignment
016, 0,----> .\_htk\HVite -A -a -C hVite.conf -S spMfc.scp -d hmms_p/ -i spLab_aligned
#[07] post file moving operation
017, 1,----> mkdir outDir
018, 1,----> copy spLab_aligned.mlf outDir\./_wav_aligned.mlf
24
HLed spLab.scp spLab.mlf
spLab.lst
hLed.led
HLed spLab00.mlf
spLab00.lst
hLed00.led
HLed spLab_p.mlf
spLab_p.lst
hLed.led
spLab_p.dic
HLed
25
HCopy
hCopy.conf
spWav2Mfc.scp
*.wav *.mfc
HCopy
HCompV
26
HCompV
HCompV.conf
*.mfc hmms_p/*
spMfc.scp
spLab_p.mlf myHmmPro
HERest
27
HERest
hErest.conf
*.mfc
hmms_p/*
spMfc.scp
spLab_p.mlf spLab_p.lst
hmms_p/HER1.acc
N iterations
N=5
HERest
HVite
28
HVite
hVite.conf *.mfc
spMfc.scp
spLab_p.lst
spLab_aligned.mlf
spLab.mlf
spLab_p.dic
hmms_p/
HTK summary
29
HLed
HCopy
HCompV
HERest
HVite
HTK Tools
#!MLF!#
"./_wav/SN0.rec"
0 800000 sil -578.044434
800000 8700000 wikipedia -5636.368652
8700000 9900000 is -855.988770
9900000 10800000 a -693.554871
10800000 20100000 multilingual -7268.197266
20100000 21400000 sil -791.746216
.
"./_wav/SN1.rec"
0 800000 sil -541.083069
800000 8600000 webbased -5977.622070
8600000 10200000 sil -1048.225220
.
"./_wav/SN2.rec"
0 1500000 sil -1100.892822
1500000 10800000 freecontent -7094.197266
10800000 21700000 encyclopedia -8148.633789
21700000 25700000 project -3247.493896
25700000 32500000 supported -5594.979492
32500000 35500000 by -2412.487305
35500000 37000000 the -1176.310547
37000000 43700000 wikimedia -5128.852051
43700000 52100000 foundation -5995.618164
52100000 53100000 sil -695.872864 .
.
.
spLab_aligned.mlf
wavDir/
The major algorithm in HTK
30
‘Holiday Shopping’ = ‘h’+’o’+’l’+’i’+’d’+’ay’+’sil’+’sh’+’o’+’p’+’I’+’ng’
‘h’ ’o’ ’ng’
• Forced Alignment in HTK – 1. Given a Speech signal – 2. Doing the Pronunciation transcription
• Pronunciation symbols must be all-ASCII only!!
– 3. Training to get the HMM models
31
‘h’
’o’
’ng’
– 4. Doing the Viterbi Search for the optimal path (alignment):
32
#!MLF!#
"wavDir/SN0001.rec"
0 800000 sil -567.865356
800000 8700000 wikipedia -5670.471680
8700000 10000000 is -951.059692
10000000 10600000 a -489.843994
10600000 20000000 multilingual -7398.754395
20000000 20700000 sil -416.119415
.
"wavDir/SN0002.rec"
0 900000 sil -632.964050
900000 8600000 webbased -6000.767578
8600000 9900000 sil -914.236206
.
"wavDir/SN0003.rec"
0 2100000 sil -1373.137817
2100000 9000000 freecontent -5306.260742
9000000 18500000 encyclopedia -6654.958984
18500000 25600000 project -5698.730469
25600000 32700000 supported -5713.494141
32700000 33200000 by -429.306763
33200000 34800000 the -1205.477539
34800000 41500000 wikimedia -5115.318359
41500000 50000000 foundation -6074.208496
50000000 52000000 and -1746.236938
52000000 56200000 based -3267.695801
56200000 57000000 on -585.264404
57000000 57700000 a -577.346130
57700000 63200000 model -3769.413574
63200000 63800000 of -524.015503
63800000 65300000 sil -1129.348633
.
wavDir.align
33
Now it’s time to KaraOke !
A Browser in Javascript and HTML for Text-KaraOke
• https://youtu.be/11-ltx0yv_o
34
A Browser in Python using TKinter for Text-KaraOke
35
Conclusion & Future Work
• Make the process more automatically.
• Make the user interface more friendly.
• Make the program more robust.
• Call for your help to improve.
• Thank you for Listening!
36
37
PyCon JP 2015
Renyuan Lyu
呂仁園
Chun-Han Lai
賴俊翰
Karaoke-style Read-aloud System
Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
Thank you for Listening. ご聴取 有り難う 御座いました。
感謝您的收聽。