LYU0103Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu
Students: Gao Zheng Hong
Lei Mo
LYU0103Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu
Students: Gao Zheng Hong
Lei Mo
Outline of Presentation
Project overview
Introduction to SR
Comparison of different SR engines
Audio Extraction
Speech Segmentation
Visual Training Tool
String Alignment
Summary
Project Overview
Our project about SR is a part of a mother project called VIEW
The project include a digital video library, which uses the outcome tool of our project to produce captions for the videos in it.
VIEW: Video over InternEt and Wireless and supervised by our supervisor
Objective of VIEW is to develop a multilingual digital video content hub for culture exchange and commercial deployment
Video Information Processing
There is a information processing engine, and VIP: major components including: speech recognition, video OCR, scene detection.
Our task is SR
Our Project Objectives
Apply speech recognition techniques for video data obtained from digital video library to retrieve information, including the text of speech and timing of every word spoken
We need to embed ViaVioce(introduced later) in the system and try to increase the accuracy of SR engine as much as possible
After 1: we need to perform multi-lingual speech recognition, timing information retrieval, and real-time dictation. More specifically, …3
Video with SR Processing
As shown in the diagram, the ultimate goal of our project is to display the video as well as showing the text of speaker’s voice, with the word currently speaking highlighted
SR Process
Input signal: get speaker’s voice from an input device( commonly a microphone),quality of input device and noise in the room can influence the accuracy of SR system very much
Feature extraction:try to deal with problem created in first part, The two aims are separate classes of speech sounds, such as music and speech, and effectively suppress irrelevant sources of variation ( we are mainly working at this part)
The other parts are the work for SR engine,since we will not touch ViaVoice SR engine, so we will not explain it in detail here
Challenges and Difficultiesof SR
Speaker Variability
Channel Variability
Linguistic Variability
Coarticulation
Many improvements have been realized in past decades but computers are still not able to understand every single word pronounced by everyone. Speech Recognition is still a very cumbersome problem
SV: two speakers or even the same speaker will pronounce the same word differently
CV: the quality and position of microphone and background environment will affect the output
LV: Several factors( phonetics, phonology,syntax ) will affect the input audio signal.
C:The sound is affected by the sounds both proceeding and following it.
So we need a SR engine to deal with these problems…..
Requirement of our project
state-of-the-art high quality SR engine
The nature of our project requires a state-of-the-art high quality SR engine, which can dictate the speech in a large volume of video segments and produce text with an acceptable accuracy
Different SR engine
CMU Sphinx
Microsoft SAPI
IBM ViaVoice
During the summer holidays in 2001, we investigated different speech recognition engines such as CMU Sphinx, Microsoft SAPI and IBM ViaVoice.
Visit to IBM and Microsoft this summer
IBM Research Lab, Beijing Microsoft Research Institute, Beijing
And we also visited IBM Research Lab China and Microsoft Research Institute of China in Beijing together with our supervisor Professor Michael R. Lyu this summer.
Here we want to summarize the knowledge we learned and compare the characteristics of different SR engines
And then, give the reason why we choose IBM ViaVoice in our project
CMU Sphinx
Advantages: a. open source
b. free software
c. good for researchers and developers
Disadvantages: a. limited documentation
b. No Chinese version
c. Acoustic build process can take many days
Free: no need for initial investigation
No Chinese Version: SphinxTrain (the acoustic training environment) is release in July, 2001, which enable us to build models for any language. it needs large volume of acoustic data and investigation of the system
Microsoft SAPI
Advantages:
a. application and engine do not directly communicate with each other -- all communication is done via SAPI.
b. remove implementation details, making speech SR engine and application convenient
Disadvantages:
a. Has to implement COM objects and interfaces for SR engine to be a SAPI 5 engine
b. Limited language version
c. Do not support grammar compiler
Disadvantages: b: do not have Cantonese version
c: it is supported by ViaVoice
IBM ViaVoice
Advantages:
a.Support Dynamic vocabulary handling, database querying, add new words to the user’s vocabulary
b. Support 13 languages, including Cantonese and Chinese
c. Developers can write audio library to handle input
d. Support for Grammar Compiler APIs
Disadvantages
a. Constrained input audio data format
Advantages:
a: vocabulary size can expand or shrink according to different applications
Disadvantages:
a: 22 kHz, monotonic voice data, so we have to change wave format in our work
Why choose ViaVoice?
ViaVoice has highest accuracy of dictation if fully trained.
It uses 150,000-word base dictionary and user can add up to 64,000 words of their own.
What’s more important, it provides both Cantonese and Chinese version, which enable us to integrate it as a part of VIEW project.
Our objectives with ViaVoice
We use ViaVoice as SR engine for the whole project, so why?
Our first task is to get the SR engine to work.And then try to increase the accuracy of the speech recognition engine as much as possible and obtain the timing information of speech.
We apply some techniques to the raw audio data, and build our own visual training tools. Let’s turn to my partner to introduce it…..
Audio Extraction
Our project and also the speech recognition engine mainly deal with audio data
But in the digital video library, most data are stored as multimedia files, a mixture of both video and audio data
Therefore, audio extraction is needed
The IBM ViaVoice engine supports only monotonic, 22/11/8 kHz ACM data
We decided to store these audio data in monotonic, 22 kHz wave format
Our project and also the speech recognition engine mainly deal with audio data. But in the digital video library, most data are stored as multimedia files, a mixture of both video and audio data, such as MPEG or AVI file. Therefore, the first step we need to take is to extract audio data from these multimedia files.
Microsoft DirectShow
Under Win32 environment, Microsoft DirectShow provides a convenient multimedia library
The basic building block of DirectShow is a software component called a filter
Filters receive input and produce output
A set of connected filters is called a filter graph
Filter Graph for Playing MPEG
C:\tvbnews.mpg MPEG-I Stream Splitter MPEG Audio Decoder MPEG Video Decoder
Default DirectSound
Device
Video Renderer
This the normal playback filter graph of a mpeg file. On the top of the graph, a source filter is responsible for grabbing media stream from input file. Then the MPEG-I Stream Splitter divides the obtained media stream into two separate streams: video and audio. After that, the MPEG Audio Decoder and MPEG Video Decoder decode the corresponding stream data and pass them to output devices to render out.
Filter Graph of Audio Extractor
C:\tvbnews.mpg MPEG-I Stream Splitter MPEG Audio Decoder WAV Dest C:\tvbnews.wav ACM Wrapper
This is the filter graph of audio extractor working on the same file. The video decoder is removed. And so dp the renders. Instead, the output stream of the MPEG Audio Decoder is directed to an ACM Wrapper. The ACM Wrapper converts the input PCM stream desired format. Then the WavDest filter multiplexes the audio stream to produce a stream, which is capable of being written to a file by a File Writer.
Comments