Newest Viewed Downloaded

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo

Outline of Presentation

Project overview Introduction to SR Comparison of different SR engines Audio Extraction Speech Segmentation Visual Training Tool String Alignment Summary

Project Overview

Our project about SR is a part of a mother project called VIEW The project include a digital video library, which uses the outcome tool of our project to produce captions for the videos in it. VIEW: Video over InternEt and Wireless and supervised by our supervisor Objective of VIEW is to develop a multilingual digital video content hub for culture exchange and commercial deployment

Video Information Processing

There is a information processing engine, and VIP: major components including: speech recognition, video OCR, scene detection. Our task is SR

Our Project Objectives

Apply speech recognition techniques for video data obtained from digital video library to retrieve information, including the text of speech and timing of every word spoken We need to embed ViaVioce(introduced later) in the system and try to increase the accuracy of SR engine as much as possible After 1: we need to perform multi-lingual speech recognition, timing information retrieval, and real-time dictation. More specifically, …3

Video with SR Processing

As shown in the diagram, the ultimate goal of our project is to display the video as well as showing the text of speaker’s voice, with the word currently speaking highlighted

SR Process

Input signal: get speaker’s voice from an input device( commonly a microphone),quality of input device and noise in the room can influence the accuracy of SR system very much Feature extraction:try to deal with problem created in first part, The two aims are separate classes of speech sounds, such as music and speech, and effectively suppress irrelevant sources of variation ( we are mainly working at this part) The other parts are the work for SR engine,since we will not touch ViaVoice SR engine, so we will not explain it in detail here

Challenges and Difficulties of SR

Speaker Variability Channel Variability Linguistic Variability Coarticulation Many improvements have been realized in past decades but computers are still not able to understand every single word pronounced by everyone. Speech Recognition is still a very cumbersome problem SV: two speakers or even the same speaker will pronounce the same word differently CV: the quality and position of microphone and background environment will affect the output LV: Several factors( phonetics, phonology,syntax ) will affect the input audio signal. C:The sound is affected by the sounds both proceeding and following it. So we need a SR engine to deal with these problems…..

Requirement of our project

state-of-the-art high quality SR engine The nature of our project requires a state-of-the-art high quality SR engine, which can dictate the speech in a large volume of video segments and produce text with an acceptable accuracy

Different SR engine

CMU Sphinx Microsoft SAPI IBM ViaVoice During the summer holidays in 2001, we investigated different speech recognition engines such as CMU Sphinx, Microsoft SAPI and IBM ViaVoice.

Visit to IBM and Microsoft this summer

IBM Research Lab, Beijing Microsoft Research Institute, Beijing And we also visited IBM Research Lab China and Microsoft Research Institute of China in Beijing together with our supervisor Professor Michael R. Lyu this summer. Here we want to summarize the knowledge we learned and compare the characteristics of different SR engines And then, give the reason why we choose IBM ViaVoice in our project

CMU Sphinx

Advantages: a. open source b. free software c. good for researchers and developers Disadvantages: a. limited documentation b. No Chinese version c. Acoustic build process can take many days Free: no need for initial investigation No Chinese Version: SphinxTrain (the acoustic training environment) is release in July, 2001, which enable us to build models for any language. it needs large volume of acoustic data and investigation of the system

Microsoft SAPI

Advantages: a. application and engine do not directly communicate with each other -- all communication is done via SAPI. b. remove implementation details, making speech SR engine and application convenient Disadvantages: a. Has to implement COM objects and interfaces for SR engine to be a SAPI 5 engine b. Limited language version c. Do not support grammar compiler Disadvantages: b: do not have Cantonese version c: it is supported by ViaVoice

IBM ViaVoice

Advantages: a.Support Dynamic vocabulary handling, database querying, add new words to the user’s vocabulary b. Support 13 languages, including Cantonese and Chinese c. Developers can write audio library to handle input d. Support for Grammar Compiler APIs Disadvantages a. Constrained input audio data format Advantages: a: vocabulary size can expand or shrink according to different applications Disadvantages: a: 22 kHz, monotonic voice data, so we have to change wave format in our work

Why choose ViaVoice?

ViaVoice has highest accuracy of dictation if fully trained. It uses 150,000-word base dictionary and user can add up to 64,000 words of their own. What’s more important, it provides both Cantonese and Chinese version, which enable us to integrate it as a part of VIEW project. Our objectives with ViaVoice We use ViaVoice as SR engine for the whole project, so why? Our first task is to get the SR engine to work.And then try to increase the accuracy of the speech recognition engine as much as possible and obtain the timing information of speech. We apply some techniques to the raw audio data, and build our own visual training tools. Let’s turn to my partner to introduce it…..

Audio Extraction

Our project and also the speech recognition engine mainly deal with audio data But in the digital video library, most data are stored as multimedia files, a mixture of both video and audio data Therefore, audio extraction is needed The IBM ViaVoice engine supports only monotonic, 22/11/8 kHz ACM data We decided to store these audio data in monotonic, 22 kHz wave format Our project and also the speech recognition engine mainly deal with audio data. But in the digital video library, most data are stored as multimedia files, a mixture of both video and audio data, such as MPEG or AVI file. Therefore, the first step we need to take is to extract audio data from these multimedia files.

Microsoft DirectShow

Under Win32 environment, Microsoft DirectShow provides a convenient multimedia library The basic building block of DirectShow is a software component called a filter Filters receive input and produce output A set of connected filters is called a filter graph

Filter Graph for Playing MPEG

C:\tvbnews.mpg MPEG-I Stream Splitter MPEG Audio Decoder MPEG Video Decoder Default DirectSound Device Video Renderer This the normal playback filter graph of a mpeg file. On the top of the graph, a source filter is responsible for grabbing media stream from input file. Then the MPEG-I Stream Splitter divides the obtained media stream into two separate streams: video and audio. After that, the MPEG Audio Decoder and MPEG Video Decoder decode the corresponding stream data and pass them to output devices to render out.

Filter Graph of Audio Extractor

C:\tvbnews.mpg MPEG-I Stream Splitter MPEG Audio Decoder WAV Dest C:\tvbnews.wav ACM Wrapper This is the filter graph of audio extractor working on the same file. The video decoder is removed. And so dp the renders. Instead, the output stream of the MPEG Audio Decoder is directed to an ACM Wrapper. The ACM Wrapper converts the input PCM stream desired format. Then the WavDest filter multiplexes the audio stream to produce a stream, which is capable of being written to a file by a File Writer.

Audio Extraction Outcome

Media File Wave File tvbnews.mpg 44.100 KHz Stereo tvbnews.wav 22.050 KHz Monotonic

Showing 1 - 20 of 41 items Details

Name: 
lyu0103-1
Author: 
mlei
Company: 
ViewTech
Description: 
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo
Tags: 
the | engine | and | speech | viavoice | for | audio | data
Created: 
11/18/2001 5:24:27 PM
Slides: 
41
Views: 
9
Downloads: 
3
Rating: 
0


Comment



Share this presentation
|

Comments

Share this presentation:

|
Sitemap