Voice recognition software

Eric S. Johansson esj at harvee.org
Fri May 5 12:26:18 UTC 2006


braddock wrote:
> 2: Speech recognition is HARD.  World class speaker-independent 
> recognition error rates with unlimited processing times are still 
> around 30% (and if you come up with a technique that improves half a 
> percent, you publish).

not only is it hard, it requires incredibly specialized knowledge that
takes years to acquire.  It's not something you can hack out in a year
or two.

> 3: What is important to the government customers driving the
> technology is not necessarily what is important for commercial
> accessibility applications.  Naturally Speaking, for example, uses a
> trained limited vocabulary model with high quality mics in a
> controlled environment to get, what, 95%+ accuracy.

I believe it is referred to as a single speaker large vocabulary
continuous speech recognition system.  Fortunately they have gotten
better at training and the use of a language model to improve accuracy a
recognition.

> But those are not "real world" conditions for intelligence 
> applications, which are more concerned with things like improved 
> speaker-independent (no training) keyword extraction and speaker 
> identification on telephone-quality links (one of the most standard 
> datasets used is called <a 
> href="http://ucsu.colorado.edu/~francish/swbd.html">"Switchboard"</a>,
>  to give you some idea).

on the intelligence customer requirements,with regard to broadcasts and
speaker independence, from what I understand most of the improvements
are not coming from tearing apart the sounds into something else but
from development of better language models.  I once got a demonstration
of a dragon system transcribing newscasts and it was amazing that they
were able to differentiate between two speakers and separate the word
streams.  It was amazing.

and yes, they could neither confirm or deny that they were traveling
down to the DC area a lot.  ;-)

> Yes, it is unfortunate, and more than a little spooky.  And I would 
> agree with Eric that $10 million for development is a low and
> high-risk estimate.  The good news is that most of the research and a
> fair amount of code is openly available and could be leveraged,
> although I don't want to think about what the patent situation might
> be.

one of the ways to deal with the patent situation (and yes it would be 
an absolute minefield) would be to license and engine, language model 
and related components then tear apart the user interface and replace it 
with something decent.  No, it wouldn't be open source, yes, we would 
have to charge unless we were granted a no fee license (not bloody 
likely) but the most important thing is we would have control over the 
user interface and the connection to gnome accessibility layer.

one of the halfway steps (assuming it's good to take a while for wine to 
do a we need) is the model of running speech-recognition on Windows in a 
virtual machine bubble.  The speech-recognition bubble would drive the 
Linux machine via a gateway program tunneling keystrokes and mouse 
events to linux and context back to the speech recognition engine.

It's not ideal and you can't cheat and use xinput, you need to use the 
gnome at-spi but it would seriously improve the ability to drive linux 
via speech-recognition and may help us cope when Microsoft vista 
speech-recognition further savages the market and becomes the only 
recognition engine available to end users.

this is one of those times when I don't care about ideology, I care 
about having the tools I need to work and make money.  Open source is 
great, I have two projects of my own but dammit, it doesn't pay my 
mortgage or put food in the fridge.  Having high-quality speech 
recognition dictation, command-and-control etc. is the difference 
between being on Social Security disability and being able to have 
decent standard of living.











More information about the ubuntu-users mailing list