FW: Voice Recognition for Linux

Sat Feb 24 19:32:22 GMT 2007

Hi Henrik / Eric

Although I don't want to get embroilled in your discussions I have a
question to ask on this.

How does voice recognition work - does it use word parts as in a TTS engine
like eSpeak but in reverse, or does it maintain a dictionary of actual
words?

I presume that the problems you are corresponding about is not the way the
STT engine  works but the way it interprets the input?

Fascinating discussions - thanks

Ian

Ian

-----Original Message-----
From: ubuntu-accessibility-bounces at lists.ubuntu.com
[mailto:ubuntu-accessibility-bounces at lists.ubuntu.com]On Behalf Of Eric
S. Johansson
Sent: 23 February 2007 19:45
To: Henrik Nilsen Omma
Cc: Ubuntu Accessibility Mailing List
Subject: Re: Voice Recognition for Linux

Henrik Nilsen Omma wrote:
> Eric S. Johansson wrote:

> Looks like the original text got caught in a spam filter somewhere
> because of the attachment (I found it in the web archives). No worries
> about the tone. We are having a frank technical discussion and need to
> speak directly to get our points across. So my turn :) ...

Thanks for the understanding but it always helps to be polite.
>
> I think you are too caught up in the current working model of NS to see
> how things can be done differently.

you haven't seen the comments I've made in the past about speech user
interfaces and what Dragon has done wrong.  I have proposed many things
that should be fixed but the current command model is not one of them
>
> I have not studied the details of voice recognition and voice models,
>... but I do appreciate the need for custom voice model training over time.
> There is a need for feedback, but it does _not_ need to be real-time.
> Personally, I would prefer it not to be real time. NS does in theory
> tout this as a feature when they claim that you can record speech on a
> voice recorder and dump it into NS for transcription. I have no idea
> whether that actually works.

okay, I should probably attempt to capture some of the user experience
issues.

Correction from its recognitions is something people debate a lot.  If
you don't correct miss recognitions, you'll most likely get the same
thing over and over again.  The output of the language and recognition
model is probabilistic so makes recognitions will change from time to
time but it'll basically be the same kind misrecognition.  (Yes, all
uncorrected).

The user is then faced with a choice, do you correct the recognition
engine or do you edit the document?  In both cases, it's painful.  But
then you get the odd case with the misrecognition is completely
unintelligible and you don't have any idea what the hell you said.  Then
you have no choice but to go back and listen to what was said at that
phrase and make a correction.  This is a very real user experience.  I
have spoken with people who write documents in Microsoft Word and
they'll go back to page 5 out of 20 see something that's garbled and
play it back so they can figure out what they said.  They usually don't
correct heavy garbling but just say it again and get a more consistent
recognition from that point forward courtesy the incremental training.

In theory, you can dictate into most applications using something called
natural text.  It's a direct text injection with a history of what was
said (audio and recognition).  You can do limited correction by
Select-and-Say and it even sort of kind of works if it's a full native
Microsoft Windows application.  Tools like Thunderbird, gaim, Emacs
don't work so well.  How they feel is for later discussion.

But you have this nice tool, that's almost right, called the dictation
box.  It's a little window which has full editing a correction
capability using the voice model of NaturallySpeaking.  When you are
done with your dictation, you can inject that into the application its
associated with.  The wonderful thing about the dictation box is that
making corrections significantly improves accuracy.  If I dictated
nothing but the dictation box for a week, I would have a significantly
more accurate system and a lower level of frustration on
misrecognition's.  If I had what ever magic dictation box uses on all of
my applications, I would be ecstatic.  I wouldn't need to retrain every
six months.  But it's not sufficient.  Why is again conversation for a
future time.

If you want to migrate away from incremental recognition, you'll need to
look to NaturallySpeaking 3 or NaturallySpeaking 4 for the user
experience.  You would probably lose one to 2% (or more) on the accuracy
which is really significant.  Believe me, there's a huge difference
between 99% and 99.5% recognition accuracy in actual operating
conditions.  It's also important to note that dragon changed from the
incremental correction model a couple of times.  The last time I was in
touch with dragon employees (before the bakers got greedy), they will
really convinced incremental training, properly done, gave a
significantly better user experience and I would have to say from what I
hear and from what I have experienced, I think they were right.  Maybe
they were drinking their own Kool-Aid, maybe they were onto something.
I am no stranger to figuring out interesting ways to get the signals you
need to do something right so I trust them.

But independent of your desire, you may not be able to turn it off.  You
may have users who know how it works making your life uncomfortable
because you have made their life less pleasant.  You will have me
demanding the highest possible accuracy.  :-)

I think at this point it would be a really good idea for you to go
purchase a copy of NaturallySpeaking 9 preferred.  Get a really good
headset.  The one that comes in the box is a piece of crap.  No
seriously, it's really bad.  I can give you some recommendations on
headsets (xvi mostly) but I really really love my vxi Bluetooth wireless
headset.  It is just so sweet.  It has some flaws but it's really sweet too.

> I don't really want to interact with the voice engine all the time, I
> want it to mostly stay out of my way. I don't want to look at the little
> voice level bar when I'm speaking or read the early guesses of the voice
> engine. I want to look out the window or look at the spreadsheet that
> I'm writing an email about :) The fact that NS updates the voice model
> incrementally is actually a bad feature. I don't want that. If I have a
> cold one day or there is noise outside or the mic is a bit displaced the
> profile gets damaged. That's probably why you have to start a fresh one
> every six months.

Can you use your keyboard without the delete or backspace key?  Or even
the arrow keys?  the correction dialog I'm talking about is as core to
your daily operation as those keys are.  As for changing focus, sure,
you can do it but only if you have an application which is sufficiently
speech aware to record your audio track at the same time and be able to
play back a segment you think is an error.  It's the only way you'll
make corrections unless you have a memory which is a few orders of
magnitude better than mine.

I should also note that if you don't have a clear and accurate
indication of what's a misrecognition error, correcting something that
is right can make your user model go back quickly.  at least so I am
told.  Of course, I've never done anything like that, no, no way.  Uh-huh.

> Instead of saving my voice profile every day, I would like to save up a
> log of all the mistakes that were made during the week. I would then sit
> down for a session of training to help NS cope with those words and
> phrases better. I would first take a backup of my voice profile, then
> say a few sample sentences to make sure everything was generally working
> OK. I would then read passages from the log and do the needed correction
> and re-training. I would save the profile and start using the new one
> for the next week. I would also save profiles going back four weeks, and
> once a month I would do a brief test with the stored up profiles to see
> if it had degraded over time. If it had, I would roll back to an older
> one and perhaps do some training from recent logs too. There is no
> reason a voice profile should just automatically go bad over time.

now you're thinking like a geek.  Ordinary users eventually learn when
to save a profile based on the type and number of corrections they make.
  They don't test them, they just save them and count on the system to
automatically backup every few saves.  I don't save mine every day and I
only  save my profile when I correct really persistent is recognitions.
  If I'm getting a cold or hay fever, I definitely don't save but I also
suffer from reduced recognition for a few days.

user reluctance to put in the effort is reason why you train on a
document once at the beginning.  I usually choose a couple different
documents to train on after a month on a new model but I am a rarity.  I
described this behavior in a white paper I wrote called "spam filters
are like dogs".  You have expert trainers and you have people whose dogs
crap on the neighbors lawns.  Same category of animals, with roughly the
same skill potential but very different training models.  Naturally
speaking is try to take advantage of the "less formal" behaviors for
training and they're doing a pretty good job at succeeding with those
signals.

Don't force the ordinary user to train at an expert level.  It won't
work, it will just piss them off, and it will discourage if not drive
away the moderately expert user who wants to work in the way they are
comfortable.
>
> The fact that you have to constantly interact with the voice engine is
> not a feature, it's a bug! It's just that you have adapted your
> dictation to work around it. It's not at all clear that interactive
> correction is better that batched correction. It certainly should not be
> seen as a blocker for a project like this going forward. I wouldn't want
> to spend years on a project simply to replicate NS on Linux. There is
> plenty of room for improvement in the current system.

You constantly interact with your computer and except from it a bunch of
feedback.  This is no different.  In not looking at speech levels but
you may be looking at load averages, time of day, alerts about e-mail
coming in, cursor position in an editor buffer, color changes for syntax
highlighting.  These are all forms of feedback.  Incremental training
and looking at recognition sequences are just different forms of
feedback.  He learned to incorporate it in your operation

("he learned" is a persistent misrecognition error that mostly shows up
when using natural text, because I'm not in a place where I can correct
it often enough, it keeps showing up if I was in dictation box right
now, it would be mostly gone.  This is why incremental recognition
correction is so very very important.  batch training has never made
this go away and I've tried.  The only thing that has succeeded has been
incremental in one context.)

>
> OK, now for some replies:

you mean the above weren't enough?  :-)

>
>> There is a system that art exists that does exactly what you've
>> opposed.
> [assuming you meant 'proposed' here] Unlikely. If a system with the
> level of usability existed it would already be in widespread use.
>
>>   While it was technically successful, it has failed in that nobody
>> but the originator uses it in even he admits  this model  has some
>> serious shortcomings.
>>
> What system, where? What was the model and what were the shortcomings?

http://eepatents.com/  but the package is no longer visible.  Ed took a
gun awhile ago.  His package used  xinput direct injection.  He used a
Windows application with a window to receive the dictation information
and inject it into the virtual machine.  he was able to do straight
injection of text limited by what NaturallySpeaking put out.  I think he
did some character sequence translations but I'm not sure.  He couldn't
control the mouse, couldn't shift Windows, had only global commands and
not application-specific commands.  I could be wrong at some of these
points but that's basically what I remember.

There was also a bunch of other stuff like, complicated to set up etc.
but that can be fixed relatively easily.  Especially if you remove the
dependency on twisted.

to my mind, it's the same as what you're proposing.  And there is a
general agreement that it only a starting point for the very
committed/dedicated

>
>> The reason I insist on feedback is very simple.  A good speech
>> recognition environment lets you lets you correct recognition errors
>> and create application-specific and application neutral commands.
> Yes, we agree that you need correction. The application-specific
> features can be implemented in this model too it the same way that Orca
> uses scripting.

Don't know how orca uses scripting.  pointers?

seriously though, I want a grammar and the ability to associate methods
with the grammar.  I do know I'm not the only one because there is a
fair number of people that have built grammars using the
NaturallySpeaking Visual Basic environment, natpython and a couple macro
packages built on top of natpython.

Even if you convince me, you'll have to convince them.

> You would still have to correct the mistake at some point. I would
> prefer to just dictate on and come back and correct all the mistakes at
> the end. One should read through before sending in any case ;)

Oh I understand but in my experience, if I don't pay attention to what
the recognition system is saying, by speech gets sloppy and by
recognition accuracy drops significantly until I have something which is
completely unrecognizable at the end.  Also, I'm probably "special" in
this case but even when I was typing, I continually look back at the
document as far as the screen permits searching for errors.  It seems to
help me keep speaking written speech and identifying where I'm using
spoken speech for writing.  I know other people like you want to just
dictate and not look back.  Some of them will turn their chair around
and stare at painting on the wall while they dictate.  But there are
those, like me that can't.

  > And I think that is a serious design-flaw for two (related) reasons: It
> gradually corrupts you voice files AND it makes the reader constantly
> worry about whether that is happening. You have to make sure to speak as
> correctly as properly at all times and always make sure to stop
> immediately and correct all the mistakes. Otherwise your profile will be
> hosed. I repeat: that is a bug, not a feature. You end up adapting more
> to the machine than the machine adapts to you. *That is a bug.*

It's a feature... seriously, get NaturallySpeaking, And play with the
dictation box as well as natural text driven applications.  When you
have something that Select-and-Say enabled, you don't need to pay
attention all the time, you can go back a paragraph or two or three and
fix your errors.   The only time you need to pay attention is when you
are using natural text which is one-way nuance forces you to toe the
line when it comes to applications.  That is a bug!

> I think this is an NS bug too. I don't want natural editing, I only want
> natural dictation. I want two completely separate modes: pure dictation
> and pure editing. If I say 'cut that' I want the words 'cut that' to be
> typed. To edit I want to say: 'Hal: cut that bit'. Why? because that
> would improve overall recognition and would remove the worry that you
> might delete a paragraph by mistake. NS would only trigger it's special
> functions on a single word, and otherwise just do its best to
> transcribe. You would of course select that word to be one that it would
> never get wrong. (you could argue that natural editing is a feature, but
> the fact that you cannot easily configure it to use the modes I
> described is a design-flaw).

A few things are very important in this paragraph.  Prefacing a command
is something I will really fight against.  It is a horrible thing to
impose on the user because it adds extra vocal load and cognitive load
on the user.  Voice coder has a "yo" command model for certain commands
and I just refuse to use them.  I type rather than say that sequence is
so repellent to me.  I have also had significant experience with modal
commands with DragonDictate which is why I have such a strong reaction
against the command preface and this is why Dragon Systems went away
from them.  Remember, technology dedicated company, I know for a fact
thatsome of the employees were quite smart.  If Dragon's research group
does something and sticks with it, there's probably a good reason for it.

I think part of our differences comes from modal versus nonmodal user
interfaces.  I like Emacs, it's nonmodal (mostly) other people like VI
which is exceptionally modal.  Non-modal user interfaces are preferred
in the circumstances if the indicator to activate some command or
different course of action is relatively natural.  For example if I say
"don't show dictation box" I just get text.  But if I say "show
dictation box" with a pause before the text as well as after, up comes
the dictation box.  Same words, but the simple addition of natural
length pauses allows NaturallySpeaking to identify the command and
activate it only when it's asked for.  Yes, it's training but minimal
training and it applies everywhere when separating commands from text.
This works for NaturallySpeaking commands and my private commands.

there is one additional form of mode switching in NaturallySpeaking and
that's the switching of commands based on which program is active and
its state (i.e. running dialog boxes or something equivalent).  That's
why I have Emacs commands There are only active when running Emacs.

> Precisely. It's because they don't want to fiddle with the program, they
> just want to dictate.

But those that just dictate, get unacceptable results.  Try it.  When
you get NaturallySpeaking running, just dictate and never ever correct
and see what happens.  Then try it the other way around using dictation
box whenever possible.
---eric

--
Speech-recognition in use.  It makes mistakes, I correct some.

--
Ubuntu-accessibility mailing list
Ubuntu-accessibility at lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-accessibility