FW: FW: Voice Recognition for Linux

Ian Pascoe softy.lofty.ilp at btinternet.com
Sat Feb 24 21:23:11 GMT 2007


Will - thanks.

I think I have grasped the basics now

Ian

-----Original Message-----
From: William.Walker at Sun.COM [mailto:William.Walker at Sun.COM]
Sent: 24 February 2007 20:10
To: Ian Pascoe
Cc: Ubuntu-accessibility at lists.ubuntu.com
Subject: Re: FW: Voice Recognition for Linux


Here's a paper that describes how Sphinx-4 works:

http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4Whitepaper.pdf

Hope this helps,

Will

Ian Pascoe wrote:
> Hi Henrik / Eric
>
> Although I don't want to get embroilled in your discussions I have a
> question to ask on this.
>
> How does voice recognition work - does it use word parts as in a TTS
engine
> like eSpeak but in reverse, or does it maintain a dictionary of actual
> words?
>
> I presume that the problems you are corresponding about is not the way the
> STT engine  works but the way it interprets the input?
>
> Fascinating discussions - thanks
>
> Ian
>
> Ian
>
> -----Original Message-----
> From: ubuntu-accessibility-bounces at lists.ubuntu.com
> [mailto:ubuntu-accessibility-bounces at lists.ubuntu.com]On Behalf Of Eric
> S. Johansson
> Sent: 23 February 2007 19:45
> To: Henrik Nilsen Omma
> Cc: Ubuntu Accessibility Mailing List
> Subject: Re: Voice Recognition for Linux
>
>
> Henrik Nilsen Omma wrote:
>> Eric S. Johansson wrote:
>
>> Looks like the original text got caught in a spam filter somewhere
>> because of the attachment (I found it in the web archives). No worries
>> about the tone. We are having a frank technical discussion and need to
>> speak directly to get our points across. So my turn :) ...
>
> Thanks for the understanding but it always helps to be polite.
>> I think you are too caught up in the current working model of NS to see
>> how things can be done differently.
>
> you haven't seen the comments I've made in the past about speech user
> interfaces and what Dragon has done wrong.  I have proposed many things
> that should be fixed but the current command model is not one of them
>> I have not studied the details of voice recognition and voice models,
>> ... but I do appreciate the need for custom voice model training over
time.
>> There is a need for feedback, but it does _not_ need to be real-time.
>> Personally, I would prefer it not to be real time. NS does in theory
>> tout this as a feature when they claim that you can record speech on a
>> voice recorder and dump it into NS for transcription. I have no idea
>> whether that actually works.
>
> okay, I should probably attempt to capture some of the user experience
> issues.
>
> Correction from its recognitions is something people debate a lot.  If
> you don't correct miss recognitions, you'll most likely get the same
> thing over and over again.  The output of the language and recognition
> model is probabilistic so makes recognitions will change from time to
> time but it'll basically be the same kind misrecognition.  (Yes, all
> uncorrected).
>
> The user is then faced with a choice, do you correct the recognition
> engine or do you edit the document?  In both cases, it's painful.  But
> then you get the odd case with the misrecognition is completely
> unintelligible and you don't have any idea what the hell you said.  Then
> you have no choice but to go back and listen to what was said at that
> phrase and make a correction.  This is a very real user experience.  I
> have spoken with people who write documents in Microsoft Word and
> they'll go back to page 5 out of 20 see something that's garbled and
> play it back so they can figure out what they said.  They usually don't
> correct heavy garbling but just say it again and get a more consistent
> recognition from that point forward courtesy the incremental training.
>
> In theory, you can dictate into most applications using something called
> natural text.  It's a direct text injection with a history of what was
> said (audio and recognition).  You can do limited correction by
> Select-and-Say and it even sort of kind of works if it's a full native
> Microsoft Windows application.  Tools like Thunderbird, gaim, Emacs
> don't work so well.  How they feel is for later discussion.
>
> But you have this nice tool, that's almost right, called the dictation
> box.  It's a little window which has full editing a correction
> capability using the voice model of NaturallySpeaking.  When you are
> done with your dictation, you can inject that into the application its
> associated with.  The wonderful thing about the dictation box is that
> making corrections significantly improves accuracy.  If I dictated
> nothing but the dictation box for a week, I would have a significantly
> more accurate system and a lower level of frustration on
> misrecognition's.  If I had what ever magic dictation box uses on all of
> my applications, I would be ecstatic.  I wouldn't need to retrain every
> six months.  But it's not sufficient.  Why is again conversation for a
> future time.
>
> If you want to migrate away from incremental recognition, you'll need to
> look to NaturallySpeaking 3 or NaturallySpeaking 4 for the user
> experience.  You would probably lose one to 2% (or more) on the accuracy
> which is really significant.  Believe me, there's a huge difference
> between 99% and 99.5% recognition accuracy in actual operating
> conditions.  It's also important to note that dragon changed from the
> incremental correction model a couple of times.  The last time I was in
> touch with dragon employees (before the bakers got greedy), they will
> really convinced incremental training, properly done, gave a
> significantly better user experience and I would have to say from what I
> hear and from what I have experienced, I think they were right.  Maybe
> they were drinking their own Kool-Aid, maybe they were onto something.
> I am no stranger to figuring out interesting ways to get the signals you
> need to do something right so I trust them.
>
> But independent of your desire, you may not be able to turn it off.  You
> may have users who know how it works making your life uncomfortable
> because you have made their life less pleasant.  You will have me
> demanding the highest possible accuracy.  :-)
>
> I think at this point it would be a really good idea for you to go
> purchase a copy of NaturallySpeaking 9 preferred.  Get a really good
> headset.  The one that comes in the box is a piece of crap.  No
> seriously, it's really bad.  I can give you some recommendations on
> headsets (xvi mostly) but I really really love my vxi Bluetooth wireless
> headset.  It is just so sweet.  It has some flaws but it's really sweet
too.
>
>> I don't really want to interact with the voice engine all the time, I
>> want it to mostly stay out of my way. I don't want to look at the little
>> voice level bar when I'm speaking or read the early guesses of the voice
>> engine. I want to look out the window or look at the spreadsheet that
>> I'm writing an email about :) The fact that NS updates the voice model
>> incrementally is actually a bad feature. I don't want that. If I have a
>> cold one day or there is noise outside or the mic is a bit displaced the
>> profile gets damaged. That's probably why you have to start a fresh one
>> every six months.
>
> Can you use your keyboard without the delete or backspace key?  Or even
> the arrow keys?  the correction dialog I'm talking about is as core to
> your daily operation as those keys are.  As for changing focus, sure,
> you can do it but only if you have an application which is sufficiently
> speech aware to record your audio track at the same time and be able to
> play back a segment you think is an error.  It's the only way you'll
> make corrections unless you have a memory which is a few orders of
> magnitude better than mine.
>
> I should also note that if you don't have a clear and accurate
> indication of what's a misrecognition error, correcting something that
> is right can make your user model go back quickly.  at least so I am
> told.  Of course, I've never done anything like that, no, no way.  Uh-huh.
>
>
>> Instead of saving my voice profile every day, I would like to save up a
>> log of all the mistakes that were made during the week. I would then sit
>> down for a session of training to help NS cope with those words and
>> phrases better. I would first take a backup of my voice profile, then
>> say a few sample sentences to make sure everything was generally working
>> OK. I would then read passages from the log and do the needed correction
>> and re-training. I would save the profile and start using the new one
>> for the next week. I would also save profiles going back four weeks, and
>> once a month I would do a brief test with the stored up profiles to see
>> if it had degraded over time. If it had, I would roll back to an older
>> one and perhaps do some training from recent logs too. There is no
>> reason a voice profile should just automatically go bad over time.
>
> now you're thinking like a geek.  Ordinary users eventually learn when
> to save a profile based on the type and number of corrections they make.
>   They don't test them, they just save them and count on the system to
> automatically backup every few saves.  I don't save mine every day and I
> only  save my profile when I correct really persistent is recognitions.
>   If I'm getting a cold or hay fever, I definitely don't save but I also
> suffer from reduced recognition for a few days.
>
> user reluctance to put in the effort is reason why you train on a
> document once at the beginning.  I usually choose a couple different
> documents to train on after a month on a new model but I am a rarity.  I
> described this behavior in a white paper I wrote called "spam filters
> are like dogs".  You have expert trainers and you have people whose dogs
> crap on the neighbors lawns.  Same category of animals, with roughly the
> same skill potential but very different training models.  Naturally
> speaking is try to take advantage of the "less formal" behaviors for
> training and they're doing a pretty good job at succeeding with those
> signals.
>
> Don't force the ordinary user to train at an expert level.  It won't
> work, it will just piss them off, and it will discourage if not drive
> away the moderately expert user who wants to work in the way they are
> comfortable.
>> The fact that you have to constantly interact with the voice engine is
>> not a feature, it's a bug! It's just that you have adapted your
>> dictation to work around it. It's not at all clear that interactive
>> correction is better that batched correction. It certainly should not be
>> seen as a blocker for a project like this going forward. I wouldn't want
>> to spend years on a project simply to replicate NS on Linux. There is
>> plenty of room for improvement in the current system.
>
> You constantly interact with your computer and except from it a bunch of
> feedback.  This is no different.  In not looking at speech levels but
> you may be looking at load averages, time of day, alerts about e-mail
> coming in, cursor position in an editor buffer, color changes for syntax
> highlighting.  These are all forms of feedback.  Incremental training
> and looking at recognition sequences are just different forms of
> feedback.  He learned to incorporate it in your operation
>
> ("he learned" is a persistent misrecognition error that mostly shows up
> when using natural text, because I'm not in a place where I can correct
> it often enough, it keeps showing up if I was in dictation box right
> now, it would be mostly gone.  This is why incremental recognition
> correction is so very very important.  batch training has never made
> this go away and I've tried.  The only thing that has succeeded has been
> incremental in one context.)
>
>> OK, now for some replies:
>
> you mean the above weren't enough?  :-)
>
>>> There is a system that art exists that does exactly what you've
>>> opposed.
>> [assuming you meant 'proposed' here] Unlikely. If a system with the
>> level of usability existed it would already be in widespread use.
>>
>>>   While it was technically successful, it has failed in that nobody
>>> but the originator uses it in even he admits  this model  has some
>>> serious shortcomings.
>>>
>> What system, where? What was the model and what were the shortcomings?
>
> http://eepatents.com/  but the package is no longer visible.  Ed took a
> gun awhile ago.  His package used  xinput direct injection.  He used a
> Windows application with a window to receive the dictation information
> and inject it into the virtual machine.  he was able to do straight
> injection of text limited by what NaturallySpeaking put out.  I think he
> did some character sequence translations but I'm not sure.  He couldn't
> control the mouse, couldn't shift Windows, had only global commands and
> not application-specific commands.  I could be wrong at some of these
> points but that's basically what I remember.
>
> There was also a bunch of other stuff like, complicated to set up etc.
> but that can be fixed relatively easily.  Especially if you remove the
> dependency on twisted.
>
> to my mind, it's the same as what you're proposing.  And there is a
> general agreement that it only a starting point for the very
> committed/dedicated
>
>>> The reason I insist on feedback is very simple.  A good speech
>>> recognition environment lets you lets you correct recognition errors
>>> and create application-specific and application neutral commands.
>> Yes, we agree that you need correction. The application-specific
>> features can be implemented in this model too it the same way that Orca
>> uses scripting.
>
> Don't know how orca uses scripting.  pointers?
>
> seriously though, I want a grammar and the ability to associate methods
> with the grammar.  I do know I'm not the only one because there is a
> fair number of people that have built grammars using the
> NaturallySpeaking Visual Basic environment, natpython and a couple macro
> packages built on top of natpython.
>
> Even if you convince me, you'll have to convince them.
>
>> You would still have to correct the mistake at some point. I would
>> prefer to just dictate on and come back and correct all the mistakes at
>> the end. One should read through before sending in any case ;)
>
> Oh I understand but in my experience, if I don't pay attention to what
> the recognition system is saying, by speech gets sloppy and by
> recognition accuracy drops significantly until I have something which is
> completely unrecognizable at the end.  Also, I'm probably "special" in
> this case but even when I was typing, I continually look back at the
> document as far as the screen permits searching for errors.  It seems to
> help me keep speaking written speech and identifying where I'm using
> spoken speech for writing.  I know other people like you want to just
> dictate and not look back.  Some of them will turn their chair around
> and stare at painting on the wall while they dictate.  But there are
> those, like me that can't.
>
>   > And I think that is a serious design-flaw for two (related) reasons:
It
>> gradually corrupts you voice files AND it makes the reader constantly
>> worry about whether that is happening. You have to make sure to speak as
>> correctly as properly at all times and always make sure to stop
>> immediately and correct all the mistakes. Otherwise your profile will be
>> hosed. I repeat: that is a bug, not a feature. You end up adapting more
>> to the machine than the machine adapts to you. *That is a bug.*
>
> It's a feature... seriously, get NaturallySpeaking, And play with the
> dictation box as well as natural text driven applications.  When you
> have something that Select-and-Say enabled, you don't need to pay
> attention all the time, you can go back a paragraph or two or three and
> fix your errors.   The only time you need to pay attention is when you
> are using natural text which is one-way nuance forces you to toe the
> line when it comes to applications.  That is a bug!
>
>
>> I think this is an NS bug too. I don't want natural editing, I only want
>> natural dictation. I want two completely separate modes: pure dictation
>> and pure editing. If I say 'cut that' I want the words 'cut that' to be
>> typed. To edit I want to say: 'Hal: cut that bit'. Why? because that
>> would improve overall recognition and would remove the worry that you
>> might delete a paragraph by mistake. NS would only trigger it's special
>> functions on a single word, and otherwise just do its best to
>> transcribe. You would of course select that word to be one that it would
>> never get wrong. (you could argue that natural editing is a feature, but
>> the fact that you cannot easily configure it to use the modes I
>> described is a design-flaw).
>
> A few things are very important in this paragraph.  Prefacing a command
> is something I will really fight against.  It is a horrible thing to
> impose on the user because it adds extra vocal load and cognitive load
> on the user.  Voice coder has a "yo" command model for certain commands
> and I just refuse to use them.  I type rather than say that sequence is
> so repellent to me.  I have also had significant experience with modal
> commands with DragonDictate which is why I have such a strong reaction
> against the command preface and this is why Dragon Systems went away
> from them.  Remember, technology dedicated company, I know for a fact
> thatsome of the employees were quite smart.  If Dragon's research group
> does something and sticks with it, there's probably a good reason for it.
>
> I think part of our differences comes from modal versus nonmodal user
> interfaces.  I like Emacs, it's nonmodal (mostly) other people like VI
> which is exceptionally modal.  Non-modal user interfaces are preferred
> in the circumstances if the indicator to activate some command or
> different course of action is relatively natural.  For example if I say
> "don't show dictation box" I just get text.  But if I say "show
> dictation box" with a pause before the text as well as after, up comes
> the dictation box.  Same words, but the simple addition of natural
> length pauses allows NaturallySpeaking to identify the command and
> activate it only when it's asked for.  Yes, it's training but minimal
> training and it applies everywhere when separating commands from text.
> This works for NaturallySpeaking commands and my private commands.
>
> there is one additional form of mode switching in NaturallySpeaking and
> that's the switching of commands based on which program is active and
> its state (i.e. running dialog boxes or something equivalent).  That's
> why I have Emacs commands There are only active when running Emacs.
>
>> Precisely. It's because they don't want to fiddle with the program, they
>> just want to dictate.
>
> But those that just dictate, get unacceptable results.  Try it.  When
> you get NaturallySpeaking running, just dictate and never ever correct
> and see what happens.  Then try it the other way around using dictation
> box whenever possible.
> ---eric
>
>
> --
> Speech-recognition in use.  It makes mistakes, I correct some.
>
> --
> Ubuntu-accessibility mailing list
> Ubuntu-accessibility at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/ubuntu-accessibility
>
>
>






More information about the Ubuntu-accessibility mailing list