Voice Recognition for Linux

Eric S. Johansson esj at harvee.org
Fri Feb 23 03:37:48 GMT 2007


Henrik Nilsen Omma wrote:
> Eric S. Johansson wrote:


> In short: Create a copy-left (GPL) tool to transfer text from Naturally 
> Speaking on Windows to Linux.

this is one half of the solution needed.  Not only do you need to 
propagate text to Linux but you need to provide enough context back to 
windows so that NaturallySpeaking can select different grammars.  It 
would be nice to also modify the text injected in to Linux because 
nuanced really screwed the pooch on natural text.

> A few starts have been made on this, but it needs to be organised as a 
> proper community project and driven forward by several people. 

this is a difficult task.  There is a very nice package called voice 
coder spearheaded by Alain Desilets up at nrc-it in conjunction with 
David Fox.  They haven't gotten a whole lot of additional contributions. 
  People with upper extremity disorders tend not to volunteer a whole 
lot because quite frankly life is bouncing physical pain against what 
needs to be done.  It's exhausting.

> The user 
> interface should aim to be better than what the native Windows NS 
> version has. It should be speech engine and OS agnostic. That way you'll 
> get people using it to transfer speech between all sorts of different 
> systems, and it will get more use and development. You should be able to 
> easily plug in a free engine like Sphinx (so these will be encouraged to 
> improve) or even Vista's native system, which will be very widespread.

damn, you are the optimist.  Yes, user interface does need to be better 
but it may not be possible because the recognition engine or systems 
around it may not expose the interface is necessary to make it better. 
For example, where do you get the information from to give the user 
clear feedback that the system is hearing something and it's at the 
right level?  Also, the whole process of adding or deleting words from 
your dictionary, training, or testing your audio input to make sure it 
works right?  I'm not saying it's impossible.  I'm just saying be 
prepared to work very very hard.  I think we'd be better off finding 
some way of overlaying the user interface from NaturallySpeaking on top 
of a Linux virtual machine screen.  Sucks but you might get done faster 
than you are very desirable but overly optimistic wish.
> 
> My biggest gripe with NS is the editing interface. The actual 
> recognition is quite good IMO, but when you do make a mistake it is very 
> awkward to fix it without using the keyboard. If you give an edit 
> command and that is not understood correctly either then you get a 
> meaningless sentence and you are no longer able to easily correct the 
> one you originally wanted to fix. The end result is that to totally lose 
> the flow of what you were trying to express.

It's not quite that bad.  Select-and-Say when it works is quite useful 
for small phrases.  What we need to do is propagate the Emacs mark and 
point interface into a GUI environment.  It's far more effective, at 
least when you're noodling about within error-prone navigation process.

In any event, take a look at the voice coder you live for making 
corrections.  I really like it.  It's the best correction interface is 
seen so far.  David Fox is responsible for that wonderful creation.

> I presume the macro functionality in NS is configured so that the 
> pattern recognition is quite good on the macros you define yourself. So 
> when you say 'Paste in my address' it generally works. We can (ab)use 
> this macro facility for our own editing needs. We would define a set of 
> macros that would be processed by the NS engine and would give us a know 
> and parseable string.

natpython is the way to go.  It lets a user create a sapi4 grammer  and 
associate a method with the grammar wants it resolves.  Or is the term 
hits a termial node?  Anyway, it works, it's reliable, it's written in 
Python and from user level looks to be relatively portable between 
recognition requirements.

> 
> So saying 'Macro: delete sentence' would actually insert the text 
> **MACRO-DELETE-SENTENCE** into the text stream. If you were watching the 
> text on the Windows system the real text would be interspersed with such 
> commands, but on the Linux system receiving the stream it would just Do 
> the Right Thing. The big advantage is that it's very configurable this 
> way so we can make it do what we want.

you mean something like this...

     <operation> = left | right | delete | kill | switch | copy;
     <datatype> = character | word | sentence | paragraph | line | region;
     <doit> exported = <operation> <datatype> ;

---
     def gotResults_operation(self, words, fullResults):
         translationtable={
             'leftcharacter':       "{ctrl+b}",
             'rightcharacter':      "{ctrl+f}",
             'deletecharacter':     "{Backspace}",
             'killcharacter':       "{ctrl+d}",
             'switchcharacter':     "{ctrl+t}",
             'leftword':            "{esc}b",
             'rightword':           "{esc}f",
             'killword':            "{esc}d",
             'deleteword':          "{esc}{Backspace}",
             'switchword':          "{esc}t",
             'leftsentence':        "{esc}a",
             'rightsentence':       "{esc}e",
             'killsentence':        "{esc}k",
             'deletesentence':      "{ctrl+x}{Backspace}",
             'switchsentence':      "{esc}xtranspose-sentences{enter}",
             'leftparagraph':       "{esc}xbackward-paragraph{Enter}",
             'rightparagraph':      "{esc}xforward-paragraph{Enter}",
             'killparagraph':       "{esc}xkill-paragraph{Enter}",
             'deleteparagraph':     "{esc}xbackwards-kill-paragraph{Enter}",
             'switchparagraph':     "{esc}xtranspose-paragraphs{Enter}",
             'leftline':            "{ctrl+a}",
             'rightline':           "{ctrl+e}",
             'killline':            "{ctrl+k}",
             'deleteline':          "{ctrl+@}{ctrl+a}{ctrl+w}",
             'switchline':          "{ctrl+x}{ctrl+t}",
             'killregion':          "{ctrl+w}",
             'copyregion':          "{esc}w",
         }

         recognized = convertResults(fullResults)
         part1 = ''.join(recognized["operation"])
         part2 = ''.join(recognized["datatype"])
	#print part1+part2
         natlink.playString(translationtable[(part1+part2)])

...except you only have to say "delete line" and not "Macro delete 
line".  Stick with me kid and you'll be able to talk to a recognition 
system in no time.  Of course, in normal conversation you'll find 
yourself explicitly saying punctuation in the wrong words occasionally 
but it's worth it.  :-)

seriously, you need to live with speech recognition before you know 
what's the right thing to say.  You want to economize on what you make 
the user udder because, quite frankly once her hands are toast, you 
don't want to test something even more fragile, their vocal cords. 
Because once the vocal cords are gone, the user is well and truly screwed.

> We might eventually be able to get the engine running in Wine. Frankly 
> I'm not too interested in having the whole NS run in Wine because of the 
> interface. If we can make a better interface and can demonstrate a need 
> for speech recognition (a commercial need) then we may well see the 
> owners of the code port the speech engine to Linux. Low latency kernels 
> should be a big draw for them as well.

you really want to tap into the open source speech recognition 
initiative.  No we don't have a website because, as I said, most 
disabled folks don't have a whole lot of energy for volunteering.  But 
we'll get there.  Probably before we get our 503c rating.

We have negotiated for rights to a speech recognition engine.  I don't 
know if it's better than the Sphinx group but it is open source, and the 
developer is still interested in seeing it have a life.

but even if we go forward with something like wine, much of what we are 
talking about building will be quite useful in either circumstance.

> Now we just need someone willing to go on the barricades and front such 
> a project :)

I think that's going to be you and me  with my time limited by making 
money to pay the mortgage.  I will try to finish up a draft of the 
mediator as seen by myself and a couple of folks in OSSRI and do so 
relatively soon.

> Perhaps we can start this off as a Google Summer of Code project.
perhaps but I think it's going to be much bigger than what summer of 
code can do so we will need to dig up some alternative funding sources 
so people can get paid.

-- 
Speech-recognition in use.  It makes mistakes, I correct some.



More information about the Ubuntu-accessibility mailing list