Voice Recognition for Linux

Henrik Nilsen Omma henrik at ubuntu.com
Thu Feb 22 20:25:53 GMT 2007


Eric S. Johansson wrote:
> (warning, I am an unrepentant curmudgeon and negative filter.  Interpret 
> the following accordingly.  If I'm wrong on any points, and someone 
> wants to correct me, I will gladly learn.)
>
> In a nutshell, not much.  

I agree that it does look limited at the moment and that Naturally 
Speaking is the only viable path. Via Voice is outdated on Linux and the 
Windows version of NS is better anyway.

As an unrepentant optimist though I can see the following path forward:

In short: Create a copy-left (GPL) tool to transfer text from Naturally 
Speaking on Windows to Linux.

A few starts have been made on this, but it needs to be organised as a 
proper community project and driven forward by several people. The user 
interface should aim to be better than what the native Windows NS 
version has. It should be speech engine and OS agnostic. That way you'll 
get people using it to transfer speech between all sorts of different 
systems, and it will get more use and development. You should be able to 
easily plug in a free engine like Sphinx (so these will be encouraged to 
improve) or even Vista's native system, which will be very widespread.

My biggest gripe with NS is the editing interface. The actual 
recognition is quite good IMO, but when you do make a mistake it is very 
awkward to fix it without using the keyboard. If you give an edit 
command and that is not understood correctly either then you get a 
meaningless sentence and you are no longer able to easily correct the 
one you originally wanted to fix. The end result is that to totally lose 
the flow of what you were trying to express.

The user interface is what we would have to reconstruct in whole or in 
part anyway, so it's no big loss. We should make it much more 
configurable so you can work around whatever shortcomings it has and 
encourage community contributions to improving usability. Use the NS 
macro system to send custom commands and use scripting on the receiving 
end to allow it to adapt to applications.

I presume the macro functionality in NS is configured so that the 
pattern recognition is quite good on the macros you define yourself. So 
when you say 'Paste in my address' it generally works. We can (ab)use 
this macro facility for our own editing needs. We would define a set of 
macros that would be processed by the NS engine and would give us a know 
and parseable string.

So saying 'Macro: delete sentence' would actually insert the text 
**MACRO-DELETE-SENTENCE** into the text stream. If you were watching the 
text on the Windows system the real text would be interspersed with such 
commands, but on the Linux system receiving the stream it would just Do 
the Right Thing. The big advantage is that it's very configurable this 
way so we can make it do what we want.

We might eventually be able to get the engine running in Wine. Frankly 
I'm not too interested in having the whole NS run in Wine because of the 
interface. If we can make a better interface and can demonstrate a need 
for speech recognition (a commercial need) then we may well see the 
owners of the code port the speech engine to Linux. Low latency kernels 
should be a big draw for them as well.

Now we just need someone willing to go on the barricades and front such 
a project :)

Perhaps we can start this off as a Google Summer of Code project.

Henrik




More information about the Ubuntu-accessibility mailing list