Voice Recognition? Kinda hard . . . .

Ok I decided yet an foray into speech-to-text software.  I am seeing what exactly I can just do with it and honestly, the idea of being a poor man’s Tony Stark is just too cool to pass up.  Well, I guess his commands are like “computer, reconfigure titanium actuator motors” and mine would just be “computer, open minesweeper.”  Still cool though.

Speech to text: You talk, and the computer either types your statement or executes a command.  Speech to text is also called Voice Recognition – VR.  A lot of these applications also do text to speech — computer speaks the text or responds vocally — as well.

I started to research what was open source, and what wasn’t.  With these four features as a priority:

  1. Be able to take dictation.
  2. Be able to process already recorded audio files(MP3’s)  into text.
  3. Execute commands.
  4. Respond like Hal 9000.

Starting to research I came down with these applications:

  • Open source – Sphinx project from Carnegie-Mellon.
  • Open Source – Simon (a non-American effort).
  • Plain old Microsoft VR functionality on a PC.
  • Shareware – E-Speaking VR software ($14; 30 day trial).
  • Commercial – Dragon Speech from Nuance (Currently $75 with a headset on sale)

Sphinx was purely an API for people wanteing to do academic applications or have a core library to make an app.  There were pieces in both C++ and Java, but I found the startup time quite long since I don’t want to CODE speech to text, I want to DO speech to text.

Here is the Sphinx site:  http://cmusphinx.sourceforge.net/

I played with Simon for about 4 hours, and came to this conclusion: it may be a fancy and high potential application but its over-engineered, complicated, and lacks basic “how-to”  instructions that would have any “see-I-told-you-so” usability experts write the second edition to their books scolding us developers.  There is a LOT of setup to get the thing off the ground: you have to download a dictionary,  load it, you have to create “scenearios,” create grammars, and record a lot of samples before this thing will get off the ground.  And since me and my homey Noam Chomsky haven’t hung out in years I was a little short on linguistics PhD knowledge to make this thing fly.  It exposed all the innards I didn’t want to know.

Also, Simon is with a Linux-based KDE graphics and operations.  I didn’t dig into the architecture, it kinda reminds me of a desktop KDE install on a PC.  But, I think this product is platform-independent and even runs on Mac.

The Simon site: http://simon-listens.org

In all fairness, projects like Sphinx and Simon are crucial for the advancement of VR technology, CRUCIAL.  And I thank the efforts on both projects, thank you!

But I want to be Tony Stark.  NOW, dammit.

Through my searches I found out Microsoft has some VR built in, depending on what you have installed.   Here we go again  . . . grasping for straws . . . . maybe.   First  look in my install of Office 2007.  Hmm, nothing.  I look in my Control Panel-Speech — VR tab is missing.  Upon reading I find this out:  XP OS doesn’t have it, you need to have installed Office 2002 or 2003.   After that, they moved the VR module in Vista to the OS, and Office 2007 and later will not have it.  CRAP.  So I find this site:  http://support.microsoft.com/kb/306537  and a light goes on — just install the right service pack.

I find all the software here:  http://www.microsoft.com/downloads/en/details.aspx?FamilyID=5e86ec97-40a7-453f-b0ee-6583171b4530&displaylang=en

And after getting all of it finally figure out what I want is the 68 meg file (this was quite painful, I don’t know WHAT I did to my poor Compaq).  So Bingo/Bango its installed . . . I can uh, train it.  I stopped there, because I wasn’t sure how to make it really work on my XP machine

E-Speaking is next.  I am impressed.  It gives you immediate out of the box functionality, a nice interface with commands for your PC machine.  The UI is relatively easy to use.  it TYPES into notepad!   It has a cool talking face you can skin.  Very impressed.  The E-Speaking product does most of what I want, with relatively little pain.  Awesome.  Also, it sits on top of the Microsoft SAPI engine — PC only but who cares for my uses.

Here’s the E-Speaking site:  http://www.e-speaking.com/

Dragon Speech is a pay-for — I have used it before and it is FAST and the UI is very good,  For more money it does what you want.  What I don’t like is reading their site they seem to want to limit your choice of hardware to theirs.  I am not sure though if they can lock out your own blue tooth headset, that would totally SUCK.    At the current $75 price point I may purchase it though.  Also, its difficult to figure out exactly what features come at the different pricings/versions on their site.  I don’t want to have to pay another $100 to have the word commands “unlocked” — or whether I can even train my own apps on their software, it might be quite locked down.  Also Dragon at a higher price level (for both Mac and PC) can do the transcribing of audio files.

Dragon’s site:  http://www.nuance.coI

Its an interesting revenue model to follow for E-Speaking; I have an idea Dragon will follow.  basically, if you build the wrapper and some basic functionality, you can charge for different voices, bigger dictionaries, command libraries for applications etc.  Its like Mapping software — you buy the wrapper cheap, and pay more for the maps you want if you want them.  Not a bad idea.

The winner:  right now, for my purposes: E-Speaking.   It does goals 1,3,4 and very quickly.  None of them did goal 2 — process MP3’s to text.  Maybe I’ll see what it takes to write something like that.

Comments are closed.