Phrase-Based Voice Recognition Plugin [Windows-Only]

HateDread · March 9, 2015, 11:36am

Hey guys,

I spent the weekend researching voice recognition APIs and how to integrate them in with UE4. I’ve worked out most of the technical stuff (in the C# SAPI library, for example), but the usage of the plugin kinda dictates how I expose it to you, the user.

In the following video (early testing in only a C# console application), I defined:

“Find”
“restaurants”, “hotels”, “gas stations”
“near”
“Seattle”, “Boston”, “Dallas”

and

“Team”
“alpha”, “bravo”, “charlie”, “delta”, “echo”, “foxtrot”
“attack”, “defend”, “retreat from”
“Sydney”, “Brisbane”, “Melbourne”

Because it’s phrase-directed rather than going for pure dictation, it’s pretty fast. I screwed it up by being a bit hasty and hanging back off of the microphone:

?v=7jBSRmDL_s0

I’m intending to allow the user to bind blueprint and C++ functions to the ‘speech recognized’ callbacks, so that if something has been recognized, you’ll be notified (including the phrase itself). Possibly a wrapper to that so that you can bind entire functions to phrases and let the plugin invoke them under the hood (i.e. you bind your ‘Select All’ function to the recognition callback, and it only calls ‘Select All’ when the chosen phrase is recognized… which would probably be the phrase “Select All”, but would be up to you).

The problem with any of these approaches is that it can be hard to add new phrases on the fly. It’s possible, but I need to be sure that that’s actually desired.

Would you prefer to set up the speech in a document and have that loaded in, or specify via blueprint nodes/C++ functions, or via properties/member variables? The former is easier for lots of text and adding wildcards and tricky combinations, but makes binding callbacks more complicated (and less likely for the more complicated callbacks options to be included as a feature).

Throw your feedback at me!

Dieselhead · March 9, 2015, 12:29pm

Are these APIs English only? For someone whose native tongue is not English there are two problems with all these voice command games:

You probably have some kind of accent that screws up the parser. This leads to you having to put a lot of effort into getting the words right. Most of the time a voice command system is meant to give you an extra mode of input, other than your hands, but the combined result is that you spend so much focus on getting the voice commands right that you forget about your hands and everything they’re doing in the game.
It feels unnatural to speak these robotic phrases in language that isn’t your first one (ie. to speak English when you otherwise wouldn’t). You can compare it to taking a sip from a glass of water. Speaking these words would be represented by taking a sip. If the language is your own the glass is next to you, well within reach, and it’s easy to drink. If the language is not your own the glass is instead in a different room and each time you want to take a sip you have to get up and walk over to the glass, take a sip, leave the glass in the other room and walk back to where you were before. At some point it simply becomes more effort than it’s worth.

I’d suggest you put on a really thick foreign accent and see how hard it is to get it to work. Any sort of language support or calibration will do a lot to make this more accessible to the masses.

Either way this is cool and a good alternative for people who are unable or do not wish to play with their hands (or feet).

anonymous_user_8943aceb · March 9, 2015, 12:58pm

How flexible are these systems? Can they ignore filler or unimportant words like Siri and others do or is it really a fixed comparison to a set of preloaded phrases? I’ve experimented with something like this a few years ago, but it was a little bit inconvenient having to remember and use exact wordings. I don’t know how much was improved since then, though.

KITATUS · March 9, 2015, 8:14pm

This is awesome, I’d love to see it grow!

HateDread · March 10, 2015, 4:18am

Dieselhead;242014:

Are these APIs English only? For someone whose native tongue is not English there are two problems with all these voice command games:

You probably have some kind of accent that screws up the parser. This leads to you having to put a lot of effort into getting the words right. Most of the time a voice command system is meant to give you an extra mode of input, other than your hands, but the combined result is that you spend so much focus on getting the voice commands right that you forget about your hands and everything they’re doing in the game.

It feels unnatural to speak these robotic phrases in language that isn’t your first one (ie. to speak English when you otherwise wouldn’t). You can compare it to taking a sip from a glass of water. Speaking these words would be represented by taking a sip. If the language is your own the glass is next to you, well within reach, and it’s easy to drink. If the language is not your own the glass is instead in a different room and each time you want to take a sip you have to get up and walk over to the glass, take a sip, leave the glass in the other room and walk back to where you were before. At some point it simply becomes more effort than it’s worth.

I’d suggest you put on a really thick foreign accent and see how hard it is to get it to work. Any sort of language support or calibration will do a lot to make this more accessible to the masses.

Either way this is cool and a good alternative for people who are unable or do not wish to play with their hands (or feet).

While it’s not something I have to worry about (as can be heard from the video), it’s definitely something to think about. I’m at the mercy of Microsoft and the voice packs they’ve released. If I’m not mistaken, these are the available packs:

MSSpeech_SR_en-US_TELE.msi
MSSpeech_SR_ca-ES_TELE.msi
MSSpeech_SR_da-DK_TELE.msi
MSSpeech_SR_de-DE_TELE.msi
MSSpeech_SR_en-AU_TELE.msi
MSSpeech_SR_en-CA_TELE.msi
MSSpeech_SR_en-GB_TELE.msi
MSSpeech_SR_en-IN_TELE.msi
MSSpeech_SR_es-ES_TELE.msi
MSSpeech_SR_es-MX_TELE.msi
MSSpeech_SR_fi-FI_TELE.msi
MSSpeech_SR_fr-CA_TELE.msi
MSSpeech_SR_fr-FR_TELE.msi
MSSpeech_SR_it-IT_TELE.msi
MSSpeech_SR_ja-JP_TELE.msi
MSSpeech_SR_ko-KR_TELE.msi
MSSpeech_SR_nb-NO_TELE.msi
MSSpeech_SR_nl-NL_TELE.msi
MSSpeech_SR_pl-PL_TELE.msi
MSSpeech_SR_pt-BR_TELE.msi
MSSpeech_SR_pt-PT_TELE.msi
MSSpeech_SR_ru-RU_TELE.msi
MSSpeech_SR_sv-SE_TELE.msi
MSSpeech_SR_zh-CN_TELE.msi
MSSpeech_SR_zh-HK_TELE.msi
MSSpeech_SR_zh-TW_TELE.msi

I’m not sure if they determine the words to recognize or the accent (or both), but I’ll be taking a look for sure. Not sure if I can calibrate with this API, either.

My intention for such things in-game is more as a supplementary tool than a replacement for commanding by hand.

That’s the thing - if I allow the speech to be setup via some sort of file, it’s easy to include wild-cards in your phrases. It means that you can make them rather easy to say, but with more work on the developer’s end (where it should be!).

anon26705860 · August 19, 2015, 1:06am

Have you done any more work on integrating this with Unreal? What speech recognition plugin did you settle on? This looks awesome.

Daniels_Channel · October 26, 2015, 7:35pm

WOW thats impressive any chanche i can get my hands on this?would like to use it to voice command the ai companions in my project… i would spent about 100 dollar on this if its able to call blueprints from voice commands… maybe a marketplace submission in advanced state? or even a free plugin in this state?

HateDread · October 27, 2015, 1:14am

Hey guys - I didn’t take this much further because I had trouble getting the API working (in that it’s a C# library and I’m trying to use it in C++). The C++ version of the API is horrible, but I may take a look later.

I still need to find out answers to my questions in the OP regarding how people would prefer to use it. I can’t proceed without those.

n00854180t · October 27, 2015, 5:26am

@ - take a look at CMU Sphinx and in particular, pocketsphinx (the portable library). User ShaneC made a plugin for it here: [PLUGIN] Speech Recognition - Game Development - Epic Developer Community Forums

It offers some more stuff than SAPI does, to get you interested (phoneme recognition, wordlist-recognition and free-form).

Currently it works quite well, though the recognition probability isn’t exposed (so if you have similar sounding words it might miscategorize them), and there’s a bug if you just say a few sentences in a row before pausing vs. saying one or a few words then pausing. Only the word-list support is currently implemented.

I’m going to use speech recognition for my magic system and the phoneme part of pocketsphinx for doing lipsync, and it’d be awesome to have more eyes on it