Speech Recognition Plugin - Sphinx-UE4

Hi djego,
I believe we should keep the discussion of your issue within the github ticket you have raised.


@ShaneC - Looking into the phoneme recognition again, it seems like all that really needs to be done to enable it is to pass the correct parameters and model to ps_init and then add a new event for returning the phoneme strings - does that seem right?

From what I can tell, almost all the other code is the same as in continuous.c in the pocketsphinx source (which does phoneme recognition if the right options are passed, namely -allphone <model file>).

Pretty much man, the WordSpoken event returns the phoneme string, so no need for a new event imo.
Let me know how it works for you. The phoneme strings where pretty abnormal from my very limited testing.

I did try the example program from the pocketsphinx source and notice that it spit out extra phonemes for no good reason even on its own test files. That said, I can get away with a bit more due to the resolution of current VR hardware, in that NPC and player character mouths should be small enough unless the player is right up in someone’s grill that weirdness won’t be as obvious.

Anyway I’ll see how it goes and report back!

Alright, well, I tried it and it does indeed spit out phonemes. Unfortunately it seems pretty random what it gives you, and it’s also a bit slow.

This is me saying “hello”. The “SIL” is just denoting silence. I couldn’t really get it to reliably spit out the right sounds, and it seems to often have a very large (2-3 seconds or more) delay before outputting the results.

LogBlueprintUserMessages: [MyCharacter2_C_2] Speech Recognition Init : SUCCESS
SpeechRecognitionPlugin: SIL G OW L OW L HH SIL M AH L

I’m going to play around with using some other models and see if I can get it to be a bit more reliable.

Makes me wonder what Oculus are using for the recent OVRLipSync plugin for Unity.

Update: Getting faster detection using VoxForge’s dictionary and language model. I need to see about outputting a phonetic LM based on those, though.


So not passing in anything after -allphone actually gives much better results, for some reason. Not sure what’s going on, but I did manage to get this output from me saying “Test”:

SpeechRecognitionPlugin: SIL T EH S T Z F SIL

So the above is just with the default models recommended on the wiki, and I’ve made one change to the plugin, which is to simply pass the “-allphone” parameter with no following file parameter.

It’s also quite fast :slight_smile:

Edit 2:

Set up an example using one of my Fuse characters by roughly translating the visemes to the values for the morphs needed to make that shape of the mouth, and set it up to split the phonemes, then blend between them for the whole spoken phrase.

It would be a lot better to have it end the utterance earlier and spit out phonemes while’s it’s still going, so that you don’t have to stop talking to start seeing the morphs go. I’m not exactly sure how to set that up.

Anyway, on the whole it definitely works and is decent enough for rough lipsync, especially good IMO for doing multiplayer voip!

I’ll be getting a video soon, though I’m going to have to shift the audio a bit to make it not look terrible :stuck_out_tongue:

Edit 3: By commenting out the bit that waits for silence before trying to figure out a hypothesis, I’ve made it work in sync with your voice.

I tried to set it up so it would also pass out the frame times for each phoneme, but there’s a bug with custom events and arrays that’s fixed in 4.11p2 onward, so I’ll have to wait for that or wrap it in a struct.

Awesome mate, would love to see the code and video.
Able to pm with a link to both sometime? :slight_smile: cheers

Absolutely - I’ll be taking a new vid (the ones I took last night aren’t good enough!) and will mention the code changes in detail a bit later today.

First, code changes, these are really simple.

Change config to look like this

	// Start Sphinx
	config = cmd_ln_init(NULL, ps_args(), 1,
		"-hmm", modelPath.c_str(),
		"-dict", dictionaryPath.c_str(),
		"-allphone", "",
		"-bestpath", "yes",

Passing “-allphone” with a blank file parameter.

Next, I had to comment out the part where it sets phrases to recognize, as it will simply keep doing that if these are set:

	//ps_set_keyphrase(ps, "keyphrase_search", const_copy, tollerences, phraseCnt);
	//ps_set_search(ps, "keyphrase_search");

And lastly, and this gives you the “real time” results, commenting out the bit that makes it wait for silence before forming a hypothesis:

if (/*!in_speech && */ utt_started) {

So if utt_started is true at all, we start spitting out results.

Looks a little choppy because 1) mouth shapes are just my interpretations of the visemes into Fuse’s morphs, so they’re not that accurate and 2) blending between morphs in my project is currently a bit hacky and 3) it’s not taking into account how long the phonemes are, it’s just doing each the same amount of time.

I tried your plugin and it works) but strange, i tryed 3 words: “spawn, one, two” in dictionary mode. All the time he recognizes “two”, what’s wrong? And phoneme mode doesn’t work at all.

You were able to make it work? I kept getting “speech recognition not compatible”. I used the 4.11 dll…

Pr0t0Ss12 , from the wiki:
"At the moment, this plugin should be used to detect phrases. (eg. “open browser”). Singular words recognition is poor. I am looking at ways to improve this to a passable level."
Ignore the Phoneme mode for now.

I have noticed single numbers are especially poorly recognized.
I suspect much of this is relates to issues with Pocketsphinx.
For example, the word “two” in the dictionary is made of two phoneme’s.
From =short&s]=words"]CMU Sphinx Tutorial](Building a language model – CMUSphinx Open Source Speech Recognition)

“For the best accuracy it is better to have keyphrase with 3-4 syllables. Too short phrases are easily confused.”

If you feel you can make improvements to the accuracy, by all means, download the source and get working on it. :slight_smile:
It’s available on GitHub. GitHub - shanecolb/sphinx-ue4: A speech recognition plugin for Unreal Engine 4. This is essentially a
Please let me know if you make some significant progress.

thewolfgoddess, the 4.11 binaries were built against the preview releases of 4.11.
I see now that the main release build is available, so it’s possible the previous binaries are not compatible.
Later tonight, I`ll download the main release build of 4.11, and investigate.

Ah, alright that makes sense. I rolled it back to 4.10 and got it working. I can try to work on the accuracy if I am able, but at the moment I am trying to finish a project before the deadline XD I have noticed that it works better with phrases, or it goes a little wonky, but I am now having a different issue… Is this plugin only compatible with Speech to text? I have been trying to get it to trigger actions and events but it only seems to work when using it to output strings. Otherwise it just does… nothing. I am probably doing something wrong lol

I have added a section to the wiki:

Please take a look. :slight_smile: and let me know of any problems.
I have created an experimental branch in Github. (this is what is used by the above demo)
For now, the major changes are correcting the enum spelling + more granular tolerance settings.
However, in the future, I plan to change this up considerably, and merge back when it is significantly improved in all respects.

Tomorrow, I plan to create some C++ examples, that will show-case the plugin as well.

A number of people had asked how to use my plugin within C++, instead of just blueprints.
I have updated the code (experimental branch), as well as updated the example projects to include an example.
Please take a look, and let me know if there’s any serious issues

My HTC Vive arrived about a week ago, otherwise the Android port will probably have been done by now.
Hopefully I can focus on that this weekend.

Yay for Android port! :slight_smile:

So I’m not sure why I had some good results with the phoneme recognition and then it later went to crud, but I’m going to start testing this again soon and see if I can get you a version that gives the results I saw before.

No problem. I think I recall reading that their phoneme recognition wasn’t great :confused: by their own admission.
Let me know if you make any headway.
I`m going to look at adding a Grammar support, and have it switching based on trigger-able contexts and such.

Would it be better (and hopefully easy) to utilize Android’s native speech recognition on Android ?