Speech Recognition Plugin - Sphinx-UE4

Hi Artigen,
At this time, the plugin does not work under Mac OS.
I would be happy to port it to Mac OS, if you were willing to pay me for my time.
Alternatively, the code is publicly available on GitHub.
All the best,

Has this been tested yet on 4.11?

And oh my… this is exactly the plugin I have been looking for since UE4 came out \o/


getting this error when i am trying to open it on 4.10
i tried to build from source using vs2015.
but still same

They must have compiled with debugging and released that version and not the non debugging version. the d on the end signifys debugging.

ahh, that would be my bad. I`ll update the dll’s and let you know when I have done so.

Okay, I have updated GitHub, and the Wiki.

Note: There’s an additional step to copy across the appropriate dll to match the version of unreal you are using.
Please let me know if you encounter any issues.

sorry for the delay, about converting Arabic dictionary file into a Latin transliteration i don’t think this will work, tried it before in unity but who guess it could work in unreal :smiley:
Also i got agood news 4.11 - 4.12 will support Arabic so it can be easy to compelete your work without converting any thing :smiley:

Oh, that’s great news. :slight_smile: Thanks for letting me know.
I`ll let the person who e-mailed me know that Arabic is being added in 4.11-4.12.

Hi, i have a problem, when Run my project then I put “I” (for init Speech) my Microphone Input not work, Stop and Close project come back Work my microphone :confused:

Have you ever happen it ?

Hi djego,
I believe we should keep the discussion of your issue within the github ticket you have raised.


@ShaneC - Looking into the phoneme recognition again, it seems like all that really needs to be done to enable it is to pass the correct parameters and model to ps_init and then add a new event for returning the phoneme strings - does that seem right?

From what I can tell, almost all the other code is the same as in continuous.c in the pocketsphinx source (which does phoneme recognition if the right options are passed, namely -allphone <model file>).

Pretty much man, the WordSpoken event returns the phoneme string, so no need for a new event imo.
Let me know how it works for you. The phoneme strings where pretty abnormal from my very limited testing.

I did try the example program from the pocketsphinx source and notice that it spit out extra phonemes for no good reason even on its own test files. That said, I can get away with a bit more due to the resolution of current VR hardware, in that NPC and player character mouths should be small enough unless the player is right up in someone’s grill that weirdness won’t be as obvious.

Anyway I’ll see how it goes and report back!

Alright, well, I tried it and it does indeed spit out phonemes. Unfortunately it seems pretty random what it gives you, and it’s also a bit slow.

This is me saying “hello”. The “SIL” is just denoting silence. I couldn’t really get it to reliably spit out the right sounds, and it seems to often have a very large (2-3 seconds or more) delay before outputting the results.

LogBlueprintUserMessages: [MyCharacter2_C_2] Speech Recognition Init : SUCCESS
SpeechRecognitionPlugin: SIL G OW L OW L HH SIL M AH L

I’m going to play around with using some other models and see if I can get it to be a bit more reliable.

Makes me wonder what Oculus are using for the recent OVRLipSync plugin for Unity.

Update: Getting faster detection using VoxForge’s dictionary and language model. I need to see about outputting a phonetic LM based on those, though.


So not passing in anything after -allphone actually gives much better results, for some reason. Not sure what’s going on, but I did manage to get this output from me saying “Test”:

SpeechRecognitionPlugin: SIL T EH S T Z F SIL

So the above is just with the default models recommended on the wiki, and I’ve made one change to the plugin, which is to simply pass the “-allphone” parameter with no following file parameter.

It’s also quite fast :slight_smile:

Edit 2:

Set up an example using one of my Fuse characters by roughly translating the visemes to the values for the morphs needed to make that shape of the mouth, and set it up to split the phonemes, then blend between them for the whole spoken phrase.

It would be a lot better to have it end the utterance earlier and spit out phonemes while’s it’s still going, so that you don’t have to stop talking to start seeing the morphs go. I’m not exactly sure how to set that up.

Anyway, on the whole it definitely works and is decent enough for rough lipsync, especially good IMO for doing multiplayer voip!

I’ll be getting a video soon, though I’m going to have to shift the audio a bit to make it not look terrible :stuck_out_tongue:

Edit 3: By commenting out the bit that waits for silence before trying to figure out a hypothesis, I’ve made it work in sync with your voice.

I tried to set it up so it would also pass out the frame times for each phoneme, but there’s a bug with custom events and arrays that’s fixed in 4.11p2 onward, so I’ll have to wait for that or wrap it in a struct.

Awesome mate, would love to see the code and video.
Able to pm with a link to both sometime? :slight_smile: cheers

Absolutely - I’ll be taking a new vid (the ones I took last night aren’t good enough!) and will mention the code changes in detail a bit later today.

First, code changes, these are really simple.

Change config to look like this

	// Start Sphinx
	config = cmd_ln_init(NULL, ps_args(), 1,
		"-hmm", modelPath.c_str(),
		"-dict", dictionaryPath.c_str(),
		"-allphone", "",
		"-bestpath", "yes",

Passing “-allphone” with a blank file parameter.

Next, I had to comment out the part where it sets phrases to recognize, as it will simply keep doing that if these are set:

	//ps_set_keyphrase(ps, "keyphrase_search", const_copy, tollerences, phraseCnt);
	//ps_set_search(ps, "keyphrase_search");

And lastly, and this gives you the “real time” results, commenting out the bit that makes it wait for silence before forming a hypothesis:

if (/*!in_speech && */ utt_started) {

So if utt_started is true at all, we start spitting out results.

Looks a little choppy because 1) mouth shapes are just my interpretations of the visemes into Fuse’s morphs, so they’re not that accurate and 2) blending between morphs in my project is currently a bit hacky and 3) it’s not taking into account how long the phonemes are, it’s just doing each the same amount of time.

I tried your plugin and it works) but strange, i tryed 3 words: “spawn, one, two” in dictionary mode. All the time he recognizes “two”, what’s wrong? And phoneme mode doesn’t work at all.