Nowadays using neural networks in the gaming industry (including generation of music, images and 3D models) is on everyone’s lips.
One of the possible applications of neural networks in the gaming industry is to create interactive NPCs with which the player can interact using voice.
A number of well-known companies have already started to experiment in this field. For example, NVidia has introduced NVidia ACE, whereas Ubisoft has introduced Neo NPC.
A number of less known companies have demonstrated their solutions: one, two.
We have not been very impressed with the result, we must admit, so we have thought that, at least, we could repeat it =]. It is up to you to decide if we have managed to cope with the task or not.
We have decided to conduct RnD and prepare the Proof of Concept — an application to demonstrate the possibilities of using neural networks to create interactive NPCs in games.
The results of our research are as follows. The application works in real time and does not require connection to third-party services over the Internet.
The concept
The player asks the NPC a question using voice, the NPC formulates a response and answers using voice as well.
The NPC should have its own background, and it should react to the player in a certain way.
The player’s interaction with the NPC should resemble a dialogue with a real person as much as possible. I.e. there should be no prepared remarks, the interaction should be carried out by voice only and, in addition to this, it would be highly desirable to observe facial expressions of the NPC or, at least, movements of the lips.
Besides, the NPC should remember the player to avoid starting communication from the very beginning in a new game session.
Based on the above, our interactive NPC should perform the following functions:
Listen — convert the player’s speech into text;
Understand — analyze the player’s request and form a response within the framework of the NPC’s background;
Respond — convert the response text into an audio file and synchronize the audio with the movement of the 3D model’s lips;
Remember — keep the history of communication with the player.
Third-party services (ChatGPT, Google Speech, etc.) are not to be used, just to be on the safe side. Instead, some ready-made open source solutions are to be used and to be adapted to our task.
The implementation
Unreal Engine 5 has been chosen as the game engine.
One of the reasons for this choice is the access to ready-to-use highly detailed models of people with a ready-made set of poses and animations — MetaHuman.
The functionality is divided into several plugins:
SpeechRecognitionSystem — recognizing speech. It is based on the library vosk.api.
UGpt — generating a response to the player’s text request. The LLM model being used is Mistral (7b). The model inference being used is llama.cpp (CPU).
SpeechGenerationSystem — converting text to audio. The plugin is based on the library piper.
LipSyncSystem — synchronizing lip motion animation with audio. The plugin is a redesigned version of the Oculus LipSync plugin. The original plugin is not compatible with UE5 and (mostly) does not work in real time.
There have been a number of issues when making those plugins, but they are not meant to be discussed in this particular article.
How it works (in general terms)
All the above plugins work according to the same principle:
-
when initializing a plugin component, a separate thread is created (a FRunnableThread for a FRunnable object) with an infinite loop in which the component waits for a task to be completed;
-
the component accepts the request and puts it in the request queue;
-
at the next iteration of the loop the request is retrieved from the queue and processed;
-
the result of the request execution is added to the queue with the results.
Below you may find the simplified blueprints for each component of the system and links to the documentation for the plugins.
Speech generation and lip sync
In addition, we would like to touch upon such a function as setting up the background (behavior) of the interactive NPC.
In order to ‘describe a character’ prompt engineering needs to be used.
As an example we have decided to simulate a dialogue between Dovakin and Jarl Ulfric from the Skyrim game and prepared a simple prompt as follows.
Imagine that you are one of the jarls of Skyrim.
You are talking to the great warrior {USER}.
{AI} answers politely and is willing to answer any questions.
{AI} has a job for the hero.
{USER}:
USER: Dovakin
AI: Ulfric
Dovakin — Ulfric dialogue
Dovakin: Greetings respected Jarl.
Ulfric: Greetings mighty Dovakin, what brings you to my hall?
Dovakin: Is there any work for me? I’m a good fencer.
Ulfric: Indeed, your reputation precedes you.
Ulfric: But I have a task for someone with your skills.
Ulfric: The Stormcloaks have taken over Fort Sungard, and we need it back.
Ulfric: They’ve been using it as a base to launch attacks against our people.
Ulfric: Can you help us retake it?
Dovakin: What should I do?
Ulfric: Gather some allies and head to Fort Sungard.
Ulfric: The Stormcloaks are led by a high-ranking officer named Captain Arvel.
Ulfric: He’s a formidable opponent, but with your skills and the help of your allies,
you should be able to defeat him and retake the fort.
Dovakin: What will I get for completing the task?
Ulfric: Our gratitude and the knowledge that you have helped protect the people of Skyrim.
Ulfric: And if you prove yourself to be a valuable ally,
perhaps there will be other opportunities for you in the future.
Dovakin: Where should I look for them?
Ulfric: You can find potential allies among the local militia or among the other jarls.
Ulfric: Some may join you out of loyalty or for rewards.
Ulfric: Be cautious, however, as not all will be trustworthy.
Ulfric: Use your judgment wisely.
Dovakin: Where can I get supplies for this mission?
Ulfric: You can acquire supplies from various merchants or by scavenging abandoned structures.
Ulfric: Remember, preparation is key to success on the battlefield.
Dovakin: Good. I am ready to complete this mission.
Ulfric: Excellent.
Ulfric: May the gods go with you, Dovakin.
It is somewhat scary to imagine what will happen if an LLM model is intentionally trained using the entire lore of Warhammer 40000 and an ardent fan of the universe is asked to do prompt engineering…
What next?
-
learn how to automatically (or, better say, almost automatically) create a counterpart in MetaHuman using a face photo with the help of 3D face mesh reconstruction;
-
master voice-cloning;
-
try to ‘revive’ our NPC even more by adding the sentiment analysis function for both the player’s requests and the NPC’s responses. As a result, the facial expressions and behavior of our interactive NPC will change;
-
improve the lip sync function (we will try to make the lip animation smoother and more accurate).
The ultimate goal is to create a digital counterpart of a person whose appearance, voice and behavior will resemble the original.
Conclusion
In our opinion, interactive NPCs have a future in the gaming industry. This future will not come today or tomorrow, but it will certainly come. It is already possible to experiment in an attempt to create a new genre, gameplay, etc.
At that point it is, though, difficult to imagine how this will affect the work of game designers, screenwriters and the industry upon the whole.
And what is your opinion?
Feel free to contact us: stendhal.syndrome.studio@gmail.com