Building AI-driven Voice Recognition in Unity with Sentis
Let AI understands what you're saying
This tutorial is part of the Making Games with AI Course. A free upcoming course on creating AI-powered games for Unity and Unreal.
Using your voice in games and being understood by NPCs (non-playable characters) is one of the most exciting uses of AI in game development. Think how immersive it will be to:
Have real-time conversations with NPC using chat models and AI voices.
Be able to talk with them directly with your voice.
Control your character through text or voice.
Although this was already possible thanks to APIs, there are two drawbacks:
Dependence on an internet connection, risking immersion disruption due to potential API lag.
Potential high cost associated with API usage, especially with many players.
Fortunately, Unity launched Sentis, a neural network inference library where you can run AI models directly inside your game without relying on APIs.
And so today, we will use a famous speech-to-text model called Whisper and make it run inside Unity.
This is what you’ll get at the end of this tutorial:
You can try the demo (Windows) 👉 here.
To make this project, we’re going to use:
Unity Game Engine (2022.3 and +).
Unity Sentis library is the neural network inference library that allows us to run our AI model directly inside our game.
The demo Unity package 👉 here.
At the end of this tutorial, you’ll be able to use this speech-to-text demo and use it in your game. And use it to create exciting new gameplays. For instance, I've modified Jammo the Robot from our previous tutorial to respond to player voice commands instead of text.
Sounds fun? Let’s get started!
The power of Speech-To-Text Models 🤖
How does this demo work?
The demo is simple but powerful:
You select your microphone.
You click record and speak.
You click again, and the speech-to-text model transcripts what you said.
What is Speech-To-Text 🗣️?
Speech-to-text (STT), also known as Automatic Speech Recognition (ASR), is the task of transcribing a given audio-to-text.
For instance, if the player record “What’s the weather today?”, the model takes this audio source as input and outputs a transcription of “What’s the weather today?”.
There are many different speech-to-text models, but in our case, we’ll use one of the most famous families of STT models called Whisper.
You can try by yourself on your browser here, and learn more about speech-to-text models 👉 here.
Let’s build our first speech-to-text (STT) demo with Sentis!
Step 0: Create a new Unity project
Create a new Unity project
Step 1: Install Unity Sentis
The Sentis Documentation 👉 https://docs.unity3d.com/Packages/com.unity.sentis@latest
Open your project
Click Sentis Pre-Release package or go to Window > Package Manager, click the + icon, select "Add package by name…" and type "com.unity.sentis"
Press the Add button to install the package.
Step 2: Install Newtonsoft JSON
Click again on “Add package by name”.
Type com.unity.nuget.newtonsoft-json
Press the Add button to install the package.
Step 3: Download and upload the package to your Unity project
You can find the package file 👉 here.
From the top menu in Unity, select Assets > Import Package > Custom Package, then navigate to the package file.
In the Import Unity Package window that pops up, select Import and wait for the assets to import.
We now need to download our Sentis Whisper model. The official Sentis Models by Unity are in the Hugging Face Hub 🤗
Step 4: Download Whisper from the Hugging Face Hub
For this demo, we’re going to use the model called sentis-whisper-tiny, an official model made and optimized by the Unity Team.
You can find it 👉 here
For this demo we need:
AutoDecoder_Tiny.sentis
AutoEncoder_Tiny.sentis
LogMelSpectro.sentis
vocab.json
You can download them by going to unity/sentis-whisper-tiny model and then to Files and versions and clicking on the ⬇️.
In the Unity project, create a folder called StreamingAssets.
And put the *.sentis files and the
vocab.json
in the Assets/StreamingAssets folder.
Now, if you go to Scenes>MainNoFeel and open it, you can directly try the demo in the Editor by clicking ▶️.
Step 5: Let’s understand the code
For this demo, we have two main C# scripts:
SpeechRecognitionController.cs: handles the Microphone, voice recording, and button animations.
RunWhisper.cs: handles loading the model and transcribing user voice input.
The code is commented, so I’ll not dive into all the details but give you the most important elements.
SpeechRecognitionController.cs
We have 3 main functions:
StartRecording(): when the player clicks on the record button.
StopRecording(): stop recording the user's voice and send the audio to the Whisper Model using SendRecording().
RunWhisper.cs
In RunWhisper.cs, when we start the game, we load the models (Whisper is composed of 3 models: encoder, decoder, and Spectro) and create workers.
In Sentis, you create a worker to break down the model into executable tasks, run the tasks on the GPU or CPU, and output the result.
What you can do now?
This demo is a good codebase to start iterating on. For instance, I've modified Jammo the Robot from our previous tutorial to respond to player voice commands instead of text.
In addition to incorporating this demo in your game, I have two resources to help you:
If you want to dive deeper into Audio in AI (transcription, voice generation etc). Don’t hesitate to check out the free Hugging Face Audio Course.
Unity Sentis provides a very good library. Don’t hesitate to read Understand the Sentis workflow to understand the process better.
Limitations and future improvements
If you tried the demo, you’ll see that the model is super fast, but there are some limitations in terms of accuracy. Especially if you have an accent like me (I’m French 🍷 🥖 so not the easiest accent). It’s the tradeoff between speed and CPU/GPU usage.
This is due to the fact that this model is a tiny version of Whisper, and the word selection used is called greedy. We will add more models in the future that will increase the model’s accuracy.
For now, this is an impressive and amazing opportunity to create exciting new gameplay. And I can’t wait to see what you’re going to make with it 🔥.
That’s all for today. Congratulations on finishing this tutorial.
What you can do next is dive deeper into the code and modify the demo or use it in your games. I commented on every part of it, so it should be relatively straightforward.
And if you do it, please tag me on Twitter @thomassimonini and Linkedin. I would love to see what you’re doing.
If you liked my article, please click the ❤️. And don’t hesitate to follow me on substack.
This tutorial is part of the Making Games with AI Course. A free upcoming course on creating AI-powered games for Unity and Unreal. Don’t forget to sign up 👉 here
And finally, I would like to thank Dylan Ebert for his help in making this demo possible. But also the Unity Sentis Team who published this Whisper Model 🤗.
See you next time,
Keep learning, stay awesome 🤗
Will this content ever be integrated into https://huggingface.co/learn?
Thanks for the tutorial! Though, when i press play i get the error: NotSupportedException: Format version not supported: 72108, please reimport model.
Could you suggest a solution?