the web connection is open Let's test it hi can you hear me so this will be the first time so it will probably take some time yes I can hear you what do you need help with what is that how sounds that's really interesting um hey what's your name and what can you help me with hi I'm noi your ultrac concise AI assistant I can help with AI Dev tutorials coding deas and more at note comom wow I mean I'm definitely going to self Host this speech to speech model I'm speechless anyway welcome back to yet another self-hosting multimodel large language model tutorial and this is going to be the fun one of course this is kind of a sequel of our last video where we have selfhosted Ultravox at the end of the video we discussed that we're going to pair it up with kokoro TTS and going to make it a end to end speech to speech model and I was really fascinated with the result um at least the way it produces The Voice output I am super impressed with Koko TTS especially with this Ultra Box model because this really gives so much faster inference on on just a single GPU machine that I have also used in the ultra box self hosting uh tutorial that it was really mind-blowing in this video I'm going to take you through three steps in the first step we are actually going to integrate the ultrax model with TTS and just use as a simple demo and then we are going to turn it into an websocket based server which we can use it with any of this frontend client like the one that you have just seen and then in the third part we will try to kind of replicate the ultrax API specifications or the way Ultra box works more about that into the tutorial sections and we'll try and set up a proper end to endend speech to spech model which we can work with for a different AI workflow so if you're ready let's get to my screen and let me share you how this is all done welcome to the core tutorial and what I'm going to actually do is I'm going to turn off my video recording um the reason is because I'm going to run so many things locally and I have been seeing that Loom have been crashing and was sometime not able to record the audio when I was doing the recording in recent past so I'm I'm going to stop the video recording so that I have the most Network bandwidth and I'm going to take you through the step by- step flow so you will see me at the end of the video so let me stop the camera and let's proceed with the tutorial so there are actually lot to cover to in this tutorial I'll try to finish up as quickly as possible before I proceed with any of the code dipe let me actually quickly go into the server uh now what I have done from the last video where we have done the ultra box self hosting I actually didn't kill the server I kept it running as I was testing a bunch of things I'm going to actually just close it the reason why I kept it alive is because if you have run the ultra box demo based on the previous tutorial you would have seen that it it took a lot of time to download the model set up the uh ENT things so that's why I didn't want to kill the server and just kept it running for so that I can proceed with the TTS U setup like using kokoro if you don't know what kokoro TTS is that is really a powerful local text to speech solution why it says local because if you go to the hugging face you will see that it is only2 million parameter TTS model which is really really small model but the efficiency of it is really really good as I have tested and you have probably seen at the start of the video of the demo um so that is a reason why I thought maybe because the ultra Vox speech to text model was really responding very fast as you have seen in the last video I thought maybe why not just pair it up with the coko TTS and just come up with a really speech to speech experience right uh but I had this kind of thought in mind that it might become slow or something so that is why I wanted to test it first before I actually create this video but uh as far as I've tested it was really really fast it was really good now let me quickly go through through the code that I have created again it's not a very fancy code like the last one it's very similar to the last one uh and I'm just going to quickly go through what I have done as part of the first demo so the first one is this ultrax kokoro test and this is exactly similar like the previous ultrax app file where we have done this uh we have done the self hosting part we were loading the alra boox 04 model using the Transformers we are doing basically same thing but the only extra thing that we have done uh if you can see here apart from loading the model and is we have defined this voice assistant interface and we have specified few of the voices like B Nicola U Michael Emma and George uh and these are the voices that are default available within Koko TTS and that's what we have actually specified here and uh we have again uh created an init method and in the init method just like last time we were loading the ultra Box model from the hugging face and then we are also initializing the TTS pipeline uh like the TTS Library it's a very simple library and you can actually install the library using just pip so you don't even have to download the model or or anything it's it's as simple as that so if you go to the requirement of txt you will see that coko is available as a python Library so you can imagine right so when once you install this Library it is actually downloading the model as part of that and it's just using it under the hood which is really really awesome here again very similar to the last time we are just keeping a default prompt and uh we are creating a TTS pipeline where if you select the voice uh based on The Voice it will generate the response and then we have the process pitch method which is improved like in the last process pitch method or process audio method we were processing the audio input and returning the text output but in this case we have named it process speech here we are in recording the speech input and generating speech output using C TTS again similar like last video we are using 16,000 sampling rate loading the input audio using a c using a system prompt and and uh if there is a text response that is generated then what we are doing we are loading the voice here same voice that is available here and we are loading that and then we are generating the audio based on The Voice config uh whatever the voice that we have selected from the UI and again this will also run on top of gradio UI you will see soon and at the end after the audio segment is created what we are doing we are just combining the audio segment uh in a 24 ,000 sampling rate and just returning the text response as well as the audio segments basically this create interface which is the gradio interface function is basically just creating the UI for us where we have a dropdown where we can select different voice we can specify system prompt and uh there is two output area one is the text output one is Dev Voice output that's it really again not a lot of code like around 200 lines of code and of course I have created it using AI so it even didn't take a lot of time once we had the ultrabox app code ready now what I'm going to do is just uh copy this then I'm going to go into the server and I'm going to create a folder called UV speech test and let's go inside the UV uh Speech test so let's go inside that and we are going to create the file which is ultra box speech test I will name it Y and then we going to copy this whole thing and just press shift control V that should paste the code again if you are spinning up the server first time or you don't know how to spin up the server I have used M compute so we have used that and I have attached their link and their 50% discount code in the video description so go ahead and find that out and maybe use their VM and test this model or use it for your production purpose whatever suitable for you what you additionally you need to do is just copy the requirement. txt uh because I have I am already using the existing uh environment I already had all those running uh so I'm just going to activate the environment but for you please create an python environment and just create this requirement. txt like this and then just install it pip install minus r rick.
txt i have done it I don't need to run it but for so as you can see as I've just run it all the requirement is satisfied that means I have already installed all of this module now let's give it a quick test what I'm going to do is so let me actually first delete this file which we created by mistake so what we are going to do is just run Python and this file Ultra box test this should start a gradio based UI application that we should be able to test again if you're running it for the first time what will happen is it will download all those model files and everything which will take some time but for me because I have already downloaded the models and everything it just quickly loaded the files and I have my gradio app ready so all I'm going to do is just copy that come back here and go to the gradio live application and there you go that this is how it looks like the interface uh what we are going to do is we are going to use one of the model there and I'm just going to ask the same questions like we have we have been testing Tes in previously hey what is the capital of France and then I will hit process spech for the first time it will take some time because uh the model have just loaded but still it was really really fast as you can see the answer was there uh the text response and then we can play the voice response that really uh the TTS response is really really good as I can tell you especially given that it's a very small model the response is really really but I wouldn't able to tell whether it's a premium model like 11 Labs or it's just a local model let's test another one um so let's just say how long is the Eiffel Tower in the city of Paris again this is creating a long response I I wouldn't mind so let me select another voice maybe a male voice and let's just uh ask another question how many people approximately uh stay in London or leave in London let's see that this time we are using uh a mail voice let's see how is the response comes up yeah that that looks good uh so as you can see this is all working but then this project actually literally gave me the idea that if this is all working why don't we just create a voice to voice application where we don't have to create click process speech every time and play the speech right it should be automatic uh as a voice communication but that was really challenging of course because there's a lot of things that of that uh you need to take care of different voice modulations and uh you have to manage a kind of a WSS server or websocket server and everything uh but I think I have been able to create uh this script but this particular script is a bit complex there is a bit more thing that is running on here but I'm again will try and show you how that is all working because it's a speech to speech application that we are building we need to have some kind of voice activity detector to detect when the speech is stopped or something like that that's why we are using sov vad but uh but I think at the end I have been using some Button as well to to control it because I found that sov ad was not working really good on 16,000 sampling rate uh I'll come to that later when we will test it and you will see again in the process audio basically we are doing very simple that we are getting the audio chunks marging the audio chunks and then basically processing it through again we are using different voices as you can see here then we we have very similar to the previous script we have the TTS pipeline which will take the voice input like from the UI we can select the voice and it will actually use the voice to create the output uh because this is a websocket server we have added a course middleware uh of course we have allowed star here but uh in a real life production scenario you definitely want to have course filter for a security purpose where you want to allow the request to the API from specific domains like from your production application domain or something for added security uh I added some HTML content here just to check if the ultra box or the spe to space server is available or not just by hitting you know the root path so there are a bunch of root that we have created the root path is just going to tell you whether the server is live and then we have startup and then w/ endpoint so startup is basically uh starting of the models and loading the models and everything and w /ws is the web soet endpoint that will be used by the client that we have created to do the communication and here we have added the system prompt uh that we will be using for this particular application this is a very simple application we have created a assistant called noty and uh she is basically an assistant for our Channel not together and let's see how that response comes up there is nothing much as you can see that's all it is doing it's basically running the uvon we are using uvon mod to run it as an app or as an API rather than just a radio application so that means this is just the API part we also needed to create the front end so for that I created a very simple front end with just one app. js file and in the app. js there is just a list of voice from which we will select the voice in the console and we have a button which is icon button this button will be looking like a microphone and when you click that it will check where whether the recording is already on if it is on then it will stop the streaming otherwise if you click it it will start streaming that means it will start recording and it will send the recording to the uh backend server over WSS uh there is some animations that we have added of course when the recording is on you will see some Wy animations that's all really these are all some of the uh UI bits that I have added when the recording is on then the microphone box should be giving some kind of animations uh and all of this really okay there is not again much this is just one file as you can see it's not a really fascinated UI application but this works as you will see so first what we have to do is just again for the last time we will copy the speech assistant server which is this one and we will go back to our server just come out of this existing gradio application and we will create another one which will name it server.
py name it anything really it doesn't matter and just do contr D so that's already and we will just run server. py and that should start up our server hopefully without any issue it's loading the ultrax model and it's using Cuda that means it's using our GPU and our application is started as you can see how do we confirm that that the application is started is very simple we go to our Master compute thing and we are going to just copy this and we are just going to hit 78 60 and we can see that the page has loaded there is a header alter box websocket server is ready that means the server is ready and it is running it has loaded everything so we can do the test so for doing the test again we come back here and we go inside speech with UV coko that folder that should be inside Ultra box cocoro and then speech with UV cocoro so we are now inside this particular folder and we will run npm dot that should start our UI which is the client side application I'm just going to minimize this and there go our simple yet effective front end is ready and as you can see it has able to also establish the web soet connection this is from the server that the connection is open so that means we will be able to test it now last time we haven't tested this voice so let's test with niola just allow hey can you hear me now it has sent the voice and because it's first time it will take some time I didn't actually heard any audio data let's try it can you hear me what can you help me with tools and Frameworks what do you need help with let's try another model the let's try the ni and let's say and where I can get all the tutorials resources and everything head over to not together. com for all tutorials resources and more this voice is really crazy I think there will be a lot of misuse of this voice I'm telling you anyway let's use Emma uh which is a UK female voice let's see how the accent looks like I'm noty your AI assistant I have knowledge on AI Dev tutorials practical coding deos and latest AI tools and Frameworks where can I find all the resources for all the demos and everything that you produce head over to not together.
com for all my deos tutorials and resources so as you can see this is really really fast and I think this is what an real speech to speech model should work and the Really the latency is even within a second as I have seen and that actually gave me the idea to actually create the ultrax API specific endpoint when you use the ultrabox API what happens is you first make a call to/ Api SL call endpoint and when you make the first API call it gets give you the join URL which is nothing but an websocket URL and over the websocket your client application will basically do the communication so what I have done as a next step is I have uh converted this speech assistant into the speech server and using the same particular approach like the first calling API calls and then it sends back the WSS URL and then using the WSS URL our client join that particular call and then it is able to communicate but spoiler alert this code is still not working the reason is because there has been a lot of issues when creating this uh agent or creating this particular assistant because we had to infuse memory we had to ensure our vad parameters are all good so that it can detect when the speech input is finished and so that it can process it needs to also process this speech of different sample size so there is a lot of thing going on and then there is a state management happening as well for example the agent will be in listening State and then it will go into thinking State then it will do a speech state so there are different state that is being managed there so there is a lot happening in this server and when I did a quick test what was happening is it was actually getting stuck into the listening State and I I think the reason is because the V or the voice activity detector is not properly set up yet so I will do some more work on this particular script but to be fair I have not been producing anything spe specially for the member section so I will probably add the improved version of this particular server to the member section only video uh but as a gener as a regular member you will have access to the script the one that I've already created uh so please maybe just download the script and try to improve upon it and see if you can make it working uh that will be probably a task for you uh on on which you can update me in the comments of this video now let me quickly show you how much we have been able to progress with this but uh again this needs a lot of refinement because we are actually making a realtime server API specification and that needs to be handling a lot more thing than just doing a speech to speech output isn't it so again just like the uh last one we will just do speech service do py and we're going to paste this code and we are going to run the spech service. py that should start our server up right it should load the models and everything let's just see this is all loaded again on 7860 Port so what we will do we will close this but we will use the demo that we used in our ultrax demo basically not the one that uh we have created for this one so that we are able to test a proper use case so if you remember or if you have seen the previous Ultravox demo we have actually created an assistant for Drdonut shops uh driveway so we are going to use that not going to go through this ultrax sample client code in details the reason is because I've already covered this in much detail in this particular video which is where I have explored the Ultravox uh demo so please go ahead and check that if you need to know how this is working uh all I'm going to do is just quickly uh load a cond environment this is just a a client we already have this API running which is similar to the ultra box hosted version so let's just load the environment and I have forgot which environment that we created for this oh I think I remember this is the UV app so we don't need uh a cond environment so we can just do like UV run and then run webcore app.