with just a five to 15 second recording of anyone speaking you can now clone that voice and then use the voice with text to speech to sound like that very person unlike other methods E2 F5 does not require any training and the quality sounds the best out of all free zero shot tools that I tried it also runs very fast on most supported hardware and if you combine it with other open- source AI tools you can create some cool things like making anyone say anything on on video or creating your own video podcast you'll learn
everything you need to know in the following tutorial by AI voice tutor today we will start by preparing the input voices that we want to clone before we will install and then use E2 F5 TTS for the actual text to speech generation using its features such as podcast generation and speaking in different emotions currently only English and Chinese are supported but more on that later at the end I will briefly show you how you can use face Fusion 3 to lips and videos to the voices all tools should work on most hardware and you will
need about 15 GB of free disk space to put things into perspective when it comes to free open source AI voice generation there are two main approaches the first one is to use your own voice and convert that into the voice of someone else the best tool for that is RVC V2 this requires the voices to be trained before they can be used which can take a while depending on your Hardware RVC also works in real time which means that you can have some fun with your friends in apps that support voice calls such as
Discord the second technique is text to speech and a tool called turtoise TTS lets you generate some very highquality AI voices automatically adjust some values and you can then click on Save training configuration but just like R VC you need to train the voice models before you can use them E2 F5 however is zero shot text to speech which means it does not require any additional training if you want to learn more about the other tools then you can find the videos on my Channel or you check out my AI voice cloning playlist E2 F5
only needs 5 to 15 seconds of a voice I downloaded a podcast the big question for me in that timeline is why didn't we do it that but audiobooks or movies also work fine to turn a video into a short audio clip you can use many different tools such as cap cut but I prefer to use a free open source tool called audacity just make sure to download the version without Muse Hub and to use audacity with video files you need to go to the preferences to download ffmpeg and then audacity should be able to
locate it once you open an audio or video file you can then highlight parts of the audio and either delete it by hitting backspace or once you have about 10 seconds of one person speaking then you can just export that highlighted portion as an MP3 file if you set it to the current selection only and while this one 10 seconds clip is enough I suggest you cut multiple clips of the same speaker while you're at it just make sure that the files you cut are less than 15 seconds in case they are longer e2f 5
will cut them off after 15 seconds and if it cuts them off in the middle of a word it can lead to unpredictable results I don't think there's an ideal length for the audio files as I have been getting great results with input files that were 6 seconds as well as with ones that were 15 seconds long F5 TTS also reports voices with different emotions everybody's down on the floor and we in lock down so in case you do have a source of a speaker with different emotions then it helps to name the files accordingly
in case there are any background noises in your audio files you can use another open source tool called ultimate vocal remover by using the same settings as seen here you can easily extract vocals from songs but you can also get rid of background noises as can be heard here here is the original come here for a second I see yeah he goes your punches and here are the vocals extracted with uvr come here for a second I said yeah he goes your punches keep in mind that having good and clean input audio clips can be
crucial for achieving highquality generated audio next we install or update Pinocchio and we do that by running the latest installer from the website once done you will find the list of installed apps here and in the Discover tab you will be able to locate E2 F5 TTS installing takes just a few clicks once finished it will launch the tool in a Pinocchio window you can use it right there or you can switch to run it in a web browser by clicking on this URL by the way if you want to update a tool within Pinocchio
then you need to stop it first and then you can start the update the user interface is pretty straightforward you only need to load your input audio file or as the tool calls it reference audio then you enter the text you want to generate and hit synthesize you can always monitor the detailed progress of what the tool is doing in the terminal window of Pinocchio in a world where you can clone anyone's voice with just a 15 seconds audio sample anything is possible in case you forget to save the generated audio files in the UI
you can find all files spread across these sub folders in the cachee folder of the tool if you're not happy with the result you have a few different options you can just generate the audio again and it will be using a different seed therefore sounding slightly differently you can't manipulate the seed though which makes comparisons a bit hard you can also try using another input file which sometimes can work wonders or you try to adjust the syntax of your text a bit uh misspell words on purpose or rephrase some of the sentences and even though
the transcribe in feature Works quite well in some cases it can go wrong you can check what the tool has transcribed for the reference text in the Pinocchio terminal copy the text from the terminal and paste it in the field for the reference text there you can edit it and correct any potential mistakes E2 and F5 are two separate components combined in this tool and in my opinion there's no need to use E2 since F5 is far superior in a world where you can clone anyone's voice with just a 15 seconds AIO a sample anything
is possible I do however highly suggest that you play around with the speed of the generated audio as it can have an impact on the quality of the output and to me it seems like some voices work better with a setting of 0.8 or 0.9 in a world where you can clone anyone's voice in a world where you can clone anyone's voice in a world where you can clone anyone's voice but you should not increase or decrease it too much since at some point the output will become unusable and sometimes a bit scary where you
can clone anyone's voice with just a 15 seconds and audios and Sample anything is possible if you think that was weird then watch until the end of the video to hear what the 0.3 version sounds like but be warned another cool feature is the ability to create a conversation between two people and you can either write your own podcast script or you can have an AI assist you with it what's important is that each paragraph has the name of the speaker in front of it in the E2 F5 user interface we enter the name of
the first speaker and then point it to the input audio file again you should enter the reference text in case the transcribing doesn't work or if you want to save some time since for each of the paragraphs the audio will be transcribed again do the same for speaker 2 and then you can paste your pod script in here decide whether you want to remove the silences or not and then hit generate the output will be one audio file that has the entire podcast or conversation and you will get to hear a portion of it in
a bit when we lips sync it to video but first we'll take a look at the multistyle function which lets you generate text in different emotions just like the podcast function you need to use a special syntax with the emotion being in front of the sentence in Brackets so just like with the normal mode we upload our input audio for the regular emotion and then we need to scroll down so we can add another speech type or emotion and then we give it a name and this is the same name that we will have to
use in the text to trigger the emotion repeat these steps for all speech types you have and afterwards enter the text you want to generate for me there was no noticeable difference with the silences removed so maybe play around with that function yourself and hit generate this is my regular way of speaking neither too calm nor too excited just normal sometimes though I am very relaxed calm and I will not get worked up by anything or anyone some other times however I am very angry and I feel like I will not calm down anytime soon
it doesn't take long though until I am so happy again that I could scream for Joy even though it's not perfect it shows the variation of a voice that you can achieve with different input audio files now let's look at how we can take the audio files we generated and turn them into a video with lip sync for that I prepared some short video clips of both speakers along with the generated audio clips of about the same length then we need to install face Fusion 3 through Pinocchio the same way that we installed E2 F5
after we run the tool we enable the lip Sinker and disable the face swapper since we won't be swapping any faces today you can also en the face enhancer but keep in mind that the order matters in which you activate the processors the face enhancer is optional and using it means the entire process will take about twice as long but if you don't use it the mouth and the entire face will be quite blurry in the output as could be seen in this comparison then we enable the GPU and disable the CPU next we select
the source which is our text to speech audio clip before we then select the target video and then click on start you can monitor the progress right here in the UI or if you want a bit more detail you can switch to the terminal window in Pinocchio here's a portion of the finished podcast that I stitch together along with some notes welcome back to the podcast today I'm joined by AI voice tutor tutor it's great to have you here thanks Lex I'm excited especially because we're talking about AI so tell me what's this about creating
cap videos with spaghetti it's all about open source AI people can now create images or videos like a cat eating spaghetti in a fancy restaurant with just a few inputs that's incredible so anyone can just create these things at home no Animation Studio required it's easy and the results are amazing brilliant AI powered cat content it's the future Lex thanks for explaining all this now I've got ideas anytime AI voice tutor looking forward to seeing those videos and while the lip sync isn't perfect yet just like with everything AI it's only going to get better
over time and just like the generated audio from F5 isn't perfect either I think it's very impressive and some of its issues can be fixed easily with a bit of audio editing afterwards to me it felt like it was easier to generate male voices that sounded good but I'm not sure if that's due to the tool or maybe maybe I just didn't have good input audio support for other languages is theoretically possible but the requirements to train a model for another language are very high as you need about 10,000 hours of voice data and a
lot of GPU power people are working on it however and I will make a video about it once there is support for more languages so make sure to subscribe if you don't want to miss that before playing the weird audio file I want to thank all of you and everyone who is contributing to the open source a I seene I hope you learn something new while watching and I'll see you next time roll through the depths the [Music] CL anyone's voice with just fixes what ear B