hello everyone in this video we're going to learn how to use open ai's whisper model which provides state-ofthe-art Speech to Text Speech to Text can be used for a wide range of applications like conversational apps voice commands transcribing different types of audio and translation from different languages this API is fairly straightforward to access but there are a few tips and tricks you can learn to make you an expert at whisper speech to text let's start by going to the documentation go to Plat for. open. com slocs and on the left hand sidebar go down to the endpoint section and click on speech to text open AI speech to text has two endpoints transcriptions and translations transcriptions is for doing speech to text in the same language so if you have German audio then it's going to transcribe it into German text you access the transcriptions endpoint with client.
AIO a. transcriptions doc create and then you provide the model which is whisper one and the audio file and it returns an object which contains the text translations is when you want to transcribe a foreign language into English so for example you have German audio that would be transcribed into English text and you access that endpoint with client. audio.
translations. create there are 98 supported languages for whisper and one of the great things about it is that it automatically detects the language so you don't have to specify the language it will detect it and do any type of translation if that's needed you can add timestamps to the transcription by including the timestamp granularities parameter and Whisper can handle a maximum of 25 megab so this is generally an audio file of 20 to 25 minutes if the file is longer than that then you can break it up into different segments and process itself separately with something like pied up you can also add a prompt as an additional parameter and this is to improve the quality of the output so one of the benefits is that if it's a data set that has a lot of Industry specific jargon then adding some of that jargon to The Prompt can help the model figure that out and account for it if you split your audio up into different segments then sometimes it can help to use the text from the previous segment for consistency adding an example of text in the prompt that's well formatted with punctuation can improve the output the model naturally can remove a lot of the filler words like um and a but you might actually want to keep those so you can include a prompt with text that includes these words why would you want to keep them well you maybe you have some type of a voice coaching app or some type of an app that needs to be able to analyze somebody's speech to grade it or provide advice on how to improve it and finally for some languages that have very different writing it helps to include an example in the prompt and then finally for improving readability if you're not getting the results that you want from The Prompt parameter then you can also do postprocessing with gp4 basically running it through gp4 with instructions to provide it with a certain type of formatting okay so let's code this up into an app I'll open up my visual studio code and I've created a project folder whisper tutorial I'm going to create a file main. py and the way this works if I go back to the documentation is first I have to create an audio file and then I have to provide that file to the openai client I'm going to write some code so we can do all this with python but first I'm going to have to install some libraries I'll create a requirements.
txt file and the libraries I'll need will be open AI so we can access whisper python. EnV to help load the API key sound device to record the audio scipi to save the audio file and then another Library called keyboard just so I can control when it's recording I'll save this and open a terminal and I can install the libraries all at once from the requirements file but first I'll create and activate a virtual environment so that nothing that I do here will affect the rest of my system and then I can install my libraries with Pip install hyr requirements. txt once that's done I can go back to the main.
py file and build the audio recorder I'll import some libraries then I'll create a function called record audio this function will wait for the user to press the enter key then it will record the audio using the sound device Library I just chose standard numbers for the sample rate and number of channels parameters then the recording stops when the user presses the enter button again and then finally the audio file is saved using scipi right method now we can test this I'll save it open the terminal and run python main. py press enter to start recording hello this is a test 1 2 3 okay and we can see up here that it created an output. wve file I'll open this and test it hello this is a test 1 2 3 perfect it works now we can start to integrate whisper and the speech to text go back to the documentation and in quick start copy the code for transcription then go to your main.
py file and create a new function called Speech to Text just paste the code inside the function for now we can move this import I'll put it up here with the other libraries and here as well where we create the open AI client object I'll just move that up here we'll also need an open AI API key I'm going to store that in a EnV file go back to Plat platform. open. com log in or sign up for an account if you don't have one already then go to dashboard and on the left hand side API keys and you can create a new API key for this project and once you have that paste your secret API key into your EnV file and save it go back to your main.
py and we're going to use the EnV library to get the API key from the EnV file and I will just add that as a parameter here AP I key equals OS get EnV and then the name of the key which is open AI API great and that should all be set up now so let's go back down to the speech to text function we're creating a file called output. wve so we need to be able to add that okay so what's happening here is we are recording audio with record audio function we're saving that audio to a file called output.