[ MUSIC ] Martin Cai: Hi, everyone. Welcome to this session. My name is Martin Cai.

I am a Product Manager on the Azure AI Platform team working on Phi small language models. On the stage with me is Gina Lee, on the same team with me, a Product Manager. We are your presenters today.

Small language models, or SLMs, are a revolutionary advancement in the Azure, in the AI and natural language processing. Unlike their larger counterpart, the large language models, SLMs are designed to be compact, efficient, and resource-friendly while maintaining high levels of performance and accuracy. SLMs require much less computational power and storage, which leads to much lower operational cost.

And this cost-effectiveness of SLMs makes it actually accessible for many small business or startups who may not have that kind of resource to host large language models. SLMs can be integrated across various platforms and environments, from cloud-based systems to on-premises servers and edge devices. And this flexibility really allows different organizations to do and to deploy these SLMs in a manner that best suits their requirements and constraints.

And the speed and responsiveness of SLMs really translate into a much better user experience for applications powered by SLMs, for things like real-time user chats, agents, virtual assistants, or any other interactive platforms. SLMs are also very easy to manage and to be fine-tuned compared to their larger counterparts. They can be tailored to meet very specific task type or contain customer domain knowledge.

Now, the Phi model family has been at the forefront of AI development ever since its initial inception, from Phi-2, -2. 5, Phi-3, and now Phi-3. 5.

Each iteration has brought improvements in computational efficiency, capabilities, and versatility, which makes these Phi models invaluable tools for a wide range of applications. Now, the key reason why customers really like the Phi models is because of its size-to-quality ratio. It's really small, but still, it has such quality that makes Phi models a prime candidate for further customization that can be used in places where it best suits a specific use case.

And I'll talk about the size-to-quality ratio in a bit. Now, the latest Phi-3. 5-Mini, Vision, and MoE models were released earlier this year.

They were all instruction-tuned, fully safety-aligned with our responsible AI principles. These models were published under the MIT license, which is well-known for its simplicity and permissiveness. And they are available in the Azure Model Catalog, Hugging Face, and they have other variants of the models that's available that works with ONNX Runtime, NVIDIA's NIM, and Ollama, which really enable these models to operate across a variety of hardware platforms.

Now, let's talk about a couple of Phi's use cases. Phi models are designed to perform various language processing tasks very efficiently. Deploying these models in an offline environment, whether on-device, on-premises, is particularly advantageous for places where you need local inference.

Like, for example, IoT devices, autonomous vehicles, or places where there's just really no internet access. And in many applications, fast response time is not just beneficial, it's critical. So Phi models offers the advantage of, you know, in those latency-bound scenarios where every millisecond, you know, matters.

And delays in response can impact user experience, efficiency, or sometimes even safety. In cost-constrained environments, Phi offers a very good solution for achieving high-quality language processing capabilities without incurring excessive expenses. So think about from individuals, startups, to even larger institutions, deploying a small-language model like Phi can enable a wide range of applications while staying within their budget.

And lastly, fine-tuning a small-language model for a particular task is very good with Phi models, making them, you know, really adapt to various special tasks, making them highly specialized for applications' unique needs. Now on this chart, we plotted the model size in billions of active parameters on the horizontal axis, and model quality using an MMLU aggregate metric on the vertical axis. We found that the Phi-3.

5-MoE model can achieve a quality score better than other models around a similar size, and the Mini model stands very competitive even against models 2X its model size. Now this means that Phi models can really deliver excellent results without the need for having massive computational resource. Again, this proves that bigger isn't always better in the world of language models.

Now let's take a closer look at each of the Phi-3. 5 models. The Mini model has 3.

8 billion parameters, it supports over 20 languages, and it's capable of maintaining coherent and context with its 128K long context window support, and this model excels in various tasks like reasoning, mathematics, code generation, and summarizing lengthy documents or even long meeting transcripts. It has been instruction-tuned and fully safety-aligned. The 3.

5-Vision model is multimodal with 4. 2 billion parameters that can handle both text and vision inputs. It is suitable for tasks that require both visual and textual analysis.

This model also supports 128K long context length, it excels in complex reasoning, optical character recognition, multi-frame summarization tasks, same as its Mini sibling, and this Vision model has been instruction-tuned and safety-aligned. The 3. 5 MoE model is the only mixture of experts model in the Phi family.

It has 16 modules of experts with a total of 42 billion parameters. During token processing, only two experts are activated, thus requiring just enough computation to handle 6. 6 billion parameters, making the MoE model incredibly computational efficient while outperforming other dense models of similar size.

The MoE model can support 20 languages, also 128K long context support. It excels in real-world and academic benchmarks, surpassing several other leading models in various tasks like reasoning, mathematics, and code generation. Now I've told you about the Phi models, let me share with you a couple of applications and success stories of Phi.

Bayer is a global enterprise with its core competencies in the field of healthcare and agriculture. Many of Bayer's customers, farmers, face challenges because they don't have easy access to expert advice or knowledge. And without this expert advice, farmers often run into problems and risks of having lower crop yields, increased pest problems.

So to boost farm sustainability and productivity, Bayer worked with Microsoft to create a new lightweight model. It's called ELY, or "Eli", Crop Protection. It's based on the Phi small language model, trained on U.

S. crop protection datasets with emphasis on reasoning-dense properties. Now, why Phi models?

It's because a small-language model like Phi can provide opportunities to unlock a broader usage across different industries through reduced cost, latency, and improved accuracy. We see that the benchmark shows that the ELY crop protection model is already 1. 5x better than the other baseline language models.

And the solution has unlocked benefits. It's already helping Bayer to increase its productivity for over 1,500 Bayer employees here in the U. S.

And this ELY crop protection model is available to all Azure customers in the Azure AI model catalog. Next, we want to look at capacity. Capacity helps teams to do their best work.

Enterprise employees often struggle to find information through searching across untagged or isolated content, which leads to wasted time and frustration. Capacity addressed this challenge by leveraging Phi SLM on Azure to do experiments on how effective these language-model-based tagging solutions will perform. They applied prompt engineering, adherence workflows, and scale testing to accomplish tasks such as title generation and tagging to better prepare for answers to be searched.

So Phi enabled Capacity to rapidly grow and deploy and scale to its advanced tagging solution that is both highly relevant to its customers' needs and cost-effective. Capacity customers now can save both save time and money by quickly finding the relevant information they want to. So the true power of SLMs lies in their ability to be customized for a specific task and domain.

The three primary methods, prompt engineering, RAG, which stands for retrieval augmented generation, and fine-tuning. Prompt engineering involves crafting a specific input prompt that guides the model towards generating a desired output. It is relatively simple and does not require any modification to the underlying model itself.

It's certainly good for quick experiments. However, the limitations are it may not always produce the desired result consistently, especially for difficult and complex scenarios. RAG combines the model with an information retrieval system, which can be a repository of private data, documents, or even the public internet.

With real-time retrieval of relevant information, RAG can significantly improve the accuracy of the response. However, implementing RAG can be sometimes technically challenging and can be costly also. So running these real-time information retrieval is computationally expensive.

Fine-tuning involves training a pre-existing model on a specific dataset to adjust its behavior. This process allows the model to really learn the particular nuances and patterns that is relevant to the targeted task or domain, making it highly specialized essentially. But fine-tuning also requires investment in computational resource for the training part and also have access to high-quality dataset.

So while prompt engineering and RAG offer valuable benefits, I think fine-tuning still stands as the most robust and versatile solution for customizing a small language model like Phi. Now, there are two methods to fine-tune Phi models on Azure. Choosing between them really depends on what you need to do and the priority of your fine-tuning task.

You want to use the serverless API approach for quick and cost-effective fine-tuning tasks that don't require extensive resource management. Since serverless compute is managed by Azure, meaning you don't need to worry about setting up the infra or managing the infra itself, so this allows you to solely focus on the model development work. And the setup is really quick and simple.

It is ideal for smaller fine-tuning jobs or those sporadic fine-tuning jobs. And you only pay for the resource that you use during fine-tuning. On the other hand, you want to use managed compute for fine-tuning if you need a stable and controlled environment, especially for those complex, long-running fine-tuning jobs.

Because managed compute gives you the flexibility to insert advanced control over the fine-tuning environment itself. And also, besides deploying the fine-tuned model in Azure Cloud, you can also export the model. So this makes this method the optimal choice for fine-tuning for application or solution that will eventually run on the edge.

Okay, let's do a couple of demos to show you how we can easily fine-tune the fine models. One more slide. So imagine I'm running a security firm.

I'm creating an AI-powered workflow that does news headline analysis for me to gather insights and for me to make informed decisions. So I'm using Phi-3. 5-Mini as the core component of this workflow to perform sentiment analysis of the news headline.

Let's look at two prompts. So the first one says, "Contoso system breached. Client data potentially compromised.

" Let's see what the model says. So the model thinks this is a negative sentiment, which is correct. And it also gave an explanation on how it reached to this conclusion.

So although the negative answer is correct, it doesn't quite meet my requirement. Because there are other downstream components that will consume the output from the model, and they don't need that wordy information. They just need a very concise answer.

So I need to teach the model to be concise. Second example, "Contoso recovers from a ransomware attack. " Now, the model thinks this is a positive sentiment.

Why? Because Contoso recovered from a negative event, which makes sense in a way. And also gave this very wordy explanation on how it came to this conclusion.

Now, besides teaching the model to be concise, I need the model to align with my business requirement. Because my cybersecurity firm considers all cyber attacks as negative news, negative sentiment. So I need to teach the model to think and to treat this type of news headline as negative.

All right. Let me show you how I can just start a quick experiment, fine-tuning in the Azure AI Foundry. So this is the Azure AI model catalog.

I'm going to start with the Phi-3. 5-Mini instruct. So on the model homepage, you have an option of deploying this model as-is or fine-tuning it.

Let's go fine-tune the model. And I will start with serverless API. This is a very simple UI that just collects information from you and allows you to customize this model fine-tuning experience.

So I'm just going to call it Ignite Demo. This is the model name of the output. So not the original model itself.

So on this page, a few things you can choose. One is task type. So in this case, we're demoing a check completion task type.

When it comes to training data, you have the option of either uploading a dataset from your local machine or use a dataset that's already in the Azure, which I've prepared for. So I'm going to use this training dataset. Now, once the training dataset is loaded, the UI will show a preview of the dataset itself.

It's really for sanity checking. On the next page, the wizard asks you, "How do you want to validate the fine-tuning model? " So I have the option of splitting a small portion of my training data to be used for validation or choose a separate dataset, which I will do.

Similarly, once the validation dataset is chosen, it shows a preview. Now, this is really the page where you get to pick those hyperparameters that you want to set for the fine-tuning job. It's a very simple view.

It allows you to customize the batch size, learning rate, and number of epochs. Without changing any of these, I'll just click "Next", and this is the last page where it's a summary of the input I've entered so far, and it's my last opportunity to change any of these before submitting the job. So I'll just submit the job and let it run.

In my training dataset, I have over 85,000 lines. So typically, this size of a fine-tuning job will run for an hour to two hours. So while this job is running, I just want to quickly show you, once the fine-tuning job finishes, you will get to this view, and there's a metric view where it shows you all the dashboards and graphs from this fine-tuning job, things like how the loss curve has converged.

So that was it. That was how I would fine-tune this Phi-3. 5-Mini model using serverless API.

Next, I'll hand over the stage to Gina, who is going to show you how she fine-tunes the model using a Jupyter notebook and with managed compute that gives you much better control of this fine-tuning environment itself. Gina? Gina Lee: Thank you, Martin.

So now let's switch gears and use this Jupyter notebook that I have prepared to walk through the same exact fine-tuning process, but this time using managed compute. This notebook shows you how to do chat completion fine-tuning task by specializing in your own parameters as well as managing your own compute. So to start off, we'll set up some prerequisites by downloading some necessary Python packages and setting up our fine-tuning environment, which includes the ML workspace, Azure subscription, and Azure resource.

Next, we will pick a foundation model to fine-tune. And in our case, we are going to be fine-tuning the same Phi-3. 5-Mini instruct model.

And as a next step, you'll be creating a compute to be used for your fine-tuning job. The fine-tuning job works only with GPU compute, but there are different sizes and types of GPU computes that you can use and specify for your own fine-tuning needs. In my case, I use A100 Tensor Core GPU.

And as you can see -- oops, sorry -- this script gives you an overview of the types of the different GPU computes you can use as well. And after you run through this script, you will get a little summary of the compute cluster that is running. So that's how you easily manage your own compute.

You can choose a size and type of your needs for your fine-tuning task. The next step is to prepare the data set for fine-tuning. Here, I have a snippet of the many different prompt response pairs that I use to fine-tune my model with.

And this is an example of that sentiment classification that Martin was just showing on the slides. So this is the prompt that is composed of the news headline and the corresponding sentiment classification value. And we are going to see later on Azure AI Foundry how the fine-tuned model responds after being fine-tuned with this type of prompt response pairs.

The next step is to submit the fine-tuning job. And this is where the managed compute method for fine-tuning really shines its value, because this is where you can specify all of your hyperparameters in a much more granular detail. There are two different types of parameters, training and optimization.

Training parameters define the training aspects. So things like learning rate, number of training steps, and batch size. Versus, optimization parameters help in optimizing the GPU memory and effectively using the managed compute that we have set up in the previous step.

And some examples include using deep speed, using LoRA, mixed precision training, or even multi-node training. And here you see I have 25 different hyperparameters that I have specified for my particular fine-tuning task. In comparison to the three default parameters that we had on the serverless API method, here you have 25.

And remember, you can have a lot more than this, or a lot less than this, depending on what your fine-tuning task will look like. So that's the power of using the managed compute fine-tuning method. And after specifying the parameters, this is where you'll be specifying your data set.

So here you see there are two lines for specifying data sets. The first one is the training data that I had shown you the snippet of, and the second one is the validation data. Validation data set is optional, but it is really great if you would like to evaluate the process of your fine-tuning throughout the training process running.

And after you have put in the information, you're finally ready to submit the fine-tuning job. And once you do, you will see this sort of response to the script running. It is uploading the training data set, the validation data set.

And then you get this live web view link to the Azure ML Studio that shows you a live progress of your fine-tuning job running. And here is an instance of the fine-tuning job, the same exact model with the same exact parameters that I have already run right before I came up on the stage today. And this is the base model that we are training with, the Phi-3.

5-Mini instruct model. This is the process of the fine-tuning pipeline. So if you click into it, you can see all of the detailed steps that are happening throughout the pipeline.

And when you go to the Job Overview, you can see this raw JSON file that gives you all of the details about the parameters and parameter values that you use for fine-tuning. And when you go to Metrics, you can see the same graphs that we saw for the serverless API method. The loss curve is at the bottom here.

And also the learning rate graph. And this is just like a really good visualization for you to see how successful your pipeline ran for a fine-tuning job. Now let's go back to the notebook and register the model that we just fine-tuned.

So running this piece of script, we'll be able to register the model to the AML registry. And you can choose a fine-tuned model name of your liking. And then it will give you a summarization of the model that has been just registered.

And then once you register the model, you deploy the fine-tuned model to an online endpoint so that you can interact with it live, either through your local device or through Azure AI Foundry. So here you give your online endpoint name of your choice and your managed compute type. And then you should be ready to go.

Here I have included a very short Python script for you to be able to run and interact with the endpoints live on your terminal. But because that's not pretty, we're going to look at the pretty version, which is the Azure AI Foundry. So here I have pre-loaded two prompts that Martin showed us on the slides.

Two prompts that were about sentiment classification of news headlines. And you can see that the space model, you see here on the deployment page, it says the Phi-3. 5Mini base model.

And the responses are very detailed. So you can see that we wanted a concise and sort of a more confident answer, but the base model didn't really give us what we wanted. So we're going to see what the fine-tuned model will give us after those prompt response pair data set training.

This is the fine-tuned model. Here's the deployment, Phi-3. 5-Mini fine-tuned model.

And you can see the answers are single word, very concise, very confident. And you can see that for our second prompt that had contained the word "Ransomware attack", it gave us a confident negative value, which is what we wanted. So from this example, we can see that fine-tuning can not only improve the style of the response that you want the model to give you, but also the direction of the response that the model gives you.

And as a last step of this end-to-end managed compute fine-tuning demo, I'd like to show you how to download a local copy of your fine-tuned model. And this is the real benefit of using a small language model like Phi because it is small enough for you to be able to download the full fine-tuned model instance and use it for local inferencing or local training. So when you go to your fine-tuned models page on the Machine Learning Studio and go to this artifacts tab, you can see that it has a layout of all the different files that you can individually download.

For example, there's like configuration file, model file. And when you go to the data folder, you can also download a copy of all the tensors and the tokenizer JSON file. So real benefit of fine-tuning Phi model through this managed compute route is that you can not only manage your compute cluster size and type and also specify as many hyperparameters as you want, but also download a local copy of your fine-tuned model to be able to do local inferencing.

So that completes my demo part and I'll give it back to Martin. Martin Cai: Yeah, thank you, Gina. Right, so it's real wonderful to see our customers can easily fine-tune the models in the cloud and then choose between deploy it on Azure or deploy it to edge devices.

Now, today we're very pleased to have Mr Sameer Shama, AVP and GM for HAI and IoT at MediaTek to come here to tell you about their story of using Phi SLMs. Sameer Sharma: Thank you, Martin, and thank you, Gina. [ APPLAUSE ] So folks, as Martin described, the real action happens when these SLMs get deployed on the edge because edge is where we live.

Edge is where the action is. And the biggest edge device that all of us use today is a smartphone in your pocket. So I'll show you how an SLM runs on the latest MediaTek chipset.

Before I get into the demo itself, I think MediaTek may or may not be a well-known name in this crowd. So let me give a quick overview so you have context for how we get SLMs onto millions of edge devices eventually. As a company, MediaTek is one of the top five semiconductor companies in the world.

About $14 billion revenue last year, which was a down year for us. It was a tough year for technology in general. $3.

6 billion in R&D investment. We ship about 2 billion SoC devices every year. And with more than 21,000 people and headquarters in Taiwan, we are a global company.

But all this work has resulted in the number one position by volume in smartphones, smart TVs, connectivity, broadband and networking, Chromebooks, Android tablets, Android SoCs, ARM tablets. We have a partnership with NVIDIA on automotive and we are growing in IoT. So all these are markets and our install base in this market means that on average, most of you have five to 10 devices powered by MediaTek in your homes or at your workplace.

What does that mean in terms of the AI capabilities we can bring to the table? Because MediaTek has number one position and in this eye chart, orange bar is the MediaTek market segment share compared to Apple, Qualcomm, Samsung, Unisoc and Huawei. We have consistently been the number one position for flagship phones for four consecutive years.

And we are able to bring all that capability into end devices. Leading with smartphones and tablets, the latest addition to our portfolio is the Samsung S10 Ultra device. And so you can imagine now these SLMs eventually making its way to devices like these and becoming more useful in the context of the device that you're actually using on a day-to-day basis.

But to do that, to make that happen, the underlying AI hardware is absolutely critical. So here I'm showing the AI capabilities of the latest generation SoC that we just launched. And I'm doing a year-over-year comparison to our previous generation Dimensity 9300, also comparing it to the latest Apple, Silicon and Samsung who tend to be our good competitors in the market.

And you can see the AI benchmark capabilities on D9400 which is the demo I'm about to show you. What we have done year-over-year in terms of increasing the capabilities shows up in multiple ways in the user experience. So if you're running a Stable Diffusion model, it is going to be twice as fast on D9400 compared to D9300.

By the way, these two chips were just one year apart. So in a year, we've doubled the performance of video editing tools. Any kind of language model performance is going to be 80% faster, which means your prompts will feel much more responsive, both in terms of how they're processed and how the response comes back to you.

And of course, everybody wants more battery life on their phones, that goes without saying. So now let's get to the wonderful world of Phi-3. 5-Mini.

When we worked with Martin and team, there were three things that I think Martin sort of touched on, but let me re-summarize those for you, that really attracted us to this SLM. The fact that for its size, it outperforms most of the other models out there. And not just that, it's actually outperforming many models that are twice the size.

In addition to that, when you look at the ability to fine-tune, and you saw the demos and how easy it was to load up the model and start fine-tuning it, we believe this is a good way to democratize access to SLMs. SLMs themselves are going to democratize access and interaction with edge devices, because they're going to be simpler to train, simpler to deploy. But then adding the capability for LoRA adoption and a lot of fine-tuning, that really takes it to the next level in terms of removing the barrier.

And the third thing is the MIT license terms, right? A vast majority of AI developers want to simplify their lives and they want to know whatever they're picking up is easy to use, will not result in liability, and can be modified and adapted to their use case. The licensing terms allow you to do that.

So on the left, you can see the Microsoft Azure workflow, where we are taking a standard model, the Phi-3. 5-Mini model, and fine-tuning it for the precise use case. But we are able to then combine the training capabilities available in Azure with the inference capabilities on the MediaTek SoC.

So this LoRA tuning is then fused at the edge device level. And what that means is that you're able to get much better output, much better performance for the specific use case that you're trying to deploy. In case of a smartphone, it's going to be things like, how do we make a better summary of whatever we are speaking, right?

Or how do we interact for making the phone our personal assistant? But now be able to do that locally on the device for a more rapid response system. So let's look at the demo and then I'll come back to what are the unique things that we're creating in this demo in terms of the user experience.

Speaker 1: I'm curious about the human body. Do you know how many bones we have? Computer Voice: The adult human body has around 206 bones.

However, the exact number can vary slightly due to genetic and health factors. Speaker 1: Is zero a positive or negative number? Computer Voice: Zero is a whole non-negative number.

It is the absence of any quantity, yet it is considered a number in itself. It is neither positive nor negative. Positive and negative numbers are relative to zero.

Sameer Sharma: I like the philosophical bend with which it responded to the second question. The remarkable thing here is not the interaction and the text to speech and everything else you saw. The remarkable part is, as you can see on the phone, it's all happening in an offline mode.

There is no connectivity to the internet. The entire processing, including the response and interpretation, is happening on the device itself. And that is very powerful.

I think Martin early on gave some examples of use cases where you could be trying to use your phone on a long flight where Wi-Fi access is not available or spotty. You can allow complete interaction on your phone. But then you start looking at IoT use cases where you have lack of coverage or limited coverage or expensive coverage where you can use these capabilities in an offline mode.

And that makes for a very powerful business transformation tool. Takes it beyond experimentation at the speech level and takes it into something you can integrate into your business processes. So just to highlight some things here, the performance we had was greater than 18 tokens per second, which is 50% better than the Apple A16 Bionic.

That'll give you about 12 tokens per second in terms of AI processing. And then, in addition to the fact that you have the benefits that I talked about, one of the big benefits is because the data all stays on your device, there are many use cases, for example, in healthcare or sensitive nuclear installations where keeping the data limited to the device is actually a big benefit and almost a requirement because of legal and regulatory reasons. So because of all these things that I just described, having this kind of a capability available on the edge devices, I think, from our perspective, has been the fruit of a great partnership with Microsoft.

Of course, the partnership between companies spans multiple areas. And I'm sure in future, you'll get to hear more about the other areas we're collaborating in. But thank you, Martin.

Thank you, Gina. It was a pleasure to be here. [ APPLAUSE ] Martin Cai: All right.

So we see that fine-tuning has become really a critical process to enhance the capabilities and adaptabilities of file SLMs to meet various requirements. There are general purpose requirements that can help achieve more concise, efficient answers with specialized domain knowledge. Or you can optimize the model to interpret instructions and behave in a way that you want it to align with your business.

And there are also vertical use cases, such as conversion of natural language to code, adapt of dialects or stylistic guidelines and formats, and also incorporating customer domain-specific knowledge into the model itself. All right, so this actually concludes our presentation on Phi Model Family and customization through fine-tuning. I want to leave with you a few resources where you can find more information on Phi, including the Phi Cookbook, which I think is an excellent place to get started if this is the first time you use or try the Phi models.

There's lots of resources over there. And then also the link to the Jupyter Notebook sample that Gina had presented in her demo. And I want to thank everyone again for coming to this breakout session.

Now I think we do have about six minutes for Q&A. But in case we run out of time, I'll be at one of the Azure AI booths on the third floor. And if you want to chat or ask more questions, feel free to come and find me, and we'll chit-chat.

Thank you. [ APPLAUSE ] Speaker 1: How does the pricing work? So on your website, for the base model, one million tokens is $1.

50. For the fine-tuned model, do you know the pricing for that? Martin Cai: Yeah, the question is, how does the pricing for fine-tuned model work, right?

So essentially, when you fine-tune the model, obviously you need to pay for the resource that's used during fine-tuning, the GPU hours, whether you're using the serverless API or you use managed compute, where you have the GPUs at your expense, that you want to use it however you want. In terms of inferencing, there is an extra cost of, I believe, for Phi Model is $0. 80 per hour to host that fine-tuned model.

But the actual token, input token and output token rates, is identical to the base model itself. So all you really need to pay something, the extra part, is just the hosting part. Over here.

Speaker 2: So what's the roadmap for the small language models? Because if you add more parameters, they become large language models. Martin Cai: Yes.

Speaker 2: So what are the other features and capabilities you want to enable? Martin Cai: So we are actually working towards even creating models smaller than the current Mini sized model. So right now, the Mini model stands at 3.

8 billion parameters. And we are in active development to create an even smaller model, which hopefully, we'll be able to show it within the next quarter or two, at least before the next event. And for that model to be able to really run on edge devices, because even with the 3.

8 billion parameters, we've heard feedback from this one customers that their memory footprint is still not enough to fit the model. But even if they can fit the model, but you have other stuff that's running on the phone. So you have to share the memory, right?

So that is a challenge. So in terms of roadmap, smaller. We want to go smaller.

Speaker 3: Can you bring your own model to Azure to fine-tune? Martin Cai: I think so. So there are a couple of ways.

So Serverless API has a predefined set of support for these models. So with Serverless API, I don't think you can bring your own custom model. But with Managed Compute, I think you can leverage the training pipeline, the templates that's already available for the other open-source models.

All you need to do is just swap out that model that's in Azure with your own custom model and use the existing template to fine-tune the model. Or you can go to the other end of the spectrum, which is just build your own fine-tuning environment inside the GPU VM. You set up everything and do it all yourself.

So it really depends on how handy you want to be with setting up the infrastructure and manage the environment. You have the option of either just doing everything by yourself, fully DIY, or you leverage the Managed Compute option. Speaker 4: Which capability of the Small Language Model do you cherish or care the most?

You can list a top three or top five. Martin Cai: Capabilities? Well, the base model itself does a few things, things like summarization, content extraction, instruction following, and chit-chat.

But really, through fine-tuning, you can teach the model to do anything that you want to do in terms of language processing. So in one of the customer stories I showed, Bayer created this brand new model from Phi, but they called it ELY Crop Protection. And it does so well on crop protection level knowledge.

So that's sort of a Q&A knowledge sharing scenario. Really, it depends on what you want to do. As long as you have the right data set and your recipe to fine-tune, the sky's the limit, I think, with these models.

Speaker 5: For the example that you showed, I think you had like 85,000 data points as you're fine-tuning. What's, in your recommendation, kind of the lower limit on that? Martin Cai: I want to say the more, the better.

Because ultimately, you spend the time and resource to train a model, to teach a model to do something. You want the outcome to be really as close to your desired performance as much as possible. So there is a bit of trial and error, initially, you may want to do to try the data set, see if it gets the model to that angle, and then play with a few things.

And eventually, you do this whole end-to-end fine-tuning or training work. And really, with the small language model, it's not just fine-tuning, I want to say. You can actually do more continuous training, really deeply customize the model to do something very specific.

So I don't think I have a recommendation on how low bar the data you want. I always want to say, the more, the better. The higher the quality, the training data, the better your outcome will be.

Speaker 6: You told that Phi model supports 20 languages. Would it make it smaller to drop the language number to one? For example, for specific applications.

And another question, will it be ever feasible to fine-tune a Phi model on a local desktop computer, or will we always require cloud services for that? Martin Cai: Got it, so the first question is, if I drop the 20 languages to maybe two language, three language, would that make the model smaller? No.

Adding additional language support does not change the parameter size, but it can make the language you want to support even better. So we've had customers coming to us and say, "Oh, I don't need all the 20 languages, but can you make these three languages do really well on something? " So if you know exactly the language scope you want to support, and you have the right data set, go for it.

The second question is, would I be able to train the model on desktop? Yes, so that will probably require, or it depends on what you want to use. So if you're using A100 GPU on a desktop, which can be somewhat expensive hardware to use, or you want to use a commodity, like say RTX hardware, there is some additional software framework that you may need.

I know from ONNYX, they do have this ONNYX training framework that you can leverage and run it on your desktop. And I do believe -- Speaker 7: Olive. Martin Cai: Yeah, Olive, that's the one.

And I think, I'm not sure, but TensorRT also have special support for Phi models. So those are the couple things you can try. And then there's other training frameworks, like DeepSpeed, like PyTorch has its own thing.

Just play with it. I assume that you want to train on your local machine. It's not going to be something huge, but small, like experiments, right?

This kind of R&D, research-focused, I think it's great that you want to do that. And you don't always have to go to the cloud, like I said. Yeah.

All right, I think we're out of time. Well, thank you all.