okay uh hello everyone uh I am dimma then then so I want to talk today about fireworks Ai and I'm gonna talk about open models and how they also have place next to GPT which we just heard about from the from the previous speaker uh so I'm kind of curious like from from fo here like how many people use like gbd like open production okay and how many folks use like open models like Llama Or mix okay so it's actually pretty balanced so yeah I guess my point is that it's going to stay balanced or

maybe even share of op open models over several next several years is going to increase but first uh I wanted to talk a little bit about my background so uh a l like fireworks a lot of our fing team comes from meta and Google AI so we've been doing this for uh pretty much better part of the decade kind of productionizing uh open open source Innovation Etc myself specifically uh previously I was at meta for many years as a one of the cor maintainers of figh so I seam of Open Source is really close to

my heart I'm kind of driving this deep learning Innovation through open source Frameworks uh so what we do with firworks basically we push the same seam of of Open Source Innovation and we help businesses and help developers to use open weights models uh build with them uh whether it's some small app or whether it's like actually big production production Enterprises cases so first of all like why why do you want to actually might want to use smaller models or S models like what is a tradeoff uh don't get me wrong like GPT is great gp4

is amazing I'm sure GPI will be better and generally Frontier Frontier models are really great and if you are trying to especially like prototype of build uh new capability and see what's possible you totally use the models uh however while they provide the best quality for all use cases they don't necessarily provide the best trade-off in terms of cost and when you when you go to production like in in chart basically you know GPT is great at all the things but most the production this cases don't actually need to cover all possible cases if you

building new customer support chat B probably doesn't need to know all 150 Pokemon names or write you poems or stuff like that uh what it means concretely uh there are several as one is them is latency basically if you're using a really big model which is good at medic for production cases often it hinders kind of real time processing uh times so your user interactions become come slow and product ad option might suffer uh so me open has been improving for example from gp4 to gp4 is much better but still in many cases you can

pick up a much smaller like open we model for use case and get multiple X lower latency flip side of that is operational costs right even if you no even if you're using like gp4 for and you have moderately po popular application maybe like 10,000 users per day you can like easily Rec up you know millions in cost per per year which pretty often uh whereas again using smaller models and the particular case or even just using off the shelf models Prov provided by providers such as fireworks you can often get Multiplex gbd4 like L

is 40 cheaper gb4 is like 10 time cheap or something which really adapts as your application goes from kind of prototyping stage to actually push into production and getting realage and and finally the adaptability that's something also see with many customers is that again gp4 is amazing on a lot of uh on all the possible use cases and you can get pretty far with promp engineering and kind of trying trying to customize this but many use cases are actually pretty narrow right whether it's maybe it's Specialized chb or maybe some structured text extraction which operate

in some domain where there is separate terminology or separate vocabulary where like T into a smaller subset is really is really useful and that's really where know being able to control the model and walk around and like do do some customization is really really useful again if you go to High there are probably like by now more 100K open source llms which people find for all possibleuse cases we also see businesses doing this for many different kind of applications uh which they see in production uh so open source is great uh you should all you

know go go try to use it you can go to hgen face or meta website l or mix whatever and and and use it but like run it even on your on your laptop so what uh like what is the challenges actually was using as a production turns out is like there is still a gap between like just getting the model and building the the production application first uh the set up itself is complex right you need to go and figure out where to procure gpus gpus start the break uh you need to figure out

like what deployment set up like how to Sharp it how to optimize it for your for your particular case and what we what we see pretty often it requires looking into work Cloud characteristics trying to understand what your application needs are like maybe you are building real time voice assistant and latency is really important for you maybe you're building some bch processing and like you really care about throughput stuff like that uh need to run a lot of experimentation withbe quality in find un Etc and since new models get released almost every week you kind

of have to repeat this process over and over again for different species and finally when you hit hit the production and said like GPU stand to break uh like operating actually in trans fit especially for like specialized accelerators is not that easy and that's you know some of the value which API providers uh don't provide so even though you are B to use open source models and you want to use open weights for quity customization it still for for majority of cases makes sense to use hosted like API providers serving them taking taking care of

all infrastructure uh and of tidbits while keeping you in control of quality and outcome for your case so that's uh PL fireworks has been doing for the past two years and our initial Focus was purely on a inference so what we done given our background in kind performance optimization and serving production needs meta and Google uh can come into this with L experience so we basically built our own Serv stack from the ground up which uh you know all the way from writing good kernels to optimizing kind of deployment and Shar and all of that

so all the kind of tricks which you might imagine for efficient serving and some and some more which you don't see see online all of that is incorporated so customizing for for different needs what we specifically focus on is custom and servy needs for your Cas uh basically Chan and tunan hardware and software set up for specific characteristics of the workl and which goals in terms of latency or cost you cost you might have so what what does it mean in practice right uh different like especially for LS but pretty much for all generic models

uh is like needs can VAR very rapidly like I mean this is like from artifici analysis which like kind of dependent benchmarking organization like few FKS foring website if you haven't seen it it's it's pretty cool comparing different providers uh interesting like they have like a selection of workload characteristics like should they have a shorter prompt should they have a longer prompt uh Chang changing those characteristics even for like shared public API providers changes results dramatically like for example a lot of people heard about Gro has been super super fast if you're asking it hello

how are you uh it's going to generate you like 500 tokens per second or something uh however if you like get slightly longer prompt if you get like prompt of like 1,000 10,000 tokens which often happens with basis cases if they're trying to ground them performance situation flips quite a bit because yeah har characteristics there running on actually not very favorable for this Set uh so like kind of I'm using examples that for B characteristics of your prom this maybe you have different T sessions for for the same user which can have repeated context Etc

there's going to be really really big difference in like how do you optimize software in set up for uh we don't just do LMS uh we uh Al I mean focus on all different modalities uh so for example for image generation stable diffusion you know surface DXL is probably the fastest provider uh with multiple times faster than what you will get you know gra your your own GPU from one of the GPU providers for example uh like uh actually the backend which stabil diffusion is insert on from stability API and I said like there's like

even for public platform there is a lot of performance you cany but where this ta performance really comes in is for customiz deployment for use cases and uh if you for example we are trying to productionize nlm and figure out how much capacity you want need to run it you probably be looking a lot of graphs like this uh which is basically plots like versus Laten here right so depending on the like the fix Hardware depending on the loot like request per second which my LM gets what is the latency of the response and usually

this graph know go somewhere up and at some point start going uh kind of more vertical as the as the server gets over the LED and you often see you know pictures like this where depending on whether you're optimizing for let's say latency or books right whether you're optimizing for serving as many users under some reasonable constraint or not uh you might get very different setups right in this case for Red Line has higher latency at if the server is not loaded but can serve much much higher uh loads on the same Hardware so what

we often see is that use Gates cannot operate an assumption oh I want to serve that many tokens under let's say two two seconds response limit and that's that's where a lot of customization might actually end up and you often see multiple X multiples there um yeah as a example right depending on what's like latency right you can get much better through uh meaning serving the serving the higher number of users in the same Hardware which means lower cost for you uh as a complete example I don't know whether you guys uh know cursor it's

like a pretty co uh ai ai first called editor kind of like pilot plus if you will U so yeah they're one of our customers they for example they publish block post how they use us for like right use case where by Post in the model and customiz structure set up they were able to get like 12x better performance than out of scen it's like 7x higher speed than what you get from like they try pretty much every cloud provider CL Source ones so yeah that's it's actually pretty cool about how where they go into

details of how this feature was buil ET so my so that's for that's for T performance in terms of coverage uh yeah so for pretty much all popular open like open weights models and some of the proprietary ones also through Partnerships so you can go the platform uh try experimenting productionizing those and yeah we are pretty quick in like 8 new new on for example like GMA which was at least maybe two weeks ago or something like this um what it means practice is the on the platform is that you can go and start with

open source models like hundreds hundreds of those we also have some of their TS which we do ourselves and I'm going to talk about one of those later on but also you can go and customize the model so whether you want to do it yourself on your your infrastructure in your Hardware you're uh welcome to F tune whether it's full F or maybe parici F upload it to our conf you can use our hosted F service if you don't want to figure out whether how to SP up GPU and stop training uh specifically if you're

2 parameter efficiently so if you're customizing this techniques like Laura uh an interesting property of platform is we can serve those efficiently kind on the same GPU shared across many customers what it means for you is that you can go and train like a thousand Laur models uploaded to a platform uh still call Api paying per token without having to pay for hosting that model pretty much pretty much not addition uh where alternative would be like if you are trying to SP Spa up hpu on your own you to have a lot of gpus sitting

there not getting not getting much traffic and we see many customers kind of using this for either experimentation but have lot of variance or like customization part domain right for example if they're themselves a platform and they have like 100 customers they might want to f t for every customer and have a model variant for that but still serve efficiently on the same set of Hardware without paying crazy cost so like a insance is kind of what we started with but uh it's as as you actually heard from previous previous talk about gpts right uh

just LM is just a building block in a broader kind of AI application framework uh so yeah there's only that much you can get from just call ANM uh one one obvious problem is the hallucination right because models don't know what they don't know they try they really try to pleas they come up with plausible but maybe explanations that's why ground is really important uh multimodality and specialization in different modalities is important and while there's been a lot of progress in Native multi multimodel problems still chaining different models is pretty pretty common in practice especially

more robust applications and yeah I kind of interfacing is external knowledge whether it's databases or knowledge or documents in kind of R setting uh is another common pattern for building more more production end to end AI ations so the term which a lot of people started kind of using is compound AI which basically means going from like oh just the model is the application to broader set up where you might have model in some documents and maybe external apis and maybe some retrieval system some tools etc etc so we are like we started from a

INF just real entrance but we're actually Building Products around this compound AI Paradigm and uh I want to talk about kind of a little bit about like function col and we actually heard about function col I guess in two views in the previous start from K so I I wanted to go a little bit under the hood and like what's how it works and how these like open source models you can customize it further but yeah the Paradigm which you can you why people excited about function is you basically want to have like LM kind

of in a driver seat which like interface is a customer maybe it's customer ch uh or like some SE application where which has access maybe to specialized T models so you have some kind of routing capabilities or different modalities right for example you know go generate generate some images or transcribe audio things uploads it Etc but also interface with uh some product specific tools and apis go to some actions or go go retrieve some data Etc and kind of function callon as a paradigm is basically sit in the middle for like translating the user request

into whatever action actually need to be Tak like as a con example right if I'm trying to build a hello world chatboard for function qu which can answer the stock prices right you like when the users when the user submit some requests ask what the St price of Nvidia which I guess changed since I updated this updated this slides what actually happens under under the hood is that model knows which functions and control describ as the API schema uh if I ask for some some user request which might be useful to be answered by this

function and that's what model needs to figure out it can spit out kind of especially form me uh outputs saying like hey I want to call this function something on back end is when I go execute this function give the outputs to to the llm and LM kind of goes and say maybe I need to call more functions or maybe I want to summarize result and give it to the user in the simplest example yeah just form formed number and say like hey the price Google is is there um Step turn out is like there's

a lot of models which kind of can do function poing gp4 is really is really great of that on open source side it's actually been a little bit of a struggle with good high quality cases well maybe ma will release something in the future they haven't haven't done so yet uh one of the challenges that for example if you take llama it's really very exting it's going to be very perverse and helpful and answering but if you try to curves it to do function coordin through prompt it doesn't actually look very great there are a

lot of f Tunes over there which specialize function coord and TR a lot of data just do but they lose ability to do catness so for example if you ask it like jum just general knowledge question it's gonna like try to Pur it into some function po even though it's not really really not related uh and finally is balancing between those is pretty is pretty hard and finally G4 great uh for all but open as I mentioned earlier kind of lower latency and high speed are particular importance for function qu models because they often like

a router which sits in front of the in front of the more so what we what we've done and we also this we've trained a series of fun models which are ly Tunes but we took went to great lens to kind of balance both chattiness and helpfulness uh and make sure that uh make sure it kind of strikes the right balance there and while being based on open weight modules and pair this our INF engine it's also very fast so it's end up being much cheaper and faster than uh thanp so this is like a

demo which actually is online and so Source uh so as an example you know we it it has a bunch of function calls available to it like generate images uh draw some charts uh ask ask stock price so kind of set for this demo the model itself is somewhat self aware so s collapses and the first question was about hey what can you do so it's like self aware enough to like say that hey this is a functionality which I can do based which functions are available to me so for example we can go and

ask it like hey can can you generate me stock charts of several stock prices uh but in this case instead of asking companies I'm asking stock prices of top three so what actually needs to happen is that model needs to figure out like oh top Street Cloud providers like Amazon and Microsoft and Google uh so from its internal knowledge then need to go and do like three function calls because IPI provided to it can only ask for like one single stock price then you to take those results aggregate it aggreg together call another API for

drawing chart for you and give you the result uh which works finally it's a context aware right so if you if you ask it to wait can you add or to it uh yeah it's going to go use the previous results up to it and generate another function it's really kind of showes multiple stage for cl which uh if you're trying to get multiple function CLS working well it's pretty hard uh yeah just to find it has also sdxl hook up there so if you ask it for generating some image it yeah it forgets about

stock prices because it change the topic and it just generates your C data center which of course you wanted to see uh so this demo actually is open source so uh you you can go go look how how it how it works under the hook um so yeah basically if you go to firks and the model itself is also was open source and post on our platform so you know you can download it from H face you can go play with it this high speed our playground or try building applications uh applications with that uh

besid just function call there are a lot of kind underlying techniques which I did talk about today kind of useful for structured ining and point which you can use on our platform to go and customize it further so like build for use case maybe build which is BU for un fun and also kind of help lot customers with and kind of just to wrap up so I talked a lot about plom how you can get suets um so you can you can have like fireworks AI to check all the models which are available some of

them serverless some of them on demand is per hour pricing in terms of model we provide the whole rench started from seress inference which is basically say open a compatiable API which you of with where pay per token don't worry need to worry about Hardware configurations at all and you can still deploy your fine tuned L adapters in Ser second as you canate more serious maybe use case uh often people transition to On Demand deployments where uh would really help the kind of customized Hardware set up for us case like I there you can use

still open source models or some customizations which you build on top and payment is more per GPO seconds depending on the use case and finally you know for Enterprise Clans there is more personalized set up and like even custom feature development which you can help help that as a as use case graduates to more scale and estimation yeah uh so I mean we've been uh We've we've been around for like around years but actually we've had a lot of production use cases uh from companies you know red from startups to pretty big Enterprises which which

use it inun in run time to serve features in their apps I talk care about CER but Uber s another another compil press etc etc so there are being added to the site so definitely don't be shy to use to use us for production regular SL and cases uh also we're like really care about develop like especially given our P background really care about like developers and supporting like open source e system uh we were close there was many atic libraries or kind of Frameworks providers SL chain or l IND Etc actually chain based last

year reports we we're happy to find we're actually one of the top open source model providers uh which was really great but it also means that you can you can St with call in our API which again is open a compatible so you can open just change one uh one URI somewhere or uh you can go use um I mean or or you can go use the like it's one Frameworks whether it's autogen or uh L chain Lama index Etc often this one line called change so that that's it for kind of short overview as

a platform uh thank you everyone and I guess we have maybe time for like few questions if there are any yeah thanks to me that was great um one question is how does dspi generate canidate fors does it use llm at back this uh okay so I mean I'm not I'm not sh with ds5 but I can I think I can try to answer this uh if I remember correctly I seem they have both like simple is like this like comor optimization framework which Tunes promt automatically for you I think they have BS of kind

of like you you provided a template and it like tries to multiple like patterns of how to do a few short examples and kind of like hot forat put them together basic F of them and see which which is the higher quality and uh I think they had some llm based prom advis prom advising but I'm not PR sure I'm suspect it might be question not for this s but happy to answer well we have a few more so goad thanks did I understand that's the point you need so the users will actually upload their

data uh so there are two options right so again it's probably similar to like close the providers too so we have hosted find Union Service yeah you can you would upload your data or actually you can log some feedback from the platform itself uh so you know if you're calling open source model for inference right you can give us the signal of is is a good interaction not interaction we will like log it use as data set automatically or you can upload your data set and then use like hosted basically find you service where you

click a button fun model on this data set uh but you're also welcome to do fun on your own meaning that you have gpus somewhere like use some other provider whatever you obtain either full or Laura you can just upload model just so B kind of mix got it just make that's not going to be available for anybody else no no no no I mean you can make it available if you wanted to but no but was there one more question yeah you how do you how do you manage what the data security um I

mean it's uh I mean it's like a reg we we make sure that like data sets don't you know you regular like multi separation right so you de to go like a separate bucket isolated from Other Stuff Etc I mean what you would expect from like a regular s provider and so in terms of security comp yeah like so to and Hippa compli so all of the of regular stuff which you would which you would expect uh yeah again to reiterate we we're not going to use your data to point tune other models or give

your data to other people give models which you tune to other people I mean unless you want to make your model public for everyone to use there's no we don't have the Fe yet I mean if you like actually like manag so it's more like kind of like regular but I'm I'm curious to learn what your Cas are afterwards and maybe talk about it we got time for one more question uh I think there's one here oh here we go yeah thank you um so let's say I have an app already right right now using

L chain openi like gb4 I have like 20 functions WR I just want to swap in like think it was fire function B2 like will it just save me like hundreds of dollars or thousands of dollars I swap it right now what's the downfall potential I mean you should try it right because you know there are lies big lies and benchmarks right so I mean uh from many case which we tested it's actually like if you have a reasonable set of uh functions it does pretty good job so go try it uh so I mean

like General API is compatible so you should you should be able to just SW in directly as with any model switches you might need to do some promp engineering between Etc and maybe if your reasoning is extremely complicated which usually I don't see like function in these cases yeah maybe model will not be as as capable as like gp4 but if you're using like gp4 probably it's like 3.5 then yeah matches thank you [Music]

How Fireworks AI Makes Open Models Faster