I think like at the end of the day what it all comes down to is communication. Like we just like if there's some agent that's doing something we need to communicate to it how we want it to behave. Um and in some cases maybe a prompt is enough. Um but you know code is a very good communicator of what we want to happen as well. [Music] [Music] All right, everyone. Welcome to another episode of the TwiML AI podcast. I am your host, Sam Cherington. Today, I'm joined by Harrison Chase. Harrison is co-founder and CEO of

Langchain. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Harrison, welcome to the pod. Thanks for having me. It's great. It's great to be here. I'm a I'm a listener for a bunch of episodes. So honored Honored to be here. I'm excited to chat with you and looking forward to digging into all things lang chain as well as touching on topics like rag and agents and evaluation and more. Uh I'd love to have you start us out with a little bit of your background

and the kind of the lang chain story. Sure. So yeah, my background is in ML and MLOps. So I worked at a fintech company for a few years on the machine learning team there. Did some time Series stuff and then some NLP stuff, specifically entity linking. Um Kensho was the name of the company. Uh great great company, very strong kind of like team to learn from and and really grateful to a lot of the people there who who helped me get started in the industry. um then worked at robust intelligence which is an MLOps company

doing testing and validation of machine learning models uh smaller earlier startup um and so was there for a few Years at some point knew knew I was going to leave wanted to go back to some place small or start my own thing didn't didn't quite know what I wanted to do and this was in uh like September of 2022 so started going to a bunch of meetups hackathons talking to a bunch of people this was right after stable diffusion had kind of come out but before chat GPT had still launched but a lot of people

were building on the underlying open AAI APIs and so got a Chance to talk to a bunch of them saw some some patterns uh in terms of what they were building uh thought it would be really fun to just try to abstract out some of those patterns and put that in a in a Python package and that became Langchain uh and you know a year and a half later almost two years later here we are so that's uh happy to chat more about you know what what we do at Langchain now but that's a bit

on my background in the start. No, that's awesome. Uh, and I think I pulled some stats from your website. You are at least uh as of the last website update as like at like 15 million monthly downloads, 100,000 apps powered by Lang Chain, 75,000 GitHub stars, 2,000 contributors. That is huge growth for such a a short period of time. Uh, what do you think is the driver behind that? uh the rise in popularity of LLMs basically like right time right place I Think we launched lang chain nearly like uh almost nearly exactly a month before

chat GPT um it was kind of the first of its uh it was it was kind of the first of its kind in terms of being these kind of like orchestration uh middleware frameworks um a lot of people want to build with LLMs we try to make that as easy as possible um and I think the yeah maybe the numbers that's that's most interesting is like the number contributors. Um, you know, I think 2000's a lot. I think it's uh, yeah, it's probably a bit higher by now, but I think that's largely due to just

the fact of, you know, where we sit in the ecosystem. So, we really sit as kind of like the glue that connects all these components together. And there are like 80 different LLM providers that we integrate with, 100 different vector store providers, um, you know, hundred different kind of like document loaders. like there's all these integration Points and that's that's really where we get a lot of the community contributions from and I'm extremely grateful to everyone in the community who added those because there is such a long tale of integrations and so I'm very thankful

for people who try to bring their tool of choice into the into the langchain ecosystem. Absolutely. So, Langchain was the original product that was that abstraction layer that you talked about, But the product family has since evolved to include Langraph, Langraph Cloud, and Langmith. Can you kind of talk about the portfolio as it sits today and what each of the products is uh does for users? Yeah, absolutely. And I'll maybe go chronologically. So, so we launched with Lang Chain, the open source kind of like package. And um that again that was a that was a

side project kind of like launched before there was a company there there was no there was no strategy There was no real thought kind of like behind it. Um as soon as we thought about seriously forming a company around the package um one of the things that we started working on was Langmith. Um so we've been working on this basically from the beginning as well. And the idea is we noticed that one of the biggest pain points for people when they were building applications whether they were using lang chain or not um was basically Bridging

the gap from prototype to production. Um and so I think there's a lot of there's a lot of factors that go into that obviously. Um but one of the big things was just understanding what exactly is going on in your app, testing to see how well it's doing and then also making sure that you don't introduce regressions. um when you do start to get some scale kind of like monitoring that um having some sort of like uh annotations that gets get turned into Some data that you can use for evaluation or fine-tuning or fotting or

something and so that's really what we built uh Langmith for. Um it works with or without lang chain. Um the two big parts are observability and then testing and evaluation but we also have a prompt hub, a human annotation queue, u monitoring charts and things like that. Um, and so we launched that uh initially a little over a year ago. It was in private beta for a while. We launched in GA uh uh a few months ago and yeah, seeing uh I' I'd seen pretty good adoption there. About 20% of our usage comes from non

lingchain users. Uh I think that's that's a fun fact that I always kind of like like to track. Um and and so that's kind of like Langmith and you know could talk could talk for hours about Langmith. uh there are other things that we're working on notably recently uh langraph we've been spending a bunch of time on and so the way that I Think about langraph is langraph is purely orchestration so I mentioned before that lang chain had a lot of integrations and that's one of the big kind of like you know uh benefits of

using a framework like langchain another benefit of using a framework especially as the types of applications you're starting to build become more and more complex uh is the orchestration framework and so lang chain has kind of like underlying orchestration framework. Um, but Lang graph is basically a version that's much better suited towards kind of like agentic applications. Um, so specifically ones that involve lots of looping, lots of kind of like branching based on what an LM decides to do and generally some form of like persistence and memory. And so all of these are are built

into Langraph. Langraph is uh very low level, very controllable, has a built-in persistence layer. Um and so we launched This as an open- source project at the start of the year. Uh saw uh really good interest in it, especially among people who were trying to like bridge that gap from prototype to production. Um so it's it's you know there's a little bit of learning curve. It's not the easiest thing to like pull off the shelf and get started with Lang Chain's easier for that. But if you really want to do like serious things, I think

Langraph is the place to turn to. Um and and then you Mentioned Langraph Cloud. something very new that's like a few weeks old. Um that's basically a hosted runtime for uh for langraph. Um so you can kind of think of langraph as being a framework like airflow but for agents or something like that. And then langraph cloud is uh infrastructure for deploying that. So the way that I like to describe that is like if you think of the assistance API that openai launched um there's a lot in there besides just a model right there's Like

persistence of all the chat messages. You can create different assistants you can create threads. They have a concept on runs. Exactly. Yeah. And so the the idea behind lingraph cloud is like okay define your agent with lingraph. That's like the cognitive architecture the logic of the agent but then deploy it to lingraph cloud and get this infrastructure that is really handy and you know openai did a great job with a with a lot of stuff there. Um and so Get all of that but for your specific kind of like agent in your specific cognitive architecture.

I'd love to dig into agents and the way you think about that as an opportunity for users and developers as well as kind of an opportunity for lang chain the business. Um yeah, how would you describe kind of the state of the world in terms of agents? What are folks doing today? How capable are the technologies? Um, you know, I see both a lot of promise, but a Lot of frustration. Like there's kind of a huge gap in capability uh in order to make all the the promise real. And I'm wondering what you're seeing if

you're seeing similar or there some sweet spots in terms of applications where folks are really putting this agentic idea to use. There are definitely some sweet spots and so so maybe talking about like in my mind a quick history of kind of like agents. So I think like you know uh actually before chat GPT came out there Was a great paper by Shenu on uh React uh so reasoning and acting and basically the core kind of like um cognitive architecture for a pretty generic type of agent. Um and so that was in like October I

think of 2022. We included that in lang chain in like November. Um and uh you know the whole space kind of took off when chat GPT took off. But this idea of like using an agent uh and I can maybe talk about you know I I think agent's also not the best word Because I I don't know if there's a concrete kind of like agreed upon definition and it means a bunch of different things to a bunch of different people but for the purposes of kind of like it's yeah for the purposes of this like

um you know let's think about something simplistic that's running kind of like in a loop and that's exactly what kind of like took off with autog. So, AutoGPT like March of 2023 Um you know fastest growing GitHub project in history or something like that. And what they did is they basically ran an LLM in a loop. They gave it a bunch of tools that were really powerful. So like write to file system, search the web, um and uh they basically gave it really ambitious tasks as well like you know grow my Twitter following or something

like that. Um, and I think that idea of having this kind of like autonomous system that just Did something for you was really interesting to people. And that's where I think the peak of or that's when I think interest in agents really started taking off. Um, I'd say it probably hit a peak in like summer of 2023 and then I think people as you mentioned kind of like started noticing a lot of the flaws of agents mainly that they weren't really reliable. Um, they didn't really, you know, do uh what you asked them to do.

they weren't capable of doing really Complex tasks and then secondary they would take a while and were really slow. But I think the biggest thing is they just didn't really work. Um and so I'd say like largely between like you know the summer late summer of 2023 and the end of the year um there's a bit of kind of like decrease um and increase in skepticism of of of agents. Um and then I think uh starting starting in the beginning of this year in 2024 there there have been more kind of like Aentic systems that

have been shipped. So I think like you know ramp uh shipped an agentic system notion shipped something uh elastic shipped one we wrote a case study with them uh CLA shipped uh something agentic like right can you describe one or two of those just for context? Yeah, absolutely. So, the one that I'm kind of like most intimately familiar with is is the Elastic one uh because we work pretty closely with them. So, that is basically An assistant. It's a it's an assistant for uh some of their uh logs basically. So, you could go in, you

can debug things, you can take actions, it can answer questions, it can do a little bit of like rag research, but like not like a simple chatbot. It can like dive a little bit deeper into the logs um and and explains things. I think the um uh another one that I'll note is RAMP. So RAMP had a a really cool one as well where basically you'd go on their Website and you could ask their agent like, "Hey, how do I uh uh I don't know file this expense report and it wouldn't just chat back, it

would actually like show you the button to click." And so it was in the browser. And I thought that was really cool because it wasn't really chat based. And so I think I think those are two two good examples. Um, and I think the some some common things between all of these are that they're relatively focused kind of like Applications. You're not asking it to like increase your kind of like Twitter following, right? It's much more narrow like, hey, what's going on in this log or something like that. Um, and so they're and so they're

very focused. And then if you uh if if you dig a little bit into like the cognitive architecture of what's going on behind the scenes, for a lot of them, they're not really just like running in a loop. They're a little bit more bounded. Um maybe they Have some check after the fact that like is verifying that what it does is correct. Maybe they have some explicit like classification step at the start to determine whether it even needs to use kind of like an LLM or not. But there basically this like really custom what we're

kind of like calling cognitive architecture. um which uh and we didn't come up with that term. Uh that's uh very kind of like uh I think in 1960 was the first time or somewhere around there That that people started using that term. But basically we uh you know when we talk about it we say it we use it to describe kind of like how your system or your application kind of like thinks and you know we're generally talking about systems that are using LLMs to do this reasoning and then interacting with external things. And anyways,

the point is that you don't just like for all these applications, they're not letting it run like unbounded. They have lots of Like little checks and little kind of like classification steps that really kind of like keep it focused on a specific workflow. And I think that's the commonality that a lot of these agents has is they're very like workflow based. And so I'd say like two of the big industries where we see them being successful, one is customer support um and then another is in kind of like data enrichment. So talking about customer support,

you know, there's generally a Pretty um there can be a lot of edge cases, but uh there's there's generally one a nice uh kind of like job to be done. There's existing channels of communication. There's generally some like reference material that like 50% of the questions can be answered by and then like maybe there's uh you know uh customer specific information that answers another 30%. And then the great thing about customer support is that you have this like built-in concept of Escalation very naturally where you can just escalate them to a human. And so it's

a very kind of like it's it's a great place and I think that's where we saw a lot of uh agents first kind of like being used. Um and then a second category that I think's been coming online more and more recently is this idea of like data enrichment. So we we did this internally for you know we launched Langraph Cloud. We got a bunch of people signed up for for Langraph Cloud. We want to learn a little bit more about them and their company. We don't want to like ask them about it though. That's

a lot to like fill out in a form. So, you know, they they say like, "Hey, my name's Jim. I'm from company XYZ. Uh, we'll uh use a search engine. We use uh Tavilli in for our case, but there's a bunch of other ones out there." Um, and we'll go basically like go use an agent to search the web and fill in different form uh pieces of Information like where the company is located, how big are they, have they raised funding, things like that. Um, and I think this is this is uh uh we see

this being used a bunch in uh sales and and kind of like enriching sales data because that's a very valuable place to do it. Um, but I think in general, this concept of data enrichment is a second like really good use case for these agents. And it's again very like workflow based. Like, you know, if you Think about how URI would do this, it's a pretty reasonable workflow of like I'd go to Google, I'd search something, I maybe like look at the first result. Okay, maybe that has it. Great. Then I'll fill in the information.

Maybe it doesn't have it. Then I'll go back. I'll go check the second result. Okay, maybe I need to redo the search. I type in like there's this general like workflow that we can kind of describe. And I think that's the key for like reliable Agents like workflows that you can kind of like reasonably describe whether it's through prompting or whether it's through the cognitive architecture of of your whole system. Now just digging into that a little bit more deeply you know we've had data enrichment for a while entire you know large companies have built

on scraping the web for this kind of data from the perspective of this cognitive architecture that you've described. you Dig a little bit deeper into what makes these systems agentic. Is it um you know is it an LLM making decisions about when to continue? you kind of suggested that is it um you know but you've also said classification like classifiers like I'm imagining you know once you start going down a path of having your LLM doing classifiers it's only a matter of time before you pull those out and do something cheaper like you know what

what is like the the core agentic Aspects of the of this particular example. Yeah. And so, uh, I really liked what you said as well, uh, using the word kind of like agentic rather than kind of like asking like, you know, what's an agent in this case. I'm stealing this take from, uh, Andrew Ing, but he basically said like, you know, it's really hard to defide what an agent is, but there's the spectrum of agenticness. And rather than talking about whether Something's an agent or not, we should be talking about like how agentic it is.

Oh, and really in what ways? I don't know that like you know saying this is a seven agentic versus a 10 agentic really does justice to all the different ways that a system could be agendic and you know for that matter I've kind of coming around on this whole line of thinking like my initial perspective on like what an agent was was you know the thing that we've all kind of wanted like this uh Avatar for lack of a better term you know that's also super overloaded but like this thing that lives you know, in

the web that like does stuff for me. Hey, I want to, you know, take a trip to New York City, like go figure it all out for me. Um, and, you know, would it kind of autonomously, you know, take some actions on my behalf, you know, based on, you know, things that it's, you know, seeing out on the the internet. Um And you know I you know as much as we get out of LLMs that we're still pretty far from I think that uh vision but um you know when I see some examples of systems

where people are you know using LLM to you know go fetch a lot of information kind of frame out some decisions interact with other downstream applications reformatting information like it seems to be a good fit for what LLM are capable for and it does like if you squint the right way it feels Agentic. Yeah. I I think like there's there there's a lot of components that make up what it means to be agentic. I think the main one that that I kind of look at is like how much is the LLM deciding the control flow

of the application basically. Um and so like uh you know classification yes is uh is you know it definitely this is why it's this is why I like the spectrum of agenticness rather than a concrete definition cuz When you do classification you're kind of determining what happens what's the control flow of the information but like yeah one classification step isn't that agentic but maybe you add you know if you have 10 different steps and some of them can loop back to each other okay maybe that's a little bit more interesting if you have kind of

like tool calling now like then it's deciding what tools to call right off the bat and and and so that's uh and it's deciding The inputs to that as well. And so I think like when I uh yeah, one of the things I always try to understand when when talking with people about the systems they're building are like yeah, how much is the LLM deciding kind of like what's happening and what the control flow of of the application is. And so going back to your original question about talking about these data enrichment agents and like

what maybe makes them different or why LMS are good For it, I think there's a bunch of cases in data enrichment and in just things in general where you there's some level of ambiguity or you hit some edge cases or you hit some kind of like uh errors. So like uh you know I think LLMs make it much easier for for the ambiguity part like you can just describe kind of like in natural language what fields you maybe want it to enrich for and it can go do those right off the shelf. You don't have

to kind of like have uh done That before to like have some confidence that it will do it well. Um, and then when it goes tries to search, like if it doesn't find something the first time, it can decide that like what it found isn't good enough, right? Like it can make that judgment and then it can also decide like do I need to look for like another link on the on the search or do I actually need to issue like a new search and then like after it issues three new searches, it can decide

like Hey there's really no good information here or it can say hey I actually need to issue a fourth because I think I'm narrowing in on it, right? So those types of decisions help uh cover some of the longtail of things. It makes it easier to just uh I mean this is slightly separate, but using LMS makes it easier to just do data enrichment on kind of like arbitrary fields. Um and so yeah, for those two reasons, I think tasks like data enrichment are are good Candidates for agentic things. Yeah, I wonder how you'd respond

to this. Another argument that I sometimes make against like agents is um you know there's often this feeling that um and this I think this is applies broadly to LLM based applications but agent stuff in particular that like we're not really building for the LLMs as they are today. we're building for like you know GPT5 or like the the version after that Like you know the point is to you know learn and grab some land and kind of uh build some capability and hope that the LLMs catch up. Um but in the case of agents

like we've added you know tools into you know kind of the core LLMs to a large degree at least that's the way most developers think of them. you know, we're adding, you know, super long context windows. Um, I guess the like this argument is that like, okay, you're say you say that like You're, you know, the agent doesn't really work right now. It's kind of flaky because the LMS aren't all that great with like cognitive stuff and reasoning. Uh, that's going to come in the next version of, you know, GPTX, whatever. Um, but the next

version of GPTX might not need the whole agent architecture in order to do the thing that you're trying to do. It might just be able to do it. Uh, any reaction or thoughts to that just to help me tune My, you know, my kind of mental model of this? Yeah, I mean a a few quick thoughts like one like I think there will be tasks that like the uh there'll always be some aspect of agenticness that's needed. So like if I need to look up kind of like current information, I need to get that from

somewhere. And like possibly this is handled under the hood by GPT5, but that just means that under the hood GPT5 is An agent itself. Um but if if you need to connect to kind of like knowledge that's not in the the training data, then like you need to get that somehow and you need to do you need to do some calls, get that reason over it, maybe do some more calls. Um, two, like I do think there are cases today where people are doing where people are using agents to just make like two calls to

an LLM and there's nothing in the middle. Like there's no calls to external databases Or there's no calls to external search engines. Um, and I think those types of things absolutely will disappear as the models get better and better. Like that to me is a prime case of something that should just be a single prompt as opposed to tool calls. Um, and is an example of this like a research type of application where you're asking an LLM to develop some idea and then pass it to an editor to give it some to give the original

LLM uh, you know, Instructions for improving some writing or something. Yeah. Yeah. I think I think that's a good example. I think uh just rewriting things in in a different tone. So like having one LLM generate an answer and then rewriting a different tone. that can probably be um uh uh one prompt in the future. I think an even more extreme example of this is just like chain of thought prompting, right? So like that used to be a prompting technique that Was needed in order to get any reasonable answers. Now that's just RLHF into the

models. And I think that will happen um more and more. Um and that wasn't really two calls, but it was, you know, a prompting technique that also became a less relevant. Um and I think that absolutely will happen. I think you're alluding to something I've seen uh that I think of very similarly and that's the idea that a lot of what I've seen folks trying to use Agentic systems For is patching o maybe this is very similar to what I said previously but it's like trying to get past shortcomings of LLMs like uh I'm going

I have this task I'm going to decompose it into you know these three subtasks and send those out to individual quote unquote agents, but that's mostly because either the context window is too small or the LLMs aren't strong enough reasoners and they lose the point and they can't kind of hold multiple things In in context at one time. Is that kind of what you're getting at? To some degree, yes. Um but but I'll also argue against that. Um and and and so what I'll say is that I think like at the end of the day,

what it all comes down to is communication. like we just like if there's some agent that's doing something we need to communicate to it how we want it to behave. Um and in some cases maybe a prompt is enough. Um but you know code is a very good Communicator of what we want to happen as well. And so if there's some system that breaks things up and and whether that's like like right now like it absolutely is almost needed for these models to work. Um but you know going forward maybe that's easier for us to

do. Maybe it's more deterministic and it's cheaper. Um, and so we can just, you know, code is a form of communication as well. And so as we think of these systems as being agents And we need to communicate to these systems how it should behave, like is that really going to be all text or will there also be some code in there? Because code is great at communicating exactly what should happen in specific scenarios. And so I I think the answer is it depends. Like I think for some situations, yeah, there's probably some kind of

like chaining right now where there's not a ton of sense in like breaking it up and it's like more like You can communicate well enough to the LLM what it should do in one prompt and it's just not like following it and that's the shortcoming of the LLM. But I think in other situations organizing in this modular reusable way is an efficient form of of communication. Um and that's around for for the long term. Um, so, uh, you know, a boring middle answer, but I think it's I appreciate that. I appreciate that. And by the

way, I also I also really liked what you said earlier about like, you know, uh, we have to do this in the shortcoming, but that's okay because like, you know, uh, it's it's great to learn how these system works. I totally agree with that. Like there's so much learning that happens from just building with the systems today. Um I think uh there was a good podcast uh that uh the CEO of Claro was on and he was basically saying like you know a lot of our Internal tooling uh we built and like sure maybe we

could have bought some of it the AI the AI specific internal tooling but like we learned so much from that that even if like you know we're like okay we don't really need this we'll go buy it now like they accumulated all that knowledge that's going to be valuable in the future. And so I think I I I really like that you said that because I think there's so much truth to that. Beyond the limitations of LLMs today, what are the key challenges that folks are running into when they're trying to deploy these agentic systems?

I mean, I think a lot of it um a lot of it still comes down to just communication. Um, and so some of this is due to the LLMs not being good at understanding, but like I think there's a bunch of situations where we see people building agents and they're like, "Why isn't this agent working?" And They're just not providing it with the right context, whether this is through the prompt or through retrieval or through the tools that it has access to. Like, I think we're still figuring out as as a as an industry how

to best communicate with these LLMs. Um, and and so there are definitely scenarios where they can be useful, but they're underperforming because the communication on our part isn't kind of like good enough. Um, so I' I'd say That's probably number one. Um, I mean, I think after that like cost and latency start to become um somewhat of an issue. Models have been getting really cheap recently though. Um, so uh that's that's less of an issue. Uh latency is still well it depends what scale you're operating at. I shouldn't say it's less of an issue um

but it depends what scale you're operating at. Latency latency is maybe uh the one we hear more than cost. Um especially if you're doing chat like Any if you're doing multiple calls behind the scenes, it starts to add up. Um and so I'd say like number one thing we talk to companies about is still performance. Then's probably latency and then is probably cost. How are folks using observability tools like Langmith to build better agents? They're just getting insights about what's actually happening. Like it sounds really basic, but probably the number one feature used in Langmith

is Tracing. And tracing is just a step of like what the agent took um or what steps the agent took and the input and output at each step. And so like crucially with agents like you don't know what steps they're taking. And so being able to see like yeah like how many times is it calling the tool? Um and then the second part is like okay at those steps like what's the exact input and output? Like oftent times when an agent messes up it's because an LLM Messes up and an LLM has an input and

output. And often times the LLM messes up because the input is just uh you know wrong in some way. And wrong can be like maybe I cut off some context, maybe I didn't include some context, maybe the context window got really long. Um, right. And so having this observability into like what are the steps and what are the inputs and outputs at each step sounds super basic but uh yeah crucial for developing agents. And when folks Are using a tool like Langmith, are they um are they kind of, you know, importing the library, registering it

in some way, and then they're getting this degree of transparency for free, so to speak, or um do they have to kind of instrument their code very deeply in order to get the kind of visibility that's needed to practically operate and debug an agent? it it basically it depends a little bit. So if you're using a framework like Langchain or Lang Graph or or other ones Out there um a lot of the observability solutions have hooks into that framework. Um so obviously we integrate very deeply with Lang Chain and Lang Graph. Um but so do

a bunch of other observability solutions and uh they also do that with other with other frameworks out there as well. So if you're doing that then you basically get it for free and that's I think that's actually a very understated benefit of using a framework especially for these complex Applications. Um if if you're yeah this is this is a separate topic but you know if you're just doing a single call to an LLM sure maybe you don't need a framework. When you start doing more complex things there's a lot of benefits. Um if you're not

using a framework then there's a few uh different ways that this looks like. So, a lot of uh vendors, ourselves included, have some things that just patch OpenAI's client directly um as well as Other clients uh like Anthropic or Google, but OpenAI is by far the most common one. And so then you can basically either uh you can either like import uh Lingsmith and it registers it in the background or you can import you can do like from LSmith import open client and you just use OpenAI client and it's exactly the same. We just patch

it with some logging. you still don't get the full trace and the full trace is really valuable but you get all the LLM Calls. Um if you want to get the full trace then yeah it's man and and you're not if you want to get the full trace and you're not using a framework then it's manually adding in uh we have like a traceable concept you just add that as a decorator to your functions and automatically logs all the inputs and outputs and if you call any functions inside that are also traceable then it logs

those as well. So very like hotelish Like you made this distinction between using a framework and not using a framework. Uh but I think it'd be interesting to dig into like the level of abstraction of Langraph relative to other frameworks. Is it an agent framework? I have the impression that it's lower level. Uh we've got a Genai meetup uh and we've explored a bunch of the agent frameworks and uh there are very strong opinions within our meetup as to whether You should just build your agent from scratch versus use an agent framework because uh we've

run into um you know for example like hidden prompts like you don't have control over some prompt that's like deep in this agent library and it's messing up uh your agent and and its uh its reasoning. Uh is it correct that langraph operates at somewhat of a lower level of abstraction and like how do you think about the different types of frameworks Out there? Yeah, langraph is extremely low level. Um and to be honest, we learned from a lot of the mistakes. I don't know if they're pain points that people kind of like had with

lang chain originally. So in lang chain we had concept of agents and concepts of change and they did have these kind of like pre-built prompts and also these pre-built kind of like cognitive architectures under the hood as to as to what was actually happening. Um and that that made it great to get started. Um and that definitely contributed a bunch to the the rise of of of lang chain at the start but it also made it tough to customize when you got past a certain point. Um and so we were working on Langraph for six,

seven months before we announced it. And it was just an idea uh that Nuno, one of our one of our lead engineers had, which is basically like yeah, like be really really low-level, really controllable. So it's uh the the interface for it looks like network X, which is a popular kind of like Python graph uh library. And and uh there's actually another layer underneath it that's inspired by by PGOL, which is a graph uh paper. There's actually no open source implementation of it but there's a graph paper by Google and those are just generic graph

interfaces right so it's uh it's an agent framework in the sense that like some of the decisions and ways That thing are orchestrated are done so in a way that's important for agents. So we have first class support for streaming that's really important for agents that might not be important for kind of like traditional data orchestration things. Um we have really good support for looping. A lot of data orchestration things are like DAGs, right? And but you need loops for agents. Um really good support for conditional edges, really good support For human in the loop.

Um so so all of these things that are important for agents are built in. But on the surface it looks very similar to uh like Airflow or something like that. Um or temporal like a very very low level. There's no hidden prompts. There's no uh there's no like enforced cognitive architectures. You build your own. Um, and uh, yeah, I do think it's the only thing really like it out there right now. I think there are other agent frameworks, but I think They all uh, I don't know the extent that they have built-in prompts, but I

think they are definitely a bit more opinionated about like yeah, they have a concept of what an agent is and like how an agent should should communicate with other agents. Um, and I think we uh did not want to be opinionated at all and wanted to make it really low level. And are there migration paths between Lang Chain and Langraph or what is the um the interoperability Idea between those two? Yeah. So uh so we had a concept of like an agent in Langchain. the only highle the the single highle abstraction we have in langraph

is basically a replica of that because we wanted to make it seamless for people to switch between them. Um, so if you were using kind of like the lang chain agent executor, you can easily use an equivalent that's built on top of langraph, which means it's easier to extend. Uh, it has better Management of memory, things like that. Um, you can use langraph with or without lang chain. Um, so lang graph is a graph. It's basically nodes and edges. The nodes and edges uh can be arbitrary Python functions. So you can put whatever you want

in there. Um uh and at this point, you know, a big part of Lang Chain is all the integrations it has. And so those are still very very useful to use with Langraph. Uh but you don't you don't have to if if you don't want To. Okay. Very cool. Uh and so another topic that is very uh popular of a lot of interest uh beyond agents is rag uh particularly in kind of enterprise use cases but uh fairly broadly uh how are you thinking about rag and what are well let's start with like what are

some of the key rag use cases that you're seeing? Yeah, I mean I think uh it's basically just bringing external Knowledge to the LLM at a high level. Um and so we see this in I mean uh customer support's a great example of this. Uh like there's all these kind of like instructions for how to handle specific situations or information about the product. You need to be able to you know the LM doesn't know about that. So you need to be able to bring it to that in in some form. Um, and so I think

uh and I think uh Jason Louu on Twitter has uh he's basically been saying like rag is Basically just search. Um and yeah, that's basically what it is. Like if if you want uh to do some search over some documents and then insert that as context. Um then yeah, that that's kind of what it is. And there's a bunch of applications where it's kind of like needed. Um there's a bunch of applications where search is needed. And I don't know if you'd consider that rag, but like it it it is basically so taking the data

enrichment agent for example, The data enrichment agent needs to go and look information up and often times it does that on the web. Um, and so you're not doing kind of like the typical you don't need to like do any vectorization because you're just using a search engine, but like behind the hood like the search engine probably isn't using vectors, but it's got like uh it's got a good index that you can kind of like query. And so basically there's two separate components. There's Like the indexing of data and this is like data that you

own that you want to put in a format that the LLM can easily query. And then there's like the retrieval which is actually retrieving it once it's been indexed. And this can be data that you've indexed or this can be data that someone else like Google has indexed. Um but you can still retrieve that information. Um and I'd say most applications involve some form of this whether they call it kind of Like rag or not. Um, and I think I' I'd maybe uh, you know, I I think a year ago the hello world of LLM

apps was a rag chatbot, right? You could chat with your data. You upload a PDF, you do some indexing, you put it in a vector store, you then build a chatbot over it. More and more we see basically like a gentic rag being a popular thing. And and that's really where an LLM just uses that same retriever that you used in the rag chop bot, but it uses it as a Tool. It decides when to call it. It decides whether it wants to call it twice or three times or just once or not at all.

Um and that's basically because we see applications becoming more and more complex. Um but I I say this just to say that like rag is not just a rag chatbot. Rag is also really really useful as a tool for an agent. And from that point of view, I think most applications have some form of that. And do lang chain or langraph offer any Particular degree of support for the retrieval part of rag uh or is it you know again more at the metaphor of a tool or uh data ingestion or something along those lines? So

it's more it's uh it's it's more in the form of kind of like tools and data ingestion. So in lang chain we have a bunch of integrations with a bunch of different vector stores and a bunch of different embedding models as well as a bunch of different retrievers over uh Either a like non vector-based uh uh retrieval strategies so like BM25 or something like that or just things where you don't even need to build the index so like a search engine. So we have all those integrations in lang chain. Um and so okay you've got

your retriever then the question is how do you use it and that like we've uh that's really up to you. Um so you know we have uh some guides for getting started with again like the simple rag chatbot um using it As a tool but that's really where like the the world is your oyster kind of but yes we we absolutely help you kind of like build index uh different retrieval strategies. Those are all in the forms of integrations within lang chain. Talk a little bit about Langmith and how Langmith is helping folks tackle uh

evaluation and in particular you referenced earlier this idea of kind of getting beyond the proof of concept. like it's easy to do the hello world rag Application but you know tuning it so that it's actually good and you'd want to put it in front of users is a lot harder and is a stumbling block for a lot of the folks that I talk to. Uh my understanding is that Langmith is kind of the tool in your toolbox that is meant to uh address that or is a a piece of that puzzle. How are folks using

it? Yeah. So I think there's there's two big kind of like there's two big buzzwordy things that people are doing with Langmith. One is like observability, the other is like testing and evaluation. But to to your second question around like how how like what do these have to do with getting your app from kind of like prototype to production. Um I I think the the thing that we're seeing is that in order to do that, you really want to start building up kind of like a data flywheel of of sorts. Um and and so that's

a little bit vague. So so allow me to like dive in a Little bit more. Like you you want to start first seeing how the application is doing on real world queries cuz like you maybe like launch it in like private beta or something like that. You don't really know how people are going to use it. Um and you want to start seeing how it's actually doing. And so this is where the observability comes into play. but pretty quickly after that or maybe even before in some cases if if you just test it out yourself,

like you want to Make sure that it's doing well on what you want it to do well on. Um, and so I think the the the point that I want to emphasize here though is that what you want it to do well on isn't a static thing that you know 100% ahead of time. That evolves as you ship it to users. That evolves as you hit more edge cases. That evolves as people use it in ways that you can't imagine. And so having this basically path from going from seeing how users are using it, taking

Those inputs and then setting up some sort of system to make sure that you're doing what what it should be doing and and that's like testing or evaluation. Um and like start spinning that. Um that that's part of the data flywheel. And then I think the the next step from there is like okay I can like I can see what they're doing. I've started building these test case to ensure that it's performing like I want it to perform. How do I actually start like making it better? How do I, you know, if it's doing wrong

on this one, how do how do I get it to start being better? And that I think is still under kind of like I think that's still underexplored and and under under kind of like investigated. It really gets towards this idea of like continual learning. Um, which I would I would say has been like the holy grail of MLOps for for a while. Um but I think with LLMs, you Know, concretely what this looks like is like fot prompting or fine-tuning. You start collecting these data points um from users. You start collecting feedback on where

it's doing uh well or poorly. That's another really big part of this data flywheel because you can't look at every data point. So if you can get feedback from your users, that's fantastic. And then you start feeding that back into the application either in the prompts in the form of fshot Examples um or uh via fine-tuning and updating the weights. Uh, and then there's a there's another or by like manually tweaking the system yourself based on kind of like what what you see. Um, so that's that's kind of like the data flywheel aspect. And and

that's the whole kind of like you really want to get that process turnurning to kind of go from something that's in prototype to production. A key part of that is Evaluation. And evaluation is really hard. There's two main parts to it. One is kind of like the data that you're evaluating on and the second is the metrics that you're using to evaluate. What we've seen is that this is really custom for each application. Um, so like yeah, there's academic data sets and there's some off-the-shelf metrics, but generally you need to think pretty carefully about this

and you're definitely going to want to build up a Data set of your own and you're probably going to want to use uh not off-the-shelf metrics. Um, and so we try to help with that as best we can. Like uh we we try to make it as easy as possible to go from production logs to your data sets. Um we'll be launching something shortly around like uh generating more synthetic examples based on the data set that that you do have to expand it more. Um and then on the metric side we we we try to

uh uh we try To provide some off-the-shelf prompts and best practices to to use there. Um but the main thing we provide is just a framework for kind of like thinking about it, seeing the results, comparing the results, tracking the results over time. I like to think for all of these things like what's the difference between this and software engineering right like LMS are great and new but like what's actually the new things here and I think if you compare this to Traditional software engineering there's two things that jump to mind maybe three um one

is like you generally don't expect it to get to pass all the tests like maybe there are some test cases that you always want it to pass but you generally have a data set and you're more benchmarking like it does it does 80% here um and you Oh, I make this change and it goes up to 81 or it goes down to 60 and oops, that's bad. And so having this kind of like time series View over time um that's not just like yes, it passes everything or no it doesn't pass everything. That's that's kind

of important. And then a second point would be like often times this is a second and third point combined, but like there's often times a lot of still humans looking at the data that really matters. And so you want to make it easy for people to do that. You want to put it in a beautiful format. You want to make it easy to look at. And then the Third thing is you want to make it easy to compare because a lot of these are less like it even for specific data points it's often hard to

evaluate like this summary is a seven out of 10 like what does that mean like that's really hard but it's easier to be like this summary is better than the previous summary um both like for humans and also if you start automating that with like LLM as a judge type things and so that like pairwise comparison I think is Something unique to LLM evaluation that doesn't really exist in traditional software evals Um, you can maybe argue argue that it exists with like performance testing or something like that, but I think it's a bit distinctive. Um,

and and so those are some of the new things that we're building within Langmith to kind of like help with with evaluation. Mhm. When you talk about the aspect of evaluation that is defining a data set And coming up with some kind of custom metric and then comparing evaluation or runs of that metric um you know against versions of your data set or versions of your process or model. think of uh something like uh weights and biases um that allows you to kind of collect these metrics and graph them across different uh different runs like

are you how does lang compare to something like that? Is it like weights and biases with you know textual evaluation things or uh yeah It it's not just kind of like looking at this chart over time. I think a big part is also like diving into the specific data points that are often very textual that often have uh traces and trajectories of what's going on under the hoods and if you want to debug you you need to be able to see that. Um it's doing pair-wise comparisons. Um and these are all things that you know

they're adding in as well. But I think these are things that are not kind of Like in the traditional kind of like weights and biases type platform. Um, and then I also I I've mainly used weights and biases for kind of like plotting my loss curves over time. I haven't done a ton of kind of like data set management in there. I'm sure they have some things in there, but I don't uh I I don't know if that's kind of like the core focus of the platform. that has been a larger kind of like emphasis

of ours is like yeah we think Data is really important and it should be a central piece and so how can we make it a delightful experience to create this from traces generate synthetic ones upload new ones uh uh track versions of data sets over time um again I'm I'm I'm sure they have some versions of that but I think it's a little bit different than how I've at least used weights and biases in the past so reading between the lines what I'm Hearing is some of the pieces and parts might be similar. You've got

some repository for a data set. You've got um you know collectors that will collect metrics, but a lot of the distinction is it sounds like user experience and kind of a vision for the UI that a developer needs in order to build a better agentic system or LLM application. Yeah, I I think 100%. I mean, I think building LLM applications, especially if you're not like fine-tuning the model, But you're just using an API, that's a different experience than training deep learning models. Um, and so there, yeah, there's a different UX. There's a different DevX for

for all of this. Um, and, you know, I think we're all just trying to figure out what the best what the best version of that is. I'd love to have you riff a little bit on like, you know, crystal ball looking into, you know, the future, where you see all this going. I'm gonna be wrong With everything I say, but I love giving hot takes, so I'll give it a try. Oh, we should have started with the hot takes. I I don't know how burning hot some of these are. You can tell me, but um

I mean, I'm generally bullish on agentic applications. I don't think uh it's going to be it's I don't think it's going to look like autog. I think it's going to be more like streamlined workflows that involve some LLM. Um, I Don't think that the models are going to get so good so quickly that having some sort of orchestration framework like Langraph is isn't going to be necessary. Um, I think uh I'm I'm basing that based on one like I think even if the models do get really good, you still need to communicate with it and

communication's hard and and a framework can help communicate. And then two, I I feel like progress has kind of slowed on the frontier on the frontier models Recently. Um, and so what am I talking Gary Marcus all of a sudden? I I think it's great to build for the future, but I think you also have to build for what's here today. Um, and I think here today you you you need to do a lot of cognitive architecture design. Um, I'm very very bullish on fot prompting. Um, I think this, uh, I think that could lead

to a really really cool I don't know if it's continual learning Or even just like or personal learning. Uh, personalization is maybe a better word for it, but like the whole power of these LLM is yeah, that they can learn on the fly basically. And so I think like uh collecting feedback and then serving those as fshot examples. I'm extremely bullish on that. Um, and we're doing a bunch of combo research, combo product development there. Um, multimodal is going to be interesting. I have no clue what to expect from models That are natively multimodal. Uh,

specifically around speech. I know I know a bunch of them do do images as as input, but speech is going to be really interesting. No clue what that's going to look like, but but looking forward to that. Um, is there a canonical use case in your mind with a speech multimodal model? I mean like customer support is a great one where there's a lot of customer support that's done over the phone. Um And you know there's uh just skipping transcription step. Yeah. Yeah. So like you know that's that's the way that's done today. Basically text

or sorry speech to text speech to speech in an LLM and then and then uh sorry speech to text to text in an LLM and then text to speech on the other end. Um, but yeah, I mean I've chatted with a few people in in the field and I think it's an open question of like how much do we Build on this existing stack or do we wait for, you know, OpenAI to release their natively multimodal model and and build on that. So no clue what that's going to look like. Well, you asked me to

evaluate your takes. If you're saying that scaling laws are dead, I'd say that would be a hot take. I think uh yeah, I mean I think the speed is slowing down. Would you think would you think would you do you think slowing down is is a hot take? No, that's probably evidenced. Okay. All right. All right. Um I also think I this has come up in several conversations. We you talked about it a couple of times. uh pro it's probably under discussed and uh we haven't talked enough talked about it in enough detail for everyone

to appreciate it. this idea of um you're kind of hinting at this idea of Like fusing rag and fshot learning or fot prompting um into a process that um can personalize LLM responses for individual users or for individual contexts based on uh retrieved information from a query. like taking a step back. So rag you usually think about rag is like hey I get this query I need to answer this question the question isn't in like the information from the web that I train the the train the model on It's in like my enterprise you know

document database I pull those documents or chunks of them I give them to the LLM to um generate a response directly but there's this other kind of idea that you're suggesting which is um I'm just using a regular prompt. I'm not necessarily trying to augment with external data so that I can generate based on that data, but as part of my prompt, I include examples. And if I can Include better examples, then I can make the LLM responses better. Um, not necessarily to get information from those responses into the the output. Uh, and I've been

hearing quite a bit about folks starting to do that. Uh I haven't seen any like tooling for that. It sounds like you haven't either and you're starting to work on it. Um you know thinking about analoges and like the internet there was a lot of work went into like how to do personalization And a lot of tooling grew out of that and I could see that being a really interesting area uh in the future. 100%. Yeah. I mean uh so yeah I think we we commonly call this like dynamic fshot prompting. Basically it's fshot prompting

but you dynamically choose um which ones and there's different ways of dynamically choosing as well right like you could randomly choose. That's not a great way to do it but that is dynamic. Um so uh And it might have its merits. Yeah. Yeah. Absolutely. Um the there actually has been tooling in lane chain from this from nearly the beginning. So Sam Whitmore um who uh has an awesome startup um uh that's building DOT um she added a bunch of stuff around memory and this you could argue this is related to memory but she added this

concept of kind of like dynamic fshot prompting as well I believe super early on I think it's maybe one of the more underutilized Parts of lang chain I think that's combination due to there's still such kind of like lowhanging fruits in prompting the models keep on getting better but um again I think as we start to like dial in on how to go from prototype to production, it starts to become more relevant. And yeah, we're also working on tooling that's not in lingchain to facilitate this cuz yeah, maybe it's part, you know, people just want

it to Be easy and it's not easy enough if you have to manage all of that yourself. Um, I think that's I think I think it's valid as well. Um, so I'm I'm very very bullish on this concept though. Yeah. Awesome. Awesome. Very cool. Well, Harrison, thanks so much for taking a few minutes out of your very busy schedule to share a bit about what you've been working on and how you see the world. No, absolutely. Thank you for having me. As I mentioned, I've I've listened to a few of these episodes, so honored to

be a part of it. Great to reconnect. [Applause] [Music] [Applause] [Music] [Applause] [Music]

The Building Blocks of Agentic Systems with Harrison Chase - 698