We used to think about building an a a an AI system as we collect the data, we build the model, we have an output. A lot of those more um sophisticated companies are actually already thinking in a different way, right? the AI especially the agents will need need to learn or to be developed in a different way where you have an hypothesis you want to cover that hypothesis with your data you want to model you want to evaluate that hypothesis and make sure that uh your systems are updated and that's where synthetic data is actually helping in changing and this is what we call the the acceleration through epistemic development where synthetic data is the main tool to achieve that.
All right. So, today we have Fabiano Clemente, uh, senior director and distinguished engineer at KPMG. So, Fabiano, welcome to the podcast.
Hi, Ben. Thank you. It's a pleasure to be here.
Our topic, main topic today is synthetic data and uh we'll try to focus on that, but obviously we may get derailed here and there. Uh, I think it's fair to say at this point most listeners have heard of this notion of synthetic data. Some have probably even tried to generate their own or used the tool, but obviously you're much more hands-on and much more active on a day-to-day basis uh when it comes to synthetic data.
So maybe we'll start uh Fabiana if you can maybe describe what what would you say are the top two to three use cases uh where synthetic data seems to work right now? Uh yeah that's that's a good start and yes uh it it's it's true that a lot of users have already dealt or heard of synthetic data before. uh that does not mean there aren't a lot of misconceptions but we can uh delve into that a bit later on but in a nutshell um understanding that synthetic data is the concept of any data that is not uh collected from real world events we can think of a different set and spectrum of use cases and applications and we can go from the lowhanging fruit of test data management so data that will allow you to test systems to go all the way to more intelligent use cases where you need to help the development of AI agents and in between you can uh think of synthetic data as a privacy preserving way for you to have access to data.
So it's a large and broader scope and this scope is not served by by all the means by the same technology. Of course it will vary depend on your application use case and what you want and expect to gain from synthetic data generation. So, so when you uh talk about uh AI applications, right?
So, most people uh think of things like coding and programming and and uh maybe customer support things like that. So, what would be the equivalent for synthetic data? What are like the most cited uh examples?
If you were to give a talk, right? and you're pressed to give examples where synthetic data is being used. So what would be the top two most common reasons for using yeah the the the three ones that I mentioned are the most common.
So one of them is okay I have a real data set I want to ch to share this with my offshore team but I can't so the data can't leave the country I still want to keep some yeah some level of structure but also correlations so you go for synthetic data instead and in here you use synthetic replicas which is a type of synthetic data or you are developing your own AI agents uh and you are uh looking into uh improving your training your evals and then you leverage synthetic data to construct the whole system uh and change uh uh the epistemics around your AI agents. So I I would say those two are fundamentally different but they are uh true applications on how synthetic data can help uh nowadays. So uh so you've been working on synthetic data for a while.
So what's one or two examples where synthetic data solve the problem and it actually surprised you? Surprised me? I would say um I I wouldn't say it surprised me but definitely it's probably the the the best way uh to leverage it.
One of them I just mentioned it. It was really to enable how the offshore teams would have access to us a data set that is similar and in this case uh develop analytic solutions on top for example. Um and that one is usually you don't think about on how companies are you think about how companies are restricted to share data with uh external entities but you don't think sometimes on how an external entity can still be the same company just in a different country.
Uh on the other hand, I I would say that I've also have seen cases where synthetic data did help a lot in improving the results of fraud detection. Uh which to to an extent is something that is not obvious that will be uh uh a good path or a good way for you to to improve your results when it comes to um to fraud detection. So, so for teams that don't have a lot of experience with synthetic data, so so what are let's say the one to two most common mistakes?
Oh, that's a good one. Yeah. Um I I I would say that the biggest mistake I have seen is um perhaps oversimplifying the complexity of synthetic data.
And I'm not saying synthetic data complexity in a bad way or in a bad term. But as in anything that uh leverages data, you need planning. You need to think about what you want to get as an outcome.
So even if you are just building uh a test data set for to test the software application, you need to plan what really use cases do you want to cover on the synthetic data. And usually people have this expectation that synthetic data is just a click on a button, it'll do exactly everything as I want. Uh it's simple and it's just dummy.
So it's very easy to do. So that I'll say it's one of the biggest mistakes I have observed. Um and the second one uh I would say that is not understanding there are different methodologies and different types of synthetic data that you can leverage and being able to select the correct one for their objectives.
Uh and these are two fundamental they are not technical if you ask me they are really around requirements uh and understanding the technology that you want to leverage. So is it uh is it fair to say that uh I guess historically a few years ago synthetic data my impression at least and I guess this was because this was before uh chat JBT synthetic data tended to be around computer vision images those kinds of things uh so these days what's the so the as far as data modality is it basically across the board everyone is using uh trying to do synthetic data. I mean even people in robotics are doing synthetic data at this point but uh what's the dominant what's the dominant type of data that people are I I I would say that uh first I it's the first data type or uh that leverage synthetic was actually structured data way before text or images if you think about so of course if we are surprised we have been doing that for more than 50 years probably regard regardless, right?
Um and I think that uh images did evolve quite interestingly in the last 10 years probably I would say as well as text and I would say that nowadays if you think about it text is probably uh the type of synthetic data that is dominating the market that doesn't mean um the space of synthetic data for text is well defined or well structured because now anyone today consider that synthetic data is just the the and if you are oversimplifying the outcome of an LLM can be considered a synthetic data but that does not mean it's well structured or is is actually uh being correctly used and leveraged for for what they are doing but that that I would say that definitely uh text is dominating nowadays so so uh without synthetic data Right? So, normally what you would do is okay, so I want to build a model. Here's some historical data or in the case of finance, here's historical trades and financial data.
And then I'll build the model and then I'll I'll test the model out and then deploy to production. But obviously things can go wrong even in the in the scenario I painted, right? So you can have a a drift, right?
So, so like the the real world changes and then so what you built your model on is no longer the same or you may have ended up kind of the sample you you created your model from from was biased and so on and so forth. Uh obviously the same problems will occur with synthetic data, right? So what are some of the common technical problems?
I would I I guess I would is the question for synthetic on that side. I wouldn't say it's a technical problem from synthetic data. It's a technical problem from data in general.
What you just described is a definitely a fundamental problem of how the processes around building data solutions are are defined. But it could be the case Fabiana that uh uh your data is perfectly fine but your synthetic data tool was bad and so then the data so the synthetic data generated was bad right so so what are I I would I wouldn't say and and again that that goes exactly to my initial point so you you also can end up with good data and end up with a crappy model and that's your that's a you problem and that's a a problem of you not understanding how models behave. No, but surely synthetic data but surely there just like models and model building tools there are synthetic generation tools that are better than others.
So I guess what what what should people look for in terms of what tools they're using? Uh it depend on the a lot on the use case on the end on the end application. Right.
Yeah. That's a reasonable answer and and and it's it's a it's a it's a an answer that nobody likes to hear, but for me that's the the true the true answer, right? It's it depends and you need to be aware of what you want uh in order to search for for the right uh parameters and functionalities that you are looking for.
But basically synthetic data becomes a a part of the workflow just like as if real data right so what you would do in order to uh harden whatever model or or or analytics that you're building with real data you would apply the same hardening steps uh if you're using synthetic data 100% and I think it's very important that you have uh probably what I would call a governance problem process around what you considered is a synthetic data set that is ready for you to leverage. There are evaluation metrics that you should put in place. Those evaluation metrics will depend on the type of data that you are leveraging but also on the use case that you are you are building and those processes are really important.
You should make sure that the people that are leveraging synthetic data are also well trained on it because as you said yes uh as training a model synthetic data can lead to potential mistakes that you don't want to propagate and those mistakes usesly stem exactly from the lack of processes of governance on how to generate synthetic when to generate it from where uh and from what for what um and having that those metrics and that insurance I think it's essential for companies to adopt on their daily basis uh a synthetic data uh generation method. So, so uh with the rise of uh foundation models and generative AI uh uh you know a few of the trends there are things like agents multimodal multi multimodality reasoning right so let's take them one at a time so agents obviously agents is a broad topic but you know at the simplest level you have a an agent that does one thing well, but even that one thing may involve multiple steps, could involve tool calling and things like this, right? So, so are people starting to use synthetic data uh as part of their uh agent building process.
I wouldn't generalize uh to everyone across the industry but I would say that we have evidences that some companies are definitely adopting right um either Facebook and and I say Facebook I I wanted to mean Meta actually open so it sounds like really advanced companies. Yes, exactly. And I was about to say that even XAI um they are all leveraging synthetic data and above uh and all of them in terms of they're betting on the leveraging synthetic data to enable a different structure exploration of the knowledge spaces right exactly what you said an AI agent or a set or a multi- aent system will require reasoning a multi-step kind of framework and usually your knowledge base is not structured in that way, right?
Or it's a bit more um it's it's less structured if you go and check. So the synthetic data is actually one of the pieces that is helping on having those knowledge spaces while structured in a way that they can optimize the outcome from agents for example or even to change on how models actually acquire the understanding. So in the traditional way we used to think about building an a a an AI system as we collect the data, we build the model, we have an output.
A lot of those more um sophisticated companies are actually already thinking in a different way, right? the AI especially the agents will need need to learn or to be developed in a different way where you have an hypothesis you want to cover that hypothesis with your data you want to model you want to evaluate that hypothesis and make sure that uh your systems are updated and that's where synthetic data is actually helping in changing and this is what we call the the acceleration through epistemic development where synthetic data is the main tool to achieve that But this is how we know the the or we understand in a general way how sophisticated companies are using it. I wouldn't dare to say that everyone in the industry is using it that way.
Yeah. Yeah. Yeah.
So one of the more interesting things in this area I think is this uh emerging body of practice around agent optimization and uh the key insight there is that uh um you can uh you can boost uh your agent a lot by just rewiring the agent graph without upgrading your model. Right? So, Mhm.
and so now you've got a bunch of open- source projects ranging from tax grad uh open evolve ja all designed to uh do a lot of these things and and I would imagine uh even as you're optimizing your agent you're going to want to run this agent through a bunch of scenarios that don't exist in your data set. So uh and uh could involve even edge cases. And now that these agents are actually as we discussed doing a bunch of things uh using a bunch of tools uh kind of that space is kind of broad and I doubt that you would have that historical data handy anyway.
So you would you would need to have tools that would uh uh allow you to with confidence know that you've optimized this agent properly and that it's ready to at least roll be rolled out even in a limited way. Exactly. Exactly.
And uh so and you what you just described is exactly this need of change of paradigm, right? We used to think that we need to learn by exposure by learning historical data. We definitely now need to have our systems learning by construction and be able to test it right away.
Uh and that's where I think that the synthetic data is actually a very good and a needed accelerator. Um, and I'm I'm just glad that AI agents brought that perspective because that this perspective already existed. It was just harder to conceptualize and see the value uh because it's very abstract, right?
So yeah. Yeah. If you if you think of all the the agents that at least on the business side, right?
So set aside the coding agents but actually a lot of these business agents are uh are coming out of China and uh since I spent a lot of time in China in the past I I've been talking to a bunch of people there and I guess uh the reason that the Chinese companies are moving to the west is it's much easier to charge people in the west than in China. Right? So for whatever reason they're here they're building these tools that will automate a bunch of things.
Right? So the canonical example would be uh create a PowerPoint presentation based on the following specs and blah blah blah. But if you you can imagine uh these business process agents becoming more and more complex hitting more and more tools.
um it's just impossible to think that uh you would have all of that historical data handy anyway. So you would really need a way to simulate uh uh the behavior of these agents. And one question I have uh Fabiana is one of the things that you keep reading about and I guess uh is generally true of millennials is uh chat bots becoming kind of uh true friends or companions or even romantic partners.
Right? So it started it got me thinking it got me thinking that okay so if that's happening in order to harden this chat bots you would need to simulate data where the chatbot is now starting to detect emotion emotional response. uh you know, not just uh uh not just not just uh plain text, but there's got to be as as you're testing these chat most you have to inject all sorts of emotional scenarios because now it's like acting like a friend of someone.
So, have you heard of emotion being part of synthetic data generation somehow? Uh not really. And I'm and probably I'm a bit more skeptical when it comes to emotion.
Um because ra and I I understand your point. It depends on what you consider emotion. I I I'm skeptical too.
I'm not sure if if it's happening. I'm just uh I'm just speculating that because the the interaction is becoming emotional to some degree. So there must be some people there must be some people attempting to to generate data that has emotional dimensions somehow.
I'm just making this up by the way. Yeah. Yeah.
Yeah. No, I I I'm I bet it's a possibility and and I'm not surprised if someone was doing that. emotions have been like one of the focus of AI for we we always heard about the sentiment analysis that that always happened so I wouldn't uh be surprised I'm not aware myself uh but as I told you I'm really skeptical that even synthetic data could be helpful on that side perhaps we you can uh create better boundaries if that makes sense but still uh there's always a limited uh cap capability of these models to to really understand be beyond syntax right um and and that's where I I still stand even if someone told me I was able to get some better results I do bet those better results were achieved in a very specific narrowed uh kind of uh situation uh though well uh we have heard that some stories of people there are very happy uh with bots that they never felt more uh a better comp companionship than the bots they have right now.
So there's a lot of nuances there. So uh one of one of the things that brought synthetic data back in the headlines maybe 12 to 18 months ago was there was sudd suddenly a lot of talk about all right so we're running out of data all these models are being trained on internet data but everyone has a basically vacuumed all of that data so then now we have to kind of uh uh maybe to distinguish our model or or or or make our models even better. Uh obviously scaling laws have multiple dimensions.
There's compute, there's data. Um but since data is running out, we need synthetic data, right? So on the other hand though then um a lot of people raised the possibility that now you have the possibility that AI trained on AI data is going to lead to some sort of model collapse right so what have you heard recently in terms of of uh the concerns around uh uh you know so obviously not there's no such thing as free lunch right so every every kind of thing you use has potential uh disadvantages.
So this disadvantage that people bring up Fabiana of okay so if I'm if you're going to train models on synthetic data then that's going to degrade the model over time because basically you know the mo it's like a loop right so uh right you can only you can only the the model's capability of generating synthetic data is limited by the model itself so therefore you know yeah yeah in that sense under the assumption that the synthetic data that we are talking is generated by the LLMs. We we can't forget that are way more about synthetic data. There are simulations and and and simulations are used by quite some time with very good results, right?
They were used for uh the studies of covid vaccination. It it is used every day with weather um and and they work. So, but of course there's a limitation and I wouldn't say there's a and I agree with there's no free lunch.
Um I wouldn't say a degrades in the capability of the the the model but I would definitely say a plate uh because well unless you are doing assumptions based one on what you know you and you just know that there is no collected data but this actually happens. uh but unless you know new behaviors the fact that you are generating the same data from around the same behaviors you will achieve a plateau and also I think that's one of the the the things that regardless generative AI's like LLM will always have a problem with they always are dependent on having seeing a lot of data right and and we know that that plateau eventually will be uh achieved uh and I think we and then we have a totally different problem. Uh how mathematically can we solve this plot and it's and I on that side I I don't think synthetic data will be the answer anymore.
So in in so we talk so what we just discussed there uh uh focuses mainly on LLMs and foundation models involving tax but one area that people seem particularly excited about these days are foundation models for uh the physical world primarily robotics right so in that world right uh in that world it seems like so there's two general eneral approaches that people are are doing. One is uh um actually collect data but obviously they don't have the same internet scale data that uh uh you'll have for for LLMs. Secondly, you generate data by having humans do a task and you just capture it on video and then uh that that's how you collect data.
And then the third approach is simulation, right? So basically now that you've collected human data maybe you can have simulations to expand the amount of data you have. Mhm.
So the the critics say that you know the simulations are fine but there's still still a gap between simulation to real data. I mean and these are you know people like Rodney Brooks right? the real uh one of the granddaddies of robotics.
So it seems like uh in certain areas like that synthetic data may still need work. No, I wouldn't say may still need work, but I would say that definitely needs to be more export. Uh it's more on that side.
uh because I know uh companies that work on the specifically synthetic data for robotics and they are having very good results for for and and I understand that we have to have talk to Rodney. We have to have them talk to Rodney perhaps because we have to get be pragmatic, right? You want to develop robots and uh solutions for automation but data collection is expensive, time consuming um and it's very hard to to get all the the movements that you want to capture collected just by nature.
Uh having that said, simulation is great. synthetic data can help in you know building a bridge uh in between the real data and the simulations. Uh uh in some cases it won't cover 100% but it will cover perhaps 80 to 90%.
And sometimes it's better to just have 80% of the cases than having the 20% covered by real data. I think here it's more a pragmatic approach. In real world scenarios a lot of times the 80% uh are very good are excellent actually.
So, so in closing going back to the topic of agents, uh obviously uh uh people tend to get ahead of themselves, right? So people are still working on single agents to do very narrow tasks but then on the other hand that uh there's already a lot of talk about multi- aents, right? So obviously multi- aents introduces uh a lot more complexity for one particularly if the agents are communicating.
So there's just the communication challenges between those agents. So, what are some of the the new tools that you're hearing about that target specifically multi- aents or the scale that agents are have introduced to synthetic data? Not new tools actually uh but uh but of course we have been actively working and a lot of the vendors in synthetic data that already work with this type of data are exploring is covering new scenarios and new features, right?
A lot of these agents are relying for example on document processing. So there are new solutions for document generation which is highly helpful or you one of the things that I also like is for example um in market research there's all the synthetic personas being required nowadays to accelerate hypothesis testing learning speed um for example which is very interesting uh or um there are solutions being developed but to help with reasoning structure for bots. So those are I wouldn't say specifically tools that are coming out but are definitely solutions that are being developed targeting these needs and requirements to test um for multi-agent architectures.
Yeah, it seems like there's uh like uh there's a group out of Meta that uh I don't know how real this is, but they uh released a paper that uh basically even uses ray, right? So, so to uh uh scale uh and and basically scale to to for scale and orchestration and uh and specifically uh increase throughput mainly to uh generate data for synthetic data for multi- aent scenario. I'm not sure.
It seems like according to the paper, they're actually using this. Uh but I'm not sure if anyone else is using it. That's but that's companies will use it differently, right?
That's an architecture solution for a problem they had. They want to augment the throughput, test the system loads and that will be uh a decision for the different engineering teams on how to apply or the the synthetic data generation because testing throughput testing systems capabilities well we are we have been using synthetic data that way for decades now. It's just the change of paradigm and and by the way it's not really a change because if we think about multi- aents just as we think about microservices from the 2010s it's the same concept it's the same needs it's just a shift in terms of tools just because instead of being applied to just software engineering you're actually applying this to uh AIdriven solutions.
So I I see a lot of changing on that area or tooling even for example authentication for agents right you we are we are seeing a lot of uh solutions exactly for that but it's not something specific to synthetic data it's more on a broader sense of architectural solutions to deliver multi- uh uh system multi- aent systems. Yeah. Yeah.
And also it seems like it fits into the natural uh tooling that's uh happening in uh multimodal data and and and uh data for generative AI in general and that uh you need high throughput but you also need efficient utilization of resources between GPUs and CPUs and and fine grained utilization because basically uh these uh precious computing resources. So, and with that, thank you, Fabiana. Thank you, Van.
Thank you for having me. This was a pleasure.