[Applause] Okay. Thank hi everybody. Uh this speech aimed to give uh an overview about uh AI related related uh projects and uh technologies uh uses at so uh I can't go in detail for uh everything.

So we will focus uh about the the metadata creation and a few sorry three sorry three sorry three sorry three sorry three sorry three sorry three sorry three sorry three sorry three sorry three okay thanks and a few uh a few highlights about some of most uh most important most prominent project actually I will go on the summary uh as an introduction just uh we have to uh keep in mind that INA start in 19745 sorry and since uh Our archive the amount of archive we have to manage uh have grown and grown ever. We have uh three main uses. The first is a professional use.

Uh this is the first uh we try to to work on we can these are the marketable archives. So the one we can sell on the uh on the on the other side we have some re research use since 1995 199 uh2 uh 1992 uh those are the legal deposit archive we can commercialize this but we have to uh describe the contents and it's uh we will go we will just show the the amount of archive but we we we preserve all the the the program we are just broadcast the French program broadcast in France and uh since the 2000s about 2000 uh we start to uh invade the web and have uh we try to address new use uh about editorializ editorializating our contents so we are more and more uh B2C and B2B platform with these three uses uh we reach about four uh 30 million hours of content, 2. 5 for professional archives and about 25 millions for uh the legal deposit since 2006.

We also have the web legal deposit. Uh this is not all the web but uh this is a selection. I think it's about 300 uh website we uh we daily uh preserve, daily archive.

Question is today how AI can help us uh to manage all that archive and we can have the all the AI processing for uh the enhancing media quality and on the other end we have all the processing for enhance the metadata production creation. Uh about the the processing for enhancing media quality I will be short on this on this side. uh we have today we have a dilemma uh between the preservation and the promotion of our archive.

Uh the idea is all a lot of our archive are still in a poor definition like uh standard definition SD but more and more uh people need to use ultra high definition. So we have today we have to uh make choice and we have to make uh we have to transform our archive but it raises a lot of ethical questions. The main uh the main technologies we we are experimenting on most most of the the technologies are just an experiment way uh are the super resolution uh like upscaling the frame the the the image uh some frame rate modification experiments and uh other thing uh very um think hard to to to to drive at at INA because it's raised a lot of ethical question like the in paintings of archives So about all those technologies and the use of the of those technologies there's four u main thought uh we can talk about so far the first is that all of the all the those technologies requires uh some human uh control and supervision.

Uh most of the time to have good results we have to combine and to cross different tools and methods. uh but the problem we we crossed today is more of the technologies will be outdated maybe tomorrow maybe in few years. So we have to uh think about what is very important and what can be uh stay for long term in in our processes and uh I would say but all of those tools raise e ethical questions and in in turn uh internal in internal questions but also for our customers but uh more and more we have some uh task for making some HD uh uh HD uh formats process and things like that.

I will now focus about uh all the processes for enhancing metadata production creation. Uh we have to know that uh we have three uh main sources of uh description the of metadata. The first is a document description description of contents.

Uh we have also the cataloging. we make differences between uh both uh of these activities. The the description is really describe the contents and cataloging is more like saying at uh for the legal deposit but some uh programs have been broadcasted at uh each uh at uh this date or this hour in in this channel.

For all the rest is uh we rely on data import a lot of data import but yet uh detailed description of content it's really concern a really small part of our our archives. Uh talking about the legal deposit, we have 85 of uh of our contents that don't have keywords uh summarize and like that. So very poor description.

And about the profession professional collection, the one that are marketable only a third uh a third sorry don't have a summary uh paragraph and content description. So uh it's very uh hard uh not to to make it uh researchable. We since maybe now six years we trying to uh adopt a lot of technologies a lot of technologies uh don't have uh we this we don't have so much use cases we don't have we didn't process so much archives and contents with all the those technologies but most of the the one I show here are ready to use but we uh now the the the the most uh important thing we have uh to to ensure is to have the proper system to manage to run uh and to uh validate to have a control quality control of uh those uh the data we produce with those tools.

I will focus uh about gen AI and semantic uh search engine uh at the end because it's this is uh one of uh the approach we want to uh go uh toward in in the main use uh we aim for with all is the des the discover discoverability of archives how we can easily search and finding uh the the documents we we want to find and analysis side can we make new uh knowledge new uh new worth uh value sorry of uh our archives so it's a screenshot from a website that in appear I will talk about later about highlights okay only 20 minutes left um first highlight is about transcription uh in 200 uh in 23 in in 2024 we uh start to uh implemented whisper at INA and we aim to transcribe all of our archives. So uh we took one year to uh set the systems uh and just try to give us some GPUs uh some computational uh computational uh power to to to run all of the all the process the transcription processes and uh at the end of 2024 we uh ended the transcription of the professional archives to about 2. 5 million hours and we studied uh at the beginning of 2025 the transcription of the legal deposit uh archives.

Uh we hope that at the end of the year we will just complete it the Lego deposit transcriptions but um we maybe uh go uh maybe maybe we rush a little bit uh because there's a lot of weaknesses we not we noticed about uh about whisper uh the I mean the the the most uh impacting one is that the the kind of Whisper and ASA models are trained on data and so they're not up to date to the new terms we can just cross in the in the media in news. So about all what is not named entity but future name entity um it's uh kind is the thing we cannot uh just uh have good results. Um in last October uh we I presented a short I had a short speech about uh what we plan for the transcription and by then we we wanted to train uh to make some fine tune of uh whisper based on our archives to make it uh more uh responsive of for detected new uh new terms to people but it's too expensive too costly so we just give up with this ID And now uh because labeling label labeling data is very uh costly and we don't want to uh have armies of uh labeling people that just spend all the time of annotated data.

So too expensive. Uh this is not a quality uh not very creative works and it is not the the the we just the first experimenting we we we have done just show that the results was not very satisfying. So we give up with this idea and we try to turn to other solutions.

The we have two main uh approaches we want to we want to to experiment. So it's very new for us. I don't have uh a real uh feedback to give about the the results but the first uh uh thing we want to to enhance is uh the use of the meta existing metadata uh like we we have a lot of uh of metadata about our programs and maybe we can just give uh we try to have some uh processes p pipelines that can we can add some knowledge uh during the transcription like whisper Yeah, which per can have hot words features like you give a list of words and it will try to find this word in the in the audio uh media you give it you give him give it uh that's one of the solution the other one is like is to lie on other technologies and we think maybe name entities system can help us to not uh correct the correcting the the the transcription result may give us some other data we can rely on.

The idea uh we is to have a in-housem made uh solution based on refined. It's a refine is a open source uh solution made by uh Amazon I think uh it can just uh it can um make some uh name entity recognition and name entity linking. Linking is like the dismbiguation uh part is when you have a knowledge base and uh from starting from the the potential name entity uh the the model will detect like uh you will say this word seems to be a people seems to be an organization.

Then in the second step it will check in the database in the knowledge base if you can make make a match with one of uh one on one entity sorry. So we uh we are uh making this uh setting this we we aim to have a a first working prototype at the end of the month of this month and the idea if maybe we can uh starting from the the name entity results uh try to correct enhance the results uh with like with things like uh just after yeah so I'm sorry just uh all the the the knowledge base is based on Wikid data and we we aim to uh import Wikid data uh uh data uh every weeks uh at least every weeks. The idea is uh we can we have two ways of uh try to enhance and u have a control quality control on the on the the name entities sorry the instead of just uh extracting name entities on our footage on audio visual footage we will uh we try to lies on news feeds because it's text and for the name entities is maybe uh a better uh we will have better results if we started with text because there no ambiguity with the audio pronunciation audio the the pronunciation of the the words.

So uh the idea is what we will just extract a lot of word on news feeds and we will just uh complete our database with the the new uh name entity uh we will cross on the on the news feeds. And the idea is we can just try to correct uh the the name entity with uh when we try to to uh to um extract nameities on our archives and audio visual audiovisisual contents uh we want to to try to experiment with phonetic embeddings uh technologies. It's not something that can uh uh correct everything but because there's a lot of things that won't work with like if you have some exacting name some we in France so if we have some people with name that is not common we are not sure that people will pronounce it correctly and uh so it won't solve everything but it's maybe one approach that can help uh to have better result.

The idea is to uh combine they have more uh more things that can help us to have better results. And so the phonetic uh approach and maybe the metadata approach like starting from uh for the the the the name entity we find in the in the news feeds. Maybe we can just use those data and compare with the the data the the name entities we can find on the the audio visual footage.

The idea uh of the transcription uh one of one use we can we can make with transcription is uh we aim to go uh towards semantic search approach. The idea we we already talking this morning and in the afternoon about semantical search semantic search the idea is uh that we can make some embeddings. Embeddings is like mathematical description of informations.

It's like floating numbers and we can on a in a in a space in multi-dimensional space we can just try to represent this data and make some links some um uh relation between uh data. So here in the in this example uh king and man are related but man is related to woman, king is related to queen. So we can relate uh we can link queen and woman that's the main idea.

Uh we use uh with these uh embeddings we will use some similarity uh approaches to make some cluster uh some uh bunch of data that are similar and we can say this is the more or less the same information. The benefits the the thing we we want to to reach is like is having a new way of searching information. So far uh with the transcription only we can have keywords.

It's something that is very useful but now we have to add semantical concept like uh if I'm talking about um something but that don't have the good word or just a specific word it will expand the search giving some synonyms something like that. the ID an example like if I want to make a search about climate and ecological uh matters uh I will just here I type for climatic but there's an error this is not how we we write climatic in French so we won't have results but with semantic search the ide could compensate this and it will uh we will have results even if we make some mistakes uh in our search. On the other side, full full text search will just uh we will have some results that include the the word we we reward we want.

But with semantic search, we can have different contents that are related to maybe not exactly the the same thing we we we are searching for, but it expand the results. Sorry. So it is not really detailed and I don't talk about technologies and how we do things.

It's very really very simply but uh the the idea is starting from the ASS the speech to text model like whisper transcript everything we can maybe add the name entities and then with llames we make uh some summarization. So we this is has been explained before we have some we make some chunks of the text and we make the the summary on the chunks we can also liize on metadata and name entity to uh try to enhance uh the the context and the information we give to the LM for the the last uh the final part we will uh try to we will just make the embeddings and the the the semantic search embedding It's still a prototype but the semantics is not uh released yet. uh we still experimenting on and we uh are making some uh some evaluation with uh our customers or the intern customers.

We uh at least we will uh at last sorry uh we want to uh set some multimodel semantic search like multimodel approach like it was said before. uh we are experimenting on clip too and uh the fact is for for now we just don't have uh real results but uh we are just experimenting on the different technologies uh we can we can uh we can easily work with. So we have uh to ensure to have algorithm that can uh be used for multimodel data for text image sounds.

We must have uh some similarity algorithm that are satisfying and can make those link between the the different embedding. And finally uh we are working uh to find some proper vector search enzyme uh to make to have something that is rec researchable and have a good uh response time. Just two other highlights about use cases and use cases that that can um that are uh like something that we can we can have some benefits for your users for the internal uh company users we just uh a few months ago maybe three three months ago uh we started to uh expose a service that is called toolbox.

uh in that toolbox uh the idea is uh to make some on demand uh processing uh for so far we just uh implemented the the as model whisper so uh we have at some people that can just it's more for editorial edi editorial uh department uh they are the the the most frequent user of this service but they can just make some transcription on demand with whisper uh it's I think that we that are that was so uh very important to us because so far we're just experimenting uh on the side on the techn techn technoical oh and for about technologies and uh we people on the on the enterprise didn't really know what they can do uh and how uh I mean in in which extent it it is very uh useful and so for for them it's a good way to uh make user understand how they can use this. And by the way, most of the of the the people journalists and people from the editorial editorialism side were already using transcriptions tools but not the ones we produced in the institute about data in a fair uh this website. So it's um website that just gives some analysis uh not analysis but some uh some information some data to make proper analysis of media uh media uh French media.

So we it lies uh it relies sorry uh on uh AI generated uh data. So, so far we have uh we we work with uh 20 uh channels uh five radio channels and 15 TV channels and uh the the metadata uh are from 2000 2018 uh uh to 2025. The idea uh is to have some metadata coming from uh the recognition or a name entities facial recognition or name entities um data.

So far we don't use the facial recognition but we want to use it by the end of the year and we have transcription uh that that can help us to have summaries. We have some other tools like the something like it's sound classification that can just that just recognize the voices and the music and we try to have some uh gender analytics uh to say how much uh how many pe how many men just talk and how many women talk in the some parity uh analysis. But fact is uh today we we try to set a lot of things and we experiment a lot of things but we uh maybe we should have started with uh implementing some labeling validating data uh systems and so we uh are making this uh now after um having uh processing a lot of data.

The idea is if we want to uh be sure that the quality is good, we can just uh correct everything and check everything uh with humans. So we have to have we need to to set some uh good interface uh and good metrics that can help us to understand what is uh what is uh uh what are we are generating. if we have some changes, something that some doesn't go go as we want, just to uh be able to to show it really quickly.

So, we have two main uh website. The the first website is a website for labeling data. Uh we called it grown control uh because this because that was the name of uh a restaurant we were talking about it and it was like it fit the the programmer and uh so the grow is like a website uh in which we can just have several uh screens for HM.

So ehm is a French words like interface. to user user interface. Uh the idea is uh we can easily make some interface for different use case uh like for transcriptions or for uh topic uh detection.

But so far uh about transcriptions the this interface is not very satisfying because uh we can just make some word error rate uh work because we can just okay we can validate and unvalidate a block of some paragraph and but after we can just uh if we want we can just correct but the result is we we will have a word error rate but we don't know what to do with word error rate it doesn't help us to uh enhance the systems. So we are working on something that we can uh that can just give something like uh more consistent more uh helpful to understand what is really happening like uh hallucinations missing words and those things that we can just detect with this interface. The idea is to uh about monitoring AI to have something that that uh can give you uh a quick uh overview about what you have uh processing so far, what is missing or not and a detail view that is not I I don't have a screenshot for the detail view uh that can give you uh more information about the process.

As a conclusion uh just to to sum up uh so most of the feedback uh just lead us to to say but we have to combine tools and method to improve the results. Uh it is true uh is also true for the the metadata creation. Uh we want to enhance semantic search systems and move toward multimodel searches.

uh so we are not uh it's not okay for for so far but we are this is one of our main priority for the maybe two three next years uh we want to enhance monitoring quality control and labeling systems and uh that's uh all of this uh the the the goal is to give uh internal company user access to those technologies because for the four last years they just hear about it but never uh use it. And uh finally, we want to end the evoling legal and rights landscape of AI because we have some technologies like the facial recognition we are working for working on but we don't know we don't know if we can make some real use of it. So uh quite unknown uh side.

Thank you.

MMC Seminar 2025 - INA IA project overview / Transcription and named entity focus / Semantic search