Welcome back everyone we'll get started here I'm Anne Carpenter I direct the Imaging platform here at the broad and um have expertise in image analysis and in AI for drug Discovery um today I'm really excited to be chairing this session because we'll be shifting from what we learned about this this morning a lot to do with uh genomics and protein structures and then we'll be transitioning during this session from Um uh look starting to look at function the functional impacts of of protein but then also cell biology um which is my my own personal interest

so very excited to to see this session play out and our first speaker will be talking about Gene variations to protein functional impact at scale and that's suaya ikbal from here at the Bro Institute thank you for the introduction and very excited to be here good afternoon everyone uh so I I realized That one good thing about speaking in the afternoon is I can almost entirely skip the background because that's what Hillary and also marage covered a lot uh but thinking about connecting the dots in Biology one thing that I think a lot about is

how to connect these Gene variations to protein functional impact and that my talk will focus uh on that so uh what is the connection between the uh genes to prote prot mean it's basically means connecting the Instruction manual to the messenger to all the way to protein that basically performs the uh function of the cell and that understanding is uh even more important when there is a um error in the instruction manual and then develop therapeutic hypothesis accordingly uh and this this and this connections is even more important to understand when we want to unveil

the molecular that is the structure function effect of protein coding Miss trans mutations because for Many many proteins it has been shown uh that uh the rare disease especially the rare monogenic disease mutations cluster on protein structure to pinpoint to a specific functional region to hint to the functional effect of the protein mutation and that was for one example but there are many many examples of such for example the umod kidney disease mutations and then Gaba receptors and epileptic andap opathy and then Mark 2 Gene uh and the associated shakar tooth Uh protein M mutation

and then uh why this understanding uh is important uh is because understanding the functional effect of M mutation is a challenge in human genetics and it's a longstanding bottleneck for the translation and as I say that this stat should be known by now to this audience that 50% of all protein coding uh uh genome variations are missense variations and of all the observed missense variations only 2% is Uh classified as pathogenic uh variance leaving almost 84% uh as variance of Uncertain significance and this is a data according to 2024 clean barar and this number of

variance of Uncertain significance keeps on growing every year and this has a huge implication uh because these variants are not clinically actionable and this variance are not also included in any case control studies due to the lack of their Interpretation uh so um because it has this huge implication we asked uh a more deeper question for example uh what fraction of this variance of Uncertain significance is actually present in in a healthy individual and they are just fine and that fraction is 45% so 45% of the this huge amount of variance of Uncertain significance is

present in a healthy individual and then we also as the opposite question that what fraction of these variance actually have a Functional impact but we just don't know it and thanks to the databases such as ma DB and protein GM that records the functional assessment readouts of a variant effect and from there what we could quantify is about for 50 genes 21% % of these variance of Uncertain significance are actually loss of function variance or non-functional um so that kind of makes me think that um clinical interpretation or pathogenicity interpretation and the Functional effect interpretation

these are related problems but not really the same problem and this is exactly what also Hillary mentioned so as I was saying that many things are kind of also have been set in the morning session um so uh instead said this problem is actually a multi-dimensional problem so a variant can be within this spectrum of pathogenicity and it can be an in a spectrum of selection frequency and then so as in the Conservation and then what is my uh interest particularly is this protein functional Spectrum right from the gain of function to the loss of

function uh now let's think about a little bit the protein function and I will I will try to eliminate like two special challenges uh that are associated with this functional effect prediction one is that um protein structure is actually more conserved that pro protein sequence I mean if you Look at all the kineses together you will see they all look like the same and if you look at all the ion channels they all look the same and within these classes they look very different uh that makes me think that it requires some protein or class

specific or even the cellular context specific tuning um um and then the second one is my favorite that is what does protein function mean um does function mean stability does function mean interaction does function Mean regulations does function mean cellular toxicity so when I think about the molecular effect and the functional effect of M transmutation what I also think a lot about is that this functional effect requires a precise definition and context specific definition that what are we actually me measuring um so um towards that direction uh in in my lab uh we do um

method development to um have a precise understanding as precise as we can go of Course um of the functional and molecular impact of M mutation and I'll literally give you a very high level overview of two different projects that are going on uh currently in the lab so the first project which is uh led by Silky and Jordan they should be here around so after this talk if you have any question feel free to ask them so um here uh this topic is how can we develop an interpretable score and I kind of highlighted or

underlined the Word interpretable because that I care that interpretation about that interpretation a lot so when we throw a score to a molecular biologist or a therapeutic scientist that score um is a number fine and then they ask the deeper question that what does it actually mean what function it is actually impacting right so this scod development is solely based on a very fundamental principle which is the the central dogma of molecular biology that uh structure Function is encoded in in sequence so we hypothesize that protein sequence structureal function information can help interpret mm mutations

because um on the um because fundamentally M mutation is changing the sequence um so uh but what it requires us to do is for us to do this and that is uh basically connecting the dots which is connecting the genetic data to structural biology data which under the hood um is even more complicated because it requires you To take the genetic data or variations to the correct transcript and then to the correct protein isoforms and then to the correct protein structure which when you think about the protein datab Bank data it's very messy and it's

a tricky thing to do but Alpha came in and solved many problems at least made our life easier but it's still a difficult challenge so in the lab we develop uh and we developed and it is now published online Which is the genomics to proteins portal which is a discovery tool to link genetic screening outputs to protein sequence and structure uh this particular portal has a resource module as you can see and what we achieved with this resource module is we integrated varieties of source of genomic data to transcriptomes and then to the protein sequence

and structure and then we collected uh genetic variations data from three major Databases Nomad cleanar and hmd and then we mapped those data onto protein sequence and structure but importantly I want to mention about this particular side of the portal which is an interactive mapping tool and with that tool what you can do is you can upload your own customized annotation let's say you predict the functional effect of a of of protein variations for a particular protein using Alpha missense and you want to map that onto the Structure you can upload your score your mutations

and you can also upload your uh structure that you predicted using your favorite tool doesn't matter esm fold or Alpha fold and then establish this connection right exactly in this way side by side U between protein sequence and structure and after doing the mapping you can basically download the result and go deeper analysis uh but then I also want to highlight this particular part that Another thing we did um very deliberately and I'll tell you why that is we annotated the human proteome on the residue level with a suite of biologically interpretable features so this

is a pretty messy slide so I'll I'll simplify it um um that is what we have here is the biochemical properties of the amino acid residues the structural features computed off of pdb and Alpha fold we have collected the postranslational modification Annotations we uh have collected the variant effects or Maes from ma TB database um we collected all the sequence function activity annotations available in uniprot uh and then we also computed uh a suite of protein protein interaction annotations which can be interprotein interaction or also intraprotein interactions both of pdb and of alpha fold structures

um and now if you look at these uh groups of features um these features these groups Of features is actually indicate to different precise function that I was talking about so PTM will tell you about regulations protein protein interaction will give you the protein binding um loss of binding type of function and then the functional activity comes from uniprot and then the stability comes from structural features right so what we have gathered into the portal the variation data the um annotation the feature annotation off of that we can Ask a very simple question and that

question is which of these protein features are enriched for pathogenic variations which are the case versus the control which are the population variations and we did that and we did that with uh the pathogenic and disease mutations from cleanar hgmd and the common variance in Nomad and benign variants from cleanar and the results looks like what you can see on the top so when it's on the red side then that Feature is enriched for pathogenic variance and then when it's on the blue side then it's enriched for control variance so how to interpret that results

it's an example case here so um without looking into the details off of those two top rows what we can see is the beta sheet structural motifs off of the interum proteum variations are two-fold enich for pathogenic variations compared to the other structural Motif um and now I'm going to show you how it Looks like across hundreds of features right this is still um okay so next slide uh before that um what we can do now if you look at here we have this kind of enrichment um effect size for hundreds of features and we

can do a very simple thing that we can take a cumulative sum of it right and um although it's a very simple uh score what you can do is you can then uh look at what is the contributing factors that's coming from these different Groups of features that will indicate you that okay we are getting a high molecular impact score but what is the function that is being perur which is contributing to that score right and and this is what it looks like when we look at uh hundreds of features hundreds of features are not

included here but you can assume that the x-axis is the features and then uh over there what I um have highlighted is of if we take the entire human proteum variations then That is the enrichment effect size we can see in different features right uh but then I also mention in the beginning that protein structure function is uh class specific so what we did is we repeated this entire uh computation for um dividing by dividing the proteins into different classes right and then this is how it looks like so then starting from the top you

can see the DNA metabolism protein RNA metabolism protein and that like like divided into Different classes um and then I just want to um highlight two uh specific examples here because it's a very heavy slide so that particular dark red dots 27 fold enrichment of pathogenic variance is corresponding to this class of immunity proteins and that particular feature is beta strand um and that enrich ment is not present when we look at the entire human proteum um at the top row right so that is the variations of feature enrichment in different Functional classes and then

the second example is for the transcription regulator proteins and where you can see that the key feature comes out to be the DNA binding region which makes sense so we did uh the pfes score calculation in a protein class specific way um and I will uh skip the validation and the math and all these things and I will show you how to use this score so given a gene name first of all what you can get is a heat map like that for all possible Substitution but then you can break it down and you can

break it down by different changes that okay whether this red line that I can see it's coming because of a biochemical property change or because of a stability change or because of a sequence activity change or because of an interaction and regulation change and then when you give a mutation as input then you get a score which is the summation like 8.89 but then you can break it down into different Contributing factors and then when you take this mutation and map it using the genomics to protein portal then it will tell you that that particular

mutation is breaking an interchain hydrogen bond in the structure and the second example is is this one where the desmin protein here perturbation the genetic mutation um is is just to the adjacent amino acid of a post transational site which is the simulation side and also regulation is basically the Major Impact towards the Score so this basically wraps up the first part of the talk and then uh the second uh project that I will again present very briefly is how we are using now these protein features and the large language models for the functional effect

prediction for a particular class which is the voltage gated iron channels and we picked that class for varieties of Reason one is um of course the brain expressed voltage gated channel is a the targets are significantly Associated with neurodevelopmental diseases and the word function is at least reasonably well defined in the context of iron channel that is the closing and opening of the channel and the signal transaction and we also have a wonderful set of collaborators in this project and grad student Franchesca is leading this effort uh so shout out to her um so um

what did we do very simple thing we collected um a lot of literature data for the loss of function Gain of function annotations for the functional uh effect of iron channels and then we use these protein features that we have gathered in g2p and then we trained a neural network and we did the prediction task right and then we repeated that same task by extracting the embeddings from the esm model so basically the output of the encoder um and then we did the exact same task and in the next step we combined those two things

right the protein features and The embeddings and we did exactly again the same task and what the result showed is that combining the protein features and embeddings is beneficial for the prediction so the protein features and embedding information is complimentary which is not surprising given the thing that I uh pointed out in the beginning and that is um this um different pieces of evidences are basically complimentary towards the prediction and something important I want to mention here that I Deliberately did not uh include which esm version because there are esm2 1B and esm3 uh and

this is because we are also under the hood developing a tool that how can we compare these embeddings and then use the right one right and for our case it was esm 1B which was actually performing reasonably well we need to compare with the sm3 so I presented this slide but to close the loop that these uh informations are complementary and then to summarize um Connecting the genomics to uh proteins and not only clinical genomics actually also functional genomics to the unit of function which is the proteins is important for mutation effect prediction and developing

therapeutic strategies and towards that we develop genomics to proteins portal which is an interface between genomics and structural biology um three very uh exciting things coming up on the g2p side that is of course we will will make this pfes score and Interpretation available and we will make uh the drugable pocket annotations on the structures available I think one area we need more democratization is the Therapeutics and the drug Discovery so we'll make our a little effort towards that direction and we'll also annotate the human proteum with the base editable synthetic mutations that what are

the mutations we can synthetically actually make um so with that um till that new additions come in our portal has been Recently published uh so go read the paper and look at the portal if it is useful for you um and with that um thanks to the team and thanks to all these centers and the people within the center that I talk every day and learn from them my collaborators outside all the in partners and laders to cures and BR Institute thank you all for [Applause] listening question I thought the examples were Really neat about

the um the um which portions of the protein are more likely to have these pathogenic variants I wondered if there were any surprises other than the the two examples you showed of transcription factors being DNA binding and just generally Loops are more are more permissive to variants there are actually the cytoskeleton protein a cell cell adhesion protein class those two other classes of proteins are also very promiscous yes And out of the loops and actually there are three different types of Loops in protein structure the coil the turn and the bend and uh turn is

the only hydrogen bond determined Loop structure and that turns out to be an enriched features but not the other two not the coil or not the bend but the turns which are hydrogen bond determined okay yeah thank you thanks very much Somaya all right next up we have Ian Cheesman from The Whitehead Institute and MIT and he'll be moving us more into the domain of Cell Biology looking at function from the perspective of impacts on cells um thanks it's it's great to be here I'm I'm really enjoying today I'm really glad to be able to

be part of this we're going to change gears as as you just heard I'm going to move to much more the cell biology side and I I think I want to calibrate you for what you got to expect over the next 15 minutes I'm Going to put me on the Alex Reeves scale of like where I am on on this you know cell biology at scale in the age of Al like who who who is this Al guy you know I I I got to meet him he seems like he's doing this amazing stuff right

but I I'm I'm going to tell you why I I need everything that you all are are talking about today so we we'll we'll talk a lot about Al okay so um you know for for myself and and really what our our lab is is interested in you know I I love Cells I love cell biology I really want to understand how these work okay and so I I look I look at a picture like this and you know I see the different components of the cell the microtubules the DNA these kinetic core structures but

actually really when I look at this cell I see a bunch of different molecular machines and we think about all of the different protein components and how they're organized and functioning individually and in Combination to achieve this this really beautiful uh cellular function and so if we want to understand these protein components and really the mechanistic biology that's happening for this cell to work there's a few things that we really critically need to understand about every single protein in the cell where is that protein expressed how does it where does it localize who is it

touching physically what is are its activities how is it controlled and what Happens when you get rid of it and we've already heard a lot about those today we're moving to the functional section of of this um uh conference too and so I'm really going to focus particularly on this question of what happens when you get rid of a a core Factor okay and so fundamentally what we're talking about is a phenotype so what what really is a cellular phenotype and I'm going to spend some time talking about actually how different ways we can Think

about that today okay so I think probably the most easy defined definition of a phenotype and the way we're going to look at it um is the observable morphological and physiological characteristics of a cell we could think the same about an organismal phenotype too and you know I think really just describe the field of Cell Biology dating back to like the earliest microscopes a lot of this really has been Visual and morphological And if if sort of thinking about the way that at least I've approached this until a few years ago this is a beautiful

picture of a cell dividing this is Happy mitosis and this is not this is messed up and so I think this is kind of our level of resolution of how we describe one of these morphological phenotypes and you know so I I think I can learn a lot by saying okay this is messed up something's going wrong here actually I went to a a very cool database the Cellular phenotype database this is actually a summary of a lot of different um screens that have been done over the past 15 20 years and I tried to

sort of interrogate what the phenotypes are that are listed there and so I I I'm grabbing a bunch of these here but what you can see is that all of these categories they're they're very much about the morphological features of the cell they're very qualitative okay I think actually my favorite was increased Number of zigzag active stress fibers which like actually in some ways you can kind of get what it's talking about I think it's not reproducible it's not quantitative I don't know what I do with it I don't know how I would score the

same way as the person who scored that screen um but it's kind of beautiful to think about whatever zigzags are happening in in this case as well okay and so I think these qualitative phenotypes have been really the way that Cell biology has approached um um you know uh phenotypes for a long time but it it really doesn't give you what you need okay and so I I want to think instead about quantitative phenotypes how can we actually latch on to the features of what happens when you get rid of a protein for examp example

and so what kind of quantitative metrics can we apply so I think one of those quantitative metrics that's actually quite good is a change in cellular Cellular Fitness this is something that we can measure and if you're part of project Achilles and the dep map you're not doing this just for you know one Gene and one cell line you're doing this for every single Gene across every single cell line and this project I am a huge fan of I think I probably look at this website every single day and here we have a nice quantitative

readout for this ribosome Gene I can tell you that it is Fitness conferring in every single Cell line I can learn a lot because this is a quantitative parameter though it's also predictive and so I think really nice studies um have been done to really latch on to the similarities across all these different conditions really excited about the work that carollyn 's lab is doing right now to sort of push this to a different boundary and for us as cell biologists not only do I want to know that information about the ribosome protein And the

fact that the cells need it but it's actually also been something that's driven the biology that we're doing so a paper from Ali win in our our lab um used depmap as a starting point to find a new regulator of genome organization we simply would not have come across this Factor had it not been for these quantitative readouts and parameters that are available and in uh in depmap okay so I like that I think that we can think about other definitions for um um A cell cellular phenotype particularly that can be quantitative and so I

think anything that we can see as a change in a cells property or composition is also a really excellent um quantitative phenotype and so when we think particularly about the central dogma and the different um changes that can happen to your your DNA or at the RNA level of your protein each of these provides a really valuable readout okay so we just heard a great talk thinking about like Changes to the DNA sequence and how we can read these out across rare disease for example through whole genome sequencing R level you know both bulk and

single cell RNA seek are incredible L effective ways to monitor certain features of a cell composition and at the protein level proteomic still at the Single Cell level maybe we're not quite there yet but is a great way to be able to prove quantitatively changes across conditions and and a cellular Phenotype so thinking about these kind of approaches to thinking about a phenotype we've already heard mentioned today a very nice study from Jonathan weissman's lab next door um to be able to use um single cell R in combination with pertubations to be able to probe

the phenotypes resulting from The Knockout of key factors and not only did they do this but they did a great job of again using these in a predictive way to say changes in gene expression in terms Of being related are going to show shared functions and so I think that predictive value is such a valuable part of any quantitative phenotype I would argue that we learn a little less so you know not that you can see anything in this data set that I said showed up here but I think that understanding what the consequences of

a knockout are just by looking at the transcriptome I certainly can compare them but really actually understanding What's going wrong in the cell maybe we have a few limitations there okay and I'd also highlight that AR seek is amazing and yet it can be limiting and one of the reasons for that is that the RNA levels really do not relate to uh directly to to protein levels for example and that's because not only do we need to think about what protein products we're making but really any change across the central do Dogma change to you

know transcription start S Splicing alternative translation start sites these things are not read out in the kind of AR seek that we're doing these days and then Downstream from that how you're making the protein it's post translation modifications many of these things are also really Central to understanding what the cell is doing and how it's behaving okay so I think that we can take a step further though to say ultimately if we want to understand the downstream we want to understand the Consequences to composition we want to think about the protein ultimately what those proteins

are doing is that they are affecting cellular function they're making a change in the organization properties and features of that cell and so actually one of the easiest ways to probe composition to probe gene expression to probe protein levels to CH TR prob ptms is to come back to that first definition that I mentioned which is that we want to go Look at a cell and Define its observable and physiological characteristics and to do this in a way that's not just the sort of quality ative way that at least I had been doing for probably

25 years but in a much more quantitative way so you're about to hear uh from from Paul who's going to do a much better job of describing um Optical poed screening than I am and I'm just going to briefly mention this there but we've been really um fortunate to be able to work with Paul's group has really changed the game on what we as cell biologists can do and particularly this work that I'll briefly mention was from Luke and Paul's lab and is driven in our lab by Quon Chong Mato Caitlyn Jimmy and brani yeah

okay so the beauty of this is that really takes what our lab had been doing for a couple decades which is to knock out a gene and to look at its resulting phenotype and just supercharge this crank it to 11 and go you know go big or go home all all of Those sort of things and here what we're doing is taking a library of of different guide rnas introducing them into an A cell with an inducible cast 9 and then creating phenotypes and then taking this large pool of cells and being able to go

in and just take lots and lots of pictures but I think the thing that Paul's lab really did so beautifully and and it also just works is to subsequently come back and use N2 sequencing to define the guide that's Present in every single one of these cells and so we can collect a large number of cells and then subsequently deconvolve them to understand what the knockout is in every cell and group those together so effectively what this allows us to do is instead of taking just one picture we can take you know really millions of

pictures of cells and obviously this is this about 0.001% of the cells that uh were image Probably even I I got the Zer wrong in there it's a lot of cells were Imaging but even with this setup you know we can see quite robust morphologies for this we can really see a lot in these pictures really the same that we have been doing I think the thing that this scale allows you to achieve though is a level of quantitation that we just didn't have before and I think there's two parts of that that are so

powerful one that you Don't have to worry about Well to Well batch effects all of the cells are grouped together and so your ability to see a distinction between your control cell and The Knockout cell right next to it is very effective and then the second is that we have very good statistics across the say about 6,000 cells per Gene Target that we're looking at in this case and so really building on the beautiful work from uh and Carpenter and and the carpenter sing lab over time Being able to take the kind of um metrics

that cell profiler has used to really quantitatively measure every single one of these cells so in this slide here I'm showing you uh DNA damage intensity um across all these cells and it just does a great job of helping us identify those core factors um that are resulting in in DNA damage for example what blew my mind I did not was expect was going to happen is that not only did we get say DNA damage intensity But because of the kind of work um that an and others had done in cell profiler we could actually

extract about a thousand different phenotypic parameters for each one of these cells how big is the cell how round is the nule how how bright is the the DNA and really complex morphological features and when we do this what we really generate in fact is is a fingerprint so that every knockout cell it has a little bit higher DNA a little bit larger cell size you know and By comparing all of these different individual parameters we can do a great job of really revealing a phenotypic profile for every single one of these cells okay and

now when we compare these fingerprints we can just again morphologically really map them out Rel related to each other okay so this is the you know represent simplified representation of that we're seeing the the non-targeting guides down here everything away from that shows you that There is a phenotype uh resulting from this but what we were able to do and really the work from uh Luke Funk to do this was to be able to group these into functional categories okay not only did this work in terms of us being able to sort of map out

you know DNA damage markers or or or vesicle trafficking but it work with a level of fine grain Precision that I fundamental I actually don't think any of us expected okay and just to show you what this is is not Only did we map out translation and we can group translation but by just by simply comparing images of cells we were able to distinguish the 40s subunits from the 60s subunits you know the translation initiation factors from the biogenesis components we're able to get a really high degree of clarity on how these phenotypes relate to

each other and so you know I I laed map for example really I'm a huge fan of Dept map but limitations in the sort of um Gene Requirements and similarities we actually substantially outperform dep map in our ability to distinguish and group the translation components relative to each other okay you know I think we can go stronger for this and so uh work from uh Mato in our lab and and certainly many many others how do we go beyond simply these predesigned metrics and cell profiler and so the use of of Al or AI um

to be able to achieve that and particularly the variational art Encoders and other strategies that many of you all are talking about I am really hungry for those as a cell biologist and so if we can be able to Define where that hidden Laten space is as a different way to think about um um that underlying biology I think that's going to be very very effective for being able to do that but you know also it is worth thinking that we are simplifying this and so ultimately one thing I want to Highlight in terms of

a phenotype is that it's not a single metric it's not a single value a phenotype really is a range of states in which a cell exists and so just representing this this is a sort of toy example from Mato even a control cell is going to exist in a Continuum of states and so we've got to make sure when we're using these sort of morphological approaches that we do not oversimplify it and so then when we're comparing um you know a knockout of of Gene one it's going to have an overlap and therefore also distinctions

we need to understand that heterogenity not throw it away but really value it that is a core part of the phenotype similarly another one comes in it's going to occupy a different space but similarly heterogeneous within this okay so I'm going to take two more minutes and sort of tell you um um where I think the the limitations still remain and I want to come back to this point of Fingerprints and really think about that value and just how valuable it is to compare two fingerprints okay and so when we can do that with a

high level of precision and accuracy and quantification the predictive value to say you committed that that crime or you didn't is just so effective but if we just take a fingerprint on its own and you each were to give me just your fingerprint it really tells me nothing about you okay And so ultimately I need a strategy as a cell biologist not only to Define this quantitatively but to really understand qualitatively what the cell is doing and as a cell biologist I need this information to really drive that prediction so I'm going to show you

one example from that and I think a really nice feature of the morphological profiling is that not only did it group genes it gave us different categories that we can see the resulting phenotype There so I'm going to zoom on on this one here which is spindle bipolarity okay and so okay 10 of you probably know what this means another one what are we talking about here is a phenotype and I just want a diagram here the mitotic cell you need to segregate the chromosome you form this really beautiful structure I've been showing you pictures

of it's a mitotic spindle and when you form this bipolar structure now this allows you to segregate the Chromosomes away from each other and two new daughter cells but you can mess this up and so there's a number of players that are needed to drive this and if you do not effectively create a this bipolar structure you can actually have a collapse Down to What a monopolar structure here very distinct thing with the poles in the middle surrounded by the chromosomes okay and so if we take this category when we were looking at it and

we here means ocean Marcal who's a Grass student in the lab what she noticed is that there was something that created very robust might um U monopolar phenotypes that just didn't make any sense to us and so this is actually a proteome component psmd1 beautiful array of microtubules with the dot in the middle a nice ring around them and for this was really surprised us because it never been implicated in in spindle bipolarity before and what was even further surprising to Ocean is that it Really was just a subset of these proteins you take the

proteome the major degradation enzyme of the cell and ocean was only seeing this cap being uh affected there okay so we wanted to pursue that and actually because we have this qualitative phenotype we can think about what's going into the resulting nature of that so with uh for example it could be an imbalance of spindle forces has been defined by previous work could be a Defect in cental duplication we can come up with some conceptual models for what could be going wrong and ocean was able to exclude and include various versions of this for thinking

about the basis of that phenotype and in a really nice detective story that I just don't have time to tell you about today she was actually able to rule this down to a single component that was responsible for this phenotype and just to summarize a lot of work what ocean has shown is That normally the perom is a greatly regulated um um degradation machine but if you eliminate this cap it results in the sort of a premature and unregulated degradation of one substrate KF 11 which is resulting in this phenotype okay so final slide just

to say I think um you know highlighted today how important it is to me as a c biologist to have this kind of quantitative phenotype something that we can really Use um a variety of strategies to impact computationally but I also need this qualitative phenotype and what I really hope is that Al here working on on this in the middle um can help bridge these okay so this is my chat GPT this is a a picture of of Al um working hard in the lab and so just particularly to thank uh the the um work

from people I mentioned work from um large scale cell biology team as well as oian Marcal and I'm happy to take any Questions thank you questions for Ian I'll start if you want to make your way to a microphone um just curious whether you've had the opportunity to try to merge your data sets with data sets about a protein's natural localization in the cell yeah and and I I think you know going back to that first thing um that I I mentioned there's a lot of different ways at looking at at the function of a

protein Who who it's touching where it's localizing and I think that we when we can get to that level it's going to tell us so much more okay and so I I I think localization we're probably at a maybe B minus now I'd say human protein Outlet open cell are very great resources for that um but yeah as we can start to bring that in but have have not even touched that yet I think even getting to the point where we could maybe merge per turb seek de map and and the vvus data Set is

going to be a really nice place to be yeah yeah proteomics as as well yeah sounds good was there a question yeah uh very exciting talk thank you Yan um I have a question regarding the phenotypic landscape um so I assume those landscape you got it is from the latent features and then what's your opinion or what's your thoughts on on understanding those latent feature to the actual cellular phenotypes yeah um so the the data sets are all um Available we really hope other people will look at them I think at this point we are

substantially under sampling the information that is there and so these are are are features that really are are simple features of how bright is the DNA compressed across all of the cells to hit that Medium Point okay and so I think by comparing many of those features I think that you get other information but no I I I really think we're at basically step one of trying to Extract robust information from that unpack you can you can play all right thank you very much Ian and our last speaker in this mini session is Paul blay

from the broad and also MIT thank you super um so I'm gonna bring it uh back to the middle of the Ian Alex Spectrum uh which is a place I'm comfortable and um uh tell you about um our approach Um to generate data large sces we've heard a lot today about how we can use uh large scale data and interpret that build these powerful owl models um uh I I want to talk about uh how we're going to generate the new large scale data sets and and and and imagine you know what we can do

uh interpreting those data um going to start uh with uh some disclosures and just highlight the ones that are most relevant um so on the funding side uh That's a wonderful new czi award that I have actually together with with Ian uh to scale up some of the data generation with with Ops and think about new new ways uh to apply screens on um the business side two companies uh one bifrost which is a company uh commercializing instrumentation to generate Ops data and then also uh Ramona Optics company I'm advising in the high throughput uh

my roscopy space okay so I want to start with this uh Archetype uh uh set which is really two different archetypes uh for uh conducting Interventional uh experimental science to look at at how and in my mind um this is how cells respond to experimental perturbation so these are real empirical experiments with an uh independent parameter that's experimentally manipulated and then we uh look at at an outcome okay and the two archetypes are are things you'll you'll recognize um one is the bench Toop uh experimentation I always think of this as a postto you know

at the bench doing a focused experiment Anna always describes this as hand-to-hand combat it's a really evocative metaphor um and uh so this is the small scale uh everyday biological research um uh which is u in in routine use we have a high frequency of practice um all manner of readouts generally this is high quality work and you know each experiment teaches us something um and Uh and generally High statistical power okay at the other end of um this uh spectrum is another archetype which is high throughput data and the broad gets a little bit

stereotyped in this this category sometimes um and what I'm illustrating with the parameters here is a set of trade-offs that I think many of you will recognize so to achieve really high scale oftentimes these are substantial efforts that we're not practicing on a a very Routine basis uh sometimes there's a compromise in the information content of the readouts right we might get a binary readout or we might only see results from like the very top uh tiny fraction of uh of hits um the that would be an example of a case where um we're actually

only getting useful information from a small fraction of the measurements that are made and that can lead to uh problems uh as I've learned uh in terms of like Class you know big class imbalances when you're trying to carry out model training um and then oftentimes the statistical power the assays that are amable for high throughput uh use is not as good and in a screening context uh for example for drug Discovery this often um uh lends us to trade off our sensitivity in order to achieve higher specificity um so that things work out Downstream

as we're investing uh more and more resources in our heads okay so Um the the observation um that we've had um from this perspective of trying to generate training data is uh the rate at which we're learning from these experiments is something like the product of all the factors in these columns yeah and so the objective to maximize the learning rate right one way to approach this is to just go and try to maximize the factor in each of those columns and so this is really what my group does and we've been doing this Since

before I think I ever heard the term deep learning but of course I think this approach is well matched with the challenge to generate really effective training data and to do it at at huge scales and so I'm going to focus in uh on Ops and pick up where Ian uh left off with his beautiful overview uh uh and contextualization of how we've been using that um and uh really show you where we're going with that kind of technology and some additional ways it Can be applied okay and and so this is all about functional

genomics and so I'm going to start um by elaborating uh metaphor for for fun genomics um really with the point of um telling you that uh like it can be a little more complicated than it might appear to be okay so what what are we trying to do um with these uh functional genetic Interventional experiments in a Cell biological context we're taking uh the complex system uh of a cell and here my metaphor from the Engineering perspective is an automobile um we're asking the question that Ian posed what happens when we get rid of it

so let's think first about single loss of function perturbations of course we can think about other types of perturbations as well um and then in the simplest type of pulled genomic screen we're going to have a viability phenotype so we can say you know does the car survive or does it crash and um you know we can go through with a few Examples here steering wheel delete the steering wheel okay probably going to crash right it's essential okay we'll delete uh some of you probably recognize this is a brake caliper right part of the brakes

delete that okay turns out the car's fine not a problem uh turn signal right uh take the turn signal away yeah car's fine right so what like what's going on here uh okay so uh clearly we just like we'd like to be able to get a little bit more out of These screens um than the essential non ual calls you saw there right with the steering wheel okay the screen kind of worked for the steering wheel but it's like not that satisfying to just find out that the car crashes when you're Mo we're like okay

but why does it crash right we want to know why it crashes um the brakes okay your car it turns out has four brake calipers and they're in like two different diagonally redundant uh hydraulic systems so uh Okay like brakes are certainly essential for the functional car but they're redundant so if you just knock one out you don't get a a phenotype at the level of essentiality all right turn signals super interesting one right so turn signal doesn't affect the way your car drives at all um but it is pretty important when you have a

system of cars right and um the turn signals are mediating the car car interactions right so that's something We would like to be able to learn about with high throughput uh screening approaches and so you know um you can see there's a lot of subtlety here um and a lot of different aspects about how we do functional genomics how we do the perturbations what type of model system we use what is our assay do we have a binary readout or a complex readout this stuff really matters in terms of what it's possible to learn uh

coming out of the experiments okay and so you know um Uh this is what you know led to our interest in developing the optical pool screening technology uh which is an example of a high content pooled single cell screening technology now um a few of you are familiar with that even more of you are probably famili familiar with another type of screening technology like that called perturb seek which similarly takes a pulled cell library with genomic perturbations um and then does readout with single cell RNA Sequencing in order to produce a data set structured like

this where for each cell we have a gene expression profile and we know the identity of the um Gene that was perturbed so Optical pulled screening as you heard from Ian um produces a similarly structured data set except now we have these image-based features cellprofiler or other uh image-based features and the identity of the uh of the perturbation from the insitu Sequencing so I'll give you just a little bit more detail on the protocol um we use a a a a high amplification uh uh protocol where uh we take a padlock probe um uh span

that over the barcode or crisper guide sequence uh that we want to identify the perturbation with and amplify that by ruling Circle amplification and it's actually that amplification process that gives us the strong signals that we need to be able to zoom out with low power Optics and look at a whole lot of cells in each field View and this is a theme that others in the field have actually taken even further um by adding more amplification with in additional inv vitra transcription approaches um to to take this uh even further the other thing to

know about uh the protocol for Ops is that it has proven to be quite portable across model system so um now you know work uh in our lab and other labs at the broad has taken this into Dozens of cancer cell Lin and even some more challenging models and those new protocols I mentioned with extra amplification are going to um really allow us to go uh anywhere we want even solid tissues okay so um we've been hard at work um demonstrating uh this technology um for uh the morphological phenotypes um plus some other types of

phenotypes that are compatible with with Optical pulled screening and so I'm going to Highlight uh just a couple here as we go through um and another message I I I want to share and this um work which was a collaborative um effort uh to screen uh for host determinance of Abola Virus Infection outcomes um which was together with Rob Davy at Buu and carolin's group uh is is really a great example is the high quality of these um uh Ops data and the image-based uh phenotypes so here what I'm um showing you is a reprod

reproducibility analysis where we take Um uh uh some of the different uh image-based features that we're scoring here we're looking at a viral protein expression um so uh data so screening scores from the genomewide uh Ops that we carried out and then looking at how well those reproduced in a secondary independent validation screen okay and we can summarize uh the corelation across the set of crisper perturbations genes that were perturbed um in the secondary screen okay and so like this Is pretty good um reproducibility not just on wi jeans's hit but in terms of the

quantitative feature scores uh for each gene across those independent uh pulled screens um the other thing is we don't really get a Dropout um even at the Single Cell level you know if there's a cell there you know we're going to get a number for the size or eccentricity of the nucleus and so on and so the data quality has really been powerful of course like perturbs these Are also single cell resolved data so we can choose which cells um pass all the quality hurdles we want to put in front of them and ultimately make

it into an analysis okay we can also do some kinds of exotic things um for example remember the turn signals and the cell nonautonomous phenotype so we have an example of a project like that in revision now where a graduate student in my group on a um developed um an Ops compatible protocol For What's called the artificial synapse formation assay where you can get primary neurons to synapse to a cancer cell line where um we can uh carry out genetic perturbations to look post synaptically at Regulators of synaptogenesis itself so cell nonautonomous phenotype we can

run a pulled screen based on that it's a great story we hope will be uh coming out sometime soon um the other thing that's been Happening um uh which is important for you to know if you'd like to learn more about uh this kind of screening approach so we've been hosting the optical screening Forum as a virtual seminar uh uh series uh for a couple of years now and um you can see we've had just an amazing uh diversity of of uh great talks our most recent one was this summer uh with a new user

panel you can see we had great representation from Academia and Industry there um including Uh a couple broad alums that were great to to re-engage on this topic okay so I'm going to wrap up in the last uh minute or two uh with some more uh methodology uh that's really enabling for us uh and and others in this single cell pooed High content screening world and actually ties together uh some of the expression uh based approaches like perturbs and some of the image based approaches like Ops and this is uh the crops multi uh Lentiviral

system for delivering guide RNA sequences to cells in order to affect uh crisper perturbations so this is a dual guide uh version of the crop seek uh system originally developed in Kristoff box group in Vienna and um uh the crops system uh the key thing about it is you get both a pul three uh product uh which is the functional um guide RNA plus a p two product that's polyadenylated and can be read out using uh different different approaches Including Ops or single cell RNA SE and so we retooled this uh so you would have

uh two guides and this required quite a bit of re-engineering of the construct um this has also been set up so that it's easy to clone and um so that all the um variable elements uh the crisper spacers and Associated barcodes are really close together turns out that's crucial because if they're too far apart those actually recombine and get all swapped around and messed up during the Lenov viral production step and of course if um all the data are randomly relabeled the computer scientists and the audience will know that that's uh really bad for the

statistical power of the screen um uh we're really uh pleased with the performance of this system uh we're seeing really um faithful retention of the uh sequences of those spacers and barcodes that are um the the key elements we can make uh libraries with Really uniform uh representation that is effectively increasing our throughput a lot uh we're learning more per cell that we read out and then we've got great control over um this recombination the swapping of different elements uh with 90% of our constructs after they're delivered to the cells we're going to screen having

a totally faithful set of uh of relationships retaining that statistical power um uh this system the crop seek multi-stem is compatible with Both single cell RNA seek readout and Ops um here's are some examples of Ops uh with very high detectability of guides to make sure that in each cell we actually detect a guide and we can include that cell that we went through the effort of analyzing in our final data set we've got some other tricks as well for example uh by being able to read out two barcodes at once and a synthetic barcode

rather than an actal guide sequence we can go from an Ops That uh would have taken us 10 or 12 um uh uh sequencing Cycles to read out with standard crop seek that can now become uh as little as a three uh a three Bas readout experiment again enhancing our throughput so bringing it back to where I started um we feel uh really good about where Ops is today as a data generation method uh with high learning rates uh to drive uh model training and there's a really strong road map for how we can improve

that uh going forward and I I won't go through for time but there are a lot of different things going on that are going to enhance this and so uh I'll stop there and take any [Applause] questions questions for Paul raise your hand and someone will come with you with a microphone to you I was just curious um I don't think we've ever talked about this potential for um do you see any deal breakers for being able to transfer pulled Optical screening to a tissue Slice or inv Vivo or something no deal breakers yeah um

and the uh just a little bit of detail I alluded to uh these new protocols with an extra amplification step uh that's done using what's known as a zombie uh insitu amplification protocol from the DNA copy of the target um using a t7 based inv vitra transcription and so that protocol in particular gives uh screaming signals that I think are going to work really well in Vivo and um uh uh and and uh fol In the field near hakan in particular uh his lab has been developing methods to uh do the invivo screens and prepare

those types of tissues um that I think yes we could read out using Ops thank you the it was a fascinating talk uh just a question maybe more on the technical side uh peric measurements and readouts are uh kind of noisy or difficult to model but um the cropic multi is it enhancing or improving the limit that we get from ppic readouts oh Great question I really appreciate this so I I think the answer is um yes and no so I think no it's not going to do anything uh for the expression profile itself um

that's going to retain the same uh statistical qualities um but in terms of detection of the perturbation and whether the detected perturbation is in fact the true perturbation that was in that cell I think crops multi has a lot to offer in terms of the guide detection rates and the faithful um uh Representation of the perturbations um we have uh the um those validation experiments in progress currently and so I hope to be able to tell you tell you more about that next time excellent thank you very much thank you so much Paul [Applause]

Connecting the Dots: Biology At Scale In The Age Of AI - Session 2