[Music] okay good so I'm Jaya Tripathi and I am a principal scientist at the Mitel corporation here in Bedford Massachusetts next please just a line bought minor so minor works in the public interest and we operate several FFRDCs federally funded research and development centers so one of the programs we have is an internal research program that fosters creativity it's called the innovation program and the work that I've been doing for the last seven or eight years was funded by this research program the innovation program next please just a couple of quotes on data because that's

what I do I'm a data scientist then I really like one without data you're just another person with an opinion I don't see the slides here that's okay you're just another person with an opinion there's another one that I like next please from so this is from Sherlock Holmes a study in scarlet from Sir Arthur Conan Doyle's it is a mistake to theorize before one has data next okay so Howard says I have the sexiest job this is a quote from the Harvard Business Review a data scientist the sexiest job of the 20th century so

that's me next so what does a data scientist do what is someone like me do first it's an interdisciplinary approach so we use a lot of the techniques some of which are you must know your domain you must have domain expertise you use math statistics visualization techniques graph analytics and so on so essentially we use all these scientific methods algorithms subject matter expertise to try and extract knowledge from data and explain the phenomenon that you're trying to address next ways so you are familiar with truth in an informal way in in terms of ethics philosophy

religion you've read about it in the Bible and in math we there is a concept of absolute truth we use proofs like reductio ad absurdum or a an example which is proof by contradiction to establish the absolute truth however this notion of absolute truth does not exist in data science particularly when applied to medical research why is that why because in medical research it's based on observable properties of the phenomenon something you measure you must have heard the term evidence-based and what you measure and what you observe today may be different in a different data

set or at a different point in time also if you subscribe to the notion of absolute truth right it violates the scientific method because you're bringing in preconceived notions and biases lastly the truth in scientific method is relatively the context or the model and that gave me different pathways or different models in which to approach your problem next please so one method of scientific research that I want to talk about is hypothesis testing and there's a quote there from Edward Teller that I like so fact is a simple statement that everyone believes right if innocent

it is innocent until found guilty with the hypotheses it's guilty until you found effective so you begin with the hypothesis right you ask yourself an interesting question you begin with the hypothesis you make some assumptions and then you create tests to test those assumptions and you build models and then how confident are you that the result is not due to chance and once you're confident you've sticked your hypotheses otherwise you revert to the alternate hypothesis and you accept or reject the initial assumption in favor of the augment position and then you go on and try

and expand your hypotheses to other data sets to generalize so here was i have done several projects one of which was my hypothesis was that if i I can predict a particular kind of committed by prescribers of physicians a particular kind of prescription fraud solely and looking at data sets that have certain data elements data elements have to do with the distance the person traveled certain combinations of drugs and so on so that was my policies and the first thing I did was to go and see data that would be suitable to test my father

sees one of which was the PDMP prescription drug monitoring programs all the states have them they are prescriptions reported by pharmacies once a controlled substance is filled so an oxycodone or hydrocodone a benzo and benzodiazepine and so on next please so here is once I have my policies I seek the appropriate data and this is an industry standard for data mining it's called the crisp cross industry standard process for data mining and the first step is understanding your data well you really need to spend a lot of time understanding the data well do the data

quality assist assessment and then the next stage its preparation removing duplicates and spurious data are missing values and so on part that I would say I spent the most time on is on modeling so there's several different techniques there but I used something predictive machine learning methods examples of which are support vector machines you all must have heard of artificial neural networks random forests and so on and that's a cyclical process so you try certain essentially you're trying to predict some things in this case I'm trying to predict who the bad doctors are and so

you you tries different features to feed into your model and then if that doesn't work out you come back and change your features and so on so it's the cyclical profit process and in the end is evaluation and validation next slide please okay so I'm just going to show the next three slides and then I'll end after that I'm going to show you some examples of exploration so before I get into the modeling and the machine learning I did some exploration on the data and visualization histogram you all are familiar with histograms two basic concepts

on this chat here I have male children on the upper graph and females on the lower one and I'm just doing an age gender histogram on the prescriptions that were fed remember these are all controlled substance prescriptions what do you see what it since the audience is small enough it's okay for you to shout out what do you see that's different in the two graphs correct so on the on the right side sorry your name is yeah and she pointed out the difference mainly is in children under the age of 10 years right and so

it's worth exploring it when a histogram is something that's very simple to do and should always do your basic Union bivariate statistics in the beginning so I look at it and then I see that most of the drugs almost all the drugs were ADHD drugs ritalin and adderal and you look up literature and you research it and you find that there is a two point something times greater prevalence of that diagnosis in male children than female children so at this point a lot of data scientists would have said well that explains it you know there's

a that's a roughly two point five times over there's a 2.5 times greater prevalence of ADHD and that explains it but something in me wanted to delve further so I look further and and I think this this is the point that I'm trying to make is don't just look for the obvious explanation it's worth it to spend a little bit more time and delve further so look further and they found actually most of these prescriptions were filled by a pharmacy in a different state more than 100 miles away and there were several pharmacies much closer

and without revealing much more let's just say that this led to a case next slide please so here I am before I'm getting into complex machine learning models or doing any so intelligence applications this is me just trying to get a feel for the data so essentially after I did the data quality is you look for distributions you look to see if your data is Gaussian because certain models require certain distributions and in this case I'm just hypothesizing and and doing certain visualizations so the the upper-right one I am doing clustering so what clustering is

it's a method of unsupervised machine learning where what is the data saying about itself you have no bias no prior knowledge of the data can if I'm trying to make two classes for example it's called a binary classifier to classes bad guys good guys for example that's as simple as classifier can the data distribute itself into two groups and I'm not saying which one's bad which one's good but it's just the data so I'm Here I am trying to classify the people based on who did multiple drugs at the same time so not one drug

in January and for 30 days then another one in August but overlaps and overlaps across different pharmacological groups and do you guys see that needs clustering into two groups that's promising and so that tells me this is an input that I should use in my input vector when I do machine learning so this is Annie tree unique uniquely separable data here this is another hypothesis I have so in this case I'm looking for these dots are all prescribers or doctors red and blue the red ones are the ones that prescribe the certain combination of drugs

for which there is no medical legitimacy by that I mean there are certain combinations of drugs which no doctor would ever give why because it leads to 12 or more times greater chance of respiratory depression coma or even death and so I have prophesized that most of those guys who did this particular combination of drugs there was cash payments involvement that's a hypothesis right there just guess sometimes the Prophecy's based on some some sort of expert knowledge and sometimes it is we're trying up so I wanted to show you here so the blue dots and

and the red do you see so what's my hypothesis correct or wrong what do you see what do you see from that crap okay so I'll give you the answer it's you could more or less say that it's kind of a linear along the 45-degree line right and so here I have Medicare payments and you have cash so my policies was the people who patients who went to doctors who gave the sort of combination that's a no-no had some sort of cash payments mixed in to go under the radar and I would say that's not

entirely true hypotheses right because it's it's along the 45-degree line however there is something happening because it's below this this vertical line here most of the people who did that also had a lot of prescriptions there's a logarithmic scale so had a lot of prescriptions these examples of visualizing the data next please so here's a technique called geospatial or heat maps and hotspots you must have seen this all the time this is the map of Indiana because that's where I got the data and this is normalized by livable land area and by population because if

you don't normalize the population Indianapolis in the center will always turn out bright red because there's lots of people there so you have to normalize that population and zcta for the most part is the same as the US postal zip code postal service construct is more or less the same thing zip code tabulation area zcta but I take zcta because it gives me access to other demographic information so I can tie in the zip codes or the Z CTAs with income over those deaths education and so on so I can do that and so here

this is a heat map on a particular drug called opana so I did heat maps on many different drugs the top 10 drugs them on polypharmacy combinations of drugs and also overlaid with other datasets from CDC and other places so what's that sticks out here the the Reds the red areas Hartford City and Scott Scott Scott's bar is no I I don't really live in Indiana's I gave it to the Health and Human Services and the health commissioner from Indiana and asked him to look into it and we don't have any plausible explanation for why

why this is read these you know largely rural areas and no one knows what going on so guess what two months after I gave the state of Indiana this chart there was a huge HIV epidemic that broke out in Scotts County this is relatively large percent of population at HIV was in 2014 and the CDC and the federal government had to intervene and give aid and so on and at the conclusion of the investigation it turned out that the HIV epidemic could be attributed to people sharing needles from a panel so here's a technique that's

validated kind of this is my kind of truth it's validated because there's a hot map heat map that said you know explore this region further and States typically have limited resources so you if you not in this particular example but other examples for example if you want to see where law enforcement query should be police could decide where to place the drug diversion officers and along the heat maps you could focus your education outreach programs and then in the red spot areas and so on next please so in conclusion first we you can't do really

good research unless you really passionate about the topic so pick a topic that that the dreamy speaks to you that you're passionate about make observations ask interesting questions formulating hypotheses and develop the predictions that you can test gather the correct data test your predictions accept or reject your hypotheses and if excepted and see if you can generalize it invalidate with other datasets for further reading may I suggest reading about the Simpsons paradox uc-berkeley gender bias study or how many of you are familiar with it good and then the man in the cave allegory from Plato's

Republic thank you for having me you

Truth in Data Science | Jaya Tripathi | TEDxYouth@BHS