56:13
11/01/2023
Episode 17

Methylation Risk Scores

You may be familiar with polygenic risk scores (PRS), but have you ever heard of methylation risk scores (MRS)?

MRS are crucial to understand, as they’re a tool that quantifies DNA methylation levels at specific genomic regions linked to particular conditions, shedding light on the potential impact of epigenetic modifications on disease susceptibility.

In contrast, PRS calculates an individual’s genetic disease risk by considering multiple genetic variants across the genome, often identified through genome-wide association studies.

While PRS offers valuable insights into genetic predisposition for complex diseases such as heart disease and diabetes, it has its limitations, including the risk of false positives and challenges in clinical interpretation.

The choice between MRS and PRS depends on the specific disease or research context and the available data, as both scores provide unique perspectives on disease risk.

In this week’s Everything Epigenetics podcast, Dr. Michael Thompson and I chat about the importance and benefits of MRS, how to calculate such scores, and how these scores compare to PRS. For example, in his recent paper, Mike discovered that MRS significantly improved the imputation of 139 outcomes, whereas the PRS improved only 22.

We focus on the results from a study Mike published last year that showed MRS are associated with a collection of phenotypes with electric health record systems. Mike’s work added significant MRS to state-of-the-art EHR imputation methods that leverage the entire set of medical records, and found that including MRS as a medical feature in the algorithm significantly improves EHR imputation in 37% of lab tests examined (median R2 increase 47.6%). His publicly available results show promise for methylation risk scores as clinical and scientific tools.

Mike is currently in Barcelona working on using artificial intelligence to map and learn the biological effects of mutating everything (and anything) in every single position from a genetic variant to the change in splicing or to some other interesting phenotype.

In this podcast you’ll learn about:

– How Mike got into the field of Epigenetics
– What epigenetics means to Mike
– Mike’s interesting background starting with his undergraduate journey to his graduate and postgraduate studies
– The importance and limitations of electronic health records (EHR)
– The importance and benefits of methylation risk scores (MRS)
– The importance and limitations of polygenic risk scores (PRS)
– How MRS compares to polygenic risk scores (PRS)
– Mike’s paper titled “Methylation risk scores are associated with a collection of phenotypes within electronic health record systems” and what prompted this investigation
– How you create an MRS
– Why we don’t see MRS commercialized quite yet
– The EHR-derived phenotypes spanning medications, labs, and diagnoses that Mike investigated
– Future application of MRS
– The future of Mike’s career

About this Guest

Dr. Mike Thompson

A Los Angeles native, Mike attended UCLA for both his bachelor’s and doctorate. During his undergraduate studies of Microbiology, Immunology, and Molecular Genetics, Mike began his research career in a lab studying computational genomics in the context of cancers. During his PhD, he worked with professors Eran Halperin and Noah Zaitlen in method development in statistical genetics and electronic medical records. Now, Mike is pursuing a post-doctorate at the Centre for Genomic Regulation in Barcelona, developing interpretable deep learning techniques with Ben Lehner.

Google Sholar

Dr. Michael Thompson’s MRS Study

Hannah Went (00:00.863)
Welcome to the Everything Epigenetics podcast, Dr. Thompson. Thanks for being with me here today to chat. Looking forward to it.

Mike Thompson (00:08.298)
Me too, thanks a lot for inviting me. Excited to be here.

Hannah Went (00:10.547)
Yeah, absolutely. You’ve done a lot of work in this space. I know when we were talking beforehand, you said your first paper was relevant to an EWAS, so that epigenome-wide association studies capturing some sources of variability that may affect them. You had a really great second paper that was disentangling some genetic effects that are specific to tissue cell types, to those that are shared across, those tissues and cell types I mentioned. And you used a method there to talk about expression heritability.

and did a transcription wide analysis study. But your most recent paper that we’re going to focus on today is going to be the MRS or the methylation risk score and medical records paper. So I’m really excited to dive in and talk things, all methylation risk scores with you. And really, I want to start off by just talking about epigenetics, obviously. What does epigenetics mean to you, and how has it influenced?

your life because I know growing up in high school and college, I think I may have heard the word once. So I’d love to hear how you were kind of opened up to this wonderful field.

Mike Thompson (01:15.838)
Yeah. So, yeah, thanks. Basically, yeah, I also didn’t really hear about it too much in high school, not until exactly I was doing my undergrad at UCLA, and you kind of run through all the typical intro genetics classes. And I don’t think it was even until maybe late, late during the undergrad until taking upper divisions or things like that, where you start to hear about these, I don’t know, these epigenetic markers or…

Hannah Went (01:19.659)
Hehehe

Mike Thompson (01:45.154)
there are some parts that kind of, I guess you have this idea that a lot of changes or phenotypic changes shouldn’t be heritable from cell to cell. And you hear about this thing called epigenetics, which potentially is passing down histone markers or different methylation signatures that are going from cell to cell. And it kind of breaks this, I don’t know, this kind of original idea you have about things that are what should be heritable and what shouldn’t be heritable from cell to cell. And it kind of, I guess, goes against what you might.

expect intuitively based on what you learned originally. And so that was super interesting for me as an undergrad. And then when I started grad school, I joined the lab of Arun Halprin. There was quite a few people before me who had already been kind of diving deep into method development and things like that for methylation. Like I guess Elior Romani, for example, was like an early mentor for me. And he had written tons of papers about interesting variability and things they could do with methylation.

that was kind of from where my interest got really peaked. So what epigenetics, I guess, what it meant to me, at least during my PhD, it gave me a lot of interesting problems to think about. So that was, I guess, like, I don’t know, kind of like the first thing I thought about it. And then I guess once we started doing some of these other papers and hearing more about epigenetic clocks and a lot of these interesting influences of epigenetics where outside from just…

Hannah Went (02:56.351)
Yeah.

Mike Thompson (03:11.382)
being an interesting scientific exercise, it was something nice to think about here and there.

Hannah Went (03:16.607)
Definitely. And where are you located now? I know you said, you know, there was, we could chat about how you were moving countries maybe for science versus staying locally. That really piqued my interest too. So tell me a little bit about that and maybe your journey through schooling and then how you are where you are today.

Mike Thompson (03:36.514)
Sure, sure, sure. So yeah, I did my undergrad at UCLA, and I grew up in basically a suburb outside of LA, so called Locker Center. I don’t know, I think it’s like a 40 minute drive to UCLA. So it wasn’t that far, but I was super happy to have the opportunity to study at UCLA. And then I think maybe my third or fourth year, I knew I wanted to do grad school. And so I applied again, mostly to UCs. And at the end of kind of…

Hannah Went (03:41.023)
Mm-hmm.

Mike Thompson (04:06.498)
doing these interviews and these processes, I was lucky to, I don’t know, have the ability to decide between a couple of programs. And I was looking at staying between UCLA or going to UCSF and, or University of San Francisco, California, San Francisco. And there was quite a few people there whose research really piqued my interest. And ultimately I had kind of decided to stay toward LA and maybe do some type of collaboration or pseudo collaboration with the people at San Francisco.

And so I kind of just stayed at UCLA for my last year and started doing some of these doctorate or PhD classes early on in my last year to make room for doing other things. I guess, basically doing the prerequisites for the doctorate. The last year of undergrad, and then when I started undergrad, I had the ability to study a few other subjects that interested me like statistics or computer science. And then…

kind of serendipitously, the person I was working with at UCLA told me that they had just recruited a couple of professors to UCLA from University of California, San Francisco. And I don’t know by, yeah, exactly. It was really a serendipitous moment. My second year of my PhD, the person I wanted to work with at UCSF was now at UCLA. And that was Noah, Noah Zaitlan, who I started to collaborate with. So…

Hannah Went (05:23.723)
Yeah.

Mike Thompson (05:34.698)
Yeah, I ended up working with both him and Aran and a few other professors during my PhD. And toward the end of my PhD, I was kind of looking for a change of scenery. And I don’t know, I had wanted to go abroad for my postdoc for, I don’t know, as long as I can remember during my academic career. And found one of my advisors recommended the lab that I’m in now, which is Ben Leonard. And he’s doing, I don’t know, he has a lot of…

Hannah Went (05:37.815)
Mm-hmm.

Hannah Went (05:52.881)
haha

Hannah Went (06:01.527)
I’m gonna go get some water.

Mike Thompson (06:03.126)
very interesting ways of thinking about science. And I don’t know, clever questions about data that may have been public data for 10 years, but still no one has analyzed it in a way he’s thinking about. And so I kind of wanted to learn some of that intuition from him. And so I reached out and I don’t know, kind of very luckily it was, except he was looking for someone that had a computational skillset like that I was able to develop during my PhD. And…

Yeah, things worked out and I’m here now working with him. I’m excited.

Hannah Went (06:34.847)
Definitely. And where are you located now physically? Where is that at? You said?

Mike Thompson (06:38.038)
So now I’m in Barcelona in Spain. Yeah.

Hannah Went (06:39.947)
Okay, okay, so you are abroad, gotcha. Yeah, I was like, I know he’s at UCLA for his undergrad. I was like, I don’t think he’s still there, but I think he went abroad somewhere, so I just wanted one of that information. So very cool, and yeah, to anyone listening as well, I was actually lucky enough to do a quick study abroad program, and by quick I mean about two and a half, three weeks, should have been a lot longer in Australia in my undergrad, so he did Malborn.

Mike Thompson (06:49.563)
Hahaha

Hannah Went (07:06.223)
and Karen’s and that was very cool, but it was way too short. So I encourage anyone who can get out there and study abroad while the worries are still low and you’re not making your roots in any one place to go and experience the world. A lot of insights that can be learned there. So I’m excited to see, yeah, what you do in the lab you’re in now. It’s definitely interesting that you were able to collaborate with everyone you wanted to at UCSF at UCLA still, so it’s funny how things work out sometimes.

But yeah, we’re going to jump right into that paper I mentioned where you’re creating all of these methylation risk scores. So this is going to be very, very new to listeners. You can break it down as much as possible. I’ve talked about methylation risk scores, I think, with Danny Gad looking at the proteomic values one other time, but we didn’t get too much into the weeds. So…

If you’d like to go ahead and just define, you know, what is a methylation risk score and maybe compare that to a polygenic risk score too, because that may be more familiar.

Mike Thompson (08:08.898)
Sure, yeah, yeah, yeah. And then, do you want, I can explain a little bit of like the background or intro of like why we took the approach to, if you like, or do you want me, okay, sure.

Hannah Went (08:15.527)
Yeah, for sure. Yeah, go right into all of it. Yeah, would love to hear it all.

Mike Thompson (08:19.566)
Sure, sure, sure. Okay, cool. So, yeah. Basically, we were kind of interested in evaluating these genetics or genomics-based risk scores in the context of health records. And so, a lot of times right now on medical or electronic health records, at least, for example, the ones I’ve worked with in the States, the data is quite sparse. And so, this just means that

For a lot of patients, you have a lot of missing data. And this can kind of happen for a number of different reasons. You can imagine, in an idealistic world, only really young people between the ages of 16 and 25 are missing a lot of data because they’re super healthy, and they don’t really need to go to the doctor and get checkups that often. And whereas maybe people who are a bit older do go to the doctor more often because they want to monitor their risk, and maybe, I don’t know, they need to get blood and urine checked more frequently. And so…

the data looks kind of sparse, but we know why it looks sparse, and we have an idea. It turns out that the real world is not really like that, and there’s a number of reasons why people have a lot of missing data and health records. It could be access and all sorts of other things. And so because there is a lot of data missing and because it doesn’t have this random structure, in other words, there’s specific reasons why this data is missing the way it is. It’s not missing at random.

It kind of motivates that if you’re going to try and predict and fill in the gaps of some of this missing data of, ah, is this person at risk for a certain disease at a certain time point? I want to borrow information from other patients who have this disease. I can’t really fill in those gaps because maybe the population from which I’m trying to learn a predictor is different than the population from which I’m going to apply the predictor. So if someone’s in their 20s and I want to predict they have a risk of the disease,

and I only have a bunch of data from people who are on the older side of the spectrum, this predictor is not gonna be as calibrated or as useful for these younger people. And so kind of what this points to is that maybe something that’s gonna help you is something that’s external to these medical records to try and fill in these gaps. And so you can think of a variety of things that are gonna be external from medical records, but one thing we obviously like to think about are genomic sets of genomic information. And so kind of the classic…

Mike Thompson (10:46.326)
thing that’s been really, really studied for quite some time is these polygenic risk scores. And the basic idea is that I can collect the genetics of a certain individual. I can look at these spots in the genome that are known to have variability, and we’ll call them SNPs or single nucleotide polymorphisms. They don’t necessarily imply anything’s wrong with a human being. They just are kind of spots in the genome that we know vary from person to person.

they’re useful enough that we can actually associate these little spots of variability with a variety of traits or sicknesses of interest. And after we’ve done these associations and found what’s parts of the genes or the genome are specifically relevant to a specific trait or sickness of interest, we can build a predictor that maps from these guys, these locations to one of these predictors. And there’s tons of ways of doing this. Maybe the most classic is.

running like a linear regression or some flavor of a linear regression where you are just looking for a weighted sum of SNP effects onto a trait and use that as a predictor. And it turns out these things are great. They tell us a lot about biology. They can also be really interesting for subtyping different traits. You can imagine maybe for one of the more common thoughts is in cancer. If somebody has a type of small cell lung cancer, for example.

maybe their genetic signature can tell us that, okay, when we’re combating small cell lung cancer, we have a myriad of treatment options, and based on this genomic signature of person A, we’re gonna use treatment number two, and based on the genomic signature of person B, we’re gonna use, I don’t know, the medication three, and you can kind of use this to guide decisions as kind of the ultimate hope. So, biogenic risk scores are super useful and really interesting.

But kind of one of the drawbacks of them when we’re looking at diagnosis and not doing this kind of patient stratification of medications or things like that is that they don’t change over time. And so when you’re a baby, if I look at your genetics, your genome is largely gonna be static. So if I look at your genotypes or your counts of your SNPs, what SNPs you have, that number is not going to change over time. And so if I build a polygenic risk score when you’re…

Hannah Went (12:54.591)
Mm-hmm.

Mike Thompson (13:11.038)
six months old and when you’re 60 years old, this polygenic score is going to give me the exact same number. Of course, you can imagine that somebody’s risk for disease is going to change over time. So maybe they’re, again, if we’re looking at lung cancer, if we know that somebody smokes cigarettes or they have certain exercise habits or they have certain dietary or I don’t know, any disease is going to be multifactorial, but you can imagine that there’s a number of things that are going to influence their risk of disease.

And so it turns out a lot of times when people are using polygenic risk scores to predict somebody’s disease risk in medical records, they’re not just using the polygenic risk score alone, but they’re using all these other patient factors like, I don’t know, smoking score and these kind of other covariates I’ve named. And so that’s great. You can do pretty well in prediction using these like kind of ensemble approaches of these other covariates and these polygenic risk scores.

But we had the idea that one thing that might be useful in this case is methylation. And why? It’s because DNA methylation does change over time. And so it’s this little, I’m sure most people know who are listening, it’s this little marker that lives on top of your DNA. And it changes over time. It changes based on exposures and environment, and maybe it’s responding to certain environmental factors or certain traits or diseases or disease risk in a human being. And so…

It turns out this is a really useful biomarker for doing prediction tests because I can actually get an idea of how much somebody is smoking cigarettes from their methylation. I can get an idea of their biological age and things like that, which I’m sure people know a lot about. But for this reason, it seems like kind of a clear biomarker to use because you don’t need to…

construct and measure all these different covariates that are potentially going to be useful to predict a disease risk, because this genomic biomarker has all these things baked into it. And so if you’re building a mapping from genomics to a trait of interest or a disease of interest, well, hopefully you’re going to tag some of these other information along the way, and even some potential environmental signal that you missed with this polygenic risk score. And so that’s kind of like the setup of what we did.

Mike Thompson (15:32.226)
And the idea was just to evaluate if we were to build these methylation risk scores. So it’s just the analog to a polygenic risk score where I’m going to take a weighted sum of somebody’s methylation states at different positions in the genome, and I’m going to use that weighted sum to predict some trait or disease of interest. How does this compare to using a weighted sum of somebody’s genotypes or SNP values to predict these traits of interest? And the whole goal of the paper was to kind of

establish what are the differences, is one more useful than the other for certain tasks, and the task we were interested in particular in this paper was doing the, was kind of filling in this missing data, so we call it imputation or diagnosis. So if somebody does have a disease at a specific time point or is at risk for, I don’t know, a variety of diseases or trying to predict certain traits, things like that.

Hannah Went (16:27.703)
Sure. Yeah, that is a great definition. No, thank you for comparing the two. And again, I’m going to link your paper out when this episode is ready to go and publish, because it’s an amazing paper, published, and correct me if I’m wrong, I think, in Nature on August 25, 2022. So it’s still very, very recent. And it’s titled, you know, methylation risk scores are associated with the collection of phenotypes within the electronic health record systems. I’m really interested in, though,

Mike Thompson (16:40.27)
Thanks.

Hannah Went (16:56.095)
you know, what prompted that investigation? You realize kind of there’s this missing data there. Have you always known there was kind of that gap in the data or I guess how did you identify the question in the first place? I’m kind of backing up even further from what you just told us.

Mike Thompson (17:13.502)
Yeah, so I kind of joined the lab at a pretty lucky time when this was, people were thinking about this before I was. And so, Aran and a ton of other people at UCLA had kind of been interested in this electronic health record system initiative at UCLA in which they collected, I mean, after the kind of electronic medical records were restructured, there was a ton of

Hannah Went (17:18.167)
Mm-hmm.

Mike Thompson (17:40.918)
really clever people who wanted to see what they could predict and how they could fit genomics into the space of electronic medical records. And so there had already existed a team of quite a few people who had decided that at some point they would like to collect the genotypes for tens of thousands of individuals at UCLA or going to the UCLA hospital and asking patients if they wanted to be part of these studies and things like that. And then my understanding is…

Iran, Elior, and quite a few other people on the team had been interested in also collecting methylation because aside from being useful for potentially being useful for prediction tasks and things like that, there’s a ton of basic research questions you can ask about patients and diseases and biological systems when you have multiple types of genomics. You could ask, do some of these genetic risks that exist through genotypes or SNPs alone,

Hannah Went (18:33.815)
Mm-hmm.

Mike Thompson (18:39.842)
how are they actually manifesting a change in a disease or a trait of interest? Is it acting through methylation? Is it mediated by the methylation? I think there’s just a number of tons of interesting questions you could ask. The other thing is that LA is like a very cosmopolitan city and we know that genotypes, I don’t know, when we build a risk score from genotypes, that these things are often gonna be confounded by population structure and ancestral history of a lot of…

each individual or sample. And so supplementing that with something that’s a bit more dynamic and maybe captures more environmental risk was probably a more interesting question. And so a lot of people on the team were also interested in studying, like I guess one of the things when you’re writing a grant is if you don’t have enough money to do things at a giant population scale, you need to specify what your samples of interest are going to be.

Hannah Went (19:37.207)
Mm-hmm.

Mike Thompson (19:38.002)
And so the kind of sample, and this is called ascertainment. So we’re going to choose one basic trait that we’re really interested in studying. And we want to predict things relevant to this tree. And that trait for this study was people who have kidney injury or kidney injury after surgery. And there had been a lot of work done by people in the lab about, can you predict kidney injury using medical records alone? So if.

somebody is going to have kidney failure in some given time frame? Can you look at these existing covariates in the electronic medical electronic health records and predict kidney status? And then so we kind of augmented this and said, well, hey, we can do a good job of predicting this, but we want to see if we can do a bit better. And one way that we think we can do a bit better at predicting this health outcome is maybe with a genomic biomarker and maybe with methylation.

And so I think the idea was to basically build a better predictor focusing on this space. And then once we had the data, we kind of just did a shotgun approach of seeing, okay, aside from kidney related traits, what else can we predict? And is there anything we can predict? And it turns out, luckily, yes, there’s quite a bit of information in the game.

Hannah Went (20:35.039)
Yeah.

Hannah Went (20:52.955)
Yeah, no, I like that. I always like hearing the stories. I think it’s interesting. Like you said, you may have gotten in the lab at a lucky time, but it’s, it’s, you know, nice to see the progression of the questions asked and how they evolve and how, um, you know, instead of just going into the kidney disease, right? We were able to predict a lot of other things, um, which opens our eyes to, to even that much more. Um, and you mentioned a good point there as well, Dr. Thompson about the population, which you used. Um, it was.

just people, well, I’ll let you talk about it and explain that. Can you explain the population or that data set you’re using from the electronic health record systems?

Mike Thompson (21:31.022)
Sure, sure, sure. So basically the data set is comprised of people living in Los Angeles or who kind of regularly go to the hospital at UCLA. And so there was a lot of people who consented to being part of these studies. And so, yeah, thanks a lot for those who volunteered to see and have listened to this. And a lot of people who basically signed up for these genetic studies and then we could…

The cool thing about methylation is that once you’ve sampled some of the genotypes of these people, depending on what technology you use, you can actually reuse their blood sample for methylation. And so it was as easy as just asking for consent again, hey, you know you’ve already done this. Are you interested in being part of this methylation-based study? And so again, we got about 1,000 people who were willing to do that at the time. And…

It’s, we also tried to get, again, a mixture that’s somewhat representative of the population at UCLA. So, I mean, if the future goal is to improve these predictors and diagnoses in hospitals, then you want to do something realistic, and that’s going to basically be useful at some point in the future, and this, in theory, should represent the population at large. And so, a lot of times, when these polygenic risk scores are being made, since we know that

population structure or ancestry of patients is such a big influence in the variability you see, it’s common that you might only construct these things for a given population. And for a number of reasons, you might end up with differences in sample size for if you’re predicting on Europeans or people, Latino Americans or African Americans or things like this. And this is kind of something that

Hannah Went (23:05.42)
Mm-hmm.

Hannah Went (23:20.279)
Mm-hmm.

Mike Thompson (23:24.066)
I think people are still dealing with quite a lot of the polygenic risk score space. And in the methylation risk score space, I guess it’s known that ancestry and ancestral roots can affect some of the variability in methylation and sites and that this information pops up, but it hadn’t been too extensively evaluated how ancestral information is gonna affect your predictors in methylation alone. And so…

Basically, our population was comprised of a number of different self-reported race ethnicities in the UCLA biobank. Obviously, there’s things that are really fine-grained, but in order to get the most juice out of your sample size when you only have a thousand individuals, it becomes, I guess for lack of a better word, convenient or easy to group.

people into large self-reported identities or self-reported ethnicities. And so in the study, it looks like there’s a proportion of people who report as Asian American or African American, Latino American, and then Caucasian, white American, to don’t specify further different ancestral background. And then since we have kind of the genotypic backgrounds of these individuals, we can try and make sure that these groups are…

Hannah Went (24:23.351)
future.

Hannah Went (24:30.431)
Mm-hmm.

Hannah Went (24:38.175)
Mm-hmm.

Mike Thompson (24:48.77)
are basically genetic or genetically heterogeneous in the sense that we want to maximize each group size. So we want to maximize our power. We want to maximize how large each group is, and we want to make sure there’s no mismatches. So if someone didn’t really want to report their self-reported race ethnicity, we can kind of double-check a little bit based on what they reported and some genetic data where these people live and if we should put them in one group or the other.

Hannah Went (24:53.856)
Mm-hmm.

Mike Thompson (25:17.724)
and then construct the risk score based off of that.

Hannah Went (25:17.793)
Gotcha.

Yeah, yeah. No, that makes a lot of sense. And obviously, your sample size, you said that was about a thousand people. What I think is even more important is also the replication and the validation as well. So since we’re talking about the population, I might as well ask this question. I saw in the study as I was reading it, you also replicated several methylation risk scores in multiple external studies of methylation. And then you replicated…

22 of the 30 tested methylation risk scores internally in two separate cohorts of different ethnicities, of course, as we are chatting about. So why is that important? Of course, it’s probably rather obvious, but, and maybe would you argue that the validation is more important than even the sample size itself sometimes or kind of the variation there?

Mike Thompson (26:03.018)
Yeah, so it’s a good point. So the two are probably going to be tightly linked with the sample size and your ability to replicate it out. Because you can imagine if we found something with a really small effect size, but in our population, it happened to have a large effect size because of ascertainment issues, then maybe it won’t replicate at large. Or maybe there’s some statistical anomalies to consider. For us, I think in statistical genetics in particular,

Hannah Went (26:08.567)
Mm-hmm.

Hannah Went (26:22.871)
Mm-hmm.

Mike Thompson (26:32.546)
these replications are super important because, again, you want to see in one cohort, you want to make sure that your findings are legitimate and not confounded by some weird structure ascertainment bias in your data. Because, yeah, we collected a bunch of people who have kidney disease, but maybe there’s some people who by accident have some disease or medication they’re taking that we didn’t consider. And maybe some of our findings are spurious or not legitimate, and they’re due to this…

unmeasured or unpaid attention to confounder. And so being able to build predictors on one population and showing that it applies generally to other populations is something super important to make sure to at least try and give you some additional evidence or support that your findings are real. And so we did this not only for the kidney related phenotypes, but also for psychiatric traits or some other things that we’re looking at. So

I guess one tricky part of the methylation risk score is the directionality. So are you actually predicting that someone is going to develop a psychiatric disorder, or are you seeing a response in their methylation from the psychiatric disorder? In other words, are you predicting a future time risk, or are you just seeing what’s going on right now? And so we focused on what’s going on right now. And we didn’t actually have a.

Hannah Went (27:46.359)
Mm-hmm.

Hannah Went (27:50.645)
Mm-hmm.

Mike Thompson (27:57.25)
phenotypes for specific people of whether or not they had, let’s say, schizophrenia, but we knew that they were taking medication that is specifically prescribed for people who take schizophrenia. And so one of the replications we did was say, okay, I can predict whether or not somebody is taking medication for schizophrenia based on the methylation. And ideally, this should be a proxy for whether or not they have schizophrenia. I mean, there’s a lot to unpack there because maybe

Hannah Went (28:08.513)
Mm.

Mike Thompson (28:26.418)
they’re taking medication, it should change their signature and we shouldn’t be able to predict they have schizophrenia, but there’s a lot to unpack. I’ll say that. But what we did is go to another cohort of people who had schizophrenia and we say, hey, we made this risk score to predict that people are taking a schizophrenia-related medication, which should serve as a proxy for whether or not you have schizophrenia. And we show that this medication risk score can actually predict whether someone or not…

has schizophrenia in an external cohort. And so this was super exciting because it validated a phenotype that should be independent or a trait that should be independent of kidney disease. I assume.

Hannah Went (29:06.896)
Mm-hmm.

Yeah, I know a lot to take in there. There’s just like a lot of, I feel like there’s more data and more insights I wanna talk to you about that, you know, and found in your paper. It’s like you said, okay, we’re measuring methylation, which is kind of, again, what genes are being currently expressed and how much, et cetera. So is the medication, is the methylation you’re seeing an effect from the medication or…

did it come beforehand because of the disease or kind of the outcome. So it’s, I think, sometimes a little bit hard to entangle there and imagine. But appreciate you going into the importance of the replication and validation of these markers. So getting back to just talking about methylation risk scores in general, what’s the benefit of using that over the polygenic risk score? I think you went a little bit in depth there when you said.

You know, it’s changeable. It’s going to kind of open our eyes to, to a lot more than just the polygenic risk scores where your disease for, or where your risk for certain diseases may be different at different time points, but what other benefits would you want to acknowledge?

Mike Thompson (30:16.022)
Yeah, so exactly that. So yeah, the biggest one I would say is that your risk changes over time. And so ideally, if we’re looking at diagnosis, you can think of basically, let’s say there’s a disease that changes over time. And yeah, you wanna capture it. This is one part where it’s gonna be particularly useful, but also these things that are gonna be diseases that might be basically

diseases that are diagnosed based on exclusion. So let’s say I’m interested, I mean, this is like a hypothetical, but let’s say I’m interested in studying a specific subtype of a digestive or autoimmune disorder like IBS or IBD. And there’s a lot of these things that a lot of these traits that exist. And the thing is, my understanding of it is that these can be hard to diagnose and a lot of times they can be.

kind of arduous, like you might need to change your diet for a number of months, or it might take a year before you’re diagnosed with which specific subtype of this disease you have. And so in a perfect world, we would have some cool biomarker or some easy way to do this so that the patient doesn’t have to suffer longer than they need to. And potentially since methylation is gonna capture a lot of these signatures of change in risk over time, things related to autoimmune and cell types, maybe this is gonna be one specific part where it’s gonna be used.

useful. So for diagnosing cases where we don’t have enough information, it’s potentially one interesting avenue. Another could also be that methylation, since it captures, let’s say things like cigarette smoking and a lot of environmental things, these might be things that might be hard to measure or to kind of get information from the patient about. So let’s say that I smoke a pack of cigarettes a day and I don’t really want to admit this to my physician.

Maybe I feel embarrassed because of, I don’t know, a number of the campaigns and public health kind of messaging that’s been going on. And I under report how much cigarettes I’m smoking a day to my physician. So I say, no, I just smoke once in the morning, once at night, and that’s it. Well, if I’m trying to come up with a good predictor of my disease risk, and I know that cigarette smoking is an important part of this risk, then I’m going to kind of under report or under predict the risk I have for a certain disease, because

Mike Thompson (32:41.386)
Well, like if cigarette smoking is super important, you don’t smoke that much, well, you’re probably not gonna develop this disease. But methylation is kind of, I don’t know, telling on you and saying, no, that’s wrong. You actually smoke quite a lot of cigarettes and your disease risk should be higher. And so this could be one thing. It could also just be unintentional. Like if the patient doesn’t remember what their diet looks like for the past two weeks or doesn’t remember how many cigarettes there, they’re not paying attention to things like that. So it could be a lot more innocent. And…

Hannah Went (33:09.855)
Yeah.

Mike Thompson (33:09.974)
because these things appear, it could just be useful overall. And it could also be capturing other things that we don’t really know. So maybe pollution has an effect on your methylation. And if I’m looking at you developing some type of asthma or some type of issue or respiratory issue in the future, and I have some people who are from the center of LA where it’s kind of car city and maybe somebody from, I don’t know, rural Oregon, and you wanna build a methylation risk score, well, it might be hard for me to come up with

Hannah Went (33:18.007)
Mm-hmm.

Hannah Went (33:31.575)
Mm-hmm.

Mike Thompson (33:41.487)
what’s a good measurement of this pollution and how much you’ve been exposed to it, how many days, hours per week, or coming up with some interesting variable of this pollution might be difficult to construct and throw into a model. But if methylation captures this, then you can imagine that I just throw in this biomarker into this predictor, into this model, and hopefully the model will just figure out what parts are important that capture maybe this pollution level effect. And I don’t need to kind of…

Hannah Went (33:53.547)
Mm-hmm.

Mike Thompson (34:08.098)
cleverly come up with a way to come up with some pollutant exposure school.

Hannah Went (34:11.623)
Yeah. Yeah, no, that’s great. I think, I mean, yeah, you gave a lot there. I mean, there’s a lot of applications. There’s a lot of benefit to using methylation risk scores. The IBS, IBD example is a really good one. I think, yeah, the gut, a lot of issues in the gut are hard to diagnose, leaky gut, things like that. So I think being able to

identify risk of those diseases that may be hard to define or again, diagnose in the first place. Your change in risk, like you mentioned, how the methylation is gonna be changeable, yet the polygenic risk score is gonna be stable over time. And then going into the self-reported status as well. So the work I do at True Diagnostic and what I spend most of my time on, we have an alcohol consumption methylation risk score, smoking methylation risk score and different things.

So you are able to tell, hey, you know, maybe you don’t realize you’re not drinking maybe during the week, but you’re having a lot of drinks during the weekend and you’re not realizing how much you’re drinking over that time. So those methylation patterns can give us a lot more insight into that. And then like you said, identifying maybe even some methylation risk scores for pollution and toxic levels. I think that’d be interesting too. And what some people would really want to know, maybe they want to go through a detoxification process or maybe they are susceptible to other types of.

diseases and have increased risk depending on the pollutants in those areas. So that’s great. I appreciate you going over those.

Mike Thompson (35:40.77)
Sure, sure. Yeah, it’s… Oh, go ahead. I was gonna ask. It’s kind of, it’s a super interesting field of being able to predict like drug use, because I don’t know, it seems kind of like it could be a double-edged sword. Like for example, if someone is, you’re able to predict whether or not someone’s been taking their Alzheimer’s medication. And you know, if you have a methylation risk score and you can say, hey, there seems to be like a discordance between the two, and maybe this person needs additional help and they need to take their medication more frequently. So maybe…

Hannah Went (35:44.885)
Yeah.

Hannah Went (35:57.963)
Mm-hmm.

Hannah Went (36:08.735)
Mmm.

Mike Thompson (36:08.946)
They need a nurse visit or someone to kind of keep track of this. And this is one kind of like monitoring part where it could be really useful and really exciting and helpful. But I guess the other kind of side of the coin that’s a bit more scary is maybe biasing or kind of, I don’t know, doing some sort of discrimination based on drug use. And yeah, I guess one thing that’s easy to overlook when you’re developing these scores is what they’ll be used for and where they’re useful. But yeah, I guess just a thought experiment.

Hannah Went (36:13.971)
Oh.

Hannah Went (36:30.016)
Mm.

Hannah Went (36:35.623)
Yeah, I think you have to know that I love that. That’s a really, really good example for, yeah, maybe assistance with people who are forgetting to take their medications because of some neurological disease or disorder. So that’s a really good example and one I haven’t thought about. I think the drug use though is very interesting, like having, I think we will probably see some type of commercialized tests way down the line for marijuana testing, right? Drug testing and the kind of…

career world, right, for people for hire. But then again, you can get into the discrimination. So I think that is a really good thought experiment, but kind of leads me into the thought of, yeah, you know, if these methylation risk scores are so great and they can do all of these different things, you know, why aren’t we seeing them kind of everywhere? Why aren’t we seeing them, I guess, commercialized is what I’m trying to say.

Mike Thompson (37:29.198)
I think it’s a great question. I think because, yeah, maybe because we don’t know the directionality of the methylation risk score, and it’s really just going to be kind of a current time measurement of your risk for a certain trait or disease, people are a bit reluctant to study them. The polygenic risk score is particularly interesting because you can imagine you can do, since the risk doesn’t really change over time,

If this is predictive when somebody is really young, then maybe you can do protective or you can enact some measures to try and prevent that this person is going to develop a trade over time. And so if you have a really good predictor when someone’s five years old, then maybe they’re going to be at risk for a heart attack or some trade of interest when they’re in their 30s or 50s. This is really important and potentially a really useful tool is if we can use this early on.

Hannah Went (37:58.205)
Mm-hmm.

Hannah Went (38:09.844)
Mm-hmm.

Mike Thompson (38:27.778)
Kind of the nice thing about genotypes and polygenic risk scores is that the directionality should only be going one direction. It should be that this genotype is causing the trait because the trait is probably not gonna change your genotype unless maybe we’re talking about cancer and talking about different interesting sematic alterations, but probably the disease is not going to change your biomarker. And so we’re kind of more confident that the biomarker is going to affect the disease. And so…

Hannah Went (38:27.915)
Mm-hmm.

Hannah Went (38:45.96)
Yeah.

Mike Thompson (38:55.222)
since this is a bit of an easier relationship to follow, this could be it. I’m not so sure. I’m actually kind of optimistic that now people will hopefully incorporate these things in larger cohorts because I mean, we show that they’re super useful on only a thousand individuals. And there’s a lot of these cohorts that are hundreds of thousands of individuals. And in some cases, we’re doing better with a thousand individuals than using 200,000 individuals polygenic risk scores. And so…

Hannah Went (39:08.311)
Mm-hmm.

Hannah Went (39:12.855)
Mm-hmm.

Mike Thompson (39:24.474)
I would like to see what would happen is if we up the sample size a few magnitudes and surely there’s going to be more things we can predict. We show in the paper that obviously your sample size is going to go up or your power is going to go up as your sample size goes up for quite a few different traits. So I don’t know. I think if the goal is to do a current diagnosis, like let’s say again, one of these interesting

Hannah Went (39:29.451)
Mm-hmm.

Hannah Went (39:32.673)
Yeah.

Mike Thompson (39:54.086)
Arguably, what you want is the best predictor at a given time point. And if methylation outperforms polygenic risk scores at a specific time point, then hopefully you should use that to kind of give you like the best guess of what somebody’s got. So I don’t know, I’m optimistic that these will pick up some speed and start to be in larger cohorts.

Hannah Went (39:58.667)
Mm-hmm.

Hannah Went (40:13.849)
Yeah, definitely. I think it’s still too early to tell, right? Things take time. Polygenic risk scores have been around for a long time. I think polygenic risk scores, as great as they are, are kind of sad in my opinion, right? They’re, you know, your genetic information, it’s not going to be changeable. I know, for example, I have an AP034 variant. I know my increased risk for dementia, Alzheimer’s, et cetera.

really that’s not ever going to change, but if I have a methylation risk score for that same thing, I can maybe say, oh, if I’m, I don’t know, taking a pharmaceutical or if I’m doing certain things to my lifestyle, then I can actually reduce that risk according to my methylation too, which I think is a lot more hopeful, right? And gives people maybe a better outlook. So I think using the two together is always fine as well. Just maybe depends on your perspective of those things.

Mike Thompson (40:51.744)
Yeah.

Mike Thompson (41:02.934)
Yeah, totally. I wonder if people would do this type of thing like with methylation age becoming kind of a consumer good at some point if people want to keep track of, ah, I have a chronological age of, I don’t know, 35, but my methylation age says I’m 36. I need to start exercising. It’d be interesting, apart from disease risk, obviously, to kind of gamify this thing. It’s a weird idea that an individual will have these risk scores like readily available to them.

Hannah Went (41:05.344)
Yeah.

Mike Thompson (41:32.558)
But yeah, it’s interesting to hear that you’ve done some of these tests. I’ve still not got my genome sequence, but maybe in the future.

Hannah Went (41:41.827)
Yeah, yeah. No, it’s all super interesting again. I think we don’t know what we don’t know and we’ll kind of see what people come up with. So yeah, like you said, really good thought experience and could chat about that all day. One thing I want to focus on though, Dr. Thompson, I know you mentioned this briefly, but I want to make it clear for listeners. So I’m going to pull a quote out of your paper. In your paper, you actually say that MRS significantly improved the imputation.

of 139 outcomes whereas the polygenic risk scores improved only 22. And then you know you give a lot of examples on how those implications actually increase based on methylation in medications, labs, diagnostic, diagnosis codes respectively where again genotypes only improved the labs very, very small compared to.

methylation risk score. So what do you mean by the imputation in this context? I know you said that was like what you were trying to predict, but could you explain that?

Mike Thompson (42:37.526)
Yeah, yeah, of course. So basically, when we look at a specific time point in the medical records, so let’s say I look today. And I can look at all the information that’s missing for a bunch of individuals. And let’s say, for example, I haven’t been to the doctor. And let’s say I haven’t been in a year. And there’s a lot of information for a bunch of people at this specific time point or whenever. And I want to say, OK, what’s Mike’s risk for developing?

Hannah Went (42:45.227)
Mm-hmm.

Hannah Went (43:01.771)
Mm-hmm.

Mike Thompson (43:06.838)
this trait of interest on this specific day or at this specific time. Well, what I can do, because I don’t have that data, because Mike didn’t come to the hospital, it’s a missing data. It’s not there. And so, imputation in this case, we’re kind of referring to the process of filling in the missing data. And so, in other words, I wanna guess whether Mike has this disease at this specific time. And so, anytime we’re kind of using imputation in the paper, we’re saying, we wanna do prediction.

or filling in the blanks at the current time point. We’re not gonna look in the future and say, is Mike going to develop this in the future five years from now? No, instead we’re gonna say, right now, does he have this trait? What is his lab value? Is he taking a medication? Things like that. And so that’s kind of what we’re trying to say for diagnosis. Now, to kind of evaluate what’s going on, you can’t look at missing data that’s actually missing because you don’t have a ground truth and can’t say, oh, you’re doing a great job because well,

Hannah Went (43:39.671)
Hmm.

Hannah Went (43:51.677)
Mm-hmm. Mm-hmm.

Gotcha.

Mike Thompson (44:05.714)
you can’t really evaluate if you’re doing a great job because the data is not there. So kind of the way you do this is you take, say 90% of the data and you say, I’m gonna build a model on this 90% of this data. And then the other 10% that we have, that’s not missing, this other 10% that we’ve measured and know what the disease status is, we’re gonna pretend that it is missing. And what we’re gonna do is using this 90%, we’re gonna train a model and say, okay, using 90% of this data,

Try and build a mapping from methylation or genotypes to this trait of interest. And this is what’s gonna be our model. So we’re gonna have a methylation model and a genotype model that we built on 90% of the available data. Now, on the remaining 10% of the available data, we’re gonna say, okay, apply this model on this 10% and see how predictive you are. And this is basically supposed to give us an idea of how accurate we’re gonna be in this imputation strategy. So when we’re filling in the blanks,

Hannah Went (45:02.315)
Mmm.

Mike Thompson (45:03.598)
because we’re simulating that this 10% is blank. So we’re filling in this 10% blank and then saying, how accurate are we doing? And then you kind of do a round robin approach where you change who these 10% are. So everyone takes a turn of being in the 90% or in the 10% to kind of get an idea of how accurate your diagnosis or your imputation is. And so that’s kind of what we mean. One important caveat of those numbers is that we are…

Hannah Went (45:12.875)
Mmm

Mike Thompson (45:32.962)
doing all this on 1,000 individuals. And we know that in general, genetics or polygenic effects, so we’re looking at SNPs, these little variable spots, the effect sizes are gonna be really small generally. And so what that means is to pick up these really small effect sizes, you need a ton of individuals because if you want to be confident that they’re not zero, the more people you have, the more confident you’re gonna be that it’s not a zero effect. So when you’re building this weighted sum and you’re saying,

Hannah Went (45:36.023)
Mm.

Hannah Went (45:59.2)
Mm-hmm.

Mike Thompson (46:00.842)
How much do I weight each of these important markers? These weights are what you’re learning from the data, but these weights are generally gonna be super small. And so what we’re interested in doing is saying, given the same sample size, so condition on the fact that we’re only looking at a thousand individuals or 800 individuals, what is a more useful biomarker in these diagnosis tasks? And so methylation has larger effect sizes, which…

Hannah Went (46:09.132)
Mm-hmm.

Hannah Went (46:24.736)
Mm-hmm.

Mike Thompson (46:29.862)
at least for the things we looked at or discovered, obviously. And so these effect sizes should be easier to pick up because they’re larger. And so basically, you have this kind of issue of power. So maybe for genotypes, my predictor could be super good, but I need 200,000 individuals. But for methylation, since the effect sizes are quite a bit larger, I only need 1,000 individuals to build a useful predictor. And so these numbers where we’re saying, hey, we kind of are blowing the PRS out of the water,

Hannah Went (46:49.887)
Mm-hmm.

Mike Thompson (46:59.69)
Well, we have to keep in mind that we’re looking only at 1,000 individuals, and maybe the predictive power will kind of saturate for methylation, and a PRS will be equally as good when we’re looking at 500,000 individuals for both. That doesn’t necessarily seem to be the case for a lot of the traits we looked at, but it’s something to keep in mind.

Hannah Went (47:03.883)
Mm.

Hannah Went (47:16.063)
you.

Hannah Went (47:21.215)
Yeah, no, I appreciate that little caveat there. Yeah, that makes sense. And that’s a really interesting approach. I like that, you know, you took 90% trying to predict that 10% and then, um, obviously you’re very good at filling in that information and predicting that, um, kind of missing 10% as well. So, um, talking about kind of the, the. EHR, uh, dried phenotypes. Um, yeah, what were you looking at? You know, I know you’re doing different medications, labs, um, diagnoses. Can you maybe name some of those, uh, that you’re grabbing?

Mike Thompson (47:49.394)
Yeah, so to be quite honest, we basically pulled anything that we called, quote-unquote, with a big enough sample size. So we wanted something where the disease wasn’t super rare, so it appeared in more than 5% of the individuals. And something that… So the prevalence of the disease, we say, like it appeared more in 5% of the individuals. And also the sample size was greater than, say, 100 individuals. So…

Hannah Went (47:58.57)
Okay.

Mike Thompson (48:18.486)
because this missing data thing, you have a lot of diseases where maybe if I’m checking for, let’s say, I’m taking kidney status or kidney disease, maybe I’ve only measured this trait in 500 individuals. And I can only build a model on these 500 individuals, assuming that more than 5% of 500 are positive. And so basically, we looked at every lab panel. So if I took the model, I could measure

Hannah Went (48:35.404)
Mm-hmm.

Mike Thompson (48:47.17)
blood or urine measurements, and I want to look at metabolites or things like that, or any disease or any medication, anything that kind of fit this criteria. So we end up looking at like six or 700 traits. Off the top of my head, yeah, a lot of these kidney related ones. There’s also a lot of medications. So these kind of psychiatric related disorders, there could be different cancers, different drugs for that people take on cancer.

Hannah Went (49:03.403)
Mm-hmm.

Mike Thompson (49:15.754)
heart attack and heart-related phenotypes, whether they’re taking alpha or beta blockers and what specific medications they’re taking. Labs, it could be things like glucose, C-reactive protein, creatinine, urea nitrogen, all sorts of hematocrit, hemoglobin. It kind of runs the gamut of things that are, should in theory be relatively common in medical records or.

Hannah Went (49:22.412)
Mm-hmm.

Hannah Went (49:25.942)
Yeah.

Mike Thompson (49:43.062)
If you’re going to go get a blood or urine test, these things that are measured.

Hannah Went (49:44.)
Yeah.

No, that definitely makes sense. Hey, what’s available? What can we do with it, with the data that we have? So kind of this, yeah, this next question I was pondering and was thinking about too, which maybe you just answered it, but do you think surrogate methylation risk scores for things like inflammatory markers or disease specific markers like, you know, the HbA1c you kind of mentioned or other things for kidney function would also be helpful to improve MRS? Which I think you just answered, yes, right? The more data, the better.

But how are you including all of those in an MRS model? I guess that goes back to the foundational question of, yeah, how are you making these MRS in the first place? And I know you touched on that briefly at the beginning, where you said there’s different flavors, which I liked that of the linear regression models that you use. So yeah, maybe just a quick overview of taking the weighted sum, being able to make that predictor with the MRS. But.

How are you adding more data on top of it to make the predictor better of a certain outcome?

Mike Thompson (50:49.314)
Sure, yeah, yeah, no, it’s a great point. So yeah, to start off, the methylation risk score or the polygenic risk score is this weighted sum. And so what you do is you set up a very basic regression framework where you have some outcome of interest, and then all of the methylation sites or all of the genotypes, all of these basically, these spots of these methylation measurements or these SNPs across the genome. And what you can basically do is just regress this

Hannah Went (50:58.944)
Mm-hmm.

Mike Thompson (51:18.678)
outcome or disease of interest onto these biomarkers. And this regression strategy, you can use some penalty on the effect sizes and say, hey, I want my solution to be sparse, meaning use very few markers, or I can kind of slide across the spectrum and say use as many as possible or use as many as are helpful. And so basically this regression procedure is what gives you these weights. And so to kind of come up with a predictor,

Yeah, exactly. The predictor is a weighted sum of these markers. And the strategy for learning these weights is this penalized regression framework. And so to answer your question, basically, yeah, in kind of the original polygenic risk score framework, you have your polygenic risk score and then maybe some exercise status, cigarettes, age, biological sex, things like this. And in addition to the polygenic risk score, you toss in all this information.

in your final model, and then you use that final model to predict the outcome of interest. And so your prediction is actually coming from all of these components, not just the weighted genetic score. It’s coming from quite a few different things. The benefit of methylation is that maybe when I’m building a marker for kidney status, maybe in my predictor of kidney

Hannah Went (52:23.531)
Mm-hmm.

Mike Thompson (52:45.17)
Also, what’s useful, like you say, is going to be creatinine or urea nitrogen. And maybe some of these methylation sites that are associated with these metabolites are actually being included in the model. So if I say to the model, hey, give me 100 sites that are going to predict whether or not someone’s going to get a kidney disease. Well, maybe there’s some kidney disease sites that are really useful predicting this, but maybe there’s also some sites that give me some knowledge about urea nitrogen or creatinine, and maybe these sites are going to get included in these 100.

100 sites of the model. And so maybe the model is smart enough to kind of capture these things for us if they’re going to be useful. The kind of other side of the coin, if we go away from biomarkers alone, is how people are kind of doing imputation as a whole right now. And so what they do, one common thing is this, we call it in the paper, but it’s called soft impute. But basically what you can do is you can imagine I take a patient and I have all of their medical

history of whatever was available. So if they’ve come in the last few years and I have some blood, urine, disease status, medication status for a bunch of different things. And what I can do is I can construct a matrix. So basically like a big spreadsheet of all of my individuals, all of their history, what medications they’re taking, their values for a bunch of lab panels, and then yes or no if they have a bunch of different diseases. And I’ll construct this giant spreadsheet for a bunch of different patients.

And what I’ll do is I’ll fill in the blanks using what’s called the matrix completion algorithm. So basically, if I look at the spreadsheet and I try to fill in the blanks, there’s a few clever ways of doing that. And it turns out those things do really well. And people use those a lot when they’re building machine learning algorithms to predict different traits in medical records. But what our question was is, if very naively, I construct these MRS, and then instead of just saying, hey, fill in the blanks using

labs, medications, and disease diagnoses, what happens if I also give you these methylation risk scores as now a fourth data type in this matrix completion spreadsheet completion algorithm? How does that help, or what does that do? And it turns out, if I just treat the MRS as like an additional patient measurement, I can actually predict quite a few different things a lot better than if I use their medication, or sorry, their medical history alone.

Mike Thompson (55:05.422)
And so this is what was super exciting for us, is if we say, hey, how people are currently doing things in the state of the art for filling in the blanks in these machine learning algorithms. We know that these methylation risk scores are useful for predicting some of these things. But if we want to augment and help out some of these imputation algorithms, can they actually be useful? And excitingly, the answer was yes.

Hannah Went (55:29.087)
Very cool. Yeah. Thanks for just putting that out very clearly. I wanted to make sure we had a good definition for all of the listeners. I was, I’m, I’m writing as, as we go and kind of making these different diagrams and things to, to make sense of it all. So, um, no, that’s great. Dr. Thompson. I appreciate that. And we’re getting to the end of this, this episode here. Um, just a couple of final questions for you. Um, what’s next? What are you working on in Barcelona? What’s the next big subject for you?

Mike Thompson (55:40.323)
Thank you. Bye.

Mike Thompson (55:53.2)
Yeah, oh, thanks for asking. So now I’m working with Ben Leonard and a few other people out here. And kind of the idea is they have a lot of these data sets that are called massively parallel reporter assays or deep mutagenesis scans. And basically the way they work is you have some…

let’s say you have a gene of interest or a protein of interest, and you’re interested in studying some phenomenon of this. And so it’s kind of, I guess, for me, one of the really exciting parts was that you’re moving from the statistical genetic side of things of like a GWAS or an EWAS where you’re associating a variant and saying, hey, it’s related to a trait. And what you’re actually doing is you’re going into a model organism, and you’re saying, hey, if I take all of these variants that are related to a trait, and you’re saying, hey, if I take

I think are associated with a trait and I go one by one and I mutate each of them in this model organism, can I actually learn their biological effect? And so let’s say I’m interested in studying some disease that’s mediated by alternative splicing of a gene. So let’s say if this gene alternatively splices incorrectly, it leads to a disease.

What I can do is I can look at GWAS and say, hey, there’s a bunch of different variants that are leading to this kind of errors or changes in alternative splicing. I don’t know exactly what they do, but I’m interested in kind of finding out exactly the effects. What I can do is go to a yeast cell or anything like that, and then throw in this gene with each mutation one by one and kind of measure what’s actually happening. And so we’re basically, that’s kind of like a broad scale of what we’re doing, but what we’re doing now is saying,

If we do this at scale and just mutate everything in every single position in any sort of way, can we learn what this mapping should be from genetic variant to the change in splicing or to some other interesting phenotype? And so what I’m doing now is building these mappings, and the way we’re doing that is using some flavor of artificial intelligence that’s not only accurate but super interpretable. And so…

Mike Thompson (58:05.13)
It’s still kind of in the model building space, but now I’ve kind of switched basically the toolkit I’m working with. So going from linear models to neural networks or some nice variant thereof, yeah.

Hannah Went (58:08.169)
Mm-hmm.

Hannah Went (58:15.827)
Yeah. Cool. Well, very cool. No, I’m excited to follow your work and, uh, see everything that you, you publish and, you know, post and, and come up with and your, your experimental findings. So always love to hear, you know, what you’re, you’re doing next. Um, last very last question in the podcast. Uh, this one is kind of a curve ball. Um, but Dr. Thompson, if you could be any animal in the world, what would you be and why?

Mike Thompson (58:34.626)
Hahaha.

Mike Thompson (58:40.374)
Hmm. Yeah. I guess I would be a bear. Yeah, I guess one of my favorite things to do outside of science and looking out of screen is to be outside as much as I can. So a lot of my vacation time is spent doing hiking and backpacking and mountains and yeah, I don’t know. I guess an animal. Yeah.

Hannah Went (58:55.255)
I’m sorry.

Hannah Went (59:01.567)
Mm-hmm.

Hannah Went (59:04.895)
Very cool. Yeah, I thought I almost stumped you there. Have you ever seen a bear while you were hiking or in the mountains? Yeah, okay.

Mike Thompson (59:10.762)
Yeah, a few times. Luckily from far enough distance and they weren’t messing with us while we’re eating, but yeah.

Hannah Went (59:16.647)
Yeah. Yeah. I’ve never, never seen one. I love to camp and hike and stuff. There’s, you know, a couple of gorges here around Lexington and, um, you know, Gallenburg, Tennessee and places close to us. So yeah, very cool. Well, you know, we’ve come to the end of this amazing podcast. I had such a great time chatting with you for any listeners who want to learn more about your work or, um, you know, see, see anything that, that you have going on, where can, where can they find you? I don’t know if you’re on social media or Twitter or anything.

Mike Thompson (59:26.478)
Cool.

Mike Thompson (59:41.374)
Yeah, so I’m pretty bad at being social media involved. But I guess email or Google Scholar will be the most relevant thing now. I’ll make a LinkedIn at some point. And yeah. Yeah.

Hannah Went (59:56.576)
Perfect. That’s fine. I’ll link your Google Scholar just so people can read the work and the papers if they’re interested. But yeah, I really appreciate your time. It’s been great. For everyone listening, thank you for joining us at the Everything Epigenetics podcast. Remember you have control over your epigenetics, so tune in next time to learn more. Thanks, Dr. Thompson.

Mike Thompson (01:00:03.683)
Sure, sure.

Mike Thompson (01:00:15.586)
Thank you so much. Thanks for listening, everybody.