Science Friday - 65 Genomes Expand Our Picture Of Human Genetics

Starting point is 00:00:00 Hey, it's Ira Flato, and this is Science Friday. A new look at the human genome, both with more detail and more diversity. And the point in doing this is to understand how we differ in our sequences, our structures, and trying to understand how proteins also differ amongst us. Remember the human genome project? Well, the initial draft was declared complete back in 2003. But researchers then realized that one genome doesn't paint a complete picture of, the human race. So fast forward a decade or so, there came the thousand genomes project,

Starting point is 00:00:44 an attempt to expand the picture by sampling people from all over the world with different backgrounds and try to get a fuller look at how we're the same or how we're different. Writing this month in a journal Nature, two teams of researchers take another look at some of those thousand genomes, resequencing, reassembling with more advanced techniques. to lessen the number of typos and really firm up how the pieces of the genome puzzle fit together. Joining me now to talk about how it's going are my guest, Dr. Christine Beck, Associate Professor of Genetics and Genome Sciences at the University of Connecticut Health Center and the Jackson Laboratory. She's in Farmington, Connecticut.

Starting point is 00:01:25 Dr. Glennis Logston, assistant professor of genetics and core member of the Epigenetics Institute at the University of Pennsylvania in Philly. And Dr. Adam Philippi, a senior investigator in the Center for Genomics and Data Science Research at the National Human Genome Research Institute at the NIH in Bethesda, Maryland. Welcome all of you to Science Friday. Thanks. Thanks, Ira. Great to be here. Nice to have you all. Dr. Logston, let me start with you. What's the thousand-mile view of this paper? What were you all trying to do here? Yeah, the goal of the paper, was really to generate complete sequence assemblies of 65 diverse human around the world. We were mainly interested in trying to resolve all sequences within all the 46 chromosomes

Starting point is 00:02:13 within these genomes. And that includes the most challenging regions of the genomes that have been kind of plaguing scientists for decades to try to resolve their regions. And the point in doing this is to understand the genetic and epigenetic variation of these genomes, to understand how we differ in our sequences, our structures, and trying to understand how proteins also differ amongst us. Some of the most interesting regions that we resolved are the centromeres, which are these essential chromosomal regions found on every single chromosome in our genome,

Starting point is 00:02:43 and they're important for ensuring that our chromosomes are equally and accurately segregated during mitosis and meiosis. And so for the first time, we were able to solve about 1,200 centromeres from among these 65 genomes and understand how they differ between each of us and what that difference might mean in terms of function. Does that mean the older genome sequences had problems with them? Yes, that's exactly what that means. Sequencing technologies from about a decade ago weren't really able to resolve these regions of the genome, and that's because they're really highly repetitive and very large, and the sequencing data that was generated back then was smaller than the size of the repeat itself.

Starting point is 00:03:21 But now we're able to actually resolve these regions in their entirety, traverse them from one side of the chromosome to the other, and finally get complete maps at high resolution. of these regions of the genome. Interesting. Dr. Beck, tell me more about these 65 individuals that Dr. Logston talked about. Where did these samples come from?

Starting point is 00:03:41 Sure. These 65 samples were actually part of the Thousand Genomes Project. So the samples are from around the world. And basically, with data from previous sequencing projects, we had a good handle on how much variation there was between individual samples. and a reference genome. So therefore, we chose cell lines from individuals

Starting point is 00:04:05 that would maximize the amount of novel sequence variation that was discovered in our work. Because if we sequenced a bunch of individuals that were really, really similar, we'd have less return on our investment for sequencing each individual person. So we sequenced these 65 people, and from them, we discovered a large amount of DNA variation

Starting point is 00:04:26 from person to person. And part of the reason why that's important is because we don't really have a good handle on how much DNA variation there is in some of these complex regions of the genome. So without a good understanding of that kind of background topology of the genome, it's really, really hard to separate benign differences in the population from pathogen. So you really did find diversity, similarity, dissimilarity and all these different genomes? So we looked between all of these genomes. And between them, we cataloged 188,000-plus variations between people that were greater

Starting point is 00:05:10 than 50 base pairs in length. Is that surprising all of those? You know, to a certain degree, it's not. So previous studies, looking at an individual genome versus the reference assemblies. we were able to find a decent number of variants. But in these complex regions, like Glenys was talking about and other loci, comprised largely of repeat sequences, we were able to uncover vast amounts of differences between individuals in the population that had heretofore been undiscovered because of the quality of sequencing.

Starting point is 00:05:46 Right. So just as a quick kind of side-by-side example, just four years ago, we published the Human Genome Structural Variant Consortium at the time published another study where they catalog variants between people. And with that sequencing technology, there were almost 2,000 fewer variants between every individual and the reference genome than there are in this recent compendium. Wow. Dr. Philippi, what is it that is letting you do this work now? Is it better machinery to actually do a sequence? Is it better tools to assemble all the data? What is it? Yeah, all of the above, including better computers.

Starting point is 00:06:26 You mentioned at the top the 1000 Genomes Project, which was initiated almost 20 years ago now, that was the original collection and sequencing of these samples. But it wasn't just typos that we had at the time. There was, if you want to continue the Book of Life analogy, entire pages missing from each of these individual genomes. And as was noted by Glenys, a lot of this was due to repeats. And so these repeating pages,

Starting point is 00:06:52 repeating sentences, so to speak, in the genome are just like when you're doing a jigsaw puzzle, hard to put back together again when they're highly repetitive. So how complete do you think you've gotten? Are we finished? Yeah, I was on a number of years ago talking about this telomere to telomere project, which was the first completion of an entire human genome. And we estimated at the time that that filled in about 8% of what was missing after the initial human genome project from the early 2000s. And I would say that number holds about the same for these genomes.

Starting point is 00:07:25 And so for all of the genomes presented here, each of them has about 8% more sequence than the initial product of the human genome project from 2003. And the technology, the sequencing methods are able to read a longer stretch of DNA at a time that helps. The computational methods have advanced. And we have better and more accurate methods of putting those puzzles back together again. and just handling this sheer quantity of data, generating millions and millions of sequencing reads from all of these individual genomes and putting it back together again is really only possible with the advance of computing that we've seen over the past couple of decades as well. So how close are we to a final number or a final end to all the sequence?

Starting point is 00:08:09 Well, how many billions of people deep would you like to go? Obviously, we're just scratching the surface here, but as Christine said, we're trying to do it in a way that kind of maximizes our return on investment. And so we can go into a population of people and pick out the ones that look most different from one another, sequence those first. And then over time, we start to saturate the amount of variation that we return. So now we're talking about, you know, 50-ish genomes in the next year or two, we'll be talking about thousands of genomes. And if this field just continues to increase exponentially like it has over the past two decades,

Starting point is 00:08:45 yeah, the sky's the limit. Now, you mentioned that this really helped fill in the bits of the genome that repeat over and over. What does that tell us? Why does it repeat over and over? Is there information there? Yeah, there's absolutely information in the repetitive regions of the genome, and not only just information, there's function. I mentioned earlier the centromeres. There's some of the most mutable, highly dynamic regions of our genome, and they're so mutable

Starting point is 00:09:13 that I haven't yet seen two centromeres that look identical across you. humans. And despite this variability, we can see quite a bit of variation that, in fact, affects function. So we find that when certain regions of the centuries are deleted or expanded or duplicated, this could actually affect the way that the chromosomes segregate during meiosis and mitosis. Is this repetitive stuff what we once called junk DNA? It is. It is exactly what you would call junk DNA, but we know for sure that it's not junk DNA. In fact, it's very functional, important regions of our genome. It's important for life. If we didn't have these regions of the genome, then we wouldn't be able to live. I think that that's kind of an important

Starting point is 00:09:54 part to touch on, because I think repeats of all classes really shine with these novel techniques and novel sequencing modalities, as well as the assemblies. So both the centrumeric repeats that Glenn has studied, as well as segmental duplications and complex kind of different ways of arraying those puzzle pieces from beginning to end have begun to come to light with these new sequences. And from that, you can kind of infer whether or not the mutations or the differences between these people have actually affected the coding sequences of genes embedded in these repeats, or whether or not it might have changed the cis regulatory landscape.

Starting point is 00:10:37 Like, let's say, the ability to turn a gene off or turn it up to 11 is also alternative. between some of these genomes. So getting a good picture of that repetitive nature of the underlying sequence is really, really key to understanding differences in function downstream. Turning a gene up to 11 is something we haven't spoken about before. After the break, what comes next in genetics research? How will researchers be able to capitalize on the new genetic knowledge? I think in the next few years, we'll see genomic language models, to speak, trained on that data and be able to predict associations quite accurately between atypical sequences and their disease associations. Stay with us. How much data do you need to have to tell if

Starting point is 00:11:41 something is, quote, unquote, normal genetically? I just throw that out to any of you. I think that's a great question, and I think it's really the power of this type of kind of fundamental knowledge generation that we're doing in these types of projects. You know, being trained as a computer scientist, I think a lot from that lens. And in a similar way that something like alpha folds succeeded at protein prediction based on this foundation of the protein data bank that was decades in progress, we're building now this foundation of what typical human genomes look like. And I think in the next few years, we'll see genomic language models, so to speak, trained on that data and be able to predict associations quite accurately between atypical sequences and their disease

Starting point is 00:12:27 associations. Exactly how many sequences you need and how many people with diseases and without diseases you need in that training set always depends on the type of the disease, how complex those associations are, and so forth. But I think we're rapidly approaching a tipping point in being able to make very accurate predictions off of this genomic data alone. And what kinds of predictions are we talking about? So imagine as a thought experiment, we just mutate a random base in your genome. How well do you think we can predict the effect of that mutation, whether it will be deleterious or not. We're not quite that good at it compared to some other aspects of prediction. But with these resources, we're getting much, much better, in particular in the non-coding

Starting point is 00:13:09 regions of the genome that Dr. Beck was just mentioning. A large fraction of those mutations, you have many millions of them in your genome compared to a typical reference genome, and the vast majority of them are benign. But the few that matter are the important ones, and we're going to get much better the coming years at making those predictions and being able to spot, basically at birth with DNA sequencing, those predictions, those variants that will likely result in some form of genetic disease. Would I be wrong in assuming, Dr. Philippi, that as a computer scientist, you're using a lot of AI here? More and more, it's kind of embedded into a lot of the things we do. The sequencing technologies

Starting point is 00:13:48 that we're using to read off the DNA are using state-of-the-art AI methods to make a prediction from the electrical current or the optical image that you're seeing to the actual ACGs and T's, so that translation process uses AI. And yes, these kind of DNA models that I was referring to are also coming of age now, and people are actively using them to make predictions of the suspected pathogenicity of a variant that you see in one genome compared to another. Final question. I'll send it to you, Dr. Beck.

Starting point is 00:14:19 I remember when the Human Genome Project was announced, it was hailed as a major break and helping to cure illnesses down the road? How has that been working out? How would you grade the success so far and looking forward? Oh, nice. A softball. So I think that at the end of the day, I think that the sequencing of the human genome

Starting point is 00:14:44 has allowed a lot of inference into Mendelian diseases. So the architecture of diseases that are highly penetrant in the population where you have a clear variant and effect, so a cause and effect that you can tie together very clearly, those things have really been helped astronomically by the development of the human genome reference sequence. And then stepping into kind of the more murkier territory of complex disease genetics, I think that there's still a lot of work to be done to figure out the underlying genetic architecture of those diseases and understanding kind of the combinatorics of alleles and variants that come together to equal the predisposition to diseases with environmental

Starting point is 00:15:38 factors added to them. So I think that getting back to what Dr. Philippi said earlier, I think that an understanding of this is probably going to be borne out by a much better understanding of variation in genomes, which we're gaining with studies like ours, mixed with machine learning approaches to kind of plumb the depths of those data for variants that might, in aggregate or individually, contribute to these complex diseases. So long story short, I think that there has been a lot of progress, but I also think in the future there's a lot of work and progress to be done. I think I would be remiss to not give credit to the initial human genome project that we're building on here that finished up, as you said, about two decades ago now. And I find it really

Starting point is 00:16:27 informative to look back and realize that that project took about 10 years. And in today's dollars, about $5 billion, each of these individual genomes that we're doing now at a better quality can be done in basically a few days for around $5,000. And so just do the simple math, that's a million fold reduction in the costs to sequence a human genome, thanks to these research investments. that have been made over the past 25 years by the NIH and by my home institute, NHGRI. And it's just amazing to reflect on the progress that this field has undergone over the past 20 years with those investments. And so if you look back at the economic impact on that, there was a study in 2013 that estimated the economic impact of the Human Genome Project at $1 trillion.

Starting point is 00:17:14 And that was 10 years ago. Imagine what those returns are now. So this Human Genome Project is just a gift that keeps giving both in terms of, of economic terms and in terms of quality of life. Well, I want to thank all of you for taking time to be with this. This is very informative. I imagine you're all very, very hopeful about the future. Yeah, absolutely.

Starting point is 00:17:34 Please come back and tell us more about where this is heading when you get a chance. Well, do. Thanks. Thanks, much, Ira. Thank you so much. You're welcome. Dr. Adam Philippi at the National Human Genome Research Institute that is at NIH in Bethesda. Dr. Christine Beck of the University of Connecticut Health Center and the Jackson Laboratory and Dr. Glenis Logston at the University of Pennsylvania. Thank you all for taking time,

Starting point is 00:18:00 as I say, to be with us today. Hey, thanks for listening. This episode was produced by Charles Berkwist. See you next time. I'm Ira Flato.

Science Friday - 65 Genomes Expand Our Picture Of Human Genetics

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.