Science Friday - 65 Genomes Expand Our Picture Of Human Genetics
Episode Date: August 5, 2025The first complete draft of the human genome was published back in 2003. Since then, researchers have worked both to improve the accuracy of human genetic data, and to expand its diversity, looking at... the genetics of people from many different backgrounds. Three genetics experts join Host Ira Flatow to talk about a recent close examination of the genomes of 65 individuals from around the world, and how it may help researchers get a better understanding of genomic functioning and diversity.Guests:Dr. Christine Beck is an associate professor of genetics and genome sciences in the University of Connecticut Health Center and the Jackson Laboratory.Dr. Glennis Logsdon is an assistant professor of genetics and a core member of the Epigenetics Institute at the University of Pennsylvania.Dr. Adam Philippy is a Senior Investigator in the Center for Genomics and Data Science Research at the National Human Genome Research Institute at the NIH.Transcripts for each episode are available within 1-3 days at sciencefriday.com. Subscribe to this podcast. Plus, to stay updated on all things science, sign up for Science Friday's newsletters.
Transcript
Discussion (0)
Hey, it's Ira Flato, and this is Science Friday.
A new look at the human genome, both with more detail and more diversity.
And the point in doing this is to understand how we differ in our sequences, our structures,
and trying to understand how proteins also differ amongst us.
Remember the human genome project?
Well, the initial draft was declared complete back in 2003.
But researchers then realized that one genome doesn't paint a complete picture of,
the human race. So fast forward a decade or so, there came the thousand genomes project,
an attempt to expand the picture by sampling people from all over the world with different
backgrounds and try to get a fuller look at how we're the same or how we're different.
Writing this month in a journal Nature, two teams of researchers take another look at some of
those thousand genomes, resequencing, reassembling with more advanced techniques.
to lessen the number of typos and really firm up how the pieces of the genome puzzle fit together.
Joining me now to talk about how it's going are my guest, Dr. Christine Beck,
Associate Professor of Genetics and Genome Sciences at the University of Connecticut Health Center
and the Jackson Laboratory. She's in Farmington, Connecticut.
Dr. Glennis Logston, assistant professor of genetics and core member of the Epigenetics
Institute at the University of Pennsylvania in Philly. And Dr. Adam Philippi, a senior investigator in the
Center for Genomics and Data Science Research at the National Human Genome Research Institute
at the NIH in Bethesda, Maryland. Welcome all of you to Science Friday. Thanks. Thanks, Ira.
Great to be here. Nice to have you all. Dr. Logston, let me start with you. What's the thousand-mile
view of this paper? What were you all trying to do here? Yeah, the goal of the paper,
was really to generate complete sequence assemblies of 65 diverse human around the world.
We were mainly interested in trying to resolve all sequences within all the 46 chromosomes
within these genomes.
And that includes the most challenging regions of the genomes that have been kind of plaguing
scientists for decades to try to resolve their regions.
And the point in doing this is to understand the genetic and epigenetic variation of
these genomes, to understand how we differ in our sequences, our structures,
and trying to understand how proteins also differ amongst us.
Some of the most interesting regions that we resolved are the centromeres,
which are these essential chromosomal regions found on every single chromosome in our genome,
and they're important for ensuring that our chromosomes are equally and accurately segregated during mitosis and meiosis.
And so for the first time, we were able to solve about 1,200 centromeres from among these 65 genomes
and understand how they differ between each of us and what that difference might mean in terms of function.
Does that mean the older genome sequences had problems with them?
Yes, that's exactly what that means.
Sequencing technologies from about a decade ago weren't really able to resolve these regions
of the genome, and that's because they're really highly repetitive and very large,
and the sequencing data that was generated back then was smaller than the size of the repeat itself.
But now we're able to actually resolve these regions in their entirety,
traverse them from one side of the chromosome to the other,
and finally get complete maps at high resolution.
of these regions of the genome.
Interesting.
Dr. Beck, tell me more about these 65 individuals
that Dr. Logston talked about.
Where did these samples come from?
Sure.
These 65 samples were actually part of the Thousand Genomes Project.
So the samples are from around the world.
And basically, with data from previous sequencing projects,
we had a good handle on how much variation there was
between individual samples.
and a reference genome.
So therefore, we chose cell lines from individuals
that would maximize the amount of novel sequence variation
that was discovered in our work.
Because if we sequenced a bunch of individuals
that were really, really similar,
we'd have less return on our investment
for sequencing each individual person.
So we sequenced these 65 people,
and from them, we discovered a large amount of DNA variation
from person to person.
And part of the reason why that's important is because we don't really have a good handle on how much
DNA variation there is in some of these complex regions of the genome. So without a good understanding
of that kind of background topology of the genome, it's really, really hard to separate
benign differences in the population from pathogen. So you really did find diversity, similarity,
dissimilarity and all these different genomes?
So we looked between all of these genomes.
And between them, we cataloged 188,000-plus variations between people that were greater
than 50 base pairs in length.
Is that surprising all of those?
You know, to a certain degree, it's not.
So previous studies, looking at an individual genome versus the reference assemblies.
we were able to find a decent number of variants.
But in these complex regions, like Glenys was talking about and other loci, comprised largely
of repeat sequences, we were able to uncover vast amounts of differences between individuals
in the population that had heretofore been undiscovered because of the quality of sequencing.
Right.
So just as a quick kind of side-by-side example, just four years ago, we published the Human Genome Structural
Variant Consortium at the time published another study where they catalog variants between people.
And with that sequencing technology, there were almost 2,000 fewer variants between every individual
and the reference genome than there are in this recent compendium.
Wow. Dr. Philippi, what is it that is letting you do this work now? Is it better machinery to
actually do a sequence? Is it better tools to assemble all the data? What is it?
Yeah, all of the above, including better computers.
You mentioned at the top the 1000 Genomes Project,
which was initiated almost 20 years ago now,
that was the original collection and sequencing of these samples.
But it wasn't just typos that we had at the time.
There was, if you want to continue the Book of Life analogy,
entire pages missing from each of these individual genomes.
And as was noted by Glenys, a lot of this was due to repeats.
And so these repeating pages,
repeating sentences, so to speak, in the genome are just like when you're doing a jigsaw puzzle,
hard to put back together again when they're highly repetitive.
So how complete do you think you've gotten? Are we finished?
Yeah, I was on a number of years ago talking about this telomere to telomere project,
which was the first completion of an entire human genome.
And we estimated at the time that that filled in about 8% of what was missing after the
initial human genome project from the early 2000s.
And I would say that number holds about the same for these genomes.
And so for all of the genomes presented here, each of them has about 8% more sequence than the initial product of the human genome project from 2003.
And the technology, the sequencing methods are able to read a longer stretch of DNA at a time that helps.
The computational methods have advanced.
And we have better and more accurate methods of putting those puzzles back together again.
and just handling this sheer quantity of data, generating millions and millions of sequencing
reads from all of these individual genomes and putting it back together again is really only
possible with the advance of computing that we've seen over the past couple of decades as well.
So how close are we to a final number or a final end to all the sequence?
Well, how many billions of people deep would you like to go?
Obviously, we're just scratching the surface here, but as Christine said, we're trying to do it
in a way that kind of maximizes our return on investment.
And so we can go into a population of people and pick out the ones that look most different from one another,
sequence those first.
And then over time, we start to saturate the amount of variation that we return.
So now we're talking about, you know, 50-ish genomes in the next year or two, we'll be talking about thousands of genomes.
And if this field just continues to increase exponentially like it has over the past two decades,
yeah, the sky's the limit.
Now, you mentioned that this really helped fill in the bits of the genome that repeat over and over.
What does that tell us?
Why does it repeat over and over?
Is there information there?
Yeah, there's absolutely information in the repetitive regions of the genome, and not only just information, there's function.
I mentioned earlier the centromeres.
There's some of the most mutable, highly dynamic regions of our genome, and they're so mutable
that I haven't yet seen two centromeres that look identical across you.
humans. And despite this variability, we can see quite a bit of variation that, in fact,
affects function. So we find that when certain regions of the centuries are deleted or expanded
or duplicated, this could actually affect the way that the chromosomes segregate during
meiosis and mitosis. Is this repetitive stuff what we once called junk DNA? It is. It is exactly
what you would call junk DNA, but we know for sure that it's not junk DNA. In fact, it's very
functional, important regions of our genome. It's important for life. If we didn't have these
regions of the genome, then we wouldn't be able to live. I think that that's kind of an important
part to touch on, because I think repeats of all classes really shine with these novel
techniques and novel sequencing modalities, as well as the assemblies. So both the centrumeric
repeats that Glenn has studied, as well as segmental duplications and complex kind of different ways
of arraying those puzzle pieces from beginning to end have begun to come to light with these
new sequences.
And from that, you can kind of infer whether or not the mutations or the differences between
these people have actually affected the coding sequences of genes embedded in these repeats,
or whether or not it might have changed the cis regulatory landscape.
Like, let's say, the ability to turn a gene off or turn it up to 11 is also alternative.
between some of these genomes. So getting a good picture of that repetitive nature of the
underlying sequence is really, really key to understanding differences in function downstream.
Turning a gene up to 11 is something we haven't spoken about before.
After the break, what comes next in genetics research? How will researchers be able to capitalize
on the new genetic knowledge? I think in the next few years, we'll see genomic language models,
to speak, trained on that data and be able to predict associations quite accurately between atypical
sequences and their disease associations. Stay with us. How much data do you need to have to tell if
something is, quote, unquote, normal genetically? I just throw that out to any of you.
I think that's a great question, and I think it's really the power of this type of kind of fundamental
knowledge generation that we're doing in these types of projects. You know, being trained as a
computer scientist, I think a lot from that lens. And in a similar way that something like alpha
folds succeeded at protein prediction based on this foundation of the protein data bank that was decades
in progress, we're building now this foundation of what typical human genomes look like. And I think
in the next few years, we'll see genomic language models, so to speak, trained on that data
and be able to predict associations quite accurately between atypical sequences and their disease
associations. Exactly how many sequences you need and how many people with diseases and without diseases
you need in that training set always depends on the type of the disease, how complex those associations
are, and so forth. But I think we're rapidly approaching a tipping point in being able to make
very accurate predictions off of this genomic data alone. And what kinds of predictions are we
talking about? So imagine as a thought experiment, we just mutate a random base in your genome. How well do you
think we can predict the effect of that mutation, whether it will be deleterious or not.
We're not quite that good at it compared to some other aspects of prediction.
But with these resources, we're getting much, much better, in particular in the non-coding
regions of the genome that Dr. Beck was just mentioning.
A large fraction of those mutations, you have many millions of them in your genome compared
to a typical reference genome, and the vast majority of them are benign.
But the few that matter are the important ones, and we're going to get much better
the coming years at making those predictions and being able to spot, basically at birth with
DNA sequencing, those predictions, those variants that will likely result in some form of genetic
disease. Would I be wrong in assuming, Dr. Philippi, that as a computer scientist, you're using a lot
of AI here? More and more, it's kind of embedded into a lot of the things we do. The sequencing technologies
that we're using to read off the DNA are using state-of-the-art AI methods to make a prediction from
the electrical current or the optical image that you're seeing to the actual ACGs and T's,
so that translation process uses AI.
And yes, these kind of DNA models that I was referring to are also coming of age now,
and people are actively using them to make predictions of the suspected pathogenicity of a
variant that you see in one genome compared to another.
Final question.
I'll send it to you, Dr. Beck.
I remember when the Human Genome Project was announced, it was hailed as a major break
and helping to cure illnesses down the road?
How has that been working out?
How would you grade the success so far and looking forward?
Oh, nice.
A softball.
So I think that at the end of the day,
I think that the sequencing of the human genome
has allowed a lot of inference into Mendelian diseases.
So the architecture of diseases that are highly penetrant
in the population where you have a clear variant and effect, so a cause and effect that you can tie
together very clearly, those things have really been helped astronomically by the development
of the human genome reference sequence. And then stepping into kind of the more murkier territory
of complex disease genetics, I think that there's still a lot of work to be done to figure out
the underlying genetic architecture of those diseases and understanding kind of the combinatorics
of alleles and variants that come together to equal the predisposition to diseases with environmental
factors added to them. So I think that getting back to what Dr. Philippi said earlier, I think that an
understanding of this is probably going to be borne out by a much better understanding of variation in genomes,
which we're gaining with studies like ours, mixed with machine learning approaches to kind of
plumb the depths of those data for variants that might, in aggregate or individually,
contribute to these complex diseases. So long story short, I think that there has been a lot
of progress, but I also think in the future there's a lot of work and progress to be done.
I think I would be remiss to not give credit to the initial human genome project that we're
building on here that finished up, as you said, about two decades ago now. And I find it really
informative to look back and realize that that project took about 10 years. And in today's dollars,
about $5 billion, each of these individual genomes that we're doing now at a better quality
can be done in basically a few days for around $5,000. And so just do the simple math, that's a million
fold reduction in the costs to sequence a human genome, thanks to these research investments.
that have been made over the past 25 years by the NIH and by my home institute, NHGRI.
And it's just amazing to reflect on the progress that this field has undergone over the past 20 years with those investments.
And so if you look back at the economic impact on that,
there was a study in 2013 that estimated the economic impact of the Human Genome Project at $1 trillion.
And that was 10 years ago.
Imagine what those returns are now.
So this Human Genome Project is just a gift that keeps giving both in terms of,
of economic terms and in terms of quality of life.
Well, I want to thank all of you for taking time to be with this.
This is very informative.
I imagine you're all very, very hopeful about the future.
Yeah, absolutely.
Please come back and tell us more about where this is heading when you get a chance.
Well, do. Thanks.
Thanks, much, Ira.
Thank you so much.
You're welcome.
Dr. Adam Philippi at the National Human Genome Research Institute that is at NIH in Bethesda.
Dr. Christine Beck of the University of Connecticut Health Center and the Jackson Laboratory
and Dr. Glenis Logston at the University of Pennsylvania. Thank you all for taking time,
as I say, to be with us today. Hey, thanks for listening. This episode was produced by Charles Berkwist.
See you next time. I'm Ira Flato.
