The Genome Aggregation Database has collected 15,708 genomes and 125,748 exomes (the protein-coding part of the genome) to help shed light on how genetic mutations can lead to disease. Dr Daniel MacArthur, scientific lead of the gnomAD Project, explains how the project started, how they collect the data and what they hope to achieve.


How did the Genome Aggregation Database (gnomAD) Project start?

I started my lab in Boston back in 2012. At that time, one of the first projects we launched was to begin sequencing DNA from patients affected by severe muscle diseases, like muscular dystrophy.

What became clear was that we desperately needed better databases of normal variation [in genomes] to make sense of the genetic changes that we were seeing in these patients. Imagine if you sequenced the DNA from a patient, and you find a new genetic change there. You need to know whether or not that is a really rare genetic change that is disease-causing, or whether it’s actually really common in the population, in which case it’s almost certainly not responsible for that patient’s disease.

Read more about genetics:

So we set about beginning to build a new database of human genetic variation by taking advantage of a whole bunch of other studies that had been done at the time. These were starting to pull together DNA-sequencing data for a whole variety of different common adult onset disorders, things like type 2 diabetes and heart disease.

How do you collect the data?

We don’t actually generate the data ourselves. The gnomAD Project team are data parasites: we take advantage of the data that’s been generated by others when they agree to contribute that data to our project.

More like this

We can then do the hard process of pulling that data together and aggregating it and cleaning it up and then producing this big shared database that can be released to the rest of the world.

It is, I think, a testament to the overall generosity of the human genomics community that we’ve been able to pull together such a large resource through that approach.

The database contains genomes and exomes. What are exomes?

Exomes are the genes that we know the most about and are also the genes that are the most commonly involved in very rare diseases like muscular dystrophy or severe retinal disease.

Exomes are widely used because they are cheaper than sequencing a whole genome.

How common are genetic variations?

There are about three billion base pairs, or letters, in the human genome. If you compare any two humans, we differ at about one in every thousand letters.

Everyone has genetic variations. It depends on exactly which ancestry group you’re from, but we all carry somewhere between three and six million points of variation across our genomes – points where we differ compared to the person sitting next to us or someone who is unrelated to us.

Dr Daniel MacArthur
Dr Daniel MacArthur

Most of that variation is benign, and a lot of it is quite common, but some is rare. Most of it doesn’t have a big impact on our health, but a small subset does.

In some people, such as people who suffer from severe diseases like muscular dystrophy, those genetic changes can be catastrophic.

How do these genetic variations arise?

They arise through what we call mutation. When we’re born, every one of us carries about 60 to 80 new genetic changes. These are genetic sequence differences that were not present in either of our parents.

The rest of the variants that we carry are much older, those are the ones we inherit from our parents. They have also arisen from mutations but they may have occurred many generations ago, potentially thousands of generations ago.

What makes some of these variants benign, and some of them cause disease?

There are a lot of different ways that a genetic change can affect the function of a gene. A gene is composed of lots and lots of pieces called exons. These are the protein-coding bits and they’re surrounded by a big island of non-protein-coding DNA.

Most of the genes that we understand the best in the human genome are protein-coding genes. Their job is basically to serve as a template for making a specific protein.

A good example might a gene called CFTR. This encodes a little protein that pops up into the surface of cells in the lung and is responsible for transporting chloride ions across the cell wall. Usually it does that very well, but in some individuals who have two broken copies of that gene, that transporter can be completely dysfunctional. As a result, they end up with cystic fibrosis.

So, these proteins are typically the ‘doing molecules’ in the cell. They’re the bits of the cell that get particular jobs done. These genes are important because they encode all these different molecules that perform these critical chemical functions.

How does a gene become ‘broken’?

Usually by introducing what we call a ‘stop signal’ right in the middle of that protein-coding sequence so that the protein can’t be made at all. As a result, that gene becomes non-functional.

What’s interesting is that these loss-of-function variants are seen in all of us. We all carry about 100 to 150 of these gene-breaking mutations, and most of them don’t hurt us. They knock out genes that are either not necessary for life, or they only knock out one copy of the gene and the other copy is enough for us to continue living healthily.

But some of these mutations can be really important – either because they result in very severe disease or because they potentially teach us something about the function of that gene.


By studying the people who carry these loss-of-function variants, we can learn about what breaking that gene actually does to us. And that, in turn, can tell us what the normal function of that gene is.

What is DNA?

DNA is made up of base pairs, or ‘letters’. A gene is a particular section of DNA that carries information to make proteins, or instructions for traits and characteristics.

Proteins are large, complex molecules that do most of the work in cells – and they’re necessary for the structure, function and regulation of the body’s tissues, organs and processes.

A genome is the complete set of an organism’s DNA. The ‘exome’ makes up around 1-2 per cent of the genome, yet does all of the protein-coding. The exome contains the vast majority of disease-causing variants. The remaining 98-99 per cent of the genome is called ‘non-coding’ DNA.


Jason Goodyer
Jason GoodyerCommissioning editor, BBC Science Focus

Jason is the commissioning editor for BBC Science Focus. He holds an MSc in physics and was named Section Editor of the Year by the British Society of Magazine Editors in 2019. He has been reporting on science and technology for more than a decade. During this time, he's walked the tunnels of the Large Hadron Collider, watched Stephen Hawking deliver his Reith Lecture on Black Holes and reported on everything from simulation universes to dancing cockatoos. He looks after the magazine’s and website’s news sections and makes regular appearances on the Instant Genius Podcast.