How Much of Our Genome Is Sequenced?

I'm getting ready for a class on the size and composition of the human genome so I thought I'd check to see the latest estimate of its size. Recall that in an earlier posting I concluded that the size of the human genome was 3,200,000,000 bp (3,200,000 kb, 3,200 Mb, 3.2 Gb) [How Big Is the Human Genome?].

You might think that all you have to do is check out the human genome websites and look up the exact size. That doesn't work because not all of the human genome has been sequenced and organized into a contiguous assembly of 24 different strands (one for each chromosome). So that prompts the question, how much of the human genome has actually been sequenced?1

The latest assembly is GRCh37 Patch Release 7 (GRCh37.p7), released on Feb. 3, 2012. If you look at the data for this assembly you will see an estimate of the "Total Sequenced Bases in the Assembly." The number is 3,173,036,847 bp or 3.17 Gb. This value is close to estimates of the genome size from the years before the first draft of the genome sequence was published.

I was suspicious of this number since we know that there are many gaps in the human genome sequence. The largest gaps cover highly repetitive parts of the genome—mostly around the centromeres and other heterochromatic regions. There were also gaps at the locations of several gene clusters (e.g. ribosomal RNA genes) where it's impossible to determine the exact number of copies. In the case of ribosomal RNA gene clusters, these gaps have now been closed.

Deanna Church posted a few comments on my earlier posting. She's with the Genome Reference Consortium (GRC). That's the group responsible for updating the human genome. Deanna explained that "Total Sequenced Bases in the Assembly" is not an accurate representation of the truth.2 What it actually means is total sequenced bases plus estimated sizes of the gaps. In other words, it's a good estimate of the size of the genome.

So, how much of the genome is actually sequenced and organized into "scaffolds," or contiguous stretches of DNA? You can see the actual numbers by clicking on Ungapped Lengths on the NCBI website.

The total number of sequenced base pairs that have been organized into scaffolds and placed on a particular chromosome is 2,861,332,606 bp. An additional 6,110,758 bp have been sequenced but the blocks of sequence cannot be placed in the assembly. Most of this unassigned sequence is on chromosomes 1,4,9, and 17 but some of it can't even be associated with a particular chromosome.

If we assume that the true haploid genome size is 3.2 Gb, or 3,200 Mb, then the sequenced and assigned part of the genome represents 89.6% and the unassigned sequenced part is 0.2%.

We can say that only 90% of the human genome has been sequenced and the remaining 10% falls into 357 gaps scattered throughout the genome. (Every chromosome has unsequenced gaps but some have more than others and it doesn't depend on the size of the chromosome.)

The The Wellcome Trust Sanger Institute is part of the Genome Reference Consortium but it maintains its own website on the human genome [Whole Genome]. The data on the e!Ensembl page refers to build CRCh37.p5 from Feb. 2009 but it also says the data was updated in Dec. 2011.

According to the Sanger Institute, the size of the sequenced genome is 3,283,984,159 bp and the "golden path length" is 3,101,804,739 bp. I've tried to find out what these numbers mean but if the information is present on the Ensembl website then it's very well hidden.

Are you interested in the number of genes? Here's the data from Ensembl. It indicates that the human genome contains 33,399 genes! [What Is a Gene?] [What is a gene, post-ENCODE?] This inflated value is calculated by including 12,523 genes that make an RNA product that's not translated. This is almost certainly a highly inflated number.

The data indicates that there are 181,744 gene transcripts or between 5 and 9 transcripts per gene depending on how you count the genes. I don't believe there are this many biologically functional transcripts per gene. I think the actual number is much closer to one (1) [Genes and Straw Men].

1. It ceratinly doesn't "beg the question." That means something else entirely [Begging the Question].

2. That's a euphemism for "It's a lie!"
