IBM Journal of Research and Development
IBM Skip to main content
  Home     Products & services     Support & downloads     My account  

  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
  Related links:  
     IBM Research  

IBM Journal of Research and Development  
Volume 45, Number 3/4, Page 439 (2001)
Deep computing for the life sciences
  Full article: arrowHTML arrowPDF arrowASCII   arrowCopyright info





   

Brute force estimation of the number of human genes using EST clustering as a measure

by D. B. Davison, J. F. Burke
A current question of considerable interest to both the medical and nonmedical communities concerns the number of human transcription units (which, for the purposes of this paper, are “genes”) and proteins. Even with the recent announcement of the completion of the draft sequence of the human genome, it is still extremely difficult to predict the number of genes present in the genome. There are several methods for gene prediction, all involving computational tools. One way to approach this question, involving both computation and experiment, is to look at copies of fragments of messenger ribonucleic acid (mRNA) called expressed sequence tags (ESTs). The mRNA comes only from a gene being expressed, or translated, into RNA; by clustering mRNA fragments, we can try to reconstruct the expressed gene. While the final result is a very rough representation of the “true expressed transcript,” it is probably within 20% of the real number. Here, we review the issues involved in EST clustering and present an estimate of the total number of human genes. Our results to date indicate that there are some 70000 transcription units, with an average of 1.2 different transcripts per transcription unit. Thus, we estimate the total number of human proteins to be at least 85000. The total number of proteins will be higher because of post-translational modification.
Related Subjects: Biology and biomedical studies; Biotechnology; Genomic and proteomic analysis