Jump to content

Phrap

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Genomesequencer (talk | contribs) at 18:24, 10 April 2009. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Phrap is a widely used program for DNA sequence assembly. It is part of the Phred-Phrap-Consed package.

History

Phrap was originally developed by Prof. Phil Green for the assembly of cosmids in large-scale cosmid shotgun sequencing within the Human Genome Project. Phrap has been widely used for many different sequence assembly projects, including bacterial genome assemblies and EST assemblies.

Phrap was written as a command line program for easy integration into automated data workflows in genome sequencing centers. For users who want to use Phrap from a graphical interface, the commercial programs MacVector (for Mac OS X only) and CodonCode Aligner (for Mac OS X and Windows) are available.

Methods

A detailed (albeit partially outdated) description of the Phrap algorithms can be found in the Phrap documentation. A reoccurring thread within the Phrap algorithms is the use of Phred quality scores. Phrap used quality scores to solve a problem that other assembly programs struggled with at the beginning of the Human Genome Project: correctly assembling frequent repeats, in particular Alu sequences. Phrap uses quality scores to tell if any observed differences in repeated regions are likely to be due to random errors in the sequences, or more likely to be due to the sequences being from different copies of the Alu repeat. Typically, Phrap had no problems differentiating between the different Alu copies in a cosmid, and to correctly assemble the cosmids (or, later, BACs).

Quality based consensus sequences

Another use of Phred quality scores by Phrap that contributed to the program's success was the determination of consensus sequences using sequence qualities. In effect, Phrap automated a step that was a major bottleneck in the early phases of the Human Genome Project: to determine the correct consensus sequence at all positions where the assembled sequences had discrepant bases. This approach had been suggested by Bonfield and Staden in 1995,[1] and was implemented and further optimized in Phrap. Basically, at any consensus position with discrepant bases, Phrap examines the quality scores of the aligned sequences to find the highest quality sequence. In the process, Phrap takes confirmation of local sequence by other reads into account, after considering direction and sequencing chemistry.

The mathematics of this approach were rather simple, since [[Phred quality score|Phred quality scores] are logarithmically linked to error probabilities. This means that the quality scores of confirming reads can simply be added, as long as the error distributions are sufficiently independent. To satisfy this independence criterion, reads must typically be in different direction, since peak patterns that cause base calling errors are often identical when a region is sequenced several times in the same direction.

If a consensus base is covered by both high-quality sequence and (discrepant) low-quality sequence, Phrap's selection of the higher quality sequence will in most cases be correct. Phrap then assigns the confirmed base quality to the consensus sequence base. This makes it easy to (a) find consensus regions that are not covered by high quality sequence (which will also have low quality), and (b) to quickly calculate a reasonably accurate estimate of the error rate of the consensus sequence. This information can then be used to direct finishing efforts, for example re-sequencing of problem regions.

The combination of accurate, base-specific quality scores and a quality-based consensus sequence was a critical element in the success of the Human Genome Project. Phred and Phrap, and similar programs who picked up on the ideas pioneered by these two programs, enabled the assembly of large parts of the human genome (and many other genomes) at an accuracy that was substantially higher (less than 1 error in 10,000 bases) than the typical accuracy of carefully hand-edited sequences that had been submitted to the GenBank database before.[2]

References

  1. ^ Bonfield JK, Staden R (1995): The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res. 1995 Apr 25;23(8):1406-10. PMID 7753633
  2. ^ Krawetz SA (1989): Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. Nucleic Acids Res. 1989 May 25;17(10):3951-7


Other Software