In order to assess database search methods, it is necessary to have a test dataset of sequences whose relationships are known. It is important that this knowledge be derived independently of the methods being tested. We use the structurally and evolutionarily derived relationships in the SCOP database for this purpose (figure). Specifically, the ASTRAL database provides SCOP sequences filtered at various levels of sequence identity. We use the ASTRAL genetic domain database filtered at 40% identity to make our evaluations specific for remote homologs. Sequences that are classified in the same superfamily are related both structurally and evolutionarily. Therefore, our tests evaluate a given method's ability to find the relations between superfamily members.
A good search method will find as many of the true homologs as possible in a database while cleanly separating them from the non-homologs. It will additionally be coupled to a good statistical scoring method that accurately reports how likely it is that any given match arose purely by chance and not because the sequences are really related.
In order to assess database search methods, we perform a database vs database search with our test sequences, derived from the ASTRAL database (figure). That is, we take each sequence, one by one, and use it as a query sequence to search the database. All of the hits from all of these searches are then pooled and sorted from most significant to least significant. Since we already know which sequences are related, we can then go through this list and see how many the method being evaluated got right. Ideally, there would be a clean separation between the true homologs, at the top of the list, and non-homologs at the bottom. In practice this never happens.
This data can be rendered in a Coverage versus Errors per Query (CVE) plot (figure). Coverage refers to how many of the true relations in the database were found. Erros per query is the number of times a given false relationship was reported, divided by the number of sequences in a database. The sorted list of all database search hits is traversed and at each significance level, a point on the CVE plot is generated. Looking at the CVE plot in the figure, you can see that as the coverage increases (as you find more homologs), the number of errors increases. This is what is meant by a compromise between sensitivity and specificity.
CVE results from any two database search methods (or parameter sets, or scoring schemes, etc.) can be compared but it is not immediately obvious how significant any performance difference may be. To address this question, we use a technique pioneered by Brad Efron known as bootstrapping (figure). Bootstrapping is a method for estimating the distribution from which a given statistic derives, even when you can not resample from the population again. That is exactly the problem we are faced with here. There is a population of protein sequences out in nature and only a small fraction are sampled in the ASTRAL database. We can't easily throw out the ASTRAL database and have new ones generated. If we could, we would do that and recalculate CVE statistics for each one and then see its distribution directly. Instead, we bootstrap the data we do have. That is, we resample it by making new databases that derive from the original database. These resampled databases have the same number of sequences as the original database, but each sequence is represented a random number of times.
Another consideration in these analyses is that of the representational biases within our test set. Since our test sequences derive from solved structures, this problem is particularly acute. Only sequences that are amenable to structure determination and deemed interesting research subjects have their structures solved and make their way to the ASTRAL database. The ASTRAL database, filtered at 40% identity has a few very large superfamilies, like the immunoglobulins, and very many smaller superfamilies. To address this problem, we can normalize results by superfamily size, downweighting those that are in larger superfamilies. We employ two approaches to this. Linear normalization downweights each correctly identified relation in linear proportion to its superfamily size. In this way, larger superfamilies still contribute more to the overall results, but less than in unnormalized results. The other normalization scheme, quadratic normalization, weights the results from each superfamily equally, regardless of its size.
The figure shows a CVE plot of four popular pairwise search programs, using parameters optimized for this test (see the "Bootstrapping and Normalization for Enhance Pairwise Sequence Comparison" manuscript for details). As you can see, all four methods fail to detect the majority of the evolutionary relationships in the database. This is what is meant by understanding how well the programs perform in an absolute sense. If our test database is anything like the real sequence databases biologists use (like GenBank), this means that when one does a database search, most of the sequences that are really related can't be detected. So, here's a good take-home message: If you clone and sequence a new gene, BLAST it against GenBank and fail to find any significant hits, do not write in your paper that your sequence is not homologous to any other known sequence. It very well may be, but you just can't detect it.
Another interesting observation is that for all four methods, if we normalize the results, the coverage increases. This means that when results from large superfamilies are discounted, the situation improves. Therefore, the pairwise relations in larger superfamiles are more difficult to detect, on the whole, than those in smaller superfamilies, in the ASTRAL database. Why should this be the case? I do not know.
Using the bootstrap procedure we can determine whether the coverage difference we observe at a given error rate is significant. The figure to the right shows the 200 CVE lines generated when we bootstrap resample the database 200 times. Overlayed is the Original CVE line, that is, the CVE line from the unbootstrapped database. As you can see, the resampled databases are distributed around the original line. The inset shows a histogram of the 200 bootstrap coverage values at 1% error rate. If you calculate the standard deviation of these values and plug that into the Guassian formula, bootstrap theory predicts that it should fit the actual data pretty well. Looks like it does.
This is a figure showing how significant the differences between these methods are. The left axis is the Z-scores of a two sample parametric means test between the two methods using the bootstrapped standard error. Generally, a Z-score whose absolute value is greater than 2 or 3 means that two values are significantly different. Therefore, it looks like SSEARCH generates results that are significantly better than the heuristic methods. The right axis is the fraction overlap between the two bootstrap distributions. This is a simple metric meant to give a general sense of how the two bootstrap distributions compare. It is generated by randomly drawing from one of the 200 bootstrap distributions of both methods and checking which had the greater coverage. This is repeated 1000 times. If the two distributions are equal, then the randomly drawn value from either one will be greater than the randomly drawn value from the other about half the time. This would be a fraction overlap value of 0.5. As you can see, this statistic roughly mirrors the Z-score of the two-sample parametric means test.
The datasets used in these evaluations derive from ASTRAL and up to date versions can always be obtained there. You can also get the tools and datasets for sequence comparison, as used in "Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison".
Georg Fuellen put together a very nice BioComputing Hypertext Coursebook. It has a chapter on pairwise sequence alignment. Keith Robinson made an introduction to sequence analysis.