05 Dec 02 Pairwise Sequence Analysis Tools Copyright (c) 2002 Ed Green , Gavin Price , and Steven Brenner , Univ. of California, Berkeley Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. (This is the MIT Open Source License, http://www.opensource.org/licenses/mit-license.html) ******************************************************************************* Table of Contents I. Obtain II. Install III. Understand IV. Use V. Complain ******************************************************************************* I. Obtain The most recent versions of this software can be obtained by visiting http://compbio.berkeley.edu/people/ed and following the links. ******************************************************************************* II. Install This package was developed and used exclusively under Perl-5.6.1 and RedHat Linux 7.3. I can't think of any reason why this code wouldn't work in other environments, but make no claims that it will. The shabang lines of these scripts call /usr/bin/env perl -- You may want to customize that. Running any of the scripts without arguments describes their invocation syntax. Installation should be straigtforward if you are at all familiar with perl5. Just follow these easy steps: 1. Unzip and unpack the distribution. [ed@wolf ~]$ gzip -d SeqCompEval.tar.gz [ed@wolf ~]$ tar -xvf SeqCompEval.tar 2. Set your PERL5LIB environmental variable to point to the place where you have unpacked this package. The method for doing this will depend on your shell. For tcsh, this works: [ed@wolf ~]$ setenv PERL5LIB $PERL5LIB:~/SeqCompEval 3. Go get bioperl (www.bioperl.org) and install it [optional]. ******************************************************************************* III. Understand This package contains several tools for performing some of the analyses described in Green and Brenner, "Bootstrapping and normalization for enhanced evaluations of pairwise alignment methods", 2002, Proc IEEE. Each of the files is briefly described below. The shabang line of each is /usr/bin/env perl -- You may want to customize that. Running any of the scripts without arguments describes how to invoke them. astral_stats.pl Gives some basic information about the size and superfamily content of an astral database. BBNorm.pm The Bayesian version of FamNorm.pm. bootStrap2.pl Performs standard and Bayesian bootstraps on a dbvdbsum ouput file. The first version only did standard bootstraps. covDist.pl Analyzes bootstrap distribution data to determine the mean and standard error at a given error rate. cve_automator.pl Runs cve_maker multiple times to easily set up bootstrap files for plotting. cve_maker.pl Generates coverage versus errors per query (CVE) data from the results of an ASTRAL database search. DBVDB.pm Module for easily accessing output of dbvdbsum.pl dbvdbsum.pl Parses the output files of an ASTRAL database versus database search and summarizes them. EPQvScore.pl Generates data showing how many errors were made at each level of significance. Shows how accurate the statistical scores (E-values) are. ex/ directory with example data, output, scripts, etc. See the 'Use' section below. FamNorm.pm Module for ASTRAL database files. Calculates normalization terms for each superfamily, among other things. FamNormLite.pm A viciously eviscerated version of FamNorm.pm for use with the Bayesian portion of bootStrap2.pl. All FNL does is parse the ASTRAL database. FamNorm's normalization routines were recoded in BBNorm.pm. maxCov.pl Outputs an ordered list of the coverage at a user-specified error rate given a list of cve_maker.pl output files. Good to use when you are comparing the results from many different runs. meansTest.pl Does a two sample parametric means test on the bootstrap distributions of two database searches. pOverlap.pl Calculates fraction overlap from two separate bootstrap distributions. README This file. sfSizeDist.pl Shows superfamily size distribution within an ASTRAL database. sfSizePerf.pl Calculates coverage and erros on a per superfamily basis. splitAstralByFold.pl Splits an ASTRAL database, by fold, into two non-overlapping databases. Each of the tools and a general flow-chart is shown below: 1. Do the database versus database search with whatever method, matrix, parameter set, etc. that you want to evaluate. You're on your own for this part. The only requirements are that all sequences in the database be used, individually, as query to search the database and that each of these queries generates a discrete output file. Also, these output files must be accessible by recursively descending down from a single subdirectory, i.e., they can't be all over the place. The easiest way to do this is just write them all to a single directory. 2. Run dbvdbsum.pl. This program summarizes the results of your database versus database search. For an example of its output, check out ex/blastpgp-2.2.4_astral-1.61.sum In short, it looks at the ASTRAL database you used and figures how many hits you should find. Then it goes through all your output files and writes a single line per hit. Each of these lines has the query sequence id, the id of the sequence that was hit, the e-value of the hit, a number between 0 and 3 indicating the level of SCOP relatedness between these two sequences, and then the weighting factors for normalization. The file formats supported for parsing are currently pretty limited, but it's easy enough to add more. Just add another entry to the %FORMATS at the beginning, then in processThisResult(), add another elsif clause for whatever output format your method writes and parse it accordingly. Alternatively, you can write your own code for generating the this file. The module, FamNorm.pm, will help you generate the information in the header lines. 3. Run cve_maker.pl. This program generates the CVE data. It also reports each false positive hit. 4. Plot the data using gnuplot. 5. Bootstrap the output using bootstrap2.pl. Perform both a Bayesian and standard bootstrap. 6. Make cve files for each bootstrap file with cve_automator.pl. 7. Plot the bootstrap data using gnuplot. ******************************************************************************* IV. Use I have included a sample dataset that you can step through to see how the package works. To start, look in ex/ There, you'll see a directory called blastpgp-2.2.4_astral-1.61/ This has output generated by searching the ASTRAL version 1.61 database, filtered at 40% sequence identity against itself. This is valid input for dbvdbsum.pl You can then run dbvdbsum.pl on this directory, like so: dbvdbsum.pl -q db/astral-1.61-40.s.fa -r .bla.bz2 -d blastpgp-2.24_astral-1.61 -a 1 -f B > blastpgp-2.24_astral-1.61.sum This takes about 10 minutes on my machine. This program, dbvdbsum.pl, reports on all the hits in all the results files in the directory specified by the -d flag. If you set the -a flag to a true value, then it also checks to make sure there is an output file for each sequence in the database and gives information on the level of SCOP relatedness of each hit. Take a look at the output file, blastpgp-2.2.4_astral-1.61.sum It has some information about the database used for this search and how many superfamily relations there were. Next, run cve_maker.pl on this output file, like so: cve_maker.pl -i blastpgp-2.2.4_astral-1.61.sum -o blastpgp-2.2.4_astral-1.61.sum.dat -v 1 > blastpgp-2.2.4_astral-1.61.fp This generates two more files. Take a look at blastpgp-2.2.4-astral-1.61.sum.dat It has the sets of CVE data that you can then plot with gnuplot to make a nice CVE plot. Each line contains the x and y coordinates of a single datapoint on the CVE plots. The datasets are separated by two blank lines -- gnuplot style. The first dataset is the unnormalized CVE data. The second is linearly normalized and the third is quadratically normalized. Now, look at the file blastpgp-2.2.4_astral-1.61.fp This file has a list of the false positive hits, in order of significance score. As you can see, the astral domain sequence d1ds9a_ generated a very significant alignment with d1a9na_ (e-value 8e-08), yet these sequences aren't even in the same superfamily! Now, plot the CVE data and take a look at it. The script, cve.gnuscript can be used for this. It plots CVE lines under the three normalization schemes and generates a postscript file called CVE.ps gnuplot cve.gnuscript gv CVE.eps Now, take a look at how easy or difficult is was to find superfamily level relations on a per superfamily basis, at a given error rate. The script sfSizePerf.pl does this. sfSizePerf.pl -a ex/db/astral-1.61-40.s.fa -s ex/blastpgp-2.2.4_astral-1.61.sum -e 0.01 > sfSizePerf_0.01.dat Again, the output file is suitable as a datafile for gnuplot. If we look at the output file, sfSizePerf_0.01.dat, we see the first section lists all superfamilies, by size, with the number of correctly identified relations, incorrectly identified relations, and the number that were undetermined. Undetermined means that the two sequences involved were in the same fold, but not the same superfamily. The next section compiles these results for all superfamilies of a given size. To view these data graphically, do this: gnuplot sfSizePerf.gnuscript gv sfSizePerf.eps As you can see, there is a general negative correlation between superfamily size and ability to detect pairwise relations. This is the result we expect given the normalization behavior. Now, let's see how good the blastpgp-2.2.4 e-values are. We can plot the number of errors made at each e-value using EPQvScore.pl The file blastpgp-2.2.4_astral-1.61.fp has each error (false positive), along with the e-value at which it was reported. This file is used as input to EPQvScore.pl EPQvScore.pl -a ex/db/astral-1.61-40.s.fa -f blastpgp-2.2.4_astral-1.61.fp > blastpgp-2.2.4_astral-1.61.epqvscore.dat gnuplot epqscore.gnuscript gv epqvscore.eps Looking at the plot, we see that at very low error rates, there were more errors than the BLAST e-value predicted there would be. This means that the significance is over-estimated at low error rates. The 'ideal' line is how a method would perform if it was perfect at estimating how many errors it makes. Once you get up near 1% EPQ and above, the scores look pretty good. If we had data from multiple runs, for example, if we had run blast with many different matrices, and we wanted to know which one is the 'best', we use maxCov.pl. This program looks at all the cve_maker.pl output files in a given directory and makes a sorted list of the coverage that they generated. In this example, we only have data from one run, but we'll do it anyway. Call maxCov.pl thusly: maxCov.pl -i . -e 0.01 -r .sum.dat > maxCov.out Looking at the output file, maxCov.out, we see that there was only one file that had the filename restriction '.sum.dat', our single cve_maker.pl output file. If there had be more then our ouput file would be longer. The output file shows how much coverage was achieved for each normalization method in this run. Now we'll use the dbvdbsum file to generate bootstrapped data sets with bootStrap2.pl. This program makes a user-defined number of bootstrapped databases and samples the results in the .sum file to see what the results would have been if some sequences were left out and other repeated. It can be run in two different modes: regular or Bayesian. The regular mode assigns an integer value to each sequence (0, 1, 2, ...) which will be the number of times this sequence will be represented in the bootstrap database. The Bayesian method assigns a floating point weight to each sequence. The regular mode was described in the manuscript, "Bootstrapping and normalization for enhanced pairwise sequence comparison." As mentioned there, there is an unfortunate bias in bootstrapping this way: some smaller superfamiles, by chance, do not get sampled at all. Because of the aforementioned tendency for the easier to detect relations to occur in smaller superfamilies, this creates a bias in this bootstrap method. The effect of this bias can be seen by comparing the mean of the bootstrapped coverage distribution with the coverage of the original data. The bootstrapped mean is less because some of the easier relations fall out. To address this, we implemented the Bayesian bootstrap. This assigns a weight to each sequence. In this way, each sequence will be represented and the bias drops out. Efron gives a rule of thumb of 200 bootstrap replicates to get a reliable distribution. For the sake of keeping this tutorial down to a managable size, however, I am only going to run 10. If you run this yourself and do 200 replicates, beware that it generates a lot of output. Make a new directory for the bootstrap data: mkdir bs Then go there and run the bootstrap2.pl program, like this: bootStrap2.pl -a ../db/astral-1.61-40.s.fa -f ../blastpgp-2.2.4_astral-1.61.sum -n 10 -b 1 The -b flag is used to specify running a Bayesian bootstrap. Take a look at the resulting bootstrap files. They have the .sum.bbs.X extension, where X is the bootstrap number. They're identical in format to a standard dbvdbsum output file. Next we'll use cve_automator.pl to make CVE files for each bootstrap file. Again, a new output cve data file is made for each bootstrap replicate. So, if you have done a lot of these, you're going to get a lot more data here. cve_automator just runs cve_maker.pl multiple times so the user doesn't have to invoke it once per each bootstrap file. First, we need to copy the original summary file to the bs subdirectory. From the bs subdirectory, do this: cp ../blastpgp-2.2.4_astral-1.61.sum . Now, invoke cve_automator thusly: cve_automator.pl -i blastpgp-2.2.4_astral-1.61.sum -n 10 -b 1 The output files will be identical (format-wise) to the file generated earlier by cve_maker.pl. Now let's plot the whole mess. Go back up to the ex subdirectory and run: gnuplot bootstrap.gnuscript and then: gv bayesianbootstrap.eps As you can see, the bootstrap distribution falls pretty much evenly in a fairly narrow band around the orignal CVE lines. These distributions can then be used to statistically compare any two methods. Now, we'll use some of the tools that examine these distributions. The first one is covDist.pl. It goes through all the cve (.dat) files in a given directory and finds the distribution of coverage at a user-specified error rate under a user-specified normalization scheme. It finds the expected value (mean) and the standard error of the coverage under these user-specified values. It also outputs a gnuplot script file that shows the histogram of these values and the guassian distribution generated by the parameters. If the bootstrap technique is working correctly, the histogram should closely mirror this guassian. Let's run covDist.pl at 0.01 errors per query, looking at the linearly normalized bootstrap data: covDist.pl -i bs -e 0.01 -g covDist.gnuscript -r .dat -n 2 This makes two output files, covDist.gnuscript and covDist.gnuscript.dat and reports some information to the standard output. Let's look at covDist.gnuscript.dat It reports the expected (mean) coverage and the standard error at the error rate we specified. It also has some gnuplot style data. This data is a histogram of the coverage in the bootstrap data. In covDist.gnuscript we have a gnuplot script file that will show us this data graphically. To see it, call gnuplot on this script: gnuplot covDist.gnuscript This will create covDist.gnuscript.eps Looking at this file: gv covDist.gnuscript.eps We see that this histogram is centered in the gaussian, but the gaussian is much steeper than our data. This is because the binning of the histogram is tuned for 200 replicates. You can adjust this value in covDist.pl, but if you have done 200 replicates, this generates a nicer looking plot. The next program we'll use is meansTest.pl This runs a two sample parametric means test on the bootstrap data of two database searches. It can be used to determine if the performance differences are statistically significant. Since, in this example, we only have results from one database search, we'll just run that against itself and see if it differs significantly from itself. meansTest.pl -o covDist.gnuscript.dat -t covDist.gnuscript.dat -s 10 Z score = 0 As expected, there is no difference between these two distributions. Not a big surprise since they are the same distribution. Another way to compare the two distributions is to see how much they overlap. That is how often is a randomly sampled value from one greater than a randomly sampled value from the other. To get this value, we use pOverlap.pl This program looks at the coverage distributions at a user-defined error rate and normalization scheme from two bootstrap datasets. Again, we only have one so we'll have to compare it against itself. The number reported is the percent of time the the first one was greater than the second. Call pOverlap.pl like this: pOverlap.pl -o bs -t bs -n 100000 -e 0.01 -r .dat -n 1000 -N 2 0.504 The number, 0.504, means the 50.4% of the time the coverage at 0.01 errors per query in the bootstrap distribution under linear normalization in bs/ was greater than that in /bs (the same distribution). This means the two distributions are very nearly the same. We, however, know that they are exactly the same. ******************************************************************************* V. Complain You can contact me at ed@compbio.berkeley.edu or edgreen@ugaalum.uga.edu You can contact Gavin (the author of the Bayesian bootstrap stuff) at gavin@compbio.berkeley.edu. He's a nasty little troll, though, so don't be surprised if his response consists only of 'RTFM.' There is no manual. I'm very interested in results generated by these or related tools or suggestions for improvement or new features. I'm not interested in making this code work on obscure platforms or other trivial problems.