05 Dec 02

Pairwise Sequence Analysis Tools 

  Copyright (c) 2002 Ed Green <ed@compbio.berkeley.edu>,
		     Gavin Price <gavin@compbio.berkeley.edu>,  and
                     Steven Brenner <brenner@compbio.berkeley.edu>,
		     Univ. of California, Berkeley

  Permission is hereby granted, free of charge, to any person obtaining a 
  copy of this software and associated documentation files (the "Software"),
  to deal in the Software without restriction, including without limitation
  the rights to use, copy, modify, merge, publish, distribute, sublicense,
  and/or sell copies of the Software, and to permit persons to whom the
  Software is furnished to do so, subject to the following conditions:

  The above copyright notice and this permission notice shall be included
  in all copies or substantial portions of the Software.

  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
  IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
  FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 
  THE SOFTWARE.

  (This is the MIT Open Source License, 
  http://www.opensource.org/licenses/mit-license.html)

*******************************************************************************

Table of Contents

I.   Obtain
II.  Install
III. Understand
IV.  Use
V.   Complain

*******************************************************************************

I.   Obtain

	The most recent versions of this software can be obtained by
	visiting http://compbio.berkeley.edu/people/ed and following
	the links. 

*******************************************************************************

II.  Install

	This package was developed and used exclusively under
	Perl-5.6.1 and RedHat Linux 7.3.  I can't think of any reason
	why this code wouldn't work in other environments, but make no
	claims that it will.  The shabang lines of these scripts call
	/usr/bin/env perl --  You may want to customize that.  Running
	any of the scripts without arguments describes their
	invocation syntax. 
 
	Installation should be straigtforward if you are at all
	familiar with perl5.  Just follow these easy steps:

1.  	Unzip and unpack the distribution.
    	[ed@wolf ~]$ gzip -d SeqCompEval.tar.gz
    	[ed@wolf ~]$ tar -xvf SeqCompEval.tar

2.  	Set your PERL5LIB environmental variable to point to the place
	where you have unpacked this package.  The method for doing
	this will depend on your shell.  For tcsh, this works:
    	[ed@wolf ~]$ setenv PERL5LIB $PERL5LIB:~/SeqCompEval

3.  	Go get bioperl (www.bioperl.org) and install it [optional].

*******************************************************************************

III. Understand

	This package contains several tools for performing some of the
	analyses described in Green and Brenner, "Bootstrapping and
	normalization for enhanced evaluations of pairwise alignment
	methods", 2002, Proc IEEE. 

	Each of the files is briefly described below.  The shabang
	line of each is /usr/bin/env perl -- You may want to customize
	that.  Running any of the scripts without arguments describes
	how to invoke them. 

astral_stats.pl
	Gives some basic information about the size and superfamily
	content of an astral database.

BBNorm.pm
	The Bayesian version of FamNorm.pm.

bootStrap2.pl
	Performs standard and Bayesian bootstraps on a dbvdbsum ouput
	file. The first version only did standard bootstraps.	

covDist.pl
	Analyzes bootstrap distribution data to determine the mean and
	standard error at a given error rate.

cve_automator.pl
	Runs cve_maker multiple times to easily set up bootstrap files
	for plotting.

cve_maker.pl
	Generates coverage versus errors per query (CVE) data from the
	results of an ASTRAL database search.

DBVDB.pm
	Module for easily accessing output of dbvdbsum.pl

dbvdbsum.pl
	Parses the output files of an ASTRAL database versus database
	search and summarizes them.

EPQvScore.pl
	Generates data showing how many errors were made at each level
	of significance.  Shows how accurate the statistical scores
	(E-values) are.

ex/	
	directory with example data, output, scripts, etc.  See the
	'Use' section below.

FamNorm.pm
	Module for ASTRAL database files.  Calculates normalization
	terms for each superfamily, among other things.

FamNormLite.pm
	A viciously eviscerated version of FamNorm.pm for use with the
	Bayesian portion of bootStrap2.pl. All FNL does is parse the
	ASTRAL database. FamNorm's normalization routines were recoded
	in BBNorm.pm.

maxCov.pl
	Outputs an ordered list of the coverage at a user-specified
	error rate given a list of cve_maker.pl output files.  Good
	to use when you are comparing the results from many different
	runs.

meansTest.pl
	Does a two sample parametric means test on the bootstrap
	distributions of two database searches. 

pOverlap.pl
	Calculates fraction overlap from two separate bootstrap
	distributions. 

README
	This file.

sfSizeDist.pl
	Shows superfamily size distribution within an ASTRAL
	database. 

sfSizePerf.pl
	Calculates coverage and erros on a per superfamily basis.

splitAstralByFold.pl
	Splits an ASTRAL database, by fold, into two non-overlapping
	databases. 


Each of the tools and a general flow-chart is shown below:

1.  	Do the database versus database search with whatever method,
    	matrix, parameter set, etc. that you want to evaluate.  You're
	on your own for this part.  The only requirements are that all
    	sequences in the database be used, individually, as query to
    	search the database and that each of these queries generates a
    	discrete output file.  Also, these output files must be
    	accessible by recursively descending down from a single
    	subdirectory, i.e., they can't be all over the place.  The
	easiest way to do this is just write them all to a single
	directory.  

2.  	Run dbvdbsum.pl.  This program summarizes the results of your
	database versus database search.  For an example of its
	output, check out ex/blastpgp-2.2.4_astral-1.61.sum

	In short, it looks at the ASTRAL database you used and figures
	how many hits you should find.  Then it goes through all your
	output files and writes a single line per hit.  Each of these
	lines has the query sequence id, the id of the sequence that
	was hit, the e-value of the hit, a number between 0 and 3
	indicating the level of SCOP relatedness between these two
	sequences, and then the weighting factors for normalization.
	The file formats supported for parsing are currently pretty
	limited, but it's easy enough to add more.  Just add another
	entry to the %FORMATS at the beginning, then in
	processThisResult(), add another elsif clause for whatever
	output format your method writes and parse it accordingly.
	Alternatively, you can write your own code for generating the
	this file.  The module, FamNorm.pm, will help you generate the
	information in the header lines.

3.	Run cve_maker.pl.  This program generates the CVE data.  It
	also reports each false positive hit.

4.	Plot the data using gnuplot.

5.	Bootstrap the output using bootstrap2.pl. Perform both a 
	Bayesian and standard bootstrap.

6.	Make cve files for each bootstrap file with cve_automator.pl.

7.	Plot the bootstrap data using gnuplot.

*******************************************************************************

IV.	Use
      
	I have included a sample dataset that you can step through to
	see how the package works.  To start, look in ex/
	There, you'll see a directory called
	blastpgp-2.2.4_astral-1.61/
	This has output generated by searching the ASTRAL version 1.61
	database, filtered at 40% sequence identity against itself.
	This is valid input for dbvdbsum.pl
	You can then run dbvdbsum.pl on this directory, like so:
	
dbvdbsum.pl -q db/astral-1.61-40.s.fa -r .bla.bz2 -d blastpgp-2.24_astral-1.61 -a 1 -f B > blastpgp-2.24_astral-1.61.sum
	
	This takes about 10 minutes on my machine.  This program,
	dbvdbsum.pl, reports on all the hits in all the results
	files in the directory specified by the -d flag.  If you set
	the -a flag to a true value, then it also checks to make sure
	there is an output file for each sequence in the database and
	gives information on the level of SCOP relatedness of each
	hit.  Take a look at the output file,
	blastpgp-2.2.4_astral-1.61.sum 
	It has some information about the database used for this
	search and how many superfamily relations there were. 
	
	Next, run cve_maker.pl on this output file, like so:

cve_maker.pl -i blastpgp-2.2.4_astral-1.61.sum -o blastpgp-2.2.4_astral-1.61.sum.dat -v 1 > blastpgp-2.2.4_astral-1.61.fp

	This generates two more files.  Take a look at
	blastpgp-2.2.4-astral-1.61.sum.dat
	It has the sets of CVE data that you can then plot with
	gnuplot to make a nice CVE plot.  Each line contains the x and
	y coordinates of a single datapoint on the CVE plots.  The
	datasets are separated by two blank lines -- gnuplot style.
	The first dataset is the unnormalized CVE data.  The second is 
	linearly normalized and the third is quadratically
	normalized.

	Now, look at the file blastpgp-2.2.4_astral-1.61.fp
	This file has a list of the false positive hits, in order of
	significance score.  As you can see, the astral domain
	sequence d1ds9a_ generated a very significant alignment with
	d1a9na_ (e-value 8e-08), yet these sequences aren't even in
	the same superfamily!

	Now, plot the CVE data and take a look at it.  The script,
	cve.gnuscript can be used for this.  It plots CVE lines under
	the three normalization schemes and generates a postscript
	file called CVE.ps

gnuplot cve.gnuscript
gv CVE.eps

	Now, take a look at how easy or difficult is was to find
	superfamily level relations on a per superfamily basis, at a
	given error rate.  The script sfSizePerf.pl does this.
	
sfSizePerf.pl -a ex/db/astral-1.61-40.s.fa -s ex/blastpgp-2.2.4_astral-1.61.sum -e 0.01 > sfSizePerf_0.01.dat

	Again, the output file is suitable as a datafile for gnuplot.
	If we look at the output file, sfSizePerf_0.01.dat, we see
	the first section lists all superfamilies, by size, with the
	number of correctly identified relations, incorrectly
	identified relations, and the number that were undetermined.
	Undetermined means that the two sequences involved were in the
	same fold, but not the same superfamily.  The next section
	compiles these results for all superfamilies of a given size.
	To view these data graphically, do this:

gnuplot sfSizePerf.gnuscript
gv sfSizePerf.eps	

	As you can see, there is a general negative correlation
	between superfamily size and ability to detect pairwise
	relations.  This is the result we expect given the
	normalization behavior. 

	Now, let's see how good the blastpgp-2.2.4 e-values are.  We
	can plot the number of errors made at each e-value using
	EPQvScore.pl
	The file blastpgp-2.2.4_astral-1.61.fp has each error (false
	positive), along with the e-value at which it was reported.
	This file is used as input to EPQvScore.pl

EPQvScore.pl -a ex/db/astral-1.61-40.s.fa -f blastpgp-2.2.4_astral-1.61.fp > blastpgp-2.2.4_astral-1.61.epqvscore.dat
gnuplot epqscore.gnuscript
gv epqvscore.eps

	Looking at the plot, we see that at very low error rates,
	there were more errors than the BLAST e-value predicted there
	would be.  This means that the significance is over-estimated
	at low error rates.  The 'ideal' line is how a method would
	perform if it was perfect at estimating how many errors it
	makes.  Once you get up near 1% EPQ and above, the scores look
	pretty good.

	If we had data from multiple runs, for example, if we had run blast
	with many different matrices, and we wanted to know which one is
	the 'best', we use maxCov.pl.  This program looks at all the 
	cve_maker.pl output files in a given directory and makes a sorted
	list of the coverage that they generated.  In this example, we 
	only have data from one run, but we'll do it anyway.  Call
	maxCov.pl thusly:

maxCov.pl -i . -e 0.01 -r .sum.dat > maxCov.out

	Looking at the output file, maxCov.out, we see that there was only
	one file that had the filename restriction '.sum.dat', our single 
	cve_maker.pl output file.  If there had be more then our ouput
	file would be longer.  The output file shows how much coverage
	was achieved for each normalization method in this run.

	Now we'll use the dbvdbsum file to generate bootstrapped data 
	sets with bootStrap2.pl.  This program makes a user-defined
	number of bootstrapped databases and samples the results in
	the .sum file to see what the results would have been if some
	sequences were left out and other repeated.  It can be run in
	two different modes: regular or Bayesian.  The regular mode
	assigns an integer value to each sequence (0, 1, 2, ...) which
	will be the number of times this sequence will be represented
	in the bootstrap database.  The Bayesian method assigns a
	floating point weight to each sequence.  The regular mode was
	described in the manuscript, "Bootstrapping and normalization
	for enhanced pairwise sequence comparison."  As mentioned
	there, there is an unfortunate bias in bootstrapping this way:
	some smaller superfamiles, by chance, do not get sampled at
	all.  Because of the aforementioned tendency for the easier to
	detect relations to occur in smaller superfamilies, this
	creates a bias in this bootstrap method.  The effect of this
	bias can be seen by comparing the mean of the bootstrapped
	coverage distribution with the coverage of the original data.
	The bootstrapped mean is less because some of the easier
	relations fall out.  To address this, we implemented the
	Bayesian bootstrap.  This assigns a weight to each sequence.
	In this way, each sequence will be represented and the bias
	drops out.  

	Efron gives a rule of thumb of 200 bootstrap replicates to get
	a reliable distribution.  For the sake of keeping this
	tutorial down to a managable size, however, I am only going to
	run 10.  If you run this yourself and do 200 replicates,
	beware that it generates a lot of output.  Make a new
	directory for the bootstrap data:

mkdir bs

	Then go there and run the bootstrap2.pl program, like this:

bootStrap2.pl -a ../db/astral-1.61-40.s.fa -f ../blastpgp-2.2.4_astral-1.61.sum -n 10 -b 1

	The -b flag is used to specify running a Bayesian bootstrap.

	Take a look at the resulting bootstrap files.  They have the
	.sum.bbs.X extension, where X is the bootstrap number.
	They're identical in format to a standard dbvdbsum output file.

	Next we'll use cve_automator.pl to make CVE files for each bootstrap
	file.  Again, a new output cve data file is made for each
	bootstrap replicate.  So, if you have done a lot of these,
	you're going to get a lot more data here.
	cve_automator just runs cve_maker.pl multiple times so
	the user doesn't have to invoke it once per each bootstrap
	file.  First, we need to copy the original summary file to the
	bs subdirectory.  From the bs subdirectory, do this:

cp ../blastpgp-2.2.4_astral-1.61.sum .

	Now, invoke cve_automator thusly:

cve_automator.pl -i blastpgp-2.2.4_astral-1.61.sum -n 10 -b 1

	The output files will be identical (format-wise) to the file 
	generated earlier by cve_maker.pl.
	
	Now let's plot the whole mess. Go back up to the ex
	subdirectory and run:

gnuplot bootstrap.gnuscript

	and then:

gv bayesianbootstrap.eps

	As you can see, the bootstrap distribution falls pretty much
	evenly in a fairly narrow band around the orignal CVE lines.
	These distributions can then be used to statistically compare
	any two methods.

	Now, we'll use some of the tools that examine these
	distributions.  The first one is covDist.pl.  It goes through
	all the cve (.dat) files in a given directory and finds the
	distribution of coverage at a user-specified error rate under
	a user-specified normalization scheme.  It finds the expected
	value (mean) and the standard error of the coverage under
	these user-specified values.  It also outputs a gnuplot script
	file that shows the histogram of these values and the guassian
	distribution generated by the parameters.  If the bootstrap
	technique is working correctly, the histogram should closely
	mirror this guassian. 

	Let's run covDist.pl at 0.01 errors per query, looking at the
	linearly normalized bootstrap data:

covDist.pl -i bs -e 0.01 -g covDist.gnuscript -r .dat -n 2

	This makes two output files, covDist.gnuscript and
	covDist.gnuscript.dat and reports some information to the
	standard output.  Let's look at covDist.gnuscript.dat 
	It reports the expected (mean) coverage and the standard error
	at the error rate we specified.  It also has some gnuplot
	style data.  This data is a histogram of the coverage in the
	bootstrap data.

	In covDist.gnuscript we have a gnuplot script file that will
	show us this data graphically.  To see it, call gnuplot on
	this script:	 

gnuplot covDist.gnuscript
	
	This will create covDist.gnuscript.eps
	Looking at this file:

gv covDist.gnuscript.eps

	We see that this histogram is centered in the gaussian, but
	the gaussian is much steeper than our data.  This is because
	the binning of the histogram is tuned for 200 replicates.  You
	can adjust this value in covDist.pl, but if you have done 200
	replicates, this generates a nicer looking plot.

	The next program we'll use is meansTest.pl
	This runs a two sample parametric means test on the bootstrap
	data of two database searches.  It can be used to determine if
	the performance differences are statistically significant.
	Since, in this example, we only have results from one database
	search, we'll just run that against itself and see if it
	differs significantly from itself. 

meansTest.pl -o covDist.gnuscript.dat -t covDist.gnuscript.dat -s 10
Z score = 0

	As expected, there is no difference between these two
	distributions.  Not a big surprise since they are the same
	distribution. 

	Another way to compare the two distributions is to see how
	much they overlap.  That is how often is a randomly sampled
	value from one greater than a randomly sampled value from the
	other.  To get this value, we use pOverlap.pl
	This program looks at the coverage distributions at a
	user-defined error rate and normalization scheme from two
	bootstrap datasets.  Again, we only have one so we'll have to
	compare it against itself.  The number reported is the percent
	of time the the first one was greater than the second.
	Call pOverlap.pl like this:

pOverlap.pl -o bs -t bs -n 100000 -e 0.01 -r .dat -n 1000 -N 2
0.504

	The number, 0.504, means the 50.4% of the time the coverage at
	0.01 errors per query in the bootstrap distribution under
	linear normalization in bs/ was greater than that in /bs (the
	same distribution).  This means the two distributions are very
	nearly the same.  We, however, know that they are exactly the
	same. 


*******************************************************************************

V.	Complain

	You can contact me at ed@compbio.berkeley.edu or
	edgreen@ugaalum.uga.edu

	You can contact Gavin (the author of the Bayesian bootstrap 
	stuff) at gavin@compbio.berkeley.edu. He's a nasty little
	troll, though, so don't be surprised if his response consists
	only of 'RTFM.'

	There is no manual.

	I'm very interested in results generated by these or related
	tools or suggestions for improvement or new features.  I'm not
	interested in making this code work on obscure platforms or
	other trivial problems.