A program to generate DNA or RNA position-specific scoring matrices and to search databases of sequences with these matrices.
Background:
Some biologically relevant sites in DNA or RNA sequence can be
described using a simple consensus sequence. The binding site for
the EcoRI DNA restriction enzyme, for example, is GAATTC.
This string completely describes the EcoRI site in that EcoRI always
binds GAATTC and never binds any other sequence. However, many
biologically relevant sequences can not be described as a simple
string. For example, a transcription factor may have a preference
for binding a T at a certain position in its binding site yet may
tolerate some other base, provided the rest of its binding preferences
are met. For biologically relevent sites like these, a consensus
will fail to capture all that is known about the site.
One way to capture more information is by using a slightly more sophisticated model. The binding site or motif can be encoded as a position-specific scoring matrix (PSSM) also called a postion weight matrix (PWM). These models of motifs encode data describing what tendencies there are at each position in the motif. This includes capturing whether a given position is important (conserved) or not.
A PSSM is trained using the data from a multiple sequence alignment of the motif. The data in each column is used to determine how often each base occurs at each position. To generate a PSSM, one often uses a multiple sequence alignment of examples of the site. For example, suppose were are given this multiple sequence alignment of binding site instances for a fictional example protein, EX:
Multiple sequence alignment for binding site EX
>Sequence1
ACGCTGA
>Sequence2
ACCCTCA
>Sequence3
AGGGTAA
>Sequence4
AGCCTTA
>Sequence5
AAGCTAA
>Sequence6
AGCTTGA
>Sequence7
AGCGTGA
>Sequence8
AGCCCAA
We can consider these 8 sequences as a sample of all the sequences that EX could bind. Then, we'll infer what the binding preferences for EX are. Inspecting this alignment, you may notice that the first and last positions are always 'A' and a few other positions are often one base or another. One could generate a consensus of this site: AGCCTXA, for example. Using this representation, however, we would not know that the first and last positions are always conserved, whereas other positions are not. To capture all such tendencies observed in our multiple sequence alignment, we can use a PSSM.
Constructing the PSSM:
First, we make a frequency table of the observed base frequencies at each
position in the multiple sequence alignment. We are assuming that the
multiple sequence alignment is correct and completely describes the
binding preferences of the site we which to model. To the extent that
this is not true, our model will be weakened, but let us not worry
about that now. Using the multiple sequence alignment shown above, our
frequency table would be:
Column | A | C | G | T |
---|---|---|---|---|
1 | 8 | 0 | 0 | 0 |
2 | 1 | 2 | 4 | 0 |
3 | 0 | 5 | 3 | 0 |
4 | 0 | 5 | 2 | 1 |
5 | 0 | 1 | 0 | 7 |
6 | 3 | 1 | 3 | 1 |
7 | 8 | 0 | 0 | 0 |
Then, convert the frequency table to a table showing in what fraction of sequences each base occurs at each position. In other words, divide each number in the frequency table by the number of sequences. In our example, this number is 8:
Column | A | C | G | T |
---|---|---|---|---|
1 | 1 | 0 | 0 | 0 |
2 | 1/8 | 1/4 | 1/2 | 0 |
3 | 0 | 5/8 | 3/8 | 0 |
4 | 0 | 5/8 | 1/4 | 1/8 |
5 | 0 | 1/8 | 0 | 7/8 |
6 | 3/8 | 1/8 | 3/8 | 1/8 |
7 | 1 | 0 | 0 | 0 |
Then, we compare the relative frequency of bases at each position to a background or random frequency. Normally this background model just assumes that each base is equally likely, i.e., occurs 1/4 of the time. Dividing each element in the relative frequency table by the background frequency gives an odds ratio that a given base matches the model versus the chance that its just background. In column 1, for example, for the base 'A', we would divide 1 by 1/4, which equals 4. That means that later, when we are searching some sequence with our PSSM for EX binding sites and we see an A at this position, it is 4 times more likely to match binding site EX than it is to match background. Of course, this is just one of 7 positions and matching that 'A' will not be enough to generate a good total score.
Column | A | C | G | T |
---|---|---|---|---|
1 | 4 | 0 | 0 | 0 |
2 | 1/2 | 1 | 2 | 0 |
3 | 0 | 5/2 | 3/2 | 0 |
4 | 0 | 5/2 | 1 | 1/2 |
5 | 0 | 1/2 | 0 | 7/2 |
6 | 3/2 | 1/2 | 3/2 | 1/2 |
7 | 4 | 0 | 0 | 0 |
One final transformation of our table and we'll have a proper PSSM. Notice that when we divide the observed frequency at each position by the background frequency that the number is more than 1 if that base is more likely to match the binding site. It is less than 1 if that base is more like the background. It is equal to 1 if it is equally likely to match our binding site or the background. To simplify things, we'll take the log, base 2, of each number. Note that the log of a number less than 1 is negative, the log of 1 is 0, and the log of a number greater than 1 is positive. So, whenever we are scoring a base at a position and it is more like the model than the background it gets a positive score. If it is more like the background than the model, then it will get a negative score. If it is just as much like the model as the background, it will get a score of 0.
Column | A | C | G | T |
---|---|---|---|---|
1 | 2 | -inf | -inf | -inf |
2 | -1 | 0 | 1 | -inf |
3 | -inf | log2(5)-1 | log2(3)-1 | -inf |
4 | -inf | log2(5)-1 | 0 | -1 |
5 | -inf | -1 | -inf | log2(7)-1 |
6 | log2(3)-1 | -1 | log2(3)-1 | -1 |
7 | 2 | -inf | -inf | -inf |
This is a PSSM, so now what. Well, now we can compare any sequence against this PSSM and get a log-odds score that compares two options: (1) the sequence is an instance of binding site EX or (2) the sequence is just random background sequence and does not look like binding site EX. To generate this score, take the sequence that we want to compare, for example, ACAGCGA. Then, for each base, extract the score from our table and add all these scores up. Why does that work? Because each score from the table is a log-odds score. Adding logs is the same thing as multiplying the operand of the log. So when we add the log-odds score at each position, we're multiplying the probability of binding site EX versus background at each position, which is exactly what we want to do. In other words, we are treating each position as an independent observation that is either support for our sequence being binding site EX or is support for our sequence NOT being binding site EX (or is no support for either).
One problem with scoring the above example (ACAGCGA), you may have noticed, is the 'A' in column 3. Because we never saw an 'A' in this position in our training data then its infinitely more likely that an A at this position supports the background model than it supports the binding site EX model (remember a negative score means background is more likely and a positive score means EX is more likely). When we add up our scores, this negative infinity will dominate. In fact, in scoring any sequence against this PSSM, anytime we see a certain base at a position that was not observed in our training data we'll have this problem. If we feel like our training data is complete enough, this may not actually be a problem. In other words, if we are sure that binding site EX can never have an 'A' at position three, then any sequence that has an 'A' at its third position cannot be binding site EX. However, we may not have such confidence that we have sampled all the possible instances of binding site EX and may want to leave the door open for things we did not actually observe. To do this, we can use pseudocounts.
Pseudocounts, as the name implies, are observations that go into our frequency table, that we did not actually observe. They are intended to correct the problem described above. But what do we use for a pseudocount? One simple option is to just start with 1 in each position in our frequency table. This has the nice benefit of never giving us a score of negative infinity. Also, this small pseudocount will be drowned out more and more as we add more data. It has the drawback of not giving a score of negative infinity when we actually want one. That is, it eliminates that possibility of saying that a certain base never occurs at a certain position. There are much more sophisticated ways of determining a proper value for a pseudocount. As you might have guessed, determining a proper pseudocount is naturally a Bayesian problem and has been approached in that way.
There are two important limitations of PSSMs to keep in mind:
1. They are only as good as the training data. Any biases or
inaccuracies within the training data or in the alignment will be
captured in the PSSM.
2. They do not model higher-order properties of the site. PSSMs are
zeroth order Markov Models. That means that each site is
assumed to evolve within its own constraints and tendencies, free of
the influence of any neighboring sites. This is a fine assumption for
many sequence motifs, but not for others. It is a particularly bad
assumption for RNA sequences that form secondary structures.
Builiding PSSMs and searching databases with them using motifBS:
motifBS is a program that will generate a log-odds matrix (a PSSM)
given an input multiple sequence alignment in fasta format. The
alignment can have gap characters. They will be treated as no
observation at that position.
motifBS can also search a fasta database of sequences with a PSSM and
output the high-scoring instances of the PSSM within the search
database. motifBS is a perl program with a compiled C implementation
of the search routine that makes searching very fast.
motifBS will not discover motifs within sequences or align motifs. The
advantages of motifBS over other similar programs are
1. Its fast - especially for searching eukaryotic genome-sized databases
2. Its easy to use.
3. Its free.
Some related resources:
meme/mast
a program to discover and search for motifs
WebLogo
a web-service for graphically displaying PSSMs
Stormo review
on representing binding site data
Get bioperl
and the Inline C
modules
Get the motifBS code.
(c) 2003 Ed Green, Steven Brenner, UC Regents