SEQUENCE ANALYSIS
Sequence analysis is one of the most fundamental
method used in bioinformatics to compare two or more sequences eother nucleic
acid or protein. The process of comparing two or more sequences to find out
similarity between them is termed as sequence alignment. By sequence
comparison, it is possible to find out relationship in structure, function and
evolution from a common ancestor.
Sequence alignment is the most common method used to
compare two (pairwise alignment) or more (multiple sequence alignment)
sequences by placing one below the other. It is a major step in sequence
similarity search. The process involves finding and constructing significant
alignments of the sequences. Two or more strings of bases or aminoacids are
made to line up against each other in order to get the highest number of
matching characters.
For eg. (1), 2 sequences BANANA and ANANAS are
written as follows:
BANANA
ANANAS
In this there are match (identical characters) and
mismatch(non-identical) characters. To reduce mismatch in the above alignment a
gap ‘_’ is added as follows:
BANANA_
_ANANAS
The column of alignment containing gap is called indels (insertion-deletion).
The column containing gap in the lower row is called deletion and the column
with gap in the top row is called insertions. So in sequence alignment the
sequences are written in such a way that it will give highest similarity.
Eg.
(2) seq.1 A C C G T T G C T A G C A A C
- - - G C C C A A T
I I I I I
I I I
I I I I I
I I I I
A C
C G C T G C G A G C A C G A T T G C C C T A T
Identical sequences
are places in the same column,representing the unchanged ancestral sequence. In
eg 2., matches are highlightedwith vertical bars. Non-identical sequences are placed in the same column as
mismatches, but leaving a gap (no vertical bars) between the aligned sequences.
Mismatches in the alignment correspond to mutations, which transform one base
into the other. The sequences are also padded with gaps, which are denoted by
dashes. Gaps in the alignment correspond to insertions or deletions. Comparison
of sequences thus show how much evolutionary changes had taken place in them.
SCORING MATRICES
Once the alignment is done, a score can be assigned
to each aligned pair according to a chosen scoring scheme. We usually reward
matches and penalize mismatches and gaps. The total score we assign to an
alignment will then be a sum of scores of each pair of aligned residues. The
similarity of two sequences can be defined
as the best score among all possible alignments between them. For
instance, using a scoring scheme that gives 1 to matches, 0 to mismatches and -1 to gaps (gap penalty), the alignment
of ACGTGG and ACAGTAG is calculated as follows:
A C – G T G G
I
I I I
I
A C
A G T A G
Alignment shows 5
matches, 1 mismatch, and 1 gap penalty. So the alignment score is
5x1+1x0+1x-1=4. A table of scores for all such match/mismatch is known as
scoring matrix.
TYPES OF SEQUENCE
ALIGNMENT
According to the number of sequences being compared,
sequence alignment is of two types:
1.
Pair wise sequence alignment: - In this
only two sequences are compared with each other to find the region of
similarity and dissimilarity.
Eg. sequence 1
T A C G T A C
Sequence 2 T A C G T A C
2.
Multiple sequence alignment:- In this more than two sequences are
compared with each other to
find the region of similarity and
dissimilarity.
For eg. Sequence 1
T A C G T A C
Sequence 2 T A C G T A C
Sequence 3 T A C G T A C
Sequence 4
T A C G T A C
The three types of changes in genetic information exchanges are substitution,
insertion or deletion. These changes accumulate after several thousand
generations resulting in considerable change in genetic information. Therefore,
a comparison of homologous sequences will show how much evolutionary
changes had taken place between
them. The differences in the sequence
reflect their evolutionary history. If one align (compare), the sequences of
trypsin of rat and mouse, there may not be much differences between them. This
indicates that the organisms are closely related. But if the same mouse trypsin
sequence when compared with that of distantly related organisms, a large
portion of the sequence may show deletions, insertions (indels) and
substitutions. This is indicative of long evolutionary history.
In all the above cases certain regions on the sequences
may show very high degree of similarity. Such packets on the sequences are
called consensus regions. They had
not passed through evolutionary changes. The reason for the unchanging nature
of consensus regions may be due to the fact that those regions are involved in
very important structural and functional role. Even if any change occurs in an
organism in those conserved region it will perish and do not pass to successive generations.
Alignment (Pair wise and multiple) is extremely
central in biological sequence analysis. Some of the purposes in aligning sequences are;
1. Reconstructing
molecular evolution
2. Matching
of functionally equivalent regions
3. Definition
of patterns the sequences must contain
According
to the length of sequences being compared sequence alignment
is of two types:
1. Global alignment:
In global alignment, two sequences to be aligned are assumed generally similar
over their entire length. Alignment is carried out from beginning to end of both sequences to find the best possible
alignment across their entire length between the two sequences. This method is
more applicable for aligning two closely related sequences of roughly the same
length. For divergent sequences and sequences of variable lengths this method
may not be able to generate optimal results because it fails to recognize highly
similar local regions between the two sequences. In global alignment tools like
ALIGN, LALIGN etc. are used for the alignment of two sequences.
2.
Local
alignment- Local alignment does not assume that the two
sequences in question have similarity over the entire length. It only finds local regions with the highest level
of similarity between the two sequences and aligns these regions without
regard for the alignment of rest of the sequence regions. This approach can be
used for aligning more divergent sequences with the goal of searching for
conserved patterns in DNA or protein sequences. The two sequences aligned can
be of different lengths. This approach is more appropriate for aligning
divergent biological sequences containing only modules that are similar, which
are referred to as domains or motifs
N ---------------------------------------------------------------------C
N --
-------------------------------------------------------------------C-
Global alignment
N ------------------------ -I---------------------I-----------------------------C
N
------------------ -I---------------------I---------------------C
Local alignment
BLAST
BLAST is the
abbreviated form of Basic Local ALIGNMENT SEARCH TOOL. This was proposed by
Altschul et al. at NIH in 1990. BLAST
provides software tools for finding high- scoring local alignments between two
sequences. During BLAST search submission of a query sequence and performing a pair
wise comparison of the query sequence with all individual sequences in a
database is performed. It works by finding short stretches of identical or
nearly identical letters in two sequences. These short strings of characters
are called words. The basic
assumption is that two related sequences must have at least one word in common.
By first identifying word matches, a longer alignment can be obtained by
extending similarity regions from the words. It uses statistical methods to
evaluate hits for their significance.
BLAST Algorithm
In BLAST Algorithm
first step is to create a list of words from the query sequence. Each word is
typically three residues for protein sequences and eleven re4sidues for DNA
sequences. The list includes every possible word extracted from the query
sequence. This step is also called seeding.
The second step is to
search a sequence database for the occurrence of these words. This step is to
identify database sequences containing the matching words. The matching of the
words is scored by a given substitution matrix.
The fourth step
involves pairwise alignment by extending from the words in both directions
while counting the alignment score using the same substitution matrix. The
resulting contiguous aligned segment pair without gaps is called high-scoring
segment pair (HSP). Highest scored segment is extended in both directions and
gaps may be introduced wherever needed. The extension continues if the
alignment score is above a certain threshold; otherwise it is terminated. The choice of the threshold value fro
continuing the extension is an important search parameter, because it
determines how likely the resulting sequences are to be biologically relevant
homologs of the query sequence.
Eg.
Query: M R D P Y N K
L I S
Scan every three
residues to be used in searching BLAST word databases
Assuming one of the
words find matches in the database
Query PYN PYN
PYN PYN
Database PYN
PFN PFQ PFE
Find the database
sequenc e corresponding to the best word match and extend alignment in both
directions. Calculate the scores using BLOSUM 62 matrix
Query: M R D P Y N K
L I S
Database: M H E P Y N D V P W
<-------- -------à
Eg.2
Input
sequence: A I L V P T V
1. Break
the query sequence into words
AILVPTVI….
AILV
ILVP
LVPT
VPTV
2. Search
for word matches (also called high-scoring pairs, or HSPs) in the database
sequences
AILV
MVQGWALYDFLKCRAILVGTVIAML……….
3. Extend
the match until the local alignment score falls below a fixed threshold (the
most recent version of BLAST allows gaps in the extended match)
-----→
AILVPTVI
MVQGWALYDFLKCRAILVGTVIAML………
|
There are five
different traditional BLASt programmes.
BLAST P- compares any
protein query sequence against a protein sequence database hence P
BLASTN- compares a
nucleotide sequence against a nucleotide sequence database hence N
BLASTX- takes a nucleotide query sequence an
translates it into protein for comparison against a protein sequence database
TBLASTN- compares a
protein sequence against a nucleotide sequence database after translating the
nucleotide database sequences into protein
TBLASTX- compares
translations of a nucleotide query sequence into protein against translations
of a nucleotide sequence database into protein.
THE
SALIENT FEATURES OF BLAST
1.
Local alignments – BLAST tries to find
out patches of regional similarity, rather than trying to find the best
alignment between the entire query and an entire database sequence.
2.
Ungapped alignments- Alignments
generated by BLAST do not contain gaps. BLAST’s speed and statistical model
depend on this , but in theory it reduces sensitivity. But BLAST will report
multiple local alignments between the query and database sequence.
3.
Explicit Statistical theory- BLAST is based on an explicit statistical
theory developed by SAmual Karlin and Steven Altschul (1990). Original theory
was later extended to cover multiple weak matches between query and database entries.
MULTIPLE SEQUENCE
ALIGNMENT (MSA)
The purpose of MSA is to bring the large number of similar
features in the same column of the alignment optimally. If the sequences in the
MSA align well, mthey are likely to be derived from a common ancestor sequence.
MSA of a set of sequences can provide information as to the most alike regions
in the set. In proteins, such regions may represent conserved functional or
structural domains.
Some of the many MSA
programmes are Clustal w or ClustalX,MSA; PRALINE etc.
Clustal
Clustal performs a
global multiple sequence alignment by a stepwise process. In step1 it performs pair
wise alignments of all the sequences provided by the user. In step 2 the scores
obtained for the pair wise alignment are used to produce a phylogenetic tree
and in step 3, the phylogenetic tree is
used a s a guide to align sequences sequentially. Thus the most closely related
sequences are aligned first, and then additional sequences are added one by one
to a profile of of an existing MSA.
The most important
releases of Clustal are Clustal X and Clustal w. The primary difference between
Clustal X and W is that the former has a simple text mode interface and latter
has an elegant graphical user interface (GUI) built using the NCBI VIBRANT
toolkit. Clustral can be used to align
any group of protein or nucleic acid sequences that are related to each other
over the entire lengths. The use of Clustal w is not advisable if ;
·
Sequences do not share common ancestry
·
Sequences have large, variable, N- and
C- terminal overhangs
·
Sequences are partially related
·
Sequences include short non-overlapping
fragments
No comments:
Post a Comment