Thursday, 9 November 2017

SEQUENCE ANALYSIS



SEQUENCE ANALYSIS
Sequence analysis is one of the most fundamental method used in bioinformatics to compare two or more sequences eother nucleic acid or protein. The process of comparing two or more sequences to find out similarity between them is termed as sequence alignment. By sequence comparison, it is possible to find out relationship in structure, function and evolution from a common ancestor.
Sequence alignment is the most common method used to compare two (pairwise alignment) or more (multiple sequence alignment) sequences by placing one below the other. It is a major step in sequence similarity search. The process involves finding and constructing significant alignments of the sequences. Two or more strings of bases or aminoacids are made to line up against each other in order to get the highest number of matching characters.
For eg. (1), 2 sequences BANANA and ANANAS are written as follows:
BANANA
ANANAS
In this there are match (identical characters) and mismatch(non-identical) characters. To reduce mismatch in the above alignment a gap ‘_’ is added as follows:
BANANA_
_ANANAS
The column of alignment containing  gap is called indels (insertion-deletion). The column containing gap in the lower row is called deletion and the column with gap in the top row is called insertions. So in sequence alignment the sequences are written in such a way that it will give highest similarity.
Eg. (2) seq.1   A C C G T T G C T A G C A A C -  -  -   G C C C A A T
                       I  I  I   I      I   I   I     I   I  I  I                     I  I  I   I      I   I
                                 A C C G C T G C G A G C A C G A T T G C C C T A T
Identical  sequences are places in the same column,representing the unchanged ancestral sequence. In eg 2., matches are highlightedwith vertical bars. Non-identical  sequences are placed in the same column as mismatches, but leaving a gap (no vertical bars) between the aligned sequences. Mismatches in the alignment correspond to mutations, which transform one base into the other. The sequences are also padded with gaps, which are denoted by dashes. Gaps in the alignment correspond to insertions or deletions. Comparison of sequences thus show how much evolutionary changes had taken place in them.
SCORING MATRICES
Once the alignment is done, a score can be assigned to each aligned pair according to a chosen scoring scheme. We usually reward matches and penalize mismatches and gaps. The total score we assign to an alignment will then be a sum of scores of each pair of aligned residues. The similarity of two sequences can be defined  as the best score among all possible alignments between them. For instance, using a scoring scheme that gives 1 to matches, 0 to mismatches  and -1 to gaps (gap penalty), the alignment of ACGTGG and ACAGTAG is calculated as follows:
A C – G T G G
I   I      I  I       I
A C  A G T A G
Alignment shows 5 matches, 1 mismatch, and 1 gap penalty. So the alignment score is 5x1+1x0+1x-1=4. A table of scores for all such match/mismatch is known as scoring matrix.
TYPES OF SEQUENCE ALIGNMENT
According to the number of sequences being compared, sequence alignment is of two types:
1.     Pair wise sequence alignment: - In this only two sequences are compared with each other to find the region of similarity and dissimilarity.
Eg.    sequence 1  T A C G T A C
         Sequence 2   T A C G T A C       
2.   Multiple sequence alignment:- In this more than two sequences are compared with each other to  
      find the region of similarity and dissimilarity.
For eg.   Sequence 1   T A C G T A C
               Sequence 2   T A C G T A C
               Sequence 3   T A C G T A C
               Sequence 4   T A C G T A C
The three types of changes in genetic  information exchanges are substitution, insertion or deletion. These changes accumulate after several thousand generations resulting in considerable change in genetic information. Therefore, a comparison of homologous sequences will show how much evolutionary changes  had taken place between them.  The differences in the sequence reflect their evolutionary history. If one align (compare), the sequences of trypsin of rat and mouse, there may not be much differences between them. This indicates that the organisms are closely related. But if the same mouse trypsin sequence when compared with that of distantly related organisms, a large portion of the sequence may show deletions, insertions (indels) and substitutions. This is indicative of long evolutionary  history.
In all the above cases certain regions on the sequences may show very high degree of similarity. Such packets on the sequences are called consensus regions. They had not passed through evolutionary changes. The reason for the unchanging nature of consensus regions may be due to the fact that those regions are involved in very important structural and functional role. Even if any change occurs in an organism in those conserved region it will perish and do not pass  to successive generations.
Alignment (Pair wise and multiple) is extremely central in biological sequence analysis. Some of the purposes  in aligning sequences are;
1.     Reconstructing molecular evolution
2.     Matching of functionally equivalent regions
3.     Definition of patterns the sequences must contain
According to the length of sequences being compared sequence alignment is of two types:
1.     Global alignment: In global alignment, two sequences to be aligned are assumed generally similar over their entire length. Alignment is carried out from beginning to end of both sequences to find the best possible alignment across their entire length between the two sequences. This method is more applicable for aligning two closely related sequences of roughly the same length. For divergent sequences and sequences of variable lengths this method may not be able to generate optimal results because it fails to recognize highly similar local regions between the two sequences. In global alignment tools like ALIGN, LALIGN etc. are used for the alignment of two sequences.
2.     Local alignment- Local alignment does not assume that the two sequences in question have similarity over the entire length. It only finds local regions with the highest level of similarity between the two sequences and aligns these regions without regard for the alignment of rest of the sequence regions. This approach can be used for aligning more divergent sequences with the goal of searching for conserved patterns in DNA or protein sequences. The two sequences aligned can be of different lengths. This approach is more appropriate for aligning divergent biological sequences containing only modules that are similar, which are referred to as domains or motifs
                    N   ---------------------------------------------------------------------C
                    N   -- -------------------------------------------------------------------C-
                                              Global alignment
              N ------------------------ -I---------------------I-----------------------------C
                       N ------------------ -I---------------------I---------------------C
                                      Local alignment
BLAST
BLAST is the abbreviated form of Basic Local ALIGNMENT SEARCH TOOL. This was proposed by Altschul et al. at NIH  in 1990. BLAST provides software tools for finding high- scoring local alignments between two sequences. During BLAST search submission of a query sequence and performing a pair wise comparison of the query sequence with all individual sequences in a database is performed. It works by finding short stretches of identical or nearly identical letters in two sequences. These short strings of characters are called words. The basic assumption is that two related sequences must have at least one word in common. By first identifying word matches, a longer alignment can be obtained by extending similarity regions from the words. It uses statistical methods to evaluate hits for their significance.
BLAST Algorithm
In BLAST Algorithm first step is to create a list of words from the query sequence. Each word is typically three residues for protein sequences and eleven re4sidues for DNA sequences. The list includes every possible word extracted from the query sequence. This step is also called seeding.
The second step is to search a sequence database for the occurrence of these words. This step is to identify database sequences containing the matching words. The matching of the words is scored by a given substitution matrix.
The fourth step involves pairwise alignment by extending from the words in both directions while counting the alignment score using the same substitution matrix. The resulting contiguous aligned segment pair without gaps is called high-scoring segment pair (HSP). Highest scored segment is extended in both directions and gaps may be introduced wherever needed. The extension continues if the alignment score is above a certain threshold; otherwise it is terminated.  The choice of the threshold value fro continuing the extension is an important search parameter, because it determines how likely the resulting sequences are to be biologically relevant homologs of the query sequence.
Eg.
Query: M R D P Y N K  L I S
Scan every three residues to be used in searching BLAST word databases
Assuming one of the words find matches in the database
Query            PYN    PYN       PYN        PYN
Database      PYN    PFN        PFQ         PFE
Find the database sequenc e corresponding to the best word match and extend alignment in both directions. Calculate the scores using BLOSUM 62 matrix
Query:       M  R  D P Y N   K  L  I  S
Database:    M  H E P Y N D V P W
                   <--------           -------à

Eg.2
Input sequence:  A I L V P T V
1.     Break the query sequence into words
AILVPTVI….
AILV
  ILVP
   LVPT
     VPTV
2.     Search for word matches (also called high-scoring pairs, or HSPs) in the database sequences
                                                                  AILV
                         MVQGWALYDFLKCRAILVGTVIAML……….
3.     Extend the match until the local alignment score falls below a fixed threshold (the most recent version of BLAST allows gaps in the extended match)
                                                                    -----→
                                                                 AILVPTVI
                            MVQGWALYDFLKCRAILVGTVIAML………


There are five different traditional BLASt programmes.
BLAST P- compares any protein query sequence against a protein sequence database hence P
BLASTN- compares a nucleotide sequence against a nucleotide sequence database hence N
BLASTX-  takes a nucleotide query sequence an translates it into protein for comparison against a protein sequence database
TBLASTN- compares a protein sequence against a nucleotide sequence database after translating the nucleotide database sequences into protein
TBLASTX- compares translations of a nucleotide query sequence into protein against translations of a nucleotide sequence database into protein.


THE SALIENT FEATURES OF BLAST
1.     Local alignments – BLAST tries to find out patches of regional similarity, rather than trying to find the best alignment between the entire query and an entire database sequence.
2.     Ungapped alignments- Alignments generated by BLAST do not contain gaps. BLAST’s speed and statistical model depend on this , but in theory it reduces sensitivity. But BLAST will report multiple local alignments between the query and database sequence.
3.     Explicit Statistical theory-  BLAST is based on an explicit statistical theory developed by SAmual Karlin and Steven Altschul (1990). Original theory was later extended to cover multiple weak matches between query and database entries.

MULTIPLE SEQUENCE ALIGNMENT (MSA)
The purpose of  MSA is to bring the large number of similar features in the same column of the alignment optimally. If the sequences in the MSA align well, mthey are likely to be derived from a common ancestor sequence. MSA of a set of sequences can provide information as to the most alike regions in the set. In proteins, such regions may represent conserved functional or structural domains.
Some of the many MSA programmes are Clustal w or ClustalX,MSA; PRALINE etc.
Clustal
Clustal performs a global multiple sequence alignment by a stepwise process. In step1 it performs pair wise alignments of all the sequences provided by the user. In step 2 the scores obtained for the pair wise alignment are used to produce a phylogenetic tree and in step 3,  the phylogenetic tree is used a s a guide to align sequences sequentially. Thus the most closely related sequences are aligned first, and then additional sequences are added one by one to a profile of of an existing MSA.
The most important releases of Clustal are Clustal X and Clustal w. The primary difference between Clustal X and W is that the former has a simple text mode interface and latter has an elegant graphical user interface (GUI) built using the NCBI VIBRANT toolkit.  Clustral can be used to align any group of protein or nucleic acid sequences that are related to each other over the entire lengths. The use of Clustal w is not advisable if ;
·        Sequences do not share common ancestry
·        Sequences have large, variable, N- and C- terminal overhangs
·        Sequences are partially related
·        Sequences include short non-overlapping fragments


No comments:

Post a Comment