Sequence data can be analyzed for(a) Sequence characteristics by knowledge-based sequence analysis, (b) Similarity search by pairwise sequence comparison, (c) Multiple sequence alignment, (d) Sequence motif discovery in multiple alignment, and(e) phylogenetic inference.
The nucleotide sequences can be retrieved from one of the three IC (Interna-tional Collaboration) nucleotide sequence repositories/databases: GenBank, EMBL Nucleotide Sequence Database, and DNA Data Bankof Japan (DDBJ). The retrieval can be conducted via accession numbers or keywords. Keynet (http://
www.ba.cnr.it/keynet.html) is a tree browsing database of keywords extracted from
NUCLEOTIDE SEQUENCE ANALYSIS 171
Figure 9.1. BLAST server of Entrez.
EMBL and GenBankaimed at assisting the user in biosequence searching. GenBank nucleotide sequence database (http://www.ncbi.nlm.nih.gov/GenBabk/) can be ac-cessed from the integrated database retrieval system of NCBI, Entrez (http://
www.ncbi.nlm.nih.gov), by selecting the Nucleotide menu. Entering the text keyword displays the summary of hits. Pickthe desired records and view them in the summary text or graphics that display the sequence, CDS, and protein product. Save the sequences in GenBankformat or fasta format (Display: fasta and View : plain text) for subsequent sequence analyses. Access to EMBL Nucleotide Sequence Database(http://www.ebi.ac.uk/embl.html) is accomplished via the European Bioin-formatics Institute (EBI) at http://www.ebi.ac.uk/embl/Access/index.html. The database search at EBI can be performed by the Sequence Retrieval System (SRS).
EMBL incorporates sequence data produced by a number of genome projects and maintains Genome MOT (genome monitoring table). DNA Data Bankof Japan (DDBJ) can be found at http://srs.ddbj.nig.ac.jp/index.html. Genome Information Broker, GIB (http://mol.genes.nig.ac.jp/gib/), can be used to retrieve the complete genome data. Chapter 3 describes the procedures for database retrieval from GenBank, EMBL, and DDBJ.
9.3.2. Similarity Search
All of the three IC centers also provide facilities for sequence similarity search and alignment. The widely used database search algorithms are FASTA (Lipman and Pearson, 1985) at http://www.nbrf.georgetown.edu/pirwww/search/fasta.html and BLAST(Altschul et al., 1990) at http://www.ncbi.nlm.nih.gov/BLAST/. For BLAST
172 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA
Figure 9.2. Shuttle vector map retrieved from Riken Gene Bank. The map of shuttle vector pYAC 3/4/5 shows major restriction sites.
nucleotide sequence analysis at NCBI (Figure 9.1), paste the sequence, and select blastn(for nucleotides) followed by choosing the basic BLAST and alignment view.
Clickthe Search button to submit the query sequence. After successful submission of the query sequence as indicated by an assignment of the Request ID, clickFormat results to display the search results. The similarity searches using FASTA (http://
www2.ebi.aci.uk/fasta3/) and BLAST (http://www2.ebi.ac.uk/blastall/) are also avail-able at EBI and DDBJ(http://spiral.genes.nig.ac.jp/homology/top-e.html). A P-value refers to the probability of obtaining, by chance, a pairwise sequence comparison of the observed similarity given the length of the query sequence and the size of the database searched. Thus, low P-values indicate sequence similarities of high significance.
9.3.3. Recombinant DNA
To search for a vector at Riken Gene Bank(http://www.rtc.riken.go.jp/), clickDNA Database Search and then select Vector Database to open the vector search page.
Enter the keyword(e.g., pBR322, cosmid, or using wild card in pBR*, p*), and click the Start button. Choose the desired vector from the hit list by clicking detail-idC.
The search returns with description (name, classification, size of vector DNA.
restriction sites, cloning site, genetic markers, host organism, growth condition, GenBankaccession, and reference) and the restriction map of the vector (Figure 9.2).
A catalog of vectors is available from American Type Culture Collection(ATCC) at http://www.atcc.org/. From the list of Search a Collection, select Molecular Biology and then Vectors. A tabulated list of name, map and/or sequence, hosts, and brief description of vectors is returned. Select map or sequence to view/save the restriction map or the nucleotide sequence (with references) of the vector. The nucleotide sequence of the known vector can be retrieved from the Nucleotide tool of Entrez
NUCLEOTIDE SEQUENCE ANALYSIS 173
Figure 9.3. Restriction map produced by Webcutter. The partial restriction map shows the nucleotide sequence of human lysozyme gene submitted to Webcutter using options for all restriction endonucleases with recognition sites equal to or greater than six nucleotides long and cutting the sequence 2—6 times (at least 2 times and at most 6 times). The restriction profile (map) is returned if ‘‘Map of restriction sites’’ is selected for display. The tables by enzyme name and by base pair number can be also returned if displays for
‘‘Table of sites, sorted alphabetically by enzyme name’’ ‘‘Table of sites, sorted sequentially by base pair number’’ are chosen.
(http://www.ncbi.nlm.nih.gov). Select and save the plasmid with circular DNA (checkthe header of GenBankformat for circular DNA).
To search for an appropriate restriction enzyme and its restriction profile, subject the query DNA to Webcutter at http://www.firstmarket.com/cutter/
cut2.html. Upload the sequence file (enter drive:LdirectoryLseqfilename) or paste the sequence into the query box. Indicate your preferences with respect to the type of analysis, site display, and restriction enzymes to include in the analysis. After clicking the Analyze Sequence button, the restriction map(duplex sequence with restriction enzymes at the cleavage sites), as shown in Figure 9.3, is returned if Map of restriction sites is selected for display. You may also select Table of sites, sorted alphabetically by enzyme name for display which lists number of cuts, positions of sites, and recognition sequences.
The primer selection for PCR can be accessed via Primer3 server of Whitehead Institute/MIT Center for Genome Research at http://genome.wi.mit.edu/cgi-bin/
primer/primer3 www.cgi (Figure 9.4). Paste the nucleotide template into the query box on the Primer3 home page. Key in the desired specifications — for example, included targets, excluded regions if any, product size, and primer picking conditions as desired. Clickthe Pickprimers button. The returned output lists Oligo(left primer and right primer) with their start position, length, Tm, GC%, and sequences (5 ; 3). The corresponding primers (for left primer and for right primer) are also shown with the source(template) sequence (Figure 9.5).
The Web Primer (http://genome-www2.stanford.edu/cgi-bin/SGD/web-pirmer) searches 35 base pairs upstream and 35 base pairs downstream of the coding sequence to locate primers. On the entry page of Web Primer(Figure 9.6), paste the query sequence, select Sequencing [info] and clickSubmit button. The parameters page return with options for information on location of primer (length of DNA in which to search for valid primer, choice of DNA strand, distance between sequencing primers and primer length), primer composition (expressed in %GC content), and
174 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA
Figure 9.4. Request form for primer selection. The nucleotide sequence of a target DNA for polymerase chain reaction can be submitted for primer selection at Primer3 server.
Figure 9.5. Output of primer selection for PCR. The abbreviated output of the primer selection for PCR by Primer3 server shows the input nucleotide sequence with appended primer oligonucleotide segments (and).
primer annealing. Accept or modify the default options and clickthe Submit button.
The user is instructed to clickhere for the list of primers. This returns data for primer-pairs listing the starting position and sequence of octadodecanucleotide primers of the coding strand.
The catalog of synthetic oligonucleotides which are proven useful as PCR primers or gene probes can be downloaded from National Cancer Research Institute
NUCLEOTIDE SEQUENCE ANALYSIS 175
Figure 9.6. Home page of Web Primer.
in Genova, Italy by selecting molprob.gz (PC version) from ftp://
ftp.biotech.ist.unige.it/pub/MPDB. For example, the following information describe the primer for tumor protein p53 gene:
ID: MP04028 Name: VNTRa Type: primer
Sequence: 5 CGAAGAGTGAAGTGCACAGG 3
DataSource: Literature Methods: PCR primers
Applications: Loss Of Heterozygosity Species: human
TargetGene: TP53
GeneDescription: Tumor Protein p53 ComplementaryPrimer: VNTRb
Bibliography: Cancer Genet Cytogenet 1995;82:106-115)PMID: 7664239]
9.3.4. Application of BioEdit
BioEdit is a software program for nucleic acid/protein sequence editing, alignment, manipulation, and analysis. It can be downloaded from http://www.mbio.ncsu.edu/
RnaseP/info/program/BIOEDIT/bioedit.html as BioEdit.zip. After installation, click BioEdit icon to open the main window. Select Open (to open new file in fasta
176 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA
Figure 9.7. Restriction map generated with BioEdit. Synthetic DNA encoding human calcitonin is subjected to restriction with all REBASE restriction endonucleases to generate restriction map.
format) or New from clipboard (to copy sequence) from the File menu. The input of sequence(s) changes the menu bar (with File, Edit, Sequence, Alignment, View, WWW, Accessory application, RNA, Option, Window, and Help menus) of the window. The Edit menu provides tools for manipulating nucleotide sequences. The Sequence menu provides tools for global alignment/calculation of identity/similarity of two sequences, creating plasmid from nucleotide sequence, analyses of nucleic acid, and protein sequences. The Alignment menu provides tools for multiple alignment, creating consensus sequence, entropy plot, positional nucleotide numeri-cal summary, and finding conserved regions. The version 4.7.8 supports up to 20,000 sequences per document.
To analyze a nucleotide sequence for base composition, complement sequence, RNA transcription, protein translation (choice of Frames), creating plasmid, and restriction map from the sequence menu, for example to construct restriction map:
·
Choose Nucleic acid tool of the Sequence menu(i.e., SequenceNucleic Acid-Restriction Map) to open a dialog box.·
Select output display and desired restriction enzymes.·
ClickGenerate map to return the restriction map(Figure 9.7).To construct recombinant DNA:
·
Input a desired plasmid sequence via File; Open.·
Copy the desired insertion DNA sequence on the clipboard.·
Identify cloning site by viewing restriction map of the plasmid.NUCLEOTIDE SEQUENCE ANALYSIS 177
Figure 9.8. Construction of cloning vector with BioEdit. The plasmid pJRD158 is retrieved from Entrez and used to construct vector for cloning DNA encoding human somatostatin with BioEdit. The cloning vector with somatostatin gene (arrow) is displayed.
·
Return to the plasmid window, place cursor at the cloning site, and paste the insertion DNA sequence from the clipboard(i.e., Edit; Paste).·
Rename the vector name if desired(clickthe highlighted name and type in the new name).·
Highlight the vector and select Sequence; Create Plasmid from Sequence to display the circular vector.·
Select Add Feature tool of the new Vector menu to open the dialog box.·
Enter the name of the insertion DNA as Feature name, identify the region of the insertion sequence by entering the start position and the end position, select type of display for the insertion sequence(color in box or arrow), and clickApply & Close.·
Display the reference restriction sites by selecting Vector; Restriction Sites to open the selection box. Transfer the desired enzyme sites to the Show box and clickApply & Close to display the recombinant DNA(Figure 9.8).·
Save the file as recdna.pmd:To perform multiple alignment:
·
Input a file containing multiple sequences by choosing File; Open.·
Highlight the headings(ID) of all sequences to be aligned.·
Select Alignment; ‘‘Plot identities to first sequence with a dot’’ to align all sequences with reference to the first sequence.178 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA
Figure 9.9. Sequence alignment with BioEdit. The sequences of DNA encoding prep-rosomatostatin mRNA are aligned to identify the consensus sequence.