65 4.2 The numerical sequence and associated DFTs of the basic cattle. left) and sour beef (right). 102 5.5 Classification accuracy (in percentage) of the proposed. AmPseAAC+RBF) method for self-consistency and jackknife testing.
Background and scope of the thesis
There is therefore a critical challenge to develop automated methods for the rapid and accurate determination of protein structures. The problem of protein structural class prediction was mainly focused on the effective representation of protein sequences and then the development of robust classification algorithms to efficiently predict the class attribute.
Motivation
This large dimension reduces the accuracy of the classifier as there is a risk of overloading. Therefore, joint time-frequency analysis is needed to better understand hidden artifacts in genomic signals.
Objective of the dissertation
Measuring and processing the genomic and proteomic signal in the time domain and frequency domain alone does not provide complete information about the structure, sequence and pattern of the molecules. Propose a hybrid feature extraction method for efficient classification of the gene expression profiles of cancer microarrays.
Dissertation Outline
First, the spectrum of the protein sequence is obtained to show the energy distribution of the frequencies in the time-frequency domain. The performance of the proposed method is evaluated and compared with existing methods using benchmark datasets.
Conclusion
An exhaustive simulation study is performed on the standards 204, 277 and 498 datasets to demonstrate the effectiveness of the new feature representation. This chapter provides a brief overview of the signal processing and soft computing techniques used to solve some challenging problems in bioinformatics.
Signal processing techniques used in the analysis
Discrete Fourier transform
Therefore, it is necessary to represent the signal in some alternative fields where the intrinsic characteristics of the signal can be reflected in a better way. Direct calculation of the DFT coefficients (point N which is a power of 2) requires O(N2) operations, while the FFT algorithm can calculate the same in only O(N log2N) operations.
Discrete cosine transform
An efficient algorithm known as fast Fourier transform (FFT) exists to reduce the computational complexity in computing the DFT and its inverse. This property of DFT can be used to identify the periodicities present in the signal.
Time-frequency analysis
Actually, the time information of the spectral elements is hidden in the phase spectrum of the signal. The spectrum of the time series is calculated using STFT, WT and S-transform and is shown in Fig.2.2.
Soft computing techniques used in the analysis
Artificial neural network
An artificial neural network (ANN) is an information processing system that attempts to simulate biological neural networks, i.e. the nervous system in the brain [18] [19]. In this algorithm, the weights and biases are initialized as very small random values. The intermediate and final outputs of the MLP are calculated by equations (2.22) and (2.23), respectively.
Conclusion
Genes and Proteins
It consists of alternating parts of coding regions (exons) and non-coding regions (introns). Therefore, searching for coding regions in a DNA sequence involves searching for the many nucleotides that make up the DNA strand.
Fundamentals of 3-base periodicity in protein coding regions . 37
Because of these discrepancies and many other reasons, the identification of protein-coding regions in the DNA sequence is a complex task. Also, many model-independent methods have been proposed to identify the coding regions in DNA sequences.
Numerical mapping of DNA sequences
The DNA sequence can be converted to the numerical sequence by replacing each nucleotide with the corresponding EIIP value. For example, if x[n] =AAT GCAT CA, then using the values from Table 3.1, the corresponding numerical sequence is given as.
Spectral content measure method
Therefore, coding regions are identified by evaluating S[N/3] in a window of N samples, then sliding the window by one or more samples and recomputing S[N/3]. Peaks in the spectra obtained by sliding window DFT correspond to protein-coding regions.
Digital filter method
When the radius R is small and close to unity, G(z) is a notch filter with zero at frequency w0. Therefore, H(z) can be a good anti-notch filter defined as 3.8) The amplitude response of a filter against a notch with radius R = 0.99 is shown in the figure.
Statistical method of coding region identification
The probability of the sequence is the product of state transitions and symbol emissions. Mathematically, the state sequence can be found by maximizing the conditional probability P(A/S) with respect to A.
The proposed time-frequency analysis method
Identification of protein coding regions in DNA using
Peaks in the energy of the filtered output signal identify the locations of protein coding regions. This energy is called the energy sequence corresponding to the three-base periodicity of the DNA sequence.
Results and Performance evaluation
Data resorces
The number of single-exon genes is 43, and the number of multi-exon genes is 152. The proportion of the coding sequence in this dataset is 14%, the intronic sequence is 46%, and the intergenic DNA is 40%.
Experimetal result
A comparative analysis of the average accuracy against threshold values for the three model-independent methods is shown in the figure. A classification experiment was also conducted to compare the performance of the proposed method.
Discussion
Therefore, the S-transform based filtering method provides a better performance on the classification, thereby evaluating the superiority of the proposed method. This happens due to the scaling nature of the Gaussian window during spectrum calculation, which can affect the time-frequency filtering operation and also the accuracy.
Conclusion
A schematic view of the interaction of a protein with target through the hot spots is shown in Fig.4.1. Since the interaction involves binding of the protein to the target, it releases some energy in that process known as binding free energy (∆G).
Review of hot spot identification methods
The signal processing techniques are the suitable candidates to extract these characteristic frequencies in the protein sequence based only on the sequence information. Under such a situation, if the conventional digital filtering technique is used, it will not uniquely detect all characteristic frequencies present in the protein sequence.
Resonant recognition model
The characteristic RRM frequency in the consensus spectrum corresponds to a particular biological function of the protein family. To analyze the characteristic frequency change in this case, a joint time-frequency analysis is needed.
Time-frequency analysis
A simple procedure was applied to identify the hotspots by changing the amplitude of the Fourier coefficients corresponding to the characteristic frequencies. Therefore, the TFA evaluates the energy concentration along the frequency axis at a given time, providing a joint time-frequency representation of the signal.
Hot spot localization in proteins using the proposed S-transform
Experimental study
These computational methods use the knowledge of the ASEdb database to predict hotspots. The hotspots identified by these methods are shown in Table 4.7 and the performance comparison is also shown in Table 4.8.
Discussion of results
Thus, it saves time, effort and cost for the identification of hotspots in proteins. But the proposed signal processing technique does not require prior 3D structural information of the protein for hotspot localization.
Conclusion
Review of protein structural class prediction
94] indicated that the protein structural classes are strongly related to amino acid composition (AAC). Much other information regarding the protein sequence is embedded in the PseAAC to optimally reflect the sequence order and length effects.
Feature representation method of protein
Amino acid composition (AAC) feature of protein
112] incorporated a wavelet power spectrum into PseAAC to reflect the long-range interaction of amino acids in the protein. 113] [114] used Fourier spectrum analysis with PseAAC to improve membrane protein prediction.
Amphiphilic Pseudo amino acid composition (AmPseAAC)
These values are listed in Table θ1 and are called the first-layer correlation factors that reflect the sequence order correlation between the most contiguous residues along the protein sequence via hydrophobicity and hydrophilicity. Therefore, a significant amount of sequence information is incorporated into the 2λ correlation factors via the amphiphilic values of the amino acid residues along a protein chain.
Spectrum based feature of protein
Amino acid hydrophobicity and hydrophilicity values determined by Tanford [118] and Hopp &. DFT is a possible tool to transform the discrete protein sequence into Table 5.1: Amino Acid Hydrophobicity and Hydrophilicity Values.
The proposed DCT amphiphilic pseudo amino acid composition
These dominant low-frequency motions have a great effect on the structure (both alpha helix and beta sheets) of the formation of the protein. The low frequency DCT coefficients on. the protein is denoted by γk and the symbol w represents the weighting factor which controls the degree of the sequence order effect to be incorporated.
Classification strategy
The first 20 values in Eq. 5.6) represent the classical amino acid composition, the following 2λ values reflect the amphiphilic sequence correlation along the protein chain and the remaining δ discrete values contain the low-frequency global information of the protein.
Performance measures
This procedure is repeated until every protein in the database is omitted for testing. The Jackknife test is the most desired and is a useful test used by most researchers to test the effectiveness of the prediction models.
Results and Discussion
Datasets
Experimental Results
The success rate of the proposed feature representation method is found to be the best compared to that obtained by others. Furthermore, the classification performance of the proposed feature representation with the RBF network is studied for the replacement and jackknife test.
Conclusion
Molecular biology research evolves through the development of the technologies used to carry it out. Therefore, there is a need to develop faster, automatic and efficient tools for the analysis of microarray data.
Microarray Technology
Gene expression data
In general, a microarray experiment produces a gene expression matrix E =Eij, 1≤ i≤n,1≤j ≤m, which is a realistically evaluated gene expression level in different samples, as shown in the figure. In the expression matrix, a row contains the expression of gene samples, the columns represent the expression profiles of the samples, and each cell is the measured expression level of gene ′i′ in sample ′j′.
Dimension reduction techniques
The F-score method of feature selection
It is a simple and robust method and does not require a priori knowledge of the sequence. The coefficients of the linear filter can be used to model the microarray samples in gene space in terms of their global spectral features.
Performance evaluation
Datasets
This dataset consists of four categories of small blue round tumors (SRBCT) with 83 samples from 2308 genes [144]. This dataset consists of three types of leukemia, namely ALL, MLL and AML with 72 samples from 12582 genes [150].
Experimental results
The success rates of all classifiers are evaluated on all six benchmark datasets. It is clear from the table that RBFNN classifier has the best performance among all classifiers and for all datasets.
Conclusion
The potential of the proposed method is assessed by evaluating ROC curves and statistical parameters (Sp, Sn, Average accuracy) in C. Exhaustive simulation study is performed on the proposed method with six standard cancer microarray datasets to demonstrate its potential. The results show that the proposed method is comparable to the existing best algorithm, but with an advantage of reduced complexity.
Future work
Lee, “A feature-based approach for modeling protein-protein interaction hotspots,” Nucleic Acids Research, vol.37, no. Antoniou, “Identification of hotspot locations in proteins using digital filters”, IEEE Journal of Selected Topics in Signal Processing, vol.2, no.
The DFT of a periodic signal
Comparison of the power spectra obtained by STFT, WT and
The structure of a single neuron
The structure of MLP network
The architecture of radial basis function network
The process to evaluate sensitivity and specificity
The relationship between the DNA sequence, gene, intergenic spaces,
The central dogma of molecular biology (flow of genetic information
DFT power spectrum of coding region of gene F56F11.4
DFT power spectrum of non-coding region of gene F56F11.4
The anti-notch filter response
A simple hidden Markov model
The synthetic time series and its amplitude spectra
The synthetic time series and its S-transform spectrum
The time-band limited filter
The recovered signal and its S-transform spectrum
Spectrogram of the DNA sequence of gene F56F11.4
The flow graph of the S-transform based filtering approach for protein