Software/Tools used in the study - Materials and Methods

Chapter 2: IPMS analysis of DCUN1D1 substrates

2.4 Materials and Methods

2.4.7 Software/Tools used in the study

2.4.7.1 Software/Tools used within X Tandem (Alanine 2017.2.1.4)

Gene Ontology

Gene Ontology (GO) is a computational tool compiled and produced by the Gene Ontology Consortium for the unification of biology using genes and gene products (Ashburner et al., 2000; Carbon et al., 2017). It consists of 12 reference genomes including E. coli, A. thaliana, S. cerevisiae, C. elegans, M.

musculus and H. sapiens, based on experimental findings from ̴140 000 peer-reviewed published articles, which are collated and annotated to represent 600 000 experimentally-supported GO annotations (Carbon et al., 2017). It uses this information together with the extensive information obtained from whole genome sequencing experiments performed on model organisms and species- specific sequencing experiments to infer ˃6 million functional annotations. These include experiment- supported, phylogenetically-inferred and computationally-inferred GO annotations (Carbon et al., 2017). The analysis is based on the understanding that multiple core biology functions are shared among eukaryotes and that the genes encoding these functions are similar and produce phenotypes that can be extrapolated to other eukaryotes. GO therefore, creates a computational model for biological systems at various cellular levels including cellular components, cellular processes and molecular functions (Ashburner et al., 2000). In addition, following an experiment, GO performs enrichment analysis on input datasets to analyse the underlying molecular changes obtained from measuring the levels of certain molecules (RNA, DNA or protein). It can identify the groups of genes that function together based on over-representation or under-representation of “GO terms” for an annotated gene set, taking into consideration the frequency of the gene within the database (background frequency), the sample frequency and p value determinations (Ashburner et al., 2000;

Carbon et al., 2017). In terms of this study, the GO outputs were based on the log (I), log (p) and descriptions based on database and sample frequencies.

52 KEGG

Kyoto Encyclopedia of Genes and Genomes (KEGG) is a multi-database resource consisting of 18 databases that provide molecular level and high-level information to understand the functions, interactions and uses of biological systems (Kanehisa and Goto, 2000; Kanehisa et al., 2016, 2017). It includes genomics information in the form of gene-sequencing data obtained from experiments plus inferences from orthologs and protein-sequencing data, based on experimental characterisation of the function of proteins (Kanehisa and Goto, 2000; Kanehisa et al., 2016, 2017). In addition, it contains chemical information, based on experiments performed using chemical substances and systems information based on “interomics” or network generation and analysis. It also contains health information, using disease and drug-treatment analyses to understand dysregulated biological systems. It then provides a computational representation of these biological systems from more than 4000 genome annotations as well as data generated from the use of viruses and plasmids (Kanehisa and Goto, 2000; Kanehisa et al., 2016, 2017). In this study, the KEGG PATHWAY database was used within X Tandem to determine the pathways significantly dysregulated following IPMS analysis, based on the log (I), log (p) and descriptions based on database and sample frequencies.

2.4.7.2 Software/Tools used for additional analyses

PAST

PAST (Palaeontology Statistics) is a software that was designed for use in quantitative palaeontology studies that is increasingly being used in other biological studies (Hammer, Ryan and Harper, 2001). It functions on the Windows operating system, based on a user-friendly spread-sheet like platform, to perform “standard numerical analysis and operations”, using methods that are specific for palaeontology and biology (Hammer, Ryan and Harper, 2001). These include plotting graphical functions, multivariate analysis, phylogenetic analysis, correlation of geological strata based on geological timelines, curve-fitting, time-series analysis and geometric analysis (Hammer, Ryan and Harper, 2001). In this study, we used PAST version 3 (PAST3) to perform statistical analyses following immunoprecipitation-coupled MS, including univariate and multivariate tests such as the Shapiro-Wilk test for normality, the Welch F test for unequal variance, the Kruskal-Wallis test for equal medians and the principle component analysis (PCA). These were performed according to the default settings, with a significance level of 0.05 and missing values were either supported (using the mean value for the dataset) or deleted depending on the test performed and within the assumptions necessary.

Venny

Venny software is an online program that allows for manual drawing of VENN diagrams to investigate relationships between collections of different datasets, in the form of listed elements (Oliveros, 2007).

It uses overlapping circles depicted on a single plane to compare and visualize 4 lists of elements. We used Venny 2.1.0 to compare the relationships within the unfiltered and filtered datasets, using the default settings and the colour style.

STRING

The STRING functional protein association network is a protein-protein interaction database that annotates and provides associations between proteins, based on physical binding and indirect interaction through similar mechanism of action or participation in similar cellular processes (Snel et al., 2000; Mering et al., 2003; Szklarczyk et al., 2015). STRING uses genomics information for annotation and prediction, information based on conserved expression as well as data obtained from high throughput experiments such as “Omics” studies (Snel et al., 2000; Mering et al., 2003; Szklarczyk

et al., 2015). In addition, it uses automated text mining and knowledge obtained from data previously analysed using STRING, to assist with making protein-protein associations (Snel et al., 2000; Mering et al., 2003; Szklarczyk et al., 2015) . The proteins are represented as nodes, while lines (edges) represent the protein-protein associations and any meaningful functional information attributed to that association, is represented as line length, thickness, evidence etc. (https://string-db.org/). These can also be represented in the form of hubs depending on the output (Snel et al., 2000; Mering et al., 2003; Szklarczyk et al., 2015). Currently, the database contains information on 9 643 763 proteins from 2031 organisms including, E. coli, A. thaliana, S. cerevisiae, C. elegans, M. musculus and H. sapiens (https://string-db.org/). Significantly, the number of interactions annotated in STRING, based on the score confidence level are: 1 380 838 440 from low confidence associations (≥ 0.150), 320 182 220 from medium confidence (≥ 0.400), 71 673 028 from high confidence (≥ 0.700) and 25 914 693 from the highest confidence associations (≥ 0.900) (https://string-db.org/).

CRAPOME

Crapome is a central repository for aggregated negative control samples generated from immunoprecipitation-coupled MS experiments for the determination of protein-protein interactions (Mellacheruvu et al., 2013). It consists of negative control samples that are added to the database as raw files including metadata and study protocols, describing the solid matrices (agarose, magnetic beads), antibodies/affinity approach (anti-GFP, streptavidin, calmodulin) and epitope tags (TAP, HA, GFP) used (Mellacheruvu et al., 2013). The deposited raw files are then processed based on a uniform pipeline, where spectral count data is parsed, and protein identifications are mapped according to NCBI gene identifications. They are then organized based on controlled vocabularies, generating standardised information for analysis by the user (Mellacheruvu et al., 2013). This allows researchers to query the polypeptides identified during their study to determine true interactors from non-specific interactors. Based on the understanding that negative control samples are, generally, not dependent on the type of bait used (TAP, HA, GFP) and that comparing studies of different sizes using Crapome, can provide insights from a larger sample of negative control data (Mellacheruvu et al., 2013). It then offers three workflows for use, either by running the user’s dataset against the database components or generating custom background protein subsets for data comparison. It contains ˃360 experiments that were performed by 12 laboratories across different countries, majority of which come from experiments performed using human cell lines (Mellacheruvu et al., 2013). It can be searched against MS data generated using E. coli, S. cerevisiae, and H. sapiens samples. We used Crapome workflow 1.1, selected H. sapiens and entered our data using Ensembl protein identification names and where identifications showed “no protein information”, we used other annotated protein names (where possible) due to the dynamic nature of data repository-based databases (https://www.crapome.org/).

Dalam dokumen The copyright of this thesis vests in the author. No quotation from it or information derived from it is to be published without full acknowledgement of the source. (Halaman 64-67)