Cancer Institute

  • Cancer Bioinformatics Core Software and Tools

    Some of the tools developed by the Bioinformatics members are listed here. Many of them are currently hosted on servers at other places. 

    Splice Scan II

    This program is a further improved tool SpliceScan, where we score Splice Sites (SSs) based on exon definition model, which includes simultaneous scoring of 3'SS (acceptor) and 5'SS (donor) signals, exon length and contribution of Exonic/Intronic Enhancer/Silencer elements associated with the delicate balance of factors committing exons to splicing. All the scoring happens through Bayesian rule in terms of Logarithm Of Odds (LODs), where the LOD score is the logarithm base 2 of the normalized signal concentration ratio in vicinity of a true versus decoy SS.

    Autism Candidate Gene Map (ACGMAP) Database

    The ACGMAP contains information on protein coding and noncoding autism candidate genes, single nucleotide polymorphisms (SNPs), copy number variants (CNVs), micro inserts and deletions, de novo mutations, functional characteristics of candidate genes (e.g. domains) splice variants and types of alternative splicing events, genomic and protein sequence information. 

    The database is undergoing continuous development and will include non-coding genes, the Autism Genome Atlas and other relevant genetic information associated with autism. The database is also linked to other important databases via the web through hyperlinks.

    HPIDB (Host Pathogen Protein-Protein Interaction Database)

    Protein-protein interactions (PPIs) play a crucial role in initiating infection in a host-pathogen system. Identification of these PPIs is important for understanding the underlying biological mechanism of infection and identifying putative drug targets. Database resources for studying host-pathogen systems are scarce and are either host specific or dedicated to specific pathogens. Here we describe "HPIDB" a host-pathogen PPI database, which will serve as a unified resource for host-pathogen interactions. Specifically, HPIDB integrates experimental PPIs from several public databases into a single, non-redundant web accessible resource.

    The database can be searched with a variety of options such as sequence identifiers, symbol, taxonomy, publication, author, or interaction type. The output is provided in a tab delimited text file format that is compatible with Cytoscape, an open source resource for PPI visualization. HPIDB allows the user to search protein sequences using BLASTP to retrieve homologous host/pathogen sequences. For high-throughput analysis, the user can search multiple protein sequences at a time using BLASTP and obtain results in tabular and sequence alignment formats. The taxonomic categorization of proteins (bacterial, viral, fungi, etc.) involved in PPI enables the user to perform category specific BLASTP searches. In addition, a new tool is introduced, which allows searching for homologous host-pathogen interactions in the HPIDB database.

    HPIDB is a unified, comprehensive resource for host-pathogen PPIs. The user interface provides new features and tools helpful for studying host-pathogen interactions.

    TAAPP (Tiling Array Data Analysis Pipeline)

    High-density tiling arrays provides closer view of transcription than regular microarrays and can also be used for annotating functional elements in genomes. The identified transcripts usually have a complex overlapping architecture when compared to the existing genome annotation. Therefore, there is a need for customized tiling array data analysis tools. Since most of the initial tiling arrays were conducted in eukaryotes, data analysis methods are well suited for eukaryotic genomes. For using whole genome tiling arrays to identify previously unknown transcriptional elements like small RNA, antisense RNA etc in prokaryotes, existing data analysis need to be tailored for prokaryotic genome architecture. Furthermore, automation of such custom data analysis workflow is necessary for biologists to apply this powerful platform for knowledge discovery.

    Here we describe TAAPP, a web-based package that consists of two modules for prokaryotic tiling array data analysis. The transcript generation module works on raw data to generate transcriptionally active regions (TARs). The feature extraction and annotation module then maps TARs to existing genome annotation. This module further categorizes the transcription profile into potential novel non-coding RNA, antisense RNA, gene expression and operon structures.

    Hardware resources

    Supercomputer at University of Mississippi, Oxford
    Supercomputer at University of Mississippi, Oxford

    he Cancer Bioinformatics Core's computational infrastructure is an integrated platform designed to minimize time spent on staging data for analysis and maximize efficiency of analysis. In practical terms this means that datasets visible on networked laptops, PC, workstations will also be available to thousands of CPU cores on the LINUX Cluster and the super computer, high-performance computational resources. Currently the main computing platforms include, networked laptops, Dell precision PCs and Dell Workstations with Duo Quad Core processors allowing for parallel computing.

    With financial support from the Cancer Institute, plans are underway to purchase a 20 node LINUX Cluster for high performance computing and servers for data storage and backups. In addition, we have developed a strategic partnership with the Supercomputing Center at Ole Miss. Computational power is tailored to fit researchers' requirements, and the infrastructure handles large and small projects. The infrastructure is designed to be flexible, with open access to the Cancer Institute and UMMC researchers on computational servers outfitted with a broad range of bioinformatics and statistical genomics application development tools.

    Software resources

    Software tools include, R, MATLAB, SAS, SPSS, GenePattern, PLINK, Osprey System, Cytoscape, Pathway studio and many web-based tools. In addition, the Cancer Bioinformatics Core has computing and programming Capabilities in JAVA, Perl, C, C++. Software not currently on the machines can be installed on request on a cost recovery basis. Commonly used software for sequence analysis, gene expression analysis, and proteomics are available to all researchers. The high-performance computing infrastructure is particularly well suited for application development. Significant number of computational and software-development projects are underway, ranging from software for specialized analysis to enterprise-wide data management and data analysis systems