SBEAMS - Proteomics

	Project Info: Home Description Contacts Test Drive Modules: Proteomics Microarray Download: Tarball releases Code Repository Sample Data

SBEAMS - Proteomics Overview

The current SEQUEST analysis sequence of Proteomics data is completely file based. Raw data files are generated by the instrument. Individal mass spectra are extracted from the data files and are searched with Sequest which in turn creates more output files. These files are summarized in two steps to create a single HTML result. This final file is iteratively culled by the investigator to filter out peptides which are either misidentified or perhaps uninteresting. This scales well with respect to disk space and doesn't require much CPU power (except for the Sequest step, of course) since it is a simple process. A filesystem imposes little order and can thus be both very flexible and very messy. However, it becomes very difficult to keep track of many experiments this way, integrate results from multiple experiments, and capitalize on the analysis performed in previous experiments while analyzing a new one. To address these needs, we are developing software which uses a relational database management system (RDBMS) to manage the organization, storage, and exploration of the Proteomics data produced at ISB.

SBEAMS-Proteomics is part of the SBEAMS (Systems Biology Experiment Analysis Management System) Project, which is a framework for collecting, storing, and accessing data produced by a variety of different experiments; these experiments can be managed separately but then correlated later under the same framework. This integrated system is a consistent framework that combines a unified state-of-the-art relational database management system (RDBMS) back end, a collection of tools to store, manage, and query experiment information and results in the RDBMS, a web front end for querying the database and providing integrated access to remote data sources, and an interface to existing programs for clustering and other analysis. Since all data from each step of the experiment are warehoused in a modular schema in the RDBMS, quality control and data analysis tasks are greatly simplified.

With SBEAMS-Proteomics, users can first enter their project, experiment, and sample information into the database. The data products of the file-based analysis pipeline are then ingested into the database, and subsequent exploration of the data, annotation of individual pieces, and correlation with other experiments can all be managed within the same system. The interface allows a much more flexible analysis of the data: individual genes, proteins, and peptides may be examined closely across multiple experiments; information about the quality of identification can be stored with the data; peptides which could not be properly identified from high quality spectra and be flagged and followed up with additional searches. The SBEAMS framework makes it easy to add a quick interface to new queries to explore the data. The knowledge stored while analyzing experiments remains easily accessible at later times and by other investigators to fully capitalize on previous work.

There are disadvantages associated with such a system, and nearly all are associated with running the RDBMS. An RDBMS capable of easily ingesting and providing quick query results on many experiments of several hundred megabytes each requires significant hardware. A machine with 4-8 CPUs, 4-8 GB of RAM, and 500-1000 GB of database storage is really needed to keep up with several experiments produced per week and multiple users accessing the data. For just a few users and experiments, today's commoditity dual CPU hardware would suffice. Managing a large database requires more administrative talent and time than managing a large filesystem. And finally software that interacts with a RDBMS adds a layer of complexity over programs which just manipulate flat text files. Noneless, these disadvantages are certainly outweighed by the new capabilities afforded by SBEAMS-Proteomics.

Screenshots and diagrams:

To the right is a screenshot of a session in the SBEAMS web interface, which is compatible with all major browsers. The upper left window shows the main welcome screen of SBEAMS inviting the authenticated user to choose which of the module. Currently the main modules are Microarray, Proteomics, and Inket. Additional smaller ISB projects that use the SBEAMS interface to access their databases are also listed.

Below that to the right is a window in which the user has selected to issue a SQL query (query parameter entry fields are scrolled out of view) that summarizes the peptides that have been annotated for the "ARP*" genes in two Drosophila Proteomics experiments (click here to execute this query on the database - SBEAMS login required). Many peptides have been observed and annotated just once, while several have been annotated many times. Various hyperlinks give the user access to more information about the genes, proteins, and peptides actually observed and annotated.

At the bottom is additional information about all the annotated occurrences of one of the peptides as identified by SEQUEST. The table includes information about which of the two experiments the peptides were observed in, the masses, the actual peptides (some of which contain tagged cysteines), pI values, ICAT quantitation ratios, annotation information (clipped off right edge) and much more. Search results can be annotated to additional insights.

Here is the current in-development relational schema for the SBEAMS-Proteomics database: