Notes installing the SBEAMS - Proteomics module $Id$ ------------------------------------------------------------------------------- 1) Software and module Dependencies You must first install the SBEAMS Core. See the separate installation notes (sbeams.installnotes) on how to accomplish that. You must also install the BioLink module; follow the installation instructions provided with that module first (BioLink.installnotes). Furthermore, you must also install the Gene Ontology module if you want to make use of the Gene Ontology Annotations (BioLink/GeneOntology.installnotes), though this can be done at a later time. The following Perl Modules often not found on a standard UNIX/Linux setup are required to successfully use SBEAMS - Proteomics (in addition to the dependencies for the SBEAMS Core). Math::Interpolate XML::Xerces PDL Perl Data Language - available from CPAN PDL::Graphics::PGPLOT PDL::PGLOT --------------- The following non-Perl software is required: Xerces C++ (a requirement for XML::Xerces Perl module) PGPLOT (required by PDL::PGPLOT) (Note that only the MS/MS spectrum viewer uses these modules. It might be nice to convert the spectrum view to use GD like some of the other plotting functions in SBEAMS. But this won't be a trivial task.) ------------------------------------------------------------------------------- 2) Installation Location SBEAMS is designed to live entirely in the "htdocs" area of your Apache web server. For the remainder of this installation, it will be assumed that your installation is configured as follows; compensate for your specific setup: servername: db DocumentRoot: /local/www/html Primary location: directly located in DocumentRoot, /local/www/html/sbeams --> http://db/sbeams/ Development location: In a dev1 tree starting in the DocumentRoot, /local/www/html/dev1/sbeams -> http://db/dev1/sbeams/ All modules live in the same area and should be unpacked into the main SBEAMS area. Set up the following sym link: cd $SBEAMS/lib/scripts/Proteomics ln -s ../share/load_biosequence_set.pl ------------------------------------------------------------------------------- 3) Create and populate the database It is assumed that you have already created and tested your SBEAMS Core database. You may either create a separate database for the Proteomics database or you can put everything in the same database. Note that some database engines (rare now) may not permit cross-database queries in which case your may NOT use separate databases. If you do use separate databases, you may not be able to enforce referential integrity between tables in the different databases. This may or may not be a significant concern. - If you decide on a separate database, create it and within it, enable users "sbeams", "sbeamsro", "sbeamsadmin" as done for the Core. - Generate the appropriate schema for your type(s) of database as follows: set dbtype="mssql" #### one of mssql mysql pgsql oracle etc. cd $SBEAMS/lib/scripts/Core ./generate_schema.pl \ --table_prop ../../conf/Proteomics/Proteomics_table_property.txt \ --table_col ../../conf/Proteomics/Proteomics_table_column.txt \ --schema_file ../../sql/Proteomics/Proteomics \ --module Proteomics \ --destination_type $dbtype - Verify that the SQL CREATE and DROP statements have been correctly generated in $SBEAMS/lib/sql/Proteomics/ cd $SBEAMS/lib/sql/Proteomics more Proteomics_CREATETABLES.mssql Several tables defined in the Proteomics module are also defined in the BioLink module. If you wish to use separate databases for these two modules, you should run the sql as is. If you are installing into a single instance, you may either remove the CREATE TABLE and any associated CREATE CONSTRAINT stmts from the SQL files, or you can simply run the CREATETABLES and CREATECONSTRAINTS scripts with the -i (--ignore_errors) flag, as shown below. - Execute the statements to create and populate the database with some bare bones data and indexes (for faster loading and querying): SQL Server Example: To CREATE and POPULATE: $SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_CREATETABLES.mssql -i -delim GO $SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_POPULATE.mssql $SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_CREATECONSTRAINTS.mssql -i -delim GO (you might notice 2 "Column x is not the same data type as referencing column y" errors while loading this file; you can safely ignore them.) $SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_ADD_MANUAL_CONSTRAINTS.mssql -delim GO To CREATE indexes: $SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_CREATEINDEXES.mssql -delim GO To DROP: #### FIXME: DON'T DO THIS BECAUSE THIS WILL AFFECT BIOLINK AS WELL!!! #$SBEAMS/lib/scripts/Core/runsql.pl Proteomics_DROP_MANUAL_CONSTRAINTS.mssql -delim GO #$SBEAMS/lib/scripts/Core/runsql.pl Proteomics_DROPCONSTRAINTS.mssql -delim GO #$SBEAMS/lib/scripts/Core/runsql.pl Proteomics_DROPTABLES.mssql -delim GO Note that the Proteomics_POPULATE.mssql is not auto-generated and should probably work for all flavors of database Examples for table creation for other database flavors can be found in the Core installation notes and will not be repeated here. ------------------------------------------------------------------------------- 4) Edit the SBEAMS Configuration files cd $SBEAMS/lib/conf edit SBEAMS.conf Specifically: DBPREFIX{Proteomics} = proteomics.dbo. RAW_DATA_DIR{Proteomics} = /raw/datasets/root/location Set DBPREFIX{Proteomics} to the database name and schema/owner to prefix to table names. RAW_DATA_DIR{Proteomics} should be set to the location where SEQUEST and other data processing will take place. Typical organization might be like: /data3/sbeams/archive/$PROJECT_TAG/$EXPERIMENT_TAG/$SEARCH_BATCH_TAG/ for which: RAW_DATA_DIR{Proteomics} = /data3/sbeams/archive ------------------------------------------------------------------------------- 5) Populate the driver tables and register the module cd $SBEAMS/lib/scripts/Core set CONFDIR = "../../conf" ./update_driver_tables.pl $CONFDIR/Proteomics/Proteomics_table_property.txt ./update_driver_tables.pl $CONFDIR/Proteomics/Proteomics_table_column.txt ./update_driver_tables.pl $CONFDIR/Proteomics/Proteomics_table_column_manual.txt If this doesn't work. Do not proceed, debug first. NOTE: You will potentially need to re-run this step every time either of these files is updated (Proteomics_table_property.txt and _column.txt) -Register the Proteomics module: $SBEAMS/lib/scripts/Core/addModule.pl Proteomics ------------------------------------------------------------------------------- 6) Add reference data cd $SBEAMS/lib/refdata/Proteomics - Add required Proteomics work_groups and TGS entries. Section below outlines the procedure for doing this manually. The records indicated will be loaded via the DataImport, and shouldn't be repeated. The instructions were left in place for reference purposes. ../../scripts/Core/DataImport.pl -s Proteomics_work_groups.xml - Add default experiment types ../../scripts/Core/DataImport.pl -s experiment_type.xml - Add default instrument types ../../scripts/Core/DataImport.pl -s instrument_type.xml -- Instructions for manually adding work groups. These were inserted via the DataImport statement above, so while you should read this section for informational purposes, you needn't add any of the specified groups. Log in via the web interface as a user with Administrator privileges, switch to the Admin group using the pull-down menu at the top, and add two work groups: [SBEAMS Home] [Admin] [Manage Work Groups] [Add Work Group] Add entries, note that this was already done above. Proteomics_user Proteomics_admin Proteomics_readonly (Note that after INSERTing the first, you can click [Back], edit the previous information slightly, and click [INSERT] to add another.) Proteomics_admin has privilege over all tables in the Proteomics module, while the Proteomics_user only has access to certain tables and may often not modify other users records. The Proteomics_readonly group is a separate group for users who are allowed to view but not add/modify/delete data. Now set up the table group securities: rowprivate - Proteomics_user - data_writer rowprivate - Proteomics_admin - data_writer rowprivate - Proteomics_readonly - data_writer project - Proteomics_user - data_writer project - Proteomics_admin - data_modifier common - Proteomics_user - data_writer common - Proteomics_admin - data_modifier Proteomics_infrastructure - Proteomics_admin - data_modifier Proteomics_user - Proteomics_user - data_writer Proteomics_user - Proteomics_admin - data_modifier Now that the Proteomics driver tables are loaded, and the groups have been established, you should be able to go to the web site again and click on SBEAMS - Proteomics and explore the tables. They're all going to be empty, but you shouldn't get any errors, just empty resultsets. If this doesn't work. Do not proceed, debug first. ------------------------------------------------------------------------------- 7) Add some sample data In the previous section we required Proteomics work groups, you will now have to add yourself (and pertinent others) to the work groups. Via the web UI, go to: [SBEAMS Home] [Admin] [Manage User Group Associations], [Add ...], and add yourself and whoever else to these groups as appropriate. (logged in as your regular user account) Add a Project as follows: - Switch to the Proteomics_user group by using the drop-down box at top - Click on [Proteomics] module or [Proteomics Home] - Click "My Projects" tab - Under "Projects You Own", click [Add A New Project]. Required fields are in red. If you don't have a budget number, enter NA. - Fill in the appropriate information and [INSERT] Register an Experiment as follows: - Click on [Proteomics Home] - Click on "My Projects" tab - Click on the Project you just created - Click [Register another Experiment] - Fill in the appropriate information - If no appropriate Experiment Type exists, click on the green + and add it. After [INSERT]ing the Experiment Type, close that window and go back to the Experiment window and click on the [REFRESH] button at the bottom of the form, and then select the new Experiment Type - If no appropriate Instrument exists, click on the green + and add it. After [INSERT]ing the Instrument, close that window and go back to the Experiment window and click on the [REFRESH] button at the bottom of the form, and then select the new Instrument - If no appropriate Instrument Type exists, click on the green + and add it. After [INSERT]ing the Instrument Type, close that window and go back to the Instrument window and click on the [REFRESH] button at the bottom of the form, and then select the new Instrument Type - See section 9 if you wish to enter a Gradient Program (optional) Register a Biosequence Set as follows: A biosequence_set is a set of proteins, genes, orfs, etc., sometimes called a "sequence database". Relevant for the Proteomics module are the "sequence databases" you run SEQUEST against. Click [Core Management: BiosequenceSets] [Add Biosequence Set] Fill in the appropriate information. Make sure that the set_path points to the location of the FASTA file that is SEQUEST searched against. Do not use the upload functionality. Currently, the system will work much better if the set_path matches the entry for the FASTA file in the SEQUEST .out files. If they do not match, some manual overriding must take place. (Hint: look at the sequest.params file to find out what Biosequence Set was used.) [INSERT] that record. ------------------------------------------------------------------------------- 8) Test the Proteomics command line functionality - Load a biosequence set: cd $SBEAMS/lib/scripts/Proteomics ./load_biosequence_set.pl --check ./load_proteomics_experiment.pl --list These two programs should list the entries for the BioSequence Set and Experiment you have loaded above. Now try loading a biosequence set: ./load_biosequence_set.pl --load --set_tag YeastORF This should load the biosequence set called YeastORF. Replace YeastORF with the tag of your Biosequence Set as defined in step 7. If this doesn't work. Do not proceed, debug first. 8b) Load possible (enzymatic) peptide list [optional; can be done later] Procedure: generate a formatted list of tryptic peptides from the Biosequence Set you used above by using digestdb, then load to database. - Compile the digestdb program (if not already done so): cd $SBEAMS/lib/c/Proteomics/digestdb make digestdb (If this doesn't work. Do not proceed, debug first.) - Generate file to load: cd $SBEAMS/lib/scripts/Proteomics $SBEAMS/lib/c/Proteomics/digestdb/digestdb $SET_PATH > peptides.out where $SET_PATH is the location of FASTA input file. - Now load the file: ./load_possible_peptides.pl --set_tag YeastORF --source_file peptides.out \ --halt_at SWN:K1CL_HUMAN Replace YeastORF with the tag of your Biosequence Set as defined in step 7; replace SWN:K1CL_HUMAN with the first line in the file you want to stop loadng (e.g. where contaminants begin). - You can then check the status of the loaded data: ./load_possible_peptides.pl --set_tag YeastORF --check_status ------------------------------------------------------------------------------- 9) Add a Gradient Program If you didn't already in step 7 as part of adding an experiment add a gradient program, it is recommended that you do so now. It is not necessary. Click [Core Management: Gradient Program] [Add Gradient Program] Fill in the appropriate information for Name and Description. Under Program Data, enter rows with four space-separated columns representing the following information: Time(minutes), %ACN in buffer A, %ACN in buffer B, and flow rate (mL/min). The last value can be set to zero if not known. ------------ Example: This is an example for a "LC Gradient 5-65% in 0-165 min" gradient: Time Buf A % Buf B % Flow mL/min ---- ------- ------- ----------- 0 95 5 0.01 165 35 65 0.01 166 20 80 0.01 170 20 80 0.01 171 95 5 0.01 196 95 5 0.01 [Note that the header can be left on the form, and any number of spaces can delimit the fields.] ------------ [INSERT] the record. You can now go back and update any experiments that used this gradient; simply select the appropriate gradient from the drop-down list. ------------------------------------------------------------------------------- 10) Load the data products for a sample experiment. The next step is to load the SEQUEST output and data products of your data processing. The recommended way of organizing searches is as follows: RAW_DATA_DIR{Proteomics}/$PROJECT_TAG/$EXPERIMENT_TAG/$SEARCH_BATCH_TAG/ for which: RAW_DATA_DIR{Proteomics} = Some absolute location as defined in step 4 $PROJECT_TAG = Tag (i.e. short name) defined for the project in step 7 $EXPERIMENT_TAG = Tag (i.e. short name) for the experiment defined in step 7 $SEARCH_BATCH_TAG = Unique tag for a search_batch (i.e. a running of SEQUEST) It is possible to run SEQUEST multiple times on the same experiment, against different Biosequence Sets or against the same one, perhaps with different parameters. The recommendation is to name the $SEARCH_BATCH_TAG after the Biosequence Set tag with additional qualifiers if multiples searches against the same set exist or are expected. Eventually, there will be an automated system which will organize the data so that this does not need to happen manually (search of sbeamsbot) but it is not yet finished. The following files should be placed in the directories: RAW_DATA_DIR{Proteomics}/$PROJECT_TAG/$EXPERIMENT_TAG/ *.dat *.nfo *.png *.mzXML RAW_DATA_DIR{Proteomics}/$PROJECT_TAG/$EXPERIMENT_TAG/$SEARCH_BATCH_TAG/ sequest.params *.html interact* RAW_DATA_DIR{Proteomics}/$PROJECT_TAG/$EXPERIMENT_TAG/$SEARCH_BATCH_TAG/$FRAC/ *.dta *.out If the files are thus organized, you can trigger the load with: cd $SBEAMS/lib/scripts/Proteomics ./load_proteomics_experiment.pl --list ./load_proteomics_experiment.pl \ --load \ --experiment_tag $EXPERIMENT_TAG \ --search_subdir=$SEARCH_BATCH_TAG Now you can load extra data (update command): ./load_proteomics_experiment.pl \ --experiment_tag $EXPERIMENT_TAG \ --search_subdir=$SEARCH_BATCH_TAG \ --update_from_summary_files \ --update_search \ --update_probabilities If this experiment has a gradient program associated with it, you will want to update the elution information by also using the --update_timing_info flag. -You may also want to take a look at the $SBEAMS/lib/scripts/Proteomics/ load_proteomics_experiment.start and load_proteomics_experiment.csh scripts to accomplish the unpacking, loading, and updating of an experiment. To load Protein Prophet xml output: ./load_ProteinProphet.pl --search_batch_id 1 -Now log into SBEAMS and explore the data for your experiment. Refer to the SBEAMS tutorial for an intro to using SBEAMS. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Troubleshooting 1) How to re-generate and update table schemas - Drop table and (firstly) its constraints, via SQL commands. Look under DOMAIN_DROPCONSTRAINTS.mssql and DOMAIN_DROPTABLES.mssql. e.g. look under Core_DROPCONSTRAINTS.mssql and Core_DROPTABLES.mssql - Edit the appropriate $DOMAIN_table_column.txt file e.g. make a field in conf/Core/Core_table_property.txt nullable=N - Generate new schema files using generate_schema.pl e.g. cd $SBEAMS/lib/scripts/Core ./generate_schema.pl \ --table_prop ../../conf/Core/Core_table_property.txt \ --table_col ../../conf/Core/Core_table_column.txt \ --schema_file ../../sql/Core/Core \ --destination_type mssql - Now re-create table and its constraints using the new (updated) dll. e.g. look under Core_CREATETABLES.mssql and Core_CREATECONSTRAINTS.mssql - Populate the table, if required. e.g. look in Core_POPULATE.mssql ------------------------------------------------------------------------------- 2) How to delete a search batch / experiment -Warning: deleting considerably slows down the database! ./load_proteomics_experiment.pl \ --delete_search_batch \ --experiment_tag $EXPTAG \ --search_subdir=$SUBDIR