Notes installing the SBEAMS - Proteomics module

$Id$


-------------------------------------------------------------------------------
1) Software and module Dependencies

You must first install the SBEAMS Core.  See the separate installation
notes (sbeams.installnotes) on how to accomplish that. You must also
install the BioLink module; follow the installation instructions
provided with that module first (BioLink.installnotes). Furthermore,
you must also install the Gene Ontology module if you want to make 
use of the Gene Ontology Annotations (BioLink/GeneOntology.installnotes),
though this can be done at a later time.

The following Perl Modules often not found on a standard UNIX/Linux
setup are required to successfully use SBEAMS - Proteomics (in addition
to the dependencies for the SBEAMS Core).

Math::Interpolate
XML::Xerces
PDL                  Perl Data Language - available from CPAN
PDL::Graphics::PGPLOT
PDL::PGLOT

---------------
The following non-Perl software is required:

Xerces C++ (a requirement for XML::Xerces Perl module)
PGPLOT              (required by PDL::PGPLOT)

(Note that only the MS/MS spectrum viewer uses these modules.  It might
be nice to convert the spectrum view to use GD like some of the other
plotting functions in SBEAMS.  But this won't be a trivial task.)


-------------------------------------------------------------------------------
2) Installation Location

SBEAMS is designed to live entirely in the "htdocs" area of your Apache
web server.  For the remainder of this installation, it will be assumed
that your installation is configured as follows; compensate for your
specific setup:
  servername: db
  DocumentRoot: /local/www/html
  Primary location: directly located in DocumentRoot,
                   /local/www/html/sbeams  --> http://db/sbeams/
  Development location: In a dev1 tree starting in the DocumentRoot,
                   /local/www/html/dev1/sbeams  -> http://db/dev1/sbeams/

All modules live in the same area and should be unpacked into the
main SBEAMS area.

Set up the following sym link:

cd $SBEAMS/lib/scripts/Proteomics
ln -s ../share/load_biosequence_set.pl


-------------------------------------------------------------------------------
3) Create and populate the database

It is assumed that you have already created and tested your SBEAMS Core
database.  You may either create a separate database for the Proteomics
database or you can put everything in the same database.

Note that some database engines (rare now) may not permit
cross-database queries in which case your may NOT use separate
databases.  If you do use separate databases, you may not be able to
enforce referential integrity between tables in the different
databases.  This may or may not be a significant concern.

- If you decide on a separate database, create it and within it,
  enable users "sbeams", "sbeamsro", "sbeamsadmin" as done for the Core.

- Generate the appropriate schema for your type(s) of database as follows:

set dbtype="mssql"   #### one of mssql mysql pgsql oracle etc.
cd $SBEAMS/lib/scripts/Core

./generate_schema.pl \
 --table_prop ../../conf/Proteomics/Proteomics_table_property.txt \
 --table_col ../../conf/Proteomics/Proteomics_table_column.txt \
 --schema_file ../../sql/Proteomics/Proteomics \
 --module Proteomics \
 --destination_type $dbtype


- Verify that the SQL CREATE and DROP statements have been correctly
  generated in $SBEAMS/lib/sql/Proteomics/

cd $SBEAMS/lib/sql/Proteomics
more Proteomics_CREATETABLES.mssql

Several tables defined in the Proteomics module are also defined in the BioLink
module.  If you wish to use separate databases for these two modules, you 
should run the sql as is.  If you are installing into a single instance, you
may either remove the CREATE TABLE and any associated CREATE CONSTRAINT stmts
from the SQL files, or you can simply run the CREATETABLES and
CREATECONSTRAINTS scripts with the -i (--ignore_errors) flag, as shown below.
                                                                                
- Execute the statements to create and populate the database with some
  bare bones data and indexes (for faster loading and querying):

SQL Server Example:

To CREATE and POPULATE:
$SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_CREATETABLES.mssql -i -delim GO
$SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_POPULATE.mssql
$SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_CREATECONSTRAINTS.mssql -i -delim GO
  (you might notice 2 "Column x is not the same data type as referencing column
   y" errors while loading this file; you can safely ignore them.)
$SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_ADD_MANUAL_CONSTRAINTS.mssql -delim GO

To CREATE indexes:
$SBEAMS/lib/scripts/Core/runsql.pl -u sbeamsadmin -s Proteomics_CREATEINDEXES.mssql -delim GO

To DROP:
#### FIXME: DON'T DO THIS BECAUSE THIS WILL AFFECT BIOLINK AS WELL!!!
#$SBEAMS/lib/scripts/Core/runsql.pl Proteomics_DROP_MANUAL_CONSTRAINTS.mssql -delim GO
#$SBEAMS/lib/scripts/Core/runsql.pl Proteomics_DROPCONSTRAINTS.mssql -delim GO
#$SBEAMS/lib/scripts/Core/runsql.pl Proteomics_DROPTABLES.mssql -delim GO


Note that the Proteomics_POPULATE.mssql is not auto-generated and should
probably work for all flavors of database

Examples for table creation for other database flavors can be found in the
Core installation notes and will not be repeated here.


-------------------------------------------------------------------------------
4) Edit the SBEAMS Configuration files

cd $SBEAMS/lib/conf
edit SBEAMS.conf

Specifically:

DBPREFIX{Proteomics}    = proteomics.dbo.
RAW_DATA_DIR{Proteomics} = /raw/datasets/root/location

Set DBPREFIX{Proteomics} to the database name and schema/owner to prefix
to table names.  RAW_DATA_DIR{Proteomics} should be set to the location
where SEQUEST and other data processing will take place.  Typical
organization might be like:

/data3/sbeams/archive/$PROJECT_TAG/$EXPERIMENT_TAG/$SEARCH_BATCH_TAG/

for which:
RAW_DATA_DIR{Proteomics} = /data3/sbeams/archive


-------------------------------------------------------------------------------
5) Populate the driver tables and register the module

cd $SBEAMS/lib/scripts/Core
set CONFDIR = "../../conf"
./update_driver_tables.pl $CONFDIR/Proteomics/Proteomics_table_property.txt
./update_driver_tables.pl $CONFDIR/Proteomics/Proteomics_table_column.txt
./update_driver_tables.pl $CONFDIR/Proteomics/Proteomics_table_column_manual.txt

If this doesn't work.  Do not proceed, debug first.

NOTE: You will potentially need to re-run this step every time either
of these files is updated (Proteomics_table_property.txt and _column.txt)


-Register the Proteomics module:

$SBEAMS/lib/scripts/Core/addModule.pl Proteomics


-------------------------------------------------------------------------------
6) Add reference data

cd $SBEAMS/lib/refdata/Proteomics

- Add required Proteomics work_groups and TGS entries.  Section below outlines
the procedure for doing this manually.  The records indicated will be loaded
via the DataImport, and shouldn't be repeated.  The instructions were left in
place for reference purposes.

../../scripts/Core/DataImport.pl -s Proteomics_work_groups.xml

- Add default experiment types

../../scripts/Core/DataImport.pl -s experiment_type.xml

- Add default instrument types

../../scripts/Core/DataImport.pl -s instrument_type.xml

-- Instructions for manually adding work groups.  These were inserted via the 
DataImport statement above, so while you should read this section for
informational purposes, you needn't add any of the specified groups.

Log in via the web interface as a user with Administrator privileges,
switch to the Admin group using the pull-down menu at the top, and add
two work groups:
[SBEAMS Home] [Admin] [Manage Work Groups] [Add Work Group]
Add entries, note that this was already done above.
  Proteomics_user
  Proteomics_admin
  Proteomics_readonly
(Note that after INSERTing the first, you can click [Back], edit the previous
information slightly, and click [INSERT] to add another.)

Proteomics_admin has privilege over all tables in the Proteomics
module, while the Proteomics_user only has access to certain tables
and may often not modify other users records.  The Proteomics_readonly
group is a separate group for users who are allowed to view but not 
add/modify/delete data.

Now set up the table group securities:
  rowprivate - Proteomics_user - data_writer
  rowprivate - Proteomics_admin - data_writer
  rowprivate - Proteomics_readonly - data_writer
  project - Proteomics_user - data_writer
  project - Proteomics_admin - data_modifier
  common - Proteomics_user - data_writer
  common - Proteomics_admin - data_modifier
  Proteomics_infrastructure - Proteomics_admin - data_modifier
  Proteomics_user - Proteomics_user - data_writer
  Proteomics_user - Proteomics_admin - data_modifier

Now that the Proteomics driver tables are loaded, and the groups have been
established, you should be able to go to the web site again and click on
SBEAMS - Proteomics and explore the tables.  They're all going to be empty,
but you shouldn't get any errors, just empty resultsets.

If this doesn't work.  Do not proceed, debug first.


-------------------------------------------------------------------------------
7) Add some sample data

In the previous section we required Proteomics work groups, you will now have
to add yourself (and pertinent others) to the work groups.  Via the web UI, go
to:

[SBEAMS Home] [Admin] [Manage User Group Associations], [Add ...], and add
yourself and whoever else to these groups as appropriate.


(logged in as your regular user account)

Add a Project as follows:

- Switch to the Proteomics_user group by using the drop-down box at top
- Click on [Proteomics] module or [Proteomics Home]
- Click "My Projects" tab
- Under "Projects You Own", click [Add A New Project].
  Required fields are in red.  If you don't have a budget number, enter NA.
- Fill in the appropriate information and [INSERT]

Register an Experiment as follows:
- Click on [Proteomics Home]
- Click on "My Projects" tab
- Click on the Project you just created
- Click [Register another Experiment]
- Fill in the appropriate information

- If no appropriate Experiment Type exists, click on the green + and add it.
  After [INSERT]ing the Experiment Type, close that window and go back to
  the Experiment window and click on the [REFRESH] button at the bottom
  of the form, and then select the new Experiment Type

- If no appropriate Instrument exists, click on the green + and add it.
  After [INSERT]ing the Instrument, close that window and go back to
  the Experiment window and click on the [REFRESH] button at the bottom
  of the form, and then select the new Instrument

- If no appropriate Instrument Type exists, click on the green + and add it.
  After [INSERT]ing the Instrument Type, close that window and go back to
  the Instrument window and click on the [REFRESH] button at the bottom
  of the form, and then select the new Instrument Type

- See section 9 if you wish to enter a Gradient Program (optional)

Register a Biosequence Set as follows:

A biosequence_set is a set of proteins, genes, orfs, etc., sometimes called
a "sequence database".  Relevant for the Proteomics module are the
"sequence databases" you run SEQUEST against.

Click [Core Management: BiosequenceSets] [Add Biosequence Set]
Fill in the appropriate information.  Make sure that the set_path points
to the location of the FASTA file that is SEQUEST searched against.
Do not use the upload functionality.  Currently, the system will work
much better if the set_path matches the entry for the FASTA file in the
SEQUEST .out files.  If they do not match, some manual overriding must
take place.
(Hint: look at the sequest.params file to find out what Biosequence Set
was used.)
[INSERT] that record.


-------------------------------------------------------------------------------
8) Test the Proteomics command line functionality


- Load a biosequence set:

cd $SBEAMS/lib/scripts/Proteomics

./load_biosequence_set.pl --check

./load_proteomics_experiment.pl --list

These two programs should list the entries for the BioSequence Set and
Experiment you have loaded above.

Now try loading a biosequence set:

./load_biosequence_set.pl --load --set_tag YeastORF

This should load the biosequence set called YeastORF.  Replace YeastORF with
the tag of your Biosequence Set as defined in step 7.

If this doesn't work.  Do not proceed, debug first.


8b) Load possible (enzymatic) peptide list [optional; can be done later]

Procedure: generate a formatted list of tryptic peptides from the
Biosequence Set you used above by using digestdb, then load to database.

- Compile the digestdb program (if not already done so):
cd $SBEAMS/lib/c/Proteomics/digestdb
make digestdb
(If this doesn't work.  Do not proceed, debug first.)

- Generate file to load:
cd $SBEAMS/lib/scripts/Proteomics
$SBEAMS/lib/c/Proteomics/digestdb/digestdb $SET_PATH > peptides.out

where $SET_PATH is the location of FASTA input file.

- Now load the file:
./load_possible_peptides.pl --set_tag YeastORF --source_file peptides.out \
  --halt_at SWN:K1CL_HUMAN

Replace YeastORF with the tag of your Biosequence Set as defined in step 7;
replace SWN:K1CL_HUMAN with the first line in the file you want to stop
loadng (e.g. where contaminants begin).

- You can then check the status of the loaded data:
./load_possible_peptides.pl --set_tag YeastORF --check_status


-------------------------------------------------------------------------------
9) Add a Gradient Program

If you didn't already in step 7 as part of adding an experiment add a
gradient program, it is recommended that you do so now.  It is not necessary.

Click [Core Management: Gradient Program] [Add Gradient Program]
Fill in the appropriate information for Name and Description.

Under Program Data, enter rows with four space-separated columns representing
the following information: Time(minutes), %ACN in buffer A, %ACN in buffer B,
and flow rate (mL/min). The last value can be set to zero if not known.

------------ Example:
This is an example for a "LC Gradient 5-65% in 0-165 min" gradient:

Time  Buf A %  Buf B %  Flow mL/min
----  -------  -------  -----------
  0   95        5        0.01
165   35       65        0.01
166   20       80        0.01
170   20       80        0.01
171   95        5        0.01
196   95        5        0.01

[Note that the header can be left on the form, and any number of spaces
can delimit the fields.]
------------

[INSERT] the record.

You can now go back and update any experiments that used this gradient;
simply select the appropriate gradient from the drop-down list.


-------------------------------------------------------------------------------
10) Load the data products for a sample experiment.

The next step is to load the SEQUEST output and data products of your
data processing.

The recommended way of organizing searches is as follows:

RAW_DATA_DIR{Proteomics}/$PROJECT_TAG/$EXPERIMENT_TAG/$SEARCH_BATCH_TAG/

for which:
RAW_DATA_DIR{Proteomics} = Some absolute location as defined in step 4
$PROJECT_TAG = Tag (i.e. short name) defined for the project in step 7
$EXPERIMENT_TAG = Tag (i.e. short name) for the experiment defined in step 7
$SEARCH_BATCH_TAG = Unique tag for a search_batch (i.e. a running of SEQUEST)
  It is possible to run SEQUEST multiple times on the same experiment,
  against different Biosequence Sets or against the same one, perhaps
  with different parameters.  The recommendation is to name the
  $SEARCH_BATCH_TAG after the Biosequence Set tag with additional
  qualifiers if multiples searches against the same set exist or are
  expected.

Eventually, there will be an automated system which will organize the data
so that this does not need to happen manually (search of sbeamsbot) but it
is not yet finished.

The following files should be placed in the directories:

RAW_DATA_DIR{Proteomics}/$PROJECT_TAG/$EXPERIMENT_TAG/
  *.dat
  *.nfo
  *.png
  *.mzXML
RAW_DATA_DIR{Proteomics}/$PROJECT_TAG/$EXPERIMENT_TAG/$SEARCH_BATCH_TAG/
  sequest.params
  *.html
  interact*
RAW_DATA_DIR{Proteomics}/$PROJECT_TAG/$EXPERIMENT_TAG/$SEARCH_BATCH_TAG/$FRAC/
  *.dta
  *.out

If the files are thus organized, you can trigger the load with:

cd $SBEAMS/lib/scripts/Proteomics

./load_proteomics_experiment.pl --list

./load_proteomics_experiment.pl \
  --load \
  --experiment_tag $EXPERIMENT_TAG \
  --search_subdir=$SEARCH_BATCH_TAG


Now you can load extra data (update command):

./load_proteomics_experiment.pl \
  --experiment_tag $EXPERIMENT_TAG \
  --search_subdir=$SEARCH_BATCH_TAG \
  --update_from_summary_files \
  --update_search \
  --update_probabilities

If this experiment has a gradient program associated with it, you
will want to update the elution information by also using the
--update_timing_info flag.


-You may also want to take a look at the $SBEAMS/lib/scripts/Proteomics/
 load_proteomics_experiment.start and load_proteomics_experiment.csh scripts
 to accomplish the unpacking, loading, and updating of an experiment.


To load Protein Prophet xml output:
./load_ProteinProphet.pl --search_batch_id 1

-Now log into SBEAMS and explore the data for your experiment. Refer to the
SBEAMS tutorial for an intro to using SBEAMS.


-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Troubleshooting

1) How to re-generate and update table schemas

- Drop table and (firstly) its constraints, via SQL commands. Look under 
  DOMAIN_DROPCONSTRAINTS.mssql and DOMAIN_DROPTABLES.mssql.
  e.g. look under Core_DROPCONSTRAINTS.mssql and Core_DROPTABLES.mssql

- Edit the appropriate $DOMAIN_table_column.txt file
  e.g. make a field in conf/Core/Core_table_property.txt nullable=N

- Generate new schema files using generate_schema.pl
  e.g.  cd $SBEAMS/lib/scripts/Core
	./generate_schema.pl \
          --table_prop ../../conf/Core/Core_table_property.txt \
          --table_col ../../conf/Core/Core_table_column.txt \
          --schema_file ../../sql/Core/Core \
          --destination_type mssql

- Now re-create table and its constraints using the new (updated) dll.
  e.g. look under Core_CREATETABLES.mssql and Core_CREATECONSTRAINTS.mssql

- Populate the table, if required.
  e.g. look in Core_POPULATE.mssql

-------------------------------------------------------------------------------
2) How to delete a search batch / experiment

-Warning: deleting considerably slows down the database!

./load_proteomics_experiment.pl \
  --delete_search_batch \
  --experiment_tag $EXPTAG \
  --search_subdir=$SUBDIR