$Id:$
=======================================================================
         Installing Gene Ontology (GO) support for SBEAMS
=======================================================================

In the following document, two directory areas are referred to:
$SBEAMS is the base area of your sbeams installation (or dev area)
$GOARC is an area where the GO source files will be stored and unpacked,
which is best not under $SBEAMS (e.g. /local/data/GeneOntology/  ).


=======================================================================
1. Download and import the Gene Ontology data
=======================================================================

    __________
___/ Download \________________________________________________________

Visit http://archive.godatabase.org/full to see what is the latest
monthly build. In this example, it is 2006-01-01:

cd $GOARC
setenv MONTH 2006-01-01
setenv DATAFILE go_200601-assocdb
setenv GZFILE ${DATAFILE}-data.gz
setenv TGZFILE ${DATAFILE}-tables.tar.gz
mkdir $MONTH
cd $MONTH
wget -O $GZFILE http://archive.godatabase.org/full/$MONTH/$GZFILE
wget -O $TGZFILE http://archive.godatabase.org/full/$MONTH/$TGZFILE
gunzip -c $GZFILE > $DATAFILE.mysql

    ______
___/ Load \____________________________________________________________

If your SBEAMS back-end is MySQL, then you can load the direct distribution
as provided by the Gene Ontology folks.  If you have a different back-end,
the process can be a little more difficult.


--- MySQL Example -------------------------------------------------

    In the following aliases, 'mysql' is the name of the database
    server, 'go' is the name of the database where the GO tables
    will be created (this should be separate from the main 'sbeams'
    database), 'goadmin' is a user with ALL privileges in database
    'go', with password 'xx22++'.

    - One way to create the database is the following:
      CREATE DATABASE go;
      GRANT ALL ON go.* TO "goadmin"@"localhost" IDENTIFIED BY 'xx22++';
      GRANT FILE ON *.* TO "goadmin"@"localhost" IDENTIFIED BY 'xx22++';

    - Furthermore, if the database server will be accessed over the
      network (i.e. SBEAMS is installed on a separate machine), you
      might also have to:
      GRANT ALL ON go.* TO "goadmin"@"10.%" IDENTIFIED BY 'xx22++';
      GRANT FILE ON *.* TO "goadmin"@"10.%" IDENTIFIED BY 'xx22++';


    - Set up some variables:

    setenv DSN "-h mysql -p mysql"
    setenv ROOTDSN "-h mysql -u goadmin -p go"

    - Verify login, drop and recreate database:

    mysql $ROOTDSN
    echo "SHOW PROCESSLIST;" | mysql $ROOTDSN
    echo "SHOW DATABASES;" | mysql $ROOTDSN
    echo "DROP DATABASE go; CREATE DATABASE go;" | mysql $ROOTDSN

    - Verify regular user login (must have added user)

    mysql $DSN


    - Load!  (could possibly take several hours)
    cd $GOARC/$MONTH
    date ; nice +19 cat $DATAFILE.mysql | mysql $DSN ; date


--- SQL Server/Sybase Example -------------------------------------

    No direct dump is provided for SQL Server, so the data must be
    BCPed in. (BCP is a SQL server/SYBASE specific tool for bulk
    copying data into tables)

    - Create a database on SQL Server called 'go' or something like
      that. Let 'sbeamsadmin' or some similar account have dbo privileges.

    --- Prepare files ---------------------------------------------

    cd $GOARC/$MONTH
    tar -zxvf $TGZFILE
    cd ${DATAFILE}-tables

    rm *.sql
    $SBEAMS/lib/scripts/BioLink/GOMySQL2MSSQL.pl \
      --in ../${DATAFILE}.mysql --out GO \
      --bcpuser sbeamsadmin --bcppass xxxxxx --bcpdatabase go

    ( Replace 'xxxxxx' in the previous command with the sbeamsadmin
      password in database 'go'.)

    - The following command creates tab-delimited fields from the
      original dump file to use in the BCP loading process. It also
      translates embedded tabs to spaces:

    foreach file (*.txt)
      echo $file
      mv $file $file.before
      cat $file.before | sed -e 's/\t\\N/\t/g' | sed -e 's/\\\t/  /g' | unix2dos > $file
    end

    - If the above produced no errors, then you could:
    rm *.before


    --- Load files ------------------------------------------------

    - If there's a previous database, drop all tables with:

    $SBEAMS/lib/scripts/Core/runsql.pl --user sbeamsadmin \
      --database go --delim GO --sfile GO.DROP.mssql

    - Then create all GO tables with:

    $SBEAMS/lib/scripts/Core/runsql.pl --user sbeamsadmin \
      --database go --delim GO --sfile GO.CREATE.mssql


    - Now copy the entire directory ${DATAFILE}-tables (e.g.
      go_200601-assocdb-tables) to a location that is visible from
      the SQL Server OS itself (if it is not already), then in a
      shell on the SQL Server itself (cmd.exe), cd to this directory
      and run the batch script created by GOMySQL2MSSQL.pl:

    GO.BCP.bat
      or
    GO.BCP.bat > bcpout.txt  (to be able to examine the diagnostic messages more easily)
    (This BCP usually under 30 mins, depending on the server hardware)

    - When this is complete, delete GO.BCP.bat (and bcpout.txt) since
      it contains a password in it:
    rm GO.BCP.bat
    rm bcpout.txt


=======================================================================
2. Run gene_annotation scripts in BioLink
=======================================================================

setenv BIOLINK_DATABASE sbeams
setenv GO_DATABASE go
setenv GO_DATABASE_PREFIX go.dbo.

#### If a previous set of tables existed, drop them
$SBEAMS/lib/scripts/Core/runsql.pl --user sbeamsadmin \
  --sfile $SBEAMS/lib/sql/BioLink/BioLink_GOAnnotation_DROPTABLES.sql \
  --database $BIOLINK_DATABASE

#### Create tables in BioLink as sbeamsadmin
$SBEAMS/lib/scripts/Core/runsql.pl --user sbeamsadmin \
  --sfile $SBEAMS/lib/sql/BioLink/BioLink_GOAnnotation_CREATETABLES.sql \
  --database $BIOLINK_DATABASE --delim GO


cd $SBEAMS/lib/scripts/BioLink

#### Load Yeast (SGD) annotations
time ./load_gene_annotation.pl --xref_dbname SGD \
  --godatabaseprefix $GO_DATABASE_PREFIX

Sample times:
  SGD may take 1:45 hours


#### Repeat for xref_dbnames FB, TAIR, UniProt as desired
as above


#### Create indexes
$SBEAMS/lib/scripts/Core/runsql.pl --user sbeamsadmin \
  --sfile $SBEAMS/lib/sql/BioLink/BioLink_GOAnnotation_CREATEINDEXES.sql \
  --database $BIOLINK_DATABASE --delim GO


=======================================================================
3. Link to Proteomics tables
=======================================================================

#### Create linking table in Proteomics with (skip if already existing)
$SBEAMS/lib/scripts/Core/runsql.pl --user sbeamsadmin \
  --sfile $SBEAMS/lib/sql/Proteomics/Proteomics_GOAnnotationLink_CREATETABLES.sql \
  --database $BIOLINK_DATABASE --delim GO

#### Populate it
#### At present it is probably best to study this file and only execute
#### the SQL that is relevant to your local installation
#### (Remove 'Biolink..' if installing on a single-database instance.)
edit $SBEAMS/lib/sql/Proteomics/Proteomics_GOAnnotationLink_POPULATE.sql

#### Remember to (re-)create the indexes after every time you modify
#### this table.


=======================================================================
4. Extra information and troubleshooting
=======================================================================

- You will need to execute most of the procedure outlined in this 
  document whenever you wish to load updated GO data. The data at
  godatabase.org is updated more or less on a monthly basis. You 
  may want to take a look at, and make use of the following script
  to help you through part of the process:

  $SBEAMS/lib/scripts/BioLink/get_GO.sh

  Please look through the code first so you understand it before
  running it.


================================= FIN =================================