Coscinasterias muricata Coelomic Epithelium Transcriptome

Introduction

Coscinasterias is a cosmopolitan genus of large and sea stars with the ability of somatic fission as a clonal reproductive strategy. Of these, C. muricata from New Zealand is particularly interesting because it reproduces exclusively sexually in certain geographical areas and mostly by cloning in others. During fission, the animals tear themselves apart across their central disc, where after the lost body parts are regenerated. We have sequenced and subsequently analyzed the transcriptome of the coelomic epithelium of a clonal C. muricata specimen. This work has been submitted to Marine Genomics (Gabre et al, “The coelomic epithelium transcriptome from a clonal sea star, Coscinasterias muricata”, 2015). If you find the following supplementary material useful for your research, please cite that paper.

Online resources

Transcriptome Browser

We have uploaded the transcript sequences into a GBrowse2 ^[1] engine to get a transcriptome browser, where many of the datasets produced by the protocols on this supplementary material web page were projected over each sequence.

https://compgen.bio.ub.edu/gbrowse/starfish_transcriptome/

The BLAST web form

We have adapted viroBLAST ^[2] to fit into our CMS system. That BLAST web service has already defined two nucleotide databases for the reads, both raw 454 sequences and cleaned reads (by Trimmomatic) and three databases for the transcriptomes produced, including SOAPdenovo-Trans (recommended), Trinity, and NewBler. There are two amino acid databases for each of those three transcriptome assemblies, made upon the six frame translations of the nucleotide sequences for the contigs, as well as taking the longest ORF from that set separately. Let us know if there is any problem using this service.

https://compgen.bio.ub.edu/transcriptomes/blast/CmurCE.TR/blast.php

Download Files

The following table summarizes the sequence files along with their approximate byte size. All the files were compressed using bzip2.

Set	Raw 454 Data	Raw Reads	Clean Reads (FastX)	Clean Reads (Trimmomatic)
G0CB7HT03	SFF (230MB)	FASTQ (42MB)	FASTQ (7.3MB)	FASTQ (25MB)
G0CB7HT04	SFF (239MB)	FASTQ (42MB)	FASTQ (8.5MB)	FASTQ (21MB)

Assembly	NewBler (RAW)	NewBler (FX)	NewBler (TR)	Trinity (FX)	Trinity (TR)	SOAPdenovoTrans (FX)	SOAPdenovoTrans (TR)
Contigs	FASTA (3.1MB)	FASTA (235KB)	FASTA (1.3MB)	FASTA (454KB)	FASTA (2.7MB)	FASTA (182KB)	FASTA (1.5MB)

“FX” and “TR” correspond to the assembly run using the FastX or the Trimmomatic protocols to clean the 454 reads respectively, you can find the assembly details on the following methods section.

The raw data has been also submitted to the NCBI-SRA database:

BioSample	SAMN03371637
Experiment	SRX889239
SRA Run	SRR1816871

Material and Methods

This section contains a detailed description of the computational protocols that produced the current C. muricata transcriptome version, as well as its annotation. Code blocks for all the steps show the command-line arguments passed to the different tools used all along the analyses performed.

Pre-Assembly Procedures

Raw Sequence Data

Sequence reads were produced at Amplicon Express, using half plate of a Roche GS FLX+ 454-Titanium sequencer. sff2fastq was used to retrieve reads sequence and qualities in FASTQ format from the original standard flowgram format (SFF) files.

Retrieving sequences from the raw 454 data SFF files

# BD is the shell variable for the project base directory.

# Using sff2fastq to extract nucleotides and qualities from SFF files into fastq

for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    sff2fastq -o $BD/seqs/$DS.fq \
                 $BD/raw_data/$DS.sff \
              2> $BD/seqs/$DS.sff2fastq.log

  }; done

Some basic stats on raw reads

for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    echo "# Running commands on $DS" 1>&2

    # Basic read length and GC content summary using EMBOSS infoseq

    infoseq -noheading -only -name -length -pgc \
            fastq::$BD/seqs/$DS.fq \
                 > $BD/seqs/$DS.info.tbl

    gawk '{ S+=$2; N++ }
      END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; }
         ' $BD/seqs/$DS.info.tbl

    # Using a custom R script to plot Length vs GC content scatterplots

    Rscript Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/$DS.info.tbl

    # QC analysis using FastQC

    mkdir $BD/seqs/$DS.QC
    fastqc -f fastq --extract --outdir $BD/seqs/$DS.QC $BD/seqs/$DS.fq

  }; done

Set	Total seqs	Total length (bp)	Avg Length (bp)
G0CB7HT03	194,344	77,400,490	398.265
G0CB7HT04	195,424	70,224,129	359.342
TOTAL	389,768	147,624,619	378.750

Set	Quality per Position	Nucleotide Content	GC distribution	Length vs GC content
G0CB7HT03
G0CB7HT04

Cleaning Raw Reads

There are several tools that can help cleaning the raw reads before the assembly, we used here those provided by FASTX tools suite. The tools it provides facilitate the pipelining of filters to clean up the read sequences. However, fastq_quality_filter does not trim low quality ends, it removes any read that does not meet the criteria of a minimum quality score (-+q=20+- for Phred Q20) across a given length percent (-+p=50+- in our case). fastx_artifacts_filter has not found any issue on our data set, thus the cleaning was mainly performed by the fastx_clipper steps.

Cleaning reads sequences with FASTX tools

for DS in G0CB7HT03 G0CB7HT04 ;
  do {

   # FastX Quality filtering
   fastq_quality_filter -i $BD/seqs/${DS}.fq -o $BD/seqs/${DS}_q20.fq -p 50 -q 20 -Q 33

   # FastX Artifact filtering
   fastx_artifacts_filter -i $BD/seqs/${DS}_q20.fq -o $BD/seqs/${DS}_q20_noartfct.fq -Q 33

   # Fastx Adapters clipping
   fastx_clipper                 -i $BD/seqs/${DS}_q20_noartfct.fq \
                                 -a CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG -Q 33 | \
     fastx_clipper               -a CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG -Q 33 | \
       fastx_clipper             -a GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC -Q 33 | \
         fastx_clipper           -a TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG   -Q 33 | \
           fastx_clipper         -a CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA   -Q 33 | \
             fastx_clipper       -a GCCTCCCTCGCGCCATCAG                          -Q 33 | \
               fastx_clipper     -a GCCTTGCCAGCCCGCTCAG                          -Q 33 | \
                 fastx_clipper   -a CGTATCGCCTCCCTCGCGCCATCAG                    -Q 33 | \
                   fastx_clipper -a CTATGCGCCTTGCCAGCCCGCTCAG                    -Q 33   \
                                 -o $BD/seqs/${DS}_q20_noartfct_clipped.fq

  }; done

Some basic stats on cleaned reads

for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    for FL in q20 q20_noartfct q20_noartfct_clipped ;
      do {

        echo "# Running commands on $DS _ $FL" 1>&2

        # Basic read length and GC content summary using EMBOSS infoseq

        infoseq -noheading -only -name -length -pgc \
                fastq::$BD/seqs/${DS}_${FL}.fq \
                     > $BD/seqs/${DS}_${FL}.info.tbl

        gawk '{ S+=$2; N++ }
           END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; }
             ' $BD/seqs/${DS}_${FL}.info.tbl

        # Using a custom R script to plot Length vs GC content scatterplots

        Rscript Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/${DS}_${FL}.info.tbl

        # QC analysis using FastQC

        mkdir $BD/seqs/${DS}_${FL}.QC
        fastqc -f fastq --extract --outdir $BD/seqs/${DS}_${FL}.QC $BD/seqs/${DS}_${FL}.fq

      }; done

  }; done

Set	Step	Total seqs	Total length (bp)	Avg Length (bp)
G0CB7HT03	raw	194,344	77,400,490	398.265
	>=Q20	193,763	77,330,283	399.097
	+Artifacts	193,763	77,330,283	399.097
	+Adapters	169,557	13,286,454	78.360
G0CB7HT04	raw	195,424	70,224,129	359.342
	>=Q20	194,415	70,085,189	360.493
	+Artifacts	194,415	70,085,189	360.493
	+Adapters	170,063	13,371,236	78.625

Final Set	Step	Quality per Position	Nucleotide Content	GC distribution	Length vs GC content
G0CB7HT03	>=Q20
	+Adapters
G0CB7HT04	>=Q20
	+Adapters

After this step, reads sequence length distribution shows a dramatic decrease, while the GC content seems to fit perfectly over the theoretical distribution. We tried a more integrated approach to see if we can improve the cleaning step on the quality trimming, which was based on Trimmomatic.

Cleaning reads sequences with Trimmomatic

export TMC=/opt/install/Trimmomatic-0.32/
for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    java -jar $TMC/trimmomatic-0.32.jar  SE  \
        $BD/seqs/${DS}.fq  $BD/seqs/${DS}.trimmomatic.fq \
        ILLUMINACLIP:$TMC/adapters/454_adapters-SE.fa:2:30:10 \
        LEADING:15 TRAILING:15 SLIDINGWINDOW:4:15 MINLEN:30 TOPHRED33 \
        2> $BD/seqs/${DS}.trimmomatic.log 1>&2

  }; done

Some basic stats on Trimmomatic cleaned reads

for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    echo "# Running commands on $DS" 1>&2

    # Basic read length and GC content summary using EMBOSS infoseq

    infoseq -noheading -only -name -length -pgc \
            fastq::$BD/seqs/${DS}.trimmomatic.fq \
                 > $BD/seqs/${DS}.trimmomatic.info.tbl

    gawk '{ S+=$2; N++ }
       END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; }
         ' $BD/seqs/${DS}.trimmomatic.info.tbl

    # Using a custom R script to plot Length vs GC content scatterplots

    Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/${DS}.trimmomatic.info.tbl

    # QC analysis using FastQC

    mkdir $BD/seqs/${DS}.trimmomatic.QC
    fastqc -f fastq --extract --outdir $BD/seqs/${DS}.trimmomatic.QC $BD/seqs/${DS}.trimmomatic.fq

  }; done

# Making the boxplots
Rscript  Compare_Vector_Distributions.R \
    "Reads Length Distribution::NA::Sequence Length (bp)" \
    blastout/violin.G0CB7HT03_reads_lengthdist.png   \
    "Raw reads${RET}::2:: ::"seqs/G0CB7HT03.info.tbl"$SPC" \
    "FASTX${RET}q20::2:: ::"seqs/G0CB7HT03_q20.info.tbl"$NBC"                           \
    "FASTX${RET}q20+art::2:: ::"seqs/G0CB7HT03_q20_noartfct.info.tbl"$NBC"              \
    "FASTX${RET}q20+art+clip::2:: ::"seqs/G0CB7HT03_q20_noartfct_clipped.info.tbl"$NBC" \
    "Trimmomatic${RET}::2:: ::"seqs/G0CB7HT03.trimmomatic.info.tbl"$TRC"

Rscript  Compare_Vector_Distributions.R \
    "Reads Length Distribution::NA::Sequence Length (bp)" \
    blastout/violin.G0CB7HT04_reads_lengthdist.png   \
    "Raw reads${RET}::2:: ::"seqs/G0CB7HT04.info.tbl"$SPC" \
    "FASTX${RET}q20::2:: ::"seqs/G0CB7HT04_q20.info.tbl"$NBC"                           \
    "FASTX${RET}q20+art::2:: ::"seqs/G0CB7HT04_q20_noartfct.info.tbl"$NBC"              \
    "FASTX${RET}q20+art+clip::2:: ::"seqs/G0CB7HT04_q20_noartfct_clipped.info.tbl"$NBC" \
    "Trimmomatic${RET}::2:: ::"seqs/G0CB7HT04.trimmomatic.info.tbl"$TRC"

Set	Total seqs	Total length (bp)	Avg Length (bp)
G0CB7HT03	raw	194,344	77,400,490	398.265
	Trimmomatic	180,024	50,285,583	279.327
G0CB7HT04	raw	195,424	70,224,129	359.342
	Trimmomatic	166,177	38,285,817	230.392

Final Set	Quality per Position	Nucleotide Content	GC distribution	Length vs GC content
G0CB7HT03
G0CB7HT04

The improved nucleotide qualities after processing reads by Trimmomatic does not affect significantly to the length of the reads as in the previous procedure, as it is shown on the Length versus GC content scatter-plots on the right column of the above table. For comparison purposes, the reads length distribution after each procedure is shown in the violin plots below:

G0CB7HT03	G0CB7HT04

Assembly Overview

For this step three different assemblers were tested: NewBler (the reference assembler for 454 sequences), trinity (routinely used for NGS transcriptome assemblies), and SOAPdenovo-Trans (a variant of SOAPdenovo specific for transcriptomes). The latter two are modern assembly tools, yet it has been reported that SOAP-denovo produces less chimeric assemblies (). In this sequence set there is no paired-end data, yet we expect that the lack of paired-end data will be compensated by the longer reads we got from the 454 sequencing experiment.

`NewBler`

Assembling reads using NewBler: reads cleaned by FastX

# Preparing the project folder
newAssembly -cdna starfish_newbler_fastx

# Preparing the reads sequences
addRun starfish_newbler_fastx ./seqs/G0CB7HT03_q20_noartfct_clipped.fasta \
                              ./seqs/G0CB7HT04_q20_noartfct_clipped.fasta
# 2 read files successfully added.
#     G0CB7HT03_q20_noartfct_clipped.fasta
#     G0CB7HT04_q20_noartfct_clipped.fasta

# Running the assembler
nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler_fastx  \
                       2> starfish_newbler_fastx/starfish_newbler_fastx.log 1>&2 &

Assembling reads using NewBler: reads cleaned by Trimmomatic

# Preparing the project folder
newAssembly -cdna starfish_newbler_trimmo

# Preparing the reads sequences
addRun starfish_newbler_trimmo ../seqs/G0CB7HT03.trimmomatic.fasta ../seqs/G0CB7HT04.trimmomatic.fasta
# 2 read files successfully added.
#     G0CB7HT03.trimmomatic.fasta
#     G0CB7HT04.trimmomatic.fasta

# Running the assembler
nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler_trimmo \
                       2> starfish_newbler_trimmo/starfish_newbler_trimmo.log 1>&2 &

Assembling reads using NewBler: using raw SFF files

# Preparing the project folder
newAssembly -cdna starfish_newbler

# Preparing the reads sequences
addRun starfish_newbler ../seqs/G0CB7HT03.sff ../seqs/G0CB7HT04.sff
# 2 read files successfully added.
#     G0CB7HT03.sff
#     G0CB7HT04.sff

# Running the assembler
nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler  \
                       2> starfish_newbler/starfish_newbler.log 1>&2 &

`Trinity`

Assembling reads using Trinity: reads cleaned by FastX

nohup Trinity.pl --seqType fq --JM 100G --CPU 32 -min_contig_length 200 \
                 --single ../seqs/G0CB7HT03_q20_noartfct_clipped.fq \
                          ../seqs/G0CB7HT04_q20_noartfct_clipped.fq \
                 --output ./starfish_trinity \
                 2> ./starfish_trinity/Trinity.log 1>&2 &

TrinityStats.pl starfish_trinity/Trinity.fasta

Assembling reads using Trinity: reads cleaned by Trimmomatic

nohup Trinity.pl --seqType fq --JM 100G --CPU 32 -min_contig_length 200 \
                 --single ../seqs/G0CB7HT03.trimmomatic.fq ../seqs/G0CB7HT04.trimmomatic.fq \
                 --output ./starfish_trinity_trimmo \
                 2> ./starfish_trinity_trimmo/Trinity.trimmomatic_fq.log 1>&2 &

TrinityStats.pl starfish_trinity_trimmo/Trinity.fasta

`SOAPdenovo-Trans`

As there are no Paired-End reads, scaffold sequences, aka transcripts, produced with the +-SOAPdenovo-Trans+- are the same as those from the contigs from step 2. The config files are available from these links: starfish.conf and starfish_trimmo.conf.

Assembling reads using SOAPdenovo-Trans: reads cleaned by FastX

# step1: computing pre-graph
SOAPdenovo-Trans-127mer pregraph -s ./starfish_soap/starfish.conf -K 63 \
                                 -o ./starfish_soap/starfish -p 32 \
                                 1> ./starfish_soap/starfish.pregraph.log \
                                 2> ./starfish_soap/starfish.pregraph.err
# step2: building contigs
SOAPdenovo-Trans-127mer contig   -g ./starfish_soap/starfish \
                                 1> ./starfish_soap/starfish.contig.log \
                                 2> ./starfish_soap/starfish.contig.err
# step3: mapping reads
SOAPdenovo-Trans-127mer map      -s ./starfish_soap/starfish.conf \
                                 -g ./starfish_soap/starfish -p 32 \
                                 1> ./starfish_soap/starfish.map.log \
                                 2> ./starfish_soap/starfish.map.err
# step4: scaffolding
SOAPdenovo-Trans-127mer scaff    -g ./starfish_soap/starfish -p 32 \
                                 1> ./starfish_soap/starfish.scaff.log \
                                 2> ./starfish_soap/starfish.scaff.err

Assembling reads using SOAPdenovo-Trans: reads cleaned by Trimmomatic

# step1
SOAPdenovo-Trans-127mer pregraph -s ./starfish_soap_trimmo/starfish_trimmo.conf -K 63 \
                                 -o ./starfish_soap_trimmo/starfish_trimmo -p 32 \
                                 1> ./starfish_soap_trimmo/starfish_trimmo.pregraph.log \
                                 2> ./starfish_soap_trimmo/starfish_trimmo.pregraph.err
# step2
SOAPdenovo-Trans-127mer contig   -g ./starfish_soap_trimmo/starfish_trimmo \
                                 1> ./starfish_soap_trimmo/starfish_trimmo.contig.log \
                                 2> ./starfish_soap_trimmo/starfish_trimmo.contig.err
# step3
SOAPdenovo-Trans-127mer map      -s ./starfish_soap_trimmo/starfish_trimmo.conf \
                                 -g ./starfish_soap_trimmo/starfish_trimmo -p 32 \
                                 1> ./starfish_soap_trimmo/starfish_trimmo.map.log \
                                 2> ./starfish_soap_trimmo/starfish_trimmo.map.err
# step4
SOAPdenovo-Trans-127mer scaff    -g ./starfish_soap_trimmo/starfish_trimmo -p 32 \
                                 1> ./starfish_soap_trimmo/starfish_trimmo.scaff.log \
                                 2> ./starfish_soap_trimmo/starfish_trimmo.scaff.err

Contigs Summary

Assembling reads using SOAPdenovo-Trans: reads cleaned by Trimmomatic

# Retrieve fasta sequences for all assemblies
for DS in starfish_newbler_fastx/assembly/454Isotigs.fna \
          starfish_newbler_trimmo/assembly/454Isotigs.fna \
          starfish_newbler/assembly/454Isotigs.fna \
          starfish_trinity/Trinity.fasta \
          starfish_trinity_trimmo/Trinity.fasta \
          starfish_soap/starfish.scafSeq \
          starfish_soap_trimmo/starfish_trimmo.scafSeq ;
  do {

    ds=${DS%%/*}

    echo "# Running commands on $DS -> $ds" 1>&2

    ln -vs $DS $ds.fa

    # Basic read length and GC content summary using EMBOSS infoseq

    infoseq -noheading -only -name -length -pgc \
            fasta::$ds.fa > $ds.info.tbl

    gawk '{ S+=$2; N++ }
       END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; }
         ' $ds.info.tbl

    # Using a custom R script to plot Length vs GC content scatterplots

    Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R $ds.info.tbl

  }; done

# Making boxplots
Rscript  Compare_Vector_Distributions.R \
    "Transcript Length Distribution::NA::Sequence Length (bp)" \
    blastout/violin.assemblies_lengthdist.png \
    "NewBler (RAW)::2:: ::"assemblies/starfish_newbler.info.tbl"$NBC"       \
    "NewBler (FX)::2:: ::"assemblies/starfish_newbler_fastx.info.tbl"$NBC"  \
    "NewBler (TR)::2:: ::"assemblies/starfish_newbler_trimmo.info.tbl"$NBC" \
    "Trinity (FX)::2:: ::"assemblies/starfish_trinity.info.tbl"$TRC"        \
    "Trinity (TR)::2:: ::"assemblies/starfish_trinity_trimmo.info.tbl"$TRC" \
    "SOAPdnT (FX)::2:: ::"assemblies/starfish_soap.info.tbl"$SPC"           \
    "SOAPdnT (TR)::2:: ::"assemblies/starfish_soap_trimmo.info.tbl"$SPC"

Rscript  Compare_Vector_Distributions_ylog.R \
    "Transcript Length Distribution::NA::Sequence Length (bp)" \
    blastout/violin.assemblies_lengthdist_ylog.png \
    "NewBler (RAW)::2:: ::"assemblies/starfish_newbler.info.tbl"$NBC"       \
    "NewBler (FX)::2:: ::"assemblies/starfish_newbler_fastx.info.tbl"$NBC"  \
    "NewBler (TR)::2:: ::"assemblies/starfish_newbler_trimmo.info.tbl"$NBC" \
    "Trinity (FX)::2:: ::"assemblies/starfish_trinity.info.tbl"$TRC"        \
    "Trinity (TR)::2:: ::"assemblies/starfish_trinity_trimmo.info.tbl"$TRC" \
    "SOAPdnT (FX)::2:: ::"assemblies/starfish_soap.info.tbl"$SPC"           \
    "SOAPdnT (TR)::2:: ::"assemblies/starfish_soap_trimmo.info.tbl"$SPC"

Assembly Run	Assembled bases(bp)	#Genes	#Transcripts	Avg GC%	Min Length(bp)	Avg Length(bp)	Median Length(bp)	Max Length(bp)	N10 length	N50 length	N90 length
NewBler (FX)	761,693	2,057	2,241	42.88	12	340.5	244	6,566	—	429	—
NewBler (TR)	4,773,068	4,525	5,447	43.57	18	876.3	607	10,814	—	1,019	—
NewBler (RAW)	10,753,480	8,182	10,022	42.10	12	1,074.0	711	13,849	—	1,319	—

Trinity (FX)	1,416,706	3,559	3,743	41.19	201	378.5	305	5,255	1,009	386	—
Trinity (TR)	8,883,819	12,342	13,817	43.59	201	643.0	474	12,023	2,420	745	—

SOAPdenovo-Trans (FX)	630,161	—	2,178	41.33	104	289.3	252	3,309	633	299	198
SOAPdenovo-Trans (TR)	5,212,378	—	11,369	43.61	100	458.5	339	9,225	1,712	529	241

Reads Cleaning Protocol	NewBler (RAW)	NewBler	Trinity	SOAPdenovo-Trans
FastX^*
Trimmomatic

^* For this row, NewBler results shown in the first column were obtained using the raw SFF files,
while all the other columns present results derived from the clean reads produced by our FastX protocol.

As better results were obtained after Trimmomatic cleaning, all the downstream analyses were performed over the NewBler “RAW”, Trinity “TR”, and SOAPdenovo-Trans “TR”.

Post-Assembly Procedures

Removing VBF genomic contamination

Just in case, we run BLAST search against a database of Viral, Bacterial, and Fungal genomes (VBF), to remove transcript sequences in the assembly that have a reasonable match to another microorganism that was processed along with the samples. We consider those transcripts as contaminant genomic sequence, and we remove from the transcript set.

cleanvbfseqs.sh

#!/bin/bash
#
# cleanvbfseqs.sh  query.fa  output_prefix
# 
#   Homoloy search over VBF genomes to remove potential contaminated sequences
#
# Note: set and export shell var BDBH, pointing to the BLAST DB VBF files
# 

# INPUT ARGS
QUERY=$1
OUTPREFIX=$2

# Minimum coverage for a contaminated sequence has to be larger than this
MINCOV=50
# BLAST PARAMETERS
EVAL=10e-25
BLSTOFMT='6 qseqid qlen sseqid slen qstart qend sstart send length score evalue bitscore '
BLSTOFMT=$BLSTOFMT'pident nident mismatch positive gapopen gaps ppos qframe sframe qseq sseq'
CPUS=32

# Checking Virus genomes
echo "### BLAST x Viral genomes..." 1>&2
blastn -task megablast \
       -query $QUERY   \
       -db $BDBH/viral_genomes           \
       -evalue $EVAL -outfmt "$BLSTOFMT" \
       -dust "yes" -num_threads $CPUS    \
               2> $OUTPREFIX.viral_genomes.log | \
    gzip -vc9 - > $OUTPREFIX.viral_genomes.out.gz

# Checking Bacterial genomes
echo "### BLAST x Bacterial genomes..." 1>&2
blastn -task megablast \
       -query $QUERY   \
       -db $BDBH/bacterial_genomes.raj   \
       -evalue $EVAL -outfmt "$BLSTOFMT" \
       -dust "yes" -num_threads $CPUS    \
               2> $OUTPREFIX.bacterial_genomes.log | \
    gzip -vc9 - > $OUTPREFIX.bacterial_genomes.out.gz

# Checking Fungal genomes
echo "### BLAST x Fungal genomes..." 1>&2
blastn -task megablast \
       -query $QUERY   \
       -db $BDBH/fungal_genomes.raj      \
       -evalue $EVAL -outfmt "$BLSTOFMT" \
       -dust "yes" -num_threads $CPUS    \
               2> $OUTPREFIX.fungal_genomes.log | \
    gzip -vc9 - > $OUTPREFIX.fungal_genomes.out.gz

# Calculating HSPs sequence coverage
echo "### Calculate sequence coverage..." 1>&2
for VBF in viral bacterial fungal;
  do {
    zcat $OUTPREFIX.${VBF}_genomes.out.gz | \
        coverage_blastshorttbl.pl --prog=blastn - \
                     > $OUTPREFIX.${VBF}_genomes.covtbl
  }; done

# Retrieving sequence ids for contaminated sequences:
#    criteria a significant match to viral/bacterial/fungal genomes
#             with more than 50% sequence sequence coverage
echo "### Retrieving sequence ids for contaminated sequences..." 1>&2
gawk -v COV=$MINCOV '($1 == "Q" && $13 > COV) { print $2, $14 }
     ' $OUTPREFIX.{viral,bacterial,fungal}_genomes.covtbl | \
  sort | uniq  > $OUTPREFIX.VBF_contaminated_seqs.QS_ids

gawk '{ print $1 }' $OUTPREFIX.VBF_contaminated_seqs.QS_ids | \
      sort | uniq > $OUTPREFIX.VBF_contaminated_seqs.ids

# Filtering out those sequences that DO NOT appear on the previous list
echo "### Filtering out vbf clean sequences..." 1>&2
grepID.pl -vAFN $OUTPREFIX.VBF_contaminated_seqs.ids \
                $QUERY \
              > $QUERY.vbfclean.fa

exit 0;

Cleaning VBF contamination

# Running the VBF cleaning script
bin/cleanvbfseqs.sh assemblies/starfish_newbler.fa \
                    blastcont/starfish_newbler \
                 2> blastcont/starfish_newbler.cleanvbfseqs.log 1>&2 &
bin/cleanvbfseqs.sh assemblies/starfish_trinity_trimmo.fa \
                    blastcont/starfish_trinity_trimmo \
                 2> blastcont/starfish_trinity_trimmo.cleanvbfseqs.log 1>&2 &
bin/cleanvbfseqs.sh assemblies/starfish_soap_trimmo.fa \
                    blastcont/starfish_soap_trimmo \
                 2> blastcont/starfish_soap_trimmo.cleanvbfseqs.log 1>&2 &

# Only few hits found and removed
wc -l blastcont/starfish_{newbler,trinity_trimmo,soap_trimmo}.VBF_contaminated_seqs.ids
  7 blastcont/starfish_newbler.VBF_contaminated_seqs.ids
 30 blastcont/starfish_trinity_trimmo.VBF_contaminated_seqs.ids
 22 blastcont/starfish_soap_trimmo.VBF_contaminated_seqs.ids

Finding putative CDS on transcripts

Finding longest ORFs to predict CDSs

while read IFA OFA;
  do {

    ifa=${IFA%.fa}

    # getting longest ORFs
    cdna2orfs.v2.pl -VSBNldu assemblies/$IFA \
                           > assemblies/$OFA.nuc.fa
    cdna2orfs.v2.pl -VSBldu  assemblies/$IFA \
                           > assemblies/$OFA.aa.fa

    # getting transcript length
    infoseq -noheading -only -name -length -pgc \
            fasta::assemblies/$IFA \
                 > assemblies/$ifa.nuc.info.tbl

    # getting longest ORFs length
    infoseq -noheading -only -name -length -pgc \
            fasta::assemblies/$OFA.nuc.fa \
                 > assemblies/$OFA.nuc.info.tbl

    Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R assemblies/$OFA.nuc.info.tbl

    # comparing transcript and longest ORFs lengths
    gawk 'BEGIN{ OFS="\t";
            while (getline<ARGV[1]>0) I[$1] = $2;
            ARGV[1] = "";
            print "ID", "TRlen", "ORFlen";
          }
          { gsub(/\.[012][\-+][0-9]+$/, "", $1);
	    if ($1 in I) {
              print $1, I[$1], $2;
              F[$1] = $2;
            } else {
              print $1, "NA", $2;
            }; 
          }
          END {
            for (n in I) {
              if (! (n in F)) print n, I[n], "NA";
            };
          }' assemblies/$ifa.nuc.info.tbl \
             assemblies/$OFA.nuc.info.tbl \
           > assemblies/$OFA.nuc.trcomp.tbl

     Rscript Transcript_vs_LongestORF_length_plot.R assemblies/$OFA.nuc.trcomp.tbl ${OFA%%.*} 25000

  }; done <<'EOF'
starfish_newbler.fa.vbfclean.fa         starfish_newbler.longest_orfs
starfish_trinity_trimmo.fa.vbfclean.fa  starfish_trinity_trimmo.longest_orfs
starfish_soap_trimmo.fa.vbfclean.fa     starfish_soap_trimmo.longest_orfs
EOF
# starfish_newbler       : From 10,149 cDNAs, 10,149 out of 1,143,474 ORFs were filtered...
# starfish_trinity_trimmo: From 13,787 cDNAs, 13,787 out of   933,239 ORFs were filtered...
# starfish_soap_trimmo   : From 11,347 cDNAs, 11,347 out of   559,743 ORFs were filtered...

	NewBler (RAW)	Trinity (TR)	SOAPdenovo-Trans (TR)
ORFs length vs GC Content
Transcript vs ORFs lengths

Fixing Strand of Transcript Sequences

In the previous section, the longest ORF was computed for each transcript, by considering all six possible coding frames (just to avoid transcripts assembled on the reverse strand). Here, that information is used to fix the orientation (strand) of all the transcript sequences that have its longest ORF encoded in the reverse strand. It is also a good moment to standardize the names of the final transcript sequences, as well as to sort them by size (longest first).

Fixing transcripts sequence orientation

while read BFA NFA SFX;
  do {

    echo "# working on $BFA" 1>&2;

    ( gawk '$1 ~ /\.[012]\+[0-9]+$/ {
              gsub(/\.[012]\+[0-9]+$/, "", $1);
              print $1;
            }' assemblies/$BFA.longest_orfs.nuc.info.tbl | \
               grepID.pl -vAF - assemblies/$NFA \
                      2> assemblies/$BFA.strfixFWD.log;
      gawk '$1 ~ /\.[012]\-[0-9]+$/ { 
              gsub(/\.[012]\-[0-9]+$/, "", $1);
              print $1;
            }' assemblies/$BFA.longest_orfs.nuc.info.tbl | \
               grepID.pl -vAF - assemblies/$NFA \
                             2> assemblies/$BFA.strfixREV.log | \
                 gawk '$1 ~ /^>/ { $1=$1"_rc"; }
                                 { print $0;  }'
      ) | fa2tbl.pl -  | \
          gawk '{ print length($2), $0 }' - | \
          sort -k1,1nr | \
            gawk -v sfx="$SFX" 'BEGIN{ C=1; }
              { sln=$1; id=$2; sq=$3; $1=$2=$3="";
                dsc=($0~/len(gth)?\=/) ? $0 : " length="sln" "$0;
                gsub(/ +/," ",dsc);
                gsub(/len\=/,"length=",dsc);
                gsub(/ path\=\[[^\]]+\]/,"",dsc);
                print sprintf("Cmur%s_TR%08d", sfx, C), sq, id, dsc;
                C++; }' - | \
          tbl2fa.pl -  > assemblies/$BFA.vbfclean_strandfix.fa

    # getting contig lengths
    infoseq -noheading -only -name -length -pgc \
            fasta::assemblies/$BFA.vbfclean_strandfix.fa \
                 > assemblies/$BFA.vbfclean_strandfix.info.tbl

  }; done <<'EOF'
starfish_newbler        starfish_newbler.fa.vbfclean.fa         NWBR
starfish_trinity_trimmo starfish_trinity_trimmo.fa.vbfclean.fa  TRIN
starfish_soap_trimmo    starfish_soap_trimmo.fa.vbfclean.fa     SOAP
EOF

Comparing Assemblers Output

Preparing BLAST target databases

while read FA TI;
  do {
    makeblastdb -in assemblies/$FA \
                -input_type fasta \
                -dbtype nucl \
                -title "$TI" \
                -out blastdb/$TI;
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF
# starfish_newbler       : added 10149 sequences in 0.801275 seconds.
# starfish_trinity_trimmo: added 13787 sequences in 0.947321 seconds.
# starfish_soap_trimmo   : added 11347 sequences in 0.677559 seconds.

Comparing transcriptome output: all-against-all

BLSTOFMT='6 qseqid qlen sseqid slen qstart qend sstart send length score evalue bitscore '
BLSTOFMT=$BLSTOFMT'pident nident mismatch positive gapopen gaps ppos qframe sframe qseq sseq'
export BLSTOFMT

# Running BLASTN and TBLASTX to find best pair-wise HSP hits
while read IFA IDB OUT;
  do {
    blastn -task blastn                \
           -query assemblies/$IFA      \
           -db blastdb/$IDB            \
           -strand 'plus'              \
           -outfmt "$BLSTOFMT"         \
           -evalue 10e-10 -num_threads 12        \
                  2> blastout/$OUT.blastn.log |  \
        gzip -c9 - > blastout/$OUT.blastn.out.gz &
    tblastx -query assemblies/$IFA \
            -db blastdb/$IDB       \
            -strand 'plus'         \
            -outfmt "$BLSTOFMT"    \
            -matrix BLOSUM90       \
            -evalue 10e-10 -num_threads 12        \
                  2> blastout/$OUT.tblastx.log |  \
        gzip -c9 - > blastout/$OUT.tblastx.out.gz &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_trinity_trimmo  sf_newbler-x-trinitytr
starfish_newbler.vbfclean_strandfix.fa         starfish_soap_trimmo     sf_newbler-x-soaptr
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_newbler         sf_trinitytr-x-newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_soap_trimmo     sf_trinitytr-x-soaptr
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_newbler         sf_soaptr-x-newbler
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_trinity_trimmo  sf_soaptr-x-trinitytr 
EOF

# Filtering HSPs to find best HITs
while read FL;
  do {
    zcat blastout/$FL.blastn.out.gz | \
        coverage_blastshorttbl.pl --prog=blastn - \
                     > blastout/$FL.blastn.covtbl &
    zcat blastout/$FL.tblastx.out.gz | \
        coverage_blastshorttbl.pl --prog=tblastx - \
                     > blastout/$FL.tblastx.covtbl &
  }; done <<'EOF'
sf_newbler-x-trinitytr
sf_newbler-x-soaptr   
sf_trinitytr-x-newbler
sf_trinitytr-x-soaptr 
sf_soaptr-x-newbler   
sf_soaptr-x-trinitytr 
EOF

Aligning reads to transcriptome contigs

Several tests were performed to assess which read mapper was going to align the long 454 reads over the transcript contigs (unpublished data). Bowtie2 yield the larger amount of mapped reads, therefore those commands are shown below.

Mapping 454 reads over the transcriptome contigs

# Preparing BOWTIE2 target databases (for the transcript contigs)
while read FA TI;
  do {
    bowtie2-build -f --large-index --seed 3523 \
                  ./assemblies/$FA \
                  ./bowtie2/$TI.bowtiedb \
               2> ./bowtie2/$TI.bowtiedb.log 1>&2 
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Mapping sequencing reads over assembled contigs
while read FA TI;
  do {
    nohup bowtie2 -q --all --phred33 --seed 3523 \
                  --local --very-sensitive-local \
                  --no-unal --met 60 --threads 32 \
                  --met-file  ./bowtie2/$TI.bowtie.metrics \
                  -x          ./bowtie2/$TI.bowtiedb \
                  -U          ./seqs/G0CB7HT03.trimmomatic.fq \
                  -U          ./seqs/G0CB7HT04.trimmomatic.fq \
                  -S          ./bowtie2/$TI.bowtie.UPL.sam \
                  --un-gz     ./bowtie2/$TI.bowtie.UPL.unaln.gz \
                  2>          ./bowtie2/$TI.bowtie.UPL.log 1>&2 &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Getting contig coverage tables
while read FA TI;
  do {
    printf "# Working on $TI " 1>&2
    samtools view -bS   ./bowtie2/$TI.bowtie.UPL.sam \
                      > ./bowtie2/$TI.bowtie.UPL.bam \
                     2> ./bowtie2/$TI.bowtie.UPL.sam2bam.log
    printf "... SAM2BAM " 1>&2
    samtools sort       ./bowtie2/$TI.bowtie.UPL.bam \
                        ./bowtie2/$TI.bowtie.UPL.sorted \
                     2> ./bowtie2/$TI.bowtie.UPL.sorted.log
    printf "... BAMSORT " 1>&2
    samtools index      ./bowtie2/$TI.bowtie.UPL.sorted.bam
    printf "... BAMSIDX " 1>&2
    samtools mpileup -f ./assemblies/$FA \
                        ./bowtie2/$TI.bowtie.UPL.sorted.bam \
                      > ./bowtie2/$TI.bowtie.UPL.pileup \
                     2> ./bowtie2/$TI.bowtie.UPL.pileup.log
    printf "... PILEUP " 1>&2
    samtools depth      ./bowtie2/$TI.bowtie.UPL.sorted.bam \
                      > ./bowtie2/$TI.bowtie.UPL.pileup_depthonly
    printf "... DEPTH\n" 1>&2
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

Estimating expression levels from sequence coverage

# 
while read FA TI;
  do {
    echo "# Working on $TI " 1>&2

    gawk 'BEGIN{
            while (getline<ARGV[1]>0) { SQL[$1]=$2; TSQL+=$2; };
            ARGV[1]="";
          }
          { if (!($1 in sc)) NT++;
            SC[$1]+=$4; sc[$1]++;
            TC+=$4; tc++;
            if (CVmax[$1] == 0) { CVmin[$1] = $4; CVmin["TOT"] = $4; };
            CVmax[$1] = CVmax[$1] < $4 ? $4 : CVmax[$1];
            CVmin[$1] = CVmin[$1] > $4 ? $4 : CVmin[$1];
            CVmax["TOT"] = CVmax["TOT"] < $4 ? $4 : CVmax["TOT"];
            CVmin["TOT"] = CVmin["TOT"] > $4 ? $4 : CVmin["TOT"];
          }
          END{
            printf "TRID\tLENGTH\tNUCL_TOTAL\tCOVG_SUM\tCOVG_MIN\tCOVG_MAX\tCOVG_AVGall\tCOVG_AVGmap\n";
            printf "TOTALseqs:%d\t%d\t%d\t%d\t%d\t%d\t%.3f\t%.3f\n", NT, TSQL, tc, TC, CVmin["TOT"], CVmax["TOT"], TC/TSQL, TC/tc;
            for (n in SQL) {
              printf "%s\t%d\t%d\t%d\t%d\t%d\t%.3f\t%.3f\n", n, SQL[n], sc[n], SC[n], CVmin[n], CVmax[n],
                                                             (n in SQL && SQL[n] != 0) ? SC[n]/SQL[n] : "NA",
                                                             (n in sc  && sc[n]  != 0) ? SC[n]/sc[n]  : "NA";
            }
          }' ./assemblies/$FA.info.tbl       \
             ./bowtie2/$TI.bowtie.UPL.pileup \
           > ./bowtie2/$TI.bowtie.UPL.pileup.summary

    egrep -v '^TRID|TOTAL' ./bowtie2/$TI.bowtie.UPL.pileup.summary \
                         > ./bowtie2/$TI.bowtie.UPL.pileup.sum2plt
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix     starfish_soap_trimmo
EOF

Having only one sample we cannot perform a differential expression analysis, yet we can check differences in coverage that will correspond to over-expressed transcripts on the sampled tissue. Sorting by average coverage of reads per transcript, we got the following top ten contigs for each of the three assemblies:

Sequences with top 10 average coverage levels

sort -k8,8nr ./bowtie2/starfish_newbler.bowtie.UPL.pileup.summary | head
CmurNWBR_TR00009750     425     425    1272737    371     3401    2994.675    2994.675
CmurNWBR_TR00005436     674     674    1202455    625     2334    1784.058    1784.058
CmurNWBR_TR00003299     957     957    1460945     73     2012    1526.588    1526.588
CmurNWBR_TR00008353     497     497     526194     37     1860    1058.740    1058.740
CmurNWBR_TR00006255     607     607     597169      4     1817     983.804     983.804
CmurNWBR_TR00005569     662     662     642126      5     1454     969.979     969.979
CmurNWBR_TR00009130     472     472     444261     26     1514     941.231     941.231
CmurNWBR_TR00008452     494     481     442215      2     1312     895.172     919.366
CmurNWBR_TR00003245     973     973     776495      9     1157     798.042     798.042
CmurNWBR_TR00006095     620     620     475386      3     1326     766.752     766.752

sort -k8,8nr ./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.summary | head
CmurTRIN_TR00000205    2602    2602    5457543      1     6635    2097.442    2097.442
CmurTRIN_TR00000032    5220    5220    5572076      1     3409    1067.448    1067.448
CmurTRIN_TR00000126    3155    3151    2799743      1     2045     887.399     888.525
CmurTRIN_TR00000028    5474    5474    2795886      1     1157     510.757     510.757
CmurTRIN_TR00000589    1686    1659     680445      1     1140     403.585     410.154
CmurTRIN_TR00000027    5543    5543    2156262      1      959     389.006     389.006
CmurTRIN_TR00000246    2406    2406     854651      1      912     355.217     355.217
CmurTRIN_TR00000212    2560    2560     886427      1      835     346.261     346.261
CmurTRIN_TR00000585    1689    1689     449898      1      628     266.369     266.369
CmurTRIN_TR00000637    1635    1635     434485     38      503     265.740     265.740

sort -k8,8nr ./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.summary | head
CmurSOAP_TR00003834     428     428    1796798    317     5671    4198.126    4198.126
CmurSOAP_TR00000442    1291    1291    2938505      0     6246    2276.146    2276.146
CmurSOAP_TR00000119    2066    2066    4193432     14     3409    2029.735    2029.735
CmurSOAP_TR00000328    1452    1452    2446201     28     2333    1684.711    1684.711
CmurSOAP_TR00001376     784     784    1178624     14     2013    1503.347    1503.347
CmurSOAP_TR00007586     272     272     252061     25     1132     926.695     926.695
CmurSOAP_TR00002681     529     529     470614    169     1217     889.629     889.629
CmurSOAP_TR00002033     629     629     513931    122     1126     817.060     817.060
CmurSOAP_TR00000056    2474    2474    1889173      9     1157     763.611     763.611
CmurSOAP_TR00010178     194     194     143838     36     2314     741.433     741.433

If we consider the functional annotation, if any, for those contigs from BLASTN searches against NCBInt, then:

Functional annotation of highest coverage contigs

Comparing coverage distributions

# Making boxplots
Rscript  Compare_Vector_Distributions_ylog.R \
    "Minimum Read Coverage${RET}per Transcript::NA::Bowtie2 Minimum Coverage" \
    bowtie2/violin.bowtie_read_mincoverage.png \
    "NewBler::5::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC"        \
    "Trinity::5::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \
    "SOAPdnT::5::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC"
Rscript  Compare_Vector_Distributions_ylog.R \
    "Maximum Read Coverage${RET}per Transcript::NA::Bowtie2 Maximum Coverage" \
    bowtie2/violin.bowtie_read_maxcoverage.png \
    "NewBler::6::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC"        \
    "Trinity::6::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \
    "SOAPdnT::6::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC"
Rscript  Compare_Vector_Distributions_ylog.R \
    "Average Read Coverage${RET}per Transcript::NA::Bowtie2 Average Coverage" \
    bowtie2/violin.bowtie_read_avgcoverage.png \
    "NewBler::7::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC"        \
    "Trinity::7::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \
    "SOAPdnT::7::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC"

Functional Annotation

Global shell vars for this section

TAB=$'\t'
RET=$'\n'
NBC="::#CC000099"
TRC="::#00CC0099"
SPC="::#00CC0099"

Comparing starfish and sea urchin transcriptomes

Retrieving sea urchin transcripts from UCSC genomes database

# Downloading transcript related files
wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/est.fa.gz     -O seaurchin/est.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/strPur2.fa.gz -O seaurchin/strPur2.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/refMrna.fa.gz -O seaurchin/refMrna.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/mrna.fa.gz    -O seaurchin/mrna.fa.gz

BLASTX/TBLASTX comparison among starfish and sea urchin transcriptomes

# Preparing BLAST target databases
for F in mrna est refMrna;
  do {
    zcat seaurchin/$F.fa.gz | \
      makeblastdb -in - \
                  -input_type fasta \
                  -dbtype nucl \
                  -title "SeaUrchin_$F" \
                  -out seaurchin/SeaUrchin_$F;
  }; done
#    mrna: added  30,401 sequences in  4.02561 seconds.
#     est: added 141,833 sequences in 11.2223 seconds.
# refMrna: added     466 sequences in  0.0728889 seconds.

# Running BLASTN and TBLASTX searches
while read FA TI;
  do {
    for SU in mrna est refMrna;
      do {
        OUT=$TI"-x-Spur_"$SU;
        blastn -task  blastn               \
               -query assemblies/$FA       \
               -db seaurchin/SeaUrchin_$SU \
               -outfmt "$BLSTOFMT" -dust "yes"     \
               -evalue 10e-25 -num_threads 12      \
                    2> blastout/$OUT.blastn.log |  \
          gzip -c9 - > blastout/$OUT.blastn.out.gz &
        tblastx -query assemblies/$FA       \
                -db seaurchin/SeaUrchin_$SU \
                -outfmt "$BLSTOFMT"         \
                -matrix BLOSUM62            \
                -evalue 10e-25 -num_threads 12        \
                     2> blastout/$OUT.tblastx.log |  \
           gzip -c9 - > blastout/$OUT.tblastx.out.gz &
      }; done
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Processing transcript coverage
for TI in starfish_newbler        \
          starfish_trinity_trimmo \
          starfish_soap_trimmo    ;
  do {
    for SU in mrna est refMrna;
      do {
        OUT=$TI"-x-Spur_"$SU;
        # BLASTN
        zcat blastout/$OUT.blastn.out.gz | \
            coverage_blastshorttbl.pl --prog=blastn - \
                         > blastout/$OUT.blastn.covtbl 
        egrep '^Q\b' blastout/$OUT.blastn.covtbl \
                     > blastout/$OUT.blastn.covtbl.qryonly

        # TBLASTX
        zcat blastout/$OUT.tblastx.out.gz | \
            coverage_blastshorttbl.pl --prog=tblastx - \
                         > blastout/$OUT.tblastx.covtbl
        egrep '^Q\b' blastout/$OUT.tblastx.covtbl \
                     > blastout/$OUT.tblastx.covtbl.qryonly

      }; done
  }; done

# Making boxplots
Rscript  Compare_Vector_Distributions.R \
    "BLASTN x Sea urchin::NA::% Coverage" \
    blastout/violin.Spur_est.blastn.png \
    "NewBler${RET}x ESTs::13::${TAB}::"blastout/starfish_newbler-x-Spur_est.blastn.covtbl.qryonly"$NBC"         \
    "NewBler${RET}x mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_mrna.blastn.covtbl.qryonly"$NBC"        \
    "NewBler${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_refMrna.blastn.covtbl.qryonly"$NBC" \
    "Trinity${RET}x ESTs::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_est.blastn.covtbl.qryonly"$TRC"         \
    "Trinity${RET}x mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_mrna.blastn.covtbl.qryonly"$TRC"        \
    "Trinity${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_refMrna.blastn.covtbl.qryonly"$TRC" \
    "SOAPdnT${RET}x ESTs::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_est.blastn.covtbl.qryonly"$SPC"         \
    "SOAPdnT${RET}x mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_mrna.blastn.covtbl.qryonly"$SPC"        \
    "SOAPdnT${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_refMrna.blastn.covtbl.qryonly"$SPC"

Rscript  Compare_Vector_Distributions.R \
    "TBLASTX x Sea urchin::NA::% Coverage" \
    blastout/violin.Spur_est.tblastx.png \
    "NewBler${RET}x ESTs::13::${TAB}::"blastout/starfish_newbler-x-Spur_est.tblastx.covtbl.qryonly"$NBC"         \
    "NewBler${RET}x mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_mrna.tblastx.covtbl.qryonly"$NBC"        \
    "NewBler${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_refMrna.tblastx.covtbl.qryonly"$NBC" \
    "Trinity${RET}x ESTs::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_est.tblastx.covtbl.qryonly"$TRC"         \
    "Trinity${RET}x mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_mrna.tblastx.covtbl.qryonly"$TRC"        \
    "Trinity${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_refMrna.tblastx.covtbl.qryonly"$TRC" \
    "SOAPdnT${RET}x ESTs::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_est.tblastx.covtbl.qryonly"$SPC"         \
    "SOAPdnT${RET}x mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_mrna.tblastx.covtbl.qryonly"$SPC"        \
    "SOAPdnT${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_refMrna.tblastx.covtbl.qryonly"$SPC"

The following tables summarize the number of transcripts having a BLAST hit (“query” column) and the number of distinct sequences from the database aligning to those transcripts (“target” column). TBL files were compressed using bzip2. As it can be appreciated when comparing both tables, TBLASTX returned almost three times more target hits while only twice as much query hits. Sea urchin mRNA transcript set has 30,401 sequences. This can be due to two reasons: 1) less isoforms per gene on starfish (but with the available data we cannot check this); and 2) the incompleteness of the current starfish transcriptome because it was built from a specific tissue (coelomic epithelium). We can check later on if there are differences at functional GO annotation.

BLASTN x S.pur mRNAs

SET	#Transcripts	#Queries	#Targets	Best Hit
NewBler	10,022	1,440	2,424	TBL
Trinity	13,817	1,772	3,169	TBL
SOAPdenovo-Trans	11,369	1,309	2,214	TBL

TBLASTX x S.pur mRNAs

SET	#Queries	#Targets	Best Hit
NewBler	2,803	7,309	TBL
Trinity	3,371	8,755	TBL
SOAPdenovo-Trans	2,317	6,378	TBL

Homology Search over Protein Sequences Databases

UniProt

UniProt was mirrored from the database ftp server on September 2014.

BLASTX over UniProt

# BLASTDB is the shell var set to the path to BLAST databases
while read FA TI;
  do {
    OUT=$TI"-x-UniProt";
    blastx -query assemblies/$FA           \
           -db $BLASTDB/UniProt_SProt_2014 \
           -strand 'plus'                  \
           -outfmt "$BLSTOFMT"             \
           -matrix BLOSUM62                \
           -evalue 10e-25 -num_threads 12  \
                2> blastout/$OUT.blastx.log |  \
      gzip -c9 - > blastout/$OUT.blastx.out.gz &
    OUT=$TI"-x-UniProtTREmbl";
    blastx -query assemblies/$FA            \
           -db $BLASTDB/UniProt_TREmbl_2014 \
           -strand 'plus'                   \
           -outfmt "$BLSTOFMT"              \
           -matrix BLOSUM62                 \
           -evalue 10e-25 -num_threads 12   \
                2> blastout/$OUT.blastx.log |  \
      gzip -c9 - > blastout/$OUT.blastx.out.gz &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Getting sequence coverage
for TI in starfish_newbler        \
          starfish_trinity_trimmo \
          starfish_soap_trimmo    ;
  do {
    for SU in UniProt UniProtTREmbl ;
      do {
        OUT=$TI"-x-"$SU;
        zcat blastout/$OUT.blastx.out.gz | \
            coverage_blastshorttbl.pl --prog=blastx - \
                         > blastout/$OUT.blastx.covtbl &
        egrep '^Q\b' blastout/$OUT.blastx.covtbl \
                     > blastout/$OUT.blastx.covtbl.qryonly
      }; done
  }; done

# Making the boxplots
Rscript  Compare_Vector_Distributions.R \
    "BLASTX x UniProt::NA::% Coverage" \
    blastout/violin.UniProt.blastx.png \
    "NewBler${RET}x UniProt::13::${TAB}::"blastout/starfish_newbler-x-UniProt.blastx.covtbl.qryonly"$NBC"        \
    "Trinity${RET}x UniProt::13::${TAB}::"blastout/starfish_trinity_trimmo-x-UniProt.blastx.covtbl.qryonly"$TRC" \
    "SOAPdnT${RET}x UniProt::13::${TAB}::"blastout/starfish_soap_trimmo-x-UniProt.blastx.covtbl.qryonly"$SPC"       \
    "NewBler${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_newbler-x-UniProtTREmbl.blastx.covtbl.qryonly"$NBC" \
    "Trinity${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_trinity_trimmo-x-UniProtTREmbl.blastx.covtbl.qryonly"$TRC" \
    "SOAPdnT${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_soap_trimmo-x-UniProtTREmbl.blastx.covtbl.qryonly"$SPC"

Violin plots comparison of assembled transcripts (on all three sets NewBler, Trinity, and SOAPdenovo-Trans) %coverage for the BLASTX search over UniProt and UniProt-TrEMBL protein sequence databases.

In order to have an equivalent functional annotation to compare starfish and sea urchin transcriptomes, same searches were performed using the sea urchin sequences.

BLASTX over UniProt for the sea urchin sequences

nohup zcat ./seaurchin/mrna.fa.gz | \
        blastx -query - \
               -db $BLASTDB/UniProt_SProt_2014 \
               -outfmt "$BLSTOFMT"             \
               -matrix BLOSUM62                \
               -evalue 10e-25 -num_threads 64  \
             > ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.out \
            2> ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.log &

coverage_blastshorttbl.pl --prog=blastx \
         ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.out \
       > ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl

NCBI non redundant

NCBI non redundant (-+NCBI-nr+-) Release 197.0 (August 15, 2013) was downloaded from the NCBI BLAST db ftp site.

BLASTX over NCBI-nr

# BLASTDB is the shell var set to the path to BLAST databases
while read FA TI;
  do {
    OUT=$TI"-x-NCBInr";
    blastx -query assemblies/$FA \
           -db $BLASTDB/nr       \
           -strand 'plus'        \
           -outfmt "$BLSTOFMT"   \
           -matrix BLOSUM62      \
           -evalue 10e-25 -num_threads 12   \
                2> blastout/$OUT.blastx.log |  \
      gzip -c9 - > blastout/$OUT.blastx.out.gz &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Getting sequence coverage
for TI in starfish_newbler        \
          starfish_trinity_trimmo \
          starfish_soap_trimmo    ;
  do {
    OUT=$TI"-x-NCBInr"

    zcat blastout/$OUT.blastx.out.gz | \
      coverage_blastshorttbl.pl --prog=blastx - \
                              > blastout/$OUT.blastx.covtbl &

    egrep '^Q\b' blastout/$OUT.blastx.covtbl \
                 > blastout/$OUT.blastx.covtbl.qryonly
  }; done

# Making the boxplots
Rscript  Compare_Vector_Distributions.R \
    "BLASTX x NCBInr::NA::% Coverage" \
    blastout/violin.NCBInr.blastx.png \
    "NewBler::13::${TAB}::"blastout/starfish_newbler-x-NCBInr.blastx.covtbl.qryonly"$NBC"        \
    "Trinity::13::${TAB}::"blastout/starfish_trinity_trimmo-x-NCBInr.blastx.covtbl.qryonly"$TRC" \
    "SOAPdnT::13::${TAB}::"blastout/starfish_soap_trimmo-x-NCBInr.blastx.covtbl.qryonly"$SPC"

Violin plots comparison of assembled transcripts (on all three sets NewBler, Trinity, and SOAPdenovo-Trans) %coverage for the BLASTX search over NCBI non-redundant protein sequence database (NCBInr).

Here too, in order to have an equivalent functional annotation to compare starfish and sea urchin transcriptomes, same searches were performed using the sea urchin sequences.

BLASTX over NCBInr for the sea urchin sequences

nohup zcat ./seaurchin/mrna.fa.gz | \
        blastx -query - \
               -db $BLASTDB/nr                 \
               -outfmt "$BLSTOFMT"             \
               -matrix BLOSUM62                \
               -evalue 10e-25 -num_threads 64  \
             > ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.out \
            2> ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.log &

coverage_blastshorttbl.pl --prog=blastx \
         ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.out \
       > ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.covtbl

NCBI nucleotide

NCBI nucleotide (-+NCBI-nt+-) Release 197.0 (August 15, 2013) was downloaded from the NCBI BLAST db ftp site.

BLASTN over NCBI-nt

# BLASTDB is the shell var set to the path to BLAST databases
while read FA TI;
  do {
    OUT=$TI"-x-NCBInt";
    blastn -query assemblies/$FA \
           -db $BLASTDB/nt       \
           -outfmt "$BLSTOFMT"   \
           -evalue 10e-5 -num_threads 32   \
                2> blastout/$OUT.blastn.log |  \
      gzip -c9 - > blastout/$OUT.blastn.out.gz &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Getting sequence coverage
for TI in starfish_newbler        \
          starfish_trinity_trimmo \
          starfish_soap_trimmo    ;
  do {
    OUT=$TI"-x-NCBInt"

    zcat blastout/$OUT.blastn.out.gz | \
      coverage_blastshorttbl.pl --prog=blastn - \
                              > blastout/$OUT.blastn.covtbl

    egrep '^Q\b' blastout/$OUT.blastn.covtbl \
                  > blastout/$OUT.blastn.covtbl.qryonly
  }; done

# Making the boxplots
Rscript  Compare_Vector_Distributions.R \
    "BLASTX x NCBInt::NA::% Coverage" \
    blastout/violin.NCBInt.blastn.png \
    "NewBler::13::${TAB}::"blastout/starfish_newbler-x-NCBInt.blastn.covtbl.qryonly"$NBC"        \
    "Trinity::13::${TAB}::"blastout/starfish_trinity_trimmo-x-NCBInt.blastn.covtbl.qryonly"$TRC" \
    "SOAPdnT::13::${TAB}::"blastout/starfish_soap_trimmo-x-NCBInt.blastn.covtbl.qryonly"$SPC"

BLASTN x NCBInt

SET	#Transcripts	#Queries	#Targets	Best Hit
NewBler	10,022	776	27,484	TBL
Trinity	13,817	902	29,342	TBL
SOAPdenovo-Trans	11,369	779	31,486	TBL

Violin plots comparison of assembled transcripts (on all three sets NewBler, Trinity, and SOAPdenovo-Trans) %coverage for the BLASTN search over NCBI nucleotide sequence database (NCBInt).

Searching for PFAM Domains

The two PFAM database sections, Pfam-A and Pfam-B were downloaded from Pfam ftp site. Pfam-A is the manually curated portion of the database that contains over 13,000 entries; for each entry a protein sequence alignment and a hidden Markov model is stored. These hidden Markov models can be used to search sequence databases with the HMMER package. Pfam-B is an automatically generated supplement to Pfam-A that contains a large number of small families derived from clusters produced by an algorithm called ADDA. Although of lower quality, Pfam-B families can be useful when no Pfam-A families are found. HMMER tools were used to scan transcript sequences to annotate domains. In fact, longest ORFs derived from those transcripts were taken, and compared with the HMM models from PFAM database (version 27.0, 2014). In order to get comparable e-values, it is required to adjust a command-line switch to set the effective search space, calculated from this formula: Z = #query_sequences x #target_models

PFAM search

# PFAMDB is the shell var set to the path to the PFAM HMM models
zegrep -c '^ACC' $PFAMDB/Pfam-[AB].hmm.gz
 14,831  Pfam-A.hmm.gz
 20,000  Pfam-B.hmm.gz
 34,831  HMM models in total

# From that, #target_models was set to 35,000. 
# As there are 10,149, 13,787, and 11,347 transcripts on the NewBler, 
# Trinity, and SOAPdenovo-Trans assemblies respectively, #query_sequences 
# was rounded up to 15,000.
# Therefore, Z = 15,000 x 35,000 = 525,000,000 = 5.25e8 ~ 6e8

while read FA TI;
  do {

    # Get the longest ORF for the final transcript set
    cdna2orfs.v2.pl -VSldu assembly/${FA}.fa \
                         > assembly/${FA}_lORFs.fa

    # HMMER against Pfam A
    OUT=$TI"_lORFs-x-PfamA"
    nohup hmmsearch -Z 6e8 -E 10e-5 --max --cpu 24 \
                    -o       ./hmmer_output/${OUT}.hmmsearch.out \
                    --tblout ./hmmer_output/${OUT}.hmmsearch.tbl \
                             $PFAMDB/Pfam-A.hmm.gz \
                             ./assembly/${FA}_lORFs.fa \
                          2> ./hmmer_output/${OUT}.hmmsearch.log &

    # HMMER against Pfam B
    OUT=$TI"_lORFs-x-PfamB";
    nohup hmmsearch -Z 6e8 -E 10e-5 --max --cpu 24 \
                    -o       ./hmmer_output/${OUT}.hmmsearch.out \
                    --tblout ./hmmer_output/${OUT}.hmmsearch.tbl \
                             $PFAMDB/Pfam-B.hmm.gz \
                             ./assembly/${FA}_lORFs.fa \
                          2> ./hmmer_output/${OUT}.hmmsearch.log &

  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix         starfish_newbler_cs
starfish_trinity_trimmo.vbfclean_strandfix  starfish_trinity_trimmo_cs
starfish_soap_trimmo.vbfclean_strandfix     starfish_soap_trimmo_cs
EOF

Retrieving BLAST Best Hists

getbesthit.awk

#!/usr/bin/gawk -f

{
  split($3,m,/\|/);

  if (MX[$1] < $10) {
    MX[$1] = $10;
    aln = $5"-"$6"/"$7"-"$8":"$9;
    coverage = sprintf("%.3f", $9/$2 * 100);
    GX[$1] = $1" "$2" "m[2]" "m[3]" "$4" "aln" "$10" "$11" "coverage;
  };

  GH[$1]++;
}
END {
  for (n in MX)
    print GX[n], GH[n];
}

Retrieving best scoring alignments for each contig

# get best HSPs for starfish contigs
for TI in newbler trinity_trimmo soap_trimmo ;
  do {
    for DB in UniProt UniProtTREmbl NCBInr ;
      do {
        OUT="starfish_"$TI"-x-"$DB
        echo "# working on $OUT..." 1>&2
        zcat   blastout/$OUT.blastx.out.gz | \
          ./getbesthit.awk - \
             > blastout/$OUT.blastx.besthsp.tbl
      }; done
  }; done

# get best HSPs for sea urchin contigs
for DB in uniprot ncbinr;
  do {
    OUT="seaurchin_mrna.blastx_"$DB"_eval10e-25"
    echo "# working on $OUT..." 1>&2
    ./getbesthit.awk seaurchin/$OUT.out \
                   > seaurchin/$OUT.besthsp.tbl
  }; done

Retrieving UniPROT and NCBI best hit descriptions

# Merging all UniProt ids from best BLAST hits
gawk '{ print $3 }
     ' ./blastout/starfish_newbler-x-UniProt.blastx.besthsp.tbl         \
       ./blastout/starfish_trinity_trimmo-x-UniProt.blastx.besthsp.tbl  \
       ./blastout/starfish_soap_trimmo-x-UniProt.blastx.besthsp.tbl     \
     > ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids

# Merging all NCBInr ids from best BLAST hits
gawk '{ print $3 }
     ' ./blastout/starfish_newbler-x-NCBInr.blastx.besthsp.tbl         \
       ./blastout/starfish_trinity_trimmo-x-NCBInr.blastx.besthsp.tbl  \
       ./blastout/starfish_soap_trimmo-x-NCBInr.blastx.besthsp.tbl     \
     > ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids

# How many ids do we have on each set?
wc -l ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids \
     ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids
  7540  blastout/blastx_uniprot_eval10e-25.besthsp.gbids
  4950  blastout/blastx_ncbinr_eval10e-25.besthsp.gbids

# Batch downloading GenBank nucleotide (gb) or protein (gp) records
batchwebquerymaker.pl -V \
        --getids-from-file  ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids \
        --set-database      'protein' \
        --db-output-rettype 'gp' \
        --output-file       ./blastout/blastx_uniprot_eval10e-25.besthsp \
                         2> ./blastout/blastx_uniprot_eval10e-25.besthsp.gp.log &

batchwebquerymaker.pl -V \
        --getids-from-file  ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids \
        --set-database      'protein' \
        --db-output-rettype 'gb' \
        --output-file       ./blastout/blastx_ncbinr_eval10e-25.besthsp \
                         2> ./blastout/blastx_ncbinr_eval10e-25.besthsp.gb.log &

An annotation summary table was also retrieved from the UniProt upload list web form, using a list of all the matching UniProt identifiers filtered out from the best hits file blastx_uniprot_eval10e-25.besthsp.gbids. The downloaded file uniprot_besthits_dowloaded_info.tab.gz was then used to annotate the starfish contigs and to create the supplementary excel files.

Formating UniProt functional annotation into Excel files

# Processing UniProt alignments
for TI in newbler        \
          trinity_trimmo \
          soap_trimmo    ;
  do {
    zcat ./blastout/uniprot_besthits_dowloaded_info.tab.gz | \
      gawk 'BEGIN{ FS=OFS="\t";
                   while (getline<ARGV[1]>0)
                     I[$1]=$2"\t"$3"\t"$10"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$11"\t"$12;
                   ARGV[1]=""; FS=" ";
                   hdr="Transcript ID\tLength(bp)\tBest HSP\tScore\tEvalue\tCoverage\tNum HSPs\t";
                   hdr=hdr"UniProt Entry\tEntry name (ACC)\tLength(aa)\tStatus\tProtein names\tGene names\t";
                   hdr=hdr"Organism\tCatalytic activity\tFunction [CC]\tKeywords\tGene ontology (GO)";
                   print hdr; }
            $3 in I { print $1,$2,$6,$7,$8,$9,$10,I[$3]; }
           ' - ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.tbl \
           > ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.tbl

    ./table2excel.pl ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.tbl \
                     ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.xls \
                     "${TI}-x-UniProt"
  }; done

mapNCBIinfo2hsps.awk

#!/usr/bin/gawk -f

BEGIN {
    while (getline<ARGV[1]>0) {
        J=$4;
        gsub(/\.[0-9]+$/,"",J);
        IDS[$3]=$4;
        TRS[$3]=TRS[$3]" "$1;
        CTG[$3""$1]=$1"\t"$3;
        ALN[$3""$1]=$6" "$7sco" "$9"bits\t"$8;
    };
    ARGV[1]="";
    N=D=A=S=F=K=G="";
    E="NotAnnotated";
    dflg=kflg=0;
}

dflg == 1 && $0 !~ /^ / { dflg=0 }
kflg == 1 && $0 !~ /^ / { kflg=0 }
dflg == 1 { gsub(/^ +/," ",$0); D=D""$0 }
kflg == 1 { gsub(/^ +/," ",$0); K=K""$0 }
  $1 == "LOCUS" { N=$2 }
  $1 == "DEFINITION" { D=$0; gsub(/^DEFINITION +/,"",D); dflg=1 }
  $1 == "ACCESSION" { A=$2 }
  $1 == "VERSION" { G=$NF; gsub(/^GI:/,"",G) }
  $1 == "SOURCE" { S=$0; gsub(/^SOURCE +/,"",S) }
  $1 == "KEYWORDS" { K=$0 ; gsub(/^KEYWORDS +/,"",K); kflg=1 }
  $0 ~ /^ *\/product/ { F=$0; gsub(/^ +\/product=/,"",F); gsub(/\"/,"",F) }  #"
  $0 ~ /^ *\/UniProtKB_evidence/ { E=$0; gsub(/^ +\/UniProtKB_evidence=/,"",E); gsub(/\"/,"",E) }  #"
  $0 ~ /^\/\/$/ {
      if (G in IDS) {
         SQ=TRS[G];
         gsub(/^ */,"",SQ);
         n=split(SQ,sq," ");
         for (i=1; i<=n; i++) {
             ti=sq[i];
             print CTG[G""ti]"\t"A"\t"N"\t"ALN[G""ti]"\t"S"\t"D"\t"F"\t"K"\t"E;
         };
       };
       N=D=A=S=F=K=G=""; E="NotAnnotated"; dflg=kflg=0;
  }

Formating NCBInr functional annotation into Excel files

# Processing NCBInr alignments
for TI in newbler        \
          trinity_trimmo \
          soap_trimmo    ;
  do {
    ./mapNCBIinfo2hsps.awk ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.tbl \
                           ./blastout/blastx_ncbinr_eval10e-25.besthsp.gb \
                         > ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.tbl

    ./tableNR2excel.pl ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.tbl \
                       ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.xls \
                       "${TI}-x-NCBInr"
  }; done

BLASTX x UniProt

SET	#Transcripts	#Queries	#Targets	Best Hit
NewBler	10,022	1,059	27,565	TBL / XLS
Trinity	13,817	1,503	37,010	TBL / XLS
SOAPdenovo-Trans	11,369	923	27,168	TBL / XLS

Sea Urchin	30,401	19,343	149,902

BLASTX x NCBInr

SET	#Queries	#Targets	Best Hit
NewBler	1,365	266,711	TBL / XLS
Trinity	2,021	406,049	TBL / XLS
SOAPdenovo-Trans	1,252	265,750	TBL / XLS

Sea Urchin	3,449	523,925

Merging Coverage information and Functional Annotation into Excel files

# Keep transcript IDs together with assembly contigs ID and length in bp.
egrep '^>' assemblies/starfish_newbler.vbfclean_strandfix.fa | \
  sed 's/^>//; s/\(length\|gene\)=//g;' | \
    gawk '{ print $1, $3":"$2, $4 }
         ' > assemblies/starfish_newbler.vbfclean_strandfix.idsmap
egrep '^>' assemblies/starfish_trinity_trimmo.vbfclean_strandfix.fa | \
  sed 's/^>//; s/\(length\|gene\)=//g;' | \
    gawk '{ print $1, $2, $3 }
         ' > assemblies/starfish_trinity_trimmo.vbfclean_strandfix.idsmap
egrep '^>' assemblies/starfish_soap_trimmo.vbfclean_strandfix.fa | \
  sed 's/^>//; s/\(length\|gene\)=//g;' | \
    gawk '{ print $1, $2, $3 }
         ' > assemblies/starfish_soap_trimmo.vbfclean_strandfix.idsmap

# Merging IDs with assembly coverage, UniProt, and NCBInr best hits info
for TI in newbler        \
          trinity_trimmo \
          soap_trimmo    ;
  do {

    echo "# working on $TI" 1>&2

    # merging tables
    gawk 'BEGIN{
        FS=" "; OFS="\t";
        while (getline<ARGV[1]>0) { S[$1] = $2; C[$2] = $1; len[$1] = $3; };
        while (getline<ARGV[2]>0) { sumcov[$1] = $(NF-2); avgcov[$1] = $NF; };
        while (getline<ARGV[3]>0) { s=$1 in S ? $1 : "##";
                                    GBid[s] = $3; GBac[s] = $4; GBaln[s] = $6" "$7" "$8" "$9" "$10; };
        while (getline<ARGV[4]>0) { s=$1 in S ? $1 : "##";
                                    UPid[s] = $3; UPac[s] = $4; UPaln[s] = $6" "$7" "$8" "$9" "$10; };
        for (s in S) {
          print s, S[s], len[s], (s in sumcov ? sumcov[s] : "NA"), (s in avgcov ? avgcov[s] : "NA"),
                (s in GBid ? GBid[s] : "NA"), (s in GBac ? GBac[s] : "NA"), (s in GBaln ? GBaln[s] : "NA"),
                (s in UPid ? UPid[s] : "NA"), (s in UPac ? UPac[s] : "NA"), (s in UPaln ? UPaln[s] : "NA");
        };
      }' ./assemblies/starfish_$TI.vbfclean_strandfix.idsmap    \
         ./bowtie2/starfish_${TI}.bowtie.UPL.pileup.summary     \
         ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.tbl  \
         ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.tbl \
       > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.tbl

    # adding UniProt descriptions
    zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
      gawk 'BEGIN{ FS=OFS="\t";
              while (getline<ARGV[1]>0) { U[$2] = ($10 == "" ? "NA\tNA" : $10"\t"$5" ["$6)"]"; };
              ARGV[1] = "";
            }
            { print $0, ($9 in U ? U[$9] : "NA\tNA"); } 
            ' - ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.tbl \
              > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl

    # Fixing NCBInr ids and inserting descriptions
    #
    # nr.ids data was filtered out from the fasta headers of the NCBInr sequences, as follows
    #   zegrep '^>' $BLASTDB/fasta/nr.gz | \
    #     perl -ne 'print join("\n", map { s/^>?gi\|//o; s/\|/\t/; s/ /\t/; s/ \[/\t\[/; $_ }
    #                                split /\x01/o, $_);' | \
    #        gzip -c9 - > $BLASTDB/fasta/nr.ids.gz &
    #
    zcat $BLASTDB/fasta/nr.ids.gz | \
      gawk 'BEGIN{
              FS=OFS="\t";
              while (getline<ARGV[1]>0) { N[$6] = $1; };
              ARGV[1] = "";
            }
            ($1 in N) { print $0 }
            ' ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl  - | \
          gawk 'BEGIN{
                  FS=OFS="\t";
                  while (getline<ARGV[1]>0) {
                    N[$1] = $2"\t"$3"\t"($4 != "" ? $4 : "NA");
                    sub(/^[^\|]*\|/,"",$2);
                    sub(/\|[^\|]*$/,"",$2);
                    I[$1] = $2;
                  };
                  ARGV[1] = "";
                }
                { $7=$6 in I ? I[$6] : "NA"; 
                  p=($6 in N ? N[$6] : "NA\tNA\tNA")"\t"$9;
                  $9=p;
                  print $0;
                }' - ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl \
                   > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.tbl

    # Converting tabular file to excel
    ./tableNR+UP2excel.pl ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.tbl \
                          ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.xls \
                          "NCBInr+UniProt_BLASTX"

    # Get total counts
    gawk -v ti=$TI '
           $6 != "NA" && $12 == "NA" { GB++ }
           $6 == "NA" && $12 != "NA" { UP++ }
           $6 != "NA" && $12 != "NA" { BO++ }
           $6 == "NA" && $12 == "NA" { NO++ }
          END{ printf "SET: %s  UP: %d  NR: %d  BOTH: %d  NONE: %d\n",
                       ti, UP, GB, BO, NO; }
          ' blastout/starfish_$TI.best_blastx_NCBI+UniProt.GB+UPdesc.tbl

   }; done

The following table summarizes the number of transcripts (contigs) having a match to at least one target sequence from UniProt, NCBI-nr, both, and no hit (none). TBL files were compressed using bzip2.

SET	#Transcripts	#UniProt-only	#NCBInr-only	#Both	#None	Summary Files
NewBler	10,022	0	0	1,365	8,769	TBL / XLS
Trinity	13,817	3	0	2,021	11,763	TBL / XLS
SOAPdenovo-Trans	11,369	8	0	1,252	10,084	TBL / XLS

Comparing Homology Search Results

In an attempt to quantify which assembly returned the best set of transcripts, one can consider the ratio of percent coverage versus length for the best HSP from BLAST of each transcript. The longer the hit and the higher the coverage percent (with respect to the transcript full length), the better; while low coverage may correspond to the presence of chimeric transcripts.

# Making scatterplots for queries best HSP coverage
while read TI LB CL;
  do {
    echo "# Working on $TI : BLASTN x Sea urchin" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-Spur_mrna.blastn.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTN x Sea urchin::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-Spur_mrna.blastn.covtbl.qryonly.scat.png
    echo "# Working on $TI : TBLASTX x Sea urchin" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-Spur_mrna.tblastx.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP TBLASTX x Sea urchin::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-Spur_mrna.tblastx.covtbl.qryonly.scat.png
    echo "# Working on $TI : BLASTX x UniProt" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-UniProt.blastx.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTX x UniProt::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-UniProt.blastx.covtbl.qryonly.scat.png
    echo "# Working on $TI : BLASTX x UniProtTREmbl" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-UniProtTREmbl.blastx.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTX x UniProt-TREmbl::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-UniProtTREmbl.blastx.covtbl.qryonly.scat.png
    echo "# Working on $TI : BLASTX x NCBInr" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-NCBInr.blastx.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTX x NCBInr::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-NCBInr.blastx.covtbl.qryonly.scat.png
    echo "# Working on $TI : BLASTN x NCBInt" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-NCBInt.blastn.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTN x NCBInt::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-NCBInt.blastn.covtbl.qryonly.scat.png
  }; done <<'EOF'
newbler         NewBler  #CC000099
trinity_trimmo  Trinity  #00CC0099
soap_trimmo     SOAPdnT  #0000CC99
EOF

BLAST SET	NewBler	Trinity	SOAPdenovo-Trans
BLASTN x Sea urchin mRNAs
TBLASTX x Sea urchin mRNAs
BLASTN x NCBInt
BLASTX x NCBInr
BLASTX x UniProt
BLASTX x UniProt-TrEBML

Functional analysis: Gene Ontology

On this section, SOAPdenovo-Trans assembly will be functionally characterized using the Gene Ontology (GO) annotation extracted from the UniProt database.

Deriving GO table from UniProt

#  Preparing GO annotation table for the current UNIPROT release
# UNIPROT points to the folder where UniProt database was downloaded

zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.dat.gz | \
  perl -ne 'BEGIN { $I=$A=$E=$O=""; $G="GO"; $T="TM"; $D="DM"; $S="unknown"; $K="."; $C=$S=0; }
    $_ =~ m{^ID\s+(\S+)\s}o                      && ($I=$A=$1, next);
    $_ =~ m{^AC\s+(\S+);}o                       && ($A=$1, next);
    $_ =~ m{^OS\s+([^\s]+)\s+([^\.\s]+)}o        && ($S=$1."_".$2, next);
    $_ =~ m{^OX\s+NCBI_TaxID=(\d+);}o            && ($C=$1, next);
    $_ =~ m{^DR\s+EMBL;\s+([^;]+);\s+([^;]+);}o  && ($E=$1, $O=$2, next);
    $_ =~ m{^DR\s+GO;\s+GO:([^;]+);\s+([^;]+);}o && ($G.=";".$1."#".$2, next);
    $_ =~ m{^KW\s+(.*)\.}o                       && ($K=$1, next);
    $_ =~ m{^FT\s+DOMAIN\s+(\d+)\s+(\d+)\s+([^\.]+)\.}o && do {
      $D.=";".$1."_".$2; ($s = $3) =~ s/\s+/_/og; $D.="_".$s; next; };
    $_ =~ m{^FT\s+TRANSMEM\s+(\d+)\s+(\d+)\s}o          && ($T.=";".$1."_".$2, next);
    $_ =~ m{^//}o && do {
      print join("\t", $I, $A, $E, $O, $S, $C, $D, $T, $G, $K),"\n";
      $I=$A=$E=$O=""; $G="GO"; $T="TM"; $D="DM"; $S="unknown"; $K="."; $C=$S=0;
      next;
    }' | gzip -c9 - \
            > $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz

zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz  | \
  gawk '   { if ($9 == "GO") { N++ } else { Y++ }; }
        END{ print "# Total: ", N+Y, " with GOs: ",Y," without GOs: ",N }'
# Total:  546439  with GOs:  518296  without GOs:  28143

Project GO Annotations (GOAs)

GOAs can be mapped from the raw set of BLAST hits or transferred from the best hit only, the latter was the chosen option. Best hits files were produced at ((Coscinasterias+transcriptome#UniProt|Homology Search over UniProt)) section.

Mapping GOAs from the BLASTX best HSPs set: starfish x UniProt

# Annotate starfish sequences
zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
   gawk 'BEGIN{
    FS=OFS="\t";
    while (getline<ARGV[1]>0) {
      if ($1 == "GO") continue;
      sub(/^GO;?/,"",$9);
      if (!($2 in GO)) GO[$2] = $9; # using uniprot accession i.e. Q6GZX4
      else GO[$2] = GO[$2]";"$9;
    };
    ARGV[1] = "";
   }
   $1 == "Q" {
    split($14,U,"|");
    u=U[2];
    split(U[3],K," ");
    if (u in GO && GO[u] != "") print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], GO[u];
    else print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], "NA.GO";
   }' - ./blastout/starfish_soap_trimmo-x-UniProt.blastx.covtbl \
      > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl

# Counting GOAs
gawk 'BEGIN{
        FS=OFS="\t";
        print "GO\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot";
      }
      $NF != "NA.GO" && $1 != "GO" {
        n=split($NF,m,";");
        for (i=1; i<=n; i++) { GO[m[i]]++; GOT++ };
      }
      END{
        for (j in GO)  {
           gop = GO[j];
           gon = GOT - GO[j];
           print j, gop, sprintf("%.5f", gop / GOT * 100), gon, sprintf("%.5f", gon / GOT * 100), GOT;
        } # print j, GO[j], GOT;
      }' ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \
       > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts.tbl

Mapping GOAs from the BLASTX best HSPs set: sea urchin x UniProt

# Annotate sea urchin sequences
zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
   gawk 'BEGIN{
    FS=OFS="\t";
    while (getline<ARGV[1]>0) {
      if ($1 == "GO") continue;
      sub(/^GO;?/,"",$9);
      if (!($2 in GO)) GO[$2] = $9; # using uniprot accession i.e. Q6GZX4
      else GO[$2] = GO[$2]";"$9;
    };
    ARGV[1] = "";
   }
   $1 == "Q" {
    split($14,U,"|");
    u=U[2];
    split(U[3],K," ");
    if (u in GO && GO[u] != "") print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], GO[u];
    else print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], "NA.GO";
   }' - ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl \
      > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl

# Counting GOAs
gawk 'BEGIN{
        FS=OFS="\t";
        print "GO\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot";
      }
      $NF != "NA.GO" && $1 != "GO" {
        n=split($NF,m,";");
        for (i=1; i<=n; i++) { GO[m[i]]++; GOT++ };
      }
      END{
        for (j in GO)  {
           gop = GO[j];
           gon = GOT - GO[j];
           print j, gop, sprintf("%.5f", gop / GOT * 100), gon, sprintf("%.5f", gon / GOT * 100), GOT;
        } # print j, GO[j], GOT;
      }' ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \
       > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts.tbl

Comparing GOAs between sets

mkGOcountstbl.awk

#!/usr/bin/gawk -f
BEGIN{
  FS="\t";
  while (getline<ARGV[1]>0) {
    if ($1 == "GO") continue; 
    if ($6 != "NA.GO") {
      gsub(/[\'\"]/,"",$6); #"
      n=split($6,m,";");
      for (i=1; i<=n; i++) {
        k=m[i]; 
        gsub(/\#.*$/,"",k); 
        gsub(/^.*\#/,"",m[i]);
        if (!(k in GD)) { GD[k] = "'"m[i]"'"; RO[k] = 0; };
        RO[k]++; ROT++;
      };
    };
  };
  ARGV[1] = "";
  print "GO\tGOdesc\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot\tROpos\tROpos.pct\tROneg\tROneg.pct\tROtot\tGOROpct";
  FS="\t";
}
$6 != "NA.GO" && $1 != "GO" {
  gsub(/[\'\"]/,"",$6); #"
  n=split($6,m,";");
  for (i=1; i<=n; i++) {
    k=m[i]; 
    gsub(/\#.*$/,"",k); 
    gsub(/^.*\#/,"",m[i]);
    if (!(k in GD)) { GD[k] = "'"m[i]"'"; GO[k] = 0; };
    GO[k]++; GOT++;
  };
}
END{
  for (j in GD) {
     if (!(j in GO)) GO[j]=0;
     if (!(j in RO)) RO[j]=0;
     gop = GO[j];        gopp = GOT > 0 ? sprintf("%.5f", (gop / GOT * 100)) : "NA";
     gon = GOT - GO[j];  gonp = GOT > 0 ? sprintf("%.5f", (gon / GOT * 100)) : "NA";
     rop = RO[j];        ropp = ROT > 0 ? sprintf("%.5f", (rop / ROT * 100)) : "NA";
     ron = ROT - RO[j];  ronp = ROT > 0 ? sprintf("%.5f", (ron / ROT * 100)) : "NA";
     grpp = (GOT == 0 || ROT == 0) ? "NA" : sprintf("%.5f", (GOT / ROT * 100));
     printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n",
            "GO:"j, GD[j], gop, gopp, gon, gonp, GOT, rop, ropp, ron, ronp, ROT, grpp;
  } # print j, GO[j], GOT;
}

Extended GOA counts

# Comparing GOAs counts between starfish and sea urchin
bin/mkGOcountstbl.awk  ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \
                    ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \
                  > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.tbl

# Comparing GOAs counts between starfish and UniProt
zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
 gawk 'BEGIN{ FS=OFS="\t" } { print $1,"2","3","4","5",$9 }' - | \
  bin/mkGOcountstbl.awk - \
                    ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \
                  > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.tbl

# Comparing GOAs counts between sea urchin and UniProt
zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
 gawk 'BEGIN{ FS=OFS="\t" } { print $1,"2","3","4","5",$9 }' - | \
  bin/mkGOcountstbl.awk - \
                    ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \
                  > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.tbl

Calculating counts differences between set pairs

# Comparing GOAs counts between starfish and sea urchin
Rscript bin/GO_calc_pval.R \
            ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.tbl \
          > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.GOcalc.log
Rscript bin/GO_getGOdesc.R \
            ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.chisq+hypergeom_pvals.tbl \
         2> ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.chisq+hypergeom_pvals.desc.log

# Comparing GOAs counts between starfish and UniProt
Rscript bin/GO_calc_pval.R \
            ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.tbl \
          > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.GOcalc.log
Rscript bin/GO_getGOdesc.R \
            ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.tbl \
         2> ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.desc.log

# Comparing GOAs counts between sea urchin and UniProt
Rscript bin/GO_calc_pval.R \
            ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.tbl \
          > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.GOcalc.log
Rscript bin/GO_getGOdesc.R \
            ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.tbl \
         2> ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.desc.log

GOA Frequencies Analysis

From the Gene Ontology ftp repository, termdb (version 2015-01-20) was retrieved. From the term.txt, term_synonym.txt, and term2term.txt it is possible to build the full graph for the GO terms from which the parent-child relationships can be extracted. This is useful to calculate frequencies at distinct GO graph levels.

Getting GO terms parent nodes

# GOTERMDB point to the folder where the termdb database was downloaded from Gene Ontology repository
gawk 'BEGIN {
    FS=OFS="\t";
    while (getline<ARGV[1]>0) {
      if ($7 == 1) continue; # 1 is the GO code for is_relation
      GO[$1] = $4;
      DE[$1] = $2;
      ON[$1] = $3;
      RO[$1] = $6;
    };
    while (getline<ARGV[2]>0) {
      if ($4 != 27) continue; # 27 is the GO code for alt_id  synonym_type
      if ($1 in SY) {
        SY[$1] = SY[$1]" "$2;
      } else {
        SY[$1] = $2;
      };
    };
    ARGV[1]=ARGV[2]=0;
    print "GOID","ONTOLOGY","TERM","PARENT","SYNONYM";
  } $2 == 1 && $5 == 0 { # $2 is relationship_type_id and 1 means is_a
     if (($4 in GO) && GO[$4] !~ /^obsolete/) {
       print GO[$4], ON[$4], DE[$4], RO[$4] == 1 ? "all" : GO[$3], ($4 in SY) ? SY[$4] : "NA";
       SN[$4]++;
     };
  } END{
     for (n in GO) {
       if (!(n in SN)) {
         print "#", GO[n], ON[n], DE[n], RO[n] == 1 ? "all" : GO[n], ($4 in SY) ? SY[n] : "NA";
       };
     };
  }' $GOTERMDB/term.txt\
     $GOTERMDB/term_synonym.txt \
     $GOTERMDB/term2term.txt \
   > GOA/goterms+parents.tbl

On this section, frequencies are computed for the three sets, yet the main interest was only the starfish GOs mapped from UniProt.

GO terms frequencies analysis

for F in starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot \
         starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin \
         seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot ;
  do {
    echo "# working on $F"

    bin/GO_prune_GOlevels.pl ./GOA/goterms+parents.tbl \
                              ./GOA/$F.tbl > ./GOA/$F.gofreqs.tbl
  }; done

GO terms subgraphs

for F in starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot \
         starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin \
         seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot ;
 do {
  echo "# working on $F"

  SF="./GOA/"$F".chisq+hypergeom_pvals.goinfo";

  bin/GO_make_graphviz_dotfile_ext.pl ./GOA/goterms+parents.tbl \
                                    $SF.tbl \
                                    1e-5 \
                                 2> $SF.ext.dot.maxpval_1e-5.log;
  for o in CC BP MF;
    do {
      mv -v $SF.graphviz.${o}.ext.dot $SF.graphviz.${o}.ext.maxpval_1e-5.dot
    }; done;

  echo "# working on $F";
  bin/GO_make_graphviz_dotfile_ext.pl ./GOA/goterms+parents.tbl \
                                    $SF.tbl \
                                 2> $SF.ext.dot.log;
  for o in CC BP MF;
    do {
      echo "# graphviz dot on $o [co/pdf]";
      dot -Tpdf $SF.graphviz.${o}.ext.dot \
              > $SF.graphviz.${o}.ext.dot.pdf;
      echo "# graphviz dot on $o [co/svg]";
      dot -Tsvg $SF.graphviz.${o}.ext.dot \
              > $SF.graphviz.${o}.ext.dot.svg;
    }; done;

  for o in CC BP MF;
    do {
      echo "# graphviz dot on $o [co/pdf]";
      dot -Tpdf $SF.graphviz.${o}.ext.maxpval_1e-5.dot \
              > $SF.graphviz.${o}.ext.maxpval_1e-5.dot.pdf;
      echo "# graphviz dot on $o [co/svg]";
      dot -Tsvg $SF.graphviz.${o}.ext.maxpval_1e-5.dot \
              > $SF.graphviz.${o}.ext.maxpval_1e-5.dot.svg;
    }; done;
 }; done;

Producing the GO barplots with R

Using R to draw GO frequency barplots

### GO freqs analysis:
GODATAFILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.gofreqs.tbl"
GOs.Cmur <- read.table(GODATAFILE, header=TRUE, sep="\t");

# Overview of the GO data
nrow(GOs.Cmur)

sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 1 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 1 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 1 ])

sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 2 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 2 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 2 ])

sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 3 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 3 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 3 ])

sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel > 1 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel > 1 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel > 1 ])

GK <- list(CCU="#E94F6C", # lightred
           CCD="#A24436", # darkred
           BPU="#52E853", # lightgreen
           BPD="#426A2B", # darkgreen ... #0A945C
           MFU="#6FAFF9", # lightblue
           MFD="#007CBA") # darkblue

numbars <- 25

for (golvl in 3:4) {

  GOpct.MF.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "MF" &
                             GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ]
  GOpct.CC.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "CC" &
                             GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ]
  GOpct.BP.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "BP" &
                             GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ]
  cat("numMF:",nrow(GOpct.MF.Cmur),"\n") #->  96 gt0 vs 512 total
  cat("numCC:",nrow(GOpct.CC.Cmur),"\n") #->  65 gt0 vs 612 total
  cat("numBP:",nrow(GOpct.BP.Cmur),"\n") #-> 214 gt0 vs 943 total
  GOsum.MF.Cmur <- sum(GOpct.MF.Cmur$GOcount) #->  800
  GOsum.CC.Cmur <- sum(GOpct.CC.Cmur$GOcount) #-> 1271
  GOsum.BP.Cmur <- sum(GOpct.BP.Cmur$GOcount) #-> 3403
  cat("sumMF:",GOsum.MF.Cmur,"\n")
  cat("sumCC:",GOsum.CC.Cmur,"\n")
  cat("sumBP:",GOsum.BP.Cmur,"\n")

  GOpct.MF.Cmur$GOpct <- GOpct.MF.Cmur$GOcount / GOsum.MF.Cmur * 100
  GOpct.CC.Cmur$GOpct <- GOpct.CC.Cmur$GOcount / GOsum.CC.Cmur * 100
  GOpct.BP.Cmur$GOpct <- GOpct.BP.Cmur$GOcount / GOsum.BP.Cmur * 100

  GOord.MF.Cmur <- head(GOpct.MF.Cmur[ order(-GOpct.MF.Cmur$GOpct), ], n=numbars)
  GOord.CC.Cmur <- head(GOpct.CC.Cmur[ order(-GOpct.CC.Cmur$GOpct), ], n=numbars)
  GOord.BP.Cmur <- head(GOpct.BP.Cmur[ order(-GOpct.BP.Cmur$GOpct), ], n=numbars)
  GOplt.MF.Cmur <- rbind(GOord.MF.Cmur, 
                         data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="MF",
                                    TERM="Other low represented GO codes",
                                    GOpct=100 - sum(GOord.MF.Cmur$GOpct)))
  GOplt.CC.Cmur <- rbind(GOord.CC.Cmur, 
                         data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="CC",
                                    TERM="Other low represented GO codes",
                                    GOpct=100 - sum(GOord.CC.Cmur$GOpct)))
  GOplt.BP.Cmur <- rbind(GOord.BP.Cmur, 
                         data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="BP",
                                    TERM="Other low represented GO codes",
                                    GOpct=100 - sum(GOord.BP.Cmur$GOpct)))

  GOplt.MF.Cmur$LBL <- paste(GOplt.MF.Cmur$TERM,
                             ifelse(is.na(GOplt.MF.Cmur$GOID), "", paste(" [",GOplt.MF.Cmur$GOID,"]",sep="")),
                             sep="")
  GOplt.MF.Cmur.max <- max(GOplt.MF.Cmur$GOpct)

  pdf(file=paste("GOA/GOfreqs_Cmur_MF_lvl",golvl,".pdf", sep=""),
      width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE);
  par.def<-par(mar=c(4,25,1,1)+0.1);
  barplot(GOplt.MF.Cmur$GOpct,
          names.arg=GOplt.MF.Cmur$LBL,
          col=as.character(GK["MFU"]),
          horiz=TRUE, xlim=c(0,GOplt.MF.Cmur.max),
          border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE)
  axis(1,at=c(seq(0,GOplt.MF.Cmur.max+5,2)),line=0.25)
  abline(v=seq(0,GOplt.MF.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black")
  mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
  text(GOplt.MF.Cmur.max, numbars*1.2, # lpos*1.2,
       labels="Molecular\nFunction",
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["MFD"]),
         # as.character(KK[c("BPD","CCD","MFD")]),
       adj=c(1,1.5), font=2, cex=2)
  text(-2, -0.7, # lpos*1.2,
       labels=paste("GO level ",golvl,sep=""),
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["MFD"]),
         # as.character(KK[c("BPD","BPD","MFD")]),
       adj=c(1,1), font=2, cex=1.25, xpd=TRUE)
  par(par.def);
  dev.off();

  GOplt.CC.Cmur$LBL <- paste(GOplt.CC.Cmur$TERM,
                             ifelse(is.na(GOplt.CC.Cmur$GOID), "", paste(" [",GOplt.CC.Cmur$GOID,"]",sep="")),
                             sep="")
  GOplt.CC.Cmur.max <- max(GOplt.CC.Cmur$GOpct)

  pdf(file=paste("GOA/GOfreqs_Cmur_CC_lvl",golvl,".pdf", sep=""),
      width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE);
  par.def<-par(mar=c(4,25,1,1)+0.1);
  barplot(GOplt.CC.Cmur$GOpct,
          names.arg=GOplt.CC.Cmur$LBL,
          col=as.character(GK["CCU"]),
          horiz=TRUE, xlim=c(0,GOplt.CC.Cmur.max),
          border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE)
  axis(1,at=c(seq(0,GOplt.CC.Cmur.max+5,2)),line=0.25)
  abline(v=seq(0,GOplt.CC.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black")
  mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
  text(GOplt.CC.Cmur.max, numbars*1.2, # lpos*1.2,
       labels="Cellular\nComponent",
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["CCD"]),
         # as.character(KK[c("BPD","CCD","MFD")]),
       adj=c(1,1.5), font=2, cex=2)
  text(-2, -0.7, # lpos*1.2,
       labels=paste("GO level ",golvl,sep=""),
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["CCD"]),
         # as.character(KK[c("BPD","BPD","MFD")]),
       adj=c(1,1), font=2, cex=1.25, xpd=TRUE)
  par(par.def);
  dev.off();

  GOplt.BP.Cmur$LBL <- paste(GOplt.BP.Cmur$TERM,
                             ifelse(is.na(GOplt.BP.Cmur$GOID), "", paste(" [",GOplt.BP.Cmur$GOID,"]",sep="")),
                             sep="")
  GOplt.BP.Cmur.max <- max(GOplt.BP.Cmur$GOpct)

  pdf(file=paste("GOA/GOfreqs_Cmur_BP_lvl",golvl,".pdf", sep=""),
      width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE);
  par.def<-par(mar=c(4,25,1,1)+0.1);
  barplot(GOplt.BP.Cmur$GOpct,
          names.arg=GOplt.BP.Cmur$LBL,
          col=as.character(GK["BPU"]),
          horiz=TRUE, xlim=c(0,GOplt.BP.Cmur.max),
          border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE)
  axis(1,at=c(seq(0,GOplt.BP.Cmur.max+5,2)),line=0.25)
  abline(v=seq(0,GOplt.BP.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black")
  mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
  text(GOplt.BP.Cmur.max, numbars*1.2, # lpos*1.2,
       labels="Biological\nProcess",
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["BPD"]),
         # as.character(KK[c("BPD","BPD","MFD")]),
       adj=c(1,1.5), font=2, cex=2)
  text(-2, -0.7, # lpos*1.2,
       labels=paste("GO level ",golvl,sep=""),
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["BPD"]),
         # as.character(KK[c("BPD","BPD","MFD")]),
       adj=c(1,1), font=2, cex=1.25, xpd=TRUE)
  par(par.def);
  dev.off();

} # for golvl

Using R to draw starfish/sea urchin GO comparison barplots

# Loading GO info datasets
FILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin";
DT.CmurxSpur <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""),
                           header=TRUE, sep="\t");
FILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot";
DT.CmurxUPrt <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""),
                           header=TRUE, sep="\t");
FILE <- "./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot";
DT.SpurxUPrt <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""),
                           header=TRUE, sep="\t");

# volcano plots
plot((1 - DT.CmurxUPrt$hypergeom.pval) ~ DT.CmurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - DT.CmurxUPrt$hypergeom.padj) ~ DT.CmurxUPrt$logodds, pch=16, col="red")

plot((1 - DT.SpurxUPrt$hypergeom.pval) ~ DT.SpurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - DT.SpurxUPrt$hypergeom.padj) ~ DT.SpurxUPrt$logodds, pch=16, col="red")

plot((1 - DT.CmurxSpur$hypergeom.pval) ~ DT.CmurxSpur$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - DT.CmurxSpur$hypergeom.padj) ~ DT.CmurxSpur$logodds, pch=16, col="red")

dt.CmurxUPrt <- DT.CmurxUPrt[!is.na(DT.CmurxUPrt$hypergeom.padj) & DT.CmurxUPrt$hypergeom.padj < 1e-5, ]
dt.SpurxUPrt <- DT.SpurxUPrt[!is.na(DT.SpurxUPrt$hypergeom.padj) & DT.SpurxUPrt$hypergeom.padj < 1e-5, ]
dt.CmurxSpur <- DT.CmurxSpur[!is.na(DT.CmurxSpur$hypergeom.padj) & DT.CmurxSpur$hypergeom.padj < 1e-5, ]

td.CmurxUPrt <- DT.CmurxUPrt[!is.na(DT.CmurxUPrt$hypergeom.padj), ]
td.SpurxUPrt <- DT.SpurxUPrt[!is.na(DT.SpurxUPrt$hypergeom.padj), ]
td.CmurxSpur <- DT.CmurxSpur[!is.na(DT.CmurxSpur$hypergeom.padj), ]

# volcano plots filtered
plot((1 - dt.CmurxUPrt$hypergeom.pval) ~ dt.CmurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - dt.CmurxUPrt$hypergeom.padj) ~ dt.CmurxUPrt$logodds, pch=16, col="red")

plot((1 - dt.SpurxUPrt$hypergeom.pval) ~ dt.SpurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - dt.SpurxUPrt$hypergeom.padj) ~ dt.SpurxUPrt$logodds, pch=16, col="red")

plot((1 - dt.CmurxSpur$hypergeom.pval) ~ dt.CmurxSpur$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - dt.CmurxSpur$hypergeom.padj) ~ dt.CmurxSpur$logodds, pch=16, col="red")

# logods dotplots
gox1 <- split(DT.CmurxUPrt, DT.CmurxUPrt$Ontology)
gox2 <- split(DT.SpurxUPrt, DT.SpurxUPrt$Ontology)
gox3 <- split(DT.CmurxSpur, DT.CmurxSpur$Ontology)

gos.BP <- unique(sort(c(as.character(gox1$BP$GO),
                        as.character(gox2$BP$GO),
                        as.character(gox3$BP$GO))));
gos.BP.len <- length(gos.BP);
gos.MF <- unique(sort(c(as.character(gox1$MF$GO), 
                       as.character(gox2$MF$GO),
                        as.character(gox3$MF$GO))));
gos.MF.len <- length(gos.MF);
gos.CC <- unique(sort(c(as.character(gox1$CC$GO),
                        as.character(gox2$CC$GO),
                        as.character(gox3$CC$GO))));
gos.CC.len <- length(gos.CC);

plot(DT.CmurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO & gos.CC %in% DT.SpurxUPrt$GO],
     DT.SpurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO & gos.CC %in% DT.CmurxUPrt$GO])

PTBP.Cmur <- DT.CmurxUPrt$logodds[gos.BP %in% DT.CmurxUPrt$GO]
PTBP.Cmur[ is.na(PTBP.Cmur) ] <- 0
PTBP.Spur <- DT.SpurxUPrt$logodds[gos.BP %in% DT.CmurxUPrt$GO]
PTBP.Spur[ is.na(PTBP.Spur) ] <- 0
plot(PTBP.Cmur,PTBP.Spur)

PTMF.Cmur <- DT.CmurxUPrt$logodds[gos.MF %in% DT.CmurxUPrt$GO]
PTMF.Cmur[ is.na(PTMF.Cmur) ] <- 0
PTMF.Spur <- DT.SpurxUPrt$logodds[gos.MF %in% DT.CmurxUPrt$GO]
PTMF.Spur[ is.na(PTMF.Spur) ] <- 0
plot(PTMF.Cmur,PTMF.Spur)

PTCC.Cmur <- DT.CmurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO]
PTCC.Cmur[ is.na(PTCC.Cmur) ] <- 0
PTCC.Spur <- DT.SpurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO]
PTCC.Spur[ is.na(PTCC.Spur) ] <- 0
plot(PTCC.Cmur,PTCC.Spur)

# barplot tests
st.CmurxUPrt <- head(dt.CmurxUPrt[order(dt.CmurxUPrt$hypergeom.padj), ], n=500)
st.SpurxUPrt <- head(dt.SpurxUPrt[order(dt.SpurxUPrt$hypergeom.padj), ], n=500)

st.CmurxSpur <- head(dt.CmurxSpur[order(dt.CmurxSpur$hypergeom.padj), ], n=500)

par.def<-par(mar=c(4,25,1,1)+0.1);
barplot(st.CmurxSpur$logodds,names.arg=paste(st.CmurxSpur$Term," [",st.CmurxSpur$GO,"]",sep=""),horiz=TRUE,las=2)
par(par.def);

# K<-list(CC="red",BP="green",MF="blue")
K<-list(CC="#E94F6C",BP="#52E853",MF="#6FAFF9")
k<-list(CC="#A24436",BP="#0A945C",MF="#007CBA")

gord.CmurxSpur <- st.CmurxSpur[order(st.CmurxSpur$Ontology, -st.CmurxSpur$logodds), ]
par.def<-par(mar=c(4,25,1,1)+0.1);
barplot(gord.CmurxSpur$logodds,
        names.arg=paste(gord.CmurxSpur$Term," [",gord.CmurxSpur$GO,"]",sep=""),
        col=as.character(K[gord.CmurxSpur$Ontology]),
        horiz=TRUE,las=2)
par(par.def);

ts.CmurxUPrt <- head(td.CmurxUPrt[order(td.CmurxUPrt$hypergeom.padj), ], n=500)
ts.SpurxUPrt <- head(td.SpurxUPrt[order(td.SpurxUPrt$hypergeom.padj), ], n=500)

# filtering 100 topmost and bottom overrepresented terms
tdu.CmurxSpur <- td.CmurxSpur[order(-td.CmurxSpur$logodds),]
tts.CmurxSpur <- rbind(head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "CC", ], n=100),
                       head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "BP", ], n=100),
                       head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "MF", ], n=100),
                       tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "CC", ], n=100),
                       tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "BP", ], n=100),
                       tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "MF", ], n=100))

KK<-list(CCU="#E94F6C", # lightred
         CCD="#A24436", # darkred
         BPU="#52E853", # lightgreen
         BPD="#426A2B", # darkgreen ... #0A945C
         MFU="#6FAFF9", # lightblue
         MFD="#007CBA") # darkblue

tts.CmurxSpur$OntoColors <- c(rep(KK["CCU"],100),rep(KK["BPU"],100),rep(KK["MFU"],100),
                              rep(KK["CCD"],100),rep(KK["BPD"],100),rep(KK["MFD"],100))

# GO comparison final barplots (for topmost and bottom 100 terms)
pdf(file="GOA/GOcomp_CmurxSpur_100low+100top.pdf", onefile=TRUE,
    width=8, height=24, bg="white", pointsize=10, useDingbats=FALSE);
ggord.CmurxSpur <- tts.CmurxSpur[order(tts.CmurxSpur$Ontology, -tts.CmurxSpur$logodds), ]
lpos<-c(length(ggord.CmurxSpur$Ontology[ ggord.CmurxSpur$Ontology=="BP" ]),
        length(ggord.CmurxSpur$Ontology[ ggord.CmurxSpur$Ontology=="BP" | ggord.CmurxSpur$Ontology=="CC" ]),
        length(ggord.CmurxSpur$Ontology)) # [ ggord.CmurxSpur$Ontology=="MF" ]
par.def<-par(mar=c(6,25,1,1)+0.1);
barplot(ggord.CmurxSpur$logodds[1:200],
        names.arg=paste(ggord.CmurxSpur$Term[1:200]," [",ggord.CmurxSpur$GO[1:200],"]",sep=""),
        col=as.character(ggord.CmurxSpur$OntoColors[1:200]), # K[ggord.CmurxSpur$Ontology]),
        horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i")
mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
text(8, 200*1.2, # lpos*1.2,
     labels="Biological\nProcess", # c("Biological Process", "Cellular Component", "Molecular Function"),
     col=as.character(KK["BPD"]), # KK[c("BPD","CCD","MFD")]),
     adj=c(1,1.5), font=2, cex=2)
abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black")
abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE)
barplot(ggord.CmurxSpur$logodds[201:400],
        names.arg=paste(ggord.CmurxSpur$Term[201:400]," [",ggord.CmurxSpur$GO[201:400],"]",sep=""),
        col=as.character(ggord.CmurxSpur$OntoColors[201:400]), # K[ggord.CmurxSpur$Ontology]),
        horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i")
mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
text(8, 200*1.2, # lpos*1.2,
     labels="Cellular\nComponent", # c("Biological Process", "Cellular Component", "Molecular Function"),
     col=as.character(KK["CCD"]), # as.character(KK[c("BPD","CCD","MFD")]),
     adj=c(1,1.5), font=2, cex=2)
abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black")
abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE)
barplot(ggord.CmurxSpur$logodds[401:600],
        names.arg=paste(ggord.CmurxSpur$Term[401:600]," [",ggord.CmurxSpur$GO[401:600],"]",sep=""),
        col=as.character(ggord.CmurxSpur$OntoColors[401:600]), # K[ggord.CmurxSpur$Ontology]),
        horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i")
mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
text(8, 200*1.2, # lpos*1.2,
     labels="Molecular\nFunction", # c("Biological Process", "Cellular Component", "Molecular Function"),
     col=as.character(KK["MFD"]), # as.character(KK[c("BPD","CCD","MFD")]),
     adj=c(1,1.5), font=2, cex=2)
abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black")
abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE)
par(par.def);
dev.off();

# filtering 25 topmost and bottom overrepresented terms
tdo.CmurxSpur <- td.CmurxSpur[order(-td.CmurxSpur$logodds),]
tts2.CmurxSpur <- rbind(head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "CC", ], n=25),
                        head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "BP", ], n=25),
                        head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "MF", ], n=25),
                        tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "CC", ], n=25),
                        tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "BP", ], n=25),
                        tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "MF", ], n=25))
tts2.CmurxSpur$OntoColors <- c(rep(KK["CCU"],25),rep(KK["BPU"],25),rep(KK["MFU"],25),
                               rep(KK["CCD"],25),rep(KK["BPD"],25),rep(KK["MFD"],25))

pdf(file="GOA/GOcomp_CmurxSpur_25low+25top.pdf",
    width=11, height=16, bg="white", pointsize=12, useDingbats=FALSE);
ggord2.CmurxSpur <- tts2.CmurxSpur[order(tts2.CmurxSpur$Ontology, -tts2.CmurxSpur$logodds), ]
lpos<-c(length(ggord2.CmurxSpur$Ontology[ ggord2.CmurxSpur$Ontology=="BP" ]),
        length(ggord2.CmurxSpur$Ontology[ ggord2.CmurxSpur$Ontology=="BP" | ggord2.CmurxSpur$Ontology=="CC" ]),
        length(ggord2.CmurxSpur$Ontology)) # [ ggord2.CmurxSpur$Ontology=="MF" ]
par.def<-par(mar=c(6,25,1,1)+0.1);
barplot(ggord2.CmurxSpur$logodds,
        names.arg=paste(ggord2.CmurxSpur$Term," [",ggord2.CmurxSpur$GO,"]",sep=""),
        col=as.character(ggord2.CmurxSpur$OntoColors),
        horiz=TRUE, xlim=c(-3,7.5), border=NA, las=1, cex.names=0.5, yaxs = "i")
mtext("\nCmur/Spur log odds ratio\n[dark:lower25 + light:top25]", side=1, cex=1.5, adj=0.5, padj=1, xpd=TRUE)
text(7, lpos*1.2,
     labels=c("Biological\nProcess", "Cellular\nComponent", "Molecular\nFunction"),
     col=as.character(KK[c("BPD","CCD","MFD")]),
     adj=c(1,1.5), font=2, cex=2)
abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black")
abline(h=lpos*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE)
abline(h=seq(5,145,5)*1.2, lty="dotted", lwd=0.25, col="black", xpd=TRUE)
par(par.def);
dev.off();

# 
pdf(file="GOA/GOcomp_CmurxSpur_hypergeom_adjpval_1-5_signif.pdf",
    width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE);
tdc.CmurxSpur <- dt.CmurxSpur[order(-dt.CmurxSpur$logodds),]
ggordco.CmurxSpur <- tdc.CmurxSpur[order(tdc.CmurxSpur$Ontology, -tdc.CmurxSpur$logodds), ]
lpos<-c(length(ggordco.CmurxSpur$Ontology[ ggordco.CmurxSpur$Ontology=="BP" ]),
        length(ggordco.CmurxSpur$Ontology[ ggordco.CmurxSpur$Ontology=="BP" | ggordco.CmurxSpur$Ontology=="CC" ]),
        length(ggordco.CmurxSpur$Ontology)) # [ ggordco.CmurxSpur$Ontology=="MF" ]
par.def<-par(mar=c(10,24,1,1)+0.1);
barplot(ggordco.CmurxSpur$logodds,
        names.arg=paste(ggordco.CmurxSpur$Term," [",ggordco.CmurxSpur$GO,"]",sep=""),
        col=as.character(K[ggordco.CmurxSpur$Ontology]),
        horiz=TRUE, xlim=c(0,7.5), border=NA, las=1)
mtext("\nCmur/Spur log odds ratio\n(hypergeometric test adjusted p-value < 10-5)", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
text(7, lpos*1.2,
     labels=c("BP", "CC", "MF"),
     col=c(as.character(k[c("CC","BP","MF")])),
     adj=c(1,1.25), font=2, cex=2)
abline(v=seq(0,7,1), lty="dotted", lwd=0.5, col="black")
par(par.def);
dev.off();

# GO wordclouds?
library("GOsummaries");

wcd1 <- data.frame(Term = st.CmurxUPrt$Term,
                  Score = st.CmurxUPrt$hypergeom.padj)
wcd2 <- data.frame(Term = st.SpurxUPrt$Term,
                  Score = st.SpurxUPrt$hypergeom.padj)
wcd3 <- data.frame(Term = st.CmurxSpur$Term,
                  Score = st.CmurxSpur$hypergeom.padj)

gs <- gosummaries(wc_data = list(Results1 = wcd1, Results2 = wcd2))
 gs <- gosummaries(wc_data = list(Results3 = wcd3))
plot(gs)

Software Versions

Linux Debian wheezy (7.3), kernel 3.2.51-1(x86_64)
GNU bash 4.2.37(1)
GNU Awk 4.0.1
perl 5.14.2
R 3.1.2
- Bioconductor version 3.0 ( BiocInstaller 1.16.2)
sff2fastq 0.9.0 https://github.com/indraniel/sff2fastq
FASTX tools 0.0.13.2 http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
Trimmomatic 0.32 http://www.usadellab.org/cms/?page=trimmomatic
NewBler 2.3(091027_1459) 454 Life Sciences Corp
Trinity Trinityrnaseq_r20131110 http://trinityrnaseq.github.io/
SOAPdenovo-Trans 1.03 http://soap.genomics.org.cn/SOAPdenovo-Trans.html
viroBLAST http://indra.mullins.microbiol.washington.edu/viroblast/viroblast.php
GBrowse2 http://gmod.org/wiki/GBrowse

References

[1] Stein LD.
    “Using GBrowse 2.0 to visualize and share next-generation sequence data.”
    Brief Bioinform. 14(2):162-71, 2013 PMID.
[2] Deng W, Nickle DC, Learn GH, Maust B, Mullins JI.
    “ViroBLAST: A stand-alone BLAST web server for flexible queries of multiple databases and user’s datasets.”
    Bioinformatics, 23(17):2334-2336, 2007 PMID.