Loading…
 
Coscinasterias muricata Coelomic Epithelium Transcriptome

Introduction


Coscinasterias is a cosmopolitan genus of large and sea stars with the ability of somatic fission as a clonal reproductive strategy. Of these, C. muricata from New Zealand is particularly interesting because it reproduces exclusively sexually in certain geographical areas and mostly by cloning in others. During fission, the animals tear themselves apart across their central disc, where after the lost body parts are regenerated. We have sequenced and subsequently analyzed the transcriptome of the coelomic epithelium of a clonal C. muricata specimen. This work has been submitted to Marine Genomics (Gabre et al, “The coelomic epithelium transcriptome from a clonal sea star, Coscinasterias muricata”, 2015). If you find the following supplementary material useful for your research, please cite that paper.

Online resources

Transcriptome Browser


We have uploaded the transcript sequences into a GBrowse2 [1] engine to get a transcriptome browser, where many of the datasets produced by the protocols on this supplementary material web page were projected over each sequence.

The BLAST web form


We have adapted viroBLAST [2] to fit into our CMS system. That BLAST web service has already defined two nucleotide databases for the reads, both raw 454 sequences and cleaned reads (by Trimmomatic) and three databases for the transcriptomes produced, including SOAPdenovo-Trans (recommended), Trinity, and NewBler. There are two amino acid databases for each of those three transcriptome assemblies, made upon the six frame translations of the nucleotide sequences for the contigs, as well as taking the longest ORF from that set separately. Let us know if there is any problem using this service.

Download Files


The following table summarizes the sequence files along with their approximate byte size. All the files were compressed using bzip2.

Set Raw 454 Data Raw Reads Clean Reads
(FastX)
Clean Reads
(Trimmomatic)
G0CB7HT03 SFF (230MB) FASTQ (42MB) FASTQ (7.3MB) FASTQ (25MB)
G0CB7HT04 SFF (239MB) FASTQ (42MB) FASTQ (8.5MB) FASTQ (21MB)

Assembly NewBler (RAW) NewBler (FX) NewBler (TR) Trinity (FX) Trinity (TR) SOAPdenovoTrans (FX) SOAPdenovoTrans (TR)
Contigs FASTA (3.1MB) FASTA (235KB) FASTA (1.3MB) FASTA (454KB) FASTA (2.7MB) FASTA (182KB) FASTA (1.5MB)

“FX” and “TR” correspond to the assembly run using the FastX or the Trimmomatic protocols to clean the 454 reads respectively, you can find the assembly details on the following methods section.

The raw data has been also submitted to the NCBI-SRA database:

BioSample SAMN03371637
Experiment SRX889239
SRA Run SRR1816871

Material and Methods


This section contains a detailed description of the computational protocols that produced the current C. muricata transcriptome version, as well as its annotation. Code blocks for all the steps show the command-line arguments passed to the different tools used all along the analyses performed.

Pre-Assembly Procedures

Raw Sequence Data


Sequence reads were produced at Amplicon Express, using half plate of a Roche GS FLX+ 454-Titanium sequencer. sff2fastq was used to retrieve reads sequence and qualities in FASTQ format from the original standard flowgram format (SFF) files.

Retrieving sequences from the raw 454 data SFF files
# BD is the shell variable for the project base directory.

# Using sff2fastq to extract nucleotides and qualities from SFF files into fastq

for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    sff2fastq -o $BD/seqs/$DS.fq \
                 $BD/raw_data/$DS.sff \
              2> $BD/seqs/$DS.sff2fastq.log

  }; done

Some basic stats on raw reads
for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    echo "# Running commands on $DS" 1>&2

    # Basic read length and GC content summary using EMBOSS infoseq

    infoseq -noheading -only -name -length -pgc \
            fastq::$BD/seqs/$DS.fq \
                 > $BD/seqs/$DS.info.tbl

    gawk '{ S+=$2; N++ }
      END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; }
         ' $BD/seqs/$DS.info.tbl

    # Using a custom R script to plot Length vs GC content scatterplots

    Rscript Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/$DS.info.tbl

    # QC analysis using FastQC

    mkdir $BD/seqs/$DS.QC
    fastqc -f fastq --extract --outdir $BD/seqs/$DS.QC $BD/seqs/$DS.fq

  }; done


Set Total seqs Total length (bp) Avg Length (bp)
G0CB7HT03 194,344 77,400,490 398.265
G0CB7HT04 195,424 70,224,129 359.342
TOTAL 389,768 147,624,619 378.750 

Set Quality per Position Nucleotide Content GC distribution Length vs GC content
G0CB7HT03 Quality distribution per base position for the starfish raw reads, set G0CB7HT03. Nucleotide content distribution per base position for the starfish raw reads, set G0CB7HT03. GC content distribution along all raw read sequences: G0CB7HT03. Length vs GC content plot for the starfish raw reads, set G0CB7HT03.
G0CB7HT04 Quality distribution per base position for the starfish raw reads, set G0CB7HT04. Nucleotide content distribution per base position for the starfish raw reads, set G0CB7HT04. GC content distribution along all raw read sequences: G0CB7HT03. Length vs GC content plot for the starfish raw reads, set G0CB7HT04.


Cleaning Raw Reads


There are several tools that can help cleaning the raw reads before the assembly, we used here those provided by FASTX tools suite. The tools it provides facilitate the pipelining of filters to clean up the read sequences. However, fastq_quality_filter does not trim low quality ends, it removes any read that does not meet the criteria of a minimum quality score (-+q=20+- for Phred Q20) across a given length percent (-+p=50+- in our case). fastx_artifacts_filter has not found any issue on our data set, thus the cleaning was mainly performed by the fastx_clipper steps.

Cleaning reads sequences with FASTX tools
for DS in G0CB7HT03 G0CB7HT04 ;
  do {

   # FastX Quality filtering
   fastq_quality_filter -i $BD/seqs/${DS}.fq -o $BD/seqs/${DS}_q20.fq -p 50 -q 20 -Q 33

   # FastX Artifact filtering
   fastx_artifacts_filter -i $BD/seqs/${DS}_q20.fq -o $BD/seqs/${DS}_q20_noartfct.fq -Q 33

   # Fastx Adapters clipping
   fastx_clipper                 -i $BD/seqs/${DS}_q20_noartfct.fq \
                                 -a CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG -Q 33 | \
     fastx_clipper               -a CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG -Q 33 | \
       fastx_clipper             -a GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC -Q 33 | \
         fastx_clipper           -a TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG   -Q 33 | \
           fastx_clipper         -a CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA   -Q 33 | \
             fastx_clipper       -a GCCTCCCTCGCGCCATCAG                          -Q 33 | \
               fastx_clipper     -a GCCTTGCCAGCCCGCTCAG                          -Q 33 | \
                 fastx_clipper   -a CGTATCGCCTCCCTCGCGCCATCAG                    -Q 33 | \
                   fastx_clipper -a CTATGCGCCTTGCCAGCCCGCTCAG                    -Q 33   \
                                 -o $BD/seqs/${DS}_q20_noartfct_clipped.fq

  }; done

Some basic stats on cleaned reads
for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    for FL in q20 q20_noartfct q20_noartfct_clipped ;
      do {

        echo "# Running commands on $DS _ $FL" 1>&2

        # Basic read length and GC content summary using EMBOSS infoseq

        infoseq -noheading -only -name -length -pgc \
                fastq::$BD/seqs/${DS}_${FL}.fq \
                     > $BD/seqs/${DS}_${FL}.info.tbl

        gawk '{ S+=$2; N++ }
           END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; }
             ' $BD/seqs/${DS}_${FL}.info.tbl

        # Using a custom R script to plot Length vs GC content scatterplots

        Rscript Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/${DS}_${FL}.info.tbl

        # QC analysis using FastQC

        mkdir $BD/seqs/${DS}_${FL}.QC
        fastqc -f fastq --extract --outdir $BD/seqs/${DS}_${FL}.QC $BD/seqs/${DS}_${FL}.fq

      }; done

  }; done

Set Step Total seqs Total length (bp) Avg Length (bp)
G0CB7HT03 raw 194,344 77,400,490 398.265
>=Q20 193,763 77,330,283 399.097
+Artifacts 193,763 77,330,283 399.097
+Adapters 169,557 13,286,454 78.360
G0CB7HT04 raw 195,424 70,224,129 359.342
>=Q20 194,415 70,085,189 360.493
+Artifacts 194,415 70,085,189 360.493
+Adapters 170,063 13,371,236 78.625 

Final Set Step Quality per Position Nucleotide Content GC distribution Length vs GC content
G0CB7HT03 >=Q20 Quality distribution per base position for all starfish reads after Phred Q20 cleaning: G0CB7HT03. Nucleotide content distribution per base position for all starfish reads after Phred Q20 cleaning: G0CB7HT03. GC content distribution along all read sequences after Phred Q20 cleaning: G0CB7HT03. Length vs GC content plot for all starfish reads after Phred Q20 cleaning: G0CB7HT03.
+Adapters Quality distribution per base position for all starfish reads after Phred Q20 + artifact + adapter cleaning: G0CB7HT03. Nucleotide content distribution per base position for all starfish reads after Phred Q20 + artifact + adapter cleaning: G0CB7HT03. GC content distribution along all read sequences after Phred Q20 + artifact + adapter cleaning: G0CB7HT03. Length vs GC content plot for all starfish reads after Phred Q20 + artifact + adapter cleaning: G0CB7HT03.
G0CB7HT04 >=Q20 Quality distribution per base position for all starfish reads after Phred Q20 cleaning: G0CB7HT04. Nucleotide content distribution per base position for all starfish reads after Phred Q20 cleaning: G0CB7HT04. GC content distribution along all read sequences after Phred Q20 cleaning: G0CB7HT04. Length vs GC content plot for all starfish reads after Phred Q20 cleaning: G0CB7HT04.
+Adapters Quality distribution per base position for all starfish reads after Phred Q20 + artifact + adapter cleaning: G0CB7HT04. Nucleotide content distribution per base position for all starfish reads after Phred Q20 + artifact + adapter cleaning: G0CB7HT04. GC content distribution along all read sequences after Phred Q20 + artifact + adapter cleaning: G0CB7HT04. Length vs GC content plot for all starfish reads after Phred Q20 + artifact + adapter cleaning: G0CB7HT04.


After this step, reads sequence length distribution shows a dramatic decrease, while the GC content seems to fit perfectly over the theoretical distribution. We tried a more integrated approach to see if we can improve the cleaning step on the quality trimming, which was based on Trimmomatic.

Cleaning reads sequences with Trimmomatic
export TMC=/opt/install/Trimmomatic-0.32/
for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    java -jar $TMC/trimmomatic-0.32.jar  SE  \
        $BD/seqs/${DS}.fq  $BD/seqs/${DS}.trimmomatic.fq \
        ILLUMINACLIP:$TMC/adapters/454_adapters-SE.fa:2:30:10 \
        LEADING:15 TRAILING:15 SLIDINGWINDOW:4:15 MINLEN:30 TOPHRED33 \
        2> $BD/seqs/${DS}.trimmomatic.log 1>&2

  }; done

Some basic stats on Trimmomatic cleaned reads
for DS in G0CB7HT03 G0CB7HT04 ;
  do {

    echo "# Running commands on $DS" 1>&2

    # Basic read length and GC content summary using EMBOSS infoseq

    infoseq -noheading -only -name -length -pgc \
            fastq::$BD/seqs/${DS}.trimmomatic.fq \
                 > $BD/seqs/${DS}.trimmomatic.info.tbl

    gawk '{ S+=$2; N++ }
       END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; }
         ' $BD/seqs/${DS}.trimmomatic.info.tbl

    # Using a custom R script to plot Length vs GC content scatterplots

    Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/${DS}.trimmomatic.info.tbl

    # QC analysis using FastQC

    mkdir $BD/seqs/${DS}.trimmomatic.QC
    fastqc -f fastq --extract --outdir $BD/seqs/${DS}.trimmomatic.QC $BD/seqs/${DS}.trimmomatic.fq

  }; done

# Making the boxplots
Rscript  Compare_Vector_Distributions.R \
    "Reads Length Distribution::NA::Sequence Length (bp)" \
    blastout/violin.G0CB7HT03_reads_lengthdist.png   \
    "Raw reads${RET}::2:: ::"seqs/G0CB7HT03.info.tbl"$SPC" \
    "FASTX${RET}q20::2:: ::"seqs/G0CB7HT03_q20.info.tbl"$NBC"                           \
    "FASTX${RET}q20+art::2:: ::"seqs/G0CB7HT03_q20_noartfct.info.tbl"$NBC"              \
    "FASTX${RET}q20+art+clip::2:: ::"seqs/G0CB7HT03_q20_noartfct_clipped.info.tbl"$NBC" \
    "Trimmomatic${RET}::2:: ::"seqs/G0CB7HT03.trimmomatic.info.tbl"$TRC"

Rscript  Compare_Vector_Distributions.R \
    "Reads Length Distribution::NA::Sequence Length (bp)" \
    blastout/violin.G0CB7HT04_reads_lengthdist.png   \
    "Raw reads${RET}::2:: ::"seqs/G0CB7HT04.info.tbl"$SPC" \
    "FASTX${RET}q20::2:: ::"seqs/G0CB7HT04_q20.info.tbl"$NBC"                           \
    "FASTX${RET}q20+art::2:: ::"seqs/G0CB7HT04_q20_noartfct.info.tbl"$NBC"              \
    "FASTX${RET}q20+art+clip::2:: ::"seqs/G0CB7HT04_q20_noartfct_clipped.info.tbl"$NBC" \
    "Trimmomatic${RET}::2:: ::"seqs/G0CB7HT04.trimmomatic.info.tbl"$TRC"

Set Total seqs Total length (bp) Avg Length (bp)
G0CB7HT03 raw 194,344 77,400,490 398.265
Trimmomatic 180,024 50,285,583 279.327
G0CB7HT04 raw 195,424 70,224,129 359.342
Trimmomatic 166,177 38,285,817 230.392 

Final Set Quality per Position Nucleotide Content GC distribution Length vs GC content
G0CB7HT03 Quality distribution per base position for all starfish reads after Trimmomatic cleaning: G0CB7HT03. Nucleotide content distribution per base position for all starfish reads after Trimmomatic cleaning: G0CB7HT03. GC content distribution along all read sequences after Trimmomatic cleaning: G0CB7HT03. Length vs GC content plot for all starfish reads after Trimmomatic cleaning: G0CB7HT03.
G0CB7HT04 Quality distribution per base position for all starfish reads after Trimmomatic cleaning: G0CB7HT04. Nucleotide content distribution per base position for all starfish reads after Trimmomatic cleaning: G0CB7HT04. GC content distribution along all read sequences after Trimmomatic cleaning: G0CB7HT04. Length vs GC content plot for all starfish reads after Trimmomatic cleaning: G0CB7HT04.


The improved nucleotide qualities after processing reads by Trimmomatic does not affect significantly to the length of the reads as in the previous procedure, as it is shown on the Length versus GC content scatter-plots on the right column of the above table. For comparison purposes, the reads length distribution after each procedure is shown in the violin plots below:

G0CB7HT03
G0CB7HT04
Violinplot_summary_reads_length_distribution_G0CB7HT03 Violinplot_summary_reads_length_distribution_G0CB7HT04

Assembly Overview

For this step three different assemblers were tested: NewBler (the reference assembler for 454 sequences), trinity (routinely used for NGS transcriptome assemblies), and SOAPdenovo-Trans (a variant of SOAPdenovo specific for transcriptomes). The latter two are modern assembly tools, yet it has been reported that SOAP-denovo produces less chimeric assemblies (). In this sequence set there is no paired-end data, yet we expect that the lack of paired-end data will be compensated by the longer reads we got from the 454 sequencing experiment.

NewBler

Assembling reads using NewBler: reads cleaned by FastX
# Preparing the project folder
newAssembly -cdna starfish_newbler_fastx

# Preparing the reads sequences
addRun starfish_newbler_fastx ./seqs/G0CB7HT03_q20_noartfct_clipped.fasta \
                              ./seqs/G0CB7HT04_q20_noartfct_clipped.fasta
# 2 read files successfully added.
#     G0CB7HT03_q20_noartfct_clipped.fasta
#     G0CB7HT04_q20_noartfct_clipped.fasta

# Running the assembler
nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler_fastx  \
                       2> starfish_newbler_fastx/starfish_newbler_fastx.log 1>&2 &
Assembling reads using NewBler: reads cleaned by Trimmomatic
# Preparing the project folder
newAssembly -cdna starfish_newbler_trimmo

# Preparing the reads sequences
addRun starfish_newbler_trimmo ../seqs/G0CB7HT03.trimmomatic.fasta ../seqs/G0CB7HT04.trimmomatic.fasta
# 2 read files successfully added.
#     G0CB7HT03.trimmomatic.fasta
#     G0CB7HT04.trimmomatic.fasta

# Running the assembler
nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler_trimmo \
                       2> starfish_newbler_trimmo/starfish_newbler_trimmo.log 1>&2 &
Assembling reads using NewBler: using raw SFF files
# Preparing the project folder
newAssembly -cdna starfish_newbler

# Preparing the reads sequences
addRun starfish_newbler ../seqs/G0CB7HT03.sff ../seqs/G0CB7HT04.sff
# 2 read files successfully added.
#     G0CB7HT03.sff
#     G0CB7HT04.sff

# Running the assembler
nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler  \
                       2> starfish_newbler/starfish_newbler.log 1>&2 &

Trinity

Assembling reads using Trinity: reads cleaned by FastX
nohup Trinity.pl --seqType fq --JM 100G --CPU 32 -min_contig_length 200 \
                 --single ../seqs/G0CB7HT03_q20_noartfct_clipped.fq \
                          ../seqs/G0CB7HT04_q20_noartfct_clipped.fq \
                 --output ./starfish_trinity \
                 2> ./starfish_trinity/Trinity.log 1>&2 &

TrinityStats.pl starfish_trinity/Trinity.fasta
Assembling reads using Trinity: reads cleaned by Trimmomatic
nohup Trinity.pl --seqType fq --JM 100G --CPU 32 -min_contig_length 200 \
                 --single ../seqs/G0CB7HT03.trimmomatic.fq ../seqs/G0CB7HT04.trimmomatic.fq \
                 --output ./starfish_trinity_trimmo \
                 2> ./starfish_trinity_trimmo/Trinity.trimmomatic_fq.log 1>&2 &

TrinityStats.pl starfish_trinity_trimmo/Trinity.fasta

SOAPdenovo-Trans

As there are no Paired-End reads, scaffold sequences, aka transcripts, produced with the +-SOAPdenovo-Trans+- are the same as those from the contigs from step 2. The config files are available from these links: starfish.conf and starfish_trimmo.conf.

Assembling reads using SOAPdenovo-Trans: reads cleaned by FastX
# step1: computing pre-graph
SOAPdenovo-Trans-127mer pregraph -s ./starfish_soap/starfish.conf -K 63 \
                                 -o ./starfish_soap/starfish -p 32 \
                                 1> ./starfish_soap/starfish.pregraph.log \
                                 2> ./starfish_soap/starfish.pregraph.err
# step2: building contigs
SOAPdenovo-Trans-127mer contig   -g ./starfish_soap/starfish \
                                 1> ./starfish_soap/starfish.contig.log \
                                 2> ./starfish_soap/starfish.contig.err
# step3: mapping reads
SOAPdenovo-Trans-127mer map      -s ./starfish_soap/starfish.conf \
                                 -g ./starfish_soap/starfish -p 32 \
                                 1> ./starfish_soap/starfish.map.log \
                                 2> ./starfish_soap/starfish.map.err
# step4: scaffolding
SOAPdenovo-Trans-127mer scaff    -g ./starfish_soap/starfish -p 32 \
                                 1> ./starfish_soap/starfish.scaff.log \
                                 2> ./starfish_soap/starfish.scaff.err
Assembling reads using SOAPdenovo-Trans: reads cleaned by Trimmomatic
# step1
SOAPdenovo-Trans-127mer pregraph -s ./starfish_soap_trimmo/starfish_trimmo.conf -K 63 \
                                 -o ./starfish_soap_trimmo/starfish_trimmo -p 32 \
                                 1> ./starfish_soap_trimmo/starfish_trimmo.pregraph.log \
                                 2> ./starfish_soap_trimmo/starfish_trimmo.pregraph.err
# step2
SOAPdenovo-Trans-127mer contig   -g ./starfish_soap_trimmo/starfish_trimmo \
                                 1> ./starfish_soap_trimmo/starfish_trimmo.contig.log \
                                 2> ./starfish_soap_trimmo/starfish_trimmo.contig.err
# step3
SOAPdenovo-Trans-127mer map      -s ./starfish_soap_trimmo/starfish_trimmo.conf \
                                 -g ./starfish_soap_trimmo/starfish_trimmo -p 32 \
                                 1> ./starfish_soap_trimmo/starfish_trimmo.map.log \
                                 2> ./starfish_soap_trimmo/starfish_trimmo.map.err
# step4
SOAPdenovo-Trans-127mer scaff    -g ./starfish_soap_trimmo/starfish_trimmo -p 32 \
                                 1> ./starfish_soap_trimmo/starfish_trimmo.scaff.log \
                                 2> ./starfish_soap_trimmo/starfish_trimmo.scaff.err

Contigs Summary

Assembling reads using SOAPdenovo-Trans: reads cleaned by Trimmomatic
# Retrieve fasta sequences for all assemblies
for DS in starfish_newbler_fastx/assembly/454Isotigs.fna \
          starfish_newbler_trimmo/assembly/454Isotigs.fna \
          starfish_newbler/assembly/454Isotigs.fna \
          starfish_trinity/Trinity.fasta \
          starfish_trinity_trimmo/Trinity.fasta \
          starfish_soap/starfish.scafSeq \
          starfish_soap_trimmo/starfish_trimmo.scafSeq ;
  do {

    ds=${DS%%/*}

    echo "# Running commands on $DS -> $ds" 1>&2

    ln -vs $DS $ds.fa

    # Basic read length and GC content summary using EMBOSS infoseq

    infoseq -noheading -only -name -length -pgc \
            fasta::$ds.fa > $ds.info.tbl

    gawk '{ S+=$2; N++ }
       END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; }
         ' $ds.info.tbl

    # Using a custom R script to plot Length vs GC content scatterplots

    Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R $ds.info.tbl

  }; done

# Making boxplots
Rscript  Compare_Vector_Distributions.R \
    "Transcript Length Distribution::NA::Sequence Length (bp)" \
    blastout/violin.assemblies_lengthdist.png \
    "NewBler (RAW)::2:: ::"assemblies/starfish_newbler.info.tbl"$NBC"       \
    "NewBler (FX)::2:: ::"assemblies/starfish_newbler_fastx.info.tbl"$NBC"  \
    "NewBler (TR)::2:: ::"assemblies/starfish_newbler_trimmo.info.tbl"$NBC" \
    "Trinity (FX)::2:: ::"assemblies/starfish_trinity.info.tbl"$TRC"        \
    "Trinity (TR)::2:: ::"assemblies/starfish_trinity_trimmo.info.tbl"$TRC" \
    "SOAPdnT (FX)::2:: ::"assemblies/starfish_soap.info.tbl"$SPC"           \
    "SOAPdnT (TR)::2:: ::"assemblies/starfish_soap_trimmo.info.tbl"$SPC"

Rscript  Compare_Vector_Distributions_ylog.R \
    "Transcript Length Distribution::NA::Sequence Length (bp)" \
    blastout/violin.assemblies_lengthdist_ylog.png \
    "NewBler (RAW)::2:: ::"assemblies/starfish_newbler.info.tbl"$NBC"       \
    "NewBler (FX)::2:: ::"assemblies/starfish_newbler_fastx.info.tbl"$NBC"  \
    "NewBler (TR)::2:: ::"assemblies/starfish_newbler_trimmo.info.tbl"$NBC" \
    "Trinity (FX)::2:: ::"assemblies/starfish_trinity.info.tbl"$TRC"        \
    "Trinity (TR)::2:: ::"assemblies/starfish_trinity_trimmo.info.tbl"$TRC" \
    "SOAPdnT (FX)::2:: ::"assemblies/starfish_soap.info.tbl"$SPC"           \
    "SOAPdnT (TR)::2:: ::"assemblies/starfish_soap_trimmo.info.tbl"$SPC"
Assembly Run Assembled bases(bp) #Genes #Transcripts Avg GC% Min Length(bp) Avg Length(bp) Median Length(bp) Max Length(bp) N10 length N50 length N90 length
NewBler (FX) 761,693  2,057  2,241  42.88  12  340.5  244  6,566  429 
NewBler (TR) 4,773,068  4,525  5,447  43.57  18  876.3  607  10,814  1,019 
NewBler (RAW) 10,753,480  8,182  10,022  42.10  12  1,074.0  711  13,849  1,319 
Trinity (FX) 1,416,706  3,559  3,743  41.19  201  378.5  305  5,255  1,009  386 
Trinity (TR) 8,883,819  12,342  13,817  43.59  201  643.0  474  12,023  2,420  745 
SOAPdenovo-Trans (FX) 630,161  2,178  41.33  104  289.3  252  3,309  633  299  198
SOAPdenovo-Trans (TR) 5,212,378  11,369  43.61  100  458.5  339  9,225  1,712  529  241 

Reads Cleaning Protocol NewBler (RAW) NewBler Trinity SOAPdenovo-Trans
FastX* Length vs GC content scatterplot for the assembled sequences produced by NewBler (raw SFF). Length vs GC content scatterplot of the sequences assembled by NewBler from the clean sequences provided by our FastX protocol. Length vs GC content scatterplot for the assembled sequences produced by Trinity (using clean reads after FastX). Length vs GC content scatterplot for the assembled sequences produced by SOAPdenovo-Trans (using clean reads after FastX).
Trimmomatic Length vs GC content scatterplot for the assembled sequences produced by NewBler (using clean reads after Trimmomatic). Length vs GC content scatterplot for the assembled sequences produced by Trinity (using clean reads after Trimmomatic). Length vs GC content scatterplot for the assembled sequences produced by SOAPdenovo-Trans (using clean reads after Trimmomatic).

* For this row, NewBler results shown in the first column were obtained using the raw SFF files,
while all the other columns present results derived from the clean reads produced by our FastX protocol.


As better results were obtained after Trimmomatic cleaning, all the downstream analyses were performed over the NewBler “RAW”, Trinity “TR”, and SOAPdenovo-Trans “TR”.

Post-Assembly Procedures

Removing VBF genomic contamination

Just in case, we run BLAST search against a database of Viral, Bacterial, and Fungal genomes (VBF), to remove transcript sequences in the assembly that have a reasonable match to another microorganism that was processed along with the samples. We consider those transcripts as contaminant genomic sequence, and we remove from the transcript set.

cleanvbfseqs.sh
#!/bin/bash
#
# cleanvbfseqs.sh  query.fa  output_prefix
# 
#   Homoloy search over VBF genomes to remove potential contaminated sequences
#
# Note: set and export shell var BDBH, pointing to the BLAST DB VBF files
# 

# INPUT ARGS
QUERY=$1
OUTPREFIX=$2

# Minimum coverage for a contaminated sequence has to be larger than this
MINCOV=50
# BLAST PARAMETERS
EVAL=10e-25
BLSTOFMT='6 qseqid qlen sseqid slen qstart qend sstart send length score evalue bitscore '
BLSTOFMT=$BLSTOFMT'pident nident mismatch positive gapopen gaps ppos qframe sframe qseq sseq'
CPUS=32

# Checking Virus genomes
echo "### BLAST x Viral genomes..." 1>&2
blastn -task megablast \
       -query $QUERY   \
       -db $BDBH/viral_genomes           \
       -evalue $EVAL -outfmt "$BLSTOFMT" \
       -dust "yes" -num_threads $CPUS    \
               2> $OUTPREFIX.viral_genomes.log | \
    gzip -vc9 - > $OUTPREFIX.viral_genomes.out.gz

# Checking Bacterial genomes
echo "### BLAST x Bacterial genomes..." 1>&2
blastn -task megablast \
       -query $QUERY   \
       -db $BDBH/bacterial_genomes.raj   \
       -evalue $EVAL -outfmt "$BLSTOFMT" \
       -dust "yes" -num_threads $CPUS    \
               2> $OUTPREFIX.bacterial_genomes.log | \
    gzip -vc9 - > $OUTPREFIX.bacterial_genomes.out.gz

# Checking Fungal genomes
echo "### BLAST x Fungal genomes..." 1>&2
blastn -task megablast \
       -query $QUERY   \
       -db $BDBH/fungal_genomes.raj      \
       -evalue $EVAL -outfmt "$BLSTOFMT" \
       -dust "yes" -num_threads $CPUS    \
               2> $OUTPREFIX.fungal_genomes.log | \
    gzip -vc9 - > $OUTPREFIX.fungal_genomes.out.gz

# Calculating HSPs sequence coverage
echo "### Calculate sequence coverage..." 1>&2
for VBF in viral bacterial fungal;
  do {
    zcat $OUTPREFIX.${VBF}_genomes.out.gz | \
        coverage_blastshorttbl.pl --prog=blastn - \
                     > $OUTPREFIX.${VBF}_genomes.covtbl
  }; done

# Retrieving sequence ids for contaminated sequences:
#    criteria a significant match to viral/bacterial/fungal genomes
#             with more than 50% sequence sequence coverage
echo "### Retrieving sequence ids for contaminated sequences..." 1>&2
gawk -v COV=$MINCOV '($1 == "Q" && $13 > COV) { print $2, $14 }
     ' $OUTPREFIX.{viral,bacterial,fungal}_genomes.covtbl | \
  sort | uniq  > $OUTPREFIX.VBF_contaminated_seqs.QS_ids

gawk '{ print $1 }' $OUTPREFIX.VBF_contaminated_seqs.QS_ids | \
      sort | uniq > $OUTPREFIX.VBF_contaminated_seqs.ids

# Filtering out those sequences that DO NOT appear on the previous list
echo "### Filtering out vbf clean sequences..." 1>&2
grepID.pl -vAFN $OUTPREFIX.VBF_contaminated_seqs.ids \
                $QUERY \
              > $QUERY.vbfclean.fa

exit 0;
Cleaning VBF contamination
# Running the VBF cleaning script
bin/cleanvbfseqs.sh assemblies/starfish_newbler.fa \
                    blastcont/starfish_newbler \
                 2> blastcont/starfish_newbler.cleanvbfseqs.log 1>&2 &
bin/cleanvbfseqs.sh assemblies/starfish_trinity_trimmo.fa \
                    blastcont/starfish_trinity_trimmo \
                 2> blastcont/starfish_trinity_trimmo.cleanvbfseqs.log 1>&2 &
bin/cleanvbfseqs.sh assemblies/starfish_soap_trimmo.fa \
                    blastcont/starfish_soap_trimmo \
                 2> blastcont/starfish_soap_trimmo.cleanvbfseqs.log 1>&2 &

# Only few hits found and removed
wc -l blastcont/starfish_{newbler,trinity_trimmo,soap_trimmo}.VBF_contaminated_seqs.ids
  7 blastcont/starfish_newbler.VBF_contaminated_seqs.ids
 30 blastcont/starfish_trinity_trimmo.VBF_contaminated_seqs.ids
 22 blastcont/starfish_soap_trimmo.VBF_contaminated_seqs.ids

Finding putative CDS on transcripts

Finding longest ORFs to predict CDSs
while read IFA OFA;
  do {

    ifa=${IFA%.fa}

    # getting longest ORFs
    cdna2orfs.v2.pl -VSBNldu assemblies/$IFA \
                           > assemblies/$OFA.nuc.fa
    cdna2orfs.v2.pl -VSBldu  assemblies/$IFA \
                           > assemblies/$OFA.aa.fa

    # getting transcript length
    infoseq -noheading -only -name -length -pgc \
            fasta::assemblies/$IFA \
                 > assemblies/$ifa.nuc.info.tbl

    # getting longest ORFs length
    infoseq -noheading -only -name -length -pgc \
            fasta::assemblies/$OFA.nuc.fa \
                 > assemblies/$OFA.nuc.info.tbl

    Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R assemblies/$OFA.nuc.info.tbl

    # comparing transcript and longest ORFs lengths
    gawk 'BEGIN{ OFS="\t";
            while (getline<ARGV[1]>0) I[$1] = $2;
            ARGV[1] = "";
            print "ID", "TRlen", "ORFlen";
          }
          { gsub(/\.[012][\-+][0-9]+$/, "", $1);
	    if ($1 in I) {
              print $1, I[$1], $2;
              F[$1] = $2;
            } else {
              print $1, "NA", $2;
            }; 
          }
          END {
            for (n in I) {
              if (! (n in F)) print n, I[n], "NA";
            };
          }' assemblies/$ifa.nuc.info.tbl \
             assemblies/$OFA.nuc.info.tbl \
           > assemblies/$OFA.nuc.trcomp.tbl

     Rscript Transcript_vs_LongestORF_length_plot.R assemblies/$OFA.nuc.trcomp.tbl ${OFA%%.*} 25000

  }; done <<'EOF'
starfish_newbler.fa.vbfclean.fa         starfish_newbler.longest_orfs
starfish_trinity_trimmo.fa.vbfclean.fa  starfish_trinity_trimmo.longest_orfs
starfish_soap_trimmo.fa.vbfclean.fa     starfish_soap_trimmo.longest_orfs
EOF
# starfish_newbler       : From 10,149 cDNAs, 10,149 out of 1,143,474 ORFs were filtered...
# starfish_trinity_trimmo: From 13,787 cDNAs, 13,787 out of   933,239 ORFs were filtered...
# starfish_soap_trimmo   : From 11,347 cDNAs, 11,347 out of   559,743 ORFs were filtered...
NewBler (RAW) Trinity (TR) SOAPdenovo-Trans (TR)
ORFs length
vs GC Content
Transcript length versus GC content scatterplot for NewBler assembly from raw SFF data, after VBF cleaning. Transcript length versus GC content scatterplot for Trinity assembly from Trimmomatic cleaned reads, after VBF cleaning. Transcript length versus GC content scatterplot for SOAPdenovo-Trans assembly from Trimmomatic cleaned reads, after VBF cleaning.
Transcript
vs ORFs lengths
Transcript length versus longest ORF for the same transcript, both from NewBler assembly from raw SFF data, after VBF cleaning. Transcript length versus longest ORF for the same transcript, both from Trinity assembly from Trimmomatic cleaned reads, after VBF cleaning. Transcript length versus longest ORF for the same transcript, both from SOAPdenovo-Trans assembly from Trimmomatic cleaned reads, after VBF cleaning.

Fixing Strand of Transcript Sequences

In the previous section, the longest ORF was computed for each transcript, by considering all six possible coding frames (just to avoid transcripts assembled on the reverse strand). Here, that information is used to fix the orientation (strand) of all the transcript sequences that have its longest ORF encoded in the reverse strand. It is also a good moment to standardize the names of the final transcript sequences, as well as to sort them by size (longest first).

Fixing transcripts sequence orientation
while read BFA NFA SFX;
  do {

    echo "# working on $BFA" 1>&2;

    ( gawk '$1 ~ /\.[012]\+[0-9]+$/ {
              gsub(/\.[012]\+[0-9]+$/, "", $1);
              print $1;
            }' assemblies/$BFA.longest_orfs.nuc.info.tbl | \
               grepID.pl -vAF - assemblies/$NFA \
                      2> assemblies/$BFA.strfixFWD.log;
      gawk '$1 ~ /\.[012]\-[0-9]+$/ { 
              gsub(/\.[012]\-[0-9]+$/, "", $1);
              print $1;
            }' assemblies/$BFA.longest_orfs.nuc.info.tbl | \
               grepID.pl -vAF - assemblies/$NFA \
                             2> assemblies/$BFA.strfixREV.log | \
                 gawk '$1 ~ /^>/ { $1=$1"_rc"; }
                                 { print $0;  }'
      ) | fa2tbl.pl -  | \
          gawk '{ print length($2), $0 }' - | \
          sort -k1,1nr | \
            gawk -v sfx="$SFX" 'BEGIN{ C=1; }
              { sln=$1; id=$2; sq=$3; $1=$2=$3="";
                dsc=($0~/len(gth)?\=/) ? $0 : " length="sln" "$0;
                gsub(/ +/," ",dsc);
                gsub(/len\=/,"length=",dsc);
                gsub(/ path\=\[[^\]]+\]/,"",dsc);
                print sprintf("Cmur%s_TR%08d", sfx, C), sq, id, dsc;
                C++; }' - | \
          tbl2fa.pl -  > assemblies/$BFA.vbfclean_strandfix.fa

    # getting contig lengths
    infoseq -noheading -only -name -length -pgc \
            fasta::assemblies/$BFA.vbfclean_strandfix.fa \
                 > assemblies/$BFA.vbfclean_strandfix.info.tbl

  }; done <<'EOF'
starfish_newbler        starfish_newbler.fa.vbfclean.fa         NWBR
starfish_trinity_trimmo starfish_trinity_trimmo.fa.vbfclean.fa  TRIN
starfish_soap_trimmo    starfish_soap_trimmo.fa.vbfclean.fa     SOAP
EOF


Comparing Assemblers Output

Preparing BLAST target databases
while read FA TI;
  do {
    makeblastdb -in assemblies/$FA \
                -input_type fasta \
                -dbtype nucl \
                -title "$TI" \
                -out blastdb/$TI;
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF
# starfish_newbler       : added 10149 sequences in 0.801275 seconds.
# starfish_trinity_trimmo: added 13787 sequences in 0.947321 seconds.
# starfish_soap_trimmo   : added 11347 sequences in 0.677559 seconds.
Comparing transcriptome output: all-against-all
BLSTOFMT='6 qseqid qlen sseqid slen qstart qend sstart send length score evalue bitscore '
BLSTOFMT=$BLSTOFMT'pident nident mismatch positive gapopen gaps ppos qframe sframe qseq sseq'
export BLSTOFMT

# Running BLASTN and TBLASTX to find best pair-wise HSP hits
while read IFA IDB OUT;
  do {
    blastn -task blastn                \
           -query assemblies/$IFA      \
           -db blastdb/$IDB            \
           -strand 'plus'              \
           -outfmt "$BLSTOFMT"         \
           -evalue 10e-10 -num_threads 12        \
                  2> blastout/$OUT.blastn.log |  \
        gzip -c9 - > blastout/$OUT.blastn.out.gz &
    tblastx -query assemblies/$IFA \
            -db blastdb/$IDB       \
            -strand 'plus'         \
            -outfmt "$BLSTOFMT"    \
            -matrix BLOSUM90       \
            -evalue 10e-10 -num_threads 12        \
                  2> blastout/$OUT.tblastx.log |  \
        gzip -c9 - > blastout/$OUT.tblastx.out.gz &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_trinity_trimmo  sf_newbler-x-trinitytr
starfish_newbler.vbfclean_strandfix.fa         starfish_soap_trimmo     sf_newbler-x-soaptr
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_newbler         sf_trinitytr-x-newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_soap_trimmo     sf_trinitytr-x-soaptr
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_newbler         sf_soaptr-x-newbler
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_trinity_trimmo  sf_soaptr-x-trinitytr 
EOF

# Filtering HSPs to find best HITs
while read FL;
  do {
    zcat blastout/$FL.blastn.out.gz | \
        coverage_blastshorttbl.pl --prog=blastn - \
                     > blastout/$FL.blastn.covtbl &
    zcat blastout/$FL.tblastx.out.gz | \
        coverage_blastshorttbl.pl --prog=tblastx - \
                     > blastout/$FL.tblastx.covtbl &
  }; done <<'EOF'
sf_newbler-x-trinitytr
sf_newbler-x-soaptr   
sf_trinitytr-x-newbler
sf_trinitytr-x-soaptr 
sf_soaptr-x-newbler   
sf_soaptr-x-trinitytr 
EOF

Aligning reads to transcriptome contigs

Several tests were performed to assess which read mapper was going to align the long 454 reads over the transcript contigs (unpublished data). Bowtie2 yield the larger amount of mapped reads, therefore those commands are shown below.

Mapping 454 reads over the transcriptome contigs
# Preparing BOWTIE2 target databases (for the transcript contigs)
while read FA TI;
  do {
    bowtie2-build -f --large-index --seed 3523 \
                  ./assemblies/$FA \
                  ./bowtie2/$TI.bowtiedb \
               2> ./bowtie2/$TI.bowtiedb.log 1>&2 
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Mapping sequencing reads over assembled contigs
while read FA TI;
  do {
    nohup bowtie2 -q --all --phred33 --seed 3523 \
                  --local --very-sensitive-local \
                  --no-unal --met 60 --threads 32 \
                  --met-file  ./bowtie2/$TI.bowtie.metrics \
                  -x          ./bowtie2/$TI.bowtiedb \
                  -U          ./seqs/G0CB7HT03.trimmomatic.fq \
                  -U          ./seqs/G0CB7HT04.trimmomatic.fq \
                  -S          ./bowtie2/$TI.bowtie.UPL.sam \
                  --un-gz     ./bowtie2/$TI.bowtie.UPL.unaln.gz \
                  2>          ./bowtie2/$TI.bowtie.UPL.log 1>&2 &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Getting contig coverage tables
while read FA TI;
  do {
    printf "# Working on $TI " 1>&2
    samtools view -bS   ./bowtie2/$TI.bowtie.UPL.sam \
                      > ./bowtie2/$TI.bowtie.UPL.bam \
                     2> ./bowtie2/$TI.bowtie.UPL.sam2bam.log
    printf "... SAM2BAM " 1>&2
    samtools sort       ./bowtie2/$TI.bowtie.UPL.bam \
                        ./bowtie2/$TI.bowtie.UPL.sorted \
                     2> ./bowtie2/$TI.bowtie.UPL.sorted.log
    printf "... BAMSORT " 1>&2
    samtools index      ./bowtie2/$TI.bowtie.UPL.sorted.bam
    printf "... BAMSIDX " 1>&2
    samtools mpileup -f ./assemblies/$FA \
                        ./bowtie2/$TI.bowtie.UPL.sorted.bam \
                      > ./bowtie2/$TI.bowtie.UPL.pileup \
                     2> ./bowtie2/$TI.bowtie.UPL.pileup.log
    printf "... PILEUP " 1>&2
    samtools depth      ./bowtie2/$TI.bowtie.UPL.sorted.bam \
                      > ./bowtie2/$TI.bowtie.UPL.pileup_depthonly
    printf "... DEPTH\n" 1>&2
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF
Estimating expression levels from sequence coverage
# 
while read FA TI;
  do {
    echo "# Working on $TI " 1>&2

    gawk 'BEGIN{
            while (getline<ARGV[1]>0) { SQL[$1]=$2; TSQL+=$2; };
            ARGV[1]="";
          }
          { if (!($1 in sc)) NT++;
            SC[$1]+=$4; sc[$1]++;
            TC+=$4; tc++;
            if (CVmax[$1] == 0) { CVmin[$1] = $4; CVmin["TOT"] = $4; };
            CVmax[$1] = CVmax[$1] < $4 ? $4 : CVmax[$1];
            CVmin[$1] = CVmin[$1] > $4 ? $4 : CVmin[$1];
            CVmax["TOT"] = CVmax["TOT"] < $4 ? $4 : CVmax["TOT"];
            CVmin["TOT"] = CVmin["TOT"] > $4 ? $4 : CVmin["TOT"];
          }
          END{
            printf "TRID\tLENGTH\tNUCL_TOTAL\tCOVG_SUM\tCOVG_MIN\tCOVG_MAX\tCOVG_AVGall\tCOVG_AVGmap\n";
            printf "TOTALseqs:%d\t%d\t%d\t%d\t%d\t%d\t%.3f\t%.3f\n", NT, TSQL, tc, TC, CVmin["TOT"], CVmax["TOT"], TC/TSQL, TC/tc;
            for (n in SQL) {
              printf "%s\t%d\t%d\t%d\t%d\t%d\t%.3f\t%.3f\n", n, SQL[n], sc[n], SC[n], CVmin[n], CVmax[n],
                                                             (n in SQL && SQL[n] != 0) ? SC[n]/SQL[n] : "NA",
                                                             (n in sc  && sc[n]  != 0) ? SC[n]/sc[n]  : "NA";
            }
          }' ./assemblies/$FA.info.tbl       \
             ./bowtie2/$TI.bowtie.UPL.pileup \
           > ./bowtie2/$TI.bowtie.UPL.pileup.summary

    egrep -v '^TRID|TOTAL' ./bowtie2/$TI.bowtie.UPL.pileup.summary \
                         > ./bowtie2/$TI.bowtie.UPL.pileup.sum2plt
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix     starfish_soap_trimmo
EOF

Having only one sample we cannot perform a differential expression analysis, yet we can check differences in coverage that will correspond to over-expressed transcripts on the sampled tissue. Sorting by average coverage of reads per transcript, we got the following top ten contigs for each of the three assemblies:

Sequences with top 10 average coverage levels
sort -k8,8nr ./bowtie2/starfish_newbler.bowtie.UPL.pileup.summary | head
CmurNWBR_TR00009750     425     425    1272737    371     3401    2994.675    2994.675
CmurNWBR_TR00005436     674     674    1202455    625     2334    1784.058    1784.058
CmurNWBR_TR00003299     957     957    1460945     73     2012    1526.588    1526.588
CmurNWBR_TR00008353     497     497     526194     37     1860    1058.740    1058.740
CmurNWBR_TR00006255     607     607     597169      4     1817     983.804     983.804
CmurNWBR_TR00005569     662     662     642126      5     1454     969.979     969.979
CmurNWBR_TR00009130     472     472     444261     26     1514     941.231     941.231
CmurNWBR_TR00008452     494     481     442215      2     1312     895.172     919.366
CmurNWBR_TR00003245     973     973     776495      9     1157     798.042     798.042
CmurNWBR_TR00006095     620     620     475386      3     1326     766.752     766.752

sort -k8,8nr ./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.summary | head
CmurTRIN_TR00000205    2602    2602    5457543      1     6635    2097.442    2097.442
CmurTRIN_TR00000032    5220    5220    5572076      1     3409    1067.448    1067.448
CmurTRIN_TR00000126    3155    3151    2799743      1     2045     887.399     888.525
CmurTRIN_TR00000028    5474    5474    2795886      1     1157     510.757     510.757
CmurTRIN_TR00000589    1686    1659     680445      1     1140     403.585     410.154
CmurTRIN_TR00000027    5543    5543    2156262      1      959     389.006     389.006
CmurTRIN_TR00000246    2406    2406     854651      1      912     355.217     355.217
CmurTRIN_TR00000212    2560    2560     886427      1      835     346.261     346.261
CmurTRIN_TR00000585    1689    1689     449898      1      628     266.369     266.369
CmurTRIN_TR00000637    1635    1635     434485     38      503     265.740     265.740

sort -k8,8nr ./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.summary | head
CmurSOAP_TR00003834     428     428    1796798    317     5671    4198.126    4198.126
CmurSOAP_TR00000442    1291    1291    2938505      0     6246    2276.146    2276.146
CmurSOAP_TR00000119    2066    2066    4193432     14     3409    2029.735    2029.735
CmurSOAP_TR00000328    1452    1452    2446201     28     2333    1684.711    1684.711
CmurSOAP_TR00001376     784     784    1178624     14     2013    1503.347    1503.347
CmurSOAP_TR00007586     272     272     252061     25     1132     926.695     926.695
CmurSOAP_TR00002681     529     529     470614    169     1217     889.629     889.629
CmurSOAP_TR00002033     629     629     513931    122     1126     817.060     817.060
CmurSOAP_TR00000056    2474    2474    1889173      9     1157     763.611     763.611
CmurSOAP_TR00010178     194     194     143838     36     2314     741.433     741.433

If we consider the functional annotation, if any, for those contigs from BLASTN searches against NCBInt, then:

Functional annotation of highest coverage contigs

Comparing coverage distributions
# Making boxplots
Rscript  Compare_Vector_Distributions_ylog.R \
    "Minimum Read Coverage${RET}per Transcript::NA::Bowtie2 Minimum Coverage" \
    bowtie2/violin.bowtie_read_mincoverage.png \
    "NewBler::5::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC"        \
    "Trinity::5::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \
    "SOAPdnT::5::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC"
Rscript  Compare_Vector_Distributions_ylog.R \
    "Maximum Read Coverage${RET}per Transcript::NA::Bowtie2 Maximum Coverage" \
    bowtie2/violin.bowtie_read_maxcoverage.png \
    "NewBler::6::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC"        \
    "Trinity::6::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \
    "SOAPdnT::6::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC"
Rscript  Compare_Vector_Distributions_ylog.R \
    "Average Read Coverage${RET}per Transcript::NA::Bowtie2 Average Coverage" \
    bowtie2/violin.bowtie_read_avgcoverage.png \
    "NewBler::7::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC"        \
    "Trinity::7::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \
    "SOAPdnT::7::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC"

Functional Annotation

Global shell vars for this section
TAB=$'\t'
RET=$'\n'
NBC="::#CC000099"
TRC="::#00CC0099"
SPC="::#00CC0099"

Comparing starfish and sea urchin transcriptomes

Retrieving sea urchin transcripts from UCSC genomes database
# Downloading transcript related files
wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/est.fa.gz     -O seaurchin/est.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/strPur2.fa.gz -O seaurchin/strPur2.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/refMrna.fa.gz -O seaurchin/refMrna.fa.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/mrna.fa.gz    -O seaurchin/mrna.fa.gz
BLASTX/TBLASTX comparison among starfish and sea urchin transcriptomes
# Preparing BLAST target databases
for F in mrna est refMrna;
  do {
    zcat seaurchin/$F.fa.gz | \
      makeblastdb -in - \
                  -input_type fasta \
                  -dbtype nucl \
                  -title "SeaUrchin_$F" \
                  -out seaurchin/SeaUrchin_$F;
  }; done
#    mrna: added  30,401 sequences in  4.02561 seconds.
#     est: added 141,833 sequences in 11.2223 seconds.
# refMrna: added     466 sequences in  0.0728889 seconds.

# Running BLASTN and TBLASTX searches
while read FA TI;
  do {
    for SU in mrna est refMrna;
      do {
        OUT=$TI"-x-Spur_"$SU;
        blastn -task  blastn               \
               -query assemblies/$FA       \
               -db seaurchin/SeaUrchin_$SU \
               -outfmt "$BLSTOFMT" -dust "yes"     \
               -evalue 10e-25 -num_threads 12      \
                    2> blastout/$OUT.blastn.log |  \
          gzip -c9 - > blastout/$OUT.blastn.out.gz &
        tblastx -query assemblies/$FA       \
                -db seaurchin/SeaUrchin_$SU \
                -outfmt "$BLSTOFMT"         \
                -matrix BLOSUM62            \
                -evalue 10e-25 -num_threads 12        \
                     2> blastout/$OUT.tblastx.log |  \
           gzip -c9 - > blastout/$OUT.tblastx.out.gz &
      }; done
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Processing transcript coverage
for TI in starfish_newbler        \
          starfish_trinity_trimmo \
          starfish_soap_trimmo    ;
  do {
    for SU in mrna est refMrna;
      do {
        OUT=$TI"-x-Spur_"$SU;
        # BLASTN
        zcat blastout/$OUT.blastn.out.gz | \
            coverage_blastshorttbl.pl --prog=blastn - \
                         > blastout/$OUT.blastn.covtbl 
        egrep '^Q\b' blastout/$OUT.blastn.covtbl \
                     > blastout/$OUT.blastn.covtbl.qryonly

        # TBLASTX
        zcat blastout/$OUT.tblastx.out.gz | \
            coverage_blastshorttbl.pl --prog=tblastx - \
                         > blastout/$OUT.tblastx.covtbl
        egrep '^Q\b' blastout/$OUT.tblastx.covtbl \
                     > blastout/$OUT.tblastx.covtbl.qryonly

      }; done
  }; done

# Making boxplots
Rscript  Compare_Vector_Distributions.R \
    "BLASTN x Sea urchin::NA::% Coverage" \
    blastout/violin.Spur_est.blastn.png \
    "NewBler${RET}x ESTs::13::${TAB}::"blastout/starfish_newbler-x-Spur_est.blastn.covtbl.qryonly"$NBC"         \
    "NewBler${RET}x mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_mrna.blastn.covtbl.qryonly"$NBC"        \
    "NewBler${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_refMrna.blastn.covtbl.qryonly"$NBC" \
    "Trinity${RET}x ESTs::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_est.blastn.covtbl.qryonly"$TRC"         \
    "Trinity${RET}x mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_mrna.blastn.covtbl.qryonly"$TRC"        \
    "Trinity${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_refMrna.blastn.covtbl.qryonly"$TRC" \
    "SOAPdnT${RET}x ESTs::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_est.blastn.covtbl.qryonly"$SPC"         \
    "SOAPdnT${RET}x mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_mrna.blastn.covtbl.qryonly"$SPC"        \
    "SOAPdnT${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_refMrna.blastn.covtbl.qryonly"$SPC"

Rscript  Compare_Vector_Distributions.R \
    "TBLASTX x Sea urchin::NA::% Coverage" \
    blastout/violin.Spur_est.tblastx.png \
    "NewBler${RET}x ESTs::13::${TAB}::"blastout/starfish_newbler-x-Spur_est.tblastx.covtbl.qryonly"$NBC"         \
    "NewBler${RET}x mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_mrna.tblastx.covtbl.qryonly"$NBC"        \
    "NewBler${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_refMrna.tblastx.covtbl.qryonly"$NBC" \
    "Trinity${RET}x ESTs::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_est.tblastx.covtbl.qryonly"$TRC"         \
    "Trinity${RET}x mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_mrna.tblastx.covtbl.qryonly"$TRC"        \
    "Trinity${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_refMrna.tblastx.covtbl.qryonly"$TRC" \
    "SOAPdnT${RET}x ESTs::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_est.tblastx.covtbl.qryonly"$SPC"         \
    "SOAPdnT${RET}x mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_mrna.tblastx.covtbl.qryonly"$SPC"        \
    "SOAPdnT${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_refMrna.tblastx.covtbl.qryonly"$SPC"

The following tables summarize the number of transcripts having a BLAST hit (“query” column) and the number of distinct sequences from the database aligning to those transcripts (“target” column). TBL files were compressed using bzip2. As it can be appreciated when comparing both tables, TBLASTX returned almost three times more target hits while only twice as much query hits. Sea urchin mRNA transcript set has 30,401 sequences. This can be due to two reasons: 1) less isoforms per gene on starfish (but with the available data we cannot check this); and 2) the incompleteness of the current starfish transcriptome because it was built from a specific tissue (coelomic epithelium). We can check later on if there are differences at functional GO annotation.

BLASTN x S.pur mRNAs
SET #Transcripts #Queries #Targets Best Hit
NewBler 10,022 1,440 2,424 TBL
Trinity 13,817 1,772 3,169 TBL
SOAPdenovo-Trans 11,369 1,309 2,214 TBL
TBLASTX x S.pur mRNAs
SET #Queries #Targets Best Hit
NewBler 2,803 7,309 TBL
Trinity 3,371 8,755 TBL
SOAPdenovo-Trans 2,317 6,378 TBL


Homology Search over Protein Sequences Databases

UniProt

UniProt was mirrored from the database ftp server on September 2014.

BLASTX over UniProt
# BLASTDB is the shell var set to the path to BLAST databases
while read FA TI;
  do {
    OUT=$TI"-x-UniProt";
    blastx -query assemblies/$FA           \
           -db $BLASTDB/UniProt_SProt_2014 \
           -strand 'plus'                  \
           -outfmt "$BLSTOFMT"             \
           -matrix BLOSUM62                \
           -evalue 10e-25 -num_threads 12  \
                2> blastout/$OUT.blastx.log |  \
      gzip -c9 - > blastout/$OUT.blastx.out.gz &
    OUT=$TI"-x-UniProtTREmbl";
    blastx -query assemblies/$FA            \
           -db $BLASTDB/UniProt_TREmbl_2014 \
           -strand 'plus'                   \
           -outfmt "$BLSTOFMT"              \
           -matrix BLOSUM62                 \
           -evalue 10e-25 -num_threads 12   \
                2> blastout/$OUT.blastx.log |  \
      gzip -c9 - > blastout/$OUT.blastx.out.gz &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Getting sequence coverage
for TI in starfish_newbler        \
          starfish_trinity_trimmo \
          starfish_soap_trimmo    ;
  do {
    for SU in UniProt UniProtTREmbl ;
      do {
        OUT=$TI"-x-"$SU;
        zcat blastout/$OUT.blastx.out.gz | \
            coverage_blastshorttbl.pl --prog=blastx - \
                         > blastout/$OUT.blastx.covtbl &
        egrep '^Q\b' blastout/$OUT.blastx.covtbl \
                     > blastout/$OUT.blastx.covtbl.qryonly
      }; done
  }; done

# Making the boxplots
Rscript  Compare_Vector_Distributions.R \
    "BLASTX x UniProt::NA::% Coverage" \
    blastout/violin.UniProt.blastx.png \
    "NewBler${RET}x UniProt::13::${TAB}::"blastout/starfish_newbler-x-UniProt.blastx.covtbl.qryonly"$NBC"        \
    "Trinity${RET}x UniProt::13::${TAB}::"blastout/starfish_trinity_trimmo-x-UniProt.blastx.covtbl.qryonly"$TRC" \
    "SOAPdnT${RET}x UniProt::13::${TAB}::"blastout/starfish_soap_trimmo-x-UniProt.blastx.covtbl.qryonly"$SPC"       \
    "NewBler${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_newbler-x-UniProtTREmbl.blastx.covtbl.qryonly"$NBC" \
    "Trinity${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_trinity_trimmo-x-UniProtTREmbl.blastx.covtbl.qryonly"$TRC" \
    "SOAPdnT${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_soap_trimmo-x-UniProtTREmbl.blastx.covtbl.qryonly"$SPC"

In order to have an equivalent functional annotation to compare starfish and sea urchin transcriptomes, same searches were performed using the sea urchin sequences.

BLASTX over UniProt for the sea urchin sequences
nohup zcat ./seaurchin/mrna.fa.gz | \
        blastx -query - \
               -db $BLASTDB/UniProt_SProt_2014 \
               -outfmt "$BLSTOFMT"             \
               -matrix BLOSUM62                \
               -evalue 10e-25 -num_threads 64  \
             > ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.out \
            2> ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.log &

coverage_blastshorttbl.pl --prog=blastx \
         ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.out \
       > ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl

NCBI non redundant

NCBI non redundant (-+NCBI-nr+-) Release 197.0 (August 15, 2013) was downloaded from the NCBI BLAST db ftp site.

BLASTX over NCBI-nr
# BLASTDB is the shell var set to the path to BLAST databases
while read FA TI;
  do {
    OUT=$TI"-x-NCBInr";
    blastx -query assemblies/$FA \
           -db $BLASTDB/nr       \
           -strand 'plus'        \
           -outfmt "$BLSTOFMT"   \
           -matrix BLOSUM62      \
           -evalue 10e-25 -num_threads 12   \
                2> blastout/$OUT.blastx.log |  \
      gzip -c9 - > blastout/$OUT.blastx.out.gz &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Getting sequence coverage
for TI in starfish_newbler        \
          starfish_trinity_trimmo \
          starfish_soap_trimmo    ;
  do {
    OUT=$TI"-x-NCBInr"

    zcat blastout/$OUT.blastx.out.gz | \
      coverage_blastshorttbl.pl --prog=blastx - \
                              > blastout/$OUT.blastx.covtbl &

    egrep '^Q\b' blastout/$OUT.blastx.covtbl \
                 > blastout/$OUT.blastx.covtbl.qryonly
  }; done

# Making the boxplots
Rscript  Compare_Vector_Distributions.R \
    "BLASTX x NCBInr::NA::% Coverage" \
    blastout/violin.NCBInr.blastx.png \
    "NewBler::13::${TAB}::"blastout/starfish_newbler-x-NCBInr.blastx.covtbl.qryonly"$NBC"        \
    "Trinity::13::${TAB}::"blastout/starfish_trinity_trimmo-x-NCBInr.blastx.covtbl.qryonly"$TRC" \
    "SOAPdnT::13::${TAB}::"blastout/starfish_soap_trimmo-x-NCBInr.blastx.covtbl.qryonly"$SPC"

Here too, in order to have an equivalent functional annotation to compare starfish and sea urchin transcriptomes, same searches were performed using the sea urchin sequences.

BLASTX over NCBInr for the sea urchin sequences
nohup zcat ./seaurchin/mrna.fa.gz | \
        blastx -query - \
               -db $BLASTDB/nr                 \
               -outfmt "$BLSTOFMT"             \
               -matrix BLOSUM62                \
               -evalue 10e-25 -num_threads 64  \
             > ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.out \
            2> ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.log &

coverage_blastshorttbl.pl --prog=blastx \
         ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.out \
       > ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.covtbl

NCBI nucleotide

NCBI nucleotide (-+NCBI-nt+-) Release 197.0 (August 15, 2013) was downloaded from the NCBI BLAST db ftp site.

BLASTN over NCBI-nt
# BLASTDB is the shell var set to the path to BLAST databases
while read FA TI;
  do {
    OUT=$TI"-x-NCBInt";
    blastn -query assemblies/$FA \
           -db $BLASTDB/nt       \
           -outfmt "$BLSTOFMT"   \
           -evalue 10e-5 -num_threads 32   \
                2> blastout/$OUT.blastn.log |  \
      gzip -c9 - > blastout/$OUT.blastn.out.gz &
  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix.fa         starfish_newbler
starfish_trinity_trimmo.vbfclean_strandfix.fa  starfish_trinity_trimmo
starfish_soap_trimmo.vbfclean_strandfix.fa     starfish_soap_trimmo
EOF

# Getting sequence coverage
for TI in starfish_newbler        \
          starfish_trinity_trimmo \
          starfish_soap_trimmo    ;
  do {
    OUT=$TI"-x-NCBInt"

    zcat blastout/$OUT.blastn.out.gz | \
      coverage_blastshorttbl.pl --prog=blastn - \
                              > blastout/$OUT.blastn.covtbl

    egrep '^Q\b' blastout/$OUT.blastn.covtbl \
                  > blastout/$OUT.blastn.covtbl.qryonly
  }; done

# Making the boxplots
Rscript  Compare_Vector_Distributions.R \
    "BLASTX x NCBInt::NA::% Coverage" \
    blastout/violin.NCBInt.blastn.png \
    "NewBler::13::${TAB}::"blastout/starfish_newbler-x-NCBInt.blastn.covtbl.qryonly"$NBC"        \
    "Trinity::13::${TAB}::"blastout/starfish_trinity_trimmo-x-NCBInt.blastn.covtbl.qryonly"$TRC" \
    "SOAPdnT::13::${TAB}::"blastout/starfish_soap_trimmo-x-NCBInt.blastn.covtbl.qryonly"$SPC"
BLASTN x NCBInt
SET #Transcripts #Queries #Targets Best Hit
NewBler 10,022 776 27,484 TBL
Trinity 13,817 902 29,342 TBL
SOAPdenovo-Trans 11,369 779 31,486 TBL


Searching for PFAM Domains

The two PFAM database sections, Pfam-A and Pfam-B were downloaded from Pfam ftp site. Pfam-A is the manually curated portion of the database that contains over 13,000 entries; for each entry a protein sequence alignment and a hidden Markov model is stored. These hidden Markov models can be used to search sequence databases with the HMMER package. Pfam-B is an automatically generated supplement to Pfam-A that contains a large number of small families derived from clusters produced by an algorithm called ADDA. Although of lower quality, Pfam-B families can be useful when no Pfam-A families are found. HMMER tools were used to scan transcript sequences to annotate domains. In fact, longest ORFs derived from those transcripts were taken, and compared with the HMM models from PFAM database (version 27.0, 2014). In order to get comparable e-values, it is required to adjust a command-line switch to set the effective search space, calculated from this formula: Z = #query_sequences x #target_models

PFAM search
# PFAMDB is the shell var set to the path to the PFAM HMM models
zegrep -c '^ACC' $PFAMDB/Pfam-[AB].hmm.gz
 14,831  Pfam-A.hmm.gz
 20,000  Pfam-B.hmm.gz
 34,831  HMM models in total

# From that, #target_models was set to 35,000. 
# As there are 10,149, 13,787, and 11,347 transcripts on the NewBler, 
# Trinity, and SOAPdenovo-Trans assemblies respectively, #query_sequences 
# was rounded up to 15,000.
# Therefore, Z = 15,000 x 35,000 = 525,000,000 = 5.25e8 ~ 6e8

while read FA TI;
  do {

    # Get the longest ORF for the final transcript set
    cdna2orfs.v2.pl -VSldu assembly/${FA}.fa \
                         > assembly/${FA}_lORFs.fa

    # HMMER against Pfam A
    OUT=$TI"_lORFs-x-PfamA"
    nohup hmmsearch -Z 6e8 -E 10e-5 --max --cpu 24 \
                    -o       ./hmmer_output/${OUT}.hmmsearch.out \
                    --tblout ./hmmer_output/${OUT}.hmmsearch.tbl \
                             $PFAMDB/Pfam-A.hmm.gz \
                             ./assembly/${FA}_lORFs.fa \
                          2> ./hmmer_output/${OUT}.hmmsearch.log &

    # HMMER against Pfam B
    OUT=$TI"_lORFs-x-PfamB";
    nohup hmmsearch -Z 6e8 -E 10e-5 --max --cpu 24 \
                    -o       ./hmmer_output/${OUT}.hmmsearch.out \
                    --tblout ./hmmer_output/${OUT}.hmmsearch.tbl \
                             $PFAMDB/Pfam-B.hmm.gz \
                             ./assembly/${FA}_lORFs.fa \
                          2> ./hmmer_output/${OUT}.hmmsearch.log &

  }; done <<'EOF'
starfish_newbler.vbfclean_strandfix         starfish_newbler_cs
starfish_trinity_trimmo.vbfclean_strandfix  starfish_trinity_trimmo_cs
starfish_soap_trimmo.vbfclean_strandfix     starfish_soap_trimmo_cs
EOF

Retrieving BLAST Best Hists

getbesthit.awk
#!/usr/bin/gawk -f

{
  split($3,m,/\|/);

  if (MX[$1] < $10) {
    MX[$1] = $10;
    aln = $5"-"$6"/"$7"-"$8":"$9;
    coverage = sprintf("%.3f", $9/$2 * 100);
    GX[$1] = $1" "$2" "m[2]" "m[3]" "$4" "aln" "$10" "$11" "coverage;
  };

  GH[$1]++;
}
END {
  for (n in MX)
    print GX[n], GH[n];
}
Retrieving best scoring alignments for each contig
# get best HSPs for starfish contigs
for TI in newbler trinity_trimmo soap_trimmo ;
  do {
    for DB in UniProt UniProtTREmbl NCBInr ;
      do {
        OUT="starfish_"$TI"-x-"$DB
        echo "# working on $OUT..." 1>&2
        zcat   blastout/$OUT.blastx.out.gz | \
          ./getbesthit.awk - \
             > blastout/$OUT.blastx.besthsp.tbl
      }; done
  }; done

# get best HSPs for sea urchin contigs
for DB in uniprot ncbinr;
  do {
    OUT="seaurchin_mrna.blastx_"$DB"_eval10e-25"
    echo "# working on $OUT..." 1>&2
    ./getbesthit.awk seaurchin/$OUT.out \
                   > seaurchin/$OUT.besthsp.tbl
  }; done
Retrieving UniPROT and NCBI best hit descriptions
# Merging all UniProt ids from best BLAST hits
gawk '{ print $3 }
     ' ./blastout/starfish_newbler-x-UniProt.blastx.besthsp.tbl         \
       ./blastout/starfish_trinity_trimmo-x-UniProt.blastx.besthsp.tbl  \
       ./blastout/starfish_soap_trimmo-x-UniProt.blastx.besthsp.tbl     \
     > ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids

# Merging all NCBInr ids from best BLAST hits
gawk '{ print $3 }
     ' ./blastout/starfish_newbler-x-NCBInr.blastx.besthsp.tbl         \
       ./blastout/starfish_trinity_trimmo-x-NCBInr.blastx.besthsp.tbl  \
       ./blastout/starfish_soap_trimmo-x-NCBInr.blastx.besthsp.tbl     \
     > ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids

# How many ids do we have on each set?
wc -l ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids \
     ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids
  7540  blastout/blastx_uniprot_eval10e-25.besthsp.gbids
  4950  blastout/blastx_ncbinr_eval10e-25.besthsp.gbids

# Batch downloading GenBank nucleotide (gb) or protein (gp) records
batchwebquerymaker.pl -V \
        --getids-from-file  ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids \
        --set-database      'protein' \
        --db-output-rettype 'gp' \
        --output-file       ./blastout/blastx_uniprot_eval10e-25.besthsp \
                         2> ./blastout/blastx_uniprot_eval10e-25.besthsp.gp.log &

batchwebquerymaker.pl -V \
        --getids-from-file  ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids \
        --set-database      'protein' \
        --db-output-rettype 'gb' \
        --output-file       ./blastout/blastx_ncbinr_eval10e-25.besthsp \
                         2> ./blastout/blastx_ncbinr_eval10e-25.besthsp.gb.log &

An annotation summary table was also retrieved from the UniProt upload list web form, using a list of all the matching UniProt identifiers filtered out from the best hits file blastx_uniprot_eval10e-25.besthsp.gbids. The downloaded file uniprot_besthits_dowloaded_info.tab.gz was then used to annotate the starfish contigs and to create the supplementary excel files.

Formating UniProt functional annotation into Excel files
# Processing UniProt alignments
for TI in newbler        \
          trinity_trimmo \
          soap_trimmo    ;
  do {
    zcat ./blastout/uniprot_besthits_dowloaded_info.tab.gz | \
      gawk 'BEGIN{ FS=OFS="\t";
                   while (getline<ARGV[1]>0)
                     I[$1]=$2"\t"$3"\t"$10"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$11"\t"$12;
                   ARGV[1]=""; FS=" ";
                   hdr="Transcript ID\tLength(bp)\tBest HSP\tScore\tEvalue\tCoverage\tNum HSPs\t";
                   hdr=hdr"UniProt Entry\tEntry name (ACC)\tLength(aa)\tStatus\tProtein names\tGene names\t";
                   hdr=hdr"Organism\tCatalytic activity\tFunction [CC]\tKeywords\tGene ontology (GO)";
                   print hdr; }
            $3 in I { print $1,$2,$6,$7,$8,$9,$10,I[$3]; }
           ' - ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.tbl \
           > ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.tbl

    ./table2excel.pl ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.tbl \
                     ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.xls \
                     "${TI}-x-UniProt"
  }; done
mapNCBIinfo2hsps.awk
#!/usr/bin/gawk -f

BEGIN {
    while (getline<ARGV[1]>0) {
        J=$4;
        gsub(/\.[0-9]+$/,"",J);
        IDS[$3]=$4;
        TRS[$3]=TRS[$3]" "$1;
        CTG[$3""$1]=$1"\t"$3;
        ALN[$3""$1]=$6" "$7sco" "$9"bits\t"$8;
    };
    ARGV[1]="";
    N=D=A=S=F=K=G="";
    E="NotAnnotated";
    dflg=kflg=0;
}

dflg == 1 && $0 !~ /^ / { dflg=0 }
kflg == 1 && $0 !~ /^ / { kflg=0 }
dflg == 1 { gsub(/^ +/," ",$0); D=D""$0 }
kflg == 1 { gsub(/^ +/," ",$0); K=K""$0 }
  $1 == "LOCUS" { N=$2 }
  $1 == "DEFINITION" { D=$0; gsub(/^DEFINITION +/,"",D); dflg=1 }
  $1 == "ACCESSION" { A=$2 }
  $1 == "VERSION" { G=$NF; gsub(/^GI:/,"",G) }
  $1 == "SOURCE" { S=$0; gsub(/^SOURCE +/,"",S) }
  $1 == "KEYWORDS" { K=$0 ; gsub(/^KEYWORDS +/,"",K); kflg=1 }
  $0 ~ /^ *\/product/ { F=$0; gsub(/^ +\/product=/,"",F); gsub(/\"/,"",F) }  #"
  $0 ~ /^ *\/UniProtKB_evidence/ { E=$0; gsub(/^ +\/UniProtKB_evidence=/,"",E); gsub(/\"/,"",E) }  #"
  $0 ~ /^\/\/$/ {
      if (G in IDS) {
         SQ=TRS[G];
         gsub(/^ */,"",SQ);
         n=split(SQ,sq," ");
         for (i=1; i<=n; i++) {
             ti=sq[i];
             print CTG[G""ti]"\t"A"\t"N"\t"ALN[G""ti]"\t"S"\t"D"\t"F"\t"K"\t"E;
         };
       };
       N=D=A=S=F=K=G=""; E="NotAnnotated"; dflg=kflg=0;
  }
Formating NCBInr functional annotation into Excel files
# Processing NCBInr alignments
for TI in newbler        \
          trinity_trimmo \
          soap_trimmo    ;
  do {
    ./mapNCBIinfo2hsps.awk ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.tbl \
                           ./blastout/blastx_ncbinr_eval10e-25.besthsp.gb \
                         > ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.tbl

    ./tableNR2excel.pl ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.tbl \
                       ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.xls \
                       "${TI}-x-NCBInr"
  }; done

The following tables summarize the number of transcripts having a BLAST hit (“query” column) and the number of distinct sequences from the database aligning to those transcripts (“target” column). TBL files were compressed using bzip2.

BLASTX x UniProt
SET #Transcripts #Queries #Targets Best Hit
NewBler 10,022  1,059  27,565 TBL / XLS
Trinity 13,817  1,503  37,010 TBL / XLS
SOAPdenovo-Trans 11,369  923  27,168 TBL / XLS
Sea Urchin 30,401  19,343  149,902
BLASTX x NCBInr
SET #Queries #Targets Best Hit
NewBler 1,365  266,711 TBL / XLS
Trinity 2,021  406,049 TBL / XLS
SOAPdenovo-Trans 1,252  265,750 TBL / XLS
Sea Urchin 3,449  523,925


Merging Coverage information and Functional Annotation into Excel files
# Keep transcript IDs together with assembly contigs ID and length in bp.
egrep '^>' assemblies/starfish_newbler.vbfclean_strandfix.fa | \
  sed 's/^>//; s/\(length\|gene\)=//g;' | \
    gawk '{ print $1, $3":"$2, $4 }
         ' > assemblies/starfish_newbler.vbfclean_strandfix.idsmap
egrep '^>' assemblies/starfish_trinity_trimmo.vbfclean_strandfix.fa | \
  sed 's/^>//; s/\(length\|gene\)=//g;' | \
    gawk '{ print $1, $2, $3 }
         ' > assemblies/starfish_trinity_trimmo.vbfclean_strandfix.idsmap
egrep '^>' assemblies/starfish_soap_trimmo.vbfclean_strandfix.fa | \
  sed 's/^>//; s/\(length\|gene\)=//g;' | \
    gawk '{ print $1, $2, $3 }
         ' > assemblies/starfish_soap_trimmo.vbfclean_strandfix.idsmap

# Merging IDs with assembly coverage, UniProt, and NCBInr best hits info
for TI in newbler        \
          trinity_trimmo \
          soap_trimmo    ;
  do {

    echo "# working on $TI" 1>&2

    # merging tables
    gawk 'BEGIN{
        FS=" "; OFS="\t";
        while (getline<ARGV[1]>0) { S[$1] = $2; C[$2] = $1; len[$1] = $3; };
        while (getline<ARGV[2]>0) { sumcov[$1] = $(NF-2); avgcov[$1] = $NF; };
        while (getline<ARGV[3]>0) { s=$1 in S ? $1 : "##";
                                    GBid[s] = $3; GBac[s] = $4; GBaln[s] = $6" "$7" "$8" "$9" "$10; };
        while (getline<ARGV[4]>0) { s=$1 in S ? $1 : "##";
                                    UPid[s] = $3; UPac[s] = $4; UPaln[s] = $6" "$7" "$8" "$9" "$10; };
        for (s in S) {
          print s, S[s], len[s], (s in sumcov ? sumcov[s] : "NA"), (s in avgcov ? avgcov[s] : "NA"),
                (s in GBid ? GBid[s] : "NA"), (s in GBac ? GBac[s] : "NA"), (s in GBaln ? GBaln[s] : "NA"),
                (s in UPid ? UPid[s] : "NA"), (s in UPac ? UPac[s] : "NA"), (s in UPaln ? UPaln[s] : "NA");
        };
      }' ./assemblies/starfish_$TI.vbfclean_strandfix.idsmap    \
         ./bowtie2/starfish_${TI}.bowtie.UPL.pileup.summary     \
         ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.tbl  \
         ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.tbl \
       > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.tbl

    # adding UniProt descriptions
    zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
      gawk 'BEGIN{ FS=OFS="\t";
              while (getline<ARGV[1]>0) { U[$2] = ($10 == "" ? "NA\tNA" : $10"\t"$5" ["$6)"]"; };
              ARGV[1] = "";
            }
            { print $0, ($9 in U ? U[$9] : "NA\tNA"); } 
            ' - ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.tbl \
              > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl

    # Fixing NCBInr ids and inserting descriptions
    #
    # nr.ids data was filtered out from the fasta headers of the NCBInr sequences, as follows
    #   zegrep '^>' $BLASTDB/fasta/nr.gz | \
    #     perl -ne 'print join("\n", map { s/^>?gi\|//o; s/\|/\t/; s/ /\t/; s/ \[/\t\[/; $_ }
    #                                split /\x01/o, $_);' | \
    #        gzip -c9 - > $BLASTDB/fasta/nr.ids.gz &
    #
    zcat $BLASTDB/fasta/nr.ids.gz | \
      gawk 'BEGIN{
              FS=OFS="\t";
              while (getline<ARGV[1]>0) { N[$6] = $1; };
              ARGV[1] = "";
            }
            ($1 in N) { print $0 }
            ' ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl  - | \
          gawk 'BEGIN{
                  FS=OFS="\t";
                  while (getline<ARGV[1]>0) {
                    N[$1] = $2"\t"$3"\t"($4 != "" ? $4 : "NA");
                    sub(/^[^\|]*\|/,"",$2);
                    sub(/\|[^\|]*$/,"",$2);
                    I[$1] = $2;
                  };
                  ARGV[1] = "";
                }
                { $7=$6 in I ? I[$6] : "NA"; 
                  p=($6 in N ? N[$6] : "NA\tNA\tNA")"\t"$9;
                  $9=p;
                  print $0;
                }' - ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl \
                   > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.tbl

    # Converting tabular file to excel
    ./tableNR+UP2excel.pl ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.tbl \
                          ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.xls \
                          "NCBInr+UniProt_BLASTX"

    # Get total counts
    gawk -v ti=$TI '
           $6 != "NA" && $12 == "NA" { GB++ }
           $6 == "NA" && $12 != "NA" { UP++ }
           $6 != "NA" && $12 != "NA" { BO++ }
           $6 == "NA" && $12 == "NA" { NO++ }
          END{ printf "SET: %s  UP: %d  NR: %d  BOTH: %d  NONE: %d\n",
                       ti, UP, GB, BO, NO; }
          ' blastout/starfish_$TI.best_blastx_NCBI+UniProt.GB+UPdesc.tbl

   }; done

The following table summarizes the number of transcripts (contigs) having a match to at least one target sequence from UniProt, NCBI-nr, both, and no hit (none). TBL files were compressed using bzip2.

SET #Transcripts #UniProt-only #NCBInr-only #Both #None Summary Files
NewBler 10,022  1,365  8,769  TBL / XLS
Trinity 13,817  2,021  11,763  TBL / XLS
SOAPdenovo-Trans 11,369  1,252  10,084  TBL / XLS


Comparing Homology Search Results

In an attempt to quantify which assembly returned the best set of transcripts, one can consider the ratio of percent coverage versus length for the best HSP from BLAST of each transcript. The longer the hit and the higher the coverage percent (with respect to the transcript full length), the better; while low coverage may correspond to the presence of chimeric transcripts.

GO
# Making scatterplots for queries best HSP coverage
while read TI LB CL;
  do {
    echo "# Working on $TI : BLASTN x Sea urchin" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-Spur_mrna.blastn.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTN x Sea urchin::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-Spur_mrna.blastn.covtbl.qryonly.scat.png
    echo "# Working on $TI : TBLASTX x Sea urchin" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-Spur_mrna.tblastx.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP TBLASTX x Sea urchin::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-Spur_mrna.tblastx.covtbl.qryonly.scat.png
    echo "# Working on $TI : BLASTX x UniProt" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-UniProt.blastx.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTX x UniProt::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-UniProt.blastx.covtbl.qryonly.scat.png
    echo "# Working on $TI : BLASTX x UniProtTREmbl" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-UniProtTREmbl.blastx.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTX x UniProt-TREmbl::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-UniProtTREmbl.blastx.covtbl.qryonly.scat.png
    echo "# Working on $TI : BLASTX x NCBInr" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-NCBInr.blastx.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTX x NCBInr::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-NCBInr.blastx.covtbl.qryonly.scat.png
    echo "# Working on $TI : BLASTN x NCBInt" 1>&2;
    Rscript  Scatterplot_2cols.R \
        blastout/starfish_${TI}-x-NCBInt.blastn.covtbl.qryonly \
        "13::12::${TAB}::FALSE" \
        "${LB}${RET}Best HSP BLASTN x NCBInt::% Coverage::HSP length::${CL}" \
        blastout/starfish_${TI}-x-NCBInt.blastn.covtbl.qryonly.scat.png
  }; done <<'EOF'
newbler         NewBler  #CC000099
trinity_trimmo  Trinity  #00CC0099
soap_trimmo     SOAPdnT  #0000CC99
EOF
BLAST SET NewBler Trinity SOAPdenovo-Trans
BLASTN x
Sea urchin mRNAs
Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTN of NewBler assembly against Spur mRNA set. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTN of Trinity assembly against Spur mRNA set. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTN of SOAPdenovo-Trans assembly against Spur mRNA set.
TBLASTX x
Sea urchin mRNAs
Scatterplot comparing coverage percent with HSP length for the best HSP from TBLASTX of NewBler assembly against Spur mRNA set. Scatterplot comparing coverage percent with HSP length for the best HSP from TBLASTX of Trinity assembly against Spur mRNA set. Scatterplot comparing coverage percent with HSP length for the best HSP from TBLASTNXof SOAPdenovo-Trans assembly against Spur mRNA set.
BLASTN x
NCBInt
Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTN of NewBler assembly against NCBInt. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTN of Trinity assembly against NCBInt. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTN of SOAPdenovo-Trans assembly against NCBInt.
BLASTX x
NCBInr
Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of NewBler assembly against NCBInr. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of Trinity assembly against NCBInr. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of SOAPdenovo-Trans assembly against NCBInr.
BLASTX x
UniProt
Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of NewBler assembly against UniProt. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of Trinity assembly against UniProt. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of SOAPdenovo-Trans assembly against UniProt.
BLASTX x
UniProt-TrEBML
Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of NewBler assembly against UniProtTREmbl. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of Trinity assembly against UniProtTREmbl. Scatterplot comparing coverage percent with HSP length for the best HSP from BLASTX of SOAPdenovo-Trans assembly against UniProtTREmbl.

Functional analysis: Gene Ontology

On this section, SOAPdenovo-Trans assembly will be functionally characterized using the Gene Ontology (GO) annotation extracted from the UniProt database.

Deriving GO table from UniProt

GO
#  Preparing GO annotation table for the current UNIPROT release
# UNIPROT points to the folder where UniProt database was downloaded

zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.dat.gz | \
  perl -ne 'BEGIN { $I=$A=$E=$O=""; $G="GO"; $T="TM"; $D="DM"; $S="unknown"; $K="."; $C=$S=0; }
    $_ =~ m{^ID\s+(\S+)\s}o                      && ($I=$A=$1, next);
    $_ =~ m{^AC\s+(\S+);}o                       && ($A=$1, next);
    $_ =~ m{^OS\s+([^\s]+)\s+([^\.\s]+)}o        && ($S=$1."_".$2, next);
    $_ =~ m{^OX\s+NCBI_TaxID=(\d+);}o            && ($C=$1, next);
    $_ =~ m{^DR\s+EMBL;\s+([^;]+);\s+([^;]+);}o  && ($E=$1, $O=$2, next);
    $_ =~ m{^DR\s+GO;\s+GO:([^;]+);\s+([^;]+);}o && ($G.=";".$1."#".$2, next);
    $_ =~ m{^KW\s+(.*)\.}o                       && ($K=$1, next);
    $_ =~ m{^FT\s+DOMAIN\s+(\d+)\s+(\d+)\s+([^\.]+)\.}o && do {
      $D.=";".$1."_".$2; ($s = $3) =~ s/\s+/_/og; $D.="_".$s; next; };
    $_ =~ m{^FT\s+TRANSMEM\s+(\d+)\s+(\d+)\s}o          && ($T.=";".$1."_".$2, next);
    $_ =~ m{^//}o && do {
      print join("\t", $I, $A, $E, $O, $S, $C, $D, $T, $G, $K),"\n";
      $I=$A=$E=$O=""; $G="GO"; $T="TM"; $D="DM"; $S="unknown"; $K="."; $C=$S=0;
      next;
    }' | gzip -c9 - \
            > $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz

zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz  | \
  gawk '   { if ($9 == "GO") { N++ } else { Y++ }; }
        END{ print "# Total: ", N+Y, " with GOs: ",Y," without GOs: ",N }'
# Total:  546439  with GOs:  518296  without GOs:  28143

Project GO Annotations (GOAs)


GOAs can be mapped from the raw set of BLAST hits or transferred from the best hit only, the latter was the chosen option. Best hits files were produced at ((Coscinasterias+transcriptome#UniProt|Homology Search over UniProt)) section.

Mapping GOAs from the BLASTX best HSPs set: starfish x UniProt
# Annotate starfish sequences
zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
   gawk 'BEGIN{
    FS=OFS="\t";
    while (getline<ARGV[1]>0) {
      if ($1 == "GO") continue;
      sub(/^GO;?/,"",$9);
      if (!($2 in GO)) GO[$2] = $9; # using uniprot accession i.e. Q6GZX4
      else GO[$2] = GO[$2]";"$9;
    };
    ARGV[1] = "";
   }
   $1 == "Q" {
    split($14,U,"|");
    u=U[2];
    split(U[3],K," ");
    if (u in GO && GO[u] != "") print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], GO[u];
    else print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], "NA.GO";
   }' - ./blastout/starfish_soap_trimmo-x-UniProt.blastx.covtbl \
      > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl

# Counting GOAs
gawk 'BEGIN{
        FS=OFS="\t";
        print "GO\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot";
      }
      $NF != "NA.GO" && $1 != "GO" {
        n=split($NF,m,";");
        for (i=1; i<=n; i++) { GO[m[i]]++; GOT++ };
      }
      END{
        for (j in GO)  {
           gop = GO[j];
           gon = GOT - GO[j];
           print j, gop, sprintf("%.5f", gop / GOT * 100), gon, sprintf("%.5f", gon / GOT * 100), GOT;
        } # print j, GO[j], GOT;
      }' ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \
       > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts.tbl
Mapping GOAs from the BLASTX best HSPs set: sea urchin x UniProt
# Annotate sea urchin sequences
zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
   gawk 'BEGIN{
    FS=OFS="\t";
    while (getline<ARGV[1]>0) {
      if ($1 == "GO") continue;
      sub(/^GO;?/,"",$9);
      if (!($2 in GO)) GO[$2] = $9; # using uniprot accession i.e. Q6GZX4
      else GO[$2] = GO[$2]";"$9;
    };
    ARGV[1] = "";
   }
   $1 == "Q" {
    split($14,U,"|");
    u=U[2];
    split(U[3],K," ");
    if (u in GO && GO[u] != "") print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], GO[u];
    else print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], "NA.GO";
   }' - ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl \
      > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl

# Counting GOAs
gawk 'BEGIN{
        FS=OFS="\t";
        print "GO\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot";
      }
      $NF != "NA.GO" && $1 != "GO" {
        n=split($NF,m,";");
        for (i=1; i<=n; i++) { GO[m[i]]++; GOT++ };
      }
      END{
        for (j in GO)  {
           gop = GO[j];
           gon = GOT - GO[j];
           print j, gop, sprintf("%.5f", gop / GOT * 100), gon, sprintf("%.5f", gon / GOT * 100), GOT;
        } # print j, GO[j], GOT;
      }' ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \
       > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts.tbl

Comparing GOAs between sets

mkGOcountstbl.awk
#!/usr/bin/gawk -f
BEGIN{
  FS="\t";
  while (getline<ARGV[1]>0) {
    if ($1 == "GO") continue; 
    if ($6 != "NA.GO") {
      gsub(/[\'\"]/,"",$6); #"
      n=split($6,m,";");
      for (i=1; i<=n; i++) {
        k=m[i]; 
        gsub(/\#.*$/,"",k); 
        gsub(/^.*\#/,"",m[i]);
        if (!(k in GD)) { GD[k] = "'"m[i]"'"; RO[k] = 0; };
        RO[k]++; ROT++;
      };
    };
  };
  ARGV[1] = "";
  print "GO\tGOdesc\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot\tROpos\tROpos.pct\tROneg\tROneg.pct\tROtot\tGOROpct";
  FS="\t";
}
$6 != "NA.GO" && $1 != "GO" {
  gsub(/[\'\"]/,"",$6); #"
  n=split($6,m,";");
  for (i=1; i<=n; i++) {
    k=m[i]; 
    gsub(/\#.*$/,"",k); 
    gsub(/^.*\#/,"",m[i]);
    if (!(k in GD)) { GD[k] = "'"m[i]"'"; GO[k] = 0; };
    GO[k]++; GOT++;
  };
}
END{
  for (j in GD) {
     if (!(j in GO)) GO[j]=0;
     if (!(j in RO)) RO[j]=0;
     gop = GO[j];        gopp = GOT > 0 ? sprintf("%.5f", (gop / GOT * 100)) : "NA";
     gon = GOT - GO[j];  gonp = GOT > 0 ? sprintf("%.5f", (gon / GOT * 100)) : "NA";
     rop = RO[j];        ropp = ROT > 0 ? sprintf("%.5f", (rop / ROT * 100)) : "NA";
     ron = ROT - RO[j];  ronp = ROT > 0 ? sprintf("%.5f", (ron / ROT * 100)) : "NA";
     grpp = (GOT == 0 || ROT == 0) ? "NA" : sprintf("%.5f", (GOT / ROT * 100));
     printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n",
            "GO:"j, GD[j], gop, gopp, gon, gonp, GOT, rop, ropp, ron, ronp, ROT, grpp;
  } # print j, GO[j], GOT;
}
Extended GOA counts
# Comparing GOAs counts between starfish and sea urchin
bin/mkGOcountstbl.awk  ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \
                    ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \
                  > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.tbl

# Comparing GOAs counts between starfish and UniProt
zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
 gawk 'BEGIN{ FS=OFS="\t" } { print $1,"2","3","4","5",$9 }' - | \
  bin/mkGOcountstbl.awk - \
                    ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \
                  > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.tbl

# Comparing GOAs counts between sea urchin and UniProt
zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \
 gawk 'BEGIN{ FS=OFS="\t" } { print $1,"2","3","4","5",$9 }' - | \
  bin/mkGOcountstbl.awk - \
                    ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \
                  > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.tbl
Calculating counts differences between set pairs
# Comparing GOAs counts between starfish and sea urchin
Rscript bin/GO_calc_pval.R \
            ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.tbl \
          > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.GOcalc.log
Rscript bin/GO_getGOdesc.R \
            ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.chisq+hypergeom_pvals.tbl \
         2> ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.chisq+hypergeom_pvals.desc.log

# Comparing GOAs counts between starfish and UniProt
Rscript bin/GO_calc_pval.R \
            ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.tbl \
          > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.GOcalc.log
Rscript bin/GO_getGOdesc.R \
            ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.tbl \
         2> ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.desc.log

# Comparing GOAs counts between sea urchin and UniProt
Rscript bin/GO_calc_pval.R \
            ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.tbl \
          > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.GOcalc.log
Rscript bin/GO_getGOdesc.R \
            ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.tbl \
         2> ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.desc.log

GOA Frequencies Analysis

From the Gene Ontology ftp repository, termdb (version 2015-01-20) was retrieved. From the term.txt, term_synonym.txt, and term2term.txt it is possible to build the full graph for the GO terms from which the parent-child relationships can be extracted. This is useful to calculate frequencies at distinct GO graph levels.

Getting GO terms parent nodes
# GOTERMDB point to the folder where the termdb database was downloaded from Gene Ontology repository
gawk 'BEGIN {
    FS=OFS="\t";
    while (getline<ARGV[1]>0) {
      if ($7 == 1) continue; # 1 is the GO code for is_relation
      GO[$1] = $4;
      DE[$1] = $2;
      ON[$1] = $3;
      RO[$1] = $6;
    };
    while (getline<ARGV[2]>0) {
      if ($4 != 27) continue; # 27 is the GO code for alt_id  synonym_type
      if ($1 in SY) {
        SY[$1] = SY[$1]" "$2;
      } else {
        SY[$1] = $2;
      };
    };
    ARGV[1]=ARGV[2]=0;
    print "GOID","ONTOLOGY","TERM","PARENT","SYNONYM";
  } $2 == 1 && $5 == 0 { # $2 is relationship_type_id and 1 means is_a
     if (($4 in GO) && GO[$4] !~ /^obsolete/) {
       print GO[$4], ON[$4], DE[$4], RO[$4] == 1 ? "all" : GO[$3], ($4 in SY) ? SY[$4] : "NA";
       SN[$4]++;
     };
  } END{
     for (n in GO) {
       if (!(n in SN)) {
         print "#", GO[n], ON[n], DE[n], RO[n] == 1 ? "all" : GO[n], ($4 in SY) ? SY[n] : "NA";
       };
     };
  }' $GOTERMDB/term.txt\
     $GOTERMDB/term_synonym.txt \
     $GOTERMDB/term2term.txt \
   > GOA/goterms+parents.tbl

On this section, frequencies are computed for the three sets, yet the main interest was only the starfish GOs mapped from UniProt.

GO terms frequencies analysis
for F in starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot \
         starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin \
         seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot ;
  do {
    echo "# working on $F"

    bin/GO_prune_GOlevels.pl ./GOA/goterms+parents.tbl \
                              ./GOA/$F.tbl > ./GOA/$F.gofreqs.tbl
  }; done
GO terms subgraphs
for F in starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot \
         starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin \
         seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot ;
 do {
  echo "# working on $F"

  SF="./GOA/"$F".chisq+hypergeom_pvals.goinfo";

  bin/GO_make_graphviz_dotfile_ext.pl ./GOA/goterms+parents.tbl \
                                    $SF.tbl \
                                    1e-5 \
                                 2> $SF.ext.dot.maxpval_1e-5.log;
  for o in CC BP MF;
    do {
      mv -v $SF.graphviz.${o}.ext.dot $SF.graphviz.${o}.ext.maxpval_1e-5.dot
    }; done;

  echo "# working on $F";
  bin/GO_make_graphviz_dotfile_ext.pl ./GOA/goterms+parents.tbl \
                                    $SF.tbl \
                                 2> $SF.ext.dot.log;
  for o in CC BP MF;
    do {
      echo "# graphviz dot on $o [co/pdf]";
      dot -Tpdf $SF.graphviz.${o}.ext.dot \
              > $SF.graphviz.${o}.ext.dot.pdf;
      echo "# graphviz dot on $o [co/svg]";
      dot -Tsvg $SF.graphviz.${o}.ext.dot \
              > $SF.graphviz.${o}.ext.dot.svg;
    }; done;

  for o in CC BP MF;
    do {
      echo "# graphviz dot on $o [co/pdf]";
      dot -Tpdf $SF.graphviz.${o}.ext.maxpval_1e-5.dot \
              > $SF.graphviz.${o}.ext.maxpval_1e-5.dot.pdf;
      echo "# graphviz dot on $o [co/svg]";
      dot -Tsvg $SF.graphviz.${o}.ext.maxpval_1e-5.dot \
              > $SF.graphviz.${o}.ext.maxpval_1e-5.dot.svg;
    }; done;
 }; done;

Producing the GO barplots with R

Using R to draw GO frequency barplots
### GO freqs analysis:
GODATAFILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.gofreqs.tbl"
GOs.Cmur <- read.table(GODATAFILE, header=TRUE, sep="\t");

# Overview of the GO data
nrow(GOs.Cmur)

sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 1 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 1 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 1 ])

sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 2 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 2 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 2 ])

sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 3 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 3 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 3 ])

sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel > 1 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel > 1 ])
sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel > 1 ])

GK <- list(CCU="#E94F6C", # lightred
           CCD="#A24436", # darkred
           BPU="#52E853", # lightgreen
           BPD="#426A2B", # darkgreen ... #0A945C
           MFU="#6FAFF9", # lightblue
           MFD="#007CBA") # darkblue

numbars <- 25

for (golvl in 3:4) {

  GOpct.MF.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "MF" &
                             GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ]
  GOpct.CC.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "CC" &
                             GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ]
  GOpct.BP.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "BP" &
                             GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ]
  cat("numMF:",nrow(GOpct.MF.Cmur),"\n") #->  96 gt0 vs 512 total
  cat("numCC:",nrow(GOpct.CC.Cmur),"\n") #->  65 gt0 vs 612 total
  cat("numBP:",nrow(GOpct.BP.Cmur),"\n") #-> 214 gt0 vs 943 total
  GOsum.MF.Cmur <- sum(GOpct.MF.Cmur$GOcount) #->  800
  GOsum.CC.Cmur <- sum(GOpct.CC.Cmur$GOcount) #-> 1271
  GOsum.BP.Cmur <- sum(GOpct.BP.Cmur$GOcount) #-> 3403
  cat("sumMF:",GOsum.MF.Cmur,"\n")
  cat("sumCC:",GOsum.CC.Cmur,"\n")
  cat("sumBP:",GOsum.BP.Cmur,"\n")

  GOpct.MF.Cmur$GOpct <- GOpct.MF.Cmur$GOcount / GOsum.MF.Cmur * 100
  GOpct.CC.Cmur$GOpct <- GOpct.CC.Cmur$GOcount / GOsum.CC.Cmur * 100
  GOpct.BP.Cmur$GOpct <- GOpct.BP.Cmur$GOcount / GOsum.BP.Cmur * 100

  GOord.MF.Cmur <- head(GOpct.MF.Cmur[ order(-GOpct.MF.Cmur$GOpct), ], n=numbars)
  GOord.CC.Cmur <- head(GOpct.CC.Cmur[ order(-GOpct.CC.Cmur$GOpct), ], n=numbars)
  GOord.BP.Cmur <- head(GOpct.BP.Cmur[ order(-GOpct.BP.Cmur$GOpct), ], n=numbars)
  GOplt.MF.Cmur <- rbind(GOord.MF.Cmur, 
                         data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="MF",
                                    TERM="Other low represented GO codes",
                                    GOpct=100 - sum(GOord.MF.Cmur$GOpct)))
  GOplt.CC.Cmur <- rbind(GOord.CC.Cmur, 
                         data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="CC",
                                    TERM="Other low represented GO codes",
                                    GOpct=100 - sum(GOord.CC.Cmur$GOpct)))
  GOplt.BP.Cmur <- rbind(GOord.BP.Cmur, 
                         data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="BP",
                                    TERM="Other low represented GO codes",
                                    GOpct=100 - sum(GOord.BP.Cmur$GOpct)))

  GOplt.MF.Cmur$LBL <- paste(GOplt.MF.Cmur$TERM,
                             ifelse(is.na(GOplt.MF.Cmur$GOID), "", paste(" [",GOplt.MF.Cmur$GOID,"]",sep="")),
                             sep="")
  GOplt.MF.Cmur.max <- max(GOplt.MF.Cmur$GOpct)

  pdf(file=paste("GOA/GOfreqs_Cmur_MF_lvl",golvl,".pdf", sep=""),
      width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE);
  par.def<-par(mar=c(4,25,1,1)+0.1);
  barplot(GOplt.MF.Cmur$GOpct,
          names.arg=GOplt.MF.Cmur$LBL,
          col=as.character(GK["MFU"]),
          horiz=TRUE, xlim=c(0,GOplt.MF.Cmur.max),
          border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE)
  axis(1,at=c(seq(0,GOplt.MF.Cmur.max+5,2)),line=0.25)
  abline(v=seq(0,GOplt.MF.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black")
  mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
  text(GOplt.MF.Cmur.max, numbars*1.2, # lpos*1.2,
       labels="Molecular\nFunction",
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["MFD"]),
         # as.character(KK[c("BPD","CCD","MFD")]),
       adj=c(1,1.5), font=2, cex=2)
  text(-2, -0.7, # lpos*1.2,
       labels=paste("GO level ",golvl,sep=""),
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["MFD"]),
         # as.character(KK[c("BPD","BPD","MFD")]),
       adj=c(1,1), font=2, cex=1.25, xpd=TRUE)
  par(par.def);
  dev.off();

  GOplt.CC.Cmur$LBL <- paste(GOplt.CC.Cmur$TERM,
                             ifelse(is.na(GOplt.CC.Cmur$GOID), "", paste(" [",GOplt.CC.Cmur$GOID,"]",sep="")),
                             sep="")
  GOplt.CC.Cmur.max <- max(GOplt.CC.Cmur$GOpct)

  pdf(file=paste("GOA/GOfreqs_Cmur_CC_lvl",golvl,".pdf", sep=""),
      width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE);
  par.def<-par(mar=c(4,25,1,1)+0.1);
  barplot(GOplt.CC.Cmur$GOpct,
          names.arg=GOplt.CC.Cmur$LBL,
          col=as.character(GK["CCU"]),
          horiz=TRUE, xlim=c(0,GOplt.CC.Cmur.max),
          border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE)
  axis(1,at=c(seq(0,GOplt.CC.Cmur.max+5,2)),line=0.25)
  abline(v=seq(0,GOplt.CC.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black")
  mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
  text(GOplt.CC.Cmur.max, numbars*1.2, # lpos*1.2,
       labels="Cellular\nComponent",
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["CCD"]),
         # as.character(KK[c("BPD","CCD","MFD")]),
       adj=c(1,1.5), font=2, cex=2)
  text(-2, -0.7, # lpos*1.2,
       labels=paste("GO level ",golvl,sep=""),
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["CCD"]),
         # as.character(KK[c("BPD","BPD","MFD")]),
       adj=c(1,1), font=2, cex=1.25, xpd=TRUE)
  par(par.def);
  dev.off();

  GOplt.BP.Cmur$LBL <- paste(GOplt.BP.Cmur$TERM,
                             ifelse(is.na(GOplt.BP.Cmur$GOID), "", paste(" [",GOplt.BP.Cmur$GOID,"]",sep="")),
                             sep="")
  GOplt.BP.Cmur.max <- max(GOplt.BP.Cmur$GOpct)

  pdf(file=paste("GOA/GOfreqs_Cmur_BP_lvl",golvl,".pdf", sep=""),
      width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE);
  par.def<-par(mar=c(4,25,1,1)+0.1);
  barplot(GOplt.BP.Cmur$GOpct,
          names.arg=GOplt.BP.Cmur$LBL,
          col=as.character(GK["BPU"]),
          horiz=TRUE, xlim=c(0,GOplt.BP.Cmur.max),
          border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE)
  axis(1,at=c(seq(0,GOplt.BP.Cmur.max+5,2)),line=0.25)
  abline(v=seq(0,GOplt.BP.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black")
  mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
  text(GOplt.BP.Cmur.max, numbars*1.2, # lpos*1.2,
       labels="Biological\nProcess",
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["BPD"]),
         # as.character(KK[c("BPD","BPD","MFD")]),
       adj=c(1,1.5), font=2, cex=2)
  text(-2, -0.7, # lpos*1.2,
       labels=paste("GO level ",golvl,sep=""),
            # c("Biological Process", "Cellular Component", "Molecular Function"),
       col=as.character(GK["BPD"]),
         # as.character(KK[c("BPD","BPD","MFD")]),
       adj=c(1,1), font=2, cex=1.25, xpd=TRUE)
  par(par.def);
  dev.off();

} # for golvl
Using R to draw starfish/sea urchin GO comparison barplots
# Loading GO info datasets
FILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin";
DT.CmurxSpur <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""),
                           header=TRUE, sep="\t");
FILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot";
DT.CmurxUPrt <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""),
                           header=TRUE, sep="\t");
FILE <- "./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot";
DT.SpurxUPrt <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""),
                           header=TRUE, sep="\t");

# volcano plots
plot((1 - DT.CmurxUPrt$hypergeom.pval) ~ DT.CmurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - DT.CmurxUPrt$hypergeom.padj) ~ DT.CmurxUPrt$logodds, pch=16, col="red")

plot((1 - DT.SpurxUPrt$hypergeom.pval) ~ DT.SpurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - DT.SpurxUPrt$hypergeom.padj) ~ DT.SpurxUPrt$logodds, pch=16, col="red")

plot((1 - DT.CmurxSpur$hypergeom.pval) ~ DT.CmurxSpur$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - DT.CmurxSpur$hypergeom.padj) ~ DT.CmurxSpur$logodds, pch=16, col="red")

dt.CmurxUPrt <- DT.CmurxUPrt[!is.na(DT.CmurxUPrt$hypergeom.padj) & DT.CmurxUPrt$hypergeom.padj < 1e-5, ]
dt.SpurxUPrt <- DT.SpurxUPrt[!is.na(DT.SpurxUPrt$hypergeom.padj) & DT.SpurxUPrt$hypergeom.padj < 1e-5, ]
dt.CmurxSpur <- DT.CmurxSpur[!is.na(DT.CmurxSpur$hypergeom.padj) & DT.CmurxSpur$hypergeom.padj < 1e-5, ]

td.CmurxUPrt <- DT.CmurxUPrt[!is.na(DT.CmurxUPrt$hypergeom.padj), ]
td.SpurxUPrt <- DT.SpurxUPrt[!is.na(DT.SpurxUPrt$hypergeom.padj), ]
td.CmurxSpur <- DT.CmurxSpur[!is.na(DT.CmurxSpur$hypergeom.padj), ]

# volcano plots filtered
plot((1 - dt.CmurxUPrt$hypergeom.pval) ~ dt.CmurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - dt.CmurxUPrt$hypergeom.padj) ~ dt.CmurxUPrt$logodds, pch=16, col="red")

plot((1 - dt.SpurxUPrt$hypergeom.pval) ~ dt.SpurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - dt.SpurxUPrt$hypergeom.padj) ~ dt.SpurxUPrt$logodds, pch=16, col="red")

plot((1 - dt.CmurxSpur$hypergeom.pval) ~ dt.CmurxSpur$logodds, xlim=c(-10,10), ylim=c(0,1))
points((1 - dt.CmurxSpur$hypergeom.padj) ~ dt.CmurxSpur$logodds, pch=16, col="red")

# logods dotplots
gox1 <- split(DT.CmurxUPrt, DT.CmurxUPrt$Ontology)
gox2 <- split(DT.SpurxUPrt, DT.SpurxUPrt$Ontology)
gox3 <- split(DT.CmurxSpur, DT.CmurxSpur$Ontology)

gos.BP <- unique(sort(c(as.character(gox1$BP$GO),
                        as.character(gox2$BP$GO),
                        as.character(gox3$BP$GO))));
gos.BP.len <- length(gos.BP);
gos.MF <- unique(sort(c(as.character(gox1$MF$GO), 
                       as.character(gox2$MF$GO),
                        as.character(gox3$MF$GO))));
gos.MF.len <- length(gos.MF);
gos.CC <- unique(sort(c(as.character(gox1$CC$GO),
                        as.character(gox2$CC$GO),
                        as.character(gox3$CC$GO))));
gos.CC.len <- length(gos.CC);

plot(DT.CmurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO & gos.CC %in% DT.SpurxUPrt$GO],
     DT.SpurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO & gos.CC %in% DT.CmurxUPrt$GO])

PTBP.Cmur <- DT.CmurxUPrt$logodds[gos.BP %in% DT.CmurxUPrt$GO]
PTBP.Cmur[ is.na(PTBP.Cmur) ] <- 0
PTBP.Spur <- DT.SpurxUPrt$logodds[gos.BP %in% DT.CmurxUPrt$GO]
PTBP.Spur[ is.na(PTBP.Spur) ] <- 0
plot(PTBP.Cmur,PTBP.Spur)

PTMF.Cmur <- DT.CmurxUPrt$logodds[gos.MF %in% DT.CmurxUPrt$GO]
PTMF.Cmur[ is.na(PTMF.Cmur) ] <- 0
PTMF.Spur <- DT.SpurxUPrt$logodds[gos.MF %in% DT.CmurxUPrt$GO]
PTMF.Spur[ is.na(PTMF.Spur) ] <- 0
plot(PTMF.Cmur,PTMF.Spur)

PTCC.Cmur <- DT.CmurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO]
PTCC.Cmur[ is.na(PTCC.Cmur) ] <- 0
PTCC.Spur <- DT.SpurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO]
PTCC.Spur[ is.na(PTCC.Spur) ] <- 0
plot(PTCC.Cmur,PTCC.Spur)

# barplot tests
st.CmurxUPrt <- head(dt.CmurxUPrt[order(dt.CmurxUPrt$hypergeom.padj), ], n=500)
st.SpurxUPrt <- head(dt.SpurxUPrt[order(dt.SpurxUPrt$hypergeom.padj), ], n=500)

st.CmurxSpur <- head(dt.CmurxSpur[order(dt.CmurxSpur$hypergeom.padj), ], n=500)

par.def<-par(mar=c(4,25,1,1)+0.1);
barplot(st.CmurxSpur$logodds,names.arg=paste(st.CmurxSpur$Term," [",st.CmurxSpur$GO,"]",sep=""),horiz=TRUE,las=2)
par(par.def);

# K<-list(CC="red",BP="green",MF="blue")
K<-list(CC="#E94F6C",BP="#52E853",MF="#6FAFF9")
k<-list(CC="#A24436",BP="#0A945C",MF="#007CBA")

gord.CmurxSpur <- st.CmurxSpur[order(st.CmurxSpur$Ontology, -st.CmurxSpur$logodds), ]
par.def<-par(mar=c(4,25,1,1)+0.1);
barplot(gord.CmurxSpur$logodds,
        names.arg=paste(gord.CmurxSpur$Term," [",gord.CmurxSpur$GO,"]",sep=""),
        col=as.character(K[gord.CmurxSpur$Ontology]),
        horiz=TRUE,las=2)
par(par.def);

ts.CmurxUPrt <- head(td.CmurxUPrt[order(td.CmurxUPrt$hypergeom.padj), ], n=500)
ts.SpurxUPrt <- head(td.SpurxUPrt[order(td.SpurxUPrt$hypergeom.padj), ], n=500)

# filtering 100 topmost and bottom overrepresented terms
tdu.CmurxSpur <- td.CmurxSpur[order(-td.CmurxSpur$logodds),]
tts.CmurxSpur <- rbind(head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "CC", ], n=100),
                       head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "BP", ], n=100),
                       head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "MF", ], n=100),
                       tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "CC", ], n=100),
                       tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "BP", ], n=100),
                       tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "MF", ], n=100))

KK<-list(CCU="#E94F6C", # lightred
         CCD="#A24436", # darkred
         BPU="#52E853", # lightgreen
         BPD="#426A2B", # darkgreen ... #0A945C
         MFU="#6FAFF9", # lightblue
         MFD="#007CBA") # darkblue

tts.CmurxSpur$OntoColors <- c(rep(KK["CCU"],100),rep(KK["BPU"],100),rep(KK["MFU"],100),
                              rep(KK["CCD"],100),rep(KK["BPD"],100),rep(KK["MFD"],100))

# GO comparison final barplots (for topmost and bottom 100 terms)
pdf(file="GOA/GOcomp_CmurxSpur_100low+100top.pdf", onefile=TRUE,
    width=8, height=24, bg="white", pointsize=10, useDingbats=FALSE);
ggord.CmurxSpur <- tts.CmurxSpur[order(tts.CmurxSpur$Ontology, -tts.CmurxSpur$logodds), ]
lpos<-c(length(ggord.CmurxSpur$Ontology[ ggord.CmurxSpur$Ontology=="BP" ]),
        length(ggord.CmurxSpur$Ontology[ ggord.CmurxSpur$Ontology=="BP" | ggord.CmurxSpur$Ontology=="CC" ]),
        length(ggord.CmurxSpur$Ontology)) # [ ggord.CmurxSpur$Ontology=="MF" ]
par.def<-par(mar=c(6,25,1,1)+0.1);
barplot(ggord.CmurxSpur$logodds[1:200],
        names.arg=paste(ggord.CmurxSpur$Term[1:200]," [",ggord.CmurxSpur$GO[1:200],"]",sep=""),
        col=as.character(ggord.CmurxSpur$OntoColors[1:200]), # K[ggord.CmurxSpur$Ontology]),
        horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i")
mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
text(8, 200*1.2, # lpos*1.2,
     labels="Biological\nProcess", # c("Biological Process", "Cellular Component", "Molecular Function"),
     col=as.character(KK["BPD"]), # KK[c("BPD","CCD","MFD")]),
     adj=c(1,1.5), font=2, cex=2)
abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black")
abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE)
barplot(ggord.CmurxSpur$logodds[201:400],
        names.arg=paste(ggord.CmurxSpur$Term[201:400]," [",ggord.CmurxSpur$GO[201:400],"]",sep=""),
        col=as.character(ggord.CmurxSpur$OntoColors[201:400]), # K[ggord.CmurxSpur$Ontology]),
        horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i")
mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
text(8, 200*1.2, # lpos*1.2,
     labels="Cellular\nComponent", # c("Biological Process", "Cellular Component", "Molecular Function"),
     col=as.character(KK["CCD"]), # as.character(KK[c("BPD","CCD","MFD")]),
     adj=c(1,1.5), font=2, cex=2)
abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black")
abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE)
barplot(ggord.CmurxSpur$logodds[401:600],
        names.arg=paste(ggord.CmurxSpur$Term[401:600]," [",ggord.CmurxSpur$GO[401:600],"]",sep=""),
        col=as.character(ggord.CmurxSpur$OntoColors[401:600]), # K[ggord.CmurxSpur$Ontology]),
        horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i")
mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
text(8, 200*1.2, # lpos*1.2,
     labels="Molecular\nFunction", # c("Biological Process", "Cellular Component", "Molecular Function"),
     col=as.character(KK["MFD"]), # as.character(KK[c("BPD","CCD","MFD")]),
     adj=c(1,1.5), font=2, cex=2)
abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black")
abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE)
par(par.def);
dev.off();

# filtering 25 topmost and bottom overrepresented terms
tdo.CmurxSpur <- td.CmurxSpur[order(-td.CmurxSpur$logodds),]
tts2.CmurxSpur <- rbind(head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "CC", ], n=25),
                        head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "BP", ], n=25),
                        head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "MF", ], n=25),
                        tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "CC", ], n=25),
                        tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "BP", ], n=25),
                        tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "MF", ], n=25))
tts2.CmurxSpur$OntoColors <- c(rep(KK["CCU"],25),rep(KK["BPU"],25),rep(KK["MFU"],25),
                               rep(KK["CCD"],25),rep(KK["BPD"],25),rep(KK["MFD"],25))

pdf(file="GOA/GOcomp_CmurxSpur_25low+25top.pdf",
    width=11, height=16, bg="white", pointsize=12, useDingbats=FALSE);
ggord2.CmurxSpur <- tts2.CmurxSpur[order(tts2.CmurxSpur$Ontology, -tts2.CmurxSpur$logodds), ]
lpos<-c(length(ggord2.CmurxSpur$Ontology[ ggord2.CmurxSpur$Ontology=="BP" ]),
        length(ggord2.CmurxSpur$Ontology[ ggord2.CmurxSpur$Ontology=="BP" | ggord2.CmurxSpur$Ontology=="CC" ]),
        length(ggord2.CmurxSpur$Ontology)) # [ ggord2.CmurxSpur$Ontology=="MF" ]
par.def<-par(mar=c(6,25,1,1)+0.1);
barplot(ggord2.CmurxSpur$logodds,
        names.arg=paste(ggord2.CmurxSpur$Term," [",ggord2.CmurxSpur$GO,"]",sep=""),
        col=as.character(ggord2.CmurxSpur$OntoColors),
        horiz=TRUE, xlim=c(-3,7.5), border=NA, las=1, cex.names=0.5, yaxs = "i")
mtext("\nCmur/Spur log odds ratio\n[dark:lower25 + light:top25]", side=1, cex=1.5, adj=0.5, padj=1, xpd=TRUE)
text(7, lpos*1.2,
     labels=c("Biological\nProcess", "Cellular\nComponent", "Molecular\nFunction"),
     col=as.character(KK[c("BPD","CCD","MFD")]),
     adj=c(1,1.5), font=2, cex=2)
abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black")
abline(h=lpos*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE)
abline(h=seq(5,145,5)*1.2, lty="dotted", lwd=0.25, col="black", xpd=TRUE)
par(par.def);
dev.off();

# 
pdf(file="GOA/GOcomp_CmurxSpur_hypergeom_adjpval_1-5_signif.pdf",
    width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE);
tdc.CmurxSpur <- dt.CmurxSpur[order(-dt.CmurxSpur$logodds),]
ggordco.CmurxSpur <- tdc.CmurxSpur[order(tdc.CmurxSpur$Ontology, -tdc.CmurxSpur$logodds), ]
lpos<-c(length(ggordco.CmurxSpur$Ontology[ ggordco.CmurxSpur$Ontology=="BP" ]),
        length(ggordco.CmurxSpur$Ontology[ ggordco.CmurxSpur$Ontology=="BP" | ggordco.CmurxSpur$Ontology=="CC" ]),
        length(ggordco.CmurxSpur$Ontology)) # [ ggordco.CmurxSpur$Ontology=="MF" ]
par.def<-par(mar=c(10,24,1,1)+0.1);
barplot(ggordco.CmurxSpur$logodds,
        names.arg=paste(ggordco.CmurxSpur$Term," [",ggordco.CmurxSpur$GO,"]",sep=""),
        col=as.character(K[ggordco.CmurxSpur$Ontology]),
        horiz=TRUE, xlim=c(0,7.5), border=NA, las=1)
mtext("\nCmur/Spur log odds ratio\n(hypergeometric test adjusted p-value < 10-5)", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE)
text(7, lpos*1.2,
     labels=c("BP", "CC", "MF"),
     col=c(as.character(k[c("CC","BP","MF")])),
     adj=c(1,1.25), font=2, cex=2)
abline(v=seq(0,7,1), lty="dotted", lwd=0.5, col="black")
par(par.def);
dev.off();

# GO wordclouds?
library("GOsummaries");

wcd1 <- data.frame(Term = st.CmurxUPrt$Term,
                  Score = st.CmurxUPrt$hypergeom.padj)
wcd2 <- data.frame(Term = st.SpurxUPrt$Term,
                  Score = st.SpurxUPrt$hypergeom.padj)
wcd3 <- data.frame(Term = st.CmurxSpur$Term,
                  Score = st.CmurxSpur$hypergeom.padj)

gs <- gosummaries(wc_data = list(Results1 = wcd1, Results2 = wcd2))
 gs <- gosummaries(wc_data = list(Results3 = wcd3))
plot(gs)

Software Versions

References

[1] Stein LD.
    “Using GBrowse 2.0 to visualize and share next-generation sequence data.”
    Brief Bioinform. 14(2):162-71, 2013 PMID.

[2] Deng W, Nickle DC, Learn GH, Maust B, Mullins JI.
    “ViroBLAST: A stand-alone BLAST web server for flexible queries of multiple databases and user’s datasets.”
    Bioinformatics, 23(17):2334-2336, 2007 PMID.

Page last modified on Wednesday 31 of January, 2018 19:03:30 CET