NewBler
Trinity
SOAPdenovo-Trans
Coscinasterias is a cosmopolitan genus of large and sea stars with the ability of somatic fission as a clonal reproductive strategy. Of these, C. muricata from New Zealand is particularly interesting because it reproduces exclusively sexually in certain geographical areas and mostly by cloning in others. During fission, the animals tear themselves apart across their central disc, where after the lost body parts are regenerated. We have sequenced and subsequently analyzed the transcriptome of the coelomic epithelium of a clonal C. muricata specimen. This work has been submitted to Marine Genomics (Gabre et al, "The coelomic epithelium transcriptome from a clonal sea star, Coscinasterias muricata", 2015). If you find the following supplementary material useful for your research, please cite that paper.
We have uploaded the transcript sequences into a GBrowse2 [1] engine to get a transcriptome browser, where many of the datasets produced by the protocols on this supplementary material web page were projected over each sequence.
GBrowse2
We have adapted viroBLAST [2] to fit into our CMS system. That BLAST web service has already defined two nucleotide databases for the reads, both raw 454 sequences and cleaned reads (by Trimmomatic) and three databases for the transcriptomes produced, including SOAPdenovo-Trans (recommended), Trinity, and NewBler. There are two amino acid databases for each of those three transcriptome assemblies, made upon the six frame translations of the nucleotide sequences for the contigs, as well as taking the longest ORF from that set separately. Let us know if there is any problem using this service.
viroBLAST
BLAST
Trimmomatic
The following table summarizes the sequence files along with their approximate byte size. All the files were compressed using bzip2.
bzip2
"FX" and "TR" correspond to the assembly run using the FastX or the Trimmomatic protocols to clean the 454 reads respectively, you can find the assembly details on the following methods section.
FastX
The raw data has been also submitted to the NCBI-SRA database:
This section contains a detailed description of the computational protocols that produced the current C. muricata transcriptome version, as well as its annotation. Code blocks for all the steps show the command-line arguments passed to the different tools used all along the analyses performed.
Sequence reads were produced at Amplicon Express, using half plate of a Roche GS FLX+ 454-Titanium sequencer. sff2fastq was used to retrieve reads sequence and qualities in FASTQ format from the original standard flowgram format (SFF) files.
sff2fastq
# BD is the shell variable for the project base directory. # Using sff2fastq to extract nucleotides and qualities from SFF files into fastq for DS in G0CB7HT03 G0CB7HT04 ; do { sff2fastq -o $BD/seqs/$DS.fq \ $BD/raw_data/$DS.sff \ 2> $BD/seqs/$DS.sff2fastq.log }; done
for DS in G0CB7HT03 G0CB7HT04 ; do { echo "# Running commands on $DS" 1>&2 # Basic read length and GC content summary using EMBOSS infoseq infoseq -noheading -only -name -length -pgc \ fastq::$BD/seqs/$DS.fq \ > $BD/seqs/$DS.info.tbl gawk '{ S+=$2; N++ } END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; } ' $BD/seqs/$DS.info.tbl # Using a custom R script to plot Length vs GC content scatterplots Rscript Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/$DS.info.tbl # QC analysis using FastQC mkdir $BD/seqs/$DS.QC fastqc -f fastq --extract --outdir $BD/seqs/$DS.QC $BD/seqs/$DS.fq }; done
There are several tools that can help cleaning the raw reads before the assembly, we used here those provided by FASTX tools suite. The tools it provides facilitate the pipelining of filters to clean up the read sequences. However, fastq_quality_filter does not trim low quality ends, it removes any read that does not meet the criteria of a minimum quality score (-+q=20+- for Phred Q20) across a given length percent (-+p=50+- in our case). fastx_artifacts_filter has not found any issue on our data set, thus the cleaning was mainly performed by the fastx_clipper steps.
fastq_quality_filter
fastx_artifacts_filter
fastx_clipper
for DS in G0CB7HT03 G0CB7HT04 ; do { # FastX Quality filtering fastq_quality_filter -i $BD/seqs/${DS}.fq -o $BD/seqs/${DS}_q20.fq -p 50 -q 20 -Q 33 # FastX Artifact filtering fastx_artifacts_filter -i $BD/seqs/${DS}_q20.fq -o $BD/seqs/${DS}_q20_noartfct.fq -Q 33 # Fastx Adapters clipping fastx_clipper -i $BD/seqs/${DS}_q20_noartfct.fq \ -a CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG -Q 33 | \ fastx_clipper -a CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG -Q 33 | \ fastx_clipper -a GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC -Q 33 | \ fastx_clipper -a TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG -Q 33 | \ fastx_clipper -a CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA -Q 33 | \ fastx_clipper -a GCCTCCCTCGCGCCATCAG -Q 33 | \ fastx_clipper -a GCCTTGCCAGCCCGCTCAG -Q 33 | \ fastx_clipper -a CGTATCGCCTCCCTCGCGCCATCAG -Q 33 | \ fastx_clipper -a CTATGCGCCTTGCCAGCCCGCTCAG -Q 33 \ -o $BD/seqs/${DS}_q20_noartfct_clipped.fq }; done
for DS in G0CB7HT03 G0CB7HT04 ; do { for FL in q20 q20_noartfct q20_noartfct_clipped ; do { echo "# Running commands on $DS _ $FL" 1>&2 # Basic read length and GC content summary using EMBOSS infoseq infoseq -noheading -only -name -length -pgc \ fastq::$BD/seqs/${DS}_${FL}.fq \ > $BD/seqs/${DS}_${FL}.info.tbl gawk '{ S+=$2; N++ } END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; } ' $BD/seqs/${DS}_${FL}.info.tbl # Using a custom R script to plot Length vs GC content scatterplots Rscript Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/${DS}_${FL}.info.tbl # QC analysis using FastQC mkdir $BD/seqs/${DS}_${FL}.QC fastqc -f fastq --extract --outdir $BD/seqs/${DS}_${FL}.QC $BD/seqs/${DS}_${FL}.fq }; done }; done
After this step, reads sequence length distribution shows a dramatic decrease, while the GC content seems to fit perfectly over the theoretical distribution. We tried a more integrated approach to see if we can improve the cleaning step on the quality trimming, which was based on Trimmomatic.
export TMC=/opt/install/Trimmomatic-0.32/ for DS in G0CB7HT03 G0CB7HT04 ; do { java -jar $TMC/trimmomatic-0.32.jar SE \ $BD/seqs/${DS}.fq $BD/seqs/${DS}.trimmomatic.fq \ ILLUMINACLIP:$TMC/adapters/454_adapters-SE.fa:2:30:10 \ LEADING:15 TRAILING:15 SLIDINGWINDOW:4:15 MINLEN:30 TOPHRED33 \ 2> $BD/seqs/${DS}.trimmomatic.log 1>&2 }; done
for DS in G0CB7HT03 G0CB7HT04 ; do { echo "# Running commands on $DS" 1>&2 # Basic read length and GC content summary using EMBOSS infoseq infoseq -noheading -only -name -length -pgc \ fastq::$BD/seqs/${DS}.trimmomatic.fq \ > $BD/seqs/${DS}.trimmomatic.info.tbl gawk '{ S+=$2; N++ } END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; } ' $BD/seqs/${DS}.trimmomatic.info.tbl # Using a custom R script to plot Length vs GC content scatterplots Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R $BD/seqs/${DS}.trimmomatic.info.tbl # QC analysis using FastQC mkdir $BD/seqs/${DS}.trimmomatic.QC fastqc -f fastq --extract --outdir $BD/seqs/${DS}.trimmomatic.QC $BD/seqs/${DS}.trimmomatic.fq }; done # Making the boxplots Rscript Compare_Vector_Distributions.R \ "Reads Length Distribution::NA::Sequence Length (bp)" \ blastout/violin.G0CB7HT03_reads_lengthdist.png \ "Raw reads${RET}::2:: ::"seqs/G0CB7HT03.info.tbl"$SPC" \ "FASTX${RET}q20::2:: ::"seqs/G0CB7HT03_q20.info.tbl"$NBC" \ "FASTX${RET}q20+art::2:: ::"seqs/G0CB7HT03_q20_noartfct.info.tbl"$NBC" \ "FASTX${RET}q20+art+clip::2:: ::"seqs/G0CB7HT03_q20_noartfct_clipped.info.tbl"$NBC" \ "Trimmomatic${RET}::2:: ::"seqs/G0CB7HT03.trimmomatic.info.tbl"$TRC" Rscript Compare_Vector_Distributions.R \ "Reads Length Distribution::NA::Sequence Length (bp)" \ blastout/violin.G0CB7HT04_reads_lengthdist.png \ "Raw reads${RET}::2:: ::"seqs/G0CB7HT04.info.tbl"$SPC" \ "FASTX${RET}q20::2:: ::"seqs/G0CB7HT04_q20.info.tbl"$NBC" \ "FASTX${RET}q20+art::2:: ::"seqs/G0CB7HT04_q20_noartfct.info.tbl"$NBC" \ "FASTX${RET}q20+art+clip::2:: ::"seqs/G0CB7HT04_q20_noartfct_clipped.info.tbl"$NBC" \ "Trimmomatic${RET}::2:: ::"seqs/G0CB7HT04.trimmomatic.info.tbl"$TRC"
The improved nucleotide qualities after processing reads by Trimmomatic does not affect significantly to the length of the reads as in the previous procedure, as it is shown on the Length versus GC content scatter-plots on the right column of the above table. For comparison purposes, the reads length distribution after each procedure is shown in the violin plots below:
For this step three different assemblers were tested: NewBler (the reference assembler for 454 sequences), trinity (routinely used for NGS transcriptome assemblies), and SOAPdenovo-Trans (a variant of SOAPdenovo specific for transcriptomes). The latter two are modern assembly tools, yet it has been reported that SOAP-denovo produces less chimeric assemblies (). In this sequence set there is no paired-end data, yet we expect that the lack of paired-end data will be compensated by the longer reads we got from the 454 sequencing experiment.
trinity
SOAPdenovo
SOAP-denovo
# Preparing the project folder newAssembly -cdna starfish_newbler_fastx # Preparing the reads sequences addRun starfish_newbler_fastx ./seqs/G0CB7HT03_q20_noartfct_clipped.fasta \ ./seqs/G0CB7HT04_q20_noartfct_clipped.fasta # 2 read files successfully added. # G0CB7HT03_q20_noartfct_clipped.fasta # G0CB7HT04_q20_noartfct_clipped.fasta # Running the assembler nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler_fastx \ 2> starfish_newbler_fastx/starfish_newbler_fastx.log 1>&2 &
# Preparing the project folder newAssembly -cdna starfish_newbler_trimmo # Preparing the reads sequences addRun starfish_newbler_trimmo ../seqs/G0CB7HT03.trimmomatic.fasta ../seqs/G0CB7HT04.trimmomatic.fasta # 2 read files successfully added. # G0CB7HT03.trimmomatic.fasta # G0CB7HT04.trimmomatic.fasta # Running the assembler nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler_trimmo \ 2> starfish_newbler_trimmo/starfish_newbler_trimmo.log 1>&2 &
# Preparing the project folder newAssembly -cdna starfish_newbler # Preparing the reads sequences addRun starfish_newbler ../seqs/G0CB7HT03.sff ../seqs/G0CB7HT04.sff # 2 read files successfully added. # G0CB7HT03.sff # G0CB7HT04.sff # Running the assembler nohup runProject -cpu 32 -vt ./vectors_all.fa starfish_newbler \ 2> starfish_newbler/starfish_newbler.log 1>&2 &
nohup Trinity.pl --seqType fq --JM 100G --CPU 32 -min_contig_length 200 \ --single ../seqs/G0CB7HT03_q20_noartfct_clipped.fq \ ../seqs/G0CB7HT04_q20_noartfct_clipped.fq \ --output ./starfish_trinity \ 2> ./starfish_trinity/Trinity.log 1>&2 & TrinityStats.pl starfish_trinity/Trinity.fasta
nohup Trinity.pl --seqType fq --JM 100G --CPU 32 -min_contig_length 200 \ --single ../seqs/G0CB7HT03.trimmomatic.fq ../seqs/G0CB7HT04.trimmomatic.fq \ --output ./starfish_trinity_trimmo \ 2> ./starfish_trinity_trimmo/Trinity.trimmomatic_fq.log 1>&2 & TrinityStats.pl starfish_trinity_trimmo/Trinity.fasta
As there are no Paired-End reads, scaffold sequences, aka transcripts, produced with the +-SOAPdenovo-Trans+- are the same as those from the contigs from step 2. The config files are available from these links: starfish.conf and starfish_trimmo.conf.
# step1: computing pre-graph SOAPdenovo-Trans-127mer pregraph -s ./starfish_soap/starfish.conf -K 63 \ -o ./starfish_soap/starfish -p 32 \ 1> ./starfish_soap/starfish.pregraph.log \ 2> ./starfish_soap/starfish.pregraph.err # step2: building contigs SOAPdenovo-Trans-127mer contig -g ./starfish_soap/starfish \ 1> ./starfish_soap/starfish.contig.log \ 2> ./starfish_soap/starfish.contig.err # step3: mapping reads SOAPdenovo-Trans-127mer map -s ./starfish_soap/starfish.conf \ -g ./starfish_soap/starfish -p 32 \ 1> ./starfish_soap/starfish.map.log \ 2> ./starfish_soap/starfish.map.err # step4: scaffolding SOAPdenovo-Trans-127mer scaff -g ./starfish_soap/starfish -p 32 \ 1> ./starfish_soap/starfish.scaff.log \ 2> ./starfish_soap/starfish.scaff.err
# step1 SOAPdenovo-Trans-127mer pregraph -s ./starfish_soap_trimmo/starfish_trimmo.conf -K 63 \ -o ./starfish_soap_trimmo/starfish_trimmo -p 32 \ 1> ./starfish_soap_trimmo/starfish_trimmo.pregraph.log \ 2> ./starfish_soap_trimmo/starfish_trimmo.pregraph.err # step2 SOAPdenovo-Trans-127mer contig -g ./starfish_soap_trimmo/starfish_trimmo \ 1> ./starfish_soap_trimmo/starfish_trimmo.contig.log \ 2> ./starfish_soap_trimmo/starfish_trimmo.contig.err # step3 SOAPdenovo-Trans-127mer map -s ./starfish_soap_trimmo/starfish_trimmo.conf \ -g ./starfish_soap_trimmo/starfish_trimmo -p 32 \ 1> ./starfish_soap_trimmo/starfish_trimmo.map.log \ 2> ./starfish_soap_trimmo/starfish_trimmo.map.err # step4 SOAPdenovo-Trans-127mer scaff -g ./starfish_soap_trimmo/starfish_trimmo -p 32 \ 1> ./starfish_soap_trimmo/starfish_trimmo.scaff.log \ 2> ./starfish_soap_trimmo/starfish_trimmo.scaff.err
# Retrieve fasta sequences for all assemblies for DS in starfish_newbler_fastx/assembly/454Isotigs.fna \ starfish_newbler_trimmo/assembly/454Isotigs.fna \ starfish_newbler/assembly/454Isotigs.fna \ starfish_trinity/Trinity.fasta \ starfish_trinity_trimmo/Trinity.fasta \ starfish_soap/starfish.scafSeq \ starfish_soap_trimmo/starfish_trimmo.scafSeq ; do { ds=${DS%%/*} echo "# Running commands on $DS -> $ds" 1>&2 ln -vs $DS $ds.fa # Basic read length and GC content summary using EMBOSS infoseq infoseq -noheading -only -name -length -pgc \ fasta::$ds.fa > $ds.info.tbl gawk '{ S+=$2; N++ } END{ print "Total seqs: ", N, " Total nucleotide length: ", S, " AvgLength: ", S/N; } ' $ds.info.tbl # Using a custom R script to plot Length vs GC content scatterplots Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R $ds.info.tbl }; done # Making boxplots Rscript Compare_Vector_Distributions.R \ "Transcript Length Distribution::NA::Sequence Length (bp)" \ blastout/violin.assemblies_lengthdist.png \ "NewBler (RAW)::2:: ::"assemblies/starfish_newbler.info.tbl"$NBC" \ "NewBler (FX)::2:: ::"assemblies/starfish_newbler_fastx.info.tbl"$NBC" \ "NewBler (TR)::2:: ::"assemblies/starfish_newbler_trimmo.info.tbl"$NBC" \ "Trinity (FX)::2:: ::"assemblies/starfish_trinity.info.tbl"$TRC" \ "Trinity (TR)::2:: ::"assemblies/starfish_trinity_trimmo.info.tbl"$TRC" \ "SOAPdnT (FX)::2:: ::"assemblies/starfish_soap.info.tbl"$SPC" \ "SOAPdnT (TR)::2:: ::"assemblies/starfish_soap_trimmo.info.tbl"$SPC" Rscript Compare_Vector_Distributions_ylog.R \ "Transcript Length Distribution::NA::Sequence Length (bp)" \ blastout/violin.assemblies_lengthdist_ylog.png \ "NewBler (RAW)::2:: ::"assemblies/starfish_newbler.info.tbl"$NBC" \ "NewBler (FX)::2:: ::"assemblies/starfish_newbler_fastx.info.tbl"$NBC" \ "NewBler (TR)::2:: ::"assemblies/starfish_newbler_trimmo.info.tbl"$NBC" \ "Trinity (FX)::2:: ::"assemblies/starfish_trinity.info.tbl"$TRC" \ "Trinity (TR)::2:: ::"assemblies/starfish_trinity_trimmo.info.tbl"$TRC" \ "SOAPdnT (FX)::2:: ::"assemblies/starfish_soap.info.tbl"$SPC" \ "SOAPdnT (TR)::2:: ::"assemblies/starfish_soap_trimmo.info.tbl"$SPC"
As better results were obtained after Trimmomatic cleaning, all the downstream analyses were performed over the NewBler "RAW", Trinity "TR", and SOAPdenovo-Trans "TR".
Just in case, we run BLAST search against a database of Viral, Bacterial, and Fungal genomes (VBF), to remove transcript sequences in the assembly that have a reasonable match to another microorganism that was processed along with the samples. We consider those transcripts as contaminant genomic sequence, and we remove from the transcript set.
#!/bin/bash # # cleanvbfseqs.sh query.fa output_prefix # # Homoloy search over VBF genomes to remove potential contaminated sequences # # Note: set and export shell var BDBH, pointing to the BLAST DB VBF files # # INPUT ARGS QUERY=$1 OUTPREFIX=$2 # Minimum coverage for a contaminated sequence has to be larger than this MINCOV=50 # BLAST PARAMETERS EVAL=10e-25 BLSTOFMT='6 qseqid qlen sseqid slen qstart qend sstart send length score evalue bitscore ' BLSTOFMT=$BLSTOFMT'pident nident mismatch positive gapopen gaps ppos qframe sframe qseq sseq' CPUS=32 # Checking Virus genomes echo "### BLAST x Viral genomes..." 1>&2 blastn -task megablast \ -query $QUERY \ -db $BDBH/viral_genomes \ -evalue $EVAL -outfmt "$BLSTOFMT" \ -dust "yes" -num_threads $CPUS \ 2> $OUTPREFIX.viral_genomes.log | \ gzip -vc9 - > $OUTPREFIX.viral_genomes.out.gz # Checking Bacterial genomes echo "### BLAST x Bacterial genomes..." 1>&2 blastn -task megablast \ -query $QUERY \ -db $BDBH/bacterial_genomes.raj \ -evalue $EVAL -outfmt "$BLSTOFMT" \ -dust "yes" -num_threads $CPUS \ 2> $OUTPREFIX.bacterial_genomes.log | \ gzip -vc9 - > $OUTPREFIX.bacterial_genomes.out.gz # Checking Fungal genomes echo "### BLAST x Fungal genomes..." 1>&2 blastn -task megablast \ -query $QUERY \ -db $BDBH/fungal_genomes.raj \ -evalue $EVAL -outfmt "$BLSTOFMT" \ -dust "yes" -num_threads $CPUS \ 2> $OUTPREFIX.fungal_genomes.log | \ gzip -vc9 - > $OUTPREFIX.fungal_genomes.out.gz # Calculating HSPs sequence coverage echo "### Calculate sequence coverage..." 1>&2 for VBF in viral bacterial fungal; do { zcat $OUTPREFIX.${VBF}_genomes.out.gz | \ coverage_blastshorttbl.pl --prog=blastn - \ > $OUTPREFIX.${VBF}_genomes.covtbl }; done # Retrieving sequence ids for contaminated sequences: # criteria a significant match to viral/bacterial/fungal genomes # with more than 50% sequence sequence coverage echo "### Retrieving sequence ids for contaminated sequences..." 1>&2 gawk -v COV=$MINCOV '($1 == "Q" && $13 > COV) { print $2, $14 } ' $OUTPREFIX.{viral,bacterial,fungal}_genomes.covtbl | \ sort | uniq > $OUTPREFIX.VBF_contaminated_seqs.QS_ids gawk '{ print $1 }' $OUTPREFIX.VBF_contaminated_seqs.QS_ids | \ sort | uniq > $OUTPREFIX.VBF_contaminated_seqs.ids # Filtering out those sequences that DO NOT appear on the previous list echo "### Filtering out vbf clean sequences..." 1>&2 grepID.pl -vAFN $OUTPREFIX.VBF_contaminated_seqs.ids \ $QUERY \ > $QUERY.vbfclean.fa exit 0;
# Running the VBF cleaning script bin/cleanvbfseqs.sh assemblies/starfish_newbler.fa \ blastcont/starfish_newbler \ 2> blastcont/starfish_newbler.cleanvbfseqs.log 1>&2 & bin/cleanvbfseqs.sh assemblies/starfish_trinity_trimmo.fa \ blastcont/starfish_trinity_trimmo \ 2> blastcont/starfish_trinity_trimmo.cleanvbfseqs.log 1>&2 & bin/cleanvbfseqs.sh assemblies/starfish_soap_trimmo.fa \ blastcont/starfish_soap_trimmo \ 2> blastcont/starfish_soap_trimmo.cleanvbfseqs.log 1>&2 & # Only few hits found and removed wc -l blastcont/starfish_{newbler,trinity_trimmo,soap_trimmo}.VBF_contaminated_seqs.ids 7 blastcont/starfish_newbler.VBF_contaminated_seqs.ids 30 blastcont/starfish_trinity_trimmo.VBF_contaminated_seqs.ids 22 blastcont/starfish_soap_trimmo.VBF_contaminated_seqs.ids
while read IFA OFA; do { ifa=${IFA%.fa} # getting longest ORFs cdna2orfs.v2.pl -VSBNldu assemblies/$IFA \ > assemblies/$OFA.nuc.fa cdna2orfs.v2.pl -VSBldu assemblies/$IFA \ > assemblies/$OFA.aa.fa # getting transcript length infoseq -noheading -only -name -length -pgc \ fasta::assemblies/$IFA \ > assemblies/$ifa.nuc.info.tbl # getting longest ORFs length infoseq -noheading -only -name -length -pgc \ fasta::assemblies/$OFA.nuc.fa \ > assemblies/$OFA.nuc.info.tbl Rscript /usr/local/molbio/libR/Sequence_Length-x-PGC_scatterplot+density.R assemblies/$OFA.nuc.info.tbl # comparing transcript and longest ORFs lengths gawk 'BEGIN{ OFS="\t"; while (getline<ARGV[1]>0) I[$1] = $2; ARGV[1] = ""; print "ID", "TRlen", "ORFlen"; } { gsub(/\.[012][\-+][0-9]+$/, "", $1); if ($1 in I) { print $1, I[$1], $2; F[$1] = $2; } else { print $1, "NA", $2; }; } END { for (n in I) { if (! (n in F)) print n, I[n], "NA"; }; }' assemblies/$ifa.nuc.info.tbl \ assemblies/$OFA.nuc.info.tbl \ > assemblies/$OFA.nuc.trcomp.tbl Rscript Transcript_vs_LongestORF_length_plot.R assemblies/$OFA.nuc.trcomp.tbl ${OFA%%.*} 25000 }; done <<'EOF' starfish_newbler.fa.vbfclean.fa starfish_newbler.longest_orfs starfish_trinity_trimmo.fa.vbfclean.fa starfish_trinity_trimmo.longest_orfs starfish_soap_trimmo.fa.vbfclean.fa starfish_soap_trimmo.longest_orfs EOF # starfish_newbler : From 10,149 cDNAs, 10,149 out of 1,143,474 ORFs were filtered... # starfish_trinity_trimmo: From 13,787 cDNAs, 13,787 out of 933,239 ORFs were filtered... # starfish_soap_trimmo : From 11,347 cDNAs, 11,347 out of 559,743 ORFs were filtered...
In the previous section, the longest ORF was computed for each transcript, by considering all six possible coding frames (just to avoid transcripts assembled on the reverse strand). Here, that information is used to fix the orientation (strand) of all the transcript sequences that have its longest ORF encoded in the reverse strand. It is also a good moment to standardize the names of the final transcript sequences, as well as to sort them by size (longest first).
while read BFA NFA SFX; do { echo "# working on $BFA" 1>&2; ( gawk '$1 ~ /\.[012]\+[0-9]+$/ { gsub(/\.[012]\+[0-9]+$/, "", $1); print $1; }' assemblies/$BFA.longest_orfs.nuc.info.tbl | \ grepID.pl -vAF - assemblies/$NFA \ 2> assemblies/$BFA.strfixFWD.log; gawk '$1 ~ /\.[012]\-[0-9]+$/ { gsub(/\.[012]\-[0-9]+$/, "", $1); print $1; }' assemblies/$BFA.longest_orfs.nuc.info.tbl | \ grepID.pl -vAF - assemblies/$NFA \ 2> assemblies/$BFA.strfixREV.log | \ gawk '$1 ~ /^>/ { $1=$1"_rc"; } { print $0; }' ) | fa2tbl.pl - | \ gawk '{ print length($2), $0 }' - | \ sort -k1,1nr | \ gawk -v sfx="$SFX" 'BEGIN{ C=1; } { sln=$1; id=$2; sq=$3; $1=$2=$3=""; dsc=($0~/len(gth)?\=/) ? $0 : " length="sln" "$0; gsub(/ +/," ",dsc); gsub(/len\=/,"length=",dsc); gsub(/ path\=\[[^\]]+\]/,"",dsc); print sprintf("Cmur%s_TR%08d", sfx, C), sq, id, dsc; C++; }' - | \ tbl2fa.pl - > assemblies/$BFA.vbfclean_strandfix.fa # getting contig lengths infoseq -noheading -only -name -length -pgc \ fasta::assemblies/$BFA.vbfclean_strandfix.fa \ > assemblies/$BFA.vbfclean_strandfix.info.tbl }; done <<'EOF' starfish_newbler starfish_newbler.fa.vbfclean.fa NWBR starfish_trinity_trimmo starfish_trinity_trimmo.fa.vbfclean.fa TRIN starfish_soap_trimmo starfish_soap_trimmo.fa.vbfclean.fa SOAP EOF
while read FA TI; do { makeblastdb -in assemblies/$FA \ -input_type fasta \ -dbtype nucl \ -title "$TI" \ -out blastdb/$TI; }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo EOF # starfish_newbler : added 10149 sequences in 0.801275 seconds. # starfish_trinity_trimmo: added 13787 sequences in 0.947321 seconds. # starfish_soap_trimmo : added 11347 sequences in 0.677559 seconds.
BLSTOFMT='6 qseqid qlen sseqid slen qstart qend sstart send length score evalue bitscore ' BLSTOFMT=$BLSTOFMT'pident nident mismatch positive gapopen gaps ppos qframe sframe qseq sseq' export BLSTOFMT # Running BLASTN and TBLASTX to find best pair-wise HSP hits while read IFA IDB OUT; do { blastn -task blastn \ -query assemblies/$IFA \ -db blastdb/$IDB \ -strand 'plus' \ -outfmt "$BLSTOFMT" \ -evalue 10e-10 -num_threads 12 \ 2> blastout/$OUT.blastn.log | \ gzip -c9 - > blastout/$OUT.blastn.out.gz & tblastx -query assemblies/$IFA \ -db blastdb/$IDB \ -strand 'plus' \ -outfmt "$BLSTOFMT" \ -matrix BLOSUM90 \ -evalue 10e-10 -num_threads 12 \ 2> blastout/$OUT.tblastx.log | \ gzip -c9 - > blastout/$OUT.tblastx.out.gz & }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_trinity_trimmo sf_newbler-x-trinitytr starfish_newbler.vbfclean_strandfix.fa starfish_soap_trimmo sf_newbler-x-soaptr starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_newbler sf_trinitytr-x-newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo sf_trinitytr-x-soaptr starfish_soap_trimmo.vbfclean_strandfix.fa starfish_newbler sf_soaptr-x-newbler starfish_soap_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo sf_soaptr-x-trinitytr EOF # Filtering HSPs to find best HITs while read FL; do { zcat blastout/$FL.blastn.out.gz | \ coverage_blastshorttbl.pl --prog=blastn - \ > blastout/$FL.blastn.covtbl & zcat blastout/$FL.tblastx.out.gz | \ coverage_blastshorttbl.pl --prog=tblastx - \ > blastout/$FL.tblastx.covtbl & }; done <<'EOF' sf_newbler-x-trinitytr sf_newbler-x-soaptr sf_trinitytr-x-newbler sf_trinitytr-x-soaptr sf_soaptr-x-newbler sf_soaptr-x-trinitytr EOF
Several tests were performed to assess which read mapper was going to align the long 454 reads over the transcript contigs (unpublished data). Bowtie2 yield the larger amount of mapped reads, therefore those commands are shown below.
Bowtie2
# Preparing BOWTIE2 target databases (for the transcript contigs) while read FA TI; do { bowtie2-build -f --large-index --seed 3523 \ ./assemblies/$FA \ ./bowtie2/$TI.bowtiedb \ 2> ./bowtie2/$TI.bowtiedb.log 1>&2 }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo EOF # Mapping sequencing reads over assembled contigs while read FA TI; do { nohup bowtie2 -q --all --phred33 --seed 3523 \ --local --very-sensitive-local \ --no-unal --met 60 --threads 32 \ --met-file ./bowtie2/$TI.bowtie.metrics \ -x ./bowtie2/$TI.bowtiedb \ -U ./seqs/G0CB7HT03.trimmomatic.fq \ -U ./seqs/G0CB7HT04.trimmomatic.fq \ -S ./bowtie2/$TI.bowtie.UPL.sam \ --un-gz ./bowtie2/$TI.bowtie.UPL.unaln.gz \ 2> ./bowtie2/$TI.bowtie.UPL.log 1>&2 & }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo EOF # Getting contig coverage tables while read FA TI; do { printf "# Working on $TI " 1>&2 samtools view -bS ./bowtie2/$TI.bowtie.UPL.sam \ > ./bowtie2/$TI.bowtie.UPL.bam \ 2> ./bowtie2/$TI.bowtie.UPL.sam2bam.log printf "... SAM2BAM " 1>&2 samtools sort ./bowtie2/$TI.bowtie.UPL.bam \ ./bowtie2/$TI.bowtie.UPL.sorted \ 2> ./bowtie2/$TI.bowtie.UPL.sorted.log printf "... BAMSORT " 1>&2 samtools index ./bowtie2/$TI.bowtie.UPL.sorted.bam printf "... BAMSIDX " 1>&2 samtools mpileup -f ./assemblies/$FA \ ./bowtie2/$TI.bowtie.UPL.sorted.bam \ > ./bowtie2/$TI.bowtie.UPL.pileup \ 2> ./bowtie2/$TI.bowtie.UPL.pileup.log printf "... PILEUP " 1>&2 samtools depth ./bowtie2/$TI.bowtie.UPL.sorted.bam \ > ./bowtie2/$TI.bowtie.UPL.pileup_depthonly printf "... DEPTH\n" 1>&2 }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo EOF
# while read FA TI; do { echo "# Working on $TI " 1>&2 gawk 'BEGIN{ while (getline<ARGV[1]>0) { SQL[$1]=$2; TSQL+=$2; }; ARGV[1]=""; } { if (!($1 in sc)) NT++; SC[$1]+=$4; sc[$1]++; TC+=$4; tc++; if (CVmax[$1] == 0) { CVmin[$1] = $4; CVmin["TOT"] = $4; }; CVmax[$1] = CVmax[$1] < $4 ? $4 : CVmax[$1]; CVmin[$1] = CVmin[$1] > $4 ? $4 : CVmin[$1]; CVmax["TOT"] = CVmax["TOT"] < $4 ? $4 : CVmax["TOT"]; CVmin["TOT"] = CVmin["TOT"] > $4 ? $4 : CVmin["TOT"]; } END{ printf "TRID\tLENGTH\tNUCL_TOTAL\tCOVG_SUM\tCOVG_MIN\tCOVG_MAX\tCOVG_AVGall\tCOVG_AVGmap\n"; printf "TOTALseqs:%d\t%d\t%d\t%d\t%d\t%d\t%.3f\t%.3f\n", NT, TSQL, tc, TC, CVmin["TOT"], CVmax["TOT"], TC/TSQL, TC/tc; for (n in SQL) { printf "%s\t%d\t%d\t%d\t%d\t%d\t%.3f\t%.3f\n", n, SQL[n], sc[n], SC[n], CVmin[n], CVmax[n], (n in SQL && SQL[n] != 0) ? SC[n]/SQL[n] : "NA", (n in sc && sc[n] != 0) ? SC[n]/sc[n] : "NA"; } }' ./assemblies/$FA.info.tbl \ ./bowtie2/$TI.bowtie.UPL.pileup \ > ./bowtie2/$TI.bowtie.UPL.pileup.summary egrep -v '^TRID|TOTAL' ./bowtie2/$TI.bowtie.UPL.pileup.summary \ > ./bowtie2/$TI.bowtie.UPL.pileup.sum2plt }; done <<'EOF' starfish_newbler.vbfclean_strandfix starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix starfish_soap_trimmo EOF
Having only one sample we cannot perform a differential expression analysis, yet we can check differences in coverage that will correspond to over-expressed transcripts on the sampled tissue. Sorting by average coverage of reads per transcript, we got the following top ten contigs for each of the three assemblies:
sort -k8,8nr ./bowtie2/starfish_newbler.bowtie.UPL.pileup.summary | head CmurNWBR_TR00009750 425 425 1272737 371 3401 2994.675 2994.675 CmurNWBR_TR00005436 674 674 1202455 625 2334 1784.058 1784.058 CmurNWBR_TR00003299 957 957 1460945 73 2012 1526.588 1526.588 CmurNWBR_TR00008353 497 497 526194 37 1860 1058.740 1058.740 CmurNWBR_TR00006255 607 607 597169 4 1817 983.804 983.804 CmurNWBR_TR00005569 662 662 642126 5 1454 969.979 969.979 CmurNWBR_TR00009130 472 472 444261 26 1514 941.231 941.231 CmurNWBR_TR00008452 494 481 442215 2 1312 895.172 919.366 CmurNWBR_TR00003245 973 973 776495 9 1157 798.042 798.042 CmurNWBR_TR00006095 620 620 475386 3 1326 766.752 766.752 sort -k8,8nr ./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.summary | head CmurTRIN_TR00000205 2602 2602 5457543 1 6635 2097.442 2097.442 CmurTRIN_TR00000032 5220 5220 5572076 1 3409 1067.448 1067.448 CmurTRIN_TR00000126 3155 3151 2799743 1 2045 887.399 888.525 CmurTRIN_TR00000028 5474 5474 2795886 1 1157 510.757 510.757 CmurTRIN_TR00000589 1686 1659 680445 1 1140 403.585 410.154 CmurTRIN_TR00000027 5543 5543 2156262 1 959 389.006 389.006 CmurTRIN_TR00000246 2406 2406 854651 1 912 355.217 355.217 CmurTRIN_TR00000212 2560 2560 886427 1 835 346.261 346.261 CmurTRIN_TR00000585 1689 1689 449898 1 628 266.369 266.369 CmurTRIN_TR00000637 1635 1635 434485 38 503 265.740 265.740 sort -k8,8nr ./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.summary | head CmurSOAP_TR00003834 428 428 1796798 317 5671 4198.126 4198.126 CmurSOAP_TR00000442 1291 1291 2938505 0 6246 2276.146 2276.146 CmurSOAP_TR00000119 2066 2066 4193432 14 3409 2029.735 2029.735 CmurSOAP_TR00000328 1452 1452 2446201 28 2333 1684.711 1684.711 CmurSOAP_TR00001376 784 784 1178624 14 2013 1503.347 1503.347 CmurSOAP_TR00007586 272 272 252061 25 1132 926.695 926.695 CmurSOAP_TR00002681 529 529 470614 169 1217 889.629 889.629 CmurSOAP_TR00002033 629 629 513931 122 1126 817.060 817.060 CmurSOAP_TR00000056 2474 2474 1889173 9 1157 763.611 763.611 CmurSOAP_TR00010178 194 194 143838 36 2314 741.433 741.433
If we consider the functional annotation, if any, for those contigs from BLASTN searches against NCBInt, then:
# Making boxplots Rscript Compare_Vector_Distributions_ylog.R \ "Minimum Read Coverage${RET}per Transcript::NA::Bowtie2 Minimum Coverage" \ bowtie2/violin.bowtie_read_mincoverage.png \ "NewBler::5::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC" \ "Trinity::5::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \ "SOAPdnT::5::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC" Rscript Compare_Vector_Distributions_ylog.R \ "Maximum Read Coverage${RET}per Transcript::NA::Bowtie2 Maximum Coverage" \ bowtie2/violin.bowtie_read_maxcoverage.png \ "NewBler::6::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC" \ "Trinity::6::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \ "SOAPdnT::6::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC" Rscript Compare_Vector_Distributions_ylog.R \ "Average Read Coverage${RET}per Transcript::NA::Bowtie2 Average Coverage" \ bowtie2/violin.bowtie_read_avgcoverage.png \ "NewBler::7::${TAB}::"./bowtie2/starfish_newbler.bowtie.UPL.pileup.sum2plt"$NBC" \ "Trinity::7::${TAB}::"./bowtie2/starfish_trinity_trimmo.bowtie.UPL.pileup.sum2plt"$TRC" \ "SOAPdnT::7::${TAB}::"./bowtie2/starfish_soap_trimmo.bowtie.UPL.pileup.sum2plt"$SPC"
TAB=$'\t' RET=$'\n' NBC="::#CC000099" TRC="::#00CC0099" SPC="::#00CC0099"
# Downloading transcript related files wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/est.fa.gz -O seaurchin/est.fa.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/strPur2.fa.gz -O seaurchin/strPur2.fa.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/refMrna.fa.gz -O seaurchin/refMrna.fa.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/strPur2/bigZips/mrna.fa.gz -O seaurchin/mrna.fa.gz
# Preparing BLAST target databases for F in mrna est refMrna; do { zcat seaurchin/$F.fa.gz | \ makeblastdb -in - \ -input_type fasta \ -dbtype nucl \ -title "SeaUrchin_$F" \ -out seaurchin/SeaUrchin_$F; }; done # mrna: added 30,401 sequences in 4.02561 seconds. # est: added 141,833 sequences in 11.2223 seconds. # refMrna: added 466 sequences in 0.0728889 seconds. # Running BLASTN and TBLASTX searches while read FA TI; do { for SU in mrna est refMrna; do { OUT=$TI"-x-Spur_"$SU; blastn -task blastn \ -query assemblies/$FA \ -db seaurchin/SeaUrchin_$SU \ -outfmt "$BLSTOFMT" -dust "yes" \ -evalue 10e-25 -num_threads 12 \ 2> blastout/$OUT.blastn.log | \ gzip -c9 - > blastout/$OUT.blastn.out.gz & tblastx -query assemblies/$FA \ -db seaurchin/SeaUrchin_$SU \ -outfmt "$BLSTOFMT" \ -matrix BLOSUM62 \ -evalue 10e-25 -num_threads 12 \ 2> blastout/$OUT.tblastx.log | \ gzip -c9 - > blastout/$OUT.tblastx.out.gz & }; done }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo EOF # Processing transcript coverage for TI in starfish_newbler \ starfish_trinity_trimmo \ starfish_soap_trimmo ; do { for SU in mrna est refMrna; do { OUT=$TI"-x-Spur_"$SU; # BLASTN zcat blastout/$OUT.blastn.out.gz | \ coverage_blastshorttbl.pl --prog=blastn - \ > blastout/$OUT.blastn.covtbl egrep '^Q\b' blastout/$OUT.blastn.covtbl \ > blastout/$OUT.blastn.covtbl.qryonly # TBLASTX zcat blastout/$OUT.tblastx.out.gz | \ coverage_blastshorttbl.pl --prog=tblastx - \ > blastout/$OUT.tblastx.covtbl egrep '^Q\b' blastout/$OUT.tblastx.covtbl \ > blastout/$OUT.tblastx.covtbl.qryonly }; done }; done # Making boxplots Rscript Compare_Vector_Distributions.R \ "BLASTN x Sea urchin::NA::% Coverage" \ blastout/violin.Spur_est.blastn.png \ "NewBler${RET}x ESTs::13::${TAB}::"blastout/starfish_newbler-x-Spur_est.blastn.covtbl.qryonly"$NBC" \ "NewBler${RET}x mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_mrna.blastn.covtbl.qryonly"$NBC" \ "NewBler${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_refMrna.blastn.covtbl.qryonly"$NBC" \ "Trinity${RET}x ESTs::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_est.blastn.covtbl.qryonly"$TRC" \ "Trinity${RET}x mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_mrna.blastn.covtbl.qryonly"$TRC" \ "Trinity${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_refMrna.blastn.covtbl.qryonly"$TRC" \ "SOAPdnT${RET}x ESTs::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_est.blastn.covtbl.qryonly"$SPC" \ "SOAPdnT${RET}x mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_mrna.blastn.covtbl.qryonly"$SPC" \ "SOAPdnT${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_refMrna.blastn.covtbl.qryonly"$SPC" Rscript Compare_Vector_Distributions.R \ "TBLASTX x Sea urchin::NA::% Coverage" \ blastout/violin.Spur_est.tblastx.png \ "NewBler${RET}x ESTs::13::${TAB}::"blastout/starfish_newbler-x-Spur_est.tblastx.covtbl.qryonly"$NBC" \ "NewBler${RET}x mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_mrna.tblastx.covtbl.qryonly"$NBC" \ "NewBler${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_newbler-x-Spur_refMrna.tblastx.covtbl.qryonly"$NBC" \ "Trinity${RET}x ESTs::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_est.tblastx.covtbl.qryonly"$TRC" \ "Trinity${RET}x mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_mrna.tblastx.covtbl.qryonly"$TRC" \ "Trinity${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_trinity_trimmo-x-Spur_refMrna.tblastx.covtbl.qryonly"$TRC" \ "SOAPdnT${RET}x ESTs::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_est.tblastx.covtbl.qryonly"$SPC" \ "SOAPdnT${RET}x mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_mrna.tblastx.covtbl.qryonly"$SPC" \ "SOAPdnT${RET}x Ref.mRNA::13::${TAB}::"blastout/starfish_soap_trimmo-x-Spur_refMrna.tblastx.covtbl.qryonly"$SPC"
The following tables summarize the number of transcripts having a BLAST hit ("query" column) and the number of distinct sequences from the database aligning to those transcripts ("target" column). TBL files were compressed using bzip2. As it can be appreciated when comparing both tables, TBLASTX returned almost three times more target hits while only twice as much query hits. Sea urchin mRNA transcript set has 30,401 sequences. This can be due to two reasons: 1) less isoforms per gene on starfish (but with the available data we cannot check this); and 2) the incompleteness of the current starfish transcriptome because it was built from a specific tissue (coelomic epithelium). We can check later on if there are differences at functional GO annotation.
TBL
UniProt was mirrored from the database ftp server on September 2014.
UniProt
# BLASTDB is the shell var set to the path to BLAST databases while read FA TI; do { OUT=$TI"-x-UniProt"; blastx -query assemblies/$FA \ -db $BLASTDB/UniProt_SProt_2014 \ -strand 'plus' \ -outfmt "$BLSTOFMT" \ -matrix BLOSUM62 \ -evalue 10e-25 -num_threads 12 \ 2> blastout/$OUT.blastx.log | \ gzip -c9 - > blastout/$OUT.blastx.out.gz & OUT=$TI"-x-UniProtTREmbl"; blastx -query assemblies/$FA \ -db $BLASTDB/UniProt_TREmbl_2014 \ -strand 'plus' \ -outfmt "$BLSTOFMT" \ -matrix BLOSUM62 \ -evalue 10e-25 -num_threads 12 \ 2> blastout/$OUT.blastx.log | \ gzip -c9 - > blastout/$OUT.blastx.out.gz & }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo EOF # Getting sequence coverage for TI in starfish_newbler \ starfish_trinity_trimmo \ starfish_soap_trimmo ; do { for SU in UniProt UniProtTREmbl ; do { OUT=$TI"-x-"$SU; zcat blastout/$OUT.blastx.out.gz | \ coverage_blastshorttbl.pl --prog=blastx - \ > blastout/$OUT.blastx.covtbl & egrep '^Q\b' blastout/$OUT.blastx.covtbl \ > blastout/$OUT.blastx.covtbl.qryonly }; done }; done # Making the boxplots Rscript Compare_Vector_Distributions.R \ "BLASTX x UniProt::NA::% Coverage" \ blastout/violin.UniProt.blastx.png \ "NewBler${RET}x UniProt::13::${TAB}::"blastout/starfish_newbler-x-UniProt.blastx.covtbl.qryonly"$NBC" \ "Trinity${RET}x UniProt::13::${TAB}::"blastout/starfish_trinity_trimmo-x-UniProt.blastx.covtbl.qryonly"$TRC" \ "SOAPdnT${RET}x UniProt::13::${TAB}::"blastout/starfish_soap_trimmo-x-UniProt.blastx.covtbl.qryonly"$SPC" \ "NewBler${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_newbler-x-UniProtTREmbl.blastx.covtbl.qryonly"$NBC" \ "Trinity${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_trinity_trimmo-x-UniProtTREmbl.blastx.covtbl.qryonly"$TRC" \ "SOAPdnT${RET}x UPrt_TrEMBL::13::${TAB}::"blastout/starfish_soap_trimmo-x-UniProtTREmbl.blastx.covtbl.qryonly"$SPC"
In order to have an equivalent functional annotation to compare starfish and sea urchin transcriptomes, same searches were performed using the sea urchin sequences.
nohup zcat ./seaurchin/mrna.fa.gz | \ blastx -query - \ -db $BLASTDB/UniProt_SProt_2014 \ -outfmt "$BLSTOFMT" \ -matrix BLOSUM62 \ -evalue 10e-25 -num_threads 64 \ > ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.out \ 2> ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.log & coverage_blastshorttbl.pl --prog=blastx \ ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.out \ > ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl
NCBI non redundant (-+NCBI-nr+-) Release 197.0 (August 15, 2013) was downloaded from the NCBI BLAST db ftp site.
# BLASTDB is the shell var set to the path to BLAST databases while read FA TI; do { OUT=$TI"-x-NCBInr"; blastx -query assemblies/$FA \ -db $BLASTDB/nr \ -strand 'plus' \ -outfmt "$BLSTOFMT" \ -matrix BLOSUM62 \ -evalue 10e-25 -num_threads 12 \ 2> blastout/$OUT.blastx.log | \ gzip -c9 - > blastout/$OUT.blastx.out.gz & }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo EOF # Getting sequence coverage for TI in starfish_newbler \ starfish_trinity_trimmo \ starfish_soap_trimmo ; do { OUT=$TI"-x-NCBInr" zcat blastout/$OUT.blastx.out.gz | \ coverage_blastshorttbl.pl --prog=blastx - \ > blastout/$OUT.blastx.covtbl & egrep '^Q\b' blastout/$OUT.blastx.covtbl \ > blastout/$OUT.blastx.covtbl.qryonly }; done # Making the boxplots Rscript Compare_Vector_Distributions.R \ "BLASTX x NCBInr::NA::% Coverage" \ blastout/violin.NCBInr.blastx.png \ "NewBler::13::${TAB}::"blastout/starfish_newbler-x-NCBInr.blastx.covtbl.qryonly"$NBC" \ "Trinity::13::${TAB}::"blastout/starfish_trinity_trimmo-x-NCBInr.blastx.covtbl.qryonly"$TRC" \ "SOAPdnT::13::${TAB}::"blastout/starfish_soap_trimmo-x-NCBInr.blastx.covtbl.qryonly"$SPC"
Here too, in order to have an equivalent functional annotation to compare starfish and sea urchin transcriptomes, same searches were performed using the sea urchin sequences.
nohup zcat ./seaurchin/mrna.fa.gz | \ blastx -query - \ -db $BLASTDB/nr \ -outfmt "$BLSTOFMT" \ -matrix BLOSUM62 \ -evalue 10e-25 -num_threads 64 \ > ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.out \ 2> ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.log & coverage_blastshorttbl.pl --prog=blastx \ ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.out \ > ./seaurchin/seaurchin_mrna.blastx_ncbinr_eval10e-25.covtbl
NCBI nucleotide (-+NCBI-nt+-) Release 197.0 (August 15, 2013) was downloaded from the NCBI BLAST db ftp site.
# BLASTDB is the shell var set to the path to BLAST databases while read FA TI; do { OUT=$TI"-x-NCBInt"; blastn -query assemblies/$FA \ -db $BLASTDB/nt \ -outfmt "$BLSTOFMT" \ -evalue 10e-5 -num_threads 32 \ 2> blastout/$OUT.blastn.log | \ gzip -c9 - > blastout/$OUT.blastn.out.gz & }; done <<'EOF' starfish_newbler.vbfclean_strandfix.fa starfish_newbler starfish_trinity_trimmo.vbfclean_strandfix.fa starfish_trinity_trimmo starfish_soap_trimmo.vbfclean_strandfix.fa starfish_soap_trimmo EOF # Getting sequence coverage for TI in starfish_newbler \ starfish_trinity_trimmo \ starfish_soap_trimmo ; do { OUT=$TI"-x-NCBInt" zcat blastout/$OUT.blastn.out.gz | \ coverage_blastshorttbl.pl --prog=blastn - \ > blastout/$OUT.blastn.covtbl egrep '^Q\b' blastout/$OUT.blastn.covtbl \ > blastout/$OUT.blastn.covtbl.qryonly }; done # Making the boxplots Rscript Compare_Vector_Distributions.R \ "BLASTX x NCBInt::NA::% Coverage" \ blastout/violin.NCBInt.blastn.png \ "NewBler::13::${TAB}::"blastout/starfish_newbler-x-NCBInt.blastn.covtbl.qryonly"$NBC" \ "Trinity::13::${TAB}::"blastout/starfish_trinity_trimmo-x-NCBInt.blastn.covtbl.qryonly"$TRC" \ "SOAPdnT::13::${TAB}::"blastout/starfish_soap_trimmo-x-NCBInt.blastn.covtbl.qryonly"$SPC"
The two PFAM database sections, Pfam-A and Pfam-B were downloaded from Pfam ftp site. Pfam-A is the manually curated portion of the database that contains over 13,000 entries; for each entry a protein sequence alignment and a hidden Markov model is stored. These hidden Markov models can be used to search sequence databases with the HMMER package. Pfam-B is an automatically generated supplement to Pfam-A that contains a large number of small families derived from clusters produced by an algorithm called ADDA. Although of lower quality, Pfam-B families can be useful when no Pfam-A families are found. HMMER tools were used to scan transcript sequences to annotate domains. In fact, longest ORFs derived from those transcripts were taken, and compared with the HMM models from PFAM database (version 27.0, 2014). In order to get comparable e-values, it is required to adjust a command-line switch to set the effective search space, calculated from this formula: Z = #query_sequences x #target_models
HMMER
Z = #query_sequences x #target_models
# PFAMDB is the shell var set to the path to the PFAM HMM models zegrep -c '^ACC' $PFAMDB/Pfam-[AB].hmm.gz 14,831 Pfam-A.hmm.gz 20,000 Pfam-B.hmm.gz 34,831 HMM models in total # From that, #target_models was set to 35,000. # As there are 10,149, 13,787, and 11,347 transcripts on the NewBler, # Trinity, and SOAPdenovo-Trans assemblies respectively, #query_sequences # was rounded up to 15,000. # Therefore, Z = 15,000 x 35,000 = 525,000,000 = 5.25e8 ~ 6e8 while read FA TI; do { # Get the longest ORF for the final transcript set cdna2orfs.v2.pl -VSldu assembly/${FA}.fa \ > assembly/${FA}_lORFs.fa # HMMER against Pfam A OUT=$TI"_lORFs-x-PfamA" nohup hmmsearch -Z 6e8 -E 10e-5 --max --cpu 24 \ -o ./hmmer_output/${OUT}.hmmsearch.out \ --tblout ./hmmer_output/${OUT}.hmmsearch.tbl \ $PFAMDB/Pfam-A.hmm.gz \ ./assembly/${FA}_lORFs.fa \ 2> ./hmmer_output/${OUT}.hmmsearch.log & # HMMER against Pfam B OUT=$TI"_lORFs-x-PfamB"; nohup hmmsearch -Z 6e8 -E 10e-5 --max --cpu 24 \ -o ./hmmer_output/${OUT}.hmmsearch.out \ --tblout ./hmmer_output/${OUT}.hmmsearch.tbl \ $PFAMDB/Pfam-B.hmm.gz \ ./assembly/${FA}_lORFs.fa \ 2> ./hmmer_output/${OUT}.hmmsearch.log & }; done <<'EOF' starfish_newbler.vbfclean_strandfix starfish_newbler_cs starfish_trinity_trimmo.vbfclean_strandfix starfish_trinity_trimmo_cs starfish_soap_trimmo.vbfclean_strandfix starfish_soap_trimmo_cs EOF
#!/usr/bin/gawk -f { split($3,m,/\|/); if (MX[$1] < $10) { MX[$1] = $10; aln = $5"-"$6"/"$7"-"$8":"$9; coverage = sprintf("%.3f", $9/$2 * 100); GX[$1] = $1" "$2" "m[2]" "m[3]" "$4" "aln" "$10" "$11" "coverage; }; GH[$1]++; } END { for (n in MX) print GX[n], GH[n]; }
# get best HSPs for starfish contigs for TI in newbler trinity_trimmo soap_trimmo ; do { for DB in UniProt UniProtTREmbl NCBInr ; do { OUT="starfish_"$TI"-x-"$DB echo "# working on $OUT..." 1>&2 zcat blastout/$OUT.blastx.out.gz | \ ./getbesthit.awk - \ > blastout/$OUT.blastx.besthsp.tbl }; done }; done # get best HSPs for sea urchin contigs for DB in uniprot ncbinr; do { OUT="seaurchin_mrna.blastx_"$DB"_eval10e-25" echo "# working on $OUT..." 1>&2 ./getbesthit.awk seaurchin/$OUT.out \ > seaurchin/$OUT.besthsp.tbl }; done
# Merging all UniProt ids from best BLAST hits gawk '{ print $3 } ' ./blastout/starfish_newbler-x-UniProt.blastx.besthsp.tbl \ ./blastout/starfish_trinity_trimmo-x-UniProt.blastx.besthsp.tbl \ ./blastout/starfish_soap_trimmo-x-UniProt.blastx.besthsp.tbl \ > ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids # Merging all NCBInr ids from best BLAST hits gawk '{ print $3 } ' ./blastout/starfish_newbler-x-NCBInr.blastx.besthsp.tbl \ ./blastout/starfish_trinity_trimmo-x-NCBInr.blastx.besthsp.tbl \ ./blastout/starfish_soap_trimmo-x-NCBInr.blastx.besthsp.tbl \ > ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids # How many ids do we have on each set? wc -l ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids \ ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids 7540 blastout/blastx_uniprot_eval10e-25.besthsp.gbids 4950 blastout/blastx_ncbinr_eval10e-25.besthsp.gbids # Batch downloading GenBank nucleotide (gb) or protein (gp) records batchwebquerymaker.pl -V \ --getids-from-file ./blastout/blastx_uniprot_eval10e-25.besthsp.gbids \ --set-database 'protein' \ --db-output-rettype 'gp' \ --output-file ./blastout/blastx_uniprot_eval10e-25.besthsp \ 2> ./blastout/blastx_uniprot_eval10e-25.besthsp.gp.log & batchwebquerymaker.pl -V \ --getids-from-file ./blastout/blastx_ncbinr_eval10e-25.besthsp.gbids \ --set-database 'protein' \ --db-output-rettype 'gb' \ --output-file ./blastout/blastx_ncbinr_eval10e-25.besthsp \ 2> ./blastout/blastx_ncbinr_eval10e-25.besthsp.gb.log &
An annotation summary table was also retrieved from the UniProt upload list web form, using a list of all the matching UniProt identifiers filtered out from the best hits file blastx_uniprot_eval10e-25.besthsp.gbids. The downloaded file uniprot_besthits_dowloaded_info.tab.gz was then used to annotate the starfish contigs and to create the supplementary excel files.
blastx_uniprot_eval10e-25.besthsp.gbids
uniprot_besthits_dowloaded_info.tab.gz
# Processing UniProt alignments for TI in newbler \ trinity_trimmo \ soap_trimmo ; do { zcat ./blastout/uniprot_besthits_dowloaded_info.tab.gz | \ gawk 'BEGIN{ FS=OFS="\t"; while (getline<ARGV[1]>0) I[$1]=$2"\t"$3"\t"$10"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$11"\t"$12; ARGV[1]=""; FS=" "; hdr="Transcript ID\tLength(bp)\tBest HSP\tScore\tEvalue\tCoverage\tNum HSPs\t"; hdr=hdr"UniProt Entry\tEntry name (ACC)\tLength(aa)\tStatus\tProtein names\tGene names\t"; hdr=hdr"Organism\tCatalytic activity\tFunction [CC]\tKeywords\tGene ontology (GO)"; print hdr; } $3 in I { print $1,$2,$6,$7,$8,$9,$10,I[$3]; } ' - ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.tbl \ > ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.tbl ./table2excel.pl ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.tbl \ ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.protinfo.xls \ "${TI}-x-UniProt" }; done
#!/usr/bin/gawk -f BEGIN { while (getline<ARGV[1]>0) { J=$4; gsub(/\.[0-9]+$/,"",J); IDS[$3]=$4; TRS[$3]=TRS[$3]" "$1; CTG[$3""$1]=$1"\t"$3; ALN[$3""$1]=$6" "$7sco" "$9"bits\t"$8; }; ARGV[1]=""; N=D=A=S=F=K=G=""; E="NotAnnotated"; dflg=kflg=0; } dflg == 1 && $0 !~ /^ / { dflg=0 } kflg == 1 && $0 !~ /^ / { kflg=0 } dflg == 1 { gsub(/^ +/," ",$0); D=D""$0 } kflg == 1 { gsub(/^ +/," ",$0); K=K""$0 } $1 == "LOCUS" { N=$2 } $1 == "DEFINITION" { D=$0; gsub(/^DEFINITION +/,"",D); dflg=1 } $1 == "ACCESSION" { A=$2 } $1 == "VERSION" { G=$NF; gsub(/^GI:/,"",G) } $1 == "SOURCE" { S=$0; gsub(/^SOURCE +/,"",S) } $1 == "KEYWORDS" { K=$0 ; gsub(/^KEYWORDS +/,"",K); kflg=1 } $0 ~ /^ *\/product/ { F=$0; gsub(/^ +\/product=/,"",F); gsub(/\"/,"",F) } #" $0 ~ /^ *\/UniProtKB_evidence/ { E=$0; gsub(/^ +\/UniProtKB_evidence=/,"",E); gsub(/\"/,"",E) } #" $0 ~ /^\/\/$/ { if (G in IDS) { SQ=TRS[G]; gsub(/^ */,"",SQ); n=split(SQ,sq," "); for (i=1; i<=n; i++) { ti=sq[i]; print CTG[G""ti]"\t"A"\t"N"\t"ALN[G""ti]"\t"S"\t"D"\t"F"\t"K"\t"E; }; }; N=D=A=S=F=K=G=""; E="NotAnnotated"; dflg=kflg=0; }
# Processing NCBInr alignments for TI in newbler \ trinity_trimmo \ soap_trimmo ; do { ./mapNCBIinfo2hsps.awk ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.tbl \ ./blastout/blastx_ncbinr_eval10e-25.besthsp.gb \ > ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.tbl ./tableNR2excel.pl ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.tbl \ ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.gb.fulldesc.xls \ "${TI}-x-NCBInr" }; done
The following tables summarize the number of transcripts having a BLAST hit ("query" column) and the number of distinct sequences from the database aligning to those transcripts ("target" column). TBL files were compressed using bzip2.
# Keep transcript IDs together with assembly contigs ID and length in bp. egrep '^>' assemblies/starfish_newbler.vbfclean_strandfix.fa | \ sed 's/^>//; s/\(length\|gene\)=//g;' | \ gawk '{ print $1, $3":"$2, $4 } ' > assemblies/starfish_newbler.vbfclean_strandfix.idsmap egrep '^>' assemblies/starfish_trinity_trimmo.vbfclean_strandfix.fa | \ sed 's/^>//; s/\(length\|gene\)=//g;' | \ gawk '{ print $1, $2, $3 } ' > assemblies/starfish_trinity_trimmo.vbfclean_strandfix.idsmap egrep '^>' assemblies/starfish_soap_trimmo.vbfclean_strandfix.fa | \ sed 's/^>//; s/\(length\|gene\)=//g;' | \ gawk '{ print $1, $2, $3 } ' > assemblies/starfish_soap_trimmo.vbfclean_strandfix.idsmap # Merging IDs with assembly coverage, UniProt, and NCBInr best hits info for TI in newbler \ trinity_trimmo \ soap_trimmo ; do { echo "# working on $TI" 1>&2 # merging tables gawk 'BEGIN{ FS=" "; OFS="\t"; while (getline<ARGV[1]>0) { S[$1] = $2; C[$2] = $1; len[$1] = $3; }; while (getline<ARGV[2]>0) { sumcov[$1] = $(NF-2); avgcov[$1] = $NF; }; while (getline<ARGV[3]>0) { s=$1 in S ? $1 : "##"; GBid[s] = $3; GBac[s] = $4; GBaln[s] = $6" "$7" "$8" "$9" "$10; }; while (getline<ARGV[4]>0) { s=$1 in S ? $1 : "##"; UPid[s] = $3; UPac[s] = $4; UPaln[s] = $6" "$7" "$8" "$9" "$10; }; for (s in S) { print s, S[s], len[s], (s in sumcov ? sumcov[s] : "NA"), (s in avgcov ? avgcov[s] : "NA"), (s in GBid ? GBid[s] : "NA"), (s in GBac ? GBac[s] : "NA"), (s in GBaln ? GBaln[s] : "NA"), (s in UPid ? UPid[s] : "NA"), (s in UPac ? UPac[s] : "NA"), (s in UPaln ? UPaln[s] : "NA"); }; }' ./assemblies/starfish_$TI.vbfclean_strandfix.idsmap \ ./bowtie2/starfish_${TI}.bowtie.UPL.pileup.summary \ ./blastout/starfish_${TI}-x-NCBInr.blastx.besthsp.tbl \ ./blastout/starfish_${TI}-x-UniProt.blastx.besthsp.tbl \ > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.tbl # adding UniProt descriptions zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \ gawk 'BEGIN{ FS=OFS="\t"; while (getline<ARGV[1]>0) { U[$2] = ($10 == "" ? "NA\tNA" : $10"\t"$5" ["$6)"]"; }; ARGV[1] = ""; } { print $0, ($9 in U ? U[$9] : "NA\tNA"); } ' - ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.tbl \ > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl # Fixing NCBInr ids and inserting descriptions # # nr.ids data was filtered out from the fasta headers of the NCBInr sequences, as follows # zegrep '^>' $BLASTDB/fasta/nr.gz | \ # perl -ne 'print join("\n", map { s/^>?gi\|//o; s/\|/\t/; s/ /\t/; s/ \[/\t\[/; $_ } # split /\x01/o, $_);' | \ # gzip -c9 - > $BLASTDB/fasta/nr.ids.gz & # zcat $BLASTDB/fasta/nr.ids.gz | \ gawk 'BEGIN{ FS=OFS="\t"; while (getline<ARGV[1]>0) { N[$6] = $1; }; ARGV[1] = ""; } ($1 in N) { print $0 } ' ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl - | \ gawk 'BEGIN{ FS=OFS="\t"; while (getline<ARGV[1]>0) { N[$1] = $2"\t"$3"\t"($4 != "" ? $4 : "NA"); sub(/^[^\|]*\|/,"",$2); sub(/\|[^\|]*$/,"",$2); I[$1] = $2; }; ARGV[1] = ""; } { $7=$6 in I ? I[$6] : "NA"; p=($6 in N ? N[$6] : "NA\tNA\tNA")"\t"$9; $9=p; print $0; }' - ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.UPdesc.tbl \ > ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.tbl # Converting tabular file to excel ./tableNR+UP2excel.pl ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.tbl \ ./blastout/starfish_${TI}.best_blastx_NCBI+UniProt.GB+UPdesc.xls \ "NCBInr+UniProt_BLASTX" # Get total counts gawk -v ti=$TI ' $6 != "NA" && $12 == "NA" { GB++ } $6 == "NA" && $12 != "NA" { UP++ } $6 != "NA" && $12 != "NA" { BO++ } $6 == "NA" && $12 == "NA" { NO++ } END{ printf "SET: %s UP: %d NR: %d BOTH: %d NONE: %d\n", ti, UP, GB, BO, NO; } ' blastout/starfish_$TI.best_blastx_NCBI+UniProt.GB+UPdesc.tbl }; done
The following table summarizes the number of transcripts (contigs) having a match to at least one target sequence from UniProt, NCBI-nr, both, and no hit (none). TBL files were compressed using bzip2.
In an attempt to quantify which assembly returned the best set of transcripts, one can consider the ratio of percent coverage versus length for the best HSP from BLAST of each transcript. The longer the hit and the higher the coverage percent (with respect to the transcript full length), the better; while low coverage may correspond to the presence of chimeric transcripts.
# Making scatterplots for queries best HSP coverage while read TI LB CL; do { echo "# Working on $TI : BLASTN x Sea urchin" 1>&2; Rscript Scatterplot_2cols.R \ blastout/starfish_${TI}-x-Spur_mrna.blastn.covtbl.qryonly \ "13::12::${TAB}::FALSE" \ "${LB}${RET}Best HSP BLASTN x Sea urchin::% Coverage::HSP length::${CL}" \ blastout/starfish_${TI}-x-Spur_mrna.blastn.covtbl.qryonly.scat.png echo "# Working on $TI : TBLASTX x Sea urchin" 1>&2; Rscript Scatterplot_2cols.R \ blastout/starfish_${TI}-x-Spur_mrna.tblastx.covtbl.qryonly \ "13::12::${TAB}::FALSE" \ "${LB}${RET}Best HSP TBLASTX x Sea urchin::% Coverage::HSP length::${CL}" \ blastout/starfish_${TI}-x-Spur_mrna.tblastx.covtbl.qryonly.scat.png echo "# Working on $TI : BLASTX x UniProt" 1>&2; Rscript Scatterplot_2cols.R \ blastout/starfish_${TI}-x-UniProt.blastx.covtbl.qryonly \ "13::12::${TAB}::FALSE" \ "${LB}${RET}Best HSP BLASTX x UniProt::% Coverage::HSP length::${CL}" \ blastout/starfish_${TI}-x-UniProt.blastx.covtbl.qryonly.scat.png echo "# Working on $TI : BLASTX x UniProtTREmbl" 1>&2; Rscript Scatterplot_2cols.R \ blastout/starfish_${TI}-x-UniProtTREmbl.blastx.covtbl.qryonly \ "13::12::${TAB}::FALSE" \ "${LB}${RET}Best HSP BLASTX x UniProt-TREmbl::% Coverage::HSP length::${CL}" \ blastout/starfish_${TI}-x-UniProtTREmbl.blastx.covtbl.qryonly.scat.png echo "# Working on $TI : BLASTX x NCBInr" 1>&2; Rscript Scatterplot_2cols.R \ blastout/starfish_${TI}-x-NCBInr.blastx.covtbl.qryonly \ "13::12::${TAB}::FALSE" \ "${LB}${RET}Best HSP BLASTX x NCBInr::% Coverage::HSP length::${CL}" \ blastout/starfish_${TI}-x-NCBInr.blastx.covtbl.qryonly.scat.png echo "# Working on $TI : BLASTN x NCBInt" 1>&2; Rscript Scatterplot_2cols.R \ blastout/starfish_${TI}-x-NCBInt.blastn.covtbl.qryonly \ "13::12::${TAB}::FALSE" \ "${LB}${RET}Best HSP BLASTN x NCBInt::% Coverage::HSP length::${CL}" \ blastout/starfish_${TI}-x-NCBInt.blastn.covtbl.qryonly.scat.png }; done <<'EOF' newbler NewBler #CC000099 trinity_trimmo Trinity #00CC0099 soap_trimmo SOAPdnT #0000CC99 EOF
On this section, SOAPdenovo-Trans assembly will be functionally characterized using the Gene Ontology (GO) annotation extracted from the UniProt database.
# Preparing GO annotation table for the current UNIPROT release # UNIPROT points to the folder where UniProt database was downloaded zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.dat.gz | \ perl -ne 'BEGIN { $I=$A=$E=$O=""; $G="GO"; $T="TM"; $D="DM"; $S="unknown"; $K="."; $C=$S=0; } $_ =~ m{^ID\s+(\S+)\s}o && ($I=$A=$1, next); $_ =~ m{^AC\s+(\S+);}o && ($A=$1, next); $_ =~ m{^OS\s+([^\s]+)\s+([^\.\s]+)}o && ($S=$1."_".$2, next); $_ =~ m{^OX\s+NCBI_TaxID=(\d+);}o && ($C=$1, next); $_ =~ m{^DR\s+EMBL;\s+([^;]+);\s+([^;]+);}o && ($E=$1, $O=$2, next); $_ =~ m{^DR\s+GO;\s+GO:([^;]+);\s+([^;]+);}o && ($G.=";".$1."#".$2, next); $_ =~ m{^KW\s+(.*)\.}o && ($K=$1, next); $_ =~ m{^FT\s+DOMAIN\s+(\d+)\s+(\d+)\s+([^\.]+)\.}o && do { $D.=";".$1."_".$2; ($s = $3) =~ s/\s+/_/og; $D.="_".$s; next; }; $_ =~ m{^FT\s+TRANSMEM\s+(\d+)\s+(\d+)\s}o && ($T.=";".$1."_".$2, next); $_ =~ m{^//}o && do { print join("\t", $I, $A, $E, $O, $S, $C, $D, $T, $G, $K),"\n"; $I=$A=$E=$O=""; $G="GO"; $T="TM"; $D="DM"; $S="unknown"; $K="."; $C=$S=0; next; }' | gzip -c9 - \ > $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \ gawk ' { if ($9 == "GO") { N++ } else { Y++ }; } END{ print "# Total: ", N+Y, " with GOs: ",Y," without GOs: ",N }' # Total: 546439 with GOs: 518296 without GOs: 28143
GOAs can be mapped from the raw set of BLAST hits or transferred from the best hit only, the latter was the chosen option. Best hits files were produced at ((Coscinasterias+transcriptome#UniProt|Homology Search over UniProt)) section.
# Annotate starfish sequences zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \ gawk 'BEGIN{ FS=OFS="\t"; while (getline<ARGV[1]>0) { if ($1 == "GO") continue; sub(/^GO;?/,"",$9); if (!($2 in GO)) GO[$2] = $9; # using uniprot accession i.e. Q6GZX4 else GO[$2] = GO[$2]";"$9; }; ARGV[1] = ""; } $1 == "Q" { split($14,U,"|"); u=U[2]; split(U[3],K," "); if (u in GO && GO[u] != "") print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], GO[u]; else print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], "NA.GO"; }' - ./blastout/starfish_soap_trimmo-x-UniProt.blastx.covtbl \ > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl # Counting GOAs gawk 'BEGIN{ FS=OFS="\t"; print "GO\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot"; } $NF != "NA.GO" && $1 != "GO" { n=split($NF,m,";"); for (i=1; i<=n; i++) { GO[m[i]]++; GOT++ }; } END{ for (j in GO) { gop = GO[j]; gon = GOT - GO[j]; print j, gop, sprintf("%.5f", gop / GOT * 100), gon, sprintf("%.5f", gon / GOT * 100), GOT; } # print j, GO[j], GOT; }' ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \ > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts.tbl
# Annotate sea urchin sequences zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \ gawk 'BEGIN{ FS=OFS="\t"; while (getline<ARGV[1]>0) { if ($1 == "GO") continue; sub(/^GO;?/,"",$9); if (!($2 in GO)) GO[$2] = $9; # using uniprot accession i.e. Q6GZX4 else GO[$2] = GO[$2]";"$9; }; ARGV[1] = ""; } $1 == "Q" { split($14,U,"|"); u=U[2]; split(U[3],K," "); if (u in GO && GO[u] != "") print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], GO[u]; else print u, $2, (K[5] == "++" ? "+1" : "-1"), K[4], K[3], "NA.GO"; }' - ./seaurchin/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl \ > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl # Counting GOAs gawk 'BEGIN{ FS=OFS="\t"; print "GO\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot"; } $NF != "NA.GO" && $1 != "GO" { n=split($NF,m,";"); for (i=1; i<=n; i++) { GO[m[i]]++; GOT++ }; } END{ for (j in GO) { gop = GO[j]; gon = GOT - GO[j]; print j, gop, sprintf("%.5f", gop / GOT * 100), gon, sprintf("%.5f", gon / GOT * 100), GOT; } # print j, GO[j], GOT; }' ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \ > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts.tbl
#!/usr/bin/gawk -f BEGIN{ FS="\t"; while (getline<ARGV[1]>0) { if ($1 == "GO") continue; if ($6 != "NA.GO") { gsub(/[\'\"]/,"",$6); #" n=split($6,m,";"); for (i=1; i<=n; i++) { k=m[i]; gsub(/\#.*$/,"",k); gsub(/^.*\#/,"",m[i]); if (!(k in GD)) { GD[k] = "'"m[i]"'"; RO[k] = 0; }; RO[k]++; ROT++; }; }; }; ARGV[1] = ""; print "GO\tGOdesc\tGOpos\tGOpos.pct\tGOneg\tGOneg.pct\tGOtot\tROpos\tROpos.pct\tROneg\tROneg.pct\tROtot\tGOROpct"; FS="\t"; } $6 != "NA.GO" && $1 != "GO" { gsub(/[\'\"]/,"",$6); #" n=split($6,m,";"); for (i=1; i<=n; i++) { k=m[i]; gsub(/\#.*$/,"",k); gsub(/^.*\#/,"",m[i]); if (!(k in GD)) { GD[k] = "'"m[i]"'"; GO[k] = 0; }; GO[k]++; GOT++; }; } END{ for (j in GD) { if (!(j in GO)) GO[j]=0; if (!(j in RO)) RO[j]=0; gop = GO[j]; gopp = GOT > 0 ? sprintf("%.5f", (gop / GOT * 100)) : "NA"; gon = GOT - GO[j]; gonp = GOT > 0 ? sprintf("%.5f", (gon / GOT * 100)) : "NA"; rop = RO[j]; ropp = ROT > 0 ? sprintf("%.5f", (rop / ROT * 100)) : "NA"; ron = ROT - RO[j]; ronp = ROT > 0 ? sprintf("%.5f", (ron / ROT * 100)) : "NA"; grpp = (GOT == 0 || ROT == 0) ? "NA" : sprintf("%.5f", (GOT / ROT * 100)); printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", "GO:"j, GD[j], gop, gopp, gon, gonp, GOT, rop, ropp, ron, ronp, ROT, grpp; } # print j, GO[j], GOT; }
# Comparing GOAs counts between starfish and sea urchin bin/mkGOcountstbl.awk ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \ ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \ > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.tbl # Comparing GOAs counts between starfish and UniProt zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \ gawk 'BEGIN{ FS=OFS="\t" } { print $1,"2","3","4","5",$9 }' - | \ bin/mkGOcountstbl.awk - \ ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.tbl \ > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.tbl # Comparing GOAs counts between sea urchin and UniProt zcat $UNIPROT/knowledgebase/complete/uniprot_sprot.tbl.gz | \ gawk 'BEGIN{ FS=OFS="\t" } { print $1,"2","3","4","5",$9 }' - | \ bin/mkGOcountstbl.awk - \ ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.tbl \ > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.tbl
# Comparing GOAs counts between starfish and sea urchin Rscript bin/GO_calc_pval.R \ ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.tbl \ > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.GOcalc.log Rscript bin/GO_getGOdesc.R \ ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.chisq+hypergeom_pvals.tbl \ 2> ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin.chisq+hypergeom_pvals.desc.log # Comparing GOAs counts between starfish and UniProt Rscript bin/GO_calc_pval.R \ ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.tbl \ > ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.GOcalc.log Rscript bin/GO_getGOdesc.R \ ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.tbl \ 2> ./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.desc.log # Comparing GOAs counts between sea urchin and UniProt Rscript bin/GO_calc_pval.R \ ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.tbl \ > ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.GOcalc.log Rscript bin/GO_getGOdesc.R \ ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.tbl \ 2> ./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot.chisq+hypergeom_pvals.desc.log
From the Gene Ontology ftp repository, termdb (version 2015-01-20) was retrieved. From the term.txt, term_synonym.txt, and term2term.txt it is possible to build the full graph for the GO terms from which the parent-child relationships can be extracted. This is useful to calculate frequencies at distinct GO graph levels.
termdb
term.txt
term_synonym.txt
term2term.txt
# GOTERMDB point to the folder where the termdb database was downloaded from Gene Ontology repository gawk 'BEGIN { FS=OFS="\t"; while (getline<ARGV[1]>0) { if ($7 == 1) continue; # 1 is the GO code for is_relation GO[$1] = $4; DE[$1] = $2; ON[$1] = $3; RO[$1] = $6; }; while (getline<ARGV[2]>0) { if ($4 != 27) continue; # 27 is the GO code for alt_id synonym_type if ($1 in SY) { SY[$1] = SY[$1]" "$2; } else { SY[$1] = $2; }; }; ARGV[1]=ARGV[2]=0; print "GOID","ONTOLOGY","TERM","PARENT","SYNONYM"; } $2 == 1 && $5 == 0 { # $2 is relationship_type_id and 1 means is_a if (($4 in GO) && GO[$4] !~ /^obsolete/) { print GO[$4], ON[$4], DE[$4], RO[$4] == 1 ? "all" : GO[$3], ($4 in SY) ? SY[$4] : "NA"; SN[$4]++; }; } END{ for (n in GO) { if (!(n in SN)) { print "#", GO[n], ON[n], DE[n], RO[n] == 1 ? "all" : GO[n], ($4 in SY) ? SY[n] : "NA"; }; }; }' $GOTERMDB/term.txt\ $GOTERMDB/term_synonym.txt \ $GOTERMDB/term2term.txt \ > GOA/goterms+parents.tbl
On this section, frequencies are computed for the three sets, yet the main interest was only the starfish GOs mapped from UniProt.
for F in starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot \ starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin \ seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot ; do { echo "# working on $F" bin/GO_prune_GOlevels.pl ./GOA/goterms+parents.tbl \ ./GOA/$F.tbl > ./GOA/$F.gofreqs.tbl }; done
for F in starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot \ starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin \ seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot ; do { echo "# working on $F" SF="./GOA/"$F".chisq+hypergeom_pvals.goinfo"; bin/GO_make_graphviz_dotfile_ext.pl ./GOA/goterms+parents.tbl \ $SF.tbl \ 1e-5 \ 2> $SF.ext.dot.maxpval_1e-5.log; for o in CC BP MF; do { mv -v $SF.graphviz.${o}.ext.dot $SF.graphviz.${o}.ext.maxpval_1e-5.dot }; done; echo "# working on $F"; bin/GO_make_graphviz_dotfile_ext.pl ./GOA/goterms+parents.tbl \ $SF.tbl \ 2> $SF.ext.dot.log; for o in CC BP MF; do { echo "# graphviz dot on $o [co/pdf]"; dot -Tpdf $SF.graphviz.${o}.ext.dot \ > $SF.graphviz.${o}.ext.dot.pdf; echo "# graphviz dot on $o [co/svg]"; dot -Tsvg $SF.graphviz.${o}.ext.dot \ > $SF.graphviz.${o}.ext.dot.svg; }; done; for o in CC BP MF; do { echo "# graphviz dot on $o [co/pdf]"; dot -Tpdf $SF.graphviz.${o}.ext.maxpval_1e-5.dot \ > $SF.graphviz.${o}.ext.maxpval_1e-5.dot.pdf; echo "# graphviz dot on $o [co/svg]"; dot -Tsvg $SF.graphviz.${o}.ext.maxpval_1e-5.dot \ > $SF.graphviz.${o}.ext.maxpval_1e-5.dot.svg; }; done; }; done;
### GO freqs analysis: GODATAFILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot.gofreqs.tbl" GOs.Cmur <- read.table(GODATAFILE, header=TRUE, sep="\t"); # Overview of the GO data nrow(GOs.Cmur) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 1 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 1 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 1 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 2 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 2 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 2 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == 3 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == 3 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == 3 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel > 1 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel > 1 ]) sum(GOs.Cmur$GOcount[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel > 1 ]) GK <- list(CCU="#E94F6C", # lightred CCD="#A24436", # darkred BPU="#52E853", # lightgreen BPD="#426A2B", # darkgreen ... #0A945C MFU="#6FAFF9", # lightblue MFD="#007CBA") # darkblue numbars <- 25 for (golvl in 3:4) { GOpct.MF.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "MF" & GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ] GOpct.CC.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "CC" & GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ] GOpct.BP.Cmur <- GOs.Cmur[ GOs.Cmur$ONTOLOGY == "BP" & GOs.Cmur$GOlevel == golvl & GOs.Cmur$GOcount > 0, ] cat("numMF:",nrow(GOpct.MF.Cmur),"\n") #-> 96 gt0 vs 512 total cat("numCC:",nrow(GOpct.CC.Cmur),"\n") #-> 65 gt0 vs 612 total cat("numBP:",nrow(GOpct.BP.Cmur),"\n") #-> 214 gt0 vs 943 total GOsum.MF.Cmur <- sum(GOpct.MF.Cmur$GOcount) #-> 800 GOsum.CC.Cmur <- sum(GOpct.CC.Cmur$GOcount) #-> 1271 GOsum.BP.Cmur <- sum(GOpct.BP.Cmur$GOcount) #-> 3403 cat("sumMF:",GOsum.MF.Cmur,"\n") cat("sumCC:",GOsum.CC.Cmur,"\n") cat("sumBP:",GOsum.BP.Cmur,"\n") GOpct.MF.Cmur$GOpct <- GOpct.MF.Cmur$GOcount / GOsum.MF.Cmur * 100 GOpct.CC.Cmur$GOpct <- GOpct.CC.Cmur$GOcount / GOsum.CC.Cmur * 100 GOpct.BP.Cmur$GOpct <- GOpct.BP.Cmur$GOcount / GOsum.BP.Cmur * 100 GOord.MF.Cmur <- head(GOpct.MF.Cmur[ order(-GOpct.MF.Cmur$GOpct), ], n=numbars) GOord.CC.Cmur <- head(GOpct.CC.Cmur[ order(-GOpct.CC.Cmur$GOpct), ], n=numbars) GOord.BP.Cmur <- head(GOpct.BP.Cmur[ order(-GOpct.BP.Cmur$GOpct), ], n=numbars) GOplt.MF.Cmur <- rbind(GOord.MF.Cmur, data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="MF", TERM="Other low represented GO codes", GOpct=100 - sum(GOord.MF.Cmur$GOpct))) GOplt.CC.Cmur <- rbind(GOord.CC.Cmur, data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="CC", TERM="Other low represented GO codes", GOpct=100 - sum(GOord.CC.Cmur$GOpct))) GOplt.BP.Cmur <- rbind(GOord.BP.Cmur, data.frame(GOID=NA,GOlevel=4,GOcount=0,ONTOLOGY="BP", TERM="Other low represented GO codes", GOpct=100 - sum(GOord.BP.Cmur$GOpct))) GOplt.MF.Cmur$LBL <- paste(GOplt.MF.Cmur$TERM, ifelse(is.na(GOplt.MF.Cmur$GOID), "", paste(" [",GOplt.MF.Cmur$GOID,"]",sep="")), sep="") GOplt.MF.Cmur.max <- max(GOplt.MF.Cmur$GOpct) pdf(file=paste("GOA/GOfreqs_Cmur_MF_lvl",golvl,".pdf", sep=""), width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE); par.def<-par(mar=c(4,25,1,1)+0.1); barplot(GOplt.MF.Cmur$GOpct, names.arg=GOplt.MF.Cmur$LBL, col=as.character(GK["MFU"]), horiz=TRUE, xlim=c(0,GOplt.MF.Cmur.max), border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE) axis(1,at=c(seq(0,GOplt.MF.Cmur.max+5,2)),line=0.25) abline(v=seq(0,GOplt.MF.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black") mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE) text(GOplt.MF.Cmur.max, numbars*1.2, # lpos*1.2, labels="Molecular\nFunction", # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(GK["MFD"]), # as.character(KK[c("BPD","CCD","MFD")]), adj=c(1,1.5), font=2, cex=2) text(-2, -0.7, # lpos*1.2, labels=paste("GO level ",golvl,sep=""), # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(GK["MFD"]), # as.character(KK[c("BPD","BPD","MFD")]), adj=c(1,1), font=2, cex=1.25, xpd=TRUE) par(par.def); dev.off(); GOplt.CC.Cmur$LBL <- paste(GOplt.CC.Cmur$TERM, ifelse(is.na(GOplt.CC.Cmur$GOID), "", paste(" [",GOplt.CC.Cmur$GOID,"]",sep="")), sep="") GOplt.CC.Cmur.max <- max(GOplt.CC.Cmur$GOpct) pdf(file=paste("GOA/GOfreqs_Cmur_CC_lvl",golvl,".pdf", sep=""), width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE); par.def<-par(mar=c(4,25,1,1)+0.1); barplot(GOplt.CC.Cmur$GOpct, names.arg=GOplt.CC.Cmur$LBL, col=as.character(GK["CCU"]), horiz=TRUE, xlim=c(0,GOplt.CC.Cmur.max), border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE) axis(1,at=c(seq(0,GOplt.CC.Cmur.max+5,2)),line=0.25) abline(v=seq(0,GOplt.CC.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black") mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE) text(GOplt.CC.Cmur.max, numbars*1.2, # lpos*1.2, labels="Cellular\nComponent", # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(GK["CCD"]), # as.character(KK[c("BPD","CCD","MFD")]), adj=c(1,1.5), font=2, cex=2) text(-2, -0.7, # lpos*1.2, labels=paste("GO level ",golvl,sep=""), # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(GK["CCD"]), # as.character(KK[c("BPD","BPD","MFD")]), adj=c(1,1), font=2, cex=1.25, xpd=TRUE) par(par.def); dev.off(); GOplt.BP.Cmur$LBL <- paste(GOplt.BP.Cmur$TERM, ifelse(is.na(GOplt.BP.Cmur$GOID), "", paste(" [",GOplt.BP.Cmur$GOID,"]",sep="")), sep="") GOplt.BP.Cmur.max <- max(GOplt.BP.Cmur$GOpct) pdf(file=paste("GOA/GOfreqs_Cmur_BP_lvl",golvl,".pdf", sep=""), width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE); par.def<-par(mar=c(4,25,1,1)+0.1); barplot(GOplt.BP.Cmur$GOpct, names.arg=GOplt.BP.Cmur$LBL, col=as.character(GK["BPU"]), horiz=TRUE, xlim=c(0,GOplt.BP.Cmur.max), border=NA, las=1, cex.names=0.75, yaxs = "i", axes=FALSE) axis(1,at=c(seq(0,GOplt.BP.Cmur.max+5,2)),line=0.25) abline(v=seq(0,GOplt.BP.Cmur.max+5,2), lty="dotted", lwd=0.5, col="black") mtext("\n% GO term abundancy", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE) text(GOplt.BP.Cmur.max, numbars*1.2, # lpos*1.2, labels="Biological\nProcess", # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(GK["BPD"]), # as.character(KK[c("BPD","BPD","MFD")]), adj=c(1,1.5), font=2, cex=2) text(-2, -0.7, # lpos*1.2, labels=paste("GO level ",golvl,sep=""), # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(GK["BPD"]), # as.character(KK[c("BPD","BPD","MFD")]), adj=c(1,1), font=2, cex=1.25, xpd=TRUE) par(par.def); dev.off(); } # for golvl
# Loading GO info datasets FILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.seaurchin"; DT.CmurxSpur <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""), header=TRUE, sep="\t"); FILE <- "./GOA/starfish_soap_trimmo-x-UniProt.blastx.covtbl.blastx2go.counts_ext.uniprot"; DT.CmurxUPrt <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""), header=TRUE, sep="\t"); FILE <- "./GOA/seaurchin_mrna.blastx_uniprot_eval10e-25.covtbl.blastx2go.counts_ext.uniprot"; DT.SpurxUPrt <- read.table(paste(FILE, ".chisq+hypergeom_pvals.goinfo.tbl", sep=""), header=TRUE, sep="\t"); # volcano plots plot((1 - DT.CmurxUPrt$hypergeom.pval) ~ DT.CmurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1)) points((1 - DT.CmurxUPrt$hypergeom.padj) ~ DT.CmurxUPrt$logodds, pch=16, col="red") plot((1 - DT.SpurxUPrt$hypergeom.pval) ~ DT.SpurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1)) points((1 - DT.SpurxUPrt$hypergeom.padj) ~ DT.SpurxUPrt$logodds, pch=16, col="red") plot((1 - DT.CmurxSpur$hypergeom.pval) ~ DT.CmurxSpur$logodds, xlim=c(-10,10), ylim=c(0,1)) points((1 - DT.CmurxSpur$hypergeom.padj) ~ DT.CmurxSpur$logodds, pch=16, col="red") dt.CmurxUPrt <- DT.CmurxUPrt[!is.na(DT.CmurxUPrt$hypergeom.padj) & DT.CmurxUPrt$hypergeom.padj < 1e-5, ] dt.SpurxUPrt <- DT.SpurxUPrt[!is.na(DT.SpurxUPrt$hypergeom.padj) & DT.SpurxUPrt$hypergeom.padj < 1e-5, ] dt.CmurxSpur <- DT.CmurxSpur[!is.na(DT.CmurxSpur$hypergeom.padj) & DT.CmurxSpur$hypergeom.padj < 1e-5, ] td.CmurxUPrt <- DT.CmurxUPrt[!is.na(DT.CmurxUPrt$hypergeom.padj), ] td.SpurxUPrt <- DT.SpurxUPrt[!is.na(DT.SpurxUPrt$hypergeom.padj), ] td.CmurxSpur <- DT.CmurxSpur[!is.na(DT.CmurxSpur$hypergeom.padj), ] # volcano plots filtered plot((1 - dt.CmurxUPrt$hypergeom.pval) ~ dt.CmurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1)) points((1 - dt.CmurxUPrt$hypergeom.padj) ~ dt.CmurxUPrt$logodds, pch=16, col="red") plot((1 - dt.SpurxUPrt$hypergeom.pval) ~ dt.SpurxUPrt$logodds, xlim=c(-10,10), ylim=c(0,1)) points((1 - dt.SpurxUPrt$hypergeom.padj) ~ dt.SpurxUPrt$logodds, pch=16, col="red") plot((1 - dt.CmurxSpur$hypergeom.pval) ~ dt.CmurxSpur$logodds, xlim=c(-10,10), ylim=c(0,1)) points((1 - dt.CmurxSpur$hypergeom.padj) ~ dt.CmurxSpur$logodds, pch=16, col="red") # logods dotplots gox1 <- split(DT.CmurxUPrt, DT.CmurxUPrt$Ontology) gox2 <- split(DT.SpurxUPrt, DT.SpurxUPrt$Ontology) gox3 <- split(DT.CmurxSpur, DT.CmurxSpur$Ontology) gos.BP <- unique(sort(c(as.character(gox1$BP$GO), as.character(gox2$BP$GO), as.character(gox3$BP$GO)))); gos.BP.len <- length(gos.BP); gos.MF <- unique(sort(c(as.character(gox1$MF$GO), as.character(gox2$MF$GO), as.character(gox3$MF$GO)))); gos.MF.len <- length(gos.MF); gos.CC <- unique(sort(c(as.character(gox1$CC$GO), as.character(gox2$CC$GO), as.character(gox3$CC$GO)))); gos.CC.len <- length(gos.CC); plot(DT.CmurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO & gos.CC %in% DT.SpurxUPrt$GO], DT.SpurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO & gos.CC %in% DT.CmurxUPrt$GO]) PTBP.Cmur <- DT.CmurxUPrt$logodds[gos.BP %in% DT.CmurxUPrt$GO] PTBP.Cmur[ is.na(PTBP.Cmur) ] <- 0 PTBP.Spur <- DT.SpurxUPrt$logodds[gos.BP %in% DT.CmurxUPrt$GO] PTBP.Spur[ is.na(PTBP.Spur) ] <- 0 plot(PTBP.Cmur,PTBP.Spur) PTMF.Cmur <- DT.CmurxUPrt$logodds[gos.MF %in% DT.CmurxUPrt$GO] PTMF.Cmur[ is.na(PTMF.Cmur) ] <- 0 PTMF.Spur <- DT.SpurxUPrt$logodds[gos.MF %in% DT.CmurxUPrt$GO] PTMF.Spur[ is.na(PTMF.Spur) ] <- 0 plot(PTMF.Cmur,PTMF.Spur) PTCC.Cmur <- DT.CmurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO] PTCC.Cmur[ is.na(PTCC.Cmur) ] <- 0 PTCC.Spur <- DT.SpurxUPrt$logodds[gos.CC %in% DT.CmurxUPrt$GO] PTCC.Spur[ is.na(PTCC.Spur) ] <- 0 plot(PTCC.Cmur,PTCC.Spur) # barplot tests st.CmurxUPrt <- head(dt.CmurxUPrt[order(dt.CmurxUPrt$hypergeom.padj), ], n=500) st.SpurxUPrt <- head(dt.SpurxUPrt[order(dt.SpurxUPrt$hypergeom.padj), ], n=500) st.CmurxSpur <- head(dt.CmurxSpur[order(dt.CmurxSpur$hypergeom.padj), ], n=500) par.def<-par(mar=c(4,25,1,1)+0.1); barplot(st.CmurxSpur$logodds,names.arg=paste(st.CmurxSpur$Term," [",st.CmurxSpur$GO,"]",sep=""),horiz=TRUE,las=2) par(par.def); # K<-list(CC="red",BP="green",MF="blue") K<-list(CC="#E94F6C",BP="#52E853",MF="#6FAFF9") k<-list(CC="#A24436",BP="#0A945C",MF="#007CBA") gord.CmurxSpur <- st.CmurxSpur[order(st.CmurxSpur$Ontology, -st.CmurxSpur$logodds), ] par.def<-par(mar=c(4,25,1,1)+0.1); barplot(gord.CmurxSpur$logodds, names.arg=paste(gord.CmurxSpur$Term," [",gord.CmurxSpur$GO,"]",sep=""), col=as.character(K[gord.CmurxSpur$Ontology]), horiz=TRUE,las=2) par(par.def); ts.CmurxUPrt <- head(td.CmurxUPrt[order(td.CmurxUPrt$hypergeom.padj), ], n=500) ts.SpurxUPrt <- head(td.SpurxUPrt[order(td.SpurxUPrt$hypergeom.padj), ], n=500) # filtering 100 topmost and bottom overrepresented terms tdu.CmurxSpur <- td.CmurxSpur[order(-td.CmurxSpur$logodds),] tts.CmurxSpur <- rbind(head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "CC", ], n=100), head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "BP", ], n=100), head(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "MF", ], n=100), tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "CC", ], n=100), tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "BP", ], n=100), tail(tdu.CmurxSpur[tdu.CmurxSpur$Ontology == "MF", ], n=100)) KK<-list(CCU="#E94F6C", # lightred CCD="#A24436", # darkred BPU="#52E853", # lightgreen BPD="#426A2B", # darkgreen ... #0A945C MFU="#6FAFF9", # lightblue MFD="#007CBA") # darkblue tts.CmurxSpur$OntoColors <- c(rep(KK["CCU"],100),rep(KK["BPU"],100),rep(KK["MFU"],100), rep(KK["CCD"],100),rep(KK["BPD"],100),rep(KK["MFD"],100)) # GO comparison final barplots (for topmost and bottom 100 terms) pdf(file="GOA/GOcomp_CmurxSpur_100low+100top.pdf", onefile=TRUE, width=8, height=24, bg="white", pointsize=10, useDingbats=FALSE); ggord.CmurxSpur <- tts.CmurxSpur[order(tts.CmurxSpur$Ontology, -tts.CmurxSpur$logodds), ] lpos<-c(length(ggord.CmurxSpur$Ontology[ ggord.CmurxSpur$Ontology=="BP" ]), length(ggord.CmurxSpur$Ontology[ ggord.CmurxSpur$Ontology=="BP" | ggord.CmurxSpur$Ontology=="CC" ]), length(ggord.CmurxSpur$Ontology)) # [ ggord.CmurxSpur$Ontology=="MF" ] par.def<-par(mar=c(6,25,1,1)+0.1); barplot(ggord.CmurxSpur$logodds[1:200], names.arg=paste(ggord.CmurxSpur$Term[1:200]," [",ggord.CmurxSpur$GO[1:200],"]",sep=""), col=as.character(ggord.CmurxSpur$OntoColors[1:200]), # K[ggord.CmurxSpur$Ontology]), horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i") mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE) text(8, 200*1.2, # lpos*1.2, labels="Biological\nProcess", # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(KK["BPD"]), # KK[c("BPD","CCD","MFD")]), adj=c(1,1.5), font=2, cex=2) abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black") abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE) barplot(ggord.CmurxSpur$logodds[201:400], names.arg=paste(ggord.CmurxSpur$Term[201:400]," [",ggord.CmurxSpur$GO[201:400],"]",sep=""), col=as.character(ggord.CmurxSpur$OntoColors[201:400]), # K[ggord.CmurxSpur$Ontology]), horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i") mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE) text(8, 200*1.2, # lpos*1.2, labels="Cellular\nComponent", # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(KK["CCD"]), # as.character(KK[c("BPD","CCD","MFD")]), adj=c(1,1.5), font=2, cex=2) abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black") abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE) barplot(ggord.CmurxSpur$logodds[401:600], names.arg=paste(ggord.CmurxSpur$Term[401:600]," [",ggord.CmurxSpur$GO[401:600],"]",sep=""), col=as.character(ggord.CmurxSpur$OntoColors[401:600]), # K[ggord.CmurxSpur$Ontology]), horiz=TRUE, xlim=c(-4,8), border=NA, las=1, cex.names=0.5, yaxs = "i") mtext("\nCmur/Spur log odds ratio\n[dark:lower100 + light:top100]", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE) text(8, 200*1.2, # lpos*1.2, labels="Molecular\nFunction", # c("Biological Process", "Cellular Component", "Molecular Function"), col=as.character(KK["MFD"]), # as.character(KK[c("BPD","CCD","MFD")]), adj=c(1,1.5), font=2, cex=2) abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black") abline(h=seq(10,200,10)*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE) par(par.def); dev.off(); # filtering 25 topmost and bottom overrepresented terms tdo.CmurxSpur <- td.CmurxSpur[order(-td.CmurxSpur$logodds),] tts2.CmurxSpur <- rbind(head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "CC", ], n=25), head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "BP", ], n=25), head(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "MF", ], n=25), tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "CC", ], n=25), tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "BP", ], n=25), tail(tdo.CmurxSpur[tdo.CmurxSpur$Ontology == "MF", ], n=25)) tts2.CmurxSpur$OntoColors <- c(rep(KK["CCU"],25),rep(KK["BPU"],25),rep(KK["MFU"],25), rep(KK["CCD"],25),rep(KK["BPD"],25),rep(KK["MFD"],25)) pdf(file="GOA/GOcomp_CmurxSpur_25low+25top.pdf", width=11, height=16, bg="white", pointsize=12, useDingbats=FALSE); ggord2.CmurxSpur <- tts2.CmurxSpur[order(tts2.CmurxSpur$Ontology, -tts2.CmurxSpur$logodds), ] lpos<-c(length(ggord2.CmurxSpur$Ontology[ ggord2.CmurxSpur$Ontology=="BP" ]), length(ggord2.CmurxSpur$Ontology[ ggord2.CmurxSpur$Ontology=="BP" | ggord2.CmurxSpur$Ontology=="CC" ]), length(ggord2.CmurxSpur$Ontology)) # [ ggord2.CmurxSpur$Ontology=="MF" ] par.def<-par(mar=c(6,25,1,1)+0.1); barplot(ggord2.CmurxSpur$logodds, names.arg=paste(ggord2.CmurxSpur$Term," [",ggord2.CmurxSpur$GO,"]",sep=""), col=as.character(ggord2.CmurxSpur$OntoColors), horiz=TRUE, xlim=c(-3,7.5), border=NA, las=1, cex.names=0.5, yaxs = "i") mtext("\nCmur/Spur log odds ratio\n[dark:lower25 + light:top25]", side=1, cex=1.5, adj=0.5, padj=1, xpd=TRUE) text(7, lpos*1.2, labels=c("Biological\nProcess", "Cellular\nComponent", "Molecular\nFunction"), col=as.character(KK[c("BPD","CCD","MFD")]), adj=c(1,1.5), font=2, cex=2) abline(v=seq(-3,7,1), lty="dotted", lwd=0.5, col="black") abline(h=lpos*1.2, lty="dashed", lwd=0.5, col="black", xpd=TRUE) abline(h=seq(5,145,5)*1.2, lty="dotted", lwd=0.25, col="black", xpd=TRUE) par(par.def); dev.off(); # pdf(file="GOA/GOcomp_CmurxSpur_hypergeom_adjpval_1-5_signif.pdf", width=8, height=6, bg="white", pointsize=8, useDingbats=FALSE); tdc.CmurxSpur <- dt.CmurxSpur[order(-dt.CmurxSpur$logodds),] ggordco.CmurxSpur <- tdc.CmurxSpur[order(tdc.CmurxSpur$Ontology, -tdc.CmurxSpur$logodds), ] lpos<-c(length(ggordco.CmurxSpur$Ontology[ ggordco.CmurxSpur$Ontology=="BP" ]), length(ggordco.CmurxSpur$Ontology[ ggordco.CmurxSpur$Ontology=="BP" | ggordco.CmurxSpur$Ontology=="CC" ]), length(ggordco.CmurxSpur$Ontology)) # [ ggordco.CmurxSpur$Ontology=="MF" ] par.def<-par(mar=c(10,24,1,1)+0.1); barplot(ggordco.CmurxSpur$logodds, names.arg=paste(ggordco.CmurxSpur$Term," [",ggordco.CmurxSpur$GO,"]",sep=""), col=as.character(K[ggordco.CmurxSpur$Ontology]), horiz=TRUE, xlim=c(0,7.5), border=NA, las=1) mtext("\nCmur/Spur log odds ratio\n(hypergeometric test adjusted p-value < 10-5)", side=1, cex=1.25, adj=0.5, padj=1.25, xpd=TRUE) text(7, lpos*1.2, labels=c("BP", "CC", "MF"), col=c(as.character(k[c("CC","BP","MF")])), adj=c(1,1.25), font=2, cex=2) abline(v=seq(0,7,1), lty="dotted", lwd=0.5, col="black") par(par.def); dev.off(); # GO wordclouds? library("GOsummaries"); wcd1 <- data.frame(Term = st.CmurxUPrt$Term, Score = st.CmurxUPrt$hypergeom.padj) wcd2 <- data.frame(Term = st.SpurxUPrt$Term, Score = st.SpurxUPrt$hypergeom.padj) wcd3 <- data.frame(Term = st.CmurxSpur$Term, Score = st.CmurxSpur$hypergeom.padj) gs <- gosummaries(wc_data = list(Results1 = wcd1, Results2 = wcd2)) gs <- gosummaries(wc_data = list(Results3 = wcd3)) plot(gs)
Linux Debian
GNU bash
GNU Awk
perl
R
Bioconductor
BiocInstaller
FASTX tools
[1] Stein LD. "Using GBrowse 2.0 to visualize and share next-generation sequence data." Brief Bioinform. 14(2):162-71, 2013 PMID. [2] Deng W, Nickle DC, Learn GH, Maust B, Mullins JI. "ViroBLAST: A stand-alone BLAST web server for flexible queries of multiple databases and user's datasets." Bioinformatics, 23(17):2334-2336, 2007 PMID.
The original document is available at https://compgen.bio.ub.edu/CompGenOld/Coscinasterias+transcriptome