Global Catalogue of Type Strain (gcType) Platform Manual v2



1. Overview

1.1 Genome assembly and annotation pipeline overview

Genome assembly and annotation pipeline is composed of three analysis procedures: (1) raw reads trimming and assembly, (2) genomic component analysis and (3) gene annotation.

  • (1.1) If long reads (PacBio reads or Nanopore reads) are provided as the input, the raw sequencing reads are trimmed and assembled into contigs/scaffolds with Canu [Genome research, 2017] or Flye [Nature Biotechnology, 2019]. If NGS short reads (Illumina paired-end reads) are provided in addition, the NGS short reads will be used to polish contigs/scaffolds with Pilon [PLoS One, 2014].

  • (1.2) If only the NGS short reads are provided, the raw reads are trimmed into clean reads with sickle or Trimmomatic [Bioinformatics, 2014], corrected with Musket [Bioinformatics, 2012], and assembled into contigs/scaffolds with multiple assemblers (SOAPdenovo2 [Gigascience, 2012], SPAdes [Journal of computational biology, 2012], Velvet [Genome research, 2008], Platanus [Genome research, 2014] and IDBA [Bioinformatics, 2012]). Then, best assembly is obtained based on several widely used metrics in genome assembly. Afterwards, reads are mapped to the best assembly to check for the misassembly and to count read coverage.

  • (1.3) The final assembly result (best assembly) is used to estimate genome completeness and contamination with checkM [Genome research, 2015] and to perform further genomic component analysis.

  • (2) Genomic component analysis including CRISPR array recognition with PILER-CR [BMC Bioinformatics, 2007], repetitive structure detection with TRF [Nucleic acids research, 1999], non-coding RNA prediction with tRNAscanSE [Nucleic acids research, 2016] and RNAmmer [Nucleic acids research, 2007], and gene prediction with Prodigal [BMC Bioinformatics, 2010] is performed based on the final assembly result. Genes are annotated in the next step.

  • (3) Predicted genes in the previous step are annotated by several databases, including KEGG, COG, NR, SwissProt, AntiSMASH, MetaCyc, PHI, Pfam, CARD and VFDB.


FIGURE: Genome assembly and annotation pipeline overview.

1.2 Species identification overview

This pipeline is used to achieve the workflow described as the propositional procedure [International journal of systematic and evolutionary microbiology, 2018] based on the gcType 16S rDNA gene database and whole genome reference database.

  • (1) The only required input of this pipeline is the genome sequence file in fasta format. The initial step is to extract the 16S rDNA sequence(s) from the submitted genome sequence. If 16S rDNA sequence of the corresponding strain is also submitted, sequence similarities between the extracted and submitted 16S rDNA sequence(s) will be report.

  • (2.1) Extracted 16S rDNA sequence(s) is/are aligned to the gcType 16S rDNA gene database using BLAST. BLAST results are sorted in descending order based on the identity. The top sequence, i.e. the one with highest sequence similarity, is used to estimate the 16S rDNA gene completeness[International journal of systematic and evolutionary microbiology, 2012]. A total of 'M1' sequences are selected according to their identity from top to bottom, with a limitation that, at most, an amount of 'K1' sequences are selected within the same genus.

  • (2.2) Submitted genome sequence(s) is/are aligned to the gcType whole genome database using Mash [Genome Biology, 2016]. Mash results are sorted in ascending order of distance. A total of 'M2' sequences are selected from top to bottom, with a limitation that at most an amount of 'K2' sequences are selected within the same genus. Various genome similarity metrics are calculated between submitted genome sequence and the selected sequences.

  • (3) Extracted and selected 16S rDNA sequences are aligned using MAFFT [Molecular biology and evolution, 2013] or MUSCLE [Nucleic acids research, 2004]. Then, a phylogenetic analysis is performed using MEGA [Molecular biology and evolution, 2018], FastTree[PloS one, 2010] or RAxML[Bioinformatics, 2014]. For submitted and selected genome sequences, 56 marker genes [Nature Communications, 2016] are extracted and used to perform phylogenetic analysis.

    * Users can upload their own 16S rDNA sequence and genome sequence as reference to perform all the analysis.


FIGURE: Species identification pipeline overview.


2. Instructions

2.1 Genome assembly and annotation pipeline usage

The pipeline contains four parts: demo, procedure selection, inputs and parameters, feedback.


2.1.1 Demo

There are two demos provided in this pipeline. Please click "load input and arguments" to load the sequencing data, fill your e-mail address in feedback text box and click "run" button to submit a demo submission. It may take you 20-60 minutes. When your job is successfully submitted and finished, you will receive an e-mail with job status link and result link, respectively.

FIGURE: Demo usage.

FIGURE: Fill e-mail address to recieve feedback and submit the job.


2.1.2 Procedure selection

Genome assembly and annotation pipeline is composed of three analysis procedures. gcType provides four different combinations of them: assembly (1), assembly and annotation (1, 2 and 3), genome annotation (2, 3), gene annotation (3). The required inputs are different for different conbinations. Sequencing reads file in fastq/fastq.gz/bam format is required in assembly (1) and assembly and annotation (1, 2 and 3). Genome file in fasta/fasta.gz/fna/fna.gz is required in genome annotation (2, 3). Gene file in fasta/fasta.gz is required in gene annotation (3).

FIGURE: Procedure selection.


2.1.3 NGS and TGS reads

If you want to analyze your raw sequencing data of type strain, you must confirm the type of your data. gcType provide the analysis of NGS reads (single library), TGS reads (PacBio or Nanopore), NGS reads (single library) + TGS reads (PacBio or Nanopore). If you select 'NGS reads' option, you must upload paired-end (NOTICE!) NGS reads, which means, you need to upload two files with "_1" and "_2", "_R1" and "_R2" or "1" and "2" as suffix.

FIGURE: Sequencing reads selection.


2.1.4 Sample name

'Sample name' will be used as the folder name and file prefix. Please do not add special characters including 'space', 'tab', 'quote', '*', '+', '/', '~', '#', '.', ';', '`', '|', '!', '$', '?' and ":".


2.1.5 Quality control and assembly for NGS reads

gcType provides two raw reads trimming programs. Please select one to perform trimming. gcType provides five genome assemblers with several set of paramters. Best assembly will be selected.

FIGURE: Assembly parameters for NGS reads.


2.1.6 Quality control and assembly for TGS reads

gcType provides two assemblers. Please select one to perform assembly. Instrument and estimated genome size must be filled according to the metadata of your sample.

FIGURE: Assembly parameters for TGS reads.


2.1.7 Genomic component analysis and gene annotation

gcType provides serveral softwares to analysis the genome sequence and serveral databases to annotate predicted genes. Prodigal which is used to predict gene from genome and CheckM which is used to estimated genome completeness, contamination and heterozygosis are compulsory.

FIGURE: Genomic component analysis and gene annotation.


2.2 Species indentification pipeline usage

The pipeline contains three parts: demo, inputs and parameters, feedback.


2.2.1 Demo

There is a demo provided in this pipeline. Please click "load input and arguments" to load the genome sequence and 16S rDNA sequence, fill your e-mail address in feedback text box and click "run" button to submit a demo submission. It may take you 5-10 minutes. When your job is successfully submitted and finished, you will receive an e-mail with job status link and result link, respectively.

FIGURE: Demo usage.

FIGURE: Fill e-mail address to recieve feedback and submit the job.


2.2.2 Genome and 16S rDNA sequences submission

The required input is genome sequence in species indentification. If 16S rDNA sequence of the same sample is also provided. It will be used to validate the 16S rDNA prediction result. But, please notice that genome sequence and 16S rDNA sequence of the same sample must be submitted in a same block. If you want to submit more that one sample, please click "Add another sample".

FIGURE: Input the genome and 16S rDNA sequences of your samples.


2.2.3 Sequences selection for the phylogenetic analysis

16S rDNA alignment results will be report to the users sorted by identity in descending order. Genome distance caculation will be report to the users sorted by distance in ascending order. But they are not all used to perform phylogentic analysis. The selection parameters and input files are describe below.

FIGURE: Selection parameters.

Example: 'M1' is set to 12. 'K1' is set to 5. 16S rDNA alignment results are sorted in descending order based on the identity. The bars in different colors refer to the aligned sequences belong to different genus. Total 5 ('K1') sequences in same genus will be selected. As a result, 6, 7 and 8 in green are not selected. Total 12 ('M1') sequences will be selected. 2 in blue, 1 in purple, and sequences not shown in the figure are not selected. If you want to include all top 'M1' sequences, please set 'K1' the same as 'M1'.

FIGURE: Example of selection procedure.

FIGURE: Additional speices selection.

FIGURE: Upload Addition genome or 16S rDNA sequences.


2.2.4 Phylogenetic analysis

Phylogenetic analysis is performed based on the multiple alignment of sequences. If MEGA is selected, three methods for constructing phylogenetic trees from evolutionary distance data can be chose.

FIGURE:


3. Output

3.1 Interface

After the job is submitted or done, users will obtain a link like https://gctype.wdcm.org/alysisresult.jsp?type=pipeline()&jobId=(). is 1 or 2, represents genome assembly and annotation pipeline or species identification pipeline, respectively. is a string with 32 chars. The page contains three tabs: parameters, job status, report and files. There is a download button to download the report and output files.

FIGURE: Header of the result page.


3.2 Genome assembly and annotation pipeline output

3.2.1 Sessions of the report

The report contains sessions: project status, pipeline overview, summary, 16S and gene annotation. If 'Assembly (1)' is selected, genomic components and gene annotation sessions are removed. If 'genome annotation (2, 3)' is selected, sequencing status sessions are removed. If 'gene annotation (3)' is selected, gsequencing status, genome assembly and genomic components are removed. There is a short link to every session hidden in the title. If you move your mouse on the title, the short link will be shown.

FIGURE: Short link to every session.


3.2.2 Output files

TABLE: Output files and folders list.

File or folder nameTypeDescription
16s_blast.result.xlsfileThe result of 16S gene sequence alignment against gcType 16S rDNA gene reference database.
best_par.txtfileAssembler and parameter of best assembly.
Demo1_antiSMASHfolderThe output of antiSMASH. Contains an zipped file.
Demo1_checkMfolderThe output of checkM. Contains an log file and lineage file.
Demo1_diamondfolderAnnotation results with diamond. Contains several files name like *_diamond.txt.
Demo1.fastafileBest assembly result.
Demo1_figurefoldeFigures created with ggplot2.
Demo1_gc_cov.pngfileGC content and sequencing depth distribution.
Demo1.gene.pngfileGene length distribution.
Demo1.gene.xlsfileGene prediction statistics.
Demo1_htmlfolderReport folder.
Demo1_htmlsfolderGene location and annotation result html file.
Demo1_insert_size.pngfileInsert length distribution.
Demo1_kmer_freq.pngfile Kmer depth distribution.
Demo1_piler-crfolderThe output of PILER-CR.
Demo1_prodigalfolderThe output of Prodigal.
Demo1_ratio.htmlfileDetailed annotation result.
Demo1_RepeatMaskerfolderThe output of TRF.
Demo1_RfamfolderAnnotation of Rfam.
Demo1_RNAmmerfolderThe output of RNAmmer.
Demo1_tablesfileStatistics of annotated genes against selected databases. Contains several files name like *_table.txt.
Demo1_tRNAscanfolderThe output of tRNAscan-SE.
genome_info.txtfileAssembly statistics.
kmer_info.txtfilekmer statistics.
reads_info.txtfileQuality control summary.
reads_match.txtfileThe statistics of mapping result.
RNA.xlsfileNon-coding RNA statistics.
scaffold_info.txtfileGC content, scaffold length and sequencing depth of all scaffolds.

3.3 Species indentification pipeline output

3.3.1 Sessions of the report

The report contains five sessions: project status, pipeline overview, sequencing status, summary, 16S rDNA based analysis and genome based analysis. There is a short link to every session hidden in the title. If you move your mouse on the title, the short link will be shown.

FIGURE: Short link to every session.


3.3.2 Output files

TABLE: Output files and folders list.

File or folder nameTypeDescription
0.16sfolderinput16s_blast.stat.xls (alignment between predicted and submitted 16S rDNA seuquences); genome_info.txt (statistics of submitted genome); query*.genome.RNAmmer.fasta (predicted 16S rDNA sequence); filter.xls (16S rDNA seuquences above 1200 bp).
1.blast_16sfolderblast.out (alignment against gcType 16S rDNA sequence database); blast.stat.xls (selected 16S rDNA sequence statistics);
2.tree_16sfolderall.16s.tree (phylogenetic tree);
3.mashfoldergenome.stat.xls (mash and ANI value between selected genomes and query genomes);
4.markergenefolder*.phylosift (the output of phylosift)
5.allmarkerfolder*.list (genome filter record); COG*.fasta (multiple alignment file of 56 marker genes);
6.wholegenomefolderall.genome.fasta (multiple alignment file of 56 marker genes);
7.treefoldergenome.tree (phylogentic tree);

4. Detail

4.1.1 Insert length distribution

Insert length distribution are counted using Picard based on the sorted bam file that are created by mapping reads to contigs.

4.1.2 GC content and sequencing depth

Assembled contigs/scaffolds are chopped into framents. GC content and sequencing depth of each window are counted. Window size and step size of sliding windows equals to 500 and 20, respectively. Figure shows the 97.5% of the data. Outliers are removed.

4.1.3 Assembly tools, parameters and best result selection

TABLE: Kmer setting of assembly tools.

Read length[50,70)[70,100)[100,127)[127,*)
SPAdes-k 21,27,33,39,45;default-k 27,35,43,51,59;default-k 21,29,37,45,53;-k 21,35,49,63,77,91;default-k 21,35,49,63,77;-k 17,39,61,83,105,127;default
IDBA--mink 21 --maxk 45 --step 6;default--mink 17 --maxk 59 --step 8;--mink 27 --maxk 67 --step 10;default--mink 21 --maxk 53 --step 8;--mink 21 --maxk 91 --step 14;default--mink 21 --maxk 77 --step 14;--mink 17 --maxk 124 --step 22;default
Velvet45;default67;default91;default127;default
SOAPdenovo2-K 21 -m 45;default-K 27 -m 67;default-K 35 -m 91;default-K 39 -m 127;default
Platanus-b-k 21 -K 0.3 -s 6;default-k 27 -K 0.4 -s 8;-k 17 -K 0.45 -s 10;default-k 21 -K 0.35 -s 8;-k 21 -K 0.6 -s 14;default-k 21 -K 0.51 -s 14;-k 17 -K 0.82 -s 22;default

Best assembly is selected based on a mixed score system (N50*0.0001*20%, N75*0.0001*15%, contig*1*35%, largest*0.0001*15%, total*0.0001*10%, N*0.001*5%).

4.1.4 Version of tools and databases

TABLE: Version of tools and databases.

Tools or databasesVersion or dateArguments and description
fastQCv0.11.5(Default)
trimmomaticv0.38SLIDINGWINDOW:5:20 MINLEN:20
sicklev1.33(Default)
SPAdesv3.13.0(several sets of kmers) --careful --sc --disable-gzip-output
IDBAv1.1.3(several sets of kmers) --pre_correction
Velvetv1.2.10(several sets of kmers) -cov_cutoff auto -exp_cov auto
SOAPdenovo2v2.04(several sets of kmers) -R -d 1 -M 1 -D 1 -F
Platanus-bv1.2.0(several sets of kmers)
CANUv1.8(Default)
flyev2.5(Default)
Pilonv1.23(--changes --fix all)
checkMv1.0.11(lineage_wf)
PILER-CRv1.06(Default)
tRNAScan-SEv2.0(-qQ -Y)
RNAmmerv1.2(-multi)
Prodigalv2.6.3(Default)
TRFv4.07b(2 7 7 80 10 50 500 -f -d -m -h)
Diamondv0.9.19.120-e 1e-5 --id 40 --query-cover 40 --subject-cover 40
NRDownload at Jan. 2020(Diamond)
KEGGRelease 87.0 & 58.1(Diamond)
COGVersion 2014(Diamond)
PfamRelease 32.0(pfam_scan.pl)
TIGRfamRelease 15.0(pfam_scan.pl)
Rfamv1.1.2(Infernal)
Swiss-ProtRelease 2017_07(Diamond)
MetaCycRelease 18.1(Diamond)
PHIVersion 4.5(Diamond)
CAZyDate Jul 20 2017(Diamond)
Anti-SMASHRelease 4.2.0(Diamond)
CARDVersion 2.0.3(Diamond)
VFDBDate Oct 5 2018(Diamond)

5. Supplementary

Citation: https://gctype.wdcm.org/

Contact: ma@im.ac.cn; wulh@im.ac.cn; shiwy@im.ac.cn

Announcement: If you use the softwares and results involved below, please cite the corresponding article.

   

   


6. Change log

version 2.14 2021-02-09

Minimum limitation of Genome file and gene file size is set to 2KB in genome annotation pipeline.

Fix some bugs in the calculation of average intergenic in genome assembly and annotation pipeline.

Fix some bugs of pasting a large number of genome file in species identification pipeline.

version 2.14 2021-01-26

Cloud computing cluster will be maintained and updated during Jan 26 9:00 AM to Jan 27 12:00 AM (Beijing Time Zone). The analysis services will be shut down for several days. Please do not submit your time-consuming analysis jobs before Jan 26.

version 2.13 2020-12-03

Fix a small unit bug (bp -> kbp) in the report of genome assembly and annotation pipeline.

version 2.12 2020-11-18

Increase several limitation in sample name.

version 2.11 2020-11-09

Task statistics is shown on submission page.

version 2.10 2020-11-04

Fix some bugs in argument selection (AntiSMASH) of genome assembly and annotation pipeline.

version 2.9 2020-09-28

Maximum limitations of genome and 16S rDNA file size are set to 20MB and 1MB, respectively.

version 2.8 2020-09-17

Fix some file fragmentation transfer bugs in file uploading plug-in.

version 2.7 2020-08-18

Update MEGA X in phylogentic analysis.

Modify report template of both pipelines.

version 2.6 2020-08-14

Modify report template of both pipelines.

version 2.5 2020-08-10

Modify report template of both pipelines.

version 2.4 2020-08-01

Fix some bugs in processing special charactors in the sample names.

Add some output in genome assembly and annotation pipeline.

version 2.3 2020-07-10

Add predicted 16S rDNA alignment against gcType 16S rDNA database in genome assembly and annotation pipeline.

Modify report template of both pipelines.

version 2.2 2020-06-29

Modify output data format of genome assembly and annotation pipeline.

Modify tree viewer plugin in species identification pipeline.

Modify report template of both pipelines.

version 2.1 2020-06-15

Modify output figures in the report of genome assembly and annotation pipeline.

version 2.0 2020-05-31

Pipelines are moved to cloud-based servers. The security, CPUs, nodes, capability of parallel processing and data storage have been greatly improved.