Skip to content
Snippets Groups Projects
README.md 6.21 KiB
Newer Older
# metagWGS: Documentation
Celine Noirot's avatar
Celine Noirot committed

Joanna Fourquet's avatar
Joanna Fourquet committed
## Introduction
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
### Pipeline graphical representation
The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:
![](docs/Pipeline.png)
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
### metagWGS steps
Joanna Fourquet's avatar
Joanna Fourquet committed

metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis:
Joanna Fourquet's avatar
Joanna Fourquet committed

* `S01_CLEAN_QC` (can be stopped at with `--stop_at_clean` ; can ke skipped with `--skip_clean`)
Joanna Fourquet's avatar
Joanna Fourquet committed
   * trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
   * suppresses host contaminants ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) + [Samtools](http://www.htslib.org/))
Joanna Fourquet's avatar
Joanna Fourquet committed
   * controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
   * makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [plot_kaiju_stat.py](bin/plot_kaiju_stat.py) + [merge_kaiju_results.py](bin/merge_kaiju_results.py))
* `S02_ASSEMBLY` (can be stopped at with `--stop_at_assembly`)
   * assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit))
Joanna Fourquet's avatar
Joanna Fourquet committed
   * assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
   * deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) + [Samtools](http://www.htslib.org/) + [Bedtools](https://bedtools.readthedocs.io/en/latest/))
* `S03_FILTERING` (can be stopped at with `--stop_at_filtering` ; can be skipped with `--skip_assembly`)
   * filters contigs with low CPM value ([Filter_contig_per_cpm.py](bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `S04_STRUCTURAL_ANNOT` (can be stopped at with `--stop_at_structural_annot`)
   * makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](bin/Rename_contigs_and_genes.py))
* `S05_ALIGNMENT`
   * aligns reads to the contigs ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) + [Samtools](http://www.htslib.org/))
Joanna Fourquet's avatar
Joanna Fourquet committed
   * aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* `S06_FUNC_ANNOT` (can ke skipped with `--skip_func_annot`)
   * makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](bin/cd_hit_produce_table_clstr.py))
   * quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](bin/Quantification_clusters.py))
   * makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [best_bitscore_diamond.py](bin/best_bitscore_diamond.py) + [merge_abundance_and_functional_annotations.py](bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](bin/quantification_by_functional_annotation.py))
* `S07_TAXO_AFFI` (can ke skipped with `--skip_taxo_affi`)
   * taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](bin/aln2taxaffi.py))
   * taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](bin/aln2taxaffi.py))
   * counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](bin/quantification_by_contig_lineage.py))
* `S08_BINNING` (not yet implemented)
DARBOT Vincent's avatar
DARBOT Vincent committed
   * aligns every reads samples againts all assemblies (BWA-MEM2)
   * performs metagenome binning (METABAT2 + MAXBIN2 + CONCOCT)
   * refines bin sets (bin_refinement.sh adapt from METAWRAP bin_refinement)
   * dereplicates bins between samples (DREP)
   * taxonomically affiliates the bins (GTDBTK)
   * calculates bins abundances between samples (BWA-MEM2 + SAMTOOLS)

All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.
Joanna Fourquet's avatar
Joanna Fourquet committed

Joanna Fourquet's avatar
Joanna Fourquet committed
A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
Joanna Fourquet's avatar
Joanna Fourquet committed
The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
Two [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible.
Joanna Fourquet's avatar
Joanna Fourquet committed
## Documentation
Joanna Fourquet's avatar
Joanna Fourquet committed

The metagWGS documentation can be found in the following pages:

   * [Installation](docs/installation.md)
      * The pipeline installation procedure.
   * [Usage](docs/usage.md)
      * An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
   * [Output](docs/output.md)
      * An overview of the different output files and directories produced by the pipeline.
   * [Use case](docs/use_case.md)
      * A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/).
   * [Functional tests](functional_tests/README.md)
      * (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.

## Contact us

If you have any questions or suggestions for improvement, please contact us to claire.hoede[@]inrae.fr.

## Cite us

For the moment if you use metagWGS for your research, plese cite : 
Joanna Fourquet, Céline Noirot, Christophe Klopp, Philippe Pinton, Sylvie Combes, et al.. Whole metagenome analysis with metagWGS. JOBIM2020, Jun 2020, Montpellier, France. ⟨hal-03176836⟩