Quality control and feature tabulation¶
There are many ways in which to enter the QIIME2 workflow. Which one to use depends mostly on whether your data has been preprocessed in any way. In our case, sequences were already demultiplexed by the sequencing facility, such that we have two files for each of the samples: one containing the forward read “R1” and one with the reverse “R2”. We can import all these sequence files into QIIME2 in one go using a manifest file.
See the QIIME2 Importing data tutorial information on how to import different input data formats.
Prepping the manifest file¶
The manifest file includes the sample ids, the path to where each fastq.gz file is stored, and its orientation.
Because we have paired-end data and thus two files for each sample, we will list each sample twice, once for the forward and once for the reverse orientation.
This is what the first few lines of the manifest file look like (or run head data/MANIFEST.csv
):
sample-id,absolute-filepath,direction
DUMMY10,$PWD/data/Sample-DUMMY10_R1.fastq.gz,forward
DUMMY10,$PWD/data/Sample-DUMMY10_R2.fastq.gz,reverse
DUMMY11,$PWD/data/Sample-DUMMY11_R1.fastq.gz,forward
DUMMY11,$PWD/data/Sample-DUMMY11_R2.fastq.gz,reverse
...
You can find the manifest file under data/MANIFEST.csv as a comma-separated file.
Importing data¶
Important
Don’t forget to activate your conda environment before running any QIIME2 command.
Just run: conda activate qiime2-2021.2
.
We can use the manifest file to import the data into QIIME2. I already selected the appropriate data type and input format, so just run:
mkdir -p prep
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path data/MANIFEST.csv \
--input-format PairedEndFastqManifestPhred33 \
--output-path prep/demux-seqs.qza
This will create a special data object called an ‘artifact’ which is a zip archive with a ‘.qza’ (QIIME zipped artifact) extension. Artifacts are the main file type used in QIIME2 analyses. In addition to the actual data, in this case the fastq.gz files, artifacts contain metadata such as file format, relevant citations and the data provenance.
If you quickly want to see what sort of artifact it is, run
qiime tools peek prep/demux-seqs.qza
From the demultiplexed artifact we can create an interactive summary of the sequences. This summary includes the number of sequences that were obtained per sample and the distribution of sequence quality scores at each position.
qiime demux summarize \
--i-data prep/demux-seqs.qza \
--o-visualization prep/demux-seqs.qzv
The result is the other main object in QIIME2: a ‘vizualisation’, which is a zip archive with ‘.qzv’ (QIIME zipped vizualisation) extension.
Unlike artifacts, vizualisations cannot be used as input for further analyses in QIIME2.
They are only ment for display, for instance using the qiime tools view
utility.
qiime tools view prep/demux-seqs.qzv
Alternatively, we can view it online: https://view.qiime2.org/. This online utility is also handy to share QIIME2 results with collaborators that do not have QIIME2 installed.
Question 1
Check out the interactive quality plots. What could cause the ‘drop’ in quality at the 5-prime end of your sequences?
Primer removal¶
The next step is to remove any primer sequences. We will use the cutadapt QIIME2 plugin for that.
Because we have paired-end data, there is a forward and reverse primer, referenced by the parameters --p-front-f
and --p-front-r
in below command.
These are already the correct primer sequences used for your data.
Tip
Some steps in the QIIME2 workflow can take quite some time. We can speed things up by running certain processes using more than one cpu. Exactly how many cpus you can select mainly depends on how many you have available on your computer.
For the following command, I used --p-cores 8
(but please adjust the number of cpus based on how many you have available!).
qiime cutadapt trim-paired \
--i-demultiplexed-sequences prep/demux-seqs.qza \
--p-front-f GTGYCAGCMGCCGCGGTAA \
--p-front-r CCGYCAATTYMTTTRAGTTT \
--p-error-rate 0 \
--o-trimmed-sequences prep/trimmed-seqs.qza \
--p-cores 1 \
--verbose
The output written to the screen shows the commands that are run, as well as some info on the number of reads that are processed and trimmed.
The actual result is not written to the screen but saved as another QIIME2 artifact, which contains the trimmed sequences. Like with prep/demux-seqs.qza from the previous step, we can create a summary vizualisation and view it like so:
qiime demux summarize \
--i-data prep/trimmed-seqs.qza \
--o-visualization prep/trimmed-seqs.qzv
qiime tools view prep/trimmed-seqs.qzv
Compare the summary vizualisations of your demux and trimmed sequences.
Question 2
Feature table construction¶
Sequences were traditionally clustered into operational taxonomic units based on a fixed similarity threshold, typically 97% (Rideout et al. 2014). Such OTU clustering methods have been largely replaced now by denoise algorithms, which correct amplicon sequence errors and produce high-resolution amplicon sequence variants that resolve differences of as little as one nucleotide (Callahan et al. 2017).
Two denoisers are implemented in QIIME2: deblur-denoise and DADA2-denoise. Their performance is quite similar so which one you use in the end depends largely on taste (Nearing et al. 2018, Caruso et al. 2019). In the following sections you will run both.
See Estaki et al. 2020 for further details on the denoising workflow.
Deblur denoise¶
The Deblur-denoising procedure is split up in 3 steps.
Step 1. Join read pairs¶
First, join the forward and reverse reads into a single sequence spanning the entire target region. This joining is based on the overlap between the forward and reverse reads.
Lets first do some math:
Question 3
qiime cutadapt trim-paired
command)Now that you have checked whether there is enough overlap to reliably join your forward and reverse reads, lets do the joining. Note that I used –p-threads 8, but you need to decide for yourself how many cpus you have available.
mkdir -p deblur
qiime vsearch join-pairs \
--i-demultiplexed-seqs prep/trimmed-seqs.qza \
--o-joined-sequences deblur/joined-seqs.qza \
--p-threads 1 \
--verbose
Lets summarize and view again:
qiime demux summarize \
--i-data deblur/joined-seqs.qza \
--o-visualization deblur/joined-seqs.qzv
qiime tools view deblur/joined-seqs.qzv
Question 4
Step 2. Quality filter¶
The fastq.gz files not only contain the DNA sequences but also a quality score for each of the nucleotides. These quality scores are used by deblur-denoise to apply an initial quality filtering.
Warning
With qiime2-2021.2, the following command will produce a YAMLLoadWarning
warning, which you can ignore.
To perform quality filtering, just run:
qiime quality-filter q-score \
--i-demux deblur/joined-seqs.qza \
--o-filtered-sequences deblur/filt-seqs.qza \
--o-filter-stats deblur/filt-stats.qza \
--verbose
View the summary of your joined and filtered sequences:
qiime demux summarize \
--i-data deblur/filt-seqs.qza \
--o-visualization deblur/filt-seqs.qzv
qiime tools view deblur/filt-seqs.qzv
These results actually look very, very similar to the summary vizualisation of the joined sequences. In fact, when you look at the trimming stats, nothing really seems to have happened:
qiime metadata tabulate \
--m-input-file deblur/filt-stats.qza \
--o-visualization deblur/filt-stats.qzv
qiime tools view deblur/filt-stats.qzv
This is somewhat unusual but so is the quality of this particular data set. Just be prepared that your own data may be of lower quality.
Question 5
Have a look at the qiime quality-filter q-score
help function and see if you can find an explanation why no quality filtering was performed:
qiime quality-filter q-score --help
Now lets prepare for the next step where the actual denoising is performed:
Important
When we continue with deblur-denoise in the next step, you need to truncate your fragments such that they are all of the same length.
The position at which sequences are truncated is specified by the --p-trim-length
parameter.
Any sequence that is shorter than this value will be lost from your analyses.
Any sequence that is longer will be truncated at this position.
Question 6
--p-trim-length
?Step 3. Denoise¶
Run deblur-denoise on the joined and quality trimmed sequences.
I selected a --p-trim-length
of 370, because that resulted in minimal data loss.
That is, only <9% of the reads were discarded for being too short, and only 4 bases were trimmed off from sequences of median length.
Again, I used 8 cpus here, but you may have to modify that (--p-jobs-to-start 8
)
qiime deblur denoise-16S \
--i-demultiplexed-seqs deblur/filt-seqs.qza \
--p-trim-length 370 \
--o-representative-sequences deblur/deblur-reprseqs.qza \
--o-table deblur/deblur-table.qza \
--p-sample-stats \
--o-stats deblur/deblur-stats.qza \
--p-jobs-to-start 1 \
--verbose
This command results in three output files:
A feature table artifact with the frequencies per sample and feature.
A representative sequences artifact with one single fasta sequence for each of the features in the feature table.
A stats artifact with details of how many reads passed each filtering step of the deblur procedure.
Let start with converting the stats artifact into a vizualisations which we can then view again, like so:
qiime deblur visualize-stats \
--i-deblur-stats deblur/deblur-stats.qza \
--o-visualization deblur/deblur-stats.qzv
qiime tools view deblur/deblur-stats.qzv
The resulting table shows how many sequences per sample passed the deblur quality check, including the total number and the number of unique sequences. Just hoover over the table headers to get more information. Note that artifact here is used in the traditional sense of the word. It has nothing to do with the file type ‘artifact’.
To view the feature table summary statistics, just run:
qiime feature-table summarize \
--i-table deblur/deblur-table.qza \
--o-visualization deblur/deblur-table.qzv
qiime tools view deblur/deblur-table.qzv
Particularly interesting is the Interactive sample detail page. This shows you the number of features per sample. If there is large variation in this number between samples, it means it is difficult to directly compare samples. In that case it is often recommended to standardize your data by for instance rarefaction. Rarefaction will not be treated in detail in the tutorial, but the idea is laid out below.
Note
Rarefaction in a nut shell: for each sample you only include as many features as you have available for the smallest sample. Which features those are is determined by chance (i.e. subsample the features without replacement). If your smallest sample has very few features, you need to discard a lot of features from the other samples, which is of course a pity. In such situations it may be better to just exclude the small sample entirely.
Use the sample depth slider to see the effect of rarefying the data up to a minimum sampling depth.
Question 7
Lastly, lets check out the representative sequences:
qiime feature-table tabulate-seqs \
--i-data deblur/deblur-reprseqs.qza \
--o-visualization deblur/deblur-reprseqs.qzv
qiime tools view deblur/deblur-reprseqs.qzv
In addition to some summary statistics, this vizualization allows you to BLAST each representative sequence against the NCBI nt database. Just click on the sequence and then the View report button.
Important
Results of the “top hits” from a simple BLAST search such as this are known to be poor predictors of the true taxonomic affiliations of these features, especially in cases where the closest reference sequence in the database is not very similar to the sequence that you are using as a query. Instead, use automated taxonomic classification (see tutorial for tomorrow).
Question 8
What is the best BLAST hit for the most abundant feature in the data set?
DADA2 denoise¶
The DADA2-denoiser can be run using a single command because joining and filtering will be done automatically.
However, you need to decide on two important parameter values: --p-trunc-len-f
and --p-trunc-len-r
.
Question 9
Check out the qiime dada2 denoise-paired
help function to find out what --p-trunc-len-f
and --p-trunc-len-r
are.
Which of the previous vizualisations can help you decide on appropriate values for these parameters?
What do you think happens if you set these values at, say, 185 each?
Lets run dada2, note that I used 16 cpus, which you may have to tune down a bit.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs prep/trimmed-seqs.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 230 \
--p-trunc-len-r 220 \
--p-n-threads 1 \
--o-table dada2/dada2-table.qza \
--o-representative-sequences dada2/dada2-reprseqs.qza \
--o-denoising-stats dada2/dada2-stats.qza \
--verbose
This will take some time, so lets keep it running and pick this up again this afternoon or tomorrow morning. Then create some vizualisations:
qiime metadata tabulate \
--m-input-file dada2/dada2-stats.qza \
--o-visualization dada2/dada2-stats.qzv
qiime feature-table summarize \
--i-table dada2/dada2-table.qza \
--o-visualization dada2/dada2-table.qzv
qiime feature-table tabulate-seqs \
--i-data dada2/dada2-reprseqs.qza \
--o-visualization dada2/dada2-reprseqs.qzv
View and compare these results to the ones obtained with deblur-denoise.
Question 10
What method yields most features? Which method would you choose to continue with?