User Tools

Site Tools


cd-hit-otu-user-guide

CD-HIT-OTU User's Guide

Last updated: 2015/11/25 14:48

http://weizhong-lab.ucsd.edu/cd-hit-otu

Web server developed by Weizhong Li's lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu

Introduction

A fundamental question in metagenomics is the estimation of microbial diversity, often described in Operational Taxonomic Units (OTUs). 16S ribosomal RNA survey is the most common approach for OTU identification. However, noise in PCR amplification and sequencing often cause over-estimation (up to orders of magnitude) of OTUs.

To address this problem, new methods such as SLP [1] were developed and now the gold standard is through flowgram clustering implemented in PyroNoise [2], Denoiser [3] and AmpliconNoise [4]. Although these methods reduce much more noise than earlier approaches, they still produce many (can be >50%) spurious OTUs [4]. In addition, both SLP and flowgram-based methods require intensive computation, prohibiting them from being used in very larger datasets.

Compared to these methods, our new tool introduced here, CD-HIT-OTU, has comparable accuracy in identifying true OTUs and produces much fewer spurious OTUs. CD-HIT-OTU is ~2-4 orders of magnitude faster, especially for longer reads. CD-HIT-OTU is developed from our ultra-fast clustering program CD-HIT [5] with several unique algorithms and is available from http://weizhongli-lab.org/cd-hit-otu.

Six steps in CD-HIT-OTU

The key of CD-HIT-OTU is to filter out erroneous reads and chimeric reads by combined sequence clustering and statistical simulations. CD-HIT-OTU includes 6 steps.

1. Raw read filtering and trimming

For pyrosequencing data

Raw reads with ambiguous base calls, which account for a big portion of sequencing errors, are removed. If a 16S primer sequence is provided, reads that don’t match the primer are removed. Otherwise, a consensus of k bases (k=6 by default, adjustable by users) is built from the beginning of all reads and reads that don’t match the consensus are removed. When these reads are removed, sequence errors are more randomly distributed. This makes our simulations more accurate.

Since the rRNA tags start with the same primer, they will align together at their 5’ ends during the OTU clustering analysis. The tail at 3’ of a very long sequence cannot be aligned with many sequences during the clustering step. This means that the quality of the tail of a long read cannot be reliably validated. Therefore, we also trim long reads and remove very short reads. The median length Lm. is calculated for all reads. Longer reads are trimmed to this length. Shorter reads below the Lm x f (f=0.8 by default, adjustable by users) are removed.

For illumina data

Similar to the process for pyrosequencing data, Illumina reads with ambiguous bases and reads that don’t match user-provided primers or calculated consensus are removed.

Reads are also removed if it has more than 10 bases with error probability ≥ 0.1. According to binomial distribution, under this situation, the probability for such a read to have ≥ 2 wrong bases is at least 0.608.

If we observe a sudden drop in quality scores at the last few bases, we will trim these bases for all the reads.

For paired reads, we assemble them into contigs. We allow up to 3 mismatches in the overlapping region and generate multiple candidate contigs that cover combinations of all mismatched bases (see the following figure). For example, if a pair of reads has 3 mismatches, we will create 8 candidate contigs. All the contigs, including the candidate contigs, were clustered at 100% ID using CD-HIT-DUP. For any pair of reads, the candidate contig in the largest cluster is considered correct and other candidate contigs are removed. This unique step in CD-HIT-OTU preserve more reads than other methods in OTU analysis.

figure5_cd-hit-otu.jpg

The tails of long contigs are then trimmed using similar method as in pyrosequencing reads.

2. Clustering of duplicates

The filtered reads are clustered into clusters of duplicates. CD-HIT-DUP clusters the reads that can align at 5’ and share 100% identity over full length (only need to cover full length of shorter sequences if reads are different in length).

3. Chimeric reads detection

Chimeric clusters are identified from the duplicate clusters in step (2).

4. Read recruitment allowing one error

Secondary clusters are recruited into primary clusters. Since all reads in a duplicate cluster from step (2) are identical, only the representative read (the longest one) is used for each cluster. The representative reads from step (2) are clustered by CD-HIT at a threshold that only allows one error. For example, for 200bp reads, the reads are clustered at 99.5% identity. Unlike the default CD-HIT algorithm, which sort reads in order of decreasing length, the representative sequences in this step are sorted in order of decreasing cluster size so that the representatives from smaller secondary clusters can be recruited into their corresponding primary clusters, which are larger.

5. Removing noise

After the last step, the remaining clusters include primary clusters and non-secondary small clusters, which are noise to be removed. A cutoff x is statistically calculated at this step and clusters smaller than x are treated as noises. x is calculated in following steps:

Let the size of the largest primary cluster to be M. We reciprocally estimate the upper bound of this sequence’s depth N using our simulation method. Given N, we then estimate the average size of the tertiary clusters and this size is used as cutoff x. The clusters smaller than x are considered noise and are therefore removed. X is often very small, such as 2 or 3 in our test pyrosequencing datasets and ~100 for Illumina datasets.

6. OTU clustering

Remaining representative reads from non-chimeric clusters are clustered into OTUs at a user-specified OTU cutoff (e.g. 97% ID at species level) using CD-HIT-EST (parameters: -c 0.97 -n 10 -l 11 -p 1 -d 0 -g 1). Here option “-c 0.97” means 97% identity, a commonly used cutoff for OTU clustering at species level. “-g 1” means accurate mode. Other parameters control word size and format etc. The output clusters are the identified OTUs.

Output files

OTU.clstr is OTU cluster that only contains the representatives of denoised and non-chimeric clusters.
OTU.nr2nd.clstr is OTU clusters that also contain sequences from non-chimeric primary and secondary clusters. In this file, the number of reads in each OTU cluster can represents the abundance of that OTU

How to use CD-HIT-OTU

INSTALLTION

CD-HIT-OTU works under Linux or UNIX systems

(1) Download software from http://weizhongli-lab.org/cd-hit-otu/download.php like cd-hit-otu-454-0.0.2-2011-1004.tar.gz (for 454 data) or cd-hit-otu-illumina-0.0.1-2011-1004.tar.gz (for illumina data)

(2) Decompress it by command:

  tar zxvf cd-hit-otu-454-0.0.2-2011-1004.tar.gz 
  Or tar zxvf cd-hit-otu-illumina-0.0.1-2011-1004.tar.gz       

This will create a new directory like cd-hit-otu-454-0.0.2 or cd-hit-otu-illumina-0.0.1.

(3) enter this new directory, and then enter “cd-hit” directory (can also be cd-hit-version-number) This directory contains cd-hit package, you need to compile cd-hit by

  (a) make openmp=yes
  (b) make       

here (a) generates a multiple-threaded version of cd-hit or(b) generates regular cd-hit.

cd-hit is a package we developed, and it is regularly updated. You can check newer package at http://cd-hit.org or updates.

(4) enter “cdhit-dup” directory (can be cdhit-dup-version-number) This directory contains cdhit-dup tool, you need to compile it by

  make

USER'S GUIDE

There are few examples with this package, please enter example directory.

requirement of input

cd-hit-otu starts with a FASTA/FASTQ format file containing the rRNA tags.

Barcodes should be removed. All the tags must start at 5' with the same rRNA primer shared by all rRNA genes (rare wrong base calls or small variations are ok - we will filter and trim them away).

for example:

>D8YAWCR01AD8YT|test40b
CAACGCGAAGAACCTTACCTGGACTTGACATGCACTTGAAAACTATAGAGATATAGTCCCTCTTCGGAGCAAGTGTGCAGGTGATGCATGGCTGT
>D8YAWCR01BQSTE|test40b
CAACGCGAAGAACCTTACCTGGGTTTGACATCCTGTGAACGTCTAAGAGATTAGACAGTGCCTTCGGGAGCACAGAGACAGGTGGTGCATGG
>D8YAWCR01ADZWF|test40b
CAACGCGAAGAACCTTACCTGGACTTGACATGCACTTGAAAACTATAGAGATATAGTCCCTCTTCGGAGCAAGTGTGCAGGTGCTGCATGG

The common primers make sure all the sequences align at 5' during the clustering analysis. So, it is important for accuracy.

If your file donesn't have the rRNA primer, please add it back at the 5'. if the primer is at 3', please use the reverse strand (you can use fasta_reverse_strand.pl in this package).

If you don't know the primer, add any string like “ATGCATGCATGCCCC” at the 5'. Because the sequence of primer is not important, the important thing is that every tag start with same sequences.

Some people cluster the tags with primer, some people don't. If you think the primer should NOT be considered, you can justify the cutoff at the last step. For example, if your non-primer sequence is 80bp, and primer length is 20bp, and you want to cluster at 97% identity (over 80bp), then please use 97.6% as OTU cutoff. Here 97.6% = (20 + 80*97%) / (20+80).

For 454 data, one FASTA file is needed.

For single-end illumina data, one FASTQ file is needed.

For pair-end illumina data, two FASTQ files (forming a pair) are needed.

One script call

For 454 data

Please enter the examples directory.

You can run OTU analysis by command such as:

../cd-hit-otu-all.pl -i ArtificialGSFLX.fna -o ArtificialGSFLX-dir -e 0.005 -c 0.97 -f 0.8 -p 6

Use command without parameters to see the description of these parameters:

../cd-hit-otu-all.pl
For single-end illumina data

Please enter the examples directory.

You can run OTU analysis by command such as:

../cd-hit-otu-all-single.pl -i C1_1.fastq -o C1_1-dir -t 80 -c 0.97 -e 0.01

Use command without parameters to see the description of these parameters:

../cd-hit-otu-all-single.pl
      
For pair-end illumina data

Please enter the examples directory.

You can run OTU analysis by command such as:

../cd-hit-otu-all-pair.pl -i C1_1.fastq,C1_2.fastq -o C1-dir -t 125,125 -p 6,6 -c 0.97 -m true -e 0.01

Use command without parameters to see the description of these parameters:

../cd-hit-otu-all-pair.pl 

Clustering pooled samples

Many samples can be pooled together for OTU analysis. cd-hit-otu needs a single FASTA/FASTQ file contains all the reads from these samples. It is recommended that each read has sample name in its ID, such as

>sample1_sequence1
CAACGCGAAGAACCTTACCTGGACTTGACATGCACTTGAAAACTATAGAGATATAGTCCCTCTTCGGAGCAAGTGTGCAGGTGATGCATGGCTGT
>sample1_sequence2
CAACGCGAAGAACCTTACCTGGGTTTGACATCCTGTGAACGTCTAAGAGATTAGACAGTGCCTTCGGGAGCACAGAGACAGGTGGTGCATGG

cd-hit-otu has several scripts that can be used to parse the resutls from cd-hit-otu to generate otu distributions etc.

examples:

../clstr_sample_count_matrix.pl _ OTU.nr2nd.clstr # generate several OTU distribution files
../clstr_sample_count.pl _ OTU.nr2nd.clstr        #_ is the delimiter between sample name and sequence name

Methods

CD-HIT-OTU, the new tool we developed, is much faster than the existing methods and gives very accurate OTUs. The key point of our method is to first denoise the sequences and then do OTU clustering.

A metagenomic sample is composed of many species at various abundance levels. Each species has one or more copies of identical or distinct 16S rRNA genes. The high throughput 16S sequencing protocol offers that each unique 16S rRNA tag is sequenced many times. When the reads from a unique 16S rRNA tag are clustering at 100% sequence identity over their full length, the largest cluster contains the error-free reads. This cluster is called the primary cluster for this tag. There can be many secondary clusters, which contain reads that only have 1 identical error (indels or substitutions) at the exactly same position. Secondary clusters are much smaller than its corresponding primary cluster because a secondary cluster requires all its members make exactly the same errors. It is also possible to observe tertiary clusters, which contain reads that make exactly two same errors. Tertiary clusters are much smaller than secondary clusters and are thus ignored in our study. Smaller clusters and singletons may also include very wrong reads or artifacts such as chimeric reads.

The size and distribution of the primary and secondary clusters of a unique tag depends on sequence error model, read length and sequencing depth of this tag. Previous studies have accurately analyzed the pyrosequencing errors. It is reported the per-base error rate is ~0.005 and it drops to ~0.0025 if the obviously incorrect reads such as those with ambiguous bases are removed. Here, we made large-scale simulations to introduce sequencing errors to tags under different read length, depth and error rate. For each simulation, we clustered the simulated reads to observe the primary and secondary clusters. It is found that the secondary clusters are at least 2 orders of magnitude smaller than the corresponding primary clusters when the depth is ≥1000 (Figure 1, Table 1). figure1_cd-hit-otu.jpg

table1_cd-hit-otu.jpg

When the reads from a metagenomic dataset are clustered at 100% identity, we observe a mixture of primary, secondary, tertiary (and so on) clusters of original rRNA templates and chimeric sequences (Figure 2). In principle, just non-chimeric primary clusters are needed and sufficient for OTU clustering. All other clusters are noises, and will cause inflated estimation of OTUs. So the object here is to effectively identify these noises. figure2_cd-hit-otu.jpg

Examples

For 454 data, nine artificial metagenomic datasets (Table 2), which were generated from mixtures of known DNA sequences, were used to test CD-HIT-OTU and other methods. The same Mock datasets have also been used previously[1-4]. All files were downloaded from http://userweb.eng.gla.ac.uk/christopher.quince/Data/ according to references[2,4]. Divergent, Artificial and Titanium dataset was downloaded as SFF files, which were then converted into FASTA files, quality score files and flowgram files. For the rest of datasets (Even and Uneven), we were not able to get the SFF files. Instead, we could only obtain the filtered (but not denoised) reads in FASTA format and AmpliconNoise-specific data files. So we were not able to run Denoiser on them because we could not retrieve the flowgram data. Results and CPU time are shown in Table 2. Here, sensitivity and specificity are ratios of predicted true OTUs to all true OTUs and to all predicted OTUs respectively. All the methods have comparable high sensitivities, but CD-HIT-OTU has significantly better specificity than others. In addition, it is ~2-4 orders of magnitude faster.

Illumina datasets (Table 2) tested in our study were downloaded according the papers [6-8]. These include benchmark datasets and an actual environmental sample from Arctic tundra soil. CD-HIT-OTU identifies correct or very close number of OTUs in 1-5 minutes for the artificial communities. It also find expected number of OTUs for the Arctic tundra soil sample (AT1, AT2) with up to 5 million reads within ~ 1h. In order to get correct OTUs, the original studies [6-8] all need to remove sequences below abundance cutoff of ~10^4 reads. Otherwise they overestimated the OTUs by orders of magnitude. In CD-HIT-OTU, the cutoff x, which is equivalent to abundance cutoff, is often < 102, so our method can identify more rare populations than previous methods. In addition, CD-HIT-OUTU uses less strict quality filtering process, which allow it to utilize much more reads (55-95%) than existing methods (22-64%). table2_cd-hit-otu.jpg

Discussion

With the ultra-high speed, CD-HIT-OTU can process million of reads pooled from a series of samples in a few minutes. Such clustering further improves OTU identification because reads from different samples validate each other. CD-HIT-OTU is available from http://weizhongli-lab.org/cd-hit-otu.

References

1. S. M. Huse, D. M. Welch, H. G. Morrison et al., Environ Microbiol 12 (7), 1889 (2010).
2. C. Quince, A. Lanzen, T. P. Curtis et al., Nat Methods 6 (9), 639 (2009).
3. J. Reeder and R. Knight, Nat Methods 7 (9), 668 (2010).
4. C. Quince, A. Lanzen, R. J. Davenport et al., BMC Bioinformatics 12, 38 (2011).
5. W. Z. Li and A. Godzik, Bioinformatics 22 (13), 1658 (2006).
6. P. H. Degnan and H. Ochman, Isme J (2011).
7. J. G. Caporaso, C. L. Lauber, W. A. Walters et al., Proc Natl Acad Sci U S A 108 Suppl 1, 4516 (2011).
8. A. K. Bartram, M. D. Lynch, J. C. Stearns et al., Appl Environ Microbiol 77 (11), 3846 (2011).

cd-hit-otu-user-guide.txt · Last modified: 2015/11/25 14:48 by liwz