Some other day I was asked to compare taxonomic distribution of two metagenomics samples and answer the question whether they are significantly different. With my deep dislike of using statistical methods without understanding the data, the first thing I’ve tried was to SEE if they are different. What I did was a clustering, but (again) with CLANS software. I’ve joined two sets of reads, run BLASTN all against all and clustered them in 2D. The last thing was to color the points, one sample in red, the other in blue.
Due to high density of the points, the question is only partially answered. Detailed inspection revealed that black spots contain reads from both samples, although it’s not that clear from the picture. What is hardly visible here is that both samples maintain the same structure of clusters, therefore one can assume they are highly similar. To see detailed picture, I’ve clustered the reads by sequence identity (using a version of cd-hit written to deal with reads from 454 sequencer – called cd-hit-454) at the level of 90%. Clustering again in CLANS confirmed initial assessment – two samples were basically identical in terms of taxonomic coverage (to make the picture more clear I’ve removed connections – this group in the middle is a single cluster of highly connected reads):
Two sets of reads were not identical in size, but if they were different in terms of taxonomic coverage, one could expect at least two clusters.
Of course the initial question could be answered in other ways, some of which also include fancy pictures (such as mapping the reads onto one of NCBI’s database and then visualizing the taxonomic coverage with MEGAN, which also can compare visually two samples). Advantage of using CLANS for that particular question is that it’s robust, in a sense that it’s independent of sample-size (as you can see in the pictures above – clusters are different, although the answer is the same).