We use, we bake and we eat cookies. By browsing our site you agree to our use of cookies. Okay!

In science (tools) we trust

Plot showing the growth of NCBI's GenBank data...
Image via Wikipedia

Software generates errors and misuse of software generates even more errors. Here’re few examples of use and misuse of the common, non-scientific software and services.

I think I have mentioned this elsewhere: scientists don’t care about search engines, neither can use them. My friend was quite lucky when naming a newly found domain “NERD”, because it’s easy to remember and still unique enough to find it. But usually, people aren’t that lucky. Here’s an example:

Jane: suggesting journals, finding experts. Schuemie MJ, Kors JA.

SUMMARY: With an exponentially growing number of articles being published every year, scientists can use some help in determining which journal is most appropriate for publishing their results, and which other scientists can be called upon to review their work. Jane (Journal/Author Name Estimator) is a freely available web-based application that, on the basis of a sample text (e.g. the title and abstract of a manuscript), can suggest journals and experts who have published similar articles. AVAILABILITY: http://biosemantics.org/jane.

and

JANE: efficient mapping of prokaryotic ESTs and variable length sequence reads on related template genomes. Liang C, Schmid A, López-Sánchez MJ, Moya A, Gross R, Bernhardt J, Dandekar T.

BACKGROUND: ESTs or variable sequence reads can be available in prokaryotic studies well before a complete genome is known. Use cases include (i) transcriptome studies or (ii) single cell sequencing of bacteria. Without suitable software their further analysis and mapping would have to await finalization of the corresponding genome. RESULTS: The tool JANE rapidly maps ESTs or variable sequence reads in prokaryotic sequencing and transcriptome efforts to related template genomes. It provides an easy-to-use graphics interface for information retrieval and a toolkit for EST or nucleotide sequence function prediction. Furthermore, we developed for rapid mapping an enhanced sequence alignment algorithm which reassembles and evaluates high scoring pairs provided from the BLAST algorithm. Rapid assembly on and replacement of the template genome by sequence reads or mapped ESTs is achieved. This is illustrated (i) by data from Staphylococci as well as from a Blattabacteria sequencing effort, (ii) mapping single cell sequencing reads is shown for poribacteria to sister phylum representative Rhodopirellula Baltica SH1. The algorithm has been implemented in a web-server accessible at http://jane.bioapps.biozentrum.uni-wuerzburg.de. CONCLUSION: Rapid prokaryotic EST mapping or mapping of sequence reads is achieved applying JANE even without knowing the cognate genome sequence.

These were published ca. 20 months from each other. That?s one part of the story. The second part is that unless you have a service like EMBL?s SMART, using “cool-server-name” and “server” as a query in Google almost guarantees that you don’t find what you’re looking for. That’s why authors of the latter paper haven’t realized that there’s a service with exactly the same name. Neither of these is found by “Jane server” query anyway. The most tough example I’ve stumbled across recently was a bioinformatics service named “Project HOPE”. Go find it. I can give you a hint – it comes from Gert Vriend lab.

Common desktop software generates errors too. In 2004 Zeeberg and colleagues published report about irreversible and automatic renaming of genes by Microsoft Excel. Their conclusion was that in those large datasets we are managing now, such mistakes go unnoticed and they provide a number of examples of online databases containing errors of this kind. My former colleague, Dirk Linke has noticed other peculiarity. Microsoft Word sometimes automatically corrects words it thinks are mispelled, giving funny results such as “DANN polymerase” (a result of German version of MS Office, “dann” is a German word for “then”), or “praline” instead of “proline” (apparently all language versions of MS Office are capable of producing such mistake).

I have an example in my own area – sequence annotation. Level of misannotation in sequence databases wasn’t assessed so far, but I’ve learned over these years not to trust functional annotation that much (especially assignments like “will die slowly“). In a recent issue of PLoS Computational Biology there’s a paper entitled: “Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies” where authors write:

The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%?63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%.

Because of the nature of the annotation process, errors tend to propagate, therefore fixing the software (so for example it doesn’t overpredict) will not automatically fix the errors, unless someone is going to redo the annotations for large databases. One more note: while level of misannotations in NCBI NR database doesn’t surprise that much (it’s still high), it’s interesting to see how little error is in Swiss-Prot. I didn’t expect that manual curation produces almost no errors, especially that I’ve done exactly that (manual curation) for years during my PhD studies. It?s much lower than in other areas of life sciences. For example people of protein-protein interaction studies claim 2%-9% curation errors across datasets from three different species curated by five different interaction databases.