We use, we bake and we eat cookies. By browsing our site you agree to our use of cookies. Okay!

Patent data available. So what?

From O’Reilly Radar post entitled: Unlikely Group Working Happily Together To Solve Patent Problem:

In September, the Patent Office announced a rather strange “Request for Information” (RFI). Under this proposed scheme, the Patent Office would receive a substantial (upwards of $10 million!) donation of equipment from a vendor. In return, the vendor would get to be the official distributor of the patent database to the public, and would get to sell “value-added products.” Among other things, the vendor would get access to the patents before the public does, allowing them to mine the database, and would be allowed to sell a variety of bulk products.

My initial reaction was “cool, the more data the better”. However after a while, I’ve turned into feeling “so what?”. The more data available the more value in secretly developed algorithms that can analyze, mine or assemble the data. I doubt even large and bursty group effort can compete with dedicated teams at Google or Intellectual Ventures.

Nanobiotechnology should be mobile

It occured to me that whatever bionanotechnology is going to come to the market in the following years, it must consider mobile devices, especially mobile phones, as its platform. Upcoming technologies are already portable, just to name paper-based microfluidic analytical devices. We just only need to couple them with mobile phones and a computing cloud that will take care of data analysis and that seems to be only few steps away. Creating novel uses for mobile phones seems to be the goal of NextLab over at MIT. One of their projects is MoCA, a mobile diagnostics infrastructure. Here’re presentations about the project:

Moca Final Presentation from nextlab on Vimeo.

This is based on already existing technology. Blood or urine tests on mobile phones don’t seem to be light-years away. Sequencing human genome on the phone sounds like a nice next challenge for NGS technologies.

Tags at Technology Review blogs

I’m rereading this very nice post by Ed Boyden:

Technology Review: Blogs: Ed Boyden’s blog: How to Think

When I applied for my faculty job at the MIT Media Lab, I had to write a teaching statement. One of the things I proposed was to teach a class called “How to Think” which would focus on how to be creative, thoughtful, and powerful in a world where problems are extremely complex, targets are continuously moving, and our brains often seem like nodes of enormous networks that constantly reconfigure. In the process of thinking about this, I composed 10 rules, which I sometimes share with students. I’ve listed them here, followed by some practical advice on implementation.

I’d read that post before but only today I’ve noticed one puzzling thing about its tags. The most interesting ones, such as creativity, thinking, synthesis or simplicity, are attached only to this one particular post among all (dozen or two) blogs over at Technology Review. Apparently such topics are not what technology is all about.

In science (tools) we trust

Plot showing the growth of NCBI's GenBank data...
Image via Wikipedia

Software generates errors and misuse of software generates even more errors. Here’re few examples of use and misuse of the common, non-scientific software and services.

I think I have mentioned this elsewhere: scientists don’t care about search engines, neither can use them. My friend was quite lucky when naming a newly found domain “NERD”, because it’s easy to remember and still unique enough to find it. But usually, people aren’t that lucky. Here’s an example:

Jane: suggesting journals, finding experts. Schuemie MJ, Kors JA.

SUMMARY: With an exponentially growing number of articles being published every year, scientists can use some help in determining which journal is most appropriate for publishing their results, and which other scientists can be called upon to review their work. Jane (Journal/Author Name Estimator) is a freely available web-based application that, on the basis of a sample text (e.g. the title and abstract of a manuscript), can suggest journals and experts who have published similar articles. AVAILABILITY: http://biosemantics.org/jane.

and

JANE: efficient mapping of prokaryotic ESTs and variable length sequence reads on related template genomes. Liang C, Schmid A, López-Sánchez MJ, Moya A, Gross R, Bernhardt J, Dandekar T.

BACKGROUND: ESTs or variable sequence reads can be available in prokaryotic studies well before a complete genome is known. Use cases include (i) transcriptome studies or (ii) single cell sequencing of bacteria. Without suitable software their further analysis and mapping would have to await finalization of the corresponding genome. RESULTS: The tool JANE rapidly maps ESTs or variable sequence reads in prokaryotic sequencing and transcriptome efforts to related template genomes. It provides an easy-to-use graphics interface for information retrieval and a toolkit for EST or nucleotide sequence function prediction. Furthermore, we developed for rapid mapping an enhanced sequence alignment algorithm which reassembles and evaluates high scoring pairs provided from the BLAST algorithm. Rapid assembly on and replacement of the template genome by sequence reads or mapped ESTs is achieved. This is illustrated (i) by data from Staphylococci as well as from a Blattabacteria sequencing effort, (ii) mapping single cell sequencing reads is shown for poribacteria to sister phylum representative Rhodopirellula Baltica SH1. The algorithm has been implemented in a web-server accessible at http://jane.bioapps.biozentrum.uni-wuerzburg.de. CONCLUSION: Rapid prokaryotic EST mapping or mapping of sequence reads is achieved applying JANE even without knowing the cognate genome sequence.

These were published ca. 20 months from each other. That?s one part of the story. The second part is that unless you have a service like EMBL?s SMART, using “cool-server-name” and “server” as a query in Google almost guarantees that you don’t find what you’re looking for. That’s why authors of the latter paper haven’t realized that there’s a service with exactly the same name. Neither of these is found by “Jane server” query anyway. The most tough example I’ve stumbled across recently was a bioinformatics service named “Project HOPE”. Go find it. I can give you a hint – it comes from Gert Vriend lab.

Common desktop software generates errors too. In 2004 Zeeberg and colleagues published report about irreversible and automatic renaming of genes by Microsoft Excel. Their conclusion was that in those large datasets we are managing now, such mistakes go unnoticed and they provide a number of examples of online databases containing errors of this kind. My former colleague, Dirk Linke has noticed other peculiarity. Microsoft Word sometimes automatically corrects words it thinks are mispelled, giving funny results such as “DANN polymerase” (a result of German version of MS Office, “dann” is a German word for “then”), or “praline” instead of “proline” (apparently all language versions of MS Office are capable of producing such mistake).

I have an example in my own area – sequence annotation. Level of misannotation in sequence databases wasn’t assessed so far, but I’ve learned over these years not to trust functional annotation that much (especially assignments like “will die slowly“). In a recent issue of PLoS Computational Biology there’s a paper entitled: “Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies” where authors write:

The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%?63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%.

Because of the nature of the annotation process, errors tend to propagate, therefore fixing the software (so for example it doesn’t overpredict) will not automatically fix the errors, unless someone is going to redo the annotations for large databases. One more note: while level of misannotations in NCBI NR database doesn’t surprise that much (it’s still high), it’s interesting to see how little error is in Swiss-Prot. I didn’t expect that manual curation produces almost no errors, especially that I’ve done exactly that (manual curation) for years during my PhD studies. It?s much lower than in other areas of life sciences. For example people of protein-protein interaction studies claim 2%-9% curation errors across datasets from three different species curated by five different interaction databases.