We use, we bake and we eat cookies. By browsing our site you agree to our use of cookies. Okay!

Open data and open source. Incompatible?

Open data movement seemingly doesn’t differ much from other “open” movements – its goal is to have  free access to data (you can substitute data for articles/source code/creative content) without certain restrictions and mechanisms of control. Quite recently a group of usuall suspects (Cameron Neylon, John Wilbanks, Peter Murray-Rust and Rufus Pollock) put together a set of principles for open data in science called Panton Principles.

It’s easy to think about open data in similar way to thinking about open access or open source. However I’ve come to think that open data, especially in life sciences, is different. The background story is that we had to remove “open data” component from a big grant proposal because it was incompatible with the policy (aka long-term strategy) of the funder. It didn’t violate any of the requirements, but in get-a-grant game you don’t want to lower your chances on purpose. Hence no “open data” on that project. Konrad pointed out in this FriendFeed thread that Poland is no different to other countries in respect of compatibility of open data principles with long-term strategy of science development. And indeed – Wikipedia entry on open scientific data states that it can be challenged already by individual institutions as well as by grant agencies:

As the term Open Data is relatively new it is difficult to collect arguments against it. Unlike Open Access where groups of publishers have stated their concerns, Open Data is normally challenged by individual institutions. Their arguments may include:

  • this is a non-profit organisation and the revenue is necessary to support other activities (e.g. learned society publishing supports the society)
  • the government gives specific legitimacy for certain organisations to recover costs (NIST in US, Ordnance Survey in UK)
  • government funding may not be used to duplicate or challenge the activities of the private sector (e.g. PubChem)

And this made me think that from a point of view of a funder that wants some return on investment in science, open data and open source are pretty much incompatible. In abundance of open source code, data becomes an asset and is more protected. In abundance of open data, analysis methods become more valuable, and as such more protected (meaning less likely to be released under open source license).

Current business models don’t fit well in situation of abundance-of-everything (that is data and software), so I don’t expect that science funders (the least innovative group in the world) can get out of the “scarcity/competitiveness” frame of thinking anytime soon. I have a feeling that our issues with open data being incompatible with long-term strategy of science development, will return some day when we try to put open source into yet another grant.

Open science – campfire, formal knowledge acquisition or both?

Armenian Urn by Dmn
Image via Wikipedia

Recently I’ve stumbled upon this provocative post by Robert Paterson entitled Are Books Bad For Us?. Of course he doesn’t advocate to burn all books, but rather wonders whether books lower our ability to observe and think for ourselves. What catched my attention was the paragraph below:

How did pottery get invented? Surely no one said “Let’s have a project to invent Pottery!” How can you invent something that had never existed? No it must have happened like this – The People stopped for the night after a rainfall. The next morning, as they prepared to leave, the fire keeper noticed that beneath the coals that she was harvesting, the ground had baked to a crust. Maybe she could carry the fire in this thing – this bowl. That night as they shared the food around the fire, she told the people what had happened and showed them the “bowl” that she had lifted out of the earth the day before. And the conversation began  “how had that been? Did it hold the fire well? What else could it hold? What if we put it back in the fire? Would it hold water?” And on and on. Experiments were made. Some earth worked better than others. At the seasonal meeting with the Cousin Peoples, the People shared their story with the others and gave up a “bowl” as a gift their elder. At the next season meeting, the two tribes spent days sharing the stories of the experiments that they had been making…

Having open conversations and sharing stories of experiments are, at least that is my feeling, the sentiments of open science. Knowledge acquisition became too formal and took away the joy of discovery. However, some readers under the post above pointed out that increased complexity of the knowledge requires formal acquisition process. The way I understand it is that amount and complexity of data requires certain protocols and formats, such as MIAME for microarray experiments.

On the other hand, shared stories of experiments are cultural events. Have a look at notebooks of people working in Steve Koch lab. I like to browse them, even if I have hardly an idea what the project is all about. In some way they resemble articles from MAKE magazine, but you can easily interact with authors (some of them have blogs or are active on FriendFeed). Blog posts linked from Polymath project page are another examples – in addition to good science, they often makes a good story (I like this sentence from Nature’s article on first Polymath project: “Who would have guessed that the working record of a mathematical project would read like a thriller?”). And sharing stories is as important as making the data reusable. Without “campfire” aspect, open science is not that exciting anymore.

Biases, biases everywhere

When I skim through list of cognitive biases , I wonder why papers like Over-optimism in bioinformatics: an illustration (hat tip: Neil) do not appear more often. Here’s the abstract:

MOTIVATION: In statistical bioinformatics research, different optimization mechanisms potentially lead to “over-optimism” in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking. RESULTS: We present an empirical study on over-optimism using highdimensional classification as example. Specifically, we consider a “promising” new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, we quantitatively demonstrate that it can artificially seem superior to existing approaches if we “fish for significance”. The investigated sources of over-optimism include the optimization of data sets, of settings, of competing methods and, most importantly, of the method’s characteristics. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should always be demonstrated on independent validation data. AVAILABILITY: The R codes and relevant data can be downloaded from http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/overoptimism/, such that the study is completely reproducible.

Emphasis in the last sentence is mine. It’s the first time I saw a statement that research is reproducible, even though publishing in peer-reviewed journal means reviewers made sure it is.

Anyway, I’m looking for more examples of biases in research process. If you have some, please let me know.

Complex systems and technology

Physarum polycephalum
Image via Wikipedia

Two months ago I was speaking over at TEDxWulkan – it was a TEDx-like event inspired by the eruption of Eyjafjallajökull and its theme was inspired by a well known (at least for fans of Sean Connery) movie line “It’s impossible. But doable.” The video of my short talk is now posted on YouTube.

My point was that inspiration for modern technology could come not only from certain solutions found in living systems but also (or mainly) the design of nature’s ways of coping with changes.  I’ve used slime molds (fungus-like organisms, that do not belong to the Fungi kingdom) as an example of complex adaptive system that exhibit quite sophisticated behaviour, but can hardly be called “intelligent” by design. I’ve shown a research done over at International Centre for Unconventional Computing (must love the name) by Jeff Jones and Andrew Adamatzky. They have used Physarum polycephalum to reproduce motorways between major cities in England (see original paper: Road planning with slime mould: If Physarum built motorways it would route M6/M74 through Newcastle and supplementary material for some cool videos). Similar work has been published in Science few months ago, where a group of researchers modelled Tokyo railway network. I’ve argued that we should design our technology in similar way so it can adapt to changing circumstances on its own.

Of course it’s easy for me to theoretize like that – I’m a biologist, rather computational one and I don’t put my ideas into laboratory practice. There are some niches where adaptive systems are build for quite some time (for example large wireless networks) but these are rather exceptions. Adaptive software is still rare but seems to cover a wide area of uses (see this fascinating paper on adaptive software by Peter Norvig and David Cohn). Obviously it’s doable, but really hard. At least the field is gaining recognition. Consider that only two years ago the same group that had published this paper in Science on slime mold reproducing Tokyo railway network, was awarded Ig Nobel Prize for research on solving maze puzzles with the same organism (published in Nature).

Project sketches – spatial dynamic forest metagenome

Project sketches – ideas that have never made it into grant application (for many reasons, but mostly because I’m not an expert on the topic, so they wouldn’t be funded anyway).

The idea for this project was quite simple: to build spatial and dynamic map of forest metagenome. I wanted to collect many samples from several locations (on the ground, low hanging branches, top of the trees) in different conditions to assess how DNA is spreading in such environment. This supposed to be entirely fishing expedition – I still don’t have any fundamental question such project would answer. I hoped that adding spatial and temporal data to metagenomic analysis would reveal small fluctuations that shape overall state of the system. In other words, I hoped to find butterfly effect in metagenomic space.


The inspiration for this project wasn’t a paper, wasn’t a FriendFeed thread. It was cool device called Windbelt. The device pictured above (there are devices with different sizes) is a small wind-based power device that exploits motion of the string to generate energy. Windbelt plus a battery (as a energy storage) was supposed to power sensors (wind, air quality, temperature, humidity), a communication module (BT or Wifi) and a device that would collect DNA samples. All in semi independent package (at some point you would need to take the samples out and move to sequencing lab).


Airborne metagenome isn’t something particularly new. According to my searches, the first such experiment using modern NGS technology was published already  in 2008 by PLoS One: the study on urban indoor environment gave an insight into both microbial and functional diversity of air metagenome inside densely populated buildings. Whether dynamics of air environment is as important as I believed is to be checked. Although I’m not that sure the way I’ve imagined the methodology is really going to work.

Importance of meatspace – session at Science Online 2010

{{es|1=Investigadores en un laboratorio de la ...
Image via Wikipedia

This is long-overdue report of the session on science freelancing and science coworking at Science Online 2010 that I’ve co-moderated with Brian Russell from Carrboro Creative Coworking.

In the first part of the session I described shortly my journey from being freelance scientists (posts documenting ca. a year of being freelance scientist: Freelancing science – today and tomorrow and End of freelancing as a scientist (for now) ) to being head of virtual research institute (or rather virtual contract research organisation).  Then Brian introduced everybody to the concept of coworking and how it works for IT projects. He pointed out that one of advantages of coworking space is a real community forming around, without much of an intervention or moderation.

In the second part, inspired by Bora Zivkovic‘s ideas of science hostel (see his blog posts: Co-Researching spaces for Freelance Scientists? and What’s an office for? ) we went on discussing how much of organizational freedom can be applied to research institutions. It turned out that many people agreed that under certain conditions (for example no pathogenic wet lab work) one could do research in coworking space in majority of fields. Of course, some fields are more suitable than others (for example field work, or theoretical/computer work), but everybody seemed to agree that there’s no real obstacles to have science hostel for any kind of research. Also Bora pointed out that such hostels should be organised around equipment instead of research area, so the people from different fields can talk together and exchange ideas. Finally, Trevor Owens (community lead for Zotero project) had a cool idea of making CraigsList of unused (and available to use/rent) lab space.

During the discussion I was putting down major ideas into this mindmap available at MindMeister.

I think the major take-away message from this session was that people are surprisingly open to new kind of organization of research process. I don’t think any such initiative would run into a problem of convincing few researchers to at least try. Research parks, science hostels or virtual contract research organizations – all of these were seen as obvious solutions to certain (but not all) issues we face in our typical academic environment.

Of course the main obstacle to such ideas is the funding which is relatively simple in the first example (research park) and not that simple in others, but since the funding is an obstacle to many other good ideas, it wasn’t discussed at all.