openfoo

In the discussion under my recent post on incompatibilities between open source and open data Bill Anderson pointed out frequent confusion between “open source” and “free software”. He cited Richard Stallman’s essay which argues that open source is a software development methodology, while free software is a social movement. Building on that, Bill wrote that “‘Open data’ is not a data development method; it’s a data sharing practice (…)”, which sounded quite right.

However, after reading Stallman’s essay again and looking at the official definition of open source software (linked also by Stallman) I view the distinction between these two quite differently.

“Open” as sharing practice

The official definition of open source software specifies conditions of sharing and distributing source code. As such, it’s no different than definition of open data, except that it contains some conditions very specific to software (for example, the code must be shared under technology-neutral license). So first layer of “open” would be non-restrictive sharing policy.

“Open” as social movement

Attaching a societal idea to the “open” results in the second layer in this schema. “Free software” is one example, “making research results available to taxpayers” is another one. In many cases, a social movement was already in place before sharing practice was established, in other cases (for example with open data or open science in general) we still struggle to define the ultimate societal benefit in less than six words.

“Open” as technology

Here’s the place for viewing open source as software development method. Here’s also the place to view open science as research organization method. Most of “open” ideas result in practical advantages, such as new business models, increased sustainability, or faster growth/development. I call them “technology” because they are applied to processes. The exact benefits depend on the domain – open source and open data are comparable only in analogies.

Discussing openness

I don’t think it’s the ultimate solution, but I find such three layer model of “open” quite useful in clarifying discussions on openness. First thing is that mixing different aspects of openness results in such abstract ideas as software that has liberties (which in the end might become very important but don’t help in establishing the basics in areas that have shorter history than open source). The other thing is that it helps to divide tasks between interesting parties based on their competencies. This is how it works in case of Open Access (establishing policies, advocacy and software development are usually done by different organizations). And finally, you can easily define new or already established “open” ideas within that framework –  for example it helped me to understand differences between open politics, open government and open democracy (the differences were not that intuitive for me, as I expected).

This model could be possibly improved by adding other layers/aspects. If you have any ideas, please let me know.

Quite recently my colleagues published a paper on genome-wide model of translation (in PLoS Computational Biology). They have used a number of different data, although in one case the data wasn’t available in the text or as a supplementary material – they needed to ask the other group for certain raw numbers from their experiment on ribosome profiling in yeast (published in Science). Data was shared promptly, so my colleagues could finish the project. They have cited original paper, plus they’ve expressed their gratitude mentioning data sharing in acknowledgements.

Because sharing data resulted in a citation, I wonder how long will it take for Open Data advocates to start using this “open data citation advantage” as an argument for sharing data?

Don’t get me wrong – I’m all for Open Foo, but I do get frustrated when “citation advantage” becomes a major or even the only point of going open. It’s obscuring debate on Open Foo and limiting it to the aspects only (some) scientists care about.

Must read: advantage, schmantage post by Bill Hooker on OA citation advantage.

Intellectual Ventures is a private company with business model relying on developing large patent portfolio and licensing these to companies with infringing products. In other words their business model is patent trolling. Given my attitude towards openness, it’s clear that I don’t like their approach at all, although I must admit that some of the ideas they have developed are freaking cool. You can imagine my satisfaction when it turned out that their first venture fund, IDF I, isn’t doing that well (see embedded document below, page 7).

fundperformanceactive

However the recent tweet from Glyn Moody points to an article in TechDirt, which states that the numbers (internal rate of return at -78%) might be meaningless and the true revenue from patent trolling is still unknown.

This news inspired a hot discussion at our small research institute (the one we’ve started last year) about our attitude towards intellectual property protection and management. It seems clear that some form of IP protection is going to stay, at least for privately funded research or applied science. On the other hand, the price we as a society pay for current patent management system is constantly going up. Michael Heller in “Gridlock economy” claims that so called “quick” resolving of Golden Rice intellectual property issues took 6 years and I don’t think situation got better since then. For that reason, I’m a big fan of Michael Nielsen’s idea of automated contracts and its extension to patent system.

In the discussion we’ve had and I’d argued that in the  long run, we should stay away from holding IP rights, because of huge investment (time and money) to obtain one (just after that discussion I’ve seen an interview with Craig Venter in which he says “nobody has made any serious money off patents on human genes except patent attorneys” – worth read for other reasons as well). Instead I’ve argued for building a platform for streamlining of negotiations between patent holders and businesses in the area of our competence (which is currently green and sustainable technology). Even if we don’t grow beyond local market (Poland), my feeling is that such platform is going to be more profitable than collecting patents.

I didn’t compare our capacity for filling patents with patent portfolio of Intellectual Ventures (that would be silly). Also, despite my aversion towards IV business model, I’m not that sure their returns will never become positive. Rather I’ve argued that patent trolling might be actually profitable, but only if you can sell licenses for the whole process, or most of it. In other words, large patent portfolio might be an equivalent of automated contracts. Because the price for individual licenses is going to drop (have a look at amounts awarded for solutions over at Innocentive – there’s no way any Western university would price its services so low), small non-profit research institutions (including ours) aren’t going to earn enough from their patents to make the research sustainable.

Related articles by Zemanta

Enhanced by Zemanta
The Lorenz attractor is an example of a non-li...
Image via Wikipedia

Which complex system?

Complexity theory, that is studying complex systems, is tracked back to 18th century with classical political economy of the Scottish Enlightenment, although the real pioneers of the field are 20th century’s philosophers, economists, mathematicians and social scientists. It’s a rather young field, but it already covers quite large number of topics (such as complex adaptive systems, chaos theory, non-linearity, emergence or self-organization) and which influences other fields of science, like biology, sociology or economics. In this post each time I mention a complex system I mean “complex adaptive system” (CAS) which is adaptive (which is not the case for non-linear system), non-deterministic (which is not the case for chaotic system) and non-predictable (which is not the case for simple or linear system). John Holland’s definition of CAS is:

A Complex Adaptive System (CAS) is a dynamic network of many agents (which may represent cells, species, individuals, firms, nations) acting in parallel, constantly acting and reacting to what the other agents are doing. The control of a CAS tends to be highly dispersed and decentralized. If there is to be any coherent behavior in the system, it has to arise from competition and cooperation among the agents themselves. The overall behavior of the system is the result of a huge number of decisions made every moment by many individual agents.

I think we can safely say that science as a system of organized research within and outside certain institutions exhibit large number of properties attributed to CAS. Therefore let’s assume that science is a complex system.

Laws vs models

It’s important to remember that behavior of a complex system may depend on a unique set of fundamental laws, but these are different from models we use for practical purposes to describe this behavior. In other words, models of complex systems do not have to be reducible to unique laws. Let me pull out another quote, this time from this recent post by Wavefunction (emphasis mine):

A molecular mechanics model of a molecule assumes the molecule to be a classical set of balls and springs, with the electrons neglected. By any definition this is a ludicrously simple model that completely ignores quantum effects (or at least takes them into consideration implicitly by getting parameters from experiment). Yet, with the right parametrization, it works well-enough to be useful. There could conceivably be many other models which could give the same results. Yet nobody would make the argument that the behavior of molecules modeled in molecular mechanics is not reducible to quantum mechanics.

So, despite some people claiming to know exactly how the science is operating, and we are all wrong with our analogies, we are free to make as many models of science as we wish and there’s nothing wrong with that. Not only because laws and models are different. In many cases, emergent properties of the system cannot be derived from a set of underlying laws – we use (often naive) models to capture these phenomena.

Models of science

How many models of science can we build? Or how many models is enough?

We could compare science to a multi-agent system, where researchers would compete for goods produced by science funders.

We could compare science to a culture, where research areas would rise and fall as a result of competition between memes. Researchers and science funders would be agents of transmission

We could compare science to a simple system, with linear laws (such as “more money, more papers”) which becomes unpredictable due to inherent elements of randomness (scientific discoveries).

We could compare science to a social system, in which behaviour of researchers could be modelled by game theory.

We could compare science to a campfire, where people gather and tell stories.

We could make analogies to art, economics, sociology, or almost anything else. We could derive “laws” or “rules” based on these models, which often can quite accurately (within certain boundaries) approximate behavior of the system.

Model agnosticism

However, asking which model is the best one is like asking which approximation of molecules is the best one. The answer is that it depends on the experiment. As for protein structures, there’s a large spectrum of different approximation used, depending on the task (rough structure comparison, structure modelling, molecular dynamics, docking of small compounds). For other complex systems, situation is quite similar – the practical purpose determine the choice of the model. This is often forgotten, when you move to other fields.

There are also two other approaches – multi-model or multilevel modelling (represented roughly multiscale modelling) and model-free (represented roughly by neural networks), but if these are chosen, this happens for practical purposes, not because they represent “reality” better.

Why “science as a complex system” (or “why such a long introduction?”)

I’ve been thinking about future of science and strategy for science for quite some time. It can be quite difficult already at a personal level (career strategy) and real hard to get at a larger level (for example, open science strategy for Poland). What I’ve learned from Michael Nielsen, is that if you want to make predictions about the future, you need to understand the present as good as possible. I don’t know any better way of understanding something than constructing model after model (and testing them, if that’s possible) .

However, if you look at the predictions made by some people around, they usually focus around one or two ideas their authors like the most. Also people don’t test their predictions against different models, not to mention trying to combine models, or learn something from models incompatible with their own ideas.

But treating science as a complex system doesn’t mean only slight update to our methodology, that is testing different approches. It provides us with a variety  tools to build and test our models (network analysis, multi-agent modelling, pattern oriented modelling, cellular automata, game theory, and list goes on and on). And how to apply these tools to understand how science develops, will be the topic of upcoming posts.

Open data movement seemingly doesn’t differ much from other “open” movements – its goal is to have  free access to data (you can substitute data for articles/source code/creative content) without certain restrictions and mechanisms of control. Quite recently a group of usuall suspects (Cameron Neylon, John Wilbanks, Peter Murray-Rust and Rufus Pollock) put together a set of principles for open data in science called Panton Principles.

It’s easy to think about open data in similar way to thinking about open access or open source. However I’ve come to think that open data, especially in life sciences, is different. The background story is that we had to remove “open data” component from a big grant proposal because it was incompatible with the policy (aka long-term strategy) of the funder. It didn’t violate any of the requirements, but in get-a-grant game you don’t want to lower your chances on purpose. Hence no “open data” on that project. Konrad pointed out in this FriendFeed thread that Poland is no different to other countries in respect of compatibility of open data principles with long-term strategy of science development. And indeed – Wikipedia entry on open scientific data states that it can be challenged already by individual institutions as well as by grant agencies:

As the term Open Data is relatively new it is difficult to collect arguments against it. Unlike Open Access where groups of publishers have stated their concerns, Open Data is normally challenged by individual institutions. Their arguments may include:

  • this is a non-profit organisation and the revenue is necessary to support other activities (e.g. learned society publishing supports the society)
  • the government gives specific legitimacy for certain organisations to recover costs (NIST in US, Ordnance Survey in UK)
  • government funding may not be used to duplicate or challenge the activities of the private sector (e.g. PubChem)

And this made me think that from a point of view of a funder that wants some return on investment in science, open data and open source are pretty much incompatible. In abundance of open source code, data becomes an asset and is more protected. In abundance of open data, analysis methods become more valuable, and as such more protected (meaning less likely to be released under open source license).

Current business models don’t fit well in situation of abundance-of-everything (that is data and software), so I don’t expect that science funders (the least innovative group in the world) can get out of the “scarcity/competitiveness” frame of thinking anytime soon. I have a feeling that our issues with open data being incompatible with long-term strategy of science development, will return some day when we try to put open source into yet another grant.

Armenian Urn by Dmn
Image via Wikipedia

Recently I’ve stumbled upon this provocative post by Robert Paterson entitled Are Books Bad For Us?. Of course he doesn’t advocate to burn all books, but rather wonders whether books lower our ability to observe and think for ourselves. What catched my attention was the paragraph below:

How did pottery get invented? Surely no one said “Let’s have a project to invent Pottery!” How can you invent something that had never existed? No it must have happened like this – The People stopped for the night after a rainfall. The next morning, as they prepared to leave, the fire keeper noticed that beneath the coals that she was harvesting, the ground had baked to a crust. Maybe she could carry the fire in this thing – this bowl. That night as they shared the food around the fire, she told the people what had happened and showed them the “bowl” that she had lifted out of the earth the day before. And the conversation began  “how had that been? Did it hold the fire well? What else could it hold? What if we put it back in the fire? Would it hold water?” And on and on. Experiments were made. Some earth worked better than others. At the seasonal meeting with the Cousin Peoples, the People shared their story with the others and gave up a “bowl” as a gift their elder. At the next season meeting, the two tribes spent days sharing the stories of the experiments that they had been making…

Having open conversations and sharing stories of experiments are, at least that is my feeling, the sentiments of open science. Knowledge acquisition became too formal and took away the joy of discovery. However, some readers under the post above pointed out that increased complexity of the knowledge requires formal acquisition process. The way I understand it is that amount and complexity of data requires certain protocols and formats, such as MIAME for microarray experiments.

On the other hand, shared stories of experiments are cultural events. Have a look at notebooks of people working in Steve Koch lab. I like to browse them, even if I have hardly an idea what the project is all about. In some way they resemble articles from MAKE magazine, but you can easily interact with authors (some of them have blogs or are active on FriendFeed). Blog posts linked from Polymath project page are another examples – in addition to good science, they often makes a good story (I like this sentence from Nature’s article on first Polymath project: “Who would have guessed that the working record of a mathematical project would read like a thriller?”). And sharing stories is as important as making the data reusable. Without “campfire” aspect, open science is not that exciting anymore.

When I skim through list of cognitive biases , I wonder why papers like Over-optimism in bioinformatics: an illustration (hat tip: Neil) do not appear more often. Here’s the abstract:

MOTIVATION: In statistical bioinformatics research, different optimization mechanisms potentially lead to “over-optimism” in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking. RESULTS: We present an empirical study on over-optimism using highdimensional classification as example. Specifically, we consider a “promising” new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, we quantitatively demonstrate that it can artificially seem superior to existing approaches if we “fish for significance”. The investigated sources of over-optimism include the optimization of data sets, of settings, of competing methods and, most importantly, of the method’s characteristics. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should always be demonstrated on independent validation data. AVAILABILITY: The R codes and relevant data can be downloaded from http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/overoptimism/, such that the study is completely reproducible.

Emphasis in the last sentence is mine. It’s the first time I saw a statement that research is reproducible, even though publishing in peer-reviewed journal means reviewers made sure it is.

Anyway, I’m looking for more examples of biases in research process. If you have some, please let me know.

Physarum polycephalum
Image via Wikipedia

Two months ago I was speaking over at TEDxWulkan – it was a TEDx-like event inspired by the eruption of Eyjafjallajökull and its theme was inspired by a well known (at least for fans of Sean Connery) movie line “It’s impossible. But doable.” The video of my short talk is now posted on YouTube.

My point was that inspiration for modern technology could come not only from certain solutions found in living systems but also (or mainly) the design of nature’s ways of coping with changes.  I’ve used slime molds (fungus-like organisms, that do not belong to the Fungi kingdom) as an example of complex adaptive system that exhibit quite sophisticated behaviour, but can hardly be called “intelligent” by design. I’ve shown a research done over at International Centre for Unconventional Computing (must love the name) by Jeff Jones and Andrew Adamatzky. They have used Physarum polycephalum to reproduce motorways between major cities in England (see original paper: Road planning with slime mould: If Physarum built motorways it would route M6/M74 through Newcastle and supplementary material for some cool videos). Similar work has been published in Science few months ago, where a group of researchers modelled Tokyo railway network. I’ve argued that we should design our technology in similar way so it can adapt to changing circumstances on its own.

Of course it’s easy for me to theoretize like that – I’m a biologist, rather computational one and I don’t put my ideas into laboratory practice. There are some niches where adaptive systems are build for quite some time (for example large wireless networks) but these are rather exceptions. Adaptive software is still rare but seems to cover a wide area of uses (see this fascinating paper on adaptive software by Peter Norvig and David Cohn). Obviously it’s doable, but really hard. At least the field is gaining recognition. Consider that only two years ago the same group that had published this paper in Science on slime mold reproducing Tokyo railway network, was awarded Ig Nobel Prize for research on solving maze puzzles with the same organism (published in Nature).

Project sketches – ideas that have never made it into grant application (for many reasons, but mostly because I’m not an expert on the topic, so they wouldn’t be funded anyway).

The idea for this project was quite simple: to build spatial and dynamic map of forest metagenome. I wanted to collect many samples from several locations (on the ground, low hanging branches, top of the trees) in different conditions to assess how DNA is spreading in such environment. This supposed to be entirely fishing expedition – I still don’t have any fundamental question such project would answer. I hoped that adding spatial and temporal data to metagenomic analysis would reveal small fluctuations that shape overall state of the system. In other words, I hoped to find butterfly effect in metagenomic space.

micro_windbelt

The inspiration for this project wasn’t a paper, wasn’t a FriendFeed thread. It was cool device called Windbelt. The device pictured above (there are devices with different sizes) is a small wind-based power device that exploits motion of the string to generate energy. Windbelt plus a battery (as a energy storage) was supposed to power sensors (wind, air quality, temperature, humidity), a communication module (BT or Wifi) and a device that would collect DNA samples. All in semi independent package (at some point you would need to take the samples out and move to sequencing lab).

device_schema

Airborne metagenome isn’t something particularly new. According to my searches, the first such experiment using modern NGS technology was published already  in 2008 by PLoS One: the study on urban indoor environment gave an insight into both microbial and functional diversity of air metagenome inside densely populated buildings. Whether dynamics of air environment is as important as I believed is to be checked. Although I’m not that sure the way I’ve imagined the methodology is really going to work.

{{es|1=Investigadores en un laboratorio de la ...
Image via Wikipedia

This is long-overdue report of the session on science freelancing and science coworking at Science Online 2010 that I’ve co-moderated with Brian Russell from Carrboro Creative Coworking.

In the first part of the session I described shortly my journey from being freelance scientists (posts documenting ca. a year of being freelance scientist: Freelancing science – today and tomorrow and End of freelancing as a scientist (for now) ) to being head of virtual research institute (or rather virtual contract research organisation).  Then Brian introduced everybody to the concept of coworking and how it works for IT projects. He pointed out that one of advantages of coworking space is a real community forming around, without much of an intervention or moderation.

In the second part, inspired by Bora Zivkovic’s ideas of science hostel (see his blog posts: Co-Researching spaces for Freelance Scientists? and What’s an office for? ) we went on discussing how much of organizational freedom can be applied to research institutions. It turned out that many people agreed that under certain conditions (for example no pathogenic wet lab work) one could do research in coworking space in majority of fields. Of course, some fields are more suitable than others (for example field work, or theoretical/computer work), but everybody seemed to agree that there’s no real obstacles to have science hostel for any kind of research. Also Bora pointed out that such hostels should be organised around equipment instead of research area, so the people from different fields can talk together and exchange ideas. Finally, Trevor Owens (community lead for Zotero project) had a cool idea of making CraigsList of unused (and available to use/rent) lab space.

During the discussion I was putting down major ideas into this mindmap available at MindMeister.

I think the major take-away message from this session was that people are surprisingly open to new kind of organization of research process. I don’t think any such initiative would run into a problem of convincing few researchers to at least try. Research parks, science hostels or virtual contract research organizations – all of these were seen as obvious solutions to certain (but not all) issues we face in our typical academic environment.

Of course the main obstacle to such ideas is the funding which is relatively simple in the first example (research park) and not that simple in others, but since the funding is an obstacle to many other good ideas, it wasn’t discussed at all.

top