## Friday, December 21, 2007

### Christmas presents...

Our Christmas tree has not been decorated yet, but the presents are there: the BMC Bioinformatics paper on userscripts in life sciences, Bioclipse 1.2.0, a long list of blogs to rate, and a very nice overview from Wendy Warr on workflow environments, discussing and comparing different offerings like Pipeline Pilot, Taverna, and KNIME.

Userscripts
The paper on userscripts describes how Greasemonkey scripts can be used to combine different information sources (DOI:10.1186/1471-2105-8-487). A trailer:
Background
The web has seen an explosion of chemistry and biology related resources in the last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available with a wide variety of types of information. There is a huge need to aggregate and organise this information. However, the sheer number of resources makes it unrealistic to link them all in a centralised manner. Instead, search engines to find information in those resources flourish, and formal languages like Resource Description Framework and Web Ontology Language are increasingly used to allow linking of resources. A recent development is the use of userscripts to change the appearance of web pages, by on-the-fly modification of the web content. This pens possibilities to aggregate information and computational results from different web resources into the web page of one of those resources.

Peter et al. have been using this technology for CrystalEye too, but the paper was in a finalizing state when the userscript was announced, unfortunately.

Bioclipse 1.2.0
The other present is the Bioclipse 1.2.0 release, for which the QSAR feature is a great new feature addition (see my blog the other day with an overview of blog items detailing my participation in that feature). Ola et al. have done a great job with the plot functionality, which is very nice to scatter plot calculated descriptors. This release is likely going to be the last one in the Bioclipse 1 series, except for bug fix releases, so, this release also means I can start contributing to the Bioclipse 2 series. Recent items in the Bioclipse blog show a bright future, with project based resource handling, better scripting (R, ruby, JavaScript, BeanShell?).

BTW, we never have presents under the tree; we have Sinterklaas.

## Thursday, December 20, 2007

### The molecular QSAR descriptors in the CDK

Pending the release of Bioclipse 1.2.0, Ola asked me to do some additional feature implementation for the QSAR feature, such as having the filenames as labels in the descriptor matrix. See also these earlier items: (How more open notebook science can you get?)

But I ran into some trouble when both JOElib and CDK descriptors were selected, or Ola really. Now, nothing much I plan to do on the JOElib code, but at least I code investigate the CDK code.

The QSAR descriptor framework has been published in the Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. paper (DOI:10.2174/138161206777585274). However, while most molecular descriptors had JUnit tests for at least the calculate() method, a full and proper module testing was not set up. This involves a rough coverage testing and test methods for all methods in the classes.

So, I set up a new CDK module called qsarmolecular, and added the coverage test class QsarmolecularCoverageTest. This class is really short and basically only requires a module to be set up, as reflected by the line:
private final static String CLASS_LIST = "qsarmolecular.javafiles";
The actual functionality is inherited from the CoverageTest. The coverage testing requires, unlike tools like Emma for which reports are generated by Nightly, a certain naming scheme (explained in Development Tools. 1. Unit testing in CDK News 2.2).

Now, testing for a lot of the methods in the IMolecularDescriptor and IDescriptor interfaces are actually identical for all descriptors. Therefore, I wrote a MolecularDescriptorTest and made all JUnit test classes for the molecular descriptors extend this new class. This means that by writing only 10 new tests, with 29 assert statements, for the 45 molecular descriptor classes, 450 new unit tests are run without special effort, making to total sum of unit tests run each night by Nightly for trunk/ pass the 4500 unit tests.

Now, this turned out to be necessary. I count 52 new failing tests, which should hit Nightly in the next 24 hours.

## Wednesday, December 19, 2007

### Test results for the CDK 1.0.x branch

The Chemistry Development Kit has never really been without any bugs, which is reflected in the number of failing JUnit tests. For trunk/ this is today 106 failing tests (live stats). The stable cdk-1.0.x/ branch, however, the number of failing tests is not much lower: 64 failing tests today (live stats).

Overall, only a low percentage of the tests fails (<2% for cdk-1.0.x/ and <3% for trunk/), and, more importantly, it is particular algorithms that are typically broken. For example, in the structgen module 8 tests fail, for both CDK versions. In the cdk-1.0.x/ branch it is the valency checker code that causes quite a few fails, which I discussed in Atom typing in the CDK and which is the reason for the atom type perception refactoring in progress in trunk/ (see Evidence of Aromaticity). Not all code in trunk/ has yet been updated yet, and this causes quite a few failing tests for trunk/ in the reaction, qsarAtomic and qsarBond modules.

Back to the cdk-1.0.x/ branch. Previous CDK releases tended to have around 40 failing tests, so I was worried about the number of tests failing now. Maybe backported patches causes additional fails? To study that I had my machine run the JUnit tests for all revisions of the cdk-1.0.x/ branch since the branch was made in commit 8343. The result looks like:

Indeed, it is a number of backports that cause the clear increase in bugs between commit 9044 and 9058. Nothing particular I can see, and worse, the intermediate revisions do not compile and do not have test restults:
104 9044 3731  84  73  979.709  0105 9045    0   0   0    0.000  0106 9046    0   0   0    0.000  0107 9047    0   0   0    0.000  0108 9048    0   0   0    0.000  0109 9049    0   0   0    0.000  0110 9050    0   0   0    0.000  0111 9051    0   0   0    0.000  0112 9052    0   0   0    0.000  0113 9053    0   0   0    0.000  0114 9054    0   0   0    0.000  0115 9055    0   0   0    0.000  0116 9056    0   0   0    0.000  0117 9057    0   0   0    0.000  0118 9058 3740 104 146  989.566  0

I should have taken more care when merging in these patches, even though they are supposed to fix issues:
Merged r8697: Add a method to the query atom container creator which creates an  queryatomcontainer. This replaces each pseudoatom to an anyatom.Merged r8699 and r8700: Added test file by Volker (see cdk-user) for the shortest path problem;  JUnit test provided by Volker Haehnke (haehnke - bioinformatik uni-frankfurt de), somewhat   rewritten.Merged r8701: Renamed a variable to comply with http://en.wikipedia.org/wiki/Dijkstra's_algorithmMerged r8751: Bug fixes for bugs #1783367 'SmilesParser incorrectly assigns double bonds' and   #1783381 'SmilesParser uses Molecule instead of IMolecule'. Test case for bug #1783367.Merged r8754 and r8773: Fix and test case for bug #1783547 and #1783546 'Lost aromaticity in   SmilesParser with Biphenyl and Benzene'Merged r8774: Add a MDL RXN reader which uses the MDLV2000Reader instead of the MDLReaderMerged r8775, r8776, r8777: bug fixes for #150354 #1783774 #1778479 in the SmilesParser,   SmilesGenerator and MDLWriter/PseudoAtom.Merged r8791: Code for v,mass atom two digits mass atom and exception handelingMerged r8800: Fixed reading of MDL molfiles with exactly 12 columns (==valid) in the bond blockMerged r8802: Made a little more memory efficient by removing unnesscary cloning operationsMerged r8803: Fixed it so that we make a deep copy of the input moleculeMerged r8809: Added code to work on a local copy of theinput moleculeMerged r8811: Updated JavadocsMerged 8824 8821 8820 8819 8817 8816: Added code to properly work on a local copy

I'm quite sure it must be the deep-cloning fix ported from the commits 8800-8824. I already fixed a number of bugs in the IP calculation code which is still a good deal of the failing tests in the cdk-1.0.x/ branch (and affects trunk/ too), as can be seen by the drop in bugs just after the big increase:
r9079 | egonw | 2007-10-15 13:24:10 +0200 (Mon, 15 Oct 2007) | 1 lineRenamed container to localClone to clear up code. Fixed a bug where the uncloned atoms was searched in the cloned atomcontainer. More bugs like this are in the code. Miguel is contactedabout this problem.------------------------------------------------------------------------r9082 | egonw | 2007-10-15 13:48:15 +0200 (Mon, 15 Oct 2007) | 1 lineRenamed container to localClone to clear up code. Fixed a bug where the uncloned atoms was searched in the cloned atomcontainer.

The big drop in number of fails is caused by the removal of the SMARTS code from the branch, which has been present since the start of the branch (see this page).

From this analysis I conclude that CDK 1.0.2 can soon be released. With the not that the ionization potential calculation is not safe to use.

## Monday, December 17, 2007

### Open Data getting more recognition

The OD part of ODOSOS is getting more and more attention, and it seems that Peter's Open Data battle is paying off (see his original OpenData article in Wikipedia): an open data specific license has reached the beta stage (see this announcement).

The idea behind this licenses seems to come down to:
Facts are free. The Rightsholder takes the position that factual information is not covered by Copyright. This Document however covers the Work in jurisdictions that may protect the factual information in the Work by Copyright, and to cover any information protected by Copyright that is contained in the Work.

I am looking forward how this license will be picked up by the community. PubChem may be a good candidate to use this license; to formalize their dump into the public domain. Not just yet, though, because things might still change. It is said that a wiki will be set up to ask for feedback. Paul has written a nice writeup on the history of this license.

I particularly like the quote by Tim O'Reilly from this blog:
One day soon, tomorrow's Richard Stallman will wake up and realize that all the software distributed in the world is free and open source, but that he still has no control to improve or change the computer tools that he relies on every day. They are services backed by collective databases too large (and controlled by their service providers) to be easily modified. Even data portability initiatives such as those starting today merely scratch the surface, because taking your own data out of the pool may let you move it somewhere else, but much of its value depends on its original context, now lost.

In the past I have argued for the CC-BY license, and so does Peter in this recent comment on a post by Deepak on educating people about data ownership. Interestingly, the new license proposes to remove ownership as solution to free the data :)

## Thursday, December 13, 2007

### I don't blame Individuals in Commercial Chemoinformatics

The comment I left in the ChemSpider blog, was probably a bit blunt. ChemSpider announced having licensed software from OpenEye. I have seen such announcements more often, but am intrigued about the nature of such announcements. Is it bad that ChemSpider is using OpenEye software? Certainly not. But it is surprising that they "announced today they had entered into an agreement that will allow the incorporation of a number of OpenEye’s products into ChemZoo’s online chemistry database and property prediction service, ChemSpider" (emphasis mine).

Is it really special that you buy software and then use it? Maybe, it increasingly is, with a number of good software products freely available. Even many proprietary products are freely available, sometimes to a selected group only, though. Or, is there some license behind this that restricts you in what you may and may not do with it?

Anyway, I made the somewhat inconsiderate comment:"Amazing! (Forgive me that I [have] not read every bit…) But, amazing! A press release for the fact that one may use software ;)".

Anthony replied with these lines:"Yes, I think it is amazing that companies of this caliber are willing to provide their tools at no cost to systems like ChemSpider". He read my sarcasm correctly. I find it absurd that the future of chemoinformatics is left to the goodwill of benevolent companies. Chemoinformatics is way too important, and in way to crappy state, to be kept as proprietary toy to industry; that's something I argued before.

Let me try to explain where my sarcasm is coming from.

I do Not blame Individuals in Commercial Chemoinformatics
There is nothing wrong with getting payed for what you do. I get payed for the software I develop too, though most of my contributions to the CDK, Jmol en even some some of my contributions to Bioclipse I have made as a hobby, in my spare time, unpaid. Nothing wrong with a good hobby, I would say.

But I do not blame people for not doing the same. Neither do I blame myself for making a reasonable living in the Netherlands, unlike all those poor bastards who struggle to make it to the next month, like many in the United States. But I do not like the situation. Neither do I blame people for being religious, though I really dislike several of the things the Church is trying to make
people believe (such as that the HIV virus can get through condoms). I hate the situation.

I do not dislike the Commercial Model
People have to make a living. I do; anyone does. I do feel, however, there is a difference between making a living because you work, and getting money because you happen to be at the right side of the money flow. There is a difference between a baker getting up at 5am every morning to feed a village, and someone selling a thin slice of bread via eBay to a poor African soul who just received his/her OPLC laptop. Not that I think this really applies to the ChemSpider/OpenEye deal; just to make a statement about commercialism.

The Bill Gates foundation spending a lot of money on scientific research is what Dutch would call een sigaar uit eigen doos. This translate to something like getting a present you payed yourself. Literally, 'to get a sigar from ones own box'. But that's another story.

I hate the situation
I hate the situation that research for new drugs is so expensive, and medicine likewise. I hate it that pharmaceutical industry cannot sell these drugs cheaply to development countries, because they will be sold expensively in western markets. But I do not blame the scientists working in pharma industry.

I hate the situation that scientific results cannot be reproduced independently, because software is being used as black box. But I do not blame the guy who wrote the code.

I hate the situation that I cannot contribute the excellent products around, because they disallow me to discuss my work with others. But I do not blame the guy who sold me the license.

I hate the situation that many very qualified scientists have to find a post-doc after post-doc before the give up and do to industry. I hate the situation that the better scientist you are, the less science you actually do, because all time is spent on getting further funds. But I do not blame those who payed for those temporary post-doc positions.

I hate the situation that people have to use commercial models for their scientific contributions, just to make a living, even though they would have loved to contribute that to mankind. But I do not blame them for wanting to be able to fulfill their primary living requirements (and those of their families).

I hate the situation I review papers for free for commercial publishers, just to help science progress. I do blame myself for not having stopped doing that yet.

But I do not blame ChemSpider for buying or using commercial products. I do not blame the people working at OpenEye for making a living. But I do find it absurd that we have to be amazed that scientific software is put to work.

I apologize for being blunt, but I cannot apologize for disliking the current situation chemoinformatics is in.

## Monday, December 10, 2007

### Tagging, thesauri or ontologies?

Controlled vocabularies, hierarchies, microformats, RDF. Nico Adams pointed me to this excellent video:

It's a really nifty piece of work, which goes into the differences between thesauri, controlled vocabularies, and, as such, ontologies, and social tagging systems. Both have their virtues; it is fuzzy logic versus ODEs all over again. Whether one is better than the other only depends on the problem at hand. For example, can you imagine social tagging in atom typing prior to performing force field calculations? Or, an 150-term ontology to annotate the scientific content of your literature archive?

More from where they come from...
The video appears to be made by the Digital Etnography group, which has made several more movies. Certainly something I'm going to check out over the winter holidays (I guess I am quite a bit more religious about ODOSOS than about gods).

Nico wrote: As long as we appreciate that there may be more than one top node…. I am not entirely sure, but if he refers the thesauri, which are, a particular form of ontologies, where basically the only relations that can be found are is-a or is-parent-of, resulting in a hierarchy of controlled terminology with one top node (such as the Gene Ontology). Ontologies can and should be much richer if we really want to take advantage of our information technologies, just like we do with any graph mining. Why mould reality in a tight hierarchy?

Chemical ontologies
Peter has not seen the movie yet, but replied with a recent comment he had on CML:
Ebs and Michael had reviewed CML and questioned why the key concepts were atoms, molecules, electron, substances, whereas they suggested it would have been better to start from reactions. I think that’s a very clear difference in orientation between endurants and perdurants. Although chemists publish reactions, most of the emphasis is on (new) substances and their properties. CML is designed to map directly onto the way chemists seem to think - at least in their public communication - e.g. through documents. Of course we can also do reactions in CML, but even there the emphasis is often on the components.

The suggestion by Ebs and Michael is indeed quite surprising: ontologies tries to capture knowledge and expressed this an a small set of terms, each of which with an accurate and non-overlapping meaning (orthogonal, if you wish). Now, the terms carbon, nitrogen, oxygen, and the other 104 elements are quite accurate and rather different from each other, at least from a chemical point of view. Sure, bonding is more difficult, and let's not start about aromaticity. But to question atoms, bonds or electrons as key concepts??

## Friday, December 07, 2007

### Open Source, Open Data at the European Bioinformatics Institute

I was pleased to hear that Christoph will move to the EBI early next year. Christoph has been working on Open Source and Open Data chemoinformatics since at least 1997. I first got in contact with Christoph when I wrote code for JChemPaint (which Christoph developed) to be able to read Chemical Markup Languages (CML). This also got me into contact with Dan Gezelter who is the original author of Jmol, to which I also added CML support. And, of course, with Henry and Peter, who first developed CML. This was before XML was an official recommendation, and I have worked with CML files which you would no longer recognize. It was in Dan's office that the CDK was founded, where Christoph, Dan and I designed data classes to replace the JChemPaint and Jmol data classes. Both JChemPaint and Jmol were rewritten afterwards, but for Jmol it was later decided that more tuned classes were needed to achieve to required performance for the live rendering of tens of thousands of atoms.

Well, Christoph has done many other Open Source and Open Data stuff, including the NMRShiftDB, Bioclipse and Seneca, a tool for computer-aided structure elucidation (CASE). The scientific impact for Christoph's work is considerable. When I realize that much of his past work was setting out foundations, and that these foundations have found the be solid, I am happy to hear that he can now start to apply his work to life science problems, where current methods are failing.

Christoph, cheers!

## Tuesday, December 04, 2007

### My Open Laboratory 2007 submissions

As promised, here is my list of submission for the Open Laboratory 2007:
BTW, even though the judges have started their way through the submissions, you can still submit entries.

## Monday, December 03, 2007

### Web2O, Open Chemistry, and Chemblaics

Chemistry World December issue features a nice item on the future of data in chemistry: Surfing Web2O; Peter gave an excerpt, and Peter commented on it.

The article discusses many of the things that have been happening in the field of chemical data. It touches Jean-Claude's work on Open Notebook Science, and then moves to Peter's Open Data, mentions a number of other blogs and the Chemical blogspace. Via some video efforts, it ends up with Mitch' Chemmunity, which has the coolest Captcha I have seen so far:

It also cited Rich' blog item on 32 free chemical databases, Christoph's NMRShiftDB.org, Project Prospect, and CML which recently saw its the 7th research paper.

Of course, this is the arena of chemblaics, but unfortunately my blog is not cited (though my name mentioned). So, what is wrong with my blog??