Tuesday, December 22, 2015

Nerd Food: Dogen: The Package Management Saga

Nerd Food: Dogen: The Package Management Saga

We've just gone past Dogen's Sprint 75, so I guess it's time for one of those "reminiscing posts" - something along the lines of what we did for Sprint 50. This one is a bit more practical though; if you are only interested in the practical side, keep scrolling until you see "Conan".

So, package management. Like any other part-time C++ developer whose professional mainstay is C# and Java, I have keenly felt the need for a package manager when in C++-land. The problem is less visible when you are working with mature libraries and dealing with just Linux, due to the huge size of the package repositories and the great tooling built around them. However, things get messier when you start to go cross-platform, and messier still when you are coding on the bleeding edge of C++: either the package you need is not available in the distro's repos or even PPA's; or, when it is, its rarely at the version you require.

Alas, for all our sins, that's exactly where we were when Dogen got started.

A Spoonful of Dogen History

Dogen sprung to life just a tad after C++-0x became C++-11, so we experienced first hand the highs of a quasi-new-language followed by the lows of feeling the brunt of the bleeding edge pain. For starters, nothing we ever wanted was available out of the box, on any of the platforms we were interested in. Even Debian testing was a bit behind - probably stalled due to a compiler transition or other, but I can't quite recall the details. In those days, Real Programmers were Real Programmers and mice were mice: we had to build and install the C++ compilers ourselves and, even then, C++-11 support was new, a bit flaky and limited. We then had to use those compilers to compile all of the dependencies in C++-11 mode.

The PFH Days

After doing this manually once or twice, it soon stopped being fun. And so we solved this problem by creating the PFH - the Private Filesystem Hierarchy - a gloriously over-ambitious name to describe a set of wrapper scripts that helped with the process of downloading tarballs, unpacking, building and finally installing them into well-defined locations. It worked well enough in the confines of its remit, but we were often outside those, having to apply out-of-tree patches, adding new dependencies and so on. We also didn't use Travis in those days - not even sure it existed, but if it did, the rigmarole of the bleeding edge experience would certainly put a stop to any ideas of using it. So we used a local install of CDash with a number of build agents on OSX, Windows (MinGW) and Linux (32-bit and 64-bit). Things worked beautifully when nothing changed and the setup was stable; but, every time a new version of a library - or god forbid, of a compiler - was released, one had that sense of dread: do I really need to upgrade?

Since one of the main objectives of Dogen was to learn about C++-11, one has to say that the pain was worth it. But all of the moving parts described above were not ideal and they were certainly not the thing you want to be wasting your precious time on when it is very scarce. They were certainly not scalable.

The Good Days and the Bad Days

Things improved slightly for a year or two when distros started to ship C++-11 compliant compilers and recent boost versions. It was all so good we were able to move over to Travis and ditch almost all of our private infrastructure. For a while things looked really good. However, due to Travis' Ubuntu LTS policy, we were stuck with a rapidly ageing Boost version. At first PPAs were a good solution for this, but soon these became stale too. We also needed to get latest CMake as there are a lot of developments on that front, but we certainly could not afford (time-wise) to revert back to the bad old days of the PFH. At the same time, it made no sense to freeze dependencies in time, providing a worse development experience. So the only route left was to break Travis and hope that some solution would appear. Some alternatives were tried such as Drone.io but nothing was successful.

There was nothing else for it; what was needed was a package manager to manage the development dependencies.

Nuget Hopes Dashed

Having used Nuget in anger for both C# and C++ projects, and given Microsoft's recent change of heart with regards to open source, I was secretly hoping that Nuget would get some traction in the wider C++ world. To recap, Nuget worked well enough in Mono; in addition, C++ support for Windows was added early on. It was somewhat limited and a bit quirky at the start, but it kept on getting better, to the point of usability. Trouble was, their focus was just Visual Studio.

Alas, nothing much ever came from my Nuget hopes. However, there have been a couple of recent announcements from Microsoft that make me think that they will eventually look into this space:

Surely the logical consequence is to be able to manage packages in a consistent way across platforms? We can but hope.

Biicode Comes to the Rescue?

Nuget did not pan out but what did happen was even more unlikely: some crazy-cool Spaniards decided to create a stand alone package manager. Being from the same peninsula, I felt compelled to use their wares, and was joyful as they went from strength to strength - including the success of their open source campaign. And I loved the fact that it integrated really well with CMake, and that CLion provided Biicode integration very early on.

However, my biggest problem with Biicode was that it was just too complicated. I don't mean to say the creators of the product didn't have very good reasons for their technical choices - lord knows creating a product is hard enough, so I have nothing but praise to anyone who tries. However, for me personally, I never had the time to understand why Biicode needed its own version of CMake, nor did I want to modify my CMake files too much in order to fit properly with Biicode and so on. Basically, I needed a solution that worked well and required minimal changes at my end. Having been brought up with Maven and Nuget, I just could not understand why there wasn't a simple "packages.xml" file that specified the dependencies and then some non-intrusive CMake support to expose those into the CMake files. As you can see from some of my posts, it just seemed it required "getting" Biicode in order to make use of it, which for me was not an option.

Another thing that annoyed me was the difficulty on knowing what the "real" version of a library was. I wrote, at the time:

One slightly confusing thing about the process of adding dependencies is that there may be more than one page for a given dependency and it is not clear which one is the "best" one. For RapidJson there are three options, presumably from three different Biicode users:

  • fenix: authored on 2015-Apr-28, v1.0.1.
  • hithwen: authored 2014-Jul-30
  • denis: authored 2014-Oct-09

The "fenix" option appeared to be the most up-to-date so I went with that one. However, this illustrates a deeper issue: how do you know you can trust a package? In the ideal setup, the project owners would add Biicode support and that would then be the one true version. However, like any other project, Biicode faces the initial adoption conundrum: people are not going to be willing to spend time adding support for Biicode if there aren't a lot of users of Biicode out there already, but without a large library of dependencies there is nothing to draw users in. In this light, one can understand that it makes sense for Biicode to allow anyone to add new packages as a way to bootstrap their user base; but sooner or later they will face the same issues as all distributions face.

A few features would be helpful in the mean time:

  • popularity/number of downloads
  • user ratings

These metrics would help in deciding which package to depend on.

For all these reasons, I never found the time to get Biicode setup and these stories lingered in Dogen's backlog. And the build continued to be red.

Sadly Biicode the company didn't make it either. I feel very sad for the guys behind it, because their heart was on the right place.

Which brings us right up to date.

Enter Conan

When I was a kid, we were all big fans of Conan. No, not the barbarian, the Japanese Manga Future Boy Conan. For me the name Conan will always bring back great memories of this show, which we watched in the original Japanese with Portuguese subtitles. So I was secretly pleased when I found conan.io, a new package management system for C++. The guy behind it seems to be one of the original Biicode developers, so a lot of lessons from Biicode were learned.

To cut a short story short, the great news is I managed to add Conan support to Dogen in roughly 3 hours and with very minimal knowledge about Conan. This to me was a litmus test of sorts, because I have very little interest in package management - creating my own product has proven to be challenging enough, so the last thing I need is to divert my energy further. The other interesting thing is that roughly half of that time was taken by trying to get Travis to behave, so its not quite fair to impute it to Conan.

Setting Up Dogen for Conan

So, what changes did I do to get it all working? It was a very simple 3-step process. First I installed Conan using a Debian package from their site.

I then created a conanfile.txt on my top-level directory:

[requires]
Boost/1.60.0@lasote/stable

[generators]
cmake

Finally I modified my top-level CMakeLists.txt:

# conan support
if(EXISTS "${CMAKE_BINARY_DIR}/conanbuildinfo.cmake")
    message(STATUS "Setting up Conan support.")
    include("${CMAKE_BINARY_DIR}/conanbuildinfo.cmake")
    CONAN_BASIC_SETUP()
else()
    message(STATUS "Conan build file not found, skipping include")
endif()

This means that it is entirely possible to build Dogen without Conan, but if it is present, it will be used. With these two changes, all that was left to do was to build:

$ cd dogen/build/output
$ mkdir gcc-5-conan
$ conan install ../../..
$ make -j5 run_all_specs

Et voila, I had a brand spanking new build of Dogen using Conan. Well, actually, not quite. I've omitted a couple of problems that are a bit of a distraction on the Conan success story. Let's look at them now.

Problems and Their Solutions

The first problem was that Boost 1.59 does not appear to have an overridden FindBoost, which means that I was not able to link. I moved to Boost 1.60 - which I wanted to do any way - and it worked out of the box.

The second problem was that Conan seems to get confused with Ninja, my build system of choice. For whatever reason, when I use the Ninja generator, it fails like so:

$ cmake ../../../ -G Ninja
$ ninja -j5
$ ninja: error: '~/.conan/data/Boost/1.60.0/lasote/stable/package/ebdc9c0c0164b54c29125127c75297f6607946c5/lib/libboost_system.so', needed by 'stage/bin/dogen_utility_spec', missing and no known rule to make it

This is very strange because boost system is clearly available in the Conan download folder. Using make solved this problem. I am going to open a ticket on the Conan GitHub project to investigate this.

The third problem is more boost related than anything else. Boost Graph has not been as well maintained as it should, really. Thus users now find themselves carrying patches, and all because no one seems to be able to apply them upstream. Dogen is in this situation as we've hit the issue described here: Compile error with boost.graph 1.56.0 and g++ 4.6.4. Sadly this is still present on Boost 1.60; the patch exists in Trac but remains unapplied (#10382). This is a tad worrying as we make a lot of use of Boost Graph and intend to increase the usage in the future.

At any rate, as you can see, none of the problems were showstoppers, nor can they all be attributed to Conan.

Getting Travis to Behave

Once I got Dogen building locally, I then went on a mission to convince Travis to use it. It was painful, but mainly because of the lag between commits and hitting an error. The core of the changes to my YML file were as follows:

install:
<snip>
  # conan
  - wget https://s3-eu-west-1.amazonaws.com/conanio-production/downloads/conan-ubuntu-64_0_5_0.deb -O conan.deb
  - sudo dpkg -i conan.deb
  - rm conan.deb
<snip>
script:
  - export GIT_REPO="`pwd`"
  - cd ${GIT_REPO}/build
  - mkdir output
  - cd output
  - conan install ${GIT_REPO}
  - hash=`ls ~/.conan/data/Boost/1.60.0/lasote/stable/package/`
  - cd ~/.conan/data/Boost/1.60.0/lasote/stable/package/${hash}/include/
  - sudo patch -p0 < ${GIT_REPO}/patches/boost_1_59_graph.patch
  - cmake ${GIT_REPO} -DWITH_MINIMAL_PACKAGING=on
  - make -j2 run_all_specs
<snip>

I probably should have a bash script by know, given the size of the YML, but hey - if it works. The changes above deal with installation of the package, applying the boost patch and using Make instead of Ninja. Quite trivial in the end, even though it required a lot of iterations to get there.

Conclusions

Having a red build is a very distressful event for a developer, so you can imagine how painful it has been to have red builds for several months. So it is with unmitigated pleasure that I got to see build #628 in a shiny emerald green. As far as that goes, it has been an unmitigated success.

In a broader sense though, what can we say about Conan? There are many positives to take home, even at this early stage of Dogen usage:

  • it is a lot less intrusive than Biicode and easier to setup. Biicode was very well documented, but it was easy to stray from the beaten track and that then required reading a lot of different wiki pages. It seems easier to stay on the beaten track with Conan.
  • as with Biicode, it seems to provide solutions to Debug/Release and multi-platforms and compilers. We shall be testing it on Windows soon and reporting back.
  • hopefully, since it started Open Source from the beginning, it will form a community of developers around the source with the know-how required to maintain it. It would also be great to see if a business forms around it, since someone will have to pay the cloud bill.

In terms of negatives:

  • I still believe the most scalable approach would have been to extend Nuget for the C++ Linux use case, since Microsoft is willing to take patches and since they foot the bill for the public repo. However, I can understand why one would prefer to have total control over the solution rather than depend on the whims of some middle-manager in order to commit.
  • it seems publishing packages requires getting down into Python. Haven't tried it yet, but I'm hoping it will be made as easy as importing packages with a simple text file. The more complexity around these flows the tool adds, the less likely they are to be used.
  • there still are no "official builds" from projects. As explained above, this is a chicken and egg problem, because people are only willing to dedicate time to it once there are enough users complaining. Having said that, since Conan is easy to setup, one hopes to see some adoption in the near future.
  • even when using a GitHub profile, one still has to define a Conan specific password. This was not required with Biicode. Minor pain, but still, if they want to increase traction, this is probably an unnecessary stumbling block. It was sufficient to make me think twice about setting up a login, for one.

In truth, these are all very minor negative points, but still worth making them. All and all, I am quite pleased with Conan thus far.

Created: 2015-12-22 Tue 14:00

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Monday, December 21, 2015

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

Startups et al.

General Coding

Databases

  • What's new in PostgreSQL 9.5: The RC's are starting and 9.5 looks to continue the trend of amazing Postgres releases. My only missing wish is for native (and full) support for bitemporality really, though to be fair Temporal Tables is probably enough for my needs.

C++

  • Optimizing software in C++: One to bookmark now but to digest later. A whole load of stuff on optimisation.
  • Support for Android CMake projects in Visual Studio: So, as if the latest patches to Clang hadn't been enough, MS now decides to add support for CMake in Visual Studio. A bit embryonic, and a bit too android focused, but surely it should be extensible for more regular C++ use. Whats going on at MS? This is all far too cool to be true.
  • Quickly Loading Things From Disk: interesting analysis about the state of affairs of serialisation in C++. I'll probably require a few passes to fully digest it.
  • Beyond ad-hoc automation: leveraging structured platforms: I've been consuming this presentation slowly but steadily. It deals with a lot of the questions we all have about the new world of containers and microservices, and it seems vital to learn from experience before one finds oneself in a much bigger mess than the monolith could ever get you into. Bridget Kromhout talks intelligently about the subject.

Layman Science

Other

Created: 2015-12-21 Mon 23:31

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Friday, December 11, 2015

Nerd Food: Pull Request Driven Development

Nerd Food: Pull Request Driven Development

Being in this game for the best part of twenty years, I must confess that its not often I find something that revolutionises my coding ways. I do tend to try a lot of things, but most of them end up revealing themselves as fads or are incompatible with my flow. For instance, I never managed to get BDD to work for me, try as I might. I will keep trying because it sounds really useful, but it hasn't clicked just yet.

Having said all of that, these moments of enlightenment do occasionally happen, and when they do, nothing beats that life-changing feeling. "Pull Request Driven Development" (or PRDD) is my latest find. I'll start by confessing that "PRDD" as a name was totally made up for this post and hopefully you can see its rather tongue in cheek. However, the benefits of this approach are very real. In fact, I've been using PRDD for a while now but I just never really noticed its presence creeping in. Today, as I introduced a new developer to the process, I finally had the eureka moment and saw just how brilliant it has been thus far. It also made me realise that some people are not aware of this great tool in the developer's arsenal.

But first things first. In order to explain what I mean by PRDD, I need to provide a bit of context. Everyone is migrating to git these days, even those of us locked behind corporate walls; in our particular case, the migration path implied exposure to Git Stash. For those not in the know, picture it as an expensive and somewhat less featureful version of GitHub, but with most of the core functionality there. Of course, I'm sure GitHub is not that cheap for enterprises either, but hey at least its the tool everyone uses. Anyway - grumbling or not - we moved to Stash and all development started to revolve around Pull Requests (PRs), raised for each new feature.

Not long after PRs were introduced, a particularly interesting habit started to appear: developers begun opening the PRs earlier and earlier during the feature cycle rather than waiting to the very end. Taking this approach to the limit, the idea is that when you start to work on a new feature, you raise the ticket and the PR before you write any code at all. In practice - due to Stash's anachronisms - you need to push at least one commit, but the general notion is valid. This was never mandated anywhere, and there was no particular coordination. I guess one possible explanation for this behaviour is that one wants to get rid of the paperwork as quickly as possible to get to the coding. At any rate, the causes may be obscure but the emerging behaviour was not.

When you combine early PRs with the commit early and commit often approach - which you should be using anyway - the PR starts to become a living document; people see your development work as it progresses and they start commenting on it and possibly even sending you patches as you go along. In a way, this is an enabler for a very efficient kind of peer programming - particularly if you have a tightly knit team - because it gives you maximum parallelism but in a very subtle, non-noticeable way. The main author of the PR is coding as she would normally be, but whenever there is a lull in development - those moments where you'd be browsing the web for five minutes or so - you can quickly check for any comments on your PR and react to those. Similarly, other developers can carry on doing their own work and browse the PRs on their downtime; this allows them to provide feedback whenever it is convenient to them, and to choose the format of the feedback - lengthy or quick, as time permits.

Quick feedback is many a times invaluable in large code bases because everyone tends to know their own little corner of the code and only very few old hands know how it all hangs together. Thus, seemingly trivial one liners such as "have you considered using API xyz instead of rolling your own" or "don't forget to do abc when you do that" could save you many hours of pain and enable knowledge to be transferred organically - something that no number of wiki pages could hope to achieve in a million years because its very difficult to find these pearls in a sea of uncurated content. And because you committed early and often, each commit is very small and very easy to parse in a small interval of time, so people are much more willing to review - as opposed to that several Kb (or even Mb!) patch that you will have to allocate a day or two for. Further: if you take your commit message seriously - as, again, you should - you will find that the number of reviewers will grow rapidly simply because developers are nosy and opinionated.

Note that this review process involves no vague meetings and no lengthy and unfocused email chains; it is very high-quality because it is (or can be) very focused to specific lines of code; it causes no unwanted disruptions because you review where and when you choose to review; reviewers can provide examples and even fix things themselves if they so choose; it is totally inclusive because anyone who wants to participate can, but no one is forced to; and it equalises local and remote developers because they all have access to the same data (modulus some IRL conversations that always take place) - an important feature in this world of near-shoring, off-shoring and home-working. Most importantly, instead of finding out some fundamental errors of approach at the end of an intense period of coding, you now have timely feedback. This saves an enormous amount of time - an advantage that anyone who has been through lengthy code reviews and then spent a week or two reacting to that feedback can appreciate.

I am now a believer in PRDD. So much so that whenever I go back to work on legacy projects in svn, I find myself cringing all the way to the end of the feature. It just feels so nineties.

Update: As I finished penning this post and started reflecting about it it suddenly dawned on me that a lot of things we now take for granted are only possible because of git. And I don't mean DVCS', I specifically mean git. For example PRDD is made possible to a large extent because committing in git is a reversible process and history can be fluid if required. This means that people are not afraid of committing, which in turn enables a lot of the goodness I described above. Many DVCS' didn't like this way of viewing history - and to be fair, I know of very few people that liked the idea until they started using it. Once you figure out what it is good for (and not so good for), it suddenly becomes an amazing tool. Git is full of little decisions like this that at first sight look either straight insane or just not particularly useful but then turn out to change entire development flows.

Created: 2015-12-11 Fri 13:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Wednesday, December 09, 2015

Nerd Food: Interesting...

Nerd Food: Interesting…

Time to flush all those tabs again. Some interesting stuff I bumped into recently-ish.

Finance, Economics, Politics

Startups et al.

General Coding

Databases

C++

  • New ELF Linker from the LLVM Project: LLVM keeps on delivering! Now a new ELF linker. To be totally honest, I haven't even started using Gold in anger - I get the feeling the LLVM linker is going to be transitioned in much quicker than Gold.
  • Clang with Microsoft CodeGen in VS 2015 Update 1: OMG, OMG how cool is this - MSFT decided to create a backend for Clang that is totally compatible with MSVC AND open source it! This is just insane. This means for example that you now can develop C++ on Windows without ever having to use MSVC and Visual Studio. It also means you can cross-compile from Linux into Windows with 100% certainty things will work. It means that projects like Wine and ReactOS can start thinking about a migration path into Clang (not quite as simple as it may sound but surely makes sense). CLion with Clang on Windows will rock. The possibilities are just endless. I never quite understood what C2 was all about until I read this announcement - suddenly it all makes sense. This is fantastic news.

Layman Science

Other

  • NoiseRV Live: Still discovering this Portuguese musician, but love his work. Great concert. Could do a little bit less talking between songs, but still - artists prerogative and all that.
  • Warm Focus: Winging It: Interesting set of "intelligent dance music" as we used to call it back in the day.
  • Mosaic - The “First” Web Browser: Super-cool podcasts about internet history. It would be great to have something like this for UNIX!
  • Jackson C. Frank (1965): Tragic musician from the 60s. Great tunes.
  • Reason in common sense: Always wanted to read Santayana properly. Started, but I guess it will be a very long exercise. Interesting, if somewhat strange book.
  • Ceu - jazz baltica Live (2010): New find, Brazilian musician Ceu.

Created: 2015-12-09 Wed 12:49

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Monday, November 30, 2015

Nerd Food: Tooling in Computational Neuroscience - Part II: Microscopy

Nerd Food: Tooling in Computational Neuroscience - Part II: Microscopy

Research is what I'm doing when I don't know what I'm doing.
Wernher von Braun

Welcome to the second instalment of our second series on Computational Neuroscience for lay people. You can find the first post of the previous series here, and the first post of the current series here. As you'd expect, this second series is slightly more advanced, and, as such, it is peppered with unavoidable technical jargon. Having said that, we shall continue to pursue our ambitious target of making things as easy to parse as possible (but no easier). If you read the first series, the second should hopefully make some sense.1

Our last post discussed Computational Neuroscience as a discipline, and the kind of things one may want to do in this field. We also spoke about models and their composition, and the desirable properties of a platform that runs simulations of said models. However, it occurred to me that we should probably build some kind of "end-to-end" understanding; that is, by starting with the simulations and models we are missing a vital link with the physical (i.e. non-computational) world. To put matters right, this part attempts to provide a high-level introduction on how data is acquired from the real world and can then be used - amongst other things - to inform the modeling process.

Macro and Micro Microworlds

For the purposes of this post, the data gathering process starts with the microscope. Of course, keep in mind that we are focusing only on the morphology at present - the shape and the structures that make up the neuron - so we are ignoring other important activities in the lab. For instance, one can conduct experiments to measure voltage in a neuron, and these measurements provide data for the functional aspects of the model. Alas, we will skip these for now, with the promise of returning to them at a later date2.

So, microscopes then. Microscopy is the technical name for the observation work done with the microscope. Because neurons are so small - some 4 to 100 microns in size - only certain types of microscopes are suitable to perform neuronal microscopy. To make matters worse, the sub-structures inside the neuron are an important area of study and they can be ridiculously small: a dentritic spine - the minute protrusions that come out of the dendrites - can be as tiny as 500 nanometres; the lipid bylayer itself is only 2 or 3 nanometres thick, so you can imagine how incredibly small ion channels and pumps are. Yet these are some of the things we want to observe and measure. Lets call this the "micro" work. On the other hand, we also want to understand connectivity and other larger structures, as well as perform observations of the evolution of the cell and so on. Lets call this the "macro" work. These are not technical terms, by the by, just so we can orient ourselves. So, how does one go about observing these differently sized microworlds?

F1.large.jpg

Figure 1: Example of measurements one may want to perform on a dendrite. Source: Reversal of long-term dendritic spine alterations in Alzheimer disease models

Optical Microscopy

The "macro" work is usually done using the Optical "family" of microscopes, which is what most of us think of when hearing the word microscope. As it was with Van Leeuwenhoek's tool in the sixteen hundreds, so it is that today's optical microscopes still rely on light and lenses to perform observations. Needless to say, things did evolve a fair bit since then, but standard optical microscopy has not completely removed the shackles of its limitations. These are of three kinds, as Wikipedia helpfully tells us: a) the objects we want to observe must be dark or strongly refracting - a problem, since the internal structures of the cell are transparent; b) visible light's diffraction limit means that we cannot go much lower than 200 nanometres - pretty impressive, but unfortunately not quite low enough for detailed sub-structure analysis; and c) out of focus light hampers image clarity.

Workarounds to these limitations have been found in the guise of techniques, with the aim of augmenting the abilities of standard optical microscopy. There are many of these techniques. There is the Confocal Microscopy3 - improving resolution and contrast; the Fluorescence microscope, which uses a sub-diffraction technique to reconstruct some of the detail that is missing due to diffraction; or the incredible-looking movies produced by Multiphoton Microscopy. And of course, it is possible to combine multiple techniques in a single microscope, as is the case with the Multiphoton Fluorescence Microscopes (MTMs) and many others.

In fact, given all of these developments, it seems there is no sign of optical microscopy dying out. Presumably some of this is due to the relative lower cost of this approach as well as to the ease of use. In addition, optical microscopy is complementary to the other more expensive types of microscopes; it is the perfect tool for "macro" work that can then help to point out where to do "micro" work. For example, you can use an optical microscope to assess the larger structures and see how they evolve over time, and eventually decide on specific areas that require more detailed analysis. And when you do, you need a completely different kind of microscope.

Electron Microscopy

When you need really high-resolution, there is only one tool to turn to: the Electron Microscope (EM). This crazy critter can provide insane levels of magnification by using a beam of electrons instead of visible light. Just how insane, you ask? Well, if you think that an optical microscope lives in the range of 1500x to 2000x - that is, can magnify a sample up to two thousand times - an EM can magnify as much as 10 million times, and provide a sub-nanometre resolution4. It is mind boggling. If fact, we've already seen images of atoms using EM in part II, but perhaps it wasn't easy to appreciate just how amazing a feat that is.

Of course, EM is itself a family - and a large one at that, with many and diverse members. As with optical microscopy, each member of the family specialises on a given technique or combination of techniques. For example, the Scanning Electron Microscope (SEM) performs a scan of the object under study, and has a resolution of 1 nanometre or higher; the Scanning Confocal Electron Microscope (SCEM) uses the same confocal technique mentioned above to provide higher depth resolution; and Transmission Electron Microscopy (TEM) has the ability to penetrate inside the specimen during the imagining process, given samples with thickness of 100 nanometres or less.

A couple of noteworthy points are required at this juncture. First, whilst some of these EM techniques may sound new and exciting, most have been around for a very long time; it just seems they keep getting better and better as they mature. For example, TEM was used in the fifties to show that neurons communicate over synaptic junctions but its still wildly popular today. Secondly, its important to understand that the entire imaging process is not at all trivial - certainly not for TEM, nor EM in general and probably not for Optical Microscopy either. It just is a very labour intensive and very specialised process - most likely done by an expert human neuroanatomist - and the difficulties range from the chemical preparation of the samples all the way up to creating the images. The end product may give the impression it was easy to produce, but easy it was not.

At any rate, whatever the technical details, the fact is that the imagery that results from all these advances is truly evocative - haunting, even. Take this image produced by SEM:

Personally, I think it is incredibly beautiful; simultaneously awe-inspiring and depressing because it really conveys the messiness and complexity of wetware. By way of contrast, look at the neatness of man-made micro-structures:

bluegeneq%20x%20420.jpg

Figure 3: The BlueGene/Q chip. Source: IBM plants transactional memory in CPU

Stacks and Stacks of 'Em

Technically, pictures like the ones above are called micrographs. As you can see in the neuron micrograph, these images provide a great visual description of the topology of the object we are trying to study. You also may notice a slight coloration of the cell in that picture. This is most likely due to the fact that the people doing the analysis stain the neuron to make it easier to image. Now, in practice - at least as far as I have seen, which is not very far at all, to be fair - 2D grayscale images are preferred by researchers to the nice, Public Relations friendly pictures like the one above; those appear to be more useful for magazine covers. The working micrographs are not quite as exciting to the untrained eye but very useful to the professionals. Here's an example:

fetch.php?w=900&tok=d88a10&media=wiki:biomed-neurons.jpg

Figure 4: The left-hand side shows the original micrograph. On the right-hand side it shows the result of processing it with machine learning. Source: Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

Let's focus on the left-hand side of this image for the moment. It was taken using ssTEM - serial-section TEM, an evolutionary step in TEM. The ss part of ssTEM is helpful in creating stacks of images, which is why you see the little drawings on the left of the picture; they are there to give you the idea that the top-most image is one of 30 in a stack5. The process of producing the images above was as follows: they started off with a neuronal tissue sample, which is prepared for observation. The sample had 1.5 micrometres and was then sectioned into 30 slices of 50 nanometres. Each of these slices was imaged, at a resolution of 4x4 nanometres per pixel.

As you can imagine, this work is extremely sensitive to measurement error. The trick is to ensure there is some kind of visual continuity between images so that you can recreate a 3D model from the 2D slices. This means for instance that if you are trying to figure out connectivity, you need some way to relate a dendrite to it's soma and say to the axon of the neuron it connects to - and that's one of the reasons why the slices have to be so thin. It would be no good if the pictures miss this information out as you will not be able to recreate the connectivity faithfully. This is actually really difficult to achieve in practice due to the minute sizes involved; a slight tremor that displaces the sample by some nanometres would cause shifts in alignment; even with the high-precision the tools have, you can imagine that there is always some kind of movement in the sample's position as part of the slicing process.

Images in a stack are normally stored using traditional formats such as TIFF6. You can see an example of the raw images in a stack here. Its worth noticing that, even though the images are 2D grey-scale, since the pixel size is only a few nanometres wide (4x4 in this case), the full size of an image is very large. Indeed, the latest generation of microscopes produce stacks on the 500 Terabyte range, making the processing of the images a "big-data" challenge.

What To Do Once You Got the Images

But back to the task at hand. Once you have the stack, the next logical step is to try to figure out what's what: which objects are in the picture. This is called segmentation and labelling, presumably because you are breaking the one big monolithic picture into discrete objects and give them names. Historically, segmentation has been done manually, but its a painful, slow and error-prone process. Due to this, there is a lot of interest in automation, and it has recently become feasible to do so - what with the abundance of cheap computing resources as well as the advent of "useful" machine learning (rather than the theoretical variety). Cracking this puzzle is gaining traction amongst the programming herds, as you can see by the popularity of challenges such as this one: Segmentation of neuronal structures in EM stacks challenge - ISBI 2012. It is from this challenge we sourced the stack and micrograph above; the right-hand side is the finished product after machine learning processing.

There are also open source packages to help with segmentation. A couple of notable contenders are Fiji and Ilastik. Below is a screenshot of Ilastik.

Figure-2-a.png

Figure 5: Source: Ilastik gallery.

An activity that naturally follows on from segmentation and labelling is reconstruction. The objective of reconstruction is to try to "reconstruct" morphology given the images in the stack. It could involve inferring the missing bits of information by mathematical means or any other kind of analysis which transforms the set of discrete objects spotted by segmentation into something looking more like a bunch of connected neurons.

Once we have a reconstructed model, we can start performing morphometric analysis. As wikipedia tells us, Morphometry is "the quantitative analysis of form"; as you can imagine, there are a lot of useful things one may want to measure in the brain structures and sub-structures such as lengths, volumes, surface area and so on. Some of these measurements can of course be done in 2D, but life is made easier if the model is available in 3D. One such tool is NeuroMorph. It is an open source extension written in Python for the popular open source 3D computer graphics software Blender.

Conclusion

This post was a bit of a world-wind tour of some of the sources of real world data for Computational Neuroscience. As I soon found out, each of these sections could have easily been ten times bigger and still not provide you with a proper overview of the landscape; having said that, I hope that the post at least gives some impression of the terrain and its main features.

From a software engineering perspective, its worth pointing out the lack of standardisation in information exchange. In an ideal world, one would want a pipeline with components to perform each of the steps of the complete process, from data acquisition off of a microscope (either opitical or EM), to segmentation, labelling, reconstruction and finally morphometric analysis. This would then be used as an input to the models. Alas, no such overarching standard appears to exist.

One final point in terms of Free and Open Source Software (FOSS). On one hand, it is encouraging to see the large number of FOSS tools and programs being used. Unfortunately - at least for the lovers of Free Software - there are also some proprietary tools that are widely used such as NeuroLucida. Since the software is so specialised, the fear is that in the future, the better funded commercial enterprises will take over more and more of the space.

That's all for now. Don't forget to tune in for the next instalment!

Footnotes:

1

As it happens, what we are doing here is to apply a well-established learning methodology called the Feynman Technique. I was blissfully unaware of its existence all this time, even though Feynman is one of my heroes and even though I had read a fair bit about the man. On this topic (and the reason why I came to know about the Feynman Technique), its worth reading Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something, where Feynman discusses his disappointment with science education in Brazil. Unfortunately the Portuguese and the Brazilian teaching systems have a lot in common - or at least they did when I was younger.

2

Nor is the microscope the only way to figure out what is happening inside the brain. For example, there are neuroimagining techniques which can provide data about both structure and function.

3

Patented by Marvin Minsky, no less - yes, he of Computer Science and AI fame!

4

And, to be fair, sub-nanometre just doesn't quite capture just how low these things can go. For an example, read Electron microscopy at a sub-50 pm resolution.

5

For a more technical but yet short and understandable take, read Uniform Serial Sectioning for Transmission Electron Microscopy.

6

On the topic of formats: its probably time we mention the Open Microscopy Environment (OME). The microscopy world is dominated by hardware and as such its the perfect environment for corporations, their proprietary formats and expensive software packages. The OME guys are trying to buck the trend by creating a suite of open source tools and protocols, and by looking at some of their stuff, they seem to be doing alright.

Created: 2015-11-30 Mon 23:12

Emacs 24.5.1 (Org mode 8.2.10)

Validate

Wednesday, November 11, 2015

Nerd Food: Tooling in Computational Neuroscience - Part I: NEURON

Nerd Food: Tooling in Computational Neuroscience - Part I

In the previous series of posts we did a build up of theory - right up to the point where we were just about able to make sense of Integrate and Fire - one of the simpler families of neuron models. The series used a reductionist approach - or bottom up, if you prefer1. We are now starting a new series with the opposite take, this time coming at it from the top. The objective is to provide a (very) high-level overview - in laymen's terms, still - of a few of the "platforms" used in computational neuroscience. As this is a rather large topic, we'll try to tackle a couple of platforms each post, discussing a little bit of their history, purpose and limitations - whilst trying to maintain a focus on file formats or DSLs. "File formats" may not sound particularly exciting at first glance, but it is important to keep in mind that these are instances of meta-models of the problem domain in question, and as such, their expressiveness is very important. Understand those and you've understood a great deal about the domain and about the engineering choices of those involved.

But first, let's introduce Computational Neuroscience.

Computers and the Brain

Part V of our previous series discussed some of the reasons why one would want to model neurons (section Brief Context on Modeling). What we did not mention is that there is a whole scientific discipline dedicated to this endeavour, called Computational Neuroscience. Wikipedia has a pretty good working definition, which we will take wholesale. It states:

Computational neuroscience […] is the study of brain function in terms of the information processing properties of the structures that make up the nervous system. It is an interdisciplinary science that links the diverse fields of neuroscience, cognitive science, and psychology with electrical engineering, computer science, mathematics, and physics.

Computational neuroscience is distinct from psychological connectionism and from learning theories of disciplines such as machine learning, neural networks, and computational learning theory in that it emphasizes descriptions of functional and biologically realistic neurons (and neural systems) and their physiology and dynamics. These models capture the essential features of the biological system at multiple spatial-temporal scales, from membrane currents, proteins, and chemical coupling to network oscillations, columnar and topographic architecture, and learning and memory.

These computational models are used to frame hypotheses that can be directly tested by biological or psychological experiments.

Lots of big words, of course, but hopefully they make some sense after the previous posts. If not, don't despair; what they all hint at is an "interdisciplinary" effort to create biologically plausible models, and to use these to provide insights on how the brain is performing certain functions. Think of the Computational Neuroscientist as the right-hand person of the Neuroscientist - the "computer guy" to the "business guy", if you like. The Neuroscientist (particularly the experimental Neuroscientist) gets his or her hands messy with wetware and experiments, which end up providing data and a better biological understanding; the Computational Neuroscientist takes these and uses them to make improved computer models, which are used to test hypothesis or to make new ones, which can then validated by experiments and so on, in a virtuous feedback loop.2 Where the "interdisciplinary" part comes in is that many of the people doing the role of "computer guys" are actually not computer scientists but instead come from a variety of backgrounds such as biology, physics, chemistry and so on. This variety adds a lot of value to the discipline because the brain is such a complex organ; understanding it requires all kinds of skills - and then some.

It's Models All the Way Down

At the core, then, the work of the Computational Neuroscientist is to create models. Of course, as we already seen, one does not just walk straight into Mordor and starts creating the "most biologically plausible" model of the brain possible; all models must have a scope as narrow as possible, if they are to become a) understandable and b) computationally feasible. Thus engineering trade-offs are crucial to the discipline.

Also, it is important to understand that creating a model does not always imply writing things from scratch. Instead, most practitioners rely on a wealth of software available, all with different advantages and disadvantages.

At this juncture you are probably wondering just what exactly are these "models" we speak so much of. Are they just equations like IaF? Well, yes and no. As it happens, all models have roughly the following structure:

  • a morphology definition: we've already spoken a bit about morphology; think of it as the definition of the entities that exist in your model, their characteristics and relationships. This is actually closer to what we, computer scientists think the word modeling means. For example, the morphology defines how many neurons you have, how many axons and dendrites, connectivity, spatial positioning and so on.
  • a functional, mathematical or physical definition: I've heard it named in many ways, but fundamentally, what it boils down to is the definition of the equations that your model requires. For example, are you modeling electrical properties or reaction/diffusion?

For the simpler models, the morphology gets somewhat obscured - after all, in LIF, there is very little information about a neuron because all we are interested in are the spikes. For other models, a lot of morphological details are required.

The Tooling Landscape

Idealised…

It is important to keep in mind that these models are to be used in a simulation; that is, we are going to run the program for a period of time (hours or days) and observe different aspects of its behaviour. Thus the functional definition of the model provides the equations that describe the dynamics of the system being simulated and the morphology will provide some of the inputs for those equations.

From here one can start sketch the requirements for a system for the Computational Neuroscientist:

  • a platform of some kind to provide simulation control: starting, stopping, re-running, storing the results and so on. As the simulations can take a long time to run, the data sets can be quite large - on the hundreds of gigs range - so efficiently handling of the output data is a must.
  • some kind of DSL that provides a user friendly way to define their models, ideally with a graphical user interface that helps author the DSL. The DSL must cover the two aspects we mention above.
  • efficient libraries of numerical routines to help solve the equations. The libraries must be exposed in someway to the DSL so that users can make use of these when defining the functional aspects of the model.

Architecturally, the ability to use a cluster or GPUs would of course be very useful, but we shall ignore those aspects for now. Given this idealised platform, we can now make a bit more sense of what actually exists in the wild.

… vs Actual

The multidisciplinary nature of Computational Neuroscience poses some challenges when it comes to software development: as mentioned, many of the practitioners in the field do not have a Software Engineering background; of those that do have, most tend not to have strong biology and neuroscience backgrounds. As a result, the landscape is fragmented and the quality is uneven. On one side, most of the software is open source, making reuse a lot less of a problem. On the other hand, things such as continuous integration, version control, portability, user interface guide lines, organised releases, packaging and so on are still lagging behind most "regular" Free and Open Source projects3.

In some ways, to enter Computational Neuroscience is a bit like travelling in time to a era before git, before GitHub, before Travis and all other things we take for granted. Not everywhere, of course, but still in quite a few places, particularly with the older and more popular projects. One cannot help but get the feeling that the field could do with some of the general energy we have in the FOSS community, but the technical barriers to contributing tend to be large since the domain is so complex.

So after all of this boring introductory material, we can finally look at our first system.

NEURON

Having to choose, one feels compelled to start with NEURON - the most venerable of the lot, with roots in the 80s4. NEURON is a simulation environment with great depth of functionality and a comprehensive user manual published as a (non-free) book. For the less wealthy, an overview paper is available, as are many other online resources. The software itself is fully open source, with a public mercurial repo.

As with many of the older tools in this field, NEURON development has not quite kept up the pace with the latest and greatest. For instance, it still has a Motif'esque look to its UI but, alas, do not be fooled - its not Motif but InterViews - a technology I never heard of, but seems to have been popular in the 80's and early 90's. One fears that NEURON may just be the last widely used program relying on InterViews - and the fact that they carry their own fork of it does not make me hopeful.

subset0.gif

Figure 1: Source: NEURON Cell Builder

However, once one goes past these layers of legacy, the domain functionality of the tool is very impressive. This goes some way to explain why so many people rely on it daily and why so many papers have been written using it - over 600 papers at the last count.

Whilst NEURON is vast, we are particularly interested in only two aspects of it: hoc and mod (in its many incarnations). These are the files that can be used to define models.

Hoc

Hoc has a fascinating history and a pedigree to match. It is actually the creation of Kernighan and Pike, two UNIX luminaries, and has as contenders tools like bc and dc and so on. NEURON took hoc and extended it both in terms of syntax as well as the number of available functions; NEURON Hoc is now an interpreted object oriented language, albeit with some limitations such as lack of inheritance. Programs written in hoc execute in an interpreter called oc. There are a few variations of this interpreter, with different kinds of libraries made available to the user (UI, neuron modeling specific functionality, etc) but the gist of it is the same, and the strong point is the interactive development with rapid feedback. On the GUI versions of the interpreter, the script can specify it's UI elements including input widgets for parameters and widgets to display the output. Hoc is then used as a mix between model/view logic and morphological definition language.

To get a feel for the language, here's a very simple sample from the manual:

create soma    // model topology
access soma    // default section = soma

soma {
   diam = 10   // soma dimensions in um
   L = 10/PI   //   surface area = 100 um^2
}

NMODL

The second language supported by NEURON is NMODL - The NEURON extended MODL (Model Description Language). NMODL is used to specify a physical model in terms of equations such as simultaneous nonlinear algebraic equations, differential equations and so on. In practice, there are actually different versions of NMODL for different NEURON versions, but to keep things simple I'll just abstract these complexities and refer to them as one entity5.

As intimated above, NMODL is a descendant of MODL. As with Hoc, the history of MODL is quite interesting; it was a language was defined by the National Biomedical Simulation Resource to specify models for use with SCoP - the Simulation Control Program6. From what I can gather of SCoP, its main purpose was to make life easier when creating simulations, providing an environment where users could focus on what they were trying to simulate rather than nitty-gritty implementation specific details.

NMODL took MODL syntax and extended it with the primitives required by its domain; for instance, it added the NEURON block to the language, which allows multiple instances of "entities". As with MODL, NMODL is translated into efficient C code and linked against supporting libraries that provide the numerics; the NMODL translator to C also had to take into account the requirement of linking against NEURON libraries rather than SCoP.

The below is a snippet of NMODL code, copied from the NEURON book (chapter 9, listing 9.1):

NEURON {
  SUFFIX leak
  NONSPECIFIC_CURRENT i
  RANGE i, e, g
}

PARAMETER {
  g = 0.001  (siemens/cm2)  < 0, 1e9 >
  e = -65    (millivolt)
}

ASSIGNED {
  i  (milliamp/cm2)
  v  (millivolt)
}

NMODL and hoc are used together to form a model; hoc to provide the UI, parameters and morphology and NMODL to provide the physical modeling. The website ModelDB provides a database of models in a variety of platforms with the main objective of making research reproducible. Here you can see an example of a production NEURON model in its full glory, with a mix of hoc and NMODL files - as well as a few others such as session files, which we can ignore for our purposes.

Thoughts

NEURON is more or less a standard in Computational Neuroscience - together with a few other tools such as GENESIS, which we shall cover later. Embedded deeply in it source code is the domain logic learned painstakingly over several decades. Whilst software engineering-wise it is creaking at the seams, finding a next generation heir will be a non-trivial task given the features of the system, the amount of models that exist out there, and the knowledge and large community that uses it.

Due to this, a solution that a lot of next-generation tools have developed is to use NEURON as a backend, providing a shiny modern frontend and then generating the appropriate hoc and NMODL required by NEURON. This is then executed in a NEURON environment and the results are sent back to the user for visualisation and processing using modern tools. Le Roi Est Mort, Vive Le Roi!

Conclusions

In this first part we've outlined what Computational Neuroscience is all about, what we mean by a model in this context and what services one can expect from a platform in this domain. We also covered the first of such platforms. Tune in for the next instalment where we'll cover more platforms.

Footnotes:

1

I still owe you the final post of that series, coming out soon, hopefully.

2

Of course, once you scratch the surface, things get a bit murkier. Erik De Schutter states:

[…] The term is often used to denote theoretical approaches in neuroscience, focusing on how the brain computes information. Examples are the search for “the neural code”, using experimental, analytical, and (to a limited degree) modeling methods, or theoretical analysis of constraints on brain architecture and function. This theoretical approach is closely linked to systems neuroscience, which studies neural circuit function, most commonly in awake, behaving intact animals, and has no relation at all to systems biology. […] Alternatively, computational neuroscience is about the use of computational approaches to investigate the properties of nervous systems at different levels of detail. Strictly speaking, this implies simulation of numerical models on computers, but usually analytical models are also included […], and experimental verification of models is an important issue. Sometimes this modeling is quite data driven and may involve cycling back and forth between experimental and computational methods.

3

This is a problem that has not gone unnoticed; for instance, this paper provides an interesting and thorough review of the state onion in Computational Neuroscience: Current practice in software development for computational neuroscience and how to improve it. In particular, it explains the dilemmas faced by the maintainers of neuroscience packages.

4

The early story of NEURON is available here; see also the scholarpedia page.

5

See the NMODL page for details, in the history section.

6

As far as I can see, in the SCoP days MODL it was just called the SCoP Language, but as the related paper is under a paywall I can't prove it either way. Paper: SCoP: An interactive simulation control program for micro- and minicomputers, from Springer.

Created: 2015-11-11 Wed 17:59

Emacs 24.5.1 (Org mode 8.2.10)

Validate