Monday, April 06, 2020

Nerd Food: Notes on Computational Finance, Part I: Introduction

Nerd Food: Computational Finance, Part I: Introduction

Welcome to the first part of what we hope will be a long series of articles exploring QuantLib, Open Source Risk Engine (ORE), libbitcoin and other interesting finance-related FOSS (Free and Open Source) projects. However, I'm afraid this will be a bit of a dull first post, as we need to clarify our objectives before we can jump into the fray.

About

Even though I've been a software engineer in the financial sector for over fifteen years, I've always felt I lacked a deep understanding of the foundational concepts that make up the domain. As a self-confessed reductionist, I find this state of affairs extremely uncomfortable, akin to hearing a continuous loop of David Hilbert's words: wir müssen wissen, wir werden wissen1. The situation had to be remedied, somehow, and the material you are now reading is the proposed fix for my ailments. As to the methodology: given I've had some success in applying the Feynman Technique2 to other complex domains, it seemed only natural to try to use it for this endeavour as well. Past experience also demonstrated writing is an adequate replacement for in vivo communication, which is just as well in this brave new world of social distancing.

So that's that for the why and the how. But, just what exactly are we researching?

Scope

These posts shall largely amble where our fancy takes us, within the porous boundaries of finance. Alas, we can hardly keep calling our target domain "trading, accounting, crypto and a bit of quantitative finance, when viewed through the lens of FOSS" - descriptive though that might sound. We are Software Engineers after all, and if there is one thing we do is to name things, especially when we lack competence to do so3. In this vein, I decided to call this motley domain of ours "Computational Finance". Should the name have merit, I'm afraid I have little claim to it, as it was shamelessly stolen from this passage in Wolfram's writings:

Doctors, lawyers, teachers, farmers, whatever. The future of all these professions will be full of computational thinking. Whether it’s sensor-based medicine, computational contracts, education analytics or computational agriculture - success is going to rely on being able to do computational thinking well.

I’ve noticed an interesting trend. Pick any field X, from archeology to zoology. There either is now a “computational X” or there soon will be. And it’s widely viewed as the future of the field.

These seemed like inspiring words for anyone embarking on a long and uncertain journey, so we made them our own and, in turn, it gave us a name to rally around. But what of its boundaries? One of the biggest challenges facing the reductionist is that, in the limit, everything is interconnected with everything else, for there is no natural halting function. Thus, if you are not careful, all paths will eventually lead you into the realm of quarks and particle physics, regardless of your starting point. Now, that would not be an entirely useful outcome. I have never found a completely satisfactory answer to this question in any on my previous meanderings, but in general I tend to follow an empiric approach and let taste be my guide4. Granted, its probably not the scientific solution you were expecting, but it seems that there are "intuitive" boundaries in subjects, and when we hit one of those we shall stop5. As an example, for our purposes we need not look in detail at legal frameworks when trying to understand financial concepts, though the two disciplines are deeply intertwined.

Structure

An issue which is closely interrelated with the previous one is on how to strike a balance between computational exploration versus domain definitions. Too much exploration and you proceed full steam ahead without knowing the meaning of things; too many boring definitions and they become just words without bringing any light to the subject under scrutiny. The sweet spot lies somewhere in the middle.

Our approach can be described as follows. We shall try to progress very slowly and methodically through the concepts in the domain, building them up as we climb the abstraction ladder but without making them too dense and technical. We'll make extensive use of Wikipedia definitions, where possible, but keeping these focused only on the point at hand rather than exploring the myriad of possibilities around a theme.

Finally, we shall try to marry domain concepts with our chosen implementations - the computational experiments part - in order to illustrate their purpose and get a better understanding at what it is that they are trying to do. So, each post will be focused on one fairly narrow subject area, start with a bunch of definitions which are hopefully self-explanatory and then proceed to explore the available implementations on that topic, or code that we write ourselves.

Audience

The target audience for this material is the fabled homo developus, that non-existent "standard developer" - in this particular case, one moderately competent on C++ but unfamiliar with computational finance. On the "finance" part, if you are already familiar with the domain, you will no doubt find the content very slow going. I'm afraid this is by design: the objective is to try to build the concepts on a solid foundation for those not in the know, so slowness is unavoidable6.

With regards to the computational part: the astute reader will likely point out that there are a great deal of tutorials on QuantLib, ORE and many other libraries of a similar ilk, and many books have been written on quantitative finance. One could be forgiven for wondering if there is a need to pile on more literature onto a seemingly already crowded subject.

In our defence, we are yet to find work that directly targets "plain" software developers and provides them with a sufficiently broad view of the domain. In addition, most of the existing material is aimed at either those with strong mathematical abilities but no domain knowledge, or its converse, leaving many gaps in the understanding. What we are instead aiming for is to target those with strong programming abilities but no particular knowledge of either computational finance or mathematics. And this leads us nicely to our next topic.

Mathematics

Our assumption is that you, dear reader, are not able to attain deep levels of understanding by staring at page after page of complex mathematical formulae. I, for one, certainly cannot. Unfortunately, non-trivial mathematics is difficult to avoid when covering a subject matter of this nature so, as a counterweight, we shall strive to use it sparingly and only from a software engineering application perspective. Note that this approach is clearly not suitable for the mathematically savvy amongst us, as they will find it unnecessarily laboured; then again, our focus lies elsewhere.

Our core belief is that an average reader (like me) should be able to attain a software engineer's intuition of how things work just by fooling around with software models of formulae. The reason why I am very confident on this regard is because that's how developers learn: by coding and seeing what happens. In fact, it is this very tight feedback loop between having an idea and experimenting with it that got many of us hooked into programming in the first place, so its a very powerful tool in the motivational arsenal. And, as it turns out, these ideas are related to Wolfram's concept of Experimental Mathematics. Ultimately, our aspiration is to copy the approach taken by Klein in Coding the Matrix, though perhaps that sets the bar a tad too high. Well, at least you get the spirit of the approach.

Cryptos

Another rather peculiar idea we pursued is the use of cryptocurrencies throughout, to the exclusion of everything else. Whilst very popular in the media, where they are known as cryptos, in truth cryptocurrencies still have a limited presence in the "real" world of finance, and nowhere more so than in derivatives - i.e., the bulk of our analysis. So at first blush, this is a most puzzling choice. We have decided to do so for three reasons.

Firstly, just because I wanted to learn more about cryptos. Secondly, because there is a need to bridge the knowledge gap between these two distinct worlds of finance; to blend the old with the new if you will. Personally, I think it will be interesting to see what the proliferation of derivatives will do to cryptos - but for that we need to disseminate financial knowledge. Finally, and most important of all, because in order to properly illustrate all of the concepts we shall cover, and to drive the open source libraries to their potential, one needs vast amounts of data of the right kind. Lets elaborate further on this point.

One of the biggest problems with any material in quantitative finance is in obtaining data sets which are sufficiently rich to cover all of the concepts being explained. This, in my opinion, is one of the key shortcomings with most tutorials: they either assume users can source the data themselves, or provide a small data set to prove a very specific point but which is insufficient for wider exploration7. This document takes a slightly different approach. We will base ourselves on a simulated world - a parallel reality if you'd like, thinly anchored to our reality by freely available data taken from the crypto markets. We shall then generate all of the remaining data, to the level of precision, richness and consistency required both to drive the code samples, but also to allow for "immersive" exploration. In fact, the very processes for data generation will be used as a pathway for domain exploration.

Of course, generated data is not perfect - i.e., realistic it is not, by definition - but our point is to understand the concepts, not to create new quant models that trade in the real world, so it is adequate for our needs. In addition, the data sets and code samples, as well as the means used to derive them shall be part of a git repository under an open source licence, so they can be extended and improved over time.

If you are not familiar with cryptos, don't worry. For starters, we can assume the intricate mechanistic details to a large extent - the blockchain and so forth - and introduce key concepts as required. We need not concern ourselves with this because there is plenty of freely available material covering it in excruciating detail, and designed specifically for software engineers. Instead, we shall treat cryptos as if they were regular currencies, except where they are just too different - in which case we'll point out the differences. Its a bit of a strange approach, but hopefully it will produce the desired results.

Non Goals

If you are trying to learn techniques on how to trade, this is not the material for you. Even when we discuss trading strategies and other similar topics, our focus is always on trying to understand how the machinery works rather than on how to make money with it. Similarly, if you are a quant or are trying to become one, you are probably better off reading the traditional books such as Hull or Wilmott rather than these posts, as our treatment of mathematics will be far too basic for your requirements. However, if you are an expert in this subject area, or if you find any mistakes please do point them out.

Legalese

As with anything to do with finance, we need to set out the standard disclaimers. To make sure these are seen, we shall add them to each post.

Legal Disclaimer

All of the content, including source code, is either written by the author of the posts, or obtained from freely available sites in the internet, with suitable software licences. All content sources shall be clearly identified at the point of use. No proprietary information of any kind - including, but not limited to, source code, text, market data or mathematical models - shall be used within this material.

All of the views expressed here represent exclusively myself and are not those of any corporation I may be engaged in commercial activities with.

The information available in these blog posts is for your general information and use and is not intended to address your particular requirements. In particular, the information does not constitute any form of financial advice or recommendation and is not intended to be relied upon by users in making (or refraining from making) any investment decisions.8

All software written by the author for these posts is licensed under the Gnu GPL v3. As per the licence, it is "distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details."

Next

With all of the preliminaries out of the way, we can move on to the meat of the subject. On Part II we shall discuss our first real topic, and it could not be much more fundamental: Money.

Footnotes:

1

"We must know, we will know". As per Wikipedia:

The epitaph on his tombstone in Göttingen consists of the famous lines he spoke at the conclusion of his retirement address to the Society of German Scientists and Physicians on 8 September 1930. The words were given in response to the Latin maxim: "Ignoramus et ignorabimus" or "We do not know, we shall not know".

2

The Feynman Technique is a well-established learning methodology. For more details, see Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something.

3

There are no circumstances under which I have seen software developers lacking confidence. I feel that the motto of our profession should be the Latin translation of Make up with confidence that which you lack for in competence.

4

An idea that was most likely inspired by Linus' views on good taste. For details see Applying the Linus Torvalds “Good Taste” Coding Requirement.

5

Of course, your intuition is not my intuition. I'm afraid you will have to take my taste as a given, even where you disagree. Feel free to make your views heard though.

6

As they say in my home country of Angola, malembe malembe. The expression can be loosely translated to English as "slowly but surely", or "slowly does it".

7

As an example, the latter approach is taken by a library I respect very much, the Open Source Risk Engine (ORE).

8

This paragraph was obtained from the Truly Independent Ltd and modified to suit our needs.

Wednesday, January 03, 2018

Nerd Food: The Refactoring Quagmire

Nerd Food: The Refactoring Quagmire

The latest Dogen sprint turned out to be a really long and tortuous one, which is all the more perplexing given the long list of hard sprints that preceded it. Clearly, the slope of the curve is steepening unrelentingly. Experience teaches that whenever you find yourself wandering over such terrains, it is time to stop and gather your thoughts; more likely than not, you are going the wrong way - fast.

Thus, for Dogen, this a post of reflection. To the casual reader - if nothing else - it will hopefully serve as a cautionary tale.

Not Even Wrong

If you are one of the lucky few internauts who avidly follows our release notes, you may recall that the previous sprint had produced a moment of enlightenment where we finally understood yarn as the core of Dogen. At the time, it felt like one of those rare eureka moments, and "the one last great change to the architecture"; afterwards, all would be light. "Famous last words", you may have said then and, of course, if you did, you were right. But given the historical context, the optimism wasn't entirely unjustified. To understand why, we need to quickly recap how the architecture has evolved over time.

Dogen started out divided into three very distinct parts: the frontends (Dia, JSON), the middle-end (yarn) and the backends (C++, C#). The "pipeline" metaphor guided our design because we saw Dogen very much like a compiler, with its frontend, middle-end and backend stages. This was very handy as it meant we could test all stages of the pipeline in isolation. Composition was done by orchestrating frontend, middle-end and backends, at a higher level. This architecture had very good properties when it came to testability and debuggability: we'd start by running the entire pipeline and locating the problem; then, one could easily isolate the issue to a specific component either by looking at the log file, or by dumping the inputs and outputs of the different stages and sifting through them. As a result, bug reproduction was very straightforward since we just needed to record the inputs and create the test at the right level. Whilst the names of the models and their responsibilities changed over time, the overall pipeline architecture remained so since the very early days of Dogen.

In parallel to this, a second trend had emerged over the last ten sprints or so: we moved more and more functionality from the frontends and backends to the middle-end. The key objective here was DRY: we soon found a lot of commonalities between frontends, driving us to create a simple frontend intermediate format so that the work was carried out only once. Not long after, we discovered that backends suffered from precisely the same malaise, so the same cure begun to be applied there too. So far so good, as we were following Roberts and Johnson's sage advice:

People develop abstractions by generalizing from concrete examples. Every attempt to determine the correct abstractions on paper without actually developing a running system is doomed to failure. No one is that smart. A framework is a reusable design, so you develop it by looking at the things it is supposed to be a design of. The more examples you look at, the more general your framework will be.

The literature was with us and the wind was on our sails: the concrete code in the frontends and backends was slowly cleaned up, made general and moved across to the middle-end. As this process took hold, the middle-end grew and grew in size and responsibilities, just as everybody else shed them. Before long, we ended up with one big model, a couple medium-sized models and lots of very small models: "modelets", we named them. These were models with very little responsibility other than gluing together one or two things. The overhead of maintaining a physical component (e.g. static or dynamic library) for the sake of one or two classes seemed a tad too high.

As we begun to extrapolate the trend somewhat, a vision suddenly appeared: why not centralise everything in the middle-end? That is:

  • place all meta-models and transforms in one single central location, together with their orchestration; call it the "core model";
  • orchestration becomes either helper code or a transform in its own right;
  • within this "core model", provide interfaces that backends and frontends implement, injecting them dynamically;
  • make these new interfaces appear as transform chains themselves (mostly).

In this elegant and clean brave new world, we would no longer have "ends" as such but something more akin to "plugins", dynamically glued into the "middle-end" via the magic of dependency injection; the "middle-end" itself would no longer be a "middle" but the center of everything. Backends and frontends had to merely implement the interfaces supplied by the core and the system would just magically sort itself out. The idea seemed amazing and we quickly moved to implementation.

Alas, in our haste to jump into the fray, we had forgotten to heed Mencken:

[T]here is always a well-known solution to every human problem — neat, plausible, and wrong.

The Strange Loop

One of the biggest downsides of working alone and in your spare time is the lack of feedback from other developers. And it's not even just that other developers will teach you lots of new things. No, most often than not, they'll simply drag you away from the echo chambers and tunnels of self-reinforcement you carefully craft and curate for yourself. You are your own intellectual jailer.

In the cold light of day, any developer will tell you that creating cycles is not a good idea, and should not be done without a great deal of thought. Yet, we managed to create "circular" dependencies between all components of the system by centralising all responsibilities into yarn. Now, you may say that these are not "canonically circular" - and this is probably why the problem was not picked up in the first place - because yarn provides interfaces for other models to implement. Well, Lakos is very helpful here in explaining what is going on: our logical design had no cycles - because yarn does not explicitly call any frontends or backends - but the physical design did have them. And these came at a cost.

For starters, it screwed up reasonability. Even though frontends and backends still had their own models, the net result was that we jumbled up all of the elements of the pipeline into a single model, making it really hard to tell what's what. Explaining the system to a new developer now required saying things such as "ah, don't worry about that part for now, it belongs to the middle-end, but here we are dealing only with the backends" - a clear code smell. Once a property of the architecture, reasonability now had to be conveyed in lossy natural language. Testability and debuggability got screwed up too because now everything went through one single central model; if you needed to test a frontend fix you still required building the backends and middle-end and initialise them too. Our pursuit of clarity muddied up the waters.

To make matters worse, an even more pertinent question arose: just when exactly should you stop refactoring? In my two decades of professional development, I had never encountered this problem. In the real world, you are fortunate if you get a tiny amount of time allocated to refactoring - most of the time you need to somehow sneak it in into some overall estimate and hope no one notices. Like sharks, Project Managers (PM) are bred to smell refactoring efforts from a mile a way and know how to trim estimates down to the bone. Even when you are in a greenfield project or just lucky enough to have an enlightened PM who will bat for you, you still need to contend with the realities of corporate development: you need to ship, now. No one gets away with endless refactoring. No one, that is, other than the Free and Open Source Software Developer.

Like many a spare time project, Dogen is my test bed of ideas around coding and coding processes; a general sandbox to have fun outside of work. As such - and very much by design - the traditional feedback loops that exist in the real world need not apply. I wanted to see what would happen if you coded without any constraints and, in the end, what I found out was that if you do not self-impose some kind of halting machinery, you will refactor on forever. In practice, physics still apply, so your project will eventually die out because its energy will dissipate across the many refactoring fronts and entropy will, as always, triumph. But if you really want to keep it at bay, at least for a little while, you need to preserve energy by having one single, consistent vision - "wrong" as it may be according to some metric or other. For, as Voltaire said and we often forget, "le mieux est l'ennemi du bien".

The trouble is that refactoring is made up of a set of engineering trade-offs, and when you optimise for one thing you'll inevitably make something else worse. So, first and foremost, you need to make sure you understand what your trade-offs are, and prioritise accordingly. Secondly, looking for a global minima in such a gigantic multidimensional space is impossible, so you need to make do with local minima. But how do you known you reach a "good enough" point in that space? You need some kind of conceptual cost function.

Descending the Gradient

So it was that we started by defining the key dimensions across which we were trying to optimise. This can be phrased slightly differently: given what we now know about the domain and its implementation, what are the most important characteristics of an idealised physical and logical design?

After some thinking, the final answer was deceptively simple:

  • the entities of the logical design (models, namespaces, classes, methods and the like) should reflect what one reads in the literature of Model Driven Engineering (MDE). That is, a person competent on the field should find a code base that talks his or her language.
  • logical and physical design should promote reasonability and isolation, and thus orchestration should be performed via composition rather than by circular physical dependencies.

For now, these are the two fundamental pillars guiding the Dogen architecture; any engineering trade-offs to be made must ensure these dimensions take precedence. In other words, we can only optimise away any "modelets" if they do not impact negatively either of these two dimensions. If they do, then we must discard this refactoring option. More generally, it is now possible to "cost" all refactoring activity - a conceptual refactoring gradient descent if you'd like; it either brings us closer to the local minima or further away. It gave us a sieve with which to filter the product and sprint backlogs.

To cut a rather long story short, we ended up with a "final" - ha, ha - set of changes to the architecture to get us closer to the local minima:

  • move away from sewing terms: from the beginning we had used terms such as knitter, yarn and so forth. These were… colourful, but did not add any value and detracted us from the first dimension. This was a painful decision but clearly required if one is to comply to point one above: we need to replace all sewing terms with domain specific vocabulary.
  • reorganise the models into a pipeline: however, instead of simply going back to the "modelets" of the past, we need to have a deep think as to what responsibilities belong at what stage of the pipeline. Perhaps the "modelets" were warning us of design failures.

Conclusion

Its never a great feeling when you end a long and arduous sprint only to figure out you were going in the wrong direction in design space. In fact, it is rather frustrating. We have many stories in the product backlog which are really exciting and which will add real value to the end users - well, at this point, just us really but hey - yet we seemed to be lost in some kind of refactoring ground hog day, with no end in sight. However, the main point of Dogen is to teach, and learn we undoubtedly did.

As with anything in the physical world, nothing in software engineering exists in splendid perfection like some kind of platonic solid. Perfection belongs to the realm of maths. In engineering, something can only be described as "fit for purpose", and to do so requires to first determine best we can what that purpose might be. So, before you wonder into a refactoring quagmire of your own making, be sure to have a very firm idea of what your trade-offs are.

Created: 2018-01-03 Wed 15:55

Emacs 25.2.2 (Org mode 8.2.10)

Validate

Saturday, March 25, 2017

Nerd Food: Northwind, or Using Dogen with ODB - Part IV

Nerd Food: Northwind, or Using Dogen with ODB - Part IV

So, dear reader, we meet again for the fourth and final instalment of our series of posts on using Dogen with ODB! And if you missed an episode - unlikely as it may be - well, fear not for you can always catch up! Here are the links: Part I, Part II and Part III. But, if you are too lazy and need a summary: all we've done thus far is to install and setup an Oracle Express database, populate it with a schema (and data) and finally code-generate an ORM model with Dogen and ODB.

I guess it would not be entirely unfair to describe our adventure thus far as a prelude; if nothing else, it was a character building experience. But now we can finally enjoy the code.

Building Zango

Assuming you have checked out zango as described in Part III and you are sitting on its containing directory, you can "configure" the project fairly simply:

$ . /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh
$ cd zango
$ git pull origin master
$ cd build
$ mkdir output
$ cd output
$ CMAKE_INCLUDE_PATH=/full/path/to/local/include CMAKE_LIBRARY_PATH=/full/path/to/local/lib cmake ../.. -G Ninja
-- The C compiler identification is GNU 6.3.0
<lots of CMake output>
-- Generating done
-- Build files have been written to: /path/to/zango/build/output

As always, do not forget to replace /full/path/to/local with your path to the directory containing the ODB libraries. If all has gone according to plan, CMake should have found ODB, Boost, Dogen and all other dependencies we have carefully and painstakingly setup in the previous three parts.

Once the configuration is done, you can fire up Ninja to build:

$ ninja -j5
[1/100] Building CXX object projects/northwind/src/CMakeFiles/northwind.dir/io/category_id_io.cpp.o
<lots of Ninja output>
[98/100] Linking CXX static library projects/northwind/src/libzango.northwind.a
[99/100] Building CXX object CMakeFiles/application.dir/projects/application/main.cpp.o
[100/100] Linking CXX executable application

That was easy! But what exactly have we just built?

The "Application"

We've created a really simple application to test drive the northwind model. Of course, this is really not how your production code should look like, but it'll do just fine for our purposes. We shall start by reading a password from the command line and then we use it to instantiate our Oracle database:

    const std::string password(argv[1]);
    using odb::oracle::database;
    std::unique_ptr<database>
        db(new database("northwind", password, "XE", "localhost", 1521));

We then use this database to read all available customers:

std::list<zango::northwind::customers>
load_customers(odb::oracle::database& db) {
    odb::oracle::transaction t(db.begin());

    std::list<zango::northwind::customers> r;
    auto rs(db.query<zango::northwind::customers>());
    for (auto i(rs.begin ()); i != rs.end (); ++i)
        r.push_back(*i);
    return r;
}

Please note that this is a straightforward use of the ODB API, but barely scratches the surface of what ODB can do. ODB supports all sorts of weird and wonderful things, including fairly complex queries and other great features. If you'd like more details on how to use ODB, you should read its manual: C++ Object Persistence with ODB. It's extremely comprehensive and very well written.

Once we have the customers in memory, we can start to do things with them. We can for example serialise them to a Boost serialisation binary archive and read them back out:

    boost::filesystem::path file("a_file.bin");
    {
        boost::filesystem::ofstream os(file);
        boost::archive::binary_oarchive oa(os);
        oa << customers;
    }

    std::cout << "Wrote customers to file: "
              << file.generic_string() << std::endl;

    std::list<zango::northwind::customers> customers_from_file;
    {
        boost::filesystem::ifstream is(file);
        boost::archive::binary_iarchive ia(is);
        ia >> customers_from_file;
    }

This is where hopefully you should start to see the advantages of Dogen: without writing any code, we have full serialisation support to all classes in the model - in addition to ODB support, of course.

Another very useful feature is to dump objects into a stream:

    for (const auto& c : customers_from_file)
        std::cout << "Customer: " << c << std::endl;

The objects are written in JSON, making it easy to post-process the output with JSON tools such as JQ, resulting in a nicely formatted string:

{
  "__type__": "zango::northwind::customers",
  "customer_id": {
    "__type__": "zango::northwind::customer_id",
    "value": 90
  },
  "customer_code": "WILMK",
  "company_name": "Wilman Kala",
  "contact_name": "Matti Karttunen",
  "contact_title": "Owner/Marketing Assistant",
  "address": "Keskuskatu 45",
  "city": "Helsinki",
  "region": "",
  "postal_code": "21240",
  "country": "Finland",
  "phone": "90-224 8858",
  "fax": "90-224 8858"
}

Dogen supports dumping arbitrarily-nested graphs, so it's great for logging program state as you go along. We make extensive use of this in Dogen, since - of course - we use Dogen to develop Dogen. Whilst this has proven invaluable, we have also hit some limits. For example, sometimes you may bump into really large and complex objects and JQ just won't cut it. But the great thing is that you can always dump the JSON into PostgreSQL - very easily indeed, given the ODB support - and then run queries on the object using the power of JSONB. With a tiny bit more bother you can also dump the objects into MongoDB.

However, with all of this said, it is also important to notice that we do not support proper JSON serialisation in Dogen at the moment. This will be added Real-Soon-Now, as we have a real need for it in production, but its not there yet. At present all you have is this debug-dumping of objects into streams which happens to be JSON. It is not real JSON serialisation. Real JSON support is very high on our priority list though, so expect it to land in the next few sprints.

Another useful Dogen feature is test data generation. This can be handy for performance testing, for example. Let's say we want to generate ~10K customers and see how Oracle fares:

std::vector<zango::northwind::customers> generate_customers() {
    std::vector<zango::northwind::customers> r;
    const auto total(10 * 1000);
    r.reserve(total);

    zango::northwind::customers_generator g;
    for (int i = 0; i < total; ++i) {
        const auto c(g());
        if (i > 100)
            r.push_back(g());
    }

    return r;
}

Note that we skipped the first hundred customers just to avoid clashes with the customer_id primary key. Now, thanks to the magic of ODB we can easily push this data into the database:

void save_customers(odb::oracle::database& db,
    const std::vector<zango::northwind::customers>& customers) {

    odb::transaction t(db.begin());
    for (const auto c : customers)
        db.persist(c);
    t.commit();
}

Et voilá, we have lots of customers in the database now:

SQL> select count(1) from customers;

  COUNT(1)
----------
      9990

To be totally honest, this exercise revealed a shortcoming in Dogen: since it does not know of the size of fields on the database, the generated test data may in some cases be too big to fit the database fields:

Saving customers...
terminate called after throwing an instance of 'odb::oracle::database_exception'
  what():  12899: ORA-12899: value too large for column "NORTHWIND"."CUSTOMERS"."CUSTOMER_CODE" (actual: 6, maximum: 5)

I solved this problem with a quick hack for this article (by removing the prefix used in the test data) but a proper fix is now sitting in Dogen's product backlog for implementation in the near future.

Finally, just for giggles, I decided to push the data we read from Oracle into Redis, an in-memory cache that seems to be all the rage amongst the Cool-Kid community. To keep things simple, I used the C API provided by hiredis. Of course, if this was the real world, I would have used one of the many c++ clients for Redis such as redis-cplusplus-client or cpp redis. As it was, I could not find any Debian packages for them, so I'll just have to pretend I know C. Since I'm not much of a C programmer, I decided to do a very bad copy and paste job from this Stack Overflow article. The result was this beauty (forgive me in advance, C programmers):

    redisContext *c;
    redisReply *reply;
    const char *hostname = "localhost";
    int port = 6379;
    struct timeval timeout = { 1, 500000 }; // 1.5 seconds
    c = redisConnectWithTimeout(hostname, port, timeout);
    if (c == NULL || c->err) {
        if (c) {
            std::cerr << "Connection error: " << c->errstr << std::endl;
            redisFree(c);
        } else {
            std::cerr << "Connection error: can't allocate redis context"
                      << std::endl;
        }
        return 1;
    }

    std::ostringstream os;
    boost::archive::binary_oarchive oa(os);
    oa << customers;
    const auto value(os.str());
    const std::string key("customers");
    reply = (redisReply*)redisCommand(c, "SET %b %b", key.c_str(),
        (size_t) key.size(), value.c_str(), (size_t) value.size());
    if (!reply)
        return REDIS_ERR;
    freeReplyObject(reply);

    reply = (redisReply*)redisCommand(c, "GET %b", key.c_str(),
        (size_t) key.size());
    if (!reply)
        return REDIS_ERR;

    if ( reply->type != REDIS_REPLY_STRING ) {
        std::cerr << "ERROR: " << reply->str << std::endl;
        return 1;
    }

    const std::string redis_value(reply->str, reply->len);
    std::istringstream is(redis_value);
    std::list<zango::northwind::customers> customers_from_redis;
    boost::archive::binary_iarchive ia(is);
    ia >> customers_from_redis;
    std::cout << "Read from redis: " << customers_from_redis.size()
              << std::endl;
    std::cout << "Front customer (redis): "
              << customers_from_redis.front() << std::endl;
    freeReplyObject(reply);

And it actually works. Here's the output, with manual formatting of JSON:

Read from redis: 91
Front customer (redis):  {
  "__type__": "zango::northwind::customers",
  "customer_id": {
    "__type__": "zango::northwind::customer_id",
    "value": 1
  },
  "customer_code": "ALFKI",
  "company_name": "Alfreds Futterkiste",
  "contact_name": "Maria Anders",
  "contact_title": "Sales Representative",
  "address": "Obere Str. 57",
  "city": "Berlin",
  "region": "",
  "postal_code": "12209",
  "country": "Germany",
  "phone": "030-0074321",
  "fax": "030-0076545"
}

As you can hopefully see, in very few lines of code we managed to connect to a RDBMS, read some data, push it into a stream, read it and write into Boost Serialization archives and push it into and out of Redis. All this in fairly efficient C++ code (and some very dodgy C code, but we'll keep that one quiet).

A final note on the CMake targets. Zango comes with a couple of targets for Dogen and ODB:

  • knit_northwind generates the Dogen code from the model.
  • odb_northwind runs ODB against the Dogen model, generating the ODB sources.

The ODB target is added automatically by Dogen. The Dogen target was added manually by yours truly, and it is considered good practice to have one such target when you use Dogen so that other Dogen users know how to generate your models. You can, of course, name it what you like, but in the interest of making everyone's life easier its best if you follow the convention.

Oracle and Bulk Fetching

Whilst I was playing around with ODB and Oracle, I noticed a slight problem: there is no bulk fetch support in the ODB Oracle wrappers at present; it works for other scenarios, but not for selects. I reported this to the main ODB mailing list here. By the by, the ODB community is very friendly and their mailing list is a very responsive place to chat about ODB issues.

Anyway, so you can have an idea of this problem, here's a fetch of our generated customers without prefetch support:

<snip>
Generating customers...
Generated customers. Size: 9899
Saving customers...
Saved customers.
Read generated customers. Size: 9990 time (ms): 263.449
<snip>

Remember the 263.449 for a moment. Now say you delete all rows we generated:

delete from  customers where customer_id > 100;

Then, say you apply to libodb-oracle the hastily-hacked patch I mentioned in that mailing list thread. Of course, I am hand-waving here greatly, as you need to rebuild the library, install the binaries, rebuild zango, etc, but you get the gist. At any rate, here's the patch, hard-coding an unscientifically-obtained-prefetch of 5K rows:

--- original_statement.txt 2017-02-09 15:45:56.585765500 +0000
+++ statement.cxx        2017-02-13 10:18:28.447916100 +0000
@@ -1574,18 +1574,29 @@
       OCIError* err (conn_.error_handle ());
+      const int prefetchSize(5000);
+      sword r = OCIAttrSet (stmt_,
+          OCI_HTYPE_STMT,
+          (void*)&prefetchSize,
+          sizeof(int),
+          OCI_ATTR_PREFETCH_ROWS,
+          err);
+
+      if (r == OCI_ERROR || r == OCI_INVALID_HANDLE)
+          translate_error (err, r);
+
       // @@ Retrieve a single row into the already bound output buffers as an
       // optimization? This will avoid multiple server round-trips in the case
       // of a single object load.
       //
-      sword r (OCIStmtExecute (conn_.handle (),
+      r = OCIStmtExecute (conn_.handle (),
                                stmt_,
                                err,
                                0,
                                0,
                                0,
                                0,
-                               OCI_DEFAULT));
+                               OCI_DEFAULT);
       if (r == OCI_ERROR || r == OCI_INVALID_HANDLE)
         translate_error (conn_, r);

And now re-run the command:

Generated customers. Size: 9899
Saving customers...
Saved customers.
Read generated customers. Size: 9990 time (ms): 40.85

Magic! We're down to 40.85. Now that I have a proper setup, I am going to start working on upstreaming this patch, so that ODB can expose the fetch configuration for fetching in a similar manner it already does for other purposes. If you are interested in the gory technical details, have a look at Boris' reply.

Conclusion

Hopefully this concluding part gave you an idea of why you might want to use Dogen with ODB for your modeling needs. Sadly, its not easy to frame the discussion adequately, so that you have all the required context in order to place these two tools in the continuum of tooling; but I'm hoping this series of articles was useful to at least help you setup Oracle Express in Debian and get an idea of what you can do with these two tools.

Created: 2017-03-25 Sat 20:26

Emacs 25.1.1 (Org mode 8.2.10)

Validate

Monday, March 20, 2017

Nerd Food: Northwind, or Using Dogen with ODB - Part III

Nerd Food: Northwind, or Using Dogen with ODB - Part III

Optimism is an occupational hazard of programming; feedback is the treatment. -- Kent Beck

Welcome to the third part of a series of N blog posts on using Dogen with ODB against an Oracle database. If you want more than the TL;DR, please read Part I and Part II. Otherwise, the story so far can be quickly summarised as follows: we got our Oracle Express database installed and set up by adding the required users; we then built the ODB libraries and installed the ODB compiler.

After this rather grand build up, we shall finally get to look at Dogen - just about. It now seems clear these series will have to be extended by at least one or two additional instalments in order to provide a vaguely sensible treatment of the material I had initially planned to cover. I wasn't expecting N to become so large, but - like every good software project - I'm now realising you can only estimate the size of the series properly once you've actually finished it. And to rub salt into the wounds, before we can proceed we must start by addressing some of the instructions in the previous posts which were not quite right.

Est Humanum Errare?

The first and foremost point in the errata agenda is concerned with the additional Oracle packages we downloaded in Part I. When I had originally checked my Oracle XE install, I did not find an include directory, which led me to conclude that a separate download was required for driver libraries and header files. I did find this state of affairs somewhat unusual - but then again, it is Oracle we're talking about here, so "unusual" is the default behaviour. As it turns out, I was wrong; the header files are indeed part of the Oracle XE install, just placed under a rather… shall we say, creative, location: /u01/app/oracle/product/11.2.0/xe/rdbms/public. The libraries are there too, under the slightly more conventionally named lib directory.

This is quite an important find because the downloaded OCI driver has moved on to v12 whereas XE is still on v11. There is backwards compatibility, of course - and everything should work fine connecting a v12 client against an v11 database - but it does introduce an extra layer of complexity: you now need to make sure you do not simultaneously have both v11 and v12 shared objects in the path when linking and running or else you will start to get some strange warnings. As usual, we try our best to confuse only one issue at a time, so we need to make sure we are making use of v11 and purge all references to v12; this entails recompiling ODB's oracle support.

If you followed the instructions on Part II and you have already installed the ODB Oracle library, you'll need to remove it first:

rm /full/path/to/local/lib/libodb-oracle* /full/path/to/local/include/odb/oracle

Remember to replace /full/path/to/local with the path to your local directory. Then, you can build by following the instructions as per previous post, but with one crucial difference at configure time: point to the Oracle XE directories instead of the external OCI driver directories:

. /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh
LD_LIBRARY_PATH=/u01/app/oracle/product/11.2.0/xe/lib CPPFLAGS="-I/full/path/to/local/include -I/u01/app/oracle/product/11.2.0/xe/rdbms/public" LDFLAGS="-L/full/path/to/local/lib -L/u01/app/oracle/product/11.2.0/xe/lib" ./configure --prefix=/full/path/to/local

Again, replacing the paths accordingly. If all goes well, the end result should be an ODB Oracle library that uses the OCI driver from Oracle XE. You then just need to make sure you have executed oracle_env.sh before running your binary, but don't worry too much because I'll remind you later on. Whilst we're on the subject of Oracle packages, it's worth mentioning that I did a minor update to Part I: you didn't need to download SQLPlus separately either, as it is also included in XE package. So, in conclusion, after a lot of faffing, it turns out you can get away with just downloading XE and nothing else.

The other minor alteration to what was laid out on the original posts is that I removed the need for the basic database schema. In truth, the entities placed in that schema were not adding a lot of value; their use cases are already covered by the northwind schema, so I removed the need for two schemas and collapsed them into one.

A final note - not quite an errata per se but still, something worthwhile mentioning. We didn't do a "proper" Oracle setup, so when you reboot your box you will find that the service is no longer running. You can easily restart it from the shell, logged in as root:

# cd /etc/init.d/
# ./oracle-xe start
Starting oracle-xe (via systemctl): oracle-xe.service.

Notice that Debian is actually clever enough to integrate the Oracle scripts with systemd, so you can use the usual tools to find out more about this service:

# systemctl status oracle-xe
● oracle-xe.service - SYSV: This is a program that is responsible for taking care of
   Loaded: loaded (/etc/init.d/oracle-xe; generated; vendor preset: enabled)
   Active: active (exited) since Sun 2017-03-12 15:10:47 GMT; 6s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 16761 ExecStart=/etc/init.d/oracle-xe start (code=exited, status=0/SUCCESS)

Mar 12 15:10:37 lorenz systemd[1]: Starting SYSV: This is a program that is responsible for taking c…e of...
Mar 12 15:10:37 lorenz oracle-xe[16761]: Starting Oracle Net Listener.
Mar 12 15:10:37 lorenz su[16772]: Successful su for oracle by root
Mar 12 15:10:37 lorenz su[16772]: + ??? root:oracle
Mar 12 15:10:37 lorenz su[16772]: pam_unix(su:session): session opened for user oracle by (uid=0)
Mar 12 15:10:39 lorenz oracle-xe[16761]: Starting Oracle Database 11g Express Edition instance.
Mar 12 15:10:39 lorenz su[16800]: Successful su for oracle by root
Mar 12 15:10:39 lorenz su[16800]: + ??? root:oracle
Mar 12 15:10:39 lorenz su[16800]: pam_unix(su:session): session opened for user oracle by (uid=0)
Mar 12 15:10:47 lorenz systemd[1]: Started SYSV: This is a program that is responsible for taking care of.
Hint: Some lines were ellipsized, use -l to show in full.

With all of this said, lets resume from where we left off.

Installing the Remaining Packages

We still have a number of packages to install, but fortunately the installation steps are easy enough so we'll cover them quickly in this section. Let's start with Dogen.

Dogen

Installing Dogen is fairly straightforward: we can just grab the latest release from BinTray:

As it happens, we must install v99 or above because we did a number of fixes to Dogen as a result of this series of articles; previous releases had shortcomings with their ODB support.

As expected, the setup is pretty standard-fare debian:

$ wget https://dl.bintray.com/domaindrivenconsulting/Dogen/0.99.0/dogen_0.99.0_amd64-applications.deb -O dogen_0.99.0_amd64-applications.deb
$ sudo dpkg -i dogen_0.99.0_amd64-applications.deb
[sudo] password for USER:
Selecting previously unselected package dogen-applications.
(Reading database ... 551550 files and directories currently installed.)
Preparing to unpack dogen_0.99.0_amd64-applications.deb ...
Unpacking dogen-applications (0.99.0) ...
Setting up dogen-applications (0.99.0) ...

If all has gone according to plan, you should see something along the lines of:

$ dogen.knitter --version
Dogen Knitter v0.99.0
Copyright (C) 2015-2017 Domain Driven Consulting Plc.
Copyright (C) 2012-2015 Marco Craveiro.
License: GPLv3 - GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.

Dia

Dogen has multiple frontends - at the time of writing, JSON and Dia. We'll stick with Dia because of its visual nature, but keep in mind that what you can do with Dia you can also do with JSON.

A quick word on Dia for those not in the know, copied verbatim from its home page:

Dia is a GTK+ based diagram creation program for GNU/Linux, MacOS X, Unix, and Windows, and is released under the GPL license.

Dia is roughly inspired by the commercial Windows program 'Visio,' though more geared towards informal diagrams for casual use. It can be used to draw many different kinds of diagrams. It currently has special objects to help draw entity relationship diagrams, UML diagrams, flowcharts, network diagrams, and many other diagrams.

Dia does not change very often, which means any old version will do. You should be able to install dia straight off of package manager:

apt-get install dia

Other Dependencies

I had previously assumed Boost to be installed on Part II but - if nothing else, purely for the sake of completeness - here are the instructions to set it up, as well as CMake and Ninja. We will need these in order to build our application, but we won't dwell on them too much on them or else this series of posts would go on forever. Pretty much any recent version of Boost and CMake will do, so again we'll just stick to vanilla package manager:

# apt-get install cmake ninja-build libboost-all-dev

Mind you, you don't actually need the entirety of Boost for this exercise, but it's just easier this way.

Emacs and SQL Plus

Finally, a couple of lose notes which I might as well add here. If you wish to use SQLPlus from within Emacs - and you should, since the SQLi mode is just simply glorious - you can configure it to use our Oracle Express database quite easily:

(add-to-list 'exec-path "/u01/app/oracle/product/11.2.0/xe/bin")
(setenv "PATH" (concat (getenv "PATH") ":/u01/app/oracle/product/11.2.0/xe/bin"))
(setenv "ORACLE_HOME" "/u01/app/oracle/product/11.2.0/xe")

After this you will be able to start SQL Plus from Emacs with the usual sql-oracle command. I recommend you to do at least a minimal setup of SQL Plus too, to make it usable:

SQL> set linesize 8192
SQL> set pagesize 50000

Introducing Zango

After this excruciatingly long setup process, we can at long last start to create our very "simple" project. Simple in quotes because it ended up being a tad more complex than what was originally envisioned, so it was easier to create a GitHub repository for it. It would have been preferable to describe it from first principles, but then the commentary would literally go on for ever. A compromise had to be made.

In order to follow the remainder of this post please clone zango from GitHub:

git clone git@github.com:DomainDrivenConsulting/zango.git

Zango is a very small Dogen project that builds with CMake. Here are some notes on the folder structure to help you navigate:

  • build/cmake: additional CMake modules that are not part of the standard CMake distribution. We need this for ODB, Oracle and Dogen.
  • data: some application data that we will use to populate our database.
  • projects: where all the code lives.
  • projects/input_models: location of the Dogen models - in this case, we just have one. You could, of course, place it anywhere you'd like, but traditionally this is where they live.
  • projects/northwind: code output of the Dogen model. This is the key project of zango.
  • projects/application: our little command line driver for the application.

Now, before we get to look at the code I'd like to first talk about Northwind and on the relationship between Dogen and ODB.

Northwind Schema

Microsoft makes the venerable Northwind database available in CodePlex, at this location. I found a useful description of the Northwind database here, which I quote:

Northwind Traders Access database is a sample database that shipped with Microsoft Office suite. The Northwind database contains the sales data for a fictitious company called Northwind Traders, which imports and exports specialty foods from around the world. You can use and experiment with Access with Northwind database while you're learning and develop ideas for Access.

If you really want a thorough introduction to Northwind, you could do worse than reading this paper: Adapting the Access Northwind Database to Support a Database Course. Having said that, for the purposes of this series we don't really need to dig that deep. In fact, I'll just present CodePlex's diagram with the tables and their relationships to give you an idea of the schema - without any further commentary - and that's more or less all that needs to be said about it:

Northwind Schema (C) Microsoft.

Now, in theory, we could use this image to manually extract all the required information to create a Dia diagram that follows Dogen's conventions, code-generate that and Bob's your Uncle. However, in practice we have a problem: the CodePlex project only contains the SQL statements for Microsoft SQL Server. Part of the point of this exercise is to show that we can load real data from Oracle, rather than just generate random data, so it would be nice to load up the "real" Northwind data from their own tables. This would be more of an "end-to-end" test, as opposed to using ODB to generate the tables, and Dogen to generate random data which we can push to the database.

However, its not entirely trivial to convert T-SQL into Oracle SQL, and since this is supposed to be a "quick" project on the side - focusing on ODB and Dogen - I was keen on not spending time on unrelated activities such as SQL conversions. Fortunately, I found exactly what I was looking for: a series of posts from GeeksEngine entitled "Convert MS Access Northwind database to Oracle". For reference, these are as follows:

If you don't care too much about the details, you can just look at the Oracle SQL statements, available here and copied across into the Zango project. I guess it's still worthwhile mentioning that GeeksEngine has reduced considerably the number of entities in the schema - for which they provide a rationale. Before we start an in-depth discussions into the merits of normalisation and de-normalisation and other DBA level topics, I have to stop you in your tracks. Please do not get too hung-up on the "quality" of the database schema of Northwind - either the Microsoft or the GeeksEngine one. The purpose of this exercise is merely to demonstrate how Dogen and ODB work together to provide an ORM solution. From this perspective, any vaguely realistic database schema is adequate - provided it allows us to test-drive all the features we're interested in. Whether you agree or not with the decisions the original creators of this schema made is a completely different matter, which is well beyond the scope of this series of posts.

Right, so now we need to setup our Northwind schema and populate it with data. For this you can open a SQL Plus session with user Northwind as explained previously and then run in the SQL script:

@/path/to/zango/data/Oracle-Northwind.sql

Replacing /path/to with the full path to your Zango checkout. This executes the GeeksEngine script against your local Oracle XE database. If all has gone well, you should now have a whole load of tables and data. You can sanity-check the setup by running the following SQL:

SQL> select table_name from all_tables where owner = 'NORTHWIND';

TABLE_NAME
------------------------------
ORDER_DETAILS
CATEGORIES
CUSTOMERS
EMPLOYEES
SUPPLIERS
SHIPPERS
PRODUCTS
ORDERS

8 rows selected.

SQL> select employee_id, firstname, lastname from employees where rownum <3;

EMPLOYEE_ID FIRSTNAME  LASTNAME
----------- ---------- --------------------
      1 Nancy      Davolio
      2 Andrew     Fuller

Now then, let's model these entities in Dogen.

The Dogen Model for Northwind

Before we proceed, I'm afraid I must make yet another disclaimer: a proper explanation on how to use Dia (and UML in general) is outside the scope of these articles, so you'll see me hand-waving quite a lot. Hopefully the diagrams are sufficiently self-explanatory for you to get the idea.

The process of modeling is simply to take the entities of the GeeksEngine SQL schema and to model them in Dia, following Dogen's conventions: each SQL type is converted to what we deemed to be the closest C++ type. You can open the diagram from the folder projects/input_models/northwind.dia, but if you haven't got it handy, here's a screenshot of most of the UML model:

Dogen Northwind model.

The first point of note in that diagram is - if you pardon the pun - the UML note.

odb_uml_note.png

Figure 1: UML Note from northwind model.

This configuration is quite important so we'll discuss it a bit more detail. All lines starting with #DOGEN are an extension mechanism used to supply meta-data into Dogen. First, lets have a very quick look at the model's more "general settings":

  • yarn.dia.comment: this is a special command that tells Dogen to use this UML note as the source code comments for the namespace of the model (i.e. northwind). Thus the text "The Northwind model is a…" will become part of a doxygen comment for the namespace.
  • yarn.dia.external_modules: this places all types into the top-level namespace northwind.
  • yarn.input_language: the notation for types used in this model is C++. We won't delve on this too much, but just keep in mind that Dogen supports both C++ and C#.
  • quilt.cpp.enabled: as we are using C++, we must enable it.
  • quilt.cpp.hash.enabled: we not require this feature for the purposes of this exercise.
  • quilt.csharp.enabled: As this is a C++-only model, we will disable C#.
  • annotations.profile: Do not worry too much about this knob, it just sets a lot of default options for this project such as copyright notices and so forth.

As promised, you won't fail to notice we hand-waved quite a lot on the description of these settings. It is very difficult to explain them properly them without giving the reader an immense amount of context about Dogen. This, of course, needs to be done - particularly since we haven't really spent the required time updating the manual. However, in the interest of keeping this series of posts somewhat focused on ODB and ORM, we'll just leave it at that, with a promise to create Dogen-specific posts on them.

Talking about ORM, the next batch of settings is exactly related to that.

  • yarn.orm.database_system: here, we're stating that we're interested in both oracle and postgresql databases.
  • yarn.orm.letter_case: this sets the "case" to use for all identifiers; either upper_case or lower_case. So if you choose upper_case, all your table names will be in upper case and vice-versa. This applies to all columns and object names on the entirety of this model (e.g. customers becomes CUSTOMERS and so forth).
  • yarn.orm.schema_name: finally we set the schema name to northwind. Remember that we are in upper case, so the name becomes NORTHWIND.

In addition to the meta-data, the second point worth noticing is that there is a large overlap between C++ classes and the entities in the original diagram. For example, we have customers, suppliers, employees and so forth - the Object-Relational Mapping is very "linear". This is a characteristic of the Dogen approach to ORM, but you do not necessarily need to use ODB in this manner; we discuss this in the next section.

If one is to look at a properties of a few attributes in more detail, one can see additional Dogen meta-data. Take customer_id in the customers class:

odb_customer_id_attributes.png

Figure 2: Properties of customerid in the customer class.

The setting yarn.orm.is_primary_key tells Dogen that this attribute is the primary key of the table. Note that we did not use an int as the type of customer_id but instead made use of a Dogen feature called "primitives". Primitives are simple wrappers around builtins and "core" types such as std::string, intended to have little or no overhead after the compiler is done with them. They are useful when you want to use domain concepts to clarify intent. For example, primitives help making it obvious when you try to use a customer_id when a supplier_id was called for. It's also worth noticing that customer_id makes use of yarn.orm.is_nullable - settable to true or false. It results in Dogen telling ODB if a column can be NULL or not.

As we stated, each of the attributes of these classes has the closest C++ type we could find that maps to the SQL type used in the database schema. Of course, different developers can make different choices for these types. For example, were we to store the picture data rather than a path to the picture as GeeksEngine decided to do, we would use a std::vector<char> instead of a std::string. In that case, we'd have to perform some additional mapping too:

#DOGEN yarn.orm.type_override=postgresql,BYTEA
#DOGEN yarn.orm.type_override=oracle,BLOB

This tells Dogen about the mapping of the attribute's type to the SQL type. Dogen then conveys this information to ODB.

Dogen's ORM support is still quite young - literally a couple of sprints old - so there will be cases where you may need to perform some customisation which is not yet available in its meta-model. In these cases, you can bypass Dogen and make use of ODB pragmas directly. As an example, GeeksEngine Oracle schema named a few columns in Employees without underscores such as FIRSTNAME and LASTNAME. We want the C++ classes to have the correct names (e.g. first_name, last_name, etc) so we simply tell ODB that these columns have different names in the database. Take last name for example:

odb_last_name_attributes.png

Figure 3: Properties of last name in the employee class.

A final note on composite keys. Predictably, Dogen follows the ODB approach - in that primary keys that have more than one column must be expressed as a class on its own right. In northwind, we use the postfix _key for these class names in order to make them easier to identify - e.g. order_details_key. You won't fail to notice that this class has the flag yarn.orm.is_value set. It tells Dogen - and, by extension, ODB - that it is not really a full-blown type, which would map it to a table, but instead should be treated like other value types such as std::string.

Interlude: Dogen with ODB vs Plain ODB

"The technical minutiae is all well and good", the inquisitive reader will say, "but why Dogen and ODB? Why add yet another layer of indirection when one can just use ODB?" Indeed, it may be puzzling for there to be a need for a code-generator which generates code for another code-generator. "Turtles all the way down" and "We over-engineered it yet again", the crowd chants from the terraces.

Let me attempt to address some of these concerns.

First, it is important to understand the argument we're trying to make here: Dogen models benefit greatly from ODB, but its not necessarily the case that all ODB users benefit from Dogen. Let's start with a classic ODB use case, which is to take an existing code base and add ORM support to it. In this scenario it makes no sense to introduce Dogen; after all, ODB requires only small changes to the original source code and has the ability to parse very complex C++. And, of course, using ODB in this manner also allows one to deal with impedance mismatches between the relational model and the object model of your domain.

Dogen, on the other hand, exists mainly to support Model Driven Software Development (MDSD), so the modeling process is the driver. This means that one is expected to start with a Dogen model, and to use the traditional MDSD techniques for the management of the life-cycle of your model - and eventually for the generation of entire software product lines. Of course, you need not buy into the whole MDSD sales pitch in order to make use of Dogen, but you should at least understand it in this context. At a bare minimum, it requires you to think in terms of Domain Models - as Domain Driven Development defines them - and then in terms of "classes of features" required by the elements of your domain. These we call "facets" in Dogen parlance. There are many such facets like io, which is the ability to dump an object's state into a C++ stream - at present using JSON notation - or serialization which is the ability to serialise an object using Boost serialisation. It is in this context that ODB enters the Dogen world. We could, of course, generate ORM mappings (and SQL) directly from Dogen. But given what we've seen from ODB, it seems this would be a very large project - or, conversely, we'd have very poor support, not dealing with a great number of corner cases. By generating the very minimal and very non-intrusive code that ODB needs, we benefit from the years of experience accumulated in this tool whilst at the same time making life easier for Dogen users.

Of course, as with all engineering trade-offs, this one is not without its disadvantages. When things do go wrong you now have more moving parts to go through when root-causing: was it an error in the diagram, or was it Dogen, or was it the mapping between Dogen and ODB or was it ODB? Fortunately, I found that this situation is minimised by the way in which you end up using Dogen. For instance, all generated code can be version-controlled, so you can look at the ODB input files generated by Dogen and observe how they change with changes in the Dogen model. The Dogen ODB files should also look very much like regular hand-crafted ODB files - making use of pragmas and so forth - and you are also required to run ODB manually against them. Thus, in practice, I have found troubleshooting straightforward enough that the layers of indirection end up not constituting a real problem.

Finally, its worth pointing out that the Domain Models Dogen generates have a fairly straightforward shape to them, making the ODB mapping a lot more trivial that "general" C++ code would have. It is because of this that we have orm parameters in Dogen, which can expand to multiple ODB pragmas - the user should not need to worry about that expansion.

Conclusion

This part is already becoming quite large, so I'm afraid we need to stop it here and continue on Part IV. However, we have managed to address a few of the mistakes of the Oracle setup of previous parts, introduced the remaining applications that we need to install and then discussed Northwind - both in terms of its original intent and also in terms of the Dogen objectives. Finally we provided an explanation of how Dogen and ODB fit together in a tooling ecosystem.

Created: 2017-03-19 Sun 23:07

Emacs 25.1.1 (Org mode 8.2.10)

Validate