"Smart Data"

The Problem

In business data processing there is a profound mismatch between the data types used in the real world and the data types supported by most programming languages. This is thought to be one of the legacies of the mathematical origin of computers, but until the advent of high-level languages with support for user-defined datatypes (such as Java), nothing much could be done to resolve this conflict.  Most numbers in the real world have units attached to them - for instance, distances are in feet, metres or parsecs, bank balances and prices have to be in some currency, dates and times have to be relative to a given time zone, and so on, but most computer languages treat all quantities as if they were dimensionless. Many of the values we handle in our business applications do not have explicit precisions, so this has to be defined in the code that references them.  Similarly, currency amounts cannot be multiplied, distance times distance equals area, and dates cannot be added: numbers in the real world seldom behave the same way that numbers do in a computer.

The vast majority of the numeric values in today's business applications are described by working code, not explicitly in (or associated with) the data itself. It was reported a few years ago that many of the magnetic data tapes in the tape library belonging to the US Navy had become illegible, not because of I/O errors, but because the data formats were not specified explicitly but only held inside program code, and nobody knew which programs or copy members described which tapes.

Recently one of the Mars landers was lost because of a mismatch in units, and there is the famous case in Canada of the so-called "Gimli Glider".  On July 23, 1983, a Boeing 767 operating as Air Canada Flight 143 glided into Gimli, Manitoba. Here is a quote from a fascinating description of the so-called "Gimli Glider" incident by Gail Marsella:

When the ground crew conducted the drip procedure they determined that the tanks contained 7,682 L. The crew knew that the flight required 22,300 kg, and they knew that volume should be multiplied by density to obtain weight. But the density of jet fuel can be expressed in various units such as pounds per gallon, pounds per liter, or kilograms per liter. The ground crew used the value 1.77 without being certain of its units.
The result was that they added about 5,000 L when they should have added about 20,000 L. At the time of takeoff Flight 143 had about 10,000 kg of fuel - less than half the amount needed to reach Edmonton.

Why did the pilots and ground crew so readily accept the value 1.77? Because, when accompanied by the proper units, it is a valid conversion factor that they had all used in the past. The density of jet fuel is 1.77 pounds per liter.

Here is a quote from a NASA release about the loss of the Mars Lander:
The peer review preliminary findings indicate that one team used English units (e.g., inches, feet and pounds) while the other used metric units for a key spacecraft operation. This information was critical to the maneuvers required to place the spacecraft in the proper Mars orbit.
When humans talk about numeric values, we normally include the units - thus we would say that a weight is "150 pounds", not just "150". On the other hand, when transmitting values between computers, we very seldom include the units, partly because in most cases there are no generally accepted codes for units, and including codes for units, if they existed, would imply some (automatic) mechanism for doing conversions between them, and I am not aware of any such mechanism in common use in the DP community. However, at a more fundamental level, a block of data in a computer program is usually programmed as a set of fields, where each field represents an attribute of the entity being described, and each field is usually a simple numeric value or character string. Only floating point numbers include some metadata, and they are not appropriate for most business uses. This generally applies even to advanced database concepts such as relational databases. It was not until the advent of advanced OO languages, such as Java, that we could start thinking of almost all data as references, rather than atomic data fields. If an attribute is in fact usually a reference, then the reference can (and in many cases, should) be to a complex object, containing both unit and data.

Even when you consider the values themselves, leaving aside the question of units, there may be a wide variety of different representations. This is especially vexatious in the context of dates. Consider a value of 020511: does this represent the 2nd of May 2011? Or perhaps the 11th of May 2002? Or the 5th of February, 1911?  In the last century, one could be reasonably certain that 94 was the year, but what about the other two pairs of digits?  On the banking project I worked on, we decided to always display dates as "ddmmmyy" (later changed to "ddmmmyyyy"), where "dd" is day number, "mmm" is an alpha month abbreviation, and "yy" or "yyyy" is the year.  But alpha month abbreviations raise the question of natural languages - how many do you support, and how are they identified?  Again, I don't believe there is any universal standard.

Now, the task of creating classes to represent the whole world of magnitudes and units would obviously be a daunting one, but we can take a manageable application and develop the classes for that application.  The hope is that, if this is done carefully, we can use it as a starting-point to grow a bigger and bigger set of classes and their related methods.  Just as is currently happening with XML, different application areas have to develop standards so code and data can be shared by everyone who subscribes to these standards.  For instance, to the best of  my knowledge, these is no universally accepted, machine-processable, standard for units of mass, which means that people developing classes for mass will have to develop their own standards and get them accepted across a large group of users.  

I believe it is possible to design a set of physical units that are fairly natural to use in Java, and take advantage of Java's compile-time checking (see below).  However, in no programming language that I know of can you prevent programmers from expressing physical quantities as, say, floating-point numbers.   By comparison, in the area of business, I believe you can provide a set of classes that will be so much easier to use than the "vanilla" data types that you don't have to stand over the programmers with a big stick!

In "Business Data Types", I describe a set of Java classes that we developed for an electronic brokerage package that some colleagues and I worked on from 2000 to 2001. These Business Data Types were posted to SourceForge in 2009 under the project name JBDTypes (https://sourceforge.net/projects/jbdtypes/), and have recently been picked up by Softpedia.  

In "Physical Units", you will find a more speculative discussion of a possible implementation of some of the ideas discussed above. The classes described in that web page have never been used for an application, but I feel they make a start towards addressing a number of the problems I have described. Feedback would be appreciated.