Expand then contract

Saturday, 30 Jun 2007

I recently read some retrospection by Jeni Tennison, where she mentions I should have normalised [the data in a separate step] from the start, instead of attempting to deal with all that variability in the [processing step].

It reminded me that I just recently had such an instance myself, and not for the first time. I keep rediscovering this basic tenet of data munging over and over: that the easiest way of dealing with data that has highly variable structure is “expand then contract” – or “normalise then mangle”, I suppose. The principle is that instead of going straight from input to output, you go from input to an intermediary format to output, where the intermediary format is regularised in some fashion. Ideally, all related constructs in the input can be mapped to a single construct in the intermediary format; as far as possible, the same applies to constructs in the output. That way, instead of m·n mappings you only have to deal with m+n.

I need to learn how to recognise sooner when I’m running into such a situation. I’ve already spent way too much of my life rediscovering this utterly basic concept. Time and again I’ll have to deal with some gnarly complexity in some processing that I eventually solve by first transforming the input to something other than the output format. Only afterwards do I recognise that it’s the bog-standard m·n → m+n recipe once again, whereupon I find myself slightly embarrassed. It should seem to be possible to recognise when one is dealing with an m·n situation before wasting entirely too much time wrestling the m·n-type corner cases.

But is it?

The crux is that the formulæ above are oversimplifications, of course. Implicit in m+n are two factors assumed to be equal to 1; it’s really m·i₁+i₂·n, where the products with i₁ and i₂ correspond to the mismatch between input and intermediary format and between the intermediary and output format. These factors are always a great deal larger than 1. However, where this approach is applicable, they’re much smaller than n and m, respectively, leading to m·i₁+i₂·n < m·n. That explains why it can be so hard to see the forest for all the trees with this seemingly simple concept: it’s rarely immediately obvious that the added complexity of transforming twice would introduce significantly smaller auxiliary factors and therefore be a win.