Sturdy and massive

Sunday, 3 Sep 2006

I recently re-read Steve Yegge’s article Is Weak Typing Strong Enough?. There aren’t any surprises or striking insights in it for folks who are already in the “weak typing” camp, although the fact that he backs it up with relatively hard evidence is notable:

In any case, for several years I got to watch Perl and Java folks working side by side doing pretty much the same tasks. In some cases, they even had to implement the same logic in both languages. Yes, there were inefficiencies with our Perl-and-Java approach. However, it was the right decision at the time, and as a result, I was personally able to witness a more or less apples-to-apples, multi-year comparison of the strong-typing and weak-typing camps at work in the same production environment.

In nutshell, I was pretty impressed. I was a die-hard Java guy at the time, and even then, I could see that the Perl code was far smaller and simpler than the Java code [… and] seemed modular enough. It had a well-defined architecture, and it got the job done, year in and year out.

Our Java code (to me) seemed far more complex, even though I could read Java more easily. I think Java programmers have a tendency to overengineer things, myself included.

I didn’t expect anything else, but it’s certainly nice to be confirmed.

What I have been thinking rather more about is the following trail of questions he posed in his article’s wrap-up section:

I still have some doubts. Do weakly-typed systems have inherently lower scalability? Do they tend to dissolve into vast typeless traps at a certain size, as the static camp would have you believe? Do the runtime type-error rates get out of hand, even with rigorous unit testing and software-engineering discipline?

The problem with these questions is the apples-to-apples aspect. Let’s say that a hypothetical system in a dynamically typed language does indeed grow very large, to the point of unmanageability. And let’s further assume that typing is indeed the problem that causes the system to finally fall over and collapse on itself.

These are big assumptions, but let’s take this at face value for just a bit.

How large would such a system be? Let’s say it’s 5 million lines. Remember, face value.

By the very definition of our hypothetical example, a statically typed system of the same size would not have broken down.

Here’s the issue with apples-to-apples: systems in dynamically typed languages grow much slower. Evidence seems to suggest that codebases written under static typing constraints tend to be around half to one order of magnitude larger than those you get given dynamic typing. This is even worse in the case of Java, since Java limits expessiveness and abstraction such that codebases have a tendency to grow superlinearly.

So a 5Msloc system in dynamically typed language is not actually comparable to a 5Msloc system in a statically typed system. To make a fair comparison, you’d have to consider it as equivalent to a 30Msloc system in a statically typed one. Or maybe a 100-150Msloc one in Java.

Doesn’t sound so good for either camp anymore, does it?

How well do statically typed languages scale to such system sizes? I doubt the type system will help you much at that point. The problem is not fundamentally the type system. The problem is that large systems inherently become unmanageable past a point in modelled domain complexity. If the chosen languages differ mainly in their preference of type system, then that point will be roughly the same, regardless of the size of the codebase itself.

I’ve said it before: I think the safety conferred by static typing is a self- fulfilling prophecy. Systems written in such languages grow so fast that you need the proffered safety to cope. Conversely, when you abstain from asking for such safety by using a dynamically typed language, you will in turn be rewarded by much slower codebase growth that negates the urgent need for type safety.

I think this fundamental problem is the cause of the religious war between the camps: those in the static typing camp seem to have a blind spot here in that they assume that the typical system growth rate they are used to is universal.

But it’s not. It’s a dependent variable.