In a recent Huffington Post article I calculated the probability that Boston will get even more snow than its current historical record. The function I ended up with was
where s is the additional inches of snow beyond the 45.5 inch record set on Saturday (February 15th). Let me explain how I got that function.
First, Some Background in Probability
Let’s start with some basics. The total snowfall in any given month is a variable. But it’s not your “usual” kind of variable, it’s what we mathematicians call a “continuous random variable.” These type of variables have a function associated with them—called a “probability density function” or “PDF”—that tells us the probability that the random variable’s value will fall within a certain range (the “continuous” part means that the values of the random variable can be any real number). In our case the random variable of interest is the total snowfall in February in Boston, which I’ll denote by S, and it’s associated PDF, which I’ll denote by p(x). Then the probability that S is between, say, 45.5 inches and 45.5+h, where h is small (like, way less than 1), is approximately
To approximate the probability that S is between, say, 45.5 and 46.5 we’d need to add up many terms like the one on the right-hand side. One such approximation is
(This is the case when all h’s are equal to 0.1). If we want the exact answer we’d need to add up infinitely many terms, each corresponding to an h-value that is infinitesimally small. The way we do that in calculus is by integrating. So, in calculus speak,
Now Back to Boston’s Snowy Month
If you’ve made it this far one thing is clear: we can’t calculate anything without the PDF. That’s where the data linked above comes in. By downloading the total snowfall column in the data into a spreadsheet we can create the histogram below.
This histogram tells us how frequently (in the 95 years between 1920 and 2014) the Feburary snowfall total was between 0 and 5 inches, 5 and 10 inches, etc. (For example, the first bar says that the total snowfall was between 0 and 5 inches 26% of that time.) The black curve is the exponential function
where x is the total snowfall (in inches). This curve is Excel’s best fit to the data. It’s not perfect, but it does a better job than other fits (like a linear function).
To make f(x) into a PDF we need to make sure that the probability that S is between zero and infinity is 1. (Roughly speaking, this expresses the fact that all probabilities must add to 1.) In calculus jargon, this means that we first need to calculate the integral of f(x) between zero and infinity and then divide f(x) by that value. And since
Our PDF—which I’ll call p(x)—is therefore
(This is the PDF of an exponential distribution, a well known PDF from probability theory that has many applications to business, physics, and engineering—see here for more.)
We’ve found our PDF…woohoo! We can now calculate the probability that S—the total snowfall amount in February in Boston—will be between zero and some other number y. As before, that’s just the following integral:
This is a pretty straightforward integral to calculate using a technique called u-substitution. The answer is
We’re almost done, I promise. The last step is to use the fact that 45.5 inches have already fallen. So, given that 45.5 inches of snow have already accumulated, what is the probability that s more inches will fall? Well, if we denote by y the total snowfall (i.e., y=s+45.5), then in math-speak we want to calculate
This is an example of a “conditional probability,” and by the laws of probability, this simplifies to
and in terms of integrals becomes
Finally, calculating and simplifying gives
This is the P(s) formula I gave at the start of the article. It’s amazing what the internet, math (oh, and Excel) can accomplish. Pretty neat huh?