How to pull off machine learning’s quickest trick, part 1: basic linear regression

African Centurion 2017: team ZaiLab walks the talk
4th December 2017
Ax in the CCTR, part 6: how to motivate your agents for the final push
8th December 2017

Using Google Sheets, it’s incredibly straightforward. But really that's the whole point.

So we spoke a short while back about linear regression – the simplest form of machine learning around. It’s so simple it’s hard to call it machine learning even, but it is, and we’re going to show you how to do it.

Linear regression – basically fitting a line to a set of data points – is super great when there’s a proportional link between certain values. Which is to say, when one goes up (or down), the other is likely to go down (or up) in response.

So this is a thing you can do in Excel. Or, hey, Google Sheets, which is what we’re going to use.

Here’s the easy way. And by the end of it, you'll have a much better understanding of just how cool and straightforward it is.

 

Step 1

Open a new sheet. Y’know.

 

Step 2

Remember our example of trying to predict the selling price of vintage Care Bear comics by their year of publication? We like Care Bears, so we’ll keep using the example if it’s all the same to you.

(You may feel free to adapt every step here to your own set of data. In fact we recommend it.)

So here you want to create two headings atop columns A and B: Comic Issue and Sales Price.

(Remember how we used publication date in our example last time? If we assume that the comics came out at the same frequency, we don't need to use the date in this example – we can just conveniently use the issue numbers. Score.)

 

Step 3

Beneath the comic issue column we’re just going to pop down the issue numbers, from 1 to whatever. These are your independent variables – independent because their changes aren't reliant on the price changes; they just march along like sequential numbers tend to do.

 

Step 4

Finally, under the sales price column we’re laying in our fictional selling prices. They kinda fluctuate but as you can see the comics definitely decrease in price the newer they are. (You’ll notice that we don’t have pricing data for all the issues. C’est la vie, yo.)

These are your dependent variables – in that the way they change is dependent on the date of publication (or in this example just the issue number).

 

Step 5

We’re getting there! What you want to do now is highlight the data (including the headings) in columns A and B. Now you’re going to select Insert (in the top menu), and click Chart.

You ought to get yourself something like the below. If Google Sheets is doing its homework, it should automatically be a scatter plot, which is what we want. If you don't get a scatter plot, follow step 6 – otherwise go straight to 7.

 

Step 6

See that right-hand column of settings? That’s the chart editor, and it’s what we’re hitting next. Click Chart Type, then scroll down until you see Scatter and fire that baby up.

That’s better. Now we’re going to add a trendline.

Step 7

Still in the chart editor, click the Customise tab and expand the section labelled Series. We’re gonna click that checkbox that says Trend Line – and lo, we have ourselves a trendline.

These are the very basics of linear regression – just fitting that line as best we can to a scattering of points that have some kind of direction to them. So if you’re looking for the price of issue 4, which we don’t have, you can just stick a pin in that there line and get a pretty good estimate of what it should be.

We’re not going to stop there, though. We’re going to show you another way to calculate that trendline (using scary-sounding but super simple things like slope and intercept). Then we’re going to show you an easy way to get a forecast for any particular date based on those two things.

 

Step 8

At the top of column D, write the heading Slope. Once cell down, write the heading Intercept. Now we’re going to use two easy Google Sheets functions. If this is your first time using functions, don’t you worry. You got this.

First, here’s the skinny on slopes and intercepts: the slope simply defines the angle of your trendline along the X axis (where the issue numbers are). The intercept is the point where the trendline crosses the Y axis (where the prices are). We'll chat more on this in a second.

So now let’s find out what the slope of your data is first. Click on the cell next to the Slope heading and just type:

=SLOPE(

Once you’ve hit that open-bracket key, the function is expecting you to tell it what data to use. This function first wants to know what your vertical, dependent data is, so go ahead and highlight column B (your prices).

Now hit the comma key, which lets the function know you’re about to give it its second bit of info, which in this case are your horizontal, independent variables. Highlight column A (your issue numbers).

Now hit the close-bracket key. You should have a formula that looks like this:

=SLOPE(B:B,A:A)

Just crush that enter key and oh snap look at that weird number you got! That’s your slope, and all it does is tell you how much to multiply your issue number by to get its offset on the graph. (But by itself it's not going to help us – because we don't know yet where to intercept that offset, get it?)

 

Step 9

We’re going to do almost exactly the same thing for the intercept. It also wants your dependent and independent variables, in that order.

Like the SLOPE function, you want an intercept function that looks like this:

=INTERCEPT(B:B,A:A)

Another weird number. But look, remember how you multiply your issue number by the slope to get its offset? All you gots to do now is add (simply add!) the intercept to that number, and shazam, you've got your price prediction.

Let's do that.

 

Step 10

Here's what your next formula is going to look like (first in plain English):

predicted price = slope × issue number + intercept

It's super simple, see? Now, underneath your Intercept heading, write Issue #. Under that, write Forecast.

After Forecast, we're going to use the following formula (remembering that our slope is in E1, our intercept is in E2, and the issue we want to predict on is in E3):

=E3*E1+E2

Now once you plug in an issue number, that simple little formula is going to tell you what your price prediction will be based on the line that was fitted to your data.

Let's try it with issue 2, which is missing its price in our set. We get $185, which is absolutely in the right ballpark.

 

Interlude

Look, we kind of led you slightly the long way around there. You don't have to calculate the slope and intercept individually to run your forecast. Google Sheets has a helpful function called FORECAST, see.

We're not sorry, though – because hopefully now you have a better understanding of how that forecast is generated. We'll let you figure out how to use the more straightforward function.

 

Step 11

Now, okay, we've got a bit of a problem here. Have you seen it? Let's plug in, say, issue 50 and see what our predicted price is.

We get $97 – but in the negative 🙁 Clearly nobody in their right mind is going to pay nearly a hundred dollars to give a cheap comic away (although stranger things have happened, we guess).

This is a limitation of linear regression. We don’t have enough data, so the line it’s projecting goes straight through the floor and keeps going.

And if we did have enough data, well. Let’s say that past a certain point your comics will never sell below a dollar each. If you have a few hundred of these comics, you’ll see that the data sort of curves down like a skateboard ramp toward zero but never actually reaches zero.

Trying to draw a straight line through this gives us a very inaccurate result. Most of its predictions will lie outside the truth. What we need is something else.

That skateboard-ramp? It’s an exponential curve, and as you might have guessed there is a function for predicting a value anywhere that too.

But that's for part 2, folks.