Predicting fuel efficiency with Elixir, Nx, and Axon: a gentle introduction to Machine Learning

Posted on

Having slept on ML for the last few years, I woke up and decided I’d figure out what I’m missing. I’m a senior developer with 20+ years of mostly web and mobile experience. It’s a great time to explore the space, since Sean Moriarity and the Elixir team have introduced a whole suite of ML tools. Nx and Axon are shiny and new so I thought I’d take them for a spin. Big thanks to https://github.com/elucid and https://twitter.com/ghedamat for going through this with me.

There are many tutorials on ML but most seem to assume a deep knowledge of math, stats, and other elements of data science. After a little digging into the subject, I decided to write this article with workaday developers in mind, those of us without a science or stats background. I will bring you along with me as I dip my toes in the ML waters, and try to make something work from start to finish.

This imperfect, intro exploration of machine learning is cobbled together from a number of resources (listed in the footnotes). This is intended to provide a flavour for the type of work involved rather than a complete guide, possibly getting you ready to work with a data scientist. I cover just one of the many types of machine learning algorithms out there. This is an example of “supervised learning” using “linear regression” and a “deep neural network”. But don’t stress about that.

I should also mention that I’m basing my example shamelessly on this TensorFlow tutorial which predicts the fuel efficiency of some cars (in MPG, miles per gallon). They use Python and TensorFlow, we’re going to try to use Elixir. I don’t care about cars, and you don’t need to either to make sense of this article.

Part 1, the data

The data looks something similar to this.

cylinders  displacement   horsepower    MPG   make/model
        8         307.0        130.0   18.0    "chevrolet chevelle malibu"
        8         350.0        165.0   15.0    "buick skylark 320"
        8         318.0        150.0   18.0    "plymouth satellite"
        8         304.0        150.0   16.0    "amc rebel sst"
        8         302.0        140.0   17.0    "ford torino"

We’re going to build a machine learning “model” that will be able to predict fuel efficiency (MPG) for any car for which we known the number of cylinders, the displacement and the horsepower. You can think of a model as a set of parameters that when combined with a set of inputs will return a prediction.

model = some_nebulous_paramters # what even are these?
prediction = predict(model, input)

These parameters are conceptually just a bunch of numbers that when set to “appropriate” values will cause inputs to turn into “appropriate” predictions.

Let’s explore this idea of tunable parameters a bit to get a sense of how they’ll be used. Say we’re trying to guide someone to either Grandma’s house or The Wolf’s den.

# destinations
grandmas_house = 0
wolfs_den = 1

# directions
left = -1.0
right = 1.0

# a function that returns the direction as a value from -1.0 to 1.0 
def predict(params, destination) do
  Enum.at(params, destination)
end

Given appropriate values, we get correct answers:

# appropriate params
params = [-1.0, 1.0]
predict(params, grandmas_house)
# => -1.0 # left

predict(params, wolfs_den)
# => 1.0 # right

If whoever (or whatever) assigns these parameters comes up with sensible values, then correct answers come out. If the parameters are wrong, we get junk out.

# random params
params = [0.03, 0.0]
predict(params, grandmas_house)
# => 0.03 # ??? a tiny bit left?

predict(params, wolfs_den)
# => 0.0 # ??? the middle? straight? nowhere?

And if we are kind of close, then the answer is kind of close too. We can see that this 3rd attempt is hinting strongly at the correct answer, but it’s not exactly correct like we’re used to in regular software.

# random params
params = [-0.99, 0.98]
predict(params, grandmas_house)
# => -0.99 # Close enough to left that it may as well be left

predict(params, wolfs_den)
# => 0.98 # Close enough to right

Machine learning is getting software to continually improve the parameters for us given exposure to inputs where we know the correct answer. ML would eventually be able to converge on appropriate parameters to be able to tell us which way to go to grandma’s house. This is done by having a function that can calculate how far off and in which direction an initial set of parameters are, and then looping over the data until hopefully the parameters are set to values which lead to outputs that are very very close to the real thing.

This function we’re imagining is called a “loss function”, and this looping, measuring loss, and adjusting the parameters is what is called “training”.

Training

Let’s just jump right in and see what training looks like. This is the training loop, we give it our car data as inputs and the expected MPG for each of those cars and it gives us back a set of parameters that are arranged in such a way that when you combine them with the inputs will give you something close to the expected MPG. These are our predictions.

def train(inputs, targets) do
  inputs_and_targets = Enum.zip(inputs, targets)

  # Start with a random set of parameters, they will produce
  # horrible predictions to start.
  params = randomize_params()

  # for each pair of input and target we have in our dataset
  Enum.reduce(inputs_and_targets, params, fn {car_input, actual_mpg}, params ->
    # make a prediction with the parameters we have
    predicted_mpg = predict(params, car_input)

    # get an idea of how close/bad our prediction was
    loss = calculate_loss(actual_mpg, predicted_mpg)

    # use this loss to hopefully improve our parameters
    adjusted_params = optimize(params, loss)

    # use these for the next round
    adjusted_params
  end)
end

I have a few more concepts to layer on before we actually try it out.

Epochs.

Instead of just running through our data-set of about 400 rows once, which isn’t very much data, we’ll do the whole process defined above N number of times. Epochs are simply the N here. It’s a number which we will chose to help get us an appropriate level of training. This will be discovered rather than known.

def train(epochs, inputs, targets) do
  inputs_and_targets = Enum.zip(inputs, targets)
  params = randomize_params()

  # do what we do above `epoch` number of times...
  Enum.reduce(epochs, params, fn _epoch, params ->
    Enum.reduce(inputs_and_targets, params, fn {car_input, actual_mpg}, params ->
      ... # same as before
    end)
  end
end

Understanding tune-able parameters

Let’s look at the predict function for a moment. We’re going to be using something called “linear regression” to solve this problem. Why? It’s the only way I know currently, and as I understand, it is the easiest to come to grips with. So let’s take this one on faith. You may have come across this in high school math:

The equation of a straight line is:

y = m * x + b

https://www.mathsisfun.com/equation_of_line.html

Where m is the slope (angle) of the line and b is the “Y intercept” (how far up or down the Y axis when x is 0).

Pretend for a moment we were only trying to predict MPG based on horsepower. If we imagine a graph like TODO where x is horsepower and y is MPG. we could say something like:

mpg = m * horsepower + b

But what are m and b? Where do they come from? It turns out that these are the “parameters” that we’re trying to tune in our training process. So for this single, 2 dimensional problem, the m and the b are simply numbers (“scalars”). So you could imagine the training loop starting with random numbers for m and b, and slowly discovering values that approach the real correlation between horsepower and mpg.

Together, m and b represent what we’ve been calling “parameters”, and are commonly referred to as “weights” and “biases” respectively. If we start with some random weights and bias, we can get a value for mpg:

inputs = 130 # horsepower
weights = 0.15
bias = 5

mpg = weights * inputs + bias
# => 0.15 * 130 + 5
# => 19.5 + 5
# => 24.5

Further, if you want to consider more than just 1 column at a time as input (say cylinders, displacement and horsepower), then we can no longer just use simple scalar numbers, we must move into 3 and more dimensions. This sounds kind of scary, but it just means our weights and biases (our m and our b) become matrices. In fact if we stick with the two dimensions, but with matrices instead we can see how it works. Don’t remember how matrices work? Relax with this gentle ASMR explanation: https://youtu.be/YegPj0H6yDA

With three dimensions, and some randomly concocted values, it looks like:

cylinders = 8
displacement = 307.0
horsepower = 130.0

inputs = [[cylinders, displacement, horsepower]]
weights = [[0.01], [0.05], [0.06]]
bias = [-0.9]

mpg = inputs * weights + bias
# => [23.2] + [-0.9]
# => 22.3 mpg

If you try to run this in Elixir though, it doesn’t go well:

** (ArithmeticError) bad argument in arithmetic expression
     code: mpg = inputs * weights + bias

This is where the Nx library comes in. With it we can do matrix math and a whole pile of other things as we’ll see

  defp deps do
    [
      {:nx, "~> 0.1.0-dev", github: "elixir-nx/nx", sparse: "nx", override: true},
    ]
  end
inputs = [[cylinders, displacement, horsepower]] |> Nx.tensor()
weights = [[0.01], [0.05], [0.06]] |> Nx.tensor()
bias = [-0.9] |> Nx.tensor()

mpg =
  inputs
  |> Nx.dot(weights)
  |> Nx.add(bias)
  |> Nx.to_flat_list()
  |> List.first()

# => 22.3

So this set of arbitrary weights and biases predicts a MPG of 22.3 when given 8 cylinders, a displacement of 307 and 130 horsepower as inputs. If we pick another set of weights and biases, we get a different result:

weights = [[0.2], [0.2], [0.2]] |> Nx.tensor()
bias = [0.5] |> Nx.tensor()

mpg =
  inputs
  |> Nx.dot(weights)
  |> Nx.add(bias)
  |> Nx.to_flat_list()
  |> List.first()

# => 89.5

We can see from this that when we change the weights and biases, that the prediction changes. What values should we use? We don’t really know, but if we have inputs and the correct answer for those inputs, and we have a method of measuring how far off our predictions are from the correct answers, then we can keep trying new values until the loss goes down to an “appropriate” level. This is the essence of machine learning, or at least this one type of machine learning.

What is Nx.tensor() all about? What are Tensors? A Tensor is a data structure that allows lists and matrices to be expressed in an efficient format. They include the data, as well as metadata expressing the type (s64) and the shape of the data, a 3 x 1 matrix in this case

Nx.tensor([[1], [2], [3]])

#Nx.Tensor<
  s64[3][1]
  [
    [1],
    [2],
    [3]
  ]
>

Parsing

Knowing a bit about the process, let’s turn back to preparing our data. We pull our data out of the file line-by-line and use bitstrings to return a list of values for each. The output will be a list of lists, to represent a 2-dimensional table of values

# parse the data
parsed =
  filename
  |> File.stream!()
  |> Enum.map(&convert/1) # split each line using bitstreams...
  |> Enum.reject(fn r -> r == :doesnt_parse end) # toss out bad rows

def convert(line) do
  <<
    mpg::binary-size(7),
    cylinders::binary-size(4),
    displacement::binary-size(11),
    horsepower::binary-size(11),
    _::binary
  >> = line

  [
    mpg |> String.trim() |> String.to_float(),
    cylinders |> String.trim() |> String.to_integer(),
    displacement |> String.trim() |> String.to_float(),
    horsepower |> String.trim() |> String.to_float(),
  ]
rescue
  e in ArgumentError ->
    :doesnt_parse
end

Next we’ll split the columns we will use as input into their own list of lists. We’re also going to split those into 2 sets of rows:

  1. The training set, which should be the bulk of the data. We’re going to use 99% in this case. We’ll use this data to “train” the machine learning model.
  2. The test set, which is the remaining 1%. We’ll use this data once we’ve created our model to test that it is able to predict the MPG on cars it was not trained on. We can use the known MPG in the test set to verify that our predictions are not too far off from the real values.
# Columns to use as input
input_columns = [
  "Cylinders",
  "Displacement",
  "Horsepower",
]

# The ratio of test rows to training rows
test_train_ratio = 0.01 # use 1% of the rows to test our model against.

{test_inputs, train_inputs} =
  parsed
  |> slice_columns(input_columns) # create a new list of lists with these columns
  |> HackyTools.hacky_normalize() # see "Normalization" below
  |> Enum.map(&Nx.tensor/1) # We'll now have a list of 1x3 tensors
  |> split(test_train_ratio) # split the rows into test and train

Normalization

Normalization means to take the columns of your data and transform them so they have the same scale. This will avoid having one column whose values are large from disproportionately affecting the result when compared with a column whose values are small. Normalization gets us a level playing field. Concretely, if we normalize just the first 5 rows, we’d go from this:

Cylinders Displacement Horsepower Name
	8        307.0      130.0 chevrolet chevelle malibu
	8        350.0      165.0 buick skylark 320
	8        318.0      150.0 plymouth satellite
	8        304.0      150.0 amc rebel sst
	8        302.0      140.0 ford torino

to this:

Cylinders Displacement Horsepower Name
      1.0      0.10416    0.00000 chevrolet chevelle malibu
      1.0      1.00000    1.00000 buick skylark 320
      1.0      0.33333    0.57142 plymouth satellite
      1.0      0.04166    0.57142 amc rebel sst
      1.0      0.00000    0.28571 ford torino

You can see that the highest value in each column becomes

1.0, the lowest 0.0, and the rest in between. The Skylark has the highest displacement, the Torino has the lowest.

Normalization is outside the scope of this article, but the math is fairly simple and you can check the full source to my example to see how I hacked my way through it. My feeling is that this is something I should be able to do in Nx, but I couldn’t figure it out, so I made my own naive solution. Feedback welcome!

The target column

We’ll also split the target column (MPG) off into it’s own table, and it’s own set of training and testing values. The target is the value we are trying to predict.

# The target column
target_column = "MPG"

{test_targets, train_targets} =
  parsed
  |> slice_columns([target_column]) # create a table with just the MPG values
  |> Enum.map(fn a -> Nx.tensor([a]) end) # We'll have a list of 1 x 1 tensors
  |> split(test_train_ratio)

In the end we’ve split our data into 4 parts:

Training inputs                       |                     Training targets

cylinders  displacement   horsepower  |   MPG    Name
        8         307.0        130.0  |  18.0    chevrolet chevelle malibu
        8         350.0        165.0  |  15.0    buick skylark 320
        8         318.0        150.0  |  18.0    plymouth satellite
-----------------------------------------------------------------------------
        8         304.0        150.0  |  16.0    amc rebel sst
        8         302.0        140.0  |  17.0    ford torino
Testing inputs                        |                      Testing targets  

Training the model

Once we have the data parsed and split, we can pass it to train/2 to create our model with.

From there we can print out a few predictions.

# train the model
model = train(train_inputs, train_targets)

# make some predictions
test_inputs
|> Enum.zip(test_targets)
|> Enum.each(fn {car_input, actual_mpg} ->
  predicted_mpg =
    predict(model, car_input)
    |> scalar()

  Logger.info("Actual: #{scalar(actual_mpg)}. Predicted: #{predicted_mpg}")
end)

Training is much as we had it above. Loop over the Epochs and data, try to find better fitting parameters.

@learning_rate 0.01

def train(training_data, targets) do
  init_params = init_random_params()

  data = Enum.zip([training_data, targets])

  Enum.reduce(1..@epochs, init_params, fn epoch, params ->
    IO.write("#{epoch} ")

    Enum.reduce(data, params, fn {input, target}, cur_params ->
      update(cur_params, input, target)
    end)
  end)
end

defn update({m, b} = params, input, target) do
  {grad_m, grad_b} = grad(params, &loss(&1, input, target))

  {
    m - grad_m * @learning_rate,
    b - grad_b * @learning_rate
  }
end

defn loss(params, car_input, actual_mpg) do
  predicted_mpg = predict(params, car_input)
  Nx.mean(Nx.power(actual_mpg - predicted_mpg, 2))
end

defn predict({weights, bias}, car_input) do
  car_input
  |> Nx.dot(weights)
  |> Nx.add(bias)
end

The update function calls the Nx provided grad function with takes the parameters, and a loss function, which we define. grad/2 will return values for what it considers good values for the weights and bias, but instead of taking that advice on faith, we just take a tiny step towards it. The 0.01 here is what’s called the “learning rate”, and it’s there to make sure we take small steps towards our goal on every iteration. grad/2 automates the process of performing a number of random trials and returning the result which minimizes loss.

The loss function takes both the training inputs and actual values, makes a prediction using only the inputs and then measures how far the predictions were from the actual values. There are a number of ways to measure the accuracy of predictions, but mean squared error is fairly common. It squares the difference between predicted and actual values before averaging so that positive and negative differences don’t cancel out and suggest a higher degree of accuracy than is warranted.

Finally ‘predict’ is just as I mentioned above, we multiply the input by the weights and then add the bias.

How does it do then? When we run our hand-rolled implementation, we see the epochs fly by fairly quickly and then it outputs:

15:07:21.031 [info]  Actual: 18.0. Predicted: 22.534465789794922
15:07:21.034 [info]  Actual: 15.0. Predicted: 19.156362533569336
15:07:21.034 [info]  Actual: 18.0. Predicted: 20.994869232177734
15:07:21.034 [info]  Actual: 16.0. Predicted: 21.397958755493164

That’s not too bad right? Our model is able to predict fuel efficiency within a few MPG of the real values, so it’s sorta working. It seems like our predictions are always a bit high, but for a first go, this is great.

Making it easy on ourselves.

I’ve shown you a very simplistic implementation so far. I’ve ignored many of the details in order to show the process more clearly. In practice though we want to choose a pre-made loss function, we can also use an off-the-shelf optimizer which is something that automates the grad and updating for us, and we’d like to try to tighten up our predictions if we can. The Axon framework offers all of that and then some, so before continuing, we’ll port our current model over to Axon. It has all the same pieces, just different abstractions.

The Axon version

def train(inputs, targets) do
  # How many times we're going to go over the training data
  epochs = 30

  # How big or small of a step to take towards
  # what the optimizer tells us is the right direction
  learning_rate = 0.01

  # the same loss function as before, but make Axon do it
  loss = :mean_squared_error

  # an optimizer will do the updating and grad for us
  optimizer = Axon.Optimizers.adamw(learning_rate)

  # define the shape of our model
  model =
    Axon.input({nil, Enum.count(@input_columns)})
    |> Axon.dense(1)

  # let Axon do the training and updating of params for us.
  %{params: trained_params} =
    model 
    |> Axon.Training.step(loss, optimizer)
    |> Axon.Training.train(inputs, targets, epochs: epochs)
	
  {model, trained_params}
end

def predict({model, trained_params}, car_input) do
  model
  |> Axon.predict(trained_params, car_input) # Have Axon do the work
  |> Nx.to_flat_list() # we get a 1 x 1 Tensor back, make it a list
  |> List.first() #  Take the only value out of the list
end

Axon.input/1 wants to know the shape of the incoming data so that it can line up the matrix shapes for us. I’m telling it to use 3 as the inputs and give us an output layer as a “dense layer”, which we’ll cover soon. I think the nil means it will try to infer the other axis of the input matrix, but I could be wrong.

When it’s run, we get back very similar results:

03:03:05.826 [info]  Actual: 18.0. Predicted: 21.5460205078125
03:03:05.826 [info]  Actual: 15.0. Predicted: 23.39251708984375
03:03:05.826 [info]  Actual: 18.0. Predicted: 22.440349578857422
03:03:05.826 [info]  Actual: 16.0. Predicted: 22.274457931518555

Improving the model

I mentioned previously that our predictions aren’t terribly accurate. I’m by no means an expert, but my assumption is that there aren’t enough tweak-able parameters for the model to use. With a 3x1 matrix and a single value for the bias, that leaves only 4 parameters to tweak.

One thing we can do to give the model more parameters to tweak and more ways to express nuance is to give it additional sets of weights and biases. With more weights and biases to tweak, we have more parameters that our model can tune which will hopefully allow us to make more accurate predictions. We call these sets of weights and biases layers. So instead of a 1x3 matrix for inputs, a 3x1 matrix for weights, and a 1x1 matrix bias, we might go from 1x3 inputs to the first layer with a 3x10 matrix for weights, a 1x10 matrix for the bias, and then to a second set of weights with a 10x1 matrix and a 1x1 bias. The choice of layer dimensions is somewhat arbitrary. The only requirement is that the shapes of the input, layer, and output matrices must line up. The main thing to understand is that we’re moving from:

input > params > output

to:

input > layer1 > layer2 > output

You can add as many layers as you want, but the “shapes” of the matrices must line up. You can review the rules of matrix multiplication on your own, but the way I think of it is that if you have two matrices [a x b] and [b x c], the output matrix will have shape [a x c] and also the two “inner” values (the ‘b’s) must be the same.

# with one layer:
[1 x 3] * [3 x 1]
[1 x 1]

# with two layers:
[1 x 3] * [3 x 4] * [4 x 1]
[1 x 10] = [10 x 1]
[1 x 1]

The choice to use 10 columns for 2nd layer is arbitrary. You could replace all the 10s above with 5s and it would also work.

There are other aspects of layers that I have not covered yet. Layers that are not the inputs or outputs are deemed “hidden layers” and the presence of hidden layers means the neural network is a “deep neural network (DNN)”. These layers are also referred to as “dense” layers.

Adding one or more of these layers is really easy with Axon:

model =
      Axon.input({nil, Enum.count(@input_columns)})
      |> Axon.dense(10)
      |> Axon.dense(1)

Boom. We have now introduced a 3 x 10 matrix for our model to play with. Running it gives:

03:21:12.466 [info]  Actual: 18.0. Predicted: 19.242849349975586
03:21:12.466 [info]  Actual: 15.0. Predicted: 16.678634643554688
03:21:12.466 [info]  Actual: 18.0. Predicted: 18.168617248535156
03:21:12.466 [info]  Actual: 16.0. Predicted: 18.571996688842773

That’s closer!

Thanks

If you’ve made it this far, thanks for reading! Please don’t hesitate to get in touch if you have questions or answers or want to poke holes in my explanations, I welcome the discussion and opportunity to learn more about this.

Grox.io has some upcoming content on Nx and Axon, which promises to be interesting. https://twitter.com/redrapids/status/1423678792246169601?s=20

Hire me!

As a consultant I like to keep an eye out for interesting people to work with and projects to work on. If you or your team need help on a project, get in touch. I have been a developer for 20+ years and have lots of experience in web and mobile development in Elixir, Ruby, Android. Let’s have a chat about how I can help you.

Resources