I’ve spent the past two weeks just starting to get a handle on how to use the popular low-level, memory-safe programming language Rust. In general, it gives me C++ vibes when I’m programming in it. That said, I think that it could have good potential as a machine learning development environment, so long as it is supplied with the appropriate external libraries (or “crates,” to use Rust’s terminology.)

So, to get a handle on how I could use Rust specifically
for machine learning purposes, I did some (very brief) research into what tools
might be useful. I knew that I wanted to make use of an *auto-differentiation*
tool, as this is essentially the bedrock of a machine learning dev environment.
Auto-differentiation allows us to compute partial derivatives of the output of
our models such that we can tweak and tune them into being better predictors. I looked
over three different auto-differentiation crates:

`easy-ml`

: This is actually a much more comprehensive machine learning crate with lots of out-of-the-box machine learning implementations which are easy to implement. It also contains a module,`differentiation`

which houses its auto-differentiation engine. This module could suffice for my purposes, but I don’t like the overhead of needing to install all of the extra bells and whistles provided by`easy-ml`

just to get access to its`differentiation`

module. I’d rather use a crate which contains only the necessary tools & traits for auto-differentiation in particular.`autodiff`

: The problem with using this crate in particular is that it is mostly tailored toward calculus on scalar functions, whereas most machine learning applications require higher-dimensional differentiation in the form of vectors, matrices and scalars. I don’t know if`autodiff`

is necessarily incapable of implementing these features, but most of their examples and documentation emphasize their scalar equivalents.`autograd`

: Of the above two alternatives,`autograd`

certainly takes the cake for popularity and satisfies my complaints about the other two. The crate implements auto-differentiation on the standard Rust matrix/tensor crate,`ndarray`

, which can allow for maximal cross-compatibility with other matrix-related crates. Moreover, it is very well documented and small enough for me to be able to scan the source code to answer questions not addressed by the documentation. Moreover, it’s easily extendible for custom tensor operations and I/O pipelines.

So, I decided to go with `autograd`

for my auto-differentiation
engine. I could have gone deeper into my research to find other alternatives, but
these three appeared to be the most popular (according to crates.io), and `autograd`

serves all of the basic purposes which I am looking for.

### Linear Regression as a Neural Network

The simplest machine learning application I could come
up with to test this crate out (which wasn’t already done in their tutorials collection)
was a simple linear regression neural network. For those who don’t know, a neural
network is essentially a statistical model which recursively applies a linear regression
and some nonlinear *activation* function. So, if I wanted to use a neural network
for the purposes of a simple, one-dimensional linear regression, I would use a 1D
input layer connected directly to a 1D output layer with no activation function.
The weight parameter of the single neuron in this network corresponds to the slope
of our linear regression, while its bias parameter is the offset. This is why the
output equation of our model would look exactly like the formula for a line:

In a diagram, this linear regression model looks like the neural network pictured below:

Obviously, if this was all I wanted to use my auto-differentiation
engine for, it would be woefully overpowered, but this is just a simple “hello world.”
More generally, if I really wanted (only) to implement a linear regression algorithm
in Rust, I would just use the closed-form formula for a linear regression and use
`ndarray-linalg`

for matrix multiplications and inverses to implement it. However,
in future, I’d like to construct more complicated neural models which will require the
full power of `autograd`

, so this is a small proof-of-concept which allows me to get
comfortable with the crate.

Now that I have a basic idea of what sort of model I’m
going to be building, I can start writing some Rust code.

### Implementing the Model with AutoGrad

The main example from `autograd`

’s GitHub repo with which
my project has the most overlap is the `sine.rs`

example, which regresses a multi-layer perceptron (MLP) to the sine function.
Because I’m doing something simpler, I’ll only be following it insofar as it is a
useful reference for how regression works generally in `autograd`

.

The first thing I do is import a handful of modules from the crate:

```
use autograd as ag;
use ag::optimizers::*;
use ag::optimizers::adam::Adam;
use ag::prelude::*;
use ag::tensors_ops::*;
```

The first of these just assigns a nickname, `ag`

, which I can use in place of the
longer namespace, `autograd`

. After that, I’m importing tools for optimizers, the
Adam optimizer specifically, some high-level traits from `autograd`

’s prelude, and
the `tensor_ops`

module.

Next, I create my `main`

function, and start initializing the core engine of `autograd`

:
the so-called `VariableEnvironment`

, along with an instance of a random number generator.

```
fn main() {
// Initialize VariableEnvironment and (default seed) RNG
let mut env = ag::VariableEnvironment::new();
let rng = ag::ndarray_ext::ArrayRng::<f32>::default();
// --snip--
```

In short, the `VariableEnvironment`

is a container which
manages all of the variables which we use in learning and keeps track of how they
change throughout the learning process. In setting it up, I register our two learned
parameters in the `VariableEnvironment`

, which involves associating them with a name:

```
// --snip--
// Initializing our parameter tensors
env.name("w").set(rng.standard_uniform(&[1, 1]));
env.name("b").set(ag::ndarray_ext::zeros(&[1, 1]));
// --snip--
```

With `env.name()`

, we are registering a new `NamedVariableSlot`

in the environment,
which we then immediately give an initial value with the `set()`

method. The arguments
of the functions inside of the `set`

call are each references to arrays of `usize`

,
each of which state that we are initializing `w`

with a \(1\times 1\) tensor of
random values, and that we are initializing `b`

with a \(1\times 1\) tensor of zeros.

So, our next step in initializing the model is creating an
instance of an optimizer which governs how we use the partial derivatives we compute
in order to reduce loss and predict outputs more accurately. We’re using the Adam
optimizer for this, which we give some parameters to:

```
// --snip--
// Initializing our optimizer (Adam)
let adam = Adam::new(0.01, // Learning rate (alpha)
1e-08, // Error Toler. (epsilon)
0.9, // First Moment decay (beta1)
0.999, // Second Moment decay (beta2)
env.default_namespace().current_var_ids(), // Variable IDs from VariableEnvironment namespace
&mut env, // VariableEnvironment instance
"linear_reg_adam"); // Name string
// --snip--
```

All of the parameters I’ve put in above are generally considered
to be the “default” parameters for most use cases of Adam, but I’ve beefed up the
learning rate (\(\alpha\)) by a factor of 10 to speed up learning on such a dimensionally
small model. It’s also noteworthy that I need to provide the optimizer with the Variable
IDs from the `VariableEnvironment`

(so it knows what our learned parameters are), as
well as a *mutable* reference to the environment. We also supply it with a name.

Now I start thinking about the training process. For the purposes
of this project, I’d like to train the linear regressor in batches of size \(64\), over the
course of \(1000\) epochs (i.e. batches). So I just quickly define some constants for
these quantities:

```
// Initializing training constants
const N_EPOCHS : usize = 1000;
const BATCHSIZE : usize = 64;
```

At this point, I’m almost ready to code up the training loop.
But it feels like I’m forgetting something; perhaps the actual training data itself?
Where am I getting that data from? Because we’re working totally with a synthetic dataset
for this project, it suffices to define some “true” values we expect our network to learn,
then use those values (and some data noise) to generate data points as we train. So,
I define the parameters I’d like to learn:

```
// Setting up the (synthetic) dataset
let true_w = 2.5f32;
let true_b = -1.0f32;
let noise_scale = 0.5f32;
```

Now we can jump into the meat of our algorithm: the training
loop. This consists of a `for`

-loop to iterate over the epochs, along with the following
contents which I will explain:

```
// --snip--
for epoch in 0..N_EPOCHS {
env.run(|ctx| {
let x = standard_uniform(&[BATCHSIZE, 1], ctx) * 20. - 10.; // Randomly generate input coords
let w = ctx.variable("w"); // Retrieve variables
let b = ctx.variable("b");
let y = true_w * x + true_b +
standard_normal(&[BATCHSIZE, 1], ctx) * noise_scale; // Compute expected output (w/ noise)
let z = matmul(x, w) + b; // Compute LinReg predictions
let mean_loss = reduce_mean(square(z - y), &[0], false); // Compute Mean Sq Error Loss
let ns = ctx.default_namespace();
let (vars, grads) = grad_helper(&[mean_loss], &ns); // Compute gradients for variables
let update_op = adam.get_update_op(&vars, &grads, ctx); // Compute weight/bias change
let results = ctx.evaluator().push(mean_loss).push(update_op).run(); // "Push" both loss and changes to evaluation buffer
match epoch % 100 {
0 => println!("Mean Loss (epoch {}) = {}", epoch, results[0].as_ref().unwrap()), // Print training loss every 100 epochs
_ => ()
}
});
}
// --snip--
```

Here we note that we are passing a function to our `VariableEnvironment`

via the function `run()`

. This function expects its argument to be a function with an argument
of a type called `Context`

, which is often abbreviated as `ctx`

in variable names,
which I use in the closure. Anything we run inside of the closure we pass to `env.run()`

is used to create a `Graph`

, which is a component frequently seen in auto-differentiation
engines. Without getting into too many nitty-gritty details, you can think of this as a
graph where nodes are variables in our model, and we have arrows between them indicating
how data flows between them such that we can “trace back” how they interact with one another and
use this data to compute partial derivatives.

So what does all of this mean for us? It means that anything we do
inside of `env.run()`

will be “watched” by the graph so we can compute derivatives. Let’s
take a look at what we do inside of the context closure, one digestible chunk at a time:

```
let x = standard_uniform(&[BATCHSIZE, 1], ctx) * 20. - 10.; // Randomly generate input coords
let w = ctx.variable("w"); // Retrieve variables
let b = ctx.variable("b");
let y = true_w * x + true_b +
standard_normal(&[BATCHSIZE, 1], ctx) * noise_scale; // Compute expected output (w/ noise)
let z = matmul(x, w) + b; // Compute LinReg predictions
```

What we initially do is generate a tensor of uniformly-distributed
input coordinates, spread evenly on the interval \((-10, 10)\). Then, I call the
`variable()`

function from the `Context`

instance, which allows me to retrieve a variable
tensor by the name I assigned to it earlier on when I initialized it. The next thing I do
is generate the expected output for these data points according to the line formula,
to which I add a (scaled) tensor of Gaussian noise to add some realistic data noise like
we would see in a real dataset.

On the last line, I assign to `z`

what’s normally called the
“forward propagation” of the network. That is, I take the tensor batch of input points and
propagate it through the model and compute the output which gives our model’s prediction.
This is done through a matrix multiplication (via `matmul()`

from the `tensor_ops`

module)
after which we add our bias tensor variable `b`

.

```
let mean_loss = reduce_mean(square(z - y), &[0], false); // Compute Mean Sq Error Loss
let ns = ctx.default_namespace();
let (vars, grads) = grad_helper(&[mean_loss], &ns); // Compute gradients for variables
let update_op = adam.get_update_op(&vars, &grads, ctx); // Compute weight/bias change
let results = ctx.evaluator().push(mean_loss).push(update_op).run(); // "Push" both loss and changes to evaluation buffer
```

In this next chunk, I start by computing the current loss
of the model on the current batch, storing the value in `mean_loss`

. Because I’m
using Mean Squared Error (MSE) as our loss metric, I can compute it very simply by
constructing a tensor of the difference between our prediction and expectation (`z - y`

),
squaring its elements (via `square()`

), then, taking the average of the resulting tensor
in the form of a scalar (accomplished by `reduce_mean`

). The extra arguments at the end of
`reduce_mean`

are an array of the axes along which we’re taking the mean (`&[0]`

), and
`false`

to indicate that we want to discard the axis we’re taking the mean along, which
leaves us with a scalar.

The next meaningful line is where we assign values to `vars`

and
`grads`

- this uses the function `grad_helper()`

from the `optimizers`

module, to
which I hand both an array of the loss variable(s) I’m trying to minimize (in our case,
`&[mean_loss]`

, the MSE variable), as well as the namespace from our `Context`

which
contains all of our learnable parameters (in our case, `w`

, and `b`

as variable tensors).

I then use the `adam`

optimizer instance to call `adam.get_update_op()`

,
which takes in references to the variables and their respective gradients, as well as the
`Context`

instance, and returns the adjustments that need to be made to the variables
in order to reduce the loss, which I call `update_op`

.

In the next line, I call a sequence of functions in the form of
`ctx.evaluator().push().run()`

. The `Context`

instance contains an `Evaluator`

, which
actually enacts the changes suggested by our optimizer. The pattern important here is
“pushing” values to the evaluator, then finally calling `run()`

to turn those values
into instructions that get evaluated. In computer science, data structures that store
a set of instructions and only evaluate their output when it’s necessary is known as a
“lazy data structure.” This is why `autograd`

is built upon a “Lazy Tensor” evaluation
engine - the results aren’t explicitly computed (which can be costly) *until* we push
them to the evaluator and call “run.”

I initially had to ask; “why are we pushing our mean loss?”
After all, that’s not necessary for any of the tensor updates. The main reason is so that
we can print out the *evaluated* value of `mean_loss`

. Because we computed `mean_loss`

purely
with tensor variables, the resulting scalar is still a tensor variable and doesn’t actually
contain a numeric value. Because I’d like to print out what error we have, I need to push
it to the evaluator and get that numeric value lazily. In the next line, I use a
`match`

statement (which I like to think of as Rust’s equivalent of a `switch`

statement)
to print out the loss of the algorithm every 100 epochs.

```
match epoch % 100 {
0 => println!("Mean Loss (epoch {}) = {}", epoch, results[0].as_ref().unwrap()), // Print training loss every 100 epochs
_ => ()
}
```

Note that the index I’m using to select from `results`

to get
the numeric value of `mean_loss`

corresponds to the order which we pushed symbolic
tensors to the evaluator. Because I pushed `mean_loss`

first, it takes index `0`

. The
actual entry at that index is a `Result`

type (which can be either an `Ok`

value for the
numeric tensor value, or an `Err`

error value), so I need to `unwrap()`

it.

Now that we’ve coded up the training loop, the last thing there
is to do is handle behavior of this script after the training is complete. I’d like to
save the learned parameters to a file, which is luckily very easy with `autograd`

:

```
// --snip--
env.save("results.json").unwrap(); // Save the learned parameters to a JSON file
// --snip--
```

The very last thing I do before ending the script is print to
the command line what the learned parameters are after our last epoch, so we can
visually compare them to their “true” values we set for our synthetic dataset.

```
// --snip--
let ns = env.default_namespace();
let finalw = &ns.get_array_by_name("w").unwrap().borrow(); // Borrow the learned weight value
let finalb = &ns.get_array_by_name("b").unwrap().borrow(); // Borrow the learned bias value
println!("Final w = {}", finalw);
println!("Final b = {}", finalb); // Print them out
```

Once again, I make use of the environment’s namespace, storing
it as `ns`

. It is from this `Namespace`

instance which I can borrow the latest evaluated values
assigned to each variable by calling the function `get_array_by_name()`

, which also results
in a `Result`

, so we must unwrap it. However, the `Result`

itself wraps a `RefCell`

object
(used often in multi-threaded applications), from which we can get the stored value via
the `borrow()`

function.

And that’s pretty much all there is to it! Using `cargo build`

gives us no compiler errors or warnings, and we can execute the resulting compiled
program to see the results we expected:

I’m satisfied with these results, because it’s clear that the
error quickly converged within the first 700 or so epochs, and our learned parameters
are within 0.05 of their “true values.”

### Where to go from here

I’d like to find more practical uses of the auto-differentiation utilities provided by
`autograd`

in a more comprehensive machine learning project. I think the next application
I’ll try to implement with this will be a custom Variational Autoencoder (VAE). Stay
tuned for blog updates to see how I get on with implementing this (admittedly, much
more complicated) model in the future.