I’ve spent the past two weeks just starting to get a handle on how to use the popular low-level, memory-safe programming language Rust. In general, it gives me C++ vibes when I’m programming in it. That said, I think that it could have good potential as a machine learning development environment, so long as it is supplied with the appropriate external libraries (or “crates,” to use Rust’s terminology.)


     So, to get a handle on how I could use Rust specifically for machine learning purposes, I did some (very brief) research into what tools might be useful. I knew that I wanted to make use of an auto-differentiation tool, as this is essentially the bedrock of a machine learning dev environment. Auto-differentiation allows us to compute partial derivatives of the output of our models such that we can tweak and tune them into being better predictors. I looked over three different auto-differentiation crates:

     So, I decided to go with autograd for my auto-differentiation engine. I could have gone deeper into my research to find other alternatives, but these three appeared to be the most popular (according to crates.io), and autograd serves all of the basic purposes which I am looking for.

Linear Regression as a Neural Network

     The simplest machine learning application I could come up with to test this crate out (which wasn’t already done in their tutorials collection) was a simple linear regression neural network. For those who don’t know, a neural network is essentially a statistical model which recursively applies a linear regression and some nonlinear activation function. So, if I wanted to use a neural network for the purposes of a simple, one-dimensional linear regression, I would use a 1D input layer connected directly to a 1D output layer with no activation function. The weight parameter of the single neuron in this network corresponds to the slope of our linear regression, while its bias parameter is the offset. This is why the output equation of our model would look exactly like the formula for a line:

\[y = mx + b \qquad \Leftrightarrow \qquad y = \text{weight}\cdot x + \text{bias}\]

     In a diagram, this linear regression model looks like the neural network pictured below:


1D Linear Reg Picture


     Obviously, if this was all I wanted to use my auto-differentiation engine for, it would be woefully overpowered, but this is just a simple “hello world.” More generally, if I really wanted (only) to implement a linear regression algorithm in Rust, I would just use the closed-form formula for a linear regression and use ndarray-linalg for matrix multiplications and inverses to implement it. However, in future, I’d like to construct more complicated neural models which will require the full power of autograd, so this is a small proof-of-concept which allows me to get comfortable with the crate.


     Now that I have a basic idea of what sort of model I’m going to be building, I can start writing some Rust code.

Implementing the Model with AutoGrad

     The main example from autograd’s GitHub repo with which my project has the most overlap is the sine.rs example, which regresses a multi-layer perceptron (MLP) to the sine function. Because I’m doing something simpler, I’ll only be following it insofar as it is a useful reference for how regression works generally in autograd.

The first thing I do is import a handful of modules from the crate:


  use autograd as ag;
  use ag::optimizers::*;
  use ag::optimizers::adam::Adam;
  use ag::prelude::*;
  use ag::tensors_ops::*;


     The first of these just assigns a nickname, ag, which I can use in place of the longer namespace, autograd. After that, I’m importing tools for optimizers, the Adam optimizer specifically, some high-level traits from autograd’s prelude, and the tensor_ops module.


     Next, I create my main function, and start initializing the core engine of autograd: the so-called VariableEnvironment, along with an instance of a random number generator.


fn main() {
    // Initialize VariableEnvironment and (default seed) RNG
    let mut env = ag::VariableEnvironment::new();
    let rng = ag::ndarray_ext::ArrayRng::<f32>::default();
    // --snip--


     In short, the VariableEnvironment is a container which manages all of the variables which we use in learning and keeps track of how they change throughout the learning process. In setting it up, I register our two learned parameters in the VariableEnvironment, which involves associating them with a name:


    // --snip--
    // Initializing our parameter tensors
    env.name("w").set(rng.standard_uniform(&[1, 1]));
    env.name("b").set(ag::ndarray_ext::zeros(&[1, 1]));
    // --snip--


     With env.name(), we are registering a new NamedVariableSlot in the environment, which we then immediately give an initial value with the set() method. The arguments of the functions inside of the set call are each references to arrays of usize, each of which state that we are initializing w with a \(1\times 1\) tensor of random values, and that we are initializing b with a \(1\times 1\) tensor of zeros.


     So, our next step in initializing the model is creating an instance of an optimizer which governs how we use the partial derivatives we compute in order to reduce loss and predict outputs more accurately. We’re using the Adam optimizer for this, which we give some parameters to:


    // --snip--
    // Initializing our optimizer (Adam)
    let adam = Adam::new(0.01,  // Learning rate (alpha)
                         1e-08, // Error Toler. (epsilon)
                         0.9,   // First Moment decay (beta1)
                         0.999, // Second Moment decay (beta2)
                         env.default_namespace().current_var_ids(), // Variable IDs from VariableEnvironment namespace
                         &mut env,                                  // VariableEnvironment instance    
                         "linear_reg_adam");                        // Name string
    // --snip--


     All of the parameters I’ve put in above are generally considered to be the “default” parameters for most use cases of Adam, but I’ve beefed up the learning rate (\(\alpha\)) by a factor of 10 to speed up learning on such a dimensionally small model. It’s also noteworthy that I need to provide the optimizer with the Variable IDs from the VariableEnvironment (so it knows what our learned parameters are), as well as a mutable reference to the environment. We also supply it with a name.


     Now I start thinking about the training process. For the purposes of this project, I’d like to train the linear regressor in batches of size \(64\), over the course of \(1000\) epochs (i.e. batches). So I just quickly define some constants for these quantities:


    // Initializing training constants
    const N_EPOCHS : usize = 1000;
    const BATCHSIZE : usize = 64;


     At this point, I’m almost ready to code up the training loop. But it feels like I’m forgetting something; perhaps the actual training data itself? Where am I getting that data from? Because we’re working totally with a synthetic dataset for this project, it suffices to define some “true” values we expect our network to learn, then use those values (and some data noise) to generate data points as we train. So, I define the parameters I’d like to learn:


    // Setting up the (synthetic) dataset
    let true_w = 2.5f32;
    let true_b = -1.0f32;
    let noise_scale = 0.5f32;


     Now we can jump into the meat of our algorithm: the training loop. This consists of a for-loop to iterate over the epochs, along with the following contents which I will explain:


    // --snip--
    for epoch in 0..N_EPOCHS {
        env.run(|ctx| {
            let x = standard_uniform(&[BATCHSIZE, 1], ctx) * 20. - 10.;     // Randomly generate input coords
            let w = ctx.variable("w");                                      // Retrieve variables
            let b = ctx.variable("b");
            let y = true_w * x + true_b +
                    standard_normal(&[BATCHSIZE, 1], ctx) * noise_scale;    // Compute expected output (w/ noise)

            let z = matmul(x, w) + b;                                       // Compute LinReg predictions

            let mean_loss = reduce_mean(square(z - y), &[0], false);                // Compute Mean Sq Error Loss
            let ns = ctx.default_namespace();
            let (vars, grads) = grad_helper(&[mean_loss], &ns);                     // Compute gradients for variables
            let update_op = adam.get_update_op(&vars, &grads, ctx);                 // Compute weight/bias change
            let results = ctx.evaluator().push(mean_loss).push(update_op).run();    // "Push" both loss and changes to evaluation buffer

            match epoch % 100 {
                0 => println!("Mean Loss (epoch {}) = {}", epoch, results[0].as_ref().unwrap()),    // Print training loss every 100 epochs
                _ => ()
            }
        });
    }
    // --snip--


     Here we note that we are passing a function to our VariableEnvironment via the function run(). This function expects its argument to be a function with an argument of a type called Context, which is often abbreviated as ctx in variable names, which I use in the closure. Anything we run inside of the closure we pass to env.run() is used to create a Graph, which is a component frequently seen in auto-differentiation engines. Without getting into too many nitty-gritty details, you can think of this as a graph where nodes are variables in our model, and we have arrows between them indicating how data flows between them such that we can “trace back” how they interact with one another and use this data to compute partial derivatives.


     So what does all of this mean for us? It means that anything we do inside of env.run() will be “watched” by the graph so we can compute derivatives. Let’s take a look at what we do inside of the context closure, one digestible chunk at a time:


  let x = standard_uniform(&[BATCHSIZE, 1], ctx) * 20. - 10.;     // Randomly generate input coords
  let w = ctx.variable("w");                                      // Retrieve variables
  let b = ctx.variable("b");
  let y = true_w * x + true_b +
          standard_normal(&[BATCHSIZE, 1], ctx) * noise_scale;    // Compute expected output (w/ noise)

  let z = matmul(x, w) + b;                                       // Compute LinReg predictions


     What we initially do is generate a tensor of uniformly-distributed input coordinates, spread evenly on the interval \((-10, 10)\). Then, I call the variable() function from the Context instance, which allows me to retrieve a variable tensor by the name I assigned to it earlier on when I initialized it. The next thing I do is generate the expected output for these data points according to the line formula, to which I add a (scaled) tensor of Gaussian noise to add some realistic data noise like we would see in a real dataset.


     On the last line, I assign to z what’s normally called the “forward propagation” of the network. That is, I take the tensor batch of input points and propagate it through the model and compute the output which gives our model’s prediction. This is done through a matrix multiplication (via matmul() from the tensor_ops module) after which we add our bias tensor variable b.


  let mean_loss = reduce_mean(square(z - y), &[0], false);                // Compute Mean Sq Error Loss
  let ns = ctx.default_namespace();
  let (vars, grads) = grad_helper(&[mean_loss], &ns);                     // Compute gradients for variables
  let update_op = adam.get_update_op(&vars, &grads, ctx);                 // Compute weight/bias change
  let results = ctx.evaluator().push(mean_loss).push(update_op).run();    // "Push" both loss and changes to evaluation buffer


     In this next chunk, I start by computing the current loss of the model on the current batch, storing the value in mean_loss. Because I’m using Mean Squared Error (MSE) as our loss metric, I can compute it very simply by constructing a tensor of the difference between our prediction and expectation (z - y), squaring its elements (via square()), then, taking the average of the resulting tensor in the form of a scalar (accomplished by reduce_mean). The extra arguments at the end of reduce_mean are an array of the axes along which we’re taking the mean (&[0]), and false to indicate that we want to discard the axis we’re taking the mean along, which leaves us with a scalar.


     The next meaningful line is where we assign values to vars and grads - this uses the function grad_helper() from the optimizers module, to which I hand both an array of the loss variable(s) I’m trying to minimize (in our case, &[mean_loss], the MSE variable), as well as the namespace from our Context which contains all of our learnable parameters (in our case, w, and b as variable tensors).


     I then use the adam optimizer instance to call adam.get_update_op(), which takes in references to the variables and their respective gradients, as well as the Context instance, and returns the adjustments that need to be made to the variables in order to reduce the loss, which I call update_op.


     In the next line, I call a sequence of functions in the form of ctx.evaluator().push().run(). The Context instance contains an Evaluator, which actually enacts the changes suggested by our optimizer. The pattern important here is “pushing” values to the evaluator, then finally calling run() to turn those values into instructions that get evaluated. In computer science, data structures that store a set of instructions and only evaluate their output when it’s necessary is known as a “lazy data structure.” This is why autograd is built upon a “Lazy Tensor” evaluation engine - the results aren’t explicitly computed (which can be costly) until we push them to the evaluator and call “run.”


     I initially had to ask; “why are we pushing our mean loss?” After all, that’s not necessary for any of the tensor updates. The main reason is so that we can print out the evaluated value of mean_loss. Because we computed mean_loss purely with tensor variables, the resulting scalar is still a tensor variable and doesn’t actually contain a numeric value. Because I’d like to print out what error we have, I need to push it to the evaluator and get that numeric value lazily. In the next line, I use a match statement (which I like to think of as Rust’s equivalent of a switch statement) to print out the loss of the algorithm every 100 epochs.


  match epoch % 100 {
      0 => println!("Mean Loss (epoch {}) = {}", epoch, results[0].as_ref().unwrap()),    // Print training loss every 100 epochs
      _ => ()
  }


     Note that the index I’m using to select from results to get the numeric value of mean_loss corresponds to the order which we pushed symbolic tensors to the evaluator. Because I pushed mean_loss first, it takes index 0. The actual entry at that index is a Result type (which can be either an Ok value for the numeric tensor value, or an Err error value), so I need to unwrap() it.


     Now that we’ve coded up the training loop, the last thing there is to do is handle behavior of this script after the training is complete. I’d like to save the learned parameters to a file, which is luckily very easy with autograd:


  // --snip--
  env.save("results.json").unwrap();      // Save the learned parameters to a JSON file
  // --snip--


     The very last thing I do before ending the script is print to the command line what the learned parameters are after our last epoch, so we can visually compare them to their “true” values we set for our synthetic dataset.


  // --snip--
  let ns = env.default_namespace();
  let finalw = &ns.get_array_by_name("w").unwrap().borrow();  // Borrow the learned weight value
  let finalb = &ns.get_array_by_name("b").unwrap().borrow();  // Borrow the learned bias value
  println!("Final w = {}", finalw);
  println!("Final b = {}", finalb);                           // Print them out


     Once again, I make use of the environment’s namespace, storing it as ns. It is from this Namespace instance which I can borrow the latest evaluated values assigned to each variable by calling the function get_array_by_name(), which also results in a Result, so we must unwrap it. However, the Result itself wraps a RefCell object (used often in multi-threaded applications), from which we can get the stored value via the borrow() function.


     And that’s pretty much all there is to it! Using cargo build gives us no compiler errors or warnings, and we can execute the resulting compiled program to see the results we expected:


Cmd Line Results


     I’m satisfied with these results, because it’s clear that the error quickly converged within the first 700 or so epochs, and our learned parameters are within 0.05 of their “true values.”

Where to go from here

I’d like to find more practical uses of the auto-differentiation utilities provided by autograd in a more comprehensive machine learning project. I think the next application I’ll try to implement with this will be a custom Variational Autoencoder (VAE). Stay tuned for blog updates to see how I get on with implementing this (admittedly, much more complicated) model in the future.