This work was brief, amusing and experimental. The result is a simple Shiny app that contrasts MCMC search via simulated annealing versus the (more standard) Metropolis algorithm. While far from groundbreaking, I did pick up the following few bits of intuition along the way.

I like teaching things to anyone who will listen. Fancy models are useless if your boss doesn’t understand. Simple analogies are immensely effective for communicating almost anything at all.

The goal of Bayesian inference is, given some data, to figure out which parameters were employed in generating that data. The data itself come from generative processes – a Binomial process, a Poisson process, a Normal process (as is used in this post), for example – which each require parameters to get off the ground (in the same way that an oven needs a temperature and a time limit before it can start making bread) – * *and *; *; * *and , respectively. In statistical inference, we work backwards: we’re given the data, we hypothesize from which type of process(es) it was generated, and we then do our best to guess what these initial parameters were. Of course, we’ll never actually know: if we did, we wouldn’t need to do any modeling at all.

Bayes’s theorem is as follows:

Initially, all we have is : the data that we’ve observed. During inference, we pick a parameter value – let’s start with – and compute both and . We then multiply these two together, leaving us with an expression of how likely is to be the *real* parameter that was initially plugged into our generative process (that then generated the data we have on hand). This expression is called the posterior probability (of , given ).

The centerpiece of this process is the computation of the quantities and . To understand, let us use the example of *vetting*, i.e. vetting an individual applying for citizenship in your country – a typically multi-step process. In this particular vetting process, there are two steps.

- The first step, and perhaps the “broader stroke” of the two, is the prior probability of this parameter. In setting up our problem we choose a prior distribution – i.e. our
*a prior**i*belief of the possible range of values this true parameter can take – and the prior probability echoes how likely we thought to be the real thing before we saw any data at all. - The second step is the likelihood of our data given this parameter. It says: “assuming is the real thing, how likely was it to have observed the data that we did?” For further clarity, let’s assume in an altogether different problem that our data consists of 20 coin-flips – 17 heads, and 3 tails – and the parameter we’re currently
*vetting*is (where is the probability of flipping “heads”). So, “assuming is the real thing, how likely was it to have observed the data that we did?” The answer: “Err, not likely whatsoever.”

Finally, we multiply these values together to obtain the posterior probability, or the “yes admit this person into the country” score. If it’s high, they check out.

The prior probability for a given parameter is a single floating point number. The likelihood of a single data point given that parameter , expressed , is a single floating point number. To compute the likelihood of *all *of our data given that parameter , expressed , we must multiply the individual likelihoods together – one for each data point. For example, if we have 100 data points, is the product of 100 floating point numbers. We can write this compactly as:

Likelihood values are often pretty small, and multiplying small numbers together makes them even smaller. As such, computing the posterior on the log scale allows us to *add* instead of multiply, which solves some numerical precision troubles that computers often have. With 100 data points, computing the log posterior would be a sum of 101 numbers. On the natural scale, the posterior would be the product of 101 numbers.

I used to get easily frustrated with oft-used big words that I personally felt conferred no meaning whatsoever. “Optimization,” for example: “We here at XYZ Consulting undertake optimal processes for maximal profit.” Optimal, eh? What does that actually mean?

In recent months, I’ve realized optimization is just a collection of strategies for finding intelligent and warm-hearted prospective citizens: specifically, given one prospective citizen (parameter) that is sufficiently terrific (has a high posterior probability given the data we’ve observed), how do we then find a bunch more? In the discrete world, simulated annealing, mixed-integer programming, branch and bound, etc. would be a few of these strategies. In the continuous world, gradient descent, L-BFGS, the Powell algorithm and brute-force grid search are a few such examples.

In most mathematical cases, the measure of “sufficiently terrific” refers to a good score on a relevant loss function. In the case of XYZ Consulting, this metric is certainly more vague. (And while there may be some complex numerical optimization routine guiding their decisions, well, I’d guess the PR-score is the metric they’re more likely after.)

The most salient difference between the simulated annealing and Metropolis samplers is the “cooling schedule” of the former. In effect, simulated annealing becomes fundamentally more “narrow-minded” as time goes on: it finds a certain type of prospective citizen it likes, and it thereafter goes searching only for others that are very close in nature. Concretely, with a very quick cooling schedule, this can result in a skinny and tall posterior; when using simulating annealing for MCMC, we must take care to use a schedule that allows for sufficient exploration of the parameter space. With the Metropolis sampler, we don’t have this problem.

Finally, I’ve found that I quite enjoy using R. Plotting is miles easier than in Python and the pipe operators aren’t so bad.

The code for this project can be found here. Thanks a lot for reading.

]]>In short, I built a Shiny app that estimates my/your typical week with RescueTime. In long, the analysis is as follows.

The basic model of RescueTime is thus: track all activity, then categorize this activity by both “category” – “Software Development,” “Reference and Learning,” “Social Learning,” etc. – and “productivity level” – “Very Productive Time,” “Productive Time,” “Neutral Time,” “Distracting Time” and “Very Distracting Time.” For example, 10 minutes spent on Twitter would be logged as (600 seconds, “Social Networking”, “Very Distracting Time”), while 20 minutes on an arxiv paper logged as (1200 seconds, “Reference & Learning,” “Very Productive Time”). Finally, RescueTime maintains (among other minutia) an aggregate “productivity score” by day, week and year.

The purpose of this post is to take my weekly summary for 2016 and examine how I’m doing thus far. More specifically, with a dataset containing the total seconds-per-week spent [viewing resources categorized] at each of the 5 distinct productivity levels, I’d like to infer the productivity breakdown of a typical week. Rows of this dataset – after dividing all values in each by its sum – will contain 5 numbers, with each expressing the percentage of that week spent at the respective productivity level. Examples might include: (.2, .3, .1, .2, .2), (.1, .3, .2, .15, .25) or (.05, .25, .3, .25, .15). Of course, the values in each row must sum to 1.

In effect, we can view each row as an empirical probability distribution over the 5 levels at hand. As such, our goal is to infer the process that generated these samples in the first place. In the canonical case, this generative process would be a Dirichlet distribution – a thing that takes a vector and returns vectors of the length of containing values that sum to 1. With a Dirichlet model conditional on the RescueTime data observed, the world becomes ours: we can generate new samples (a “week” of RescueTime log!) ad infinitum, ask questions of these samples (e.g. “what percentage of the time can we expect to log more ‘Very Productive Time’ than ‘Productive Time?'”), and get some proxy lens into the brain cells and body fibers that spend our typical week in front of the computer in the manner that they do.

To begin this analysis, I first download the data at the following link. If you’re not logged in, you’ll be first prompted to do so. To do the same, you must be a paying RescueTime user. If you’re not, you’re welcome to use my personal data in order to follow along.

The data at hand have 48 rows. First, let’s see what they look like.

> head(report) week very_distracting distracting neutral productive very_productive 1 2016-01-31T00:00:00 0.05802495 0.15878213 0.1179268 0.05471899 0.6105471 2 2016-02-07T00:00:00 0.16082036 0.11625240 0.1251466 0.06762928 0.5301514 3 2016-02-14T00:00:00 0.07335485 0.18299335 0.1269896 0.08361825 0.5330439 4 2016-02-21T00:00:00 0.07911463 0.04051227 0.1445033 0.05395296 0.6819169 5 2016-02-28T00:00:00 0.07513117 0.12542957 0.1560940 0.04884047 0.5945047 6 2016-03-06T00:00:00 0.04554125 0.12288119 0.1410541 0.08958757 0.6009359

Next, let’s see how each level is distributed:

Finally, let’s choose a modeling approach. Once more, I venture that each week should be viewed as a draw from a Dirichlet distribution; at the very least, no matter how modeled, each week (draw) is inarguably a vector of values that sum to 1. To this effect, I see a few possible approaches.

A Dirichlet Process (DP) is a model of *a distribution over distributions*. In our example, this would imply that each week’s vector is drawn from one of several possible Dirichlet distributions, each one governing a fundamentally different type of week altogether. For example, let’s posit that we have several different kinds of work weeks: a “lazy” week, a “fire-power-super-charged week,” a “start-slow-finish-strong week,” a “cram-all-the-things-on Friday week.” Each week, we arrive to work on Monday morning and our aforementioned brain cells and body fibers “decide” what type of week we’re going to have. Finally, we play out the week and observe the resulting vector. Of course, while two weeks might have the same type, the resulting vectors will likely be (at least) slightly different.

In this instance, given a Dirichlet Process – where is the base distribution (from which the cells and fibers decide what type of week we’ll have) and is some prior – we first draw a week-type distribution from the base, then draw our week-level probability vector from the result. As a bonus, a DP is able to infer an *infinite* number of week-type distributions from our data (as compared to K-Means, for example, in which we would have to specify this value *a priori*) which fits nicely with the problem at hand: *à la base*, how many distinct week-types do we truly have? How would we ever know?

Dirichlet Processes are best understood through one of several simple generative statistical processes, namely the Chinese Restaurant Process, Polya Urn Model or Stick-Breaking Process. Edwin Chen has an *excellent* post dissecting each and its relation to the DP itself.

Given that a Dirichlet distribution is an exponential model, “all members of the exponential family have conjugate priors”^{1} and our data can be intuitively viewed as Dirichlet draws, it would be fortuitous if there existed some nice algebraic conjugacy to make inference a breeze. We know how to use *Beta-Binomial *and *Dirichlet-Multinomial*, but unfortunately there doesn’t seem to be much in the way of *X-Dirichlet*^{2}. As such, this approach unfortunately dead-ends here.

A final approach has us modeling the mean of each productivity-level proportion * *as:

For each , we place a normal prior , and finally give the likelihood of each productivity-level proportion as as in the canonical Bayesian linear regression. There’s two key points to make on this approach.

- As each is given by the softmax function the values are not uniquely identifiable, i.e.
`softmax(vector) = softmax(100 + vector)`

. In other words, because the magnitude of the values (how big they are) is unimportant, we cannot (nor do we need to) solve for these values exactly. I like to think of this as inference on two multi-collinear variables in a linear regression: with , we can re-express our regression as ; in effect, we now have only 1 coefficient to solve for, and while the sum is what we’re trying to infer, the individual values and are of no importance. (For example, if , we could choose and , or and , or and to no material difference.) In this case, while interpretation of the individual coefficients and would be erroneous, we can still make perfectly sound predictions on with the posterior for . To close, this is but a tangential way of saying that while the posteriors of each individual will be of little informative value, the softmax itself will still work out just fine. - I’ve chosen the likelihood as the normal distribution with respective means and a
*shared*standard deviation . First, I note that I hope this is the correct approach, i.e. “do it like you would with a typical linear model.” Second, I chose a shared standard deviation (and, frankly, prayed it would be small) as my aim way to omit it from analysis/posterior prediction entirely: while simulating seems perfectly sound, making a draw from the likelihood function, i.e. the normal distribution with mean and standard deviation , would cause our simulated productivity-level proportions to no longer add up to 1! This seems like the worst of all evils. While the spread of the respective distributions*does*seem to vary – thus suggesting we would be wise to infer a separate for each – I chose to brush this fact aside because: one, the ‘s are not independent, i.e. as one goes up another must necessarily go down, which I hoped might be in some way “captured” by the single parameter, and two, I didn’t intend to use the posterior in the analysis for the reason mentioned above, checking only to see that it converged.

In the end, I chose Option 3 for a few simple reasons. First, I have no reason to believe that the data were generated by a variety of distinct “week-type” distributions; a week is a rather large unit of time. In addition, the spread of the empirical distributions don’t appear, by no particularly rigorous measure, that erratic. Conversely, if this were instead day-level data, this argument would be much more plausible and the data would likely corroborate this point. Second, Gelman suggests this approach in response to a similar question, adding “I’ve personally never had much success with Dirichlets.^{“3}

To build this model, I elect to use Stan in R, defining it as follows:

data { int<lower=1> N; real very_distracting[N]; real distracting[N]; real neutral[N]; real productive[N]; } parameters { real phi_a; real phi_b; real phi_c; real phi_d; real phi_e; real<lower=0, upper=1> sigma; } transformed parameters { real mu_a; real mu_b; real mu_c; real mu_d; mu_a = exp(phi_a) / ( exp(phi_a) + exp(phi_b) + exp(phi_c) + exp(phi_d) + exp(phi_e) ); mu_b = exp(phi_b) / ( exp(phi_a) + exp(phi_b) + exp(phi_c) + exp(phi_d) + exp(phi_e) ); mu_c = exp(phi_c) / ( exp(phi_a) + exp(phi_b) + exp(phi_c) + exp(phi_d) + exp(phi_e) ); mu_d = exp(phi_d) / ( exp(phi_a) + exp(phi_b) + exp(phi_c) + exp(phi_d) + exp(phi_e) ); } model { sigma ~ uniform( 0 , 1 ); phi_a ~ normal( 0 , 1 ); phi_e ~ normal( 0 , 1 ); phi_d ~ normal( 0 , 1 ); phi_c ~ normal( 0 , 1 ); phi_b ~ normal( 0 , 1 ); very_distracting ~ normal(mu_a, sigma); distracting ~ normal(mu_b, sigma); neutral ~ normal(mu_c, sigma); productive ~ normal(mu_d, sigma); }

Here (with *a, b, c, d, e* corresponding respectively to “Very Distracting Time,” “Distracting Time,” “Neutral,” “Productive Time,” “Very Productive Time”) I model the likelihoods of all but *e*, as this can be computed deterministically from the posterior samples of *a, b, c* and *d* as . For priors, I place priors on , the magnitude of which should be practically irrelevant as stated previously. Finally, I give a prior, which seemed like a logical magnitude for mapping a vector of values that sum to 1 to another vector of values that sum (to something close to) to 1.

Instinctually, this modeling framework seems like it might have a few leaks in the theoretical ceiling – especially with respect to my choices surrounding the shared parameter. Should you have some feedback on this approach, please do drop a line in the comments below.

To fit the model, I use the standard Stan NUTS engine to build 4 MCMC chains, following Richard McElreath’s “four short chains to check, one long chain for inference!”^{3}The results – fortunately, quite smooth – are as follows:

The gray area of the plot pertains to the warmup period, while the white gives the valid samples. All four chains appear highly-stationary, well-mixing and roughly identical. Finally, let’s examine the convergence diagnostics themselves:

Inference for Stan model: model. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 1.5% 98.5% n_eff Rhat mu_a 0.07 0 0.01 0.05 0.09 4000 1 mu_b 0.07 0 0.01 0.05 0.09 4000 1 mu_c 0.20 0 0.01 0.18 0.22 4000 1 mu_d 0.12 0 0.01 0.10 0.14 4000 1 sigma 0.07 0 0.00 0.06 0.07 1792 1

Both `Rhat`

– a value which we hope to equal 1, would be “suspicious at 1.01 and catastrophic at 1.10”^{3} – and `n_eff`

– which expresses the “effective” number of samples, i.e. the samples not discarded due to high autocorrelation in the NUTS process – are right where we want them to be. Furthermore, ends up being rather small, and with a rather-tight 97% prediction interval to boot.

Next, let’s draw 2000 samples from the joint posterior and plot the respective distributions of against one another:

Remember, the above posterior distributions are for the *expected values* (mean) of each productivity-level proportion. In our model, we then insert this mean into a normal distribution (the likelihood function) with standard deviation and draw our final value.

Finally, let’s compute the mean of each posterior for a final result:

In summary, I’ve got work to do. Time to cast off those “Neutral” clothes and toss it to the purple.

—

Footnotes:

- Conjugate Prior – Wikipedia
- Conjugate prior to Dirichlets?
- McElreath, Richard. Statistical Rethinking. Chapman and Hall/CRC, 20151222. VitalBook file.

—

Additional Resources:

—

Code:

The code for this project can be found here.

]]>One of the expressed goals of machine learning is to learn structure in data. First, data, in line with the notion above, is a record of a thing that happened, or is. For example, data could be a piece of paper that lists all of the sales my company made yesterday. In addition, data could be a photograph which captures (via numerical pixel values) and an instant in time.

So, what does structure in this data mean? Structure refers to patterns. Structure refers to high-level relationships and phenomena in this data. In the first case, finding structure could be discovering that Sunday is our most profitable day; in the second, structure could be discovering that in a large set of photographs of people, wherever we see a human nose, there are typically two eyes just above and a mouth just below.

In machine learning, discovering structure in data in an unsupervised fashion – and especially when dealing with image, audio or video data – is typically performed via auto-encoders. The job of an auto-encoder is similar to that of a data compression model: take the original data and reduce it into something smaller that, *crucially*, contains all of the information contained in the original. Said a different way, given the compressed representation of the data, we should be able to fully reconstruct the original input.

In this post, I set out to discover structure in world flags. Specifically, I’d like to know: “what are the features that comprise these flags?” If successful, I should be able to *numerically encode* a flag as not just its raw pixel values, but instead, “some red background, plus a green star in the middle” (in the case of Morocco). Of course, these features would only arise in a dataset full of flags: if viewed through the lens of pictures of cats, the Moroccan flag would instead be encoded as “a blood-red sunset, plus a cat in a green super-hero cape, minus the cat.”

In the family of auto-encoders, the sparse auto-encoder is one of the simplest. In effect, this is a neural network with a single hidden layer which takes an image as input and learns to predict that image as output. The hidden layer is typically of a size smaller than the input and output layers, and has non-linear activations. Finally, a sparsity constraint is enforced such that the model favors having only a few non-zero hidden-layer activation values. For a given image, these activations *are* its compressed representation, i.e. it’s “encoding.”

With a trained sparse auto-encoder, we can do a few things.

- We can visualize the “features” each hidden-layer node is “looking for.” These are the high-level features that characterize our data, i.e. stars, stripes and crescents in a dataset of flags.
- Take a composition of existing encodings and generate a composite flag. For example, feed into the hidden-layer of the network, pass it through to the final layer and see what results.
- Pass a vector of random values into our hidden-layer, pass it through to the final layer and generate a new flag entirely.

A more comprehensive technical primer on sparse auto-encoders is not the premise of this post, as I believe much better resources already exist. Here are a few links I like to get you started:

The following model was trained with a hand-rolled sparse auto-encoder found here. The technical specifications are as follows:

- Downsize images to , which is roughly proportional to the largest bounding box of the originals. Then, flatten to vectors of values.
- Network dimensions are .
- Learning rate .
- Training for 1000 epochs.
- Sparsity parameter .
- Sparsity parameter in loss function .
- Initialize weights and biases with Gaussian of , .
- The full dataset is of size . Yes, it’s tiny! Use the first 100 examples for training, the next 20 for validation, and the final 18 for testing.

First, let’s see how well our network does. Again, its goal was to learn how to compress an image into a reduced representation containing enough information to recreate the original thereafter.

Here’s an image of the downsized flag of Afghanistan as passed into our network:

So, this is as good as we’re ever going to do. When we pass this into our network, here’s what it predicts:

Not terrible. Of course, this could be improved with, squarely, more training data.

After training our auto-encoder, we solve for the 64 individual images that “maximally activate” each of the 64 “feature detectors,” i.e. each of our hidden-layer nodes.

As anticipated, there does in fact appear to be some higher-level “structure” in our flags. In other words, we can now empirically see: a flag is a thing made up of some combination of horizontal stripes, vertical stripes, diagonal crosses, central emblems, the British crest, etc.

Next, let’s pass all images back through our network, obtain the 64-dimensional encoding for each, reduce these encodings into 2-dimensional space via the TSNE algorithm, and plot.

Points that are close together indicate flags that are visually similar. So, what have we learned (or rather, what human intuition have we corroborated with empirical, numerical evidence)? Notable similarities include:

- Belgium, Chad and Mali:

- Malaysia, Liberia and Puerto Rico

- Canada, Denmark and Peru:

Here, we see that similarity is defined not just across one type of feature, but necessarily, across all. Respectively, the above 3 groups seem heavy in: the “3 vertical bars” feature(s), the “stripes” and “thing in the top-left corner” feature(s), and the “cherry red” feature(s). (I include the optional “s” because the features are not particularly easy to identify nor apparently mutually exclusive in the feature map above.)

Finally, let’s generate some new flags. The following images are what happens when we pass the respective composite encodings into the hidden-layer of our auto-encoder, and feed-forward (i.e. pass it through the decoder). The result is then resized back to the original (where more resolution is inherently lost).

- Morocco:

- Morocco + Colombia:

- Morocco + Colombia + Malaysia

If only there were more countries in the world such that I could get more data. But hey, we need fewer borders, not more. Thanks for reading.

—

The code and notebook for this project can be found in the links.

]]>One Monday morning, Ernie from the ‘Street climbs out from under his red-and-blue pinstriped covers, puts both feet on the ground and opens his bedroom window. He stares out into a bustling metropolis of cookies and fur, straightens his banana-yellow turtleneck, lets out a deep, vigorous, crescent-shaped morning yawn and exclaims aloud: “Today, I’m going to make cupcakes for my dear friend Bert.”

Unfortunately, Ernie has never made cupcakes before. But no matter! He darts hastily to the kitchen, pulls out a cookbook, organizes the ingredients and turns on his small Easy-Bake oven. “I’ll experiment here. I’ll make the greatest cupcake known to all stuffed-animal-kind. And when I’m happy with the result, I’ll make 50 more,” he shouts.

Hours later, Ernie’s work is done: his cupcake – a 3-story stack of blueberry, strawberry and bacon-flavored sub-cakes – is the single best thing he’s ever tasted. Far better than anything that fraud Cookie Monster had ever tried! Ernie is thrilled, and sits back in his now-filthy kitchen to admire the result. He thinks to Bert, and wonders how just quickly he can deliver his gift. “Now that I’ve baked the perfect cupcake, I’ll just need to bake 50 more. This shouldn’t be that hard. Right?”

Ernie spins around to look at this Easy-Bake. “Well, that thing only bakes one cupcake at a time. At that rate, 50 would take me days!” Spirits still high, he runs to the local bakery and asks to use their oven – this one much larger. They happily oblige, and Ernie starts baking right away.

Unfortunately, as he’s mixing the ingredients he starts to have problems. The electric mixer breaks. The knife doesn’t quite cut the strawberries in just the right way. The measuring cups have an ever-so-slightly different size. Ernie starts to stress. He thought he was at the finish line, but now realizes that he’s really just at the start. While Ernie came equipped with the recipe to bake the cupcakes, he notes that he’s now using all new tools in a completely new kitchen under completely different circumstances. “Can’t a stuffed animal just bake a single cupcake in his small oven, then bring the recipe and ingredients to a bigger oven and bake a bunch more? Why does this need to be so complicated?”

In sadness and despair, Ernie wanders to the seaport to clear his mind. There, he comes across hundreds of blue and white, SUV-sized shipping containers and gets a funny idea: “What if I did my baking in there? I’ll move all of my tools inside – the cutting board, the knife, the mixer, the utensils – and write the recipe on the inner wall. The only thing missing will be the oven, but I can get that anywhere. That way, using the oven *chez moi,* I can continue to bake one cupcake at a time; conversely, using the oven at the bakery I can bake a whole lot more. Perfect. Ernie grabs the first container he sees and races home to pack it full.

After writing the ingredients on the container’s inner wall, Ernie realizes that if he’s going to bring this container to the bakery, it better be light. If not, he won’t be able to carry it! Therefore, instead of actually including his tools – the knife, the mixer, etc. – he simply writes down the names and numbers of these products and instructions as to where they can be acquired. Similarly, instead of including the actual ingredients for the cupcakes, he expects them to be available at the bakery itself. Then, when the recipe says “take 3 tablespoons of sugar from the cupboard,” that sugar will have already been placed in the cupboard itself.

Baking cupcakes on Sesame Street is a metaphor for building models for Kaggle. Typically, we build small prototypes on our local machine, then temporarily rent a more powerful machine sitting on a farm somewhere in Virginia to do the heavy lifting. In Kaggle competitions, Ernie’s initial problem is all too common: even after finding an electric mixer, measuring cups, etc. comparable to his own – i.e. even after installing all those libraries on our remote machine that we had on our local – the environments still weren’t quite the same and problems therein arose. Docker containers solve this problem: if we can bake our cupcake once in our kitchen, we can deterministically re-bake it *n *times in any kitchen – and preferably one with an oven much more powerful than our own.

A remote instance is the bakery: it is a computer, like ours, that can process data faster and in larger quantities. In other words, it is a kitchen with a much bigger oven.

In lieu of including cooking utensils in our container we merely specify which utensils we need and how to acquire them. For a Kaggle competition, this is akin to installing the libraries – pandas, scikit-learn, etc. – necessary for the task at hand. Once more, we do not include these libraries in our container, but instead provide instructions as to where and how to install them. In practice, this often looks like a `pip install -r requirements.txt`

in our `Dockerfile`

.

In lieu of including ingredients in our container we merely assume they’ll be available in our host bakery. This is a bit trickier than it sounds for the following reasons:

- Our host bakery is several blocks from our home. If we want ingredients to be available in that bakery, we’re going to need to physically carry them there in some sense.
- Even after physically bringing ingredients to the bakery, they still won’t be immediately available inside the container. Remember, after bringing our container to the bakery, the cooking that transpires within the container is isolated from the rest of the bakery itself; it interfaces only with the bakery’s oven.

For a Kaggle competition, how do we make local data available *within the container*, *on a remote machine?*

Docker Volumes allow data to be shared between a directory inside of a container and a directory in the local file system of the machine hosting that container. This is akin to Ernie:

- Carrying his ingredients to the bakery, along with (but not inside) his container.
- Upon arrival, placing a jar of sugar in a blue bucket in the corner of the room.
- Stipulating that, upon beginning to bake inside of the container at the bakery, ingredients should be shared between the blue bucket in the corner of the room and the cupboard. That way, when the recipe says “get a jar of sugar from the cupboard,” Ernie can reach into the cupboard inside of the container and retrieve the jar of sugar
*from the blue bucket sitting in the corner of the bakery.*Remember: the container did not ship with any ingredients inside; the cupboard, therefore, would have itself been empty.

Carrying the container to the bakery is akin to a simple `docker run`

onto the remote machine. Carrying ingredients to the bakery, i.e. placing a data file on the local file system of the remote machine, is much less sexy. In the simplest sense, this is akin to using `scp`

or `rsync`

to transfer a file from the local machine to the remote, or even using `curl`

to download a file directly onto the remote machine itself.

In practice, this often looks like:

`docker --tlsverify --tlscacert="$HOME/.docker/machine/certs/ca.pem" --tlscert="$HOME/.docker/machine/certs/cert.pem" --tlskey="$HOME/.docker/machine/certs/key.pem" -H=tcp://12.34.56:78 run --rm -i -v /data:/data kaggle-contest build_model.sh`

To bake his cupcake, Ernie used a one-of-a-kind cutting board that Bert had hand-molded for him. How can he use this at the bakery? In Kaggle terms: how can I use a library in my project that is not available on a public package repository (i.e. one that I built myself)?

To this end, there’s really no secret sauce. With the cutting board/library, we can either:

- Include it in our container and deal with the extra weight.
- Treat it as an ingredient, carry it to the bakery, and access it via a Docker Volume.

Moving your local development inside of a Docker container, and/or Dockerizing this local environment once you’re ready to use a remote resource to do the heavier lifting, will ensure you only have to figure out how to bake the cupcake once. Prototype locally, then send stress-free to the bakery for mass production.

Happy cooking.

Here’s two resources I found very helpful when learning about Docker for Kaggle:

]]>Like most of you, I spent the day of Donald Trump’s election in a state of disbelief, paralysis and exasperation. Like many more, I had several long, critical conversations about what had just happened and where we go from here. In one conversation, a friend exclaimed:

I, admittedly, live in a bubble: I know no Trump supporters. You, Will, have a diverse group of friends: surely you know a few.

While my peers are very diverse in race, culture, gender and geography, I don’t believe a single one voted for Trump.

As it is now so grippingly clear, these people do exist, and in numbers! They are blue collar workers. They are white men and white women from Middle America. They are Latina females in Florida. They are tailgaters at Penn State football games. They are critical thinkers from Ohio. And I don’t know a single one.

We have spent our adult lives in swift, upward mobility. We went to strong Universities because we went to strong high schools. For many, our parents paid for our education, leaving us as debt-free graduates with little existential challenge beyond navigating the subway map. Thereafter, we moved to great cities and found great jobs. We traveled internationally. We learned to speak Russian, to play the viola, to build ceramic pots on the weekends and take pictures with a DSLR. Moreover, from birth, we were taught that education, self-diversification and intellectual curiosity were normal and cool. Simply, we were taught how to dream.

While we are not oligarchs with gold watches, we are in no way average. Why? Because average is average. Average is having average ambition. Average is having average intellect. Average is receiving an average education. Average is living in the same town where you grew up, be it rural or urban. Average is average because it is average, and we – my liberal, creative, resilient, radically-capable peers – my entire Facebook feed – by trivial definition, are far from average.

*So, while we were busy with prosperity, what was the average American doing?*

First and foremost, I hardly know. (This, I now resolutely realize, is a major problem; much more on this below.) However, at peripheral speculation, I’d posit the following:

The average American is in the throes of a rapidly changing world and isn’t sure how to react. This is an honest, decent taxi driver who’s lost her job to Uber or a factory worker who’s lost his job to a machine.

The average American is a judicious high-school graduate without the resources to attend a strong University, and thereafter obtain a well-paying job. Vacation, mobility, and even food security are not promised.

The average American is a nurse with no idea why they are working 20-hour shifts 4 times per week.

Finally, the average American is an honest citizen who expects their government to facilitate a reasonable quality of life, whereby with average work ethic, average ambition and average creativity one can live a healthy, comfortable, average life.

To me, this seems incredibly reasonable. As we know, this is less and less our reality. The last few Presidents have done very little for these people. They are rational, and they wanted something new. That is a simplified explanation as to why a reckless and woefully under-prepared Donald Trump now finds himself in the most powerful office in the world. So, what do we do from here?

First and foremost, it is now brutally apparent that we must get to know the other side. Education is power, right? When we want to solve a problem, we first learn as much as we possibly can about the subject in question. Novelly, however, this process may be particularly and incredibly challenging when the object in question spits at your very existence: am I really meant to dialogue with someone flying a Nazi flag? Here, I’m saying yes. I’d like your thoughts as well.

We roil Trump and his base for bigotry, but bigotry is but a vehicle for blame. Blame for what, though? For the problems outlined above – for the increasingly elusive promise of a comfortable life for the typical citizen. Trump was elected because he represented a sidestep from the American politics that have underserved our masses for the last quarter-century. Throughout his campaign, he resonated with those people who had real problems that needed solving, all the while using immigrants and minorities as a scapegoat.

Simply put, the American government is to largely blame for these problems – for perpetuating serving-to-few policies on wealth distribution, education and healthcare. Furthermore, if these problems didn’t exist, I personally doubt that the people that screamed “time to go home, Apu” to a Google engineer in Silicon Valley yesterday would harbor blind hatred towards anyone at all, let alone complete strangers.

On average, lawnmowers kill 35 times more people annually than “Islamic jihadist immigrants;” when I hear fears of the latter, I chuckle in basic rationality and instead fear my chances of crossing a Manhattan Avenue. On the same note, we must realize that the scary individual who spray-painted “Sieg Heil 2016” in downtown Philadelphia on the 78th anniversary of Kristallnacht is an *extreme minority*, and while this case and any case of hate and bigotry should be treated extremely seriously, there is not (yet) reason to start running for the hills. Conversely, I would submit that the overwhelming majority of Trump’s base is a rational and welcoming human like most of the world, motivated by little more than frustration and basic human selfishness, i.e. wanting the best future for yourself and those you love.

This comes from my friend Adrian, and admittedly, I didn’t initially understand what he meant.

“If Hillary had won, we wouldn’t be having this conversation: we would have virtually high-five’d each other and moved on with our days. We wouldn’t be discussing political involvement. We wouldn’t be discussing the problems in this country! We wouldn’t be discussing what to do next.”

In times of “crisis,” people are incredibly willing to taking meetings, take phone calls, hear ideas and share their own. We, liberals, now realize there are real problems in our country that we, capable and creative, can perhaps work to solve.

Let’s not forget this. Furthermore, the millennial map was blue, blue and blue.

I frequently discuss geopolitics as a point of hobby. However, I’ve never involved myself in the political process. Presently, I live in Morocco. Let us remember – for as much as I and we offer our thoughts and ideas – talk is very cheap. None of this matters if we don’t actually do something. If we don’t build tools for education. If we don’t facilitate communication between diverse groups of people. If we don’t take time out of our days to actively learn about those that supported and support our to-be President, no matter how gut-wrenchingly challenging that may well be.

As a point of closure, if you take any issue with this post, please let me know. Please tell me I’m ignorant of American realities. Please tell me I’m willfully blind. This is the point. Educate me and let’s educate ourselves. No problem was ever really solved without first deeply understanding the problem itself. I, for one, have a lot of things to learn about half of the country in which I was born.

]]>On the outside, recurrent neural networks differ from typical, feedforward neural networks in that they take a *sequence* of input instead of an input of fixed length. Concretely, imagine we are training a sentiment classifier on a bunch of tweets. To embed these tweets in vector space, we create a bag-of-words model with vocabulary size 3. In a typical neural network, this implies an input layer of size 3; an input could be , or , or , for example. In a recurrent neural network, our input layer has the same size 3, but instead of just a single size-3 input, we can feed it a sequence of size-3 inputs of any length. For example, an input could be , or , or .

On the inside, recurrent neural networks have a different feedforward mechanism than typical neural networks. In addition, each input in our sequence of inputs is processed individually and chronologically: the first input is fed forward, then the second, and so on. Finally, after all inputs have been fed forward, we compute some gradients and update our weights. Like in feedforward networks, we also use backpropagation. However, we must now backpropagate errors to our parameters at every step in time. In other words, we must compute gradients with respect to: the state of the world when we fed our first input forward, the state of the world when we fed our second input forward, and up until the state of the world when we fed our last input forward. This algorithm is called Backpropagation Through Time.

There are many resources for understanding how to compute gradients using Backpropagation Through Time. In my view, Recurrent Neural Networks Maths is the most mathematically comprehensive, while Recurrent Neural Networks Tutorial Part 3 is more concise yet equally clear. Finally, there exists Andrej Karpathy’s Minimal character-level language model, accompanying his excellent blog post on the general theory and use of RNN’s, which I initially found convoluted and hard to understand.

In all posts, I think the authors unfortunately blur the line between the derivation of the gradients and their (efficient) implementation in code, or at the very least jump too quickly from one to another. They define variables like `dbnext`

, `delta_t`

, and without thoroughly explaining their place in the analytical gradients themselves. As one example, the first post includes the snippet:

So far, he’s just talking about analytical gradients. Next, he gives hint to the implementation-in-code that follows.

So the thing to note is that we can delay adding in the backward propagated errors until we get further into the loop. In other words, we can initially compute the derivatives of

Jwith respect to the third unrolled network with only the first term:

And then add in the other term only when we get to the second unrolled network:

Note the opposing definitions of the variable . As far as I know, the latter is, in a vacuum, categorically false. This said, I believe the author is simply providing an alternative definition of this quantity in line with a computational shortcut he later takes.

Of course, these ambiguities become very emotional, very quickly. I myself was confused for two days. As such, the aim of this post is to derive recurrent neural network gradients from scratch, and emphatically clarify that all implementation “shortcuts” thereafter are nothing more than just that, with no real bearing on the analytical gradients themselves. In other words, if you can derive the gradients, you win. Write a unit test, code these gradients in the crudest way you can, watch your test pass, and then immediately realize that your code can be made more efficient. At this point, all “shortcuts” that the above authors (and myself, now, as well) take in their code will make perfect sense.

In the simplest case, let’s assume our network has 3 layers, and just 3 parameters to optimize: , and . The foundational equations of this network are as follows:

I’ve written “softmax” and “cross-entropy” for clarity: before tackling the math below, it is important to understand what they do, and how to derive their gradients by hand.

Before moving forward, let’s restate the definition of a partial derivative itself.

A partial derivative, for example , measures how much increases with every 1-unit increase in .

Our cost is the *total* *cost* (i.e., not the average cost) of a given sequence of inputs. As such, a 1-unit increase in will impact each of , and individually. Therefore, our gradient is equal to the sum of the respective gradients at each time step :

Let’s take this piece by piece.

:

Starting with , we note that a change in will only impact at time : plays no role in computing the value of anything other than . Therefore:

:

Starting with , a change in will impact our cost in *3 separate ways**: *once, when computing the value of ; once, when computing the value of , which depends on ; once, when computing the value of , which depends on , which depends on .

More generally, a change in will impact our cost on separate occasions. Therefore:

Then, with this definition, we compute our individual gradients as:

(1)

(2)

(3)

:

Similarly:

Therefore:

(4)

(5)

(6)

Finally, we plug in the individual partial derivates to compute our final gradients, where:

- , where is a one-hot vector of the correct answer at a given time-step
- , as

At this point, you’re done: you’ve computed your gradients, and you understand Backpropagation Through Time. From this point forward, all that’s left is writing some for-loops.

As you’ll readily note, when computing the gradient for, for example, , we’ll need access to our labels at time-steps , and . For , we’ll need our labels at time-steps and . Finally, for , we’ll need our labels at just . Naturally, we look to make this efficient: for, for example, , how about just compute the parts at , and add in the rest at ? Instead of explaining further, I leave this step to you: it is ultimately trivial, a good exercise, and when you’re finished, you’ll find that your code readily resembles much of that written in the above resources.

Throughout this process, I learned a few lessons.

- When implementing neural networks from scratch, derive gradients by hand at the outset.
*This makes thing so much easier.* - Turn more readily to your pencil and paper before writing a single line of code. They are not scary and they absolutely have their place.
- The chain rule remains simple and clear. If a derivative seems to “supercede” the general difficulty of the chain rule, there’s probably something else you’re missing.

Happy RNN’s.

—

Key references for this article include:

- Recurrent Neural Networks Tutorial Part 2 Implementing A Rnn With Python Numpy And Theano
- Recurrent Neural Networks Tutorial Part 3 Backpropagation Through Time And Vanishing Gradients
- The Unreasonable Effectiveness of Recurrent Neural Networks
- Minimal character-level language model with a Vanilla Recurrent Neural Network, in Python/numpy
- Machine Learning – Recurrent Neural Networks Maths

And as with all manual voting systems, one cannot rule out at least some degree of misclassification of papers on some scale, no matter how small. We know of no evidence of cheating, and Colombia is to be lauded for the seriousness of its referendum process, but the distinction between intentional and unintentional misclassification by individual counters can occasionally become blurred in practice.

In other words, it was humans – tired humans – counting ballots by hand.

The technology of tired humans sorting pieces of paper into four stacks is, at best, crude. As a large research literature has made clear, we can reasonably assume that even well-rested people would have made mistakes with between 0.5% and 1% of the ballots. On this estimate, about 65,000-130,000 votes would have been unintentionally misclassified. It means the number of innocent counting errors could easily be substantially larger than the 53,894 yes-no difference.

Is it possible that the majority wanted “Yes” and still happened to lose?

To answer this question, we can frame the vote as a simple statistical process and ask: “if we were to re-hold the vote many more times, how often would the ‘Yes’ vote actually win?”

Should we choose, we could pursue this result analytically, i.e. solve the problem with a pencil and paper. This get messy quickly. Instead, we’ll disregard closed-form theory and run a basic simulation; “if you can write a for-loop, you can do statistics.”

We’ll frame our problem as follows:

1. voters arrive to the polls.

2. of them intend to vote “Yes”, of them intend to vote “No.”

3. Each voter casts an invalid (unmarked or void) ballot with probability .

4. Of the valid ballots, the poll workers misclassify the vote with probability .

5. Majority vote wins.

YES_BALLOTS = 6377482 NO_BALLOTS = 6431376 UNMARKED_BALLOTS = 86243 NULL_BALLOTS = 170946 TOTAL_VOTES = YES_BALLOTS + NO_BALLOTS + UNMARKED_BALLOTS + NULL_BALLOTS P_INVALID = .02 P_MISCLASSIFICATION = .01 N_TRIALS = 100000

In each trial, we assume a true, underlying for the voting populace. For example, if is .48, we will have individuals intending to vote “Yes,” and voters intending to vote “No.” We assume these values to be static: they are not generated by a random process.

Next, each voter casts an invalid ballot with probability , which we model as a Binomial random variable. Each remaining, valid ballot is then misclassified with probability . Finally, the tallies of “Yes” and “No” votes are counted, and the percentage of “Yes” votes is returned.

def simulate_vote(probability_yes): yes_votes = int(TOTAL_VOTES * probability_yes) no_votes = TOTAL_VOTES - yes_votes yes_votes_samples = N_TRIALS * [yes_votes] no_votes_samples = N_TRIALS * [no_votes] invalid_ballots_yes = np.random.binomial(n=yes_votes_samples, p=P_INVALID) invalid_ballots_no = np.random.binomial(n=no_votes_samples, p=P_INVALID) valid_yes_votes = yes_votes - invalid_ballots_yes valid_no_votes = no_votes - invalid_ballots_no yes_votes_from_yes_voters = np.random.binomial(n=valid_yes_votes, p=1-P_MISCLASSIFICATION) no_votes_from_yes_voters = valid_yes_votes - yes_votes_from_yes_voters no_votes_from_no_voters = np.random.binomial(n=valid_no_votes, p=1-P_MISCLASSIFICATION) yes_votes_from_no_voters = valid_no_votes - no_votes_from_no_voters tallied_yes_votes = yes_votes_from_yes_voters + yes_votes_from_no_voters tallied_no_votes = no_votes_from_no_voters + no_votes_from_yes_voters return tallied_yes_votes / (tallied_yes_votes + tallied_no_votes)

Let’s try this out for varying values of . To start, if the true, underlying percentage of “Yes” voters were 51%, how often would the “No” vote still win?

In [16]: percentage_of_tallied_votes_that_were_yes = simulate_vote(.51) (percentage_of_tallied_votes_that_were_yes < .5).mean() Out[16]: 0.0

That’s comforting. Given our assumptions, if 51% of the Colombian people arrived at the polls intending to vote “Yes,” the “No” vote would have nonetheless won in 0 of 100,000 trials. So, how close can we get before we start seeing backwards results?

for epsilon in [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7]: probability_yes = .5 + epsilon percentage_of_tallied_votes_that_were_yes = simulate_vote(probability_yes) proportion_of_trials_won_by_no = (percentage_of_tallied_votes_that_were_yes < .5).mean() results = "p_yes: {:1.6f}% | no_win_percentage: {:1.3f}%" print(results.format(100*probability_yes, 100*proportion_of_trials_won_by_no)) p_yes: 60.000000% | no_win_percentage: 0.000% p_yes: 51.000000% | no_win_percentage: 0.000% p_yes: 50.100000% | no_win_percentage: 0.000% p_yes: 50.010000% | no_win_percentage: 0.191% p_yes: 50.001000% | no_win_percentage: 38.688% p_yes: 50.000100% | no_win_percentage: 48.791% p_yes: 50.000010% | no_win_percentage: 50.063%

Our first frustration comes at : if ≈ 6,534,330 voters wanted “Yes” vs. ≈ 6,531,716 who wanted “No,” the “No” vote would have still won 0.191% of the time. Again, this reversal derives from human error: both on the part of the voter in casting an invalid ballot, and on the part of the the poll-worker incorrectly classifying that ballot by hand.

As we move further down, the results get tighter. At , the “Yes” vote can only be expected to have won 1 – .38688 = 61.312% of the time. Finally, at (which, keep in mind, implies an “I intend to vote ‘Yes'” vs. “I intend to vote ‘No'” differential of just voters), the “No” vote actually wins the *majority* of the 100,000 hypothetical trials. At that point, we’re really just flipping coins.

In summary, as the authors of the above post suggest, it would be statistically irresponsible to claim a definitive win for the “No.” Conversely, the true, underlying margin does prove to be extremely tight: maybe a majority vote just isn’t the best way to handle these issues after all.

—

The notebook and repo for the analysis can be found here. Key references include:

]]>I want to become more of a technical expert in machine learning. I want to use this expertise to solve real-world problems that actually matter.

To this end, I see two main roads: a traditional graduate program, and the OSMLM.

For me, graduate school is suboptimal for 3 key reasons:

1. **It’s expensive.** Upon a quick Google search, a 2-year graduate program would cost, conservatively, $80,000 in tuition fees alone. This is a wholly nontrivial sum of money that would impact how I structure the next 10 years of my life.

2. **There are far more dependencies.** I have to apply. I have to get accepted. I have to find the right professor. I have to find a city suitable to my broader interests and lifestyle. This takes time.

3. **By the time I finish, the field of machine learning will look fundamentally different than it did when I started.** This is the most important point of all. The only way to remain current with the latest tools and techniques is to do just that. Given the furious and only-accelerating-faster pace at which machine learning is moving, this requires much more than just a few hours on the weekend.

1. **I think the higher education paradigm is changing.** Access to critical, academic knowledge is increasingly democratic: Khan Academy can teach me about the Central Limit Theorem as well as any statistics professor. The ~$250,000 in tuition fees commanded by an undergraduate education at a private American university is, for some, several decades of debt and concession, and for others, prohibitive beyond comedy, reason and fantasy alike. If hard-skills are your end, online self-education is an immensely attractive, intuitive, and practical road to follow – especially in an industry as meritocratic as tech.

2. **I’m keenly aware of how productive I am in a self-teaching environment.** I’m largely self-taught in data science. Before that, it was online poker: a 5-year, $50 to $150,000 journey of instructional videos, online forums, critical discussion with other players and personal coaching – all from the comfort of my bedroom. I’m very effective at learning things online.

3. **Some of the most impactful projects I’ve completed professionally stemmed directly from those I’d completed personally.** I would not know how to ensemble models if not for Kaggle. I would not know how to perform hierarchical Bayesian inference if not for Bayesian Methods for Hackers. The open-source data science community continues to teach me creative ways to use data to solve challenging problems. To this end, I want to consume, consume, consume.

4. **The road to further technical expertise is a function of little more than time and effort.** I have a few years’ industry experience as a Data Scientist. I can write clean code and productionize machine learning things. For me, the OSMLM is nothing more than taking all of the extra-curricular time spent learning new tools and algorithms and making it a full-time job.

5. **I’m extremely motivated.** The thought of studying machine learning all day has me smiling from ear to ear. Simply put, I f*cking love this stuff.

6-9 months. Not forever.

I aim to speak indistinguishably fluent French and Spanish by the time I’m 30. I’m currently 27. The Spanish box is largely checked. With 6-9 months in Francophone Morocco, the French box will be largely checked as well.

Furthermore, I’ve always wanted to live in a Muslim country: I grew up in a predominantly Jewish suburb of Philadelphia, and have had fantastic experiences traveling the Muslim world.

I’ll be spending my best 8-10 hours of the day working from a co-working space. I’ll be taking online courses, reading textbooks, participating in machine learning competitions and publishing open-source code. I intend to post frequently to this blog.

I have 4 main areas of focus:

1. **“Deep Learning” with flavors of: auto-encoders, recommendation, and natural language processing.** I remain obsessed with encoding real-world entities as lists of numbers. I like applications that seek to understand people better than they understand themselves. Free-form text is everywhere (and relatively quick to process).

2. **Bayesian Inference.** Because they taught me frequentist statistics in school.

3. **Game Theory and Reinforcement Learning.** I wrote an undergraduate thesis in game theory and group dynamics and remain eager to tackle more. Reinforcement Learning seems like the hipster way to solve such problems these days.

4. **Apache Spark and Distributed Computing.** I have a bit of professional experience with Spark. As data continues to grow in size, distributed computing will move from a thing Google does to a no-duh occupational necessity.

Success has a few faces:

1. **Technical.** Have the technical expertise to lead teams focused on each of the above 4 topics (weighted towards the former 3, realistically).

2. **Personal.** Learning how I best learn. How do I structure my ideal working day? Do I prefer working alone, or indeed as part of a team? What is my optimal balance of reading, thinking, and coding?

3. **Language.** I intend to speak French like it’s my mother tongue.

I’m probably moving to Colombia, where I intend to devote myself to an impossibly awesome technology project and team for a period of several years.

In addition to self-study, I’d like to assist a few fascinating Moroccan technology organizations with their data problems. As such, if you know anyone in-country with even the most fleeting shared interest, please put me in touch.

The Open-Source Machine Learning Masters in Casablanca, Morocco allows me to pursue several significant personal goals at the same time. This is my Francophone machine learning adventure.

]]>Currently, this project contains a feedforward neural network with sigmoid activation functions, and both mean squared error and cross-entropy loss functions. Because everything is an object, adding future activation functions, loss functions, and optimization routines should be a breeze.

In addition to being highly compose-able, this project is highly readable. Too often, data science code is a veritable circus of variables named “wtxb”, six-times-nested for loops, and 80-line functions. The code within is both explicit and straightforward; few readers should be left wondering: “what the f*ck does that variable mean?”

Code can be found here. An example notebook is included. This is an ongoing project: I intend to add more loss functions, activation functions, convolutional and recurrent neural networks, and other optimization improvements in due time.

]]>Our neuron looks like this:

Our parameters look like this:

ACTIVATION = 3 INITIAL_WEIGHT = .5 INITIAL_BIAS = 2 ACTUAL_OUTPUT_NEURON_ACTIVATION = 0 N_EPOCHS = 5000 LEARNING_RATE = .05

We have an initial weight and bias of:

weight = 3 bias = 2

After each iteration of gradient descent, we update these parameters via:

weight += -LEARNING_RATE * weight_gradient bias += -LEARNING_RATE * bias_gradient

This is where the “learning” is concretized: changing a weight and a bias to a different weight and bias that makes our network better at prediction. So: how do we obtain the `weight_gradient`

and `bias_gradient`

? More importantly, *why do we want these things in the first place?*

I’ll be keeping this simple because it is simple. Our initial weight () and bias () were chosen randomly. As such, our network will likely make terrible predictions. By definition, our cost will be high. We want to make our cost low. Let’s pick a new weight and bias that change this cost by , where is some strictly negative number. For our weight: Define as “how much our cost changes with respect to a 1 unit change in our weight” times “how much we changed our weight”. In math, that looks like:

(1)

Our goal is to make strictly negative, such that every time we update our weight, we do so in a way that lowers our cost. Duh. Let’s choose . Our previous expression becomes:

(2)

is strictly positive, and a positive number multiplied by a negative number () is strictly negative. So, by choosing , our is always negative; in other words, at each iteration of gradient descent – in which we perform `weight += delta_weight`

, a.k.a. `weight += -LEARNING_RATE * weight_gradient`

– our cost always goes down. Nice.

For our bias, it’s the very same thing.

Deriving both and is pure 12th grade calculus. Plain and simple. If you forget your 12th grade calculus, spend ~2 minutes refreshing your memory with an article online. It’s not difficult. Before we begin, we must first define a cost function and an activation function. Let’s choose

quadratic loss and a sigmoid respectively.

.. where is the neuron’s final output, is the linear combination () input, and .

Using the chain rule, our desired expression becomes:

(3)

For our bias, the expression is almost identical:

(4)

Now we need expressions for and . Let’s derive them.

(5)

(6)

As such, our final expressions for and are:

From there, we just plug in our values from the start ( is our `ACTIVATION`

) to solve for `weight_gradient`

and `bias_gradient`

. The result of each is a *real-valued number*. It is no longer a function, nor expression, nor nebulous mathematical concept.

Finally, as initially prescribed, we update our weight and bias via:

weight += -LEARNING_RATE * weight_gradient bias += -LEARNING_RATE * bias_gradient

Because of Equation (2), the resulting weight and bias will give a lower cost than before. Nice!

Here’s a notebook showing this process in action. Happy gradient descent!

]]>