Previous | Next --- Slide 25 of 32
Back to Lecture Thumbnails
mdesoto

I'm guessing this step control parameter is equivalent to a learning rate in ML applications?

keenan

@mdesoto Right. In ML, specifically in stochastic gradient descent (SGD), the "learning rate" corresponds to the step size. However, the treatment of step control/learning rate in ML is rather simplistic: just pick a fixed constant (or maybe learn this constant).

In general, there are certain concepts that are ignored (or simply work out differently) in ML due to the structure and constraints of the problem. For instance, in ML you are often dealing with very large but heterogeneous problems; here SGD provides scalability (and other features) that make it "good enough" for learning. Line search/step control strategies are eschewed for reasons discussed and debated here.

But beware! When solving optimization problems outside of ML, you will want to consider a different bag of tricks. For instance, you may have small subproblems in a graphics algorithm where convergence rate and accuracy can be improved dramatically by judicious use of line search. ...and in fact, more and more, people are building layers for deep neural nets that perform some little local computation---which in turn might require different optimization strategies.

As recommended on another slide, you'll gain a lot of power by learning the fundamentals of optimization "in the abstract," rather than limiting yourself to just one little application domain (like ML or graphics).

rlpo

Are gradient calculations in computer graphics programs usually done through finite differences or autograd systems?

keenan

@rlpo It depends on the context.

For optimization problems, it's most common to either (i) use automatic differentiation, or (ii) write out the derivatives by hand and code them up. The second option might sound crazy, but in many problems you just have a few small derivatives you want to evaluate over and over again. Writing them out by hand provides opportunities for optimization (and more careful numerical implementation) that autodiff does not. Finite differences are not typically used for minimizing an objective function, because they're both numerically inaccurate and expensive to evaluate.

On the other hand...

For solving partial differential equations (especially on grids) finite differences are an absolutely critical part of the machinery. Unlike optimization, you don't know how to evaluate the function at every point of space---you only have a sampled representation of the function. So, the only thing you can do is take finite differences. And because you're solving a very different algorithmic problem (integrating PDEs, rather than solving an optimization problem) things are actually super fast, and quite accurate.

Moral of the story: pick the right tool for the job! E.g., finite differences are not "good" or "bad." You just have to understand them deeply, so you can determine which contexts are / are not appropriate to apply them.

keenan

@rlpo Also, here's a little note on the usage of the terms autodiff, autograd, etc., from notes [1] by Roger Grosse:

excerpt

[1] http://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/readings/L06 Automatic Differentiation.pdf