SU-CS361 APR092024 — Jemoka Knowledge Base

More Bracketing Methods Quadratic search if your function is unimodal… Pick three points that gets “high, low, high” Fit a quadratic to it, evaluate its minima and add it to the point set Now, drop any of the four resulting point Shubert-Piyavskill Method This is a Bracketing approach which grantees optimality WITHOUT unimodality by using the Lipschitz Constant. But, this only works in one dimension. Consider a Lipschitz continuous function with Lipschitz Constant L. We can get our two initial points a and b. First, we arbitrarily pick a point in the middle to evaluate; this will give us a cone (see Lipschitz Condition) which bounds the function. We will then evaluate the function + draw a new cone at each of our ed points a, b. By piecing together the cones, we now obtain a sawtooth which lower bounds our function. We will continue this by choosing the lowest point on our lower bound, reevaluating, raising it to the new sawtooth. For instance, in maximization, we can end up with sawtooths like: Bisection Method like Newton’s Method, this is a ROOT FINDING METHOD which we are coopting to find the minima by solving for FONC within an interval. bisect f’(x) by sampling points in the middle of the “valid interval” until you find the point which gives f’(x) = 0. You do this by sampling points on each edge ensuring that there is a sign switch between each edge (i.e. there is a root between the edge points), and then sampling the middle of the interval. You know that there is a sign switch somewhere by the intermediate value theorem. Local Descent Descent Direction Iteration Descent Direction Iteration is a class of method that uses a “local model” to improve the design point until we converge. check whether our current \bold{x} meets our termination conditions; if not… calculate some descent direction \bold{d} to update our \bold{x}; sometimes, people say it has to be normalized decide some step size \alpha have fun: \bold{x} \leftarrow \bold{x} + \alpha \bold{d} Line Search We can choose the step size \alpha to perform using line search; i.e., figure out our \bold{d} somehow, and then use any of the Bracketing methods (or grid it up) to solve:

\begin{equation} \min_{\alpha} f(\bold{x} + \alpha \bold{d}) \end{equation}

Decaying \alpha We can also give up solving for the greatest \alpha, fix a learning rate, and then decay it using \alpha \gamma^{n} where n is the number of iterations and \gamma is the decay rate. Approximate Line Search Instead of continuously evaluating the function f, we use a first order approximation on our directional derivative (plus some acceptability factor \beta \in [0,1], usually \beta=1 \times 10^{-4}). We will then choose the largest \alpha that satisfies Sufficient Decrease Condition

\begin{equation} f(x_{t+1}) \leq f(x_{t}) + \beta \alpha \nabla_{d} f(x_{t}) \end{equation}

Curvature Condition which bounds the “shallowness” of the directional derivatives. Trust Region Methods We often want to bound our change in x by some region \delta in our steps; so, we really want to…

\begin{align} \min_{x’}\ &f(x’) \\ s.t.\ & \mid x-x’ \mid \leq \delta \end{align}

To figure \delta, we shrink our region of trust based on the quality of our function estimate (if we used a first-order local model to figure our descent direction, we will use our first order estimate for \hat{f}):

\begin{align} \eta = \frac{f(x)-f(x’)}{f(x)-\hat{f}(x’)} \end{align}

if \eta < \eta_{1}, we would scale down \delta by some amount as evidently our actual improvement is smaller than expected and reject our new point; if \eta > \eta_{2}, we will accept our new point and scale up \delta by some amount as our improvement is better than expected. Otherwise, we will accept the new point an do not nothing to the trust region. Termination Conditions Maximum Iterations \begin{equation} k > k_{\max } \end{equation} termination condition for those on a deadline Absolute Improvement \begin{equation} |f(x_{t}) - f(x_{t+1})| < \epsilon \end{equation} Relative Improvement \begin{equation} f(x_{t}) - f(x_{t+1}) < \epsilon | f(x)| \end{equation} Some range of acceptability. Gradient \begin{equation} |\nabla f(x_{t})| < \epsilon \end{equation} First-Order Methods gradient descent see gradient descent

\begin{equation} \bold{d} = \nabla f(x) \end{equation}

Conjugate Gradient Motivation: steepest-design, but choose search directions that is orthogonal to A! Suppose A is symmetric, positive-definite. Choosing orthogonal search directions helps us preclude searching in overlapping directions (and wasting time). Essentially, we choose: \langle s^{^{(q)}}, s^{^{(\hat{q})}} \rangle_{A} with A weighted Matrix-scaled inner product. (“Orthogonality carries regardless of weighting”). We optimize the function as if its a gradratic function:

\begin{equation} \min_{x} f(x) = \frac{1}{2} \bold{x}^{\top} \bold{A}\bold{x} + \bold{b}^{\top} \bold{x} + c \end{equation}

where A is a positive, definite matrix. Under this assumption, we consider that this function would behave like a bowl. We can then formulate:

\begin{equation} \bold{d}_{t+1} = -\nabla_{t+1} f + \beta \bold{d}_{t} \end{equation}

where \bold{d}_{t+1} is the step direction we are going to use at t+2!! So we are essentially averaging the direction from two steps before. We usually set \beta to be the Fletcher-Reeves or Polak-Ribere approaches. All descent direction are Mutually Conjugate. Mutually Conjugate if x_{i} \neq x_{j} are Mutually Conjugate, we have:

\begin{equation} x_{i} A x_{j} = 0 \end{equation}

Error-Analysis for Conjugate Gradient “Steepest Descent/gradient descent, but choose step directions to be A orthogonal instead” Initial error e^{(1)} = \sum_{\hat{q} = 1}^{n} \beta^{\hat{(q)}} s^{\hat{(q)}} (\beta is the coefficients that we don’t know, which are the things we want to solve for which gives us the error.)

\begin{equation} e^{(q)} = e^{(1)} + \sum_{\hat{q} = 1}^{q-1} \alpha^{(\hat{q})} s^{(\hat{q})} \end{equation}

this implies that:

\begin{equation} \langle s^{(q)}, e^{(q)} \rangle_{A} = \langle s^{(q)}, e^{(1)} \rangle_{A} \end{equation}

this gives:

\begin{equation} \alpha^{(q)} = \frac{s^{(q)}r^{(q)}}{s^{(q)} A s^{(q)}} = - \frac{\langle s^{(q)}, e^{(q)} \rangle}{\langle s^{(q)},s^{(q)} \rangle_{A}} = -\beta^{(q)} \end{equation}

meaning: if all s are A orthogonal, the \alpha choices will result in convergence in n steps. Note that if residuals \tilde{q} < q, then s^{\tilde{q}} \cdot r^{q} = -\langle s^{\tilde{q}}, e^{q} \rangle_{A} = 0. Meaning, residuals are orthogonal to all previous search directions. Momentum We descent by calculating a “position” and a “velocity”

\begin{equation} v_{t+1} = \beta v_{t} - \alpha \nabla_{x_{t}} f \end{equation}

\begin{equation} x_{t+1} = x_{t} + v_{t+1} \end{equation}

if \beta, the momentum is set to 0, we just get normal gradient descent. If there is a positive \beta, your update vector will take on some of the previous update direction values.