Old engine for Continuous Time Bayesian Networks. Superseded by reCTBN. 🐍
https://github.com/madlabunimib/PyCTBN
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
560 lines
19 KiB
560 lines
19 KiB
"""Trust Region Reflective algorithm for least-squares optimization.
|
|
|
|
The algorithm is based on ideas from paper [STIR]_. The main idea is to
|
|
account for the presence of the bounds by appropriate scaling of the variables (or,
|
|
equivalently, changing a trust-region shape). Let's introduce a vector v:
|
|
|
|
| ub[i] - x[i], if g[i] < 0 and ub[i] < np.inf
|
|
v[i] = | x[i] - lb[i], if g[i] > 0 and lb[i] > -np.inf
|
|
| 1, otherwise
|
|
|
|
where g is the gradient of a cost function and lb, ub are the bounds. Its
|
|
components are distances to the bounds at which the anti-gradient points (if
|
|
this distance is finite). Define a scaling matrix D = diag(v**0.5).
|
|
First-order optimality conditions can be stated as
|
|
|
|
D^2 g(x) = 0.
|
|
|
|
Meaning that components of the gradient should be zero for strictly interior
|
|
variables, and components must point inside the feasible region for variables
|
|
on the bound.
|
|
|
|
Now consider this system of equations as a new optimization problem. If the
|
|
point x is strictly interior (not on the bound), then the left-hand side is
|
|
differentiable and the Newton step for it satisfies
|
|
|
|
(D^2 H + diag(g) Jv) p = -D^2 g
|
|
|
|
where H is the Hessian matrix (or its J^T J approximation in least squares),
|
|
Jv is the Jacobian matrix of v with components -1, 1 or 0, such that all
|
|
elements of matrix C = diag(g) Jv are non-negative. Introduce the change
|
|
of the variables x = D x_h (_h would be "hat" in LaTeX). In the new variables,
|
|
we have a Newton step satisfying
|
|
|
|
B_h p_h = -g_h,
|
|
|
|
where B_h = D H D + C, g_h = D g. In least squares B_h = J_h^T J_h, where
|
|
J_h = J D. Note that J_h and g_h are proper Jacobian and gradient with respect
|
|
to "hat" variables. To guarantee global convergence we formulate a
|
|
trust-region problem based on the Newton step in the new variables:
|
|
|
|
0.5 * p_h^T B_h p + g_h^T p_h -> min, ||p_h|| <= Delta
|
|
|
|
In the original space B = H + D^{-1} C D^{-1}, and the equivalent trust-region
|
|
problem is
|
|
|
|
0.5 * p^T B p + g^T p -> min, ||D^{-1} p|| <= Delta
|
|
|
|
Here, the meaning of the matrix D becomes more clear: it alters the shape
|
|
of a trust-region, such that large steps towards the bounds are not allowed.
|
|
In the implementation, the trust-region problem is solved in "hat" space,
|
|
but handling of the bounds is done in the original space (see below and read
|
|
the code).
|
|
|
|
The introduction of the matrix D doesn't allow to ignore bounds, the algorithm
|
|
must keep iterates strictly feasible (to satisfy aforementioned
|
|
differentiability), the parameter theta controls step back from the boundary
|
|
(see the code for details).
|
|
|
|
The algorithm does another important trick. If the trust-region solution
|
|
doesn't fit into the bounds, then a reflected (from a firstly encountered
|
|
bound) search direction is considered. For motivation and analysis refer to
|
|
[STIR]_ paper (and other papers of the authors). In practice, it doesn't need
|
|
a lot of justifications, the algorithm simply chooses the best step among
|
|
three: a constrained trust-region step, a reflected step and a constrained
|
|
Cauchy step (a minimizer along -g_h in "hat" space, or -D^2 g in the original
|
|
space).
|
|
|
|
Another feature is that a trust-region radius control strategy is modified to
|
|
account for appearance of the diagonal C matrix (called diag_h in the code).
|
|
|
|
Note that all described peculiarities are completely gone as we consider
|
|
problems without bounds (the algorithm becomes a standard trust-region type
|
|
algorithm very similar to ones implemented in MINPACK).
|
|
|
|
The implementation supports two methods of solving the trust-region problem.
|
|
The first, called 'exact', applies SVD on Jacobian and then solves the problem
|
|
very accurately using the algorithm described in [JJMore]_. It is not
|
|
applicable to large problem. The second, called 'lsmr', uses the 2-D subspace
|
|
approach (sometimes called "indefinite dogleg"), where the problem is solved
|
|
in a subspace spanned by the gradient and the approximate Gauss-Newton step
|
|
found by ``scipy.sparse.linalg.lsmr``. A 2-D trust-region problem is
|
|
reformulated as a 4th order algebraic equation and solved very accurately by
|
|
``numpy.roots``. The subspace approach allows to solve very large problems
|
|
(up to couple of millions of residuals on a regular PC), provided the Jacobian
|
|
matrix is sufficiently sparse.
|
|
|
|
References
|
|
----------
|
|
.. [STIR] Branch, M.A., T.F. Coleman, and Y. Li, "A Subspace, Interior,
|
|
and Conjugate Gradient Method for Large-Scale Bound-Constrained
|
|
Minimization Problems," SIAM Journal on Scientific Computing,
|
|
Vol. 21, Number 1, pp 1-23, 1999.
|
|
.. [JJMore] More, J. J., "The Levenberg-Marquardt Algorithm: Implementation
|
|
and Theory," Numerical Analysis, ed. G. A. Watson, Lecture
|
|
"""
|
|
import numpy as np
|
|
from numpy.linalg import norm
|
|
from scipy.linalg import svd, qr
|
|
from scipy.sparse.linalg import lsmr
|
|
from scipy.optimize import OptimizeResult
|
|
|
|
from .common import (
|
|
step_size_to_bound, find_active_constraints, in_bounds,
|
|
make_strictly_feasible, intersect_trust_region, solve_lsq_trust_region,
|
|
solve_trust_region_2d, minimize_quadratic_1d, build_quadratic_1d,
|
|
evaluate_quadratic, right_multiplied_operator, regularized_lsq_operator,
|
|
CL_scaling_vector, compute_grad, compute_jac_scale, check_termination,
|
|
update_tr_radius, scale_for_robust_loss_function, print_header_nonlinear,
|
|
print_iteration_nonlinear)
|
|
|
|
|
|
def trf(fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev, x_scale,
|
|
loss_function, tr_solver, tr_options, verbose):
|
|
# For efficiency, it makes sense to run the simplified version of the
|
|
# algorithm when no bounds are imposed. We decided to write the two
|
|
# separate functions. It violates the DRY principle, but the individual
|
|
# functions are kept the most readable.
|
|
if np.all(lb == -np.inf) and np.all(ub == np.inf):
|
|
return trf_no_bounds(
|
|
fun, jac, x0, f0, J0, ftol, xtol, gtol, max_nfev, x_scale,
|
|
loss_function, tr_solver, tr_options, verbose)
|
|
else:
|
|
return trf_bounds(
|
|
fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev, x_scale,
|
|
loss_function, tr_solver, tr_options, verbose)
|
|
|
|
|
|
def select_step(x, J_h, diag_h, g_h, p, p_h, d, Delta, lb, ub, theta):
|
|
"""Select the best step according to Trust Region Reflective algorithm."""
|
|
if in_bounds(x + p, lb, ub):
|
|
p_value = evaluate_quadratic(J_h, g_h, p_h, diag=diag_h)
|
|
return p, p_h, -p_value
|
|
|
|
p_stride, hits = step_size_to_bound(x, p, lb, ub)
|
|
|
|
# Compute the reflected direction.
|
|
r_h = np.copy(p_h)
|
|
r_h[hits.astype(bool)] *= -1
|
|
r = d * r_h
|
|
|
|
# Restrict trust-region step, such that it hits the bound.
|
|
p *= p_stride
|
|
p_h *= p_stride
|
|
x_on_bound = x + p
|
|
|
|
# Reflected direction will cross first either feasible region or trust
|
|
# region boundary.
|
|
_, to_tr = intersect_trust_region(p_h, r_h, Delta)
|
|
to_bound, _ = step_size_to_bound(x_on_bound, r, lb, ub)
|
|
|
|
# Find lower and upper bounds on a step size along the reflected
|
|
# direction, considering the strict feasibility requirement. There is no
|
|
# single correct way to do that, the chosen approach seems to work best
|
|
# on test problems.
|
|
r_stride = min(to_bound, to_tr)
|
|
if r_stride > 0:
|
|
r_stride_l = (1 - theta) * p_stride / r_stride
|
|
if r_stride == to_bound:
|
|
r_stride_u = theta * to_bound
|
|
else:
|
|
r_stride_u = to_tr
|
|
else:
|
|
r_stride_l = 0
|
|
r_stride_u = -1
|
|
|
|
# Check if reflection step is available.
|
|
if r_stride_l <= r_stride_u:
|
|
a, b, c = build_quadratic_1d(J_h, g_h, r_h, s0=p_h, diag=diag_h)
|
|
r_stride, r_value = minimize_quadratic_1d(
|
|
a, b, r_stride_l, r_stride_u, c=c)
|
|
r_h *= r_stride
|
|
r_h += p_h
|
|
r = r_h * d
|
|
else:
|
|
r_value = np.inf
|
|
|
|
# Now correct p_h to make it strictly interior.
|
|
p *= theta
|
|
p_h *= theta
|
|
p_value = evaluate_quadratic(J_h, g_h, p_h, diag=diag_h)
|
|
|
|
ag_h = -g_h
|
|
ag = d * ag_h
|
|
|
|
to_tr = Delta / norm(ag_h)
|
|
to_bound, _ = step_size_to_bound(x, ag, lb, ub)
|
|
if to_bound < to_tr:
|
|
ag_stride = theta * to_bound
|
|
else:
|
|
ag_stride = to_tr
|
|
|
|
a, b = build_quadratic_1d(J_h, g_h, ag_h, diag=diag_h)
|
|
ag_stride, ag_value = minimize_quadratic_1d(a, b, 0, ag_stride)
|
|
ag_h *= ag_stride
|
|
ag *= ag_stride
|
|
|
|
if p_value < r_value and p_value < ag_value:
|
|
return p, p_h, -p_value
|
|
elif r_value < p_value and r_value < ag_value:
|
|
return r, r_h, -r_value
|
|
else:
|
|
return ag, ag_h, -ag_value
|
|
|
|
|
|
def trf_bounds(fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev,
|
|
x_scale, loss_function, tr_solver, tr_options, verbose):
|
|
x = x0.copy()
|
|
|
|
f = f0
|
|
f_true = f.copy()
|
|
nfev = 1
|
|
|
|
J = J0
|
|
njev = 1
|
|
m, n = J.shape
|
|
|
|
if loss_function is not None:
|
|
rho = loss_function(f)
|
|
cost = 0.5 * np.sum(rho[0])
|
|
J, f = scale_for_robust_loss_function(J, f, rho)
|
|
else:
|
|
cost = 0.5 * np.dot(f, f)
|
|
|
|
g = compute_grad(J, f)
|
|
|
|
jac_scale = isinstance(x_scale, str) and x_scale == 'jac'
|
|
if jac_scale:
|
|
scale, scale_inv = compute_jac_scale(J)
|
|
else:
|
|
scale, scale_inv = x_scale, 1 / x_scale
|
|
|
|
v, dv = CL_scaling_vector(x, g, lb, ub)
|
|
v[dv != 0] *= scale_inv[dv != 0]
|
|
Delta = norm(x0 * scale_inv / v**0.5)
|
|
if Delta == 0:
|
|
Delta = 1.0
|
|
|
|
g_norm = norm(g * v, ord=np.inf)
|
|
|
|
f_augmented = np.zeros((m + n))
|
|
if tr_solver == 'exact':
|
|
J_augmented = np.empty((m + n, n))
|
|
elif tr_solver == 'lsmr':
|
|
reg_term = 0.0
|
|
regularize = tr_options.pop('regularize', True)
|
|
|
|
if max_nfev is None:
|
|
max_nfev = x0.size * 100
|
|
|
|
alpha = 0.0 # "Levenberg-Marquardt" parameter
|
|
|
|
termination_status = None
|
|
iteration = 0
|
|
step_norm = None
|
|
actual_reduction = None
|
|
|
|
if verbose == 2:
|
|
print_header_nonlinear()
|
|
|
|
while True:
|
|
v, dv = CL_scaling_vector(x, g, lb, ub)
|
|
|
|
g_norm = norm(g * v, ord=np.inf)
|
|
if g_norm < gtol:
|
|
termination_status = 1
|
|
|
|
if verbose == 2:
|
|
print_iteration_nonlinear(iteration, nfev, cost, actual_reduction,
|
|
step_norm, g_norm)
|
|
|
|
if termination_status is not None or nfev == max_nfev:
|
|
break
|
|
|
|
# Now compute variables in "hat" space. Here, we also account for
|
|
# scaling introduced by `x_scale` parameter. This part is a bit tricky,
|
|
# you have to write down the formulas and see how the trust-region
|
|
# problem is formulated when the two types of scaling are applied.
|
|
# The idea is that first we apply `x_scale` and then apply Coleman-Li
|
|
# approach in the new variables.
|
|
|
|
# v is recomputed in the variables after applying `x_scale`, note that
|
|
# components which were identically 1 not affected.
|
|
v[dv != 0] *= scale_inv[dv != 0]
|
|
|
|
# Here, we apply two types of scaling.
|
|
d = v**0.5 * scale
|
|
|
|
# C = diag(g * scale) Jv
|
|
diag_h = g * dv * scale
|
|
|
|
# After all this has been done, we continue normally.
|
|
|
|
# "hat" gradient.
|
|
g_h = d * g
|
|
|
|
f_augmented[:m] = f
|
|
if tr_solver == 'exact':
|
|
J_augmented[:m] = J * d
|
|
J_h = J_augmented[:m] # Memory view.
|
|
J_augmented[m:] = np.diag(diag_h**0.5)
|
|
U, s, V = svd(J_augmented, full_matrices=False)
|
|
V = V.T
|
|
uf = U.T.dot(f_augmented)
|
|
elif tr_solver == 'lsmr':
|
|
J_h = right_multiplied_operator(J, d)
|
|
|
|
if regularize:
|
|
a, b = build_quadratic_1d(J_h, g_h, -g_h, diag=diag_h)
|
|
to_tr = Delta / norm(g_h)
|
|
ag_value = minimize_quadratic_1d(a, b, 0, to_tr)[1]
|
|
reg_term = -ag_value / Delta**2
|
|
|
|
lsmr_op = regularized_lsq_operator(J_h, (diag_h + reg_term)**0.5)
|
|
gn_h = lsmr(lsmr_op, f_augmented, **tr_options)[0]
|
|
S = np.vstack((g_h, gn_h)).T
|
|
S, _ = qr(S, mode='economic')
|
|
JS = J_h.dot(S) # LinearOperator does dot too.
|
|
B_S = np.dot(JS.T, JS) + np.dot(S.T * diag_h, S)
|
|
g_S = S.T.dot(g_h)
|
|
|
|
# theta controls step back step ratio from the bounds.
|
|
theta = max(0.995, 1 - g_norm)
|
|
|
|
actual_reduction = -1
|
|
while actual_reduction <= 0 and nfev < max_nfev:
|
|
if tr_solver == 'exact':
|
|
p_h, alpha, n_iter = solve_lsq_trust_region(
|
|
n, m, uf, s, V, Delta, initial_alpha=alpha)
|
|
elif tr_solver == 'lsmr':
|
|
p_S, _ = solve_trust_region_2d(B_S, g_S, Delta)
|
|
p_h = S.dot(p_S)
|
|
|
|
p = d * p_h # Trust-region solution in the original space.
|
|
step, step_h, predicted_reduction = select_step(
|
|
x, J_h, diag_h, g_h, p, p_h, d, Delta, lb, ub, theta)
|
|
|
|
x_new = make_strictly_feasible(x + step, lb, ub, rstep=0)
|
|
f_new = fun(x_new)
|
|
nfev += 1
|
|
|
|
step_h_norm = norm(step_h)
|
|
|
|
if not np.all(np.isfinite(f_new)):
|
|
Delta = 0.25 * step_h_norm
|
|
continue
|
|
|
|
# Usual trust-region step quality estimation.
|
|
if loss_function is not None:
|
|
cost_new = loss_function(f_new, cost_only=True)
|
|
else:
|
|
cost_new = 0.5 * np.dot(f_new, f_new)
|
|
actual_reduction = cost - cost_new
|
|
Delta_new, ratio = update_tr_radius(
|
|
Delta, actual_reduction, predicted_reduction,
|
|
step_h_norm, step_h_norm > 0.95 * Delta)
|
|
|
|
step_norm = norm(step)
|
|
termination_status = check_termination(
|
|
actual_reduction, cost, step_norm, norm(x), ratio, ftol, xtol)
|
|
if termination_status is not None:
|
|
break
|
|
|
|
alpha *= Delta / Delta_new
|
|
Delta = Delta_new
|
|
|
|
if actual_reduction > 0:
|
|
x = x_new
|
|
|
|
f = f_new
|
|
f_true = f.copy()
|
|
|
|
cost = cost_new
|
|
|
|
J = jac(x, f)
|
|
njev += 1
|
|
|
|
if loss_function is not None:
|
|
rho = loss_function(f)
|
|
J, f = scale_for_robust_loss_function(J, f, rho)
|
|
|
|
g = compute_grad(J, f)
|
|
|
|
if jac_scale:
|
|
scale, scale_inv = compute_jac_scale(J, scale_inv)
|
|
else:
|
|
step_norm = 0
|
|
actual_reduction = 0
|
|
|
|
iteration += 1
|
|
|
|
if termination_status is None:
|
|
termination_status = 0
|
|
|
|
active_mask = find_active_constraints(x, lb, ub, rtol=xtol)
|
|
return OptimizeResult(
|
|
x=x, cost=cost, fun=f_true, jac=J, grad=g, optimality=g_norm,
|
|
active_mask=active_mask, nfev=nfev, njev=njev,
|
|
status=termination_status)
|
|
|
|
|
|
def trf_no_bounds(fun, jac, x0, f0, J0, ftol, xtol, gtol, max_nfev,
|
|
x_scale, loss_function, tr_solver, tr_options, verbose):
|
|
x = x0.copy()
|
|
|
|
f = f0
|
|
f_true = f.copy()
|
|
nfev = 1
|
|
|
|
J = J0
|
|
njev = 1
|
|
m, n = J.shape
|
|
|
|
if loss_function is not None:
|
|
rho = loss_function(f)
|
|
cost = 0.5 * np.sum(rho[0])
|
|
J, f = scale_for_robust_loss_function(J, f, rho)
|
|
else:
|
|
cost = 0.5 * np.dot(f, f)
|
|
|
|
g = compute_grad(J, f)
|
|
|
|
jac_scale = isinstance(x_scale, str) and x_scale == 'jac'
|
|
if jac_scale:
|
|
scale, scale_inv = compute_jac_scale(J)
|
|
else:
|
|
scale, scale_inv = x_scale, 1 / x_scale
|
|
|
|
Delta = norm(x0 * scale_inv)
|
|
if Delta == 0:
|
|
Delta = 1.0
|
|
|
|
if tr_solver == 'lsmr':
|
|
reg_term = 0
|
|
damp = tr_options.pop('damp', 0.0)
|
|
regularize = tr_options.pop('regularize', True)
|
|
|
|
if max_nfev is None:
|
|
max_nfev = x0.size * 100
|
|
|
|
alpha = 0.0 # "Levenberg-Marquardt" parameter
|
|
|
|
termination_status = None
|
|
iteration = 0
|
|
step_norm = None
|
|
actual_reduction = None
|
|
|
|
if verbose == 2:
|
|
print_header_nonlinear()
|
|
|
|
while True:
|
|
g_norm = norm(g, ord=np.inf)
|
|
if g_norm < gtol:
|
|
termination_status = 1
|
|
|
|
if verbose == 2:
|
|
print_iteration_nonlinear(iteration, nfev, cost, actual_reduction,
|
|
step_norm, g_norm)
|
|
|
|
if termination_status is not None or nfev == max_nfev:
|
|
break
|
|
|
|
d = scale
|
|
g_h = d * g
|
|
|
|
if tr_solver == 'exact':
|
|
J_h = J * d
|
|
U, s, V = svd(J_h, full_matrices=False)
|
|
V = V.T
|
|
uf = U.T.dot(f)
|
|
elif tr_solver == 'lsmr':
|
|
J_h = right_multiplied_operator(J, d)
|
|
|
|
if regularize:
|
|
a, b = build_quadratic_1d(J_h, g_h, -g_h)
|
|
to_tr = Delta / norm(g_h)
|
|
ag_value = minimize_quadratic_1d(a, b, 0, to_tr)[1]
|
|
reg_term = -ag_value / Delta**2
|
|
|
|
damp_full = (damp**2 + reg_term)**0.5
|
|
gn_h = lsmr(J_h, f, damp=damp_full, **tr_options)[0]
|
|
S = np.vstack((g_h, gn_h)).T
|
|
S, _ = qr(S, mode='economic')
|
|
JS = J_h.dot(S)
|
|
B_S = np.dot(JS.T, JS)
|
|
g_S = S.T.dot(g_h)
|
|
|
|
actual_reduction = -1
|
|
while actual_reduction <= 0 and nfev < max_nfev:
|
|
if tr_solver == 'exact':
|
|
step_h, alpha, n_iter = solve_lsq_trust_region(
|
|
n, m, uf, s, V, Delta, initial_alpha=alpha)
|
|
elif tr_solver == 'lsmr':
|
|
p_S, _ = solve_trust_region_2d(B_S, g_S, Delta)
|
|
step_h = S.dot(p_S)
|
|
|
|
predicted_reduction = -evaluate_quadratic(J_h, g_h, step_h)
|
|
step = d * step_h
|
|
x_new = x + step
|
|
f_new = fun(x_new)
|
|
nfev += 1
|
|
|
|
step_h_norm = norm(step_h)
|
|
|
|
if not np.all(np.isfinite(f_new)):
|
|
Delta = 0.25 * step_h_norm
|
|
continue
|
|
|
|
# Usual trust-region step quality estimation.
|
|
if loss_function is not None:
|
|
cost_new = loss_function(f_new, cost_only=True)
|
|
else:
|
|
cost_new = 0.5 * np.dot(f_new, f_new)
|
|
actual_reduction = cost - cost_new
|
|
|
|
Delta_new, ratio = update_tr_radius(
|
|
Delta, actual_reduction, predicted_reduction,
|
|
step_h_norm, step_h_norm > 0.95 * Delta)
|
|
|
|
step_norm = norm(step)
|
|
termination_status = check_termination(
|
|
actual_reduction, cost, step_norm, norm(x), ratio, ftol, xtol)
|
|
if termination_status is not None:
|
|
break
|
|
|
|
alpha *= Delta / Delta_new
|
|
Delta = Delta_new
|
|
|
|
if actual_reduction > 0:
|
|
x = x_new
|
|
|
|
f = f_new
|
|
f_true = f.copy()
|
|
|
|
cost = cost_new
|
|
|
|
J = jac(x, f)
|
|
njev += 1
|
|
|
|
if loss_function is not None:
|
|
rho = loss_function(f)
|
|
J, f = scale_for_robust_loss_function(J, f, rho)
|
|
|
|
g = compute_grad(J, f)
|
|
|
|
if jac_scale:
|
|
scale, scale_inv = compute_jac_scale(J, scale_inv)
|
|
else:
|
|
step_norm = 0
|
|
actual_reduction = 0
|
|
|
|
iteration += 1
|
|
|
|
if termination_status is None:
|
|
termination_status = 0
|
|
|
|
active_mask = np.zeros_like(x)
|
|
return OptimizeResult(
|
|
x=x, cost=cost, fun=f_true, jac=J, grad=g, optimality=g_norm,
|
|
active_mask=active_mask, nfev=nfev, njev=njev,
|
|
status=termination_status)
|
|
|