Solving inverse problems with TFOCS

This is a little vignette about using TFOCS to solve linear inverse problems, with a focus on problems from imaging and compressed sensing, and it covers some basic ideas in the TFOCS package.

From a generic point-of-view, linear inverse problems assume that we have data collected via (noisy) linear measurements: b=Ax_0 + z. To recover the unknown signal x_0, we look at the set of signals that agree with the measurements: C={x: , |Ax-b|_2  le epsilon } for a reasonable value of epsilon.

To select the “best” signal from this class, there are many options. A common choice is choosing the signal with least norm, for some choice of norm. In particular, the ell_1 norm is often used because it induces sparsity. For example, we have the Basis Pursuit De-Noising (BPDN) problem:

 min_x , |x|_1 quadtext{subject to}quad x in C. tag{BPDN}

There are quite a few algorithms that solve similar problems, such as

 min_x , |x|_1 + frac{lambda}{2} |Ax-b|_2^2  tag{LASSO}

which are equivalent for some choice of lambda and epsilon, but this relation is not usually known. There are many algorithms to solve LASSO, but our algorithm, TFOCS, is one of the few first-order algorithms that can efficiently solve the constrained version of this problem (BPDN).

TFOCS can also deal with variations. Consider the following problem:

 min_x , |Wx|_1 + lambda |x|_{TV} quadtext{subject to}quad x in C

where W is a wavelet transform (most natural images have wavelet coefficients that decay quickly) and |cdot|_{TV} is the total-variation norm, which penalizes changes in an image. Below are some results on the classic “cameraman” test image, where we have applied various formulations that we solved with TFOCS. This is a denoising problem, so the linear operator A is the identity, and we add noise z.

Denoising the cameraman image
original
The original image without noise
noisy
The noisy version of the image (this is what we want to clean-up)
wavelet
The cleaned-up image if we use wavelet thresholding (using an oracle to pick the optimal threshold). There are some artifacts.
tvanalysis
The cleaned-up image using the wavelet and TV composite function (see equation above). The TV norm helps reduce the artifacts.

Smoothing

To solve the problems efficiently, it is preferable to minimize a smooth function rather than a non-smooth function, because gradient descent methods have better convergence rates than sub-gradient descent methods.

The approach of TFOCS is to add a strongly convex term to the primal problem. This doesn't make the primal smooth, but rather it makes the dual smooth, and so we solve the dual problem via gradient descent (using an optimal version of gradient descent, such as Nesterov's algorithm). The picture below shows the effect of the strongly convex term added to the ell_1 norm (the red line, labeled “dual smoothed”). As you can see, it retains the sharp kink at 0, which is important since this is what promotes sparsity. Also plotted is the Huber function (the green line, labeled “primal smoothed”), which is a smooth approximation of the ell_1 norm that we used in NESTA.

Above: the absolute value function, and a smoothed version (in green) and the dual-smoothed version (in red). TFOCS uses dual smoothing.

Accelerated continuation

The smoothing is controlled by a parameter mu, where the extra term added to the objective is mu/2|x-y|_2^2 for some y. That is, we replace an optimization problem like

 min_{xin C} ; f(x)

with

 min_{xin C} ; f(x) + frac{mu}{2}|x-y|_2^2 .

We want mu to be small so that we closely approximate the original problem, but we also want it large because then the smoothed problem is easier to solve (i.e. faster convergence). The standard solution is to use continuation: we solve for a large value of mu and then use this solution to “seed” the algorithm using a small value of mu. Continuation can also be done by keeping mu fixed and updating y with a better guess.

The TFOCS paper describes an “accelerated continuation” scheme that improves on this. We analyze the fixed-mu version of continuation in terms of the proximal point method, and show that continuation is just the gradient step of a proximal point method. Thus we can apply Nesterov's accelerated gradient algorithm. The plot below shows this: we are plotting the error between the smoothed solution and non-smoothed solution. Each iteration is an update of y in either the simple fashion (“regular continuation”) or the accelerated fashion.

TFOCS also discusses some cases where the strongly convex parameter doesn't affect the solution, in so-called “exact regularization”. Simple cases have been known since the 1970s; more complete results appear in Exact regularization of convex programs by Friedlander and Tseng (2007).

Above: continuation (blue) compared to our accelerated continuation (green) method. Each iteration is an outer iteration of the larger problem.

Restart

Many compressed sensing problems enjoy a local strong convexity, and it is possible to exploit this using a restarted Nesterov algorithm. Note that, unlike gradient descent, accelerated algorithms do not automatically have linear convergence for strongly convex objectives.

In TFOCS, we explored the idea of automatic restarts, which restart whenever the function value increases. After this, the idea was explored in more depth by Brendan O'Donoghue and Emmanuel Cand├Ęs in Adaptive resart for accelerated gradient schemes.

Above: convergence rates. We compare a simple method with no line search (“fixed step”), the standard TFOCS line search (“no restart”), and then restarted versions for various values of the restart parameter. All methods used the Auslender-Teboulle algorithm. Restarting restores linear converge rates.

The need for first-order solvers

The 1990s saw the explosion of interior-point methods (IPM), and by the end of the decade, IPM solvers were mature and robust in both theory and practice. However, they do have scaling issues: for a primal variable of dimension n, each step requires the solving of a linear system on the order of n times n (or even larger, depending on the number of constraints), so this is automatically an order n^2 algorithm. Even worse, the number of steps scales slightly in sqrt{n}. The KKT system can sometimes be solved with an iterative solver like conjugate gradient, but this introduces other issues.

Below are some plots of a very respected IPM solver (SeDuMi, called via cvx), as well as l1-Magic (which is custom for the basis pursuit problem and uses conjugate gradient to solve the KKT system). You can see that the TFOCS first-order method is way faster at large problem sizes, and (not shown) requires a small memory footprint. SeDuMi required more than 4 GB of RAM on the largest problem and consequently memory calls thrashed.

The plot shows two curves for each algorithm: one curve used a loose tolerance, and one used a higher tolerance. LIPSOL is a Mehrotra-based primal-dual IPM built into MATLAB's linprog, and Simplex is also using MATLAB's implementation in linprog.

Above: Comparison of solver times

But are first-order methods like TFOCS as accurate as IPM methods? For this problem (a noiseless basis pursuit problem, with a sparse signal), the answer is yes. In fact, TFOCS returns an answer exact to machine precision, and exactly identifies the support, whereas IPM never exactly find zero entries (since, by definition, their solution is always on the interior).

Also, the Simplex method failed on most of the problem instances, and SeDuMi failed on some of the larger problems (it returned NaN, so those points are not plotted).

Above: Comparison of solver accuracy

Compressed sensing

Much of my work involves algorithms that arise from compressed sensing. There are already many excellent overviews of compressed sensing on the web, so I just list a few here: