This is a little vignette about using TFOCS to solve linear inverse problems, with a focus on problems from imaging and compressed sensing, and it covers some basic ideas in the TFOCS package.
From a generic point-of-view, linear inverse problems assume that we have data collected via (noisy) linear measurements: To recover the unknown signal , we look at the set of signals that agree with the measurements: for a reasonable value of .
To select the “best” signal from this class, there are many options. A common choice is choosing the signal with least norm, for some choice of norm. In particular, the norm is often used because it induces sparsity. For example, we have the Basis Pursuit De-Noising (BPDN) problem:
There are quite a few algorithms that solve similar problems, such as
which are equivalent for some choice of and , but this relation is not usually known. There are many algorithms to solve LASSO, but our algorithm, TFOCS, is one of the few first-order algorithms that can efficiently solve the constrained version of this problem (BPDN).
TFOCS can also deal with variations. Consider the following problem:
where is a wavelet transform (most natural images have wavelet coefficients that decay quickly) and is the total-variation norm, which penalizes changes in an image. Below are some results on the classic “cameraman” test image, where we have applied various formulations that we solved with TFOCS. This is a denoising problem, so the linear operator is the identity, and we add noise .
The original image without noise
The noisy version of the image (this is what we want to clean-up)
The cleaned-up image if we use wavelet thresholding (using an oracle to pick the optimal threshold). There are some artifacts.
The cleaned-up image using the wavelet and TV composite function (see equation above). The TV norm helps reduce the artifacts.
To solve the problems efficiently, it is preferable to minimize a smooth function rather than a non-smooth function, because gradient descent methods have better convergence rates than sub-gradient descent methods.
The approach of TFOCS is to add a strongly convex term to the primal problem. This doesn't make the primal smooth, but rather it makes the dual smooth, and so we solve the dual problem via gradient descent (using an optimal version of gradient descent, such as Nesterov's algorithm). The picture below shows the effect of the strongly convex term added to the norm (the red line, labeled “dual smoothed”). As you can see, it retains the sharp kink at 0, which is important since this is what promotes sparsity. Also plotted is the Huber function (the green line, labeled “primal smoothed”), which is a smooth approximation of the norm that we used in NESTA.
The smoothing is controlled by a parameter , where the extra term added to the objective is for some . That is, we replace an optimization problem like
We want to be small so that we closely approximate the original problem, but we also want it large because then the smoothed problem is easier to solve (i.e. faster convergence). The standard solution is to use continuation: we solve for a large value of and then use this solution to “seed” the algorithm using a small value of . Continuation can also be done by keeping fixed and updating with a better guess.
The TFOCS paper describes an “accelerated continuation” scheme that improves on this. We analyze the fixed- version of continuation in terms of the proximal point method, and show that continuation is just the gradient step of a proximal point method. Thus we can apply Nesterov's accelerated gradient algorithm. The plot below shows this: we are plotting the error between the smoothed solution and non-smoothed solution. Each iteration is an update of in either the simple fashion (“regular continuation”) or the accelerated fashion.
TFOCS also discusses some cases where the strongly convex parameter doesn't affect the solution, in so-called “exact regularization”. Simple cases have been known since the 1970s; more complete results appear in Exact regularization of convex programs by Friedlander and Tseng (2007).
Many compressed sensing problems enjoy a local strong convexity, and it is possible to exploit this using a restarted Nesterov algorithm. Note that, unlike gradient descent, accelerated algorithms do not automatically have linear convergence for strongly convex objectives.
In TFOCS, we explored the idea of automatic restarts, which restart whenever the function value increases. After this, the idea was explored in more depth by Brendan O'Donoghue and Emmanuel Candès in Adaptive resart for accelerated gradient schemes.
The 1990s saw the explosion of interior-point methods (IPM), and by the end of the decade, IPM solvers were mature and robust in both theory and practice. However, they do have scaling issues: for a primal variable of dimension , each step requires the solving of a linear system on the order of (or even larger, depending on the number of constraints), so this is automatically an order algorithm. Even worse, the number of steps scales slightly in . The KKT system can sometimes be solved with an iterative solver like conjugate gradient, but this introduces other issues.
Below are some plots of a very respected IPM solver (SeDuMi, called via cvx), as well as l1-Magic (which is custom for the basis pursuit problem and uses conjugate gradient to solve the KKT system). You can see that the TFOCS first-order method is way faster at large problem sizes, and (not shown) requires a small memory footprint. SeDuMi required more than 4 GB of RAM on the largest problem and consequently memory calls thrashed.
The plot shows two curves for each algorithm: one curve used a loose tolerance, and one used a higher tolerance. LIPSOL is a Mehrotra-based primal-dual IPM built into MATLAB's linprog, and Simplex is also using MATLAB's implementation in linprog.
But are first-order methods like TFOCS as accurate as IPM methods? For this problem (a noiseless basis pursuit problem, with a sparse signal), the answer is yes. In fact, TFOCS returns an answer exact to machine precision, and exactly identifies the support, whereas IPM never exactly find zero entries (since, by definition, their solution is always on the interior).
Also, the Simplex method failed on most of the problem instances, and SeDuMi failed on some of the larger problems (it returned NaN, so those points are not plotted).
Much of my work involves algorithms that arise from compressed sensing. There are already many excellent overviews of compressed sensing on the web, so I just list a few here:
Rice University maintains the de facto standard of compressed sensing and compressive sampling resources.
Terence Tao's blog posts about CS
Optimization Online is a repository of e-prints about optimization algorithms.