Gradient Based Joint Diagonalization of a Set of Matrices

Yaroslav O. Halchenko Barak A. Pearlmutter
Department of Computer Science
University of New Mexico
Albuquerque, NM  87131, USA

Abstract:

This paper presents an iterative algorithm to jointly approximately diagonalize (JAD) a set of real-valued $ m \times m$ matrices $ \mathbf{C}_k$ with a single $ n \times m$ matrix $ \mathbf{B}$, where $ n \leq m$. This algorithm is based on simple gradient descent minimization of the criterion $ \sum_k \omega_i \{ {\mathrm{off}}(\mathbf{B}\mathbf{C}_k
\mathbf{B}^T) \} - \lambda \log \vert\mathbf{B}\vert$, where $ {\mathrm{off}}(\mathbf{A}) = \sum_i \sum_{j
\neq i} a_{ij}^2$ measures the off-diagonal weight, $ \omega_i$ is the diagonalization error for the $ i$-th matrix, and $ \lambda \log \vert\mathbf{B}\vert$ is a penalty factor that prevents a singular solution. No specific restriction (Hermitian, positive definite, etc) is assumed about the matrices $ \mathbf{C}_k$. Performance analysis and comparison to other existing JAD techniques are presented, as well as the results of applying the method to the ICA problem.


Introduction

The problem of joint approximate diagonalization of a set of matrices is important in source separation, and a number of efficient methods for JAD have been developed. Pham and Cardoso (2000) and Cardoso and Souloumiac (1996) use Jacobi-like techniques to diagonalize a set of matrices through unitary transformation, following a prewhitening step. Pham (1999) also uses a Jacobi approach, operating on pairs of rows and columns at each step. In general, all Jacobi-based methods operate on pairs of columns and rows at each step, which can (Zumbuddy, 1923) state about some overshoot in computational expenses.

The original AC-DC algorithm (Yeredor, 2002) also operates only on Hermitian or symmetric matrices. It involves the computation of the largest eigenvalue and corresponding eigenvector at each AC step of the algorithm. As we see, many existing methods rely on the fact that matrices to be diagonalized are Hermitian, which in fact holds for many practical tasks -- symmetry comes from the fact that matrices to be diagonalized are typically covariance matrices or matrices of cummulants. These JAD methods are widely used in construction of different frameworks for separating signals in ICA problems.

The tree-decorrelation algorithm (Zibulevsky and Pearlmutter, 2000) also relies on computing an eigenspace for each matrix at each step, and makes use of the noise induced in the resulting eigenvalues by noise in the matrices input to the algorithm. Even if joint diagonalization is not the direct target of the work and is not mentioned as a criterion, it is known from PCA analysis that eigenvectors find minimum quadratic error solutions, and the tree-decorrelation method uses this property to minimize the off-diagonal entries of the transformed covariance matrices.

Here we derive a simple gradient descent rule for optimizing the criterion (2) derived below. Natural gradient descent (Amari, 1998) can be used to improve the performance by making the optimization covariant. Because the convergence of proposed method is no faster than existing methods, and convergence to the global minimum is not guaranteed, the method can be used as a simple fine-tuning mechanism for other ``fast'' algorithms which reach their optimum near1local minimum of the optimized criterion.

??? Is this true? Ie do these methods find the global optimum? !!! actually youre right - they don't promise convergence to the global minimum. Pham's paper says explicitly that criterion can have several (fixed number) of local minimums. Cardoso's old method for diagonalization (the one in JADE) seems to bring to the global minimum of the criterion while working on presphered data. So main error is introduced by presphering as I mentioned in footnote
Do they converge as quickly as gradient methods? At ICA-2001 Cardoso told me that gradient methods should be faster.I've not done any convergence analysis but all methods which I've considered converge much faster and even have quadratic convergence as Pham mentioned. In this paper we derived just simple gradient method and main problem is that on each step to compute gradient I need to do a lot of matrix multiplications. Probably if we approximate it somehow to avoid that, we gain a lot of speed-up, but for now it is computationally intensive
We also need to add something about this method not constraining solution to perfectly diagonalize a particular covariance matrix, and therefore expanding the solution space to include potentially better separations. but isn't it a question of having presphering? so Cardoso already gave analysis how badly presphering can influence on the result in (Cardoso, 1994). Pham's method also doesn't use presphering, so it is also gives potentially better results.
Maybe give an example, eg if all the matrices are contaminated by the same amount of noise it would raise the error if we singled out one for special treatment.Ok - will be done. Will include comparison between our, Cardoso's and Phams diagonalizers


Gradient-Based JAD

We wish to diagonalize set of real-valued $ m \times m$ matrices $ \mathbf{C}_k$. We adopt the usual quality criterion for diagonalization, namely the sum of the squares of off-diagonal elements, subject to a constraint to keep the diagonal elements from also going to zero. For this purpose we define

$\displaystyle {\mathrm{off}}(\mathbf{C}) = \sum_i \sum_{j \neq i} c_{ij}^2$ (1)

and then, for a set of matrices $ C_k$ where each matrix has a weight $ \omega_k$, we will minimize

$\displaystyle E(\mathbf{B}) = \sum_k \omega_k {\mathrm{off}}(\mathbf{B}\mathbf{C}_k \mathbf{B}^T)$ (2)

We must enforce a constraint to prevent $ \mathbf{B}=0$ from being a solution. One approach is to project the solution back onto a suitable manifold after each unconstrained optimization step, for instance by re-normalizing each row at each step. Another is to directly fix the norm of $ \mathbf{B}$ using one of the methods described by Douglas et al. (1999). In the spirit of Bell and Sejnowski (1995), we instead use a Lagrange multiplier on the logarithm of the absolute value of the determinant, which prevents $ \mathbf{B}$ from becoming singular.2Using this approach we obtain a criterion suitable for unconstrained minimization,

$\displaystyle \hat{E}(\mathbf{B}) = \sum_k \omega_k {\mathrm{off}}(\mathbf{B}\mathbf{C}_k \mathbf{B}^T) - \lambda \log \vert\mathbf{B}\vert$ (3)

Taking the derivative (see Appendix A) leads to

$\displaystyle \frac{d\hat{E}(\mathbf{B})}{d\mathbf{B}} = \sum_k \omega_k (\hat{\mathbf{C}}$off$\displaystyle _k \mathbf{B}\mathbf{C}_k^T + \hat{\mathbf{C}}$off$\displaystyle _k^T \mathbf{B}\mathbf{C}_k ) - \lambda \mathbf{B}^{-T}$ (4)

where $ \hat{\mathbf{C}}_k = \mathbf{B}\mathbf{C}_k \mathbf{B}^T$ is $ \mathbf{C}_k$ in the basis $ \mathbf{B}$ and $ \hat{\mathbf{C}}$off$ _k$ is equal to $ \mathbf{C}$off$ _k$ with its diagonal elements set to zero. Note that if the $ \mathbf{C}_k$ are symmetric then $ \hat{\mathbf{C}}$off$ _k \mathbf{B}\mathbf{C}^T =
\hat{\mathbf{C}}$off$ _k^T \mathbf{B}\mathbf{C}$ and the expression simplifies.

Optimization

With the gradient (4) in hand we can use any suitable gradient optimization method to minimize (3). The experimental results below were obtained using simple steepest descent. This technique has the advantage of simplicity, but is not in general very efficient. To increase stability and improve convergence we can instead use the natural gradient (Amari, 1998) to make this algorithm covariant and to avoid matrix inversion in computation of the derivative.

$\displaystyle \mathbf{B}_{t+1} = \mathbf{B}_t - \eta_t \mathbf{M}(\mathbf{B}_t) \frac{dE(\mathbf{B}_t)}{d\mathbf{B}},$ (5)

where $ \mathbf{M}(\mathbf{B}_t)$ is the natural descent direction, so $ \mathbf{M}(\mathbf{B}_t)=\mathbf{B}_t$, as suggested by Amari et al. (1996).

In case if Lagrange multiplier $ \lambda \log \vert\mathbf{B}\vert$ is used to constrain $ \mathbf{B}$ from becoming singular, we also would like to avoid using gradient descent merely to re-scale $ \mathbf{B}$ to reach a saddle point between the two components of (3), for which we introduce a projection operation to be performed once before optimization begins,

$\displaystyle \hat{\mathbf{B}}_0 = \mathbf{B}_0 \cdot \left( \frac{\lambda m}{4 E(\mathbf{B}_0)} \right)^{1/4}$ (6)

??? Was this projection used or not? If so, why bother with the log det term? Good point. So corrected paper and algorithm. May be it is better just to use rescaling of $ \mathbf{B}$ without even usage of Lagrange multiplier Was natural gradient used, or not? And what exactly is the natural gradient equation? This is very confusing.I'm confused too. In sense that don't know if usual $ W^TW$ is applicable to our optimization rule, cause our criterion space can be totally different from the others used in ICA and for which it was proven to work - will check on practice and adjust an article then


Numerical Experiments

These numerical experiments used data from Pham and Cardoso (2000), to which Prof. Cardoso was kind enough to give us access. There are three speech waveforms sampled at 8 kHz in this data set. About 2000 ms of speech are available for each speaker: one male, one female, and one child. The non-stationarity of these signals is used in Section 4.2, but is not important for the results of Section 4.1, which merely compares the diagonalization results obtained by various algorithms.


Diagonalization Performance

??? Why is this section called ``Diagonalization''? Isn't that the subject of the whole paper?

This proposed algorithm can be used for fine-tuning results obtained with algorithms which introduce some constraints, for instance requiring presphering of the data.

Matrices for diagonalization used in these experiment were obtained as covariance matrices between synthetic sensor signals, with some delays ($ \tau=[$0:10 12:2:20 25:5:100 110:10:200$ ] \cdot $0.125 $ ms$, in Matlab notation.) The covariance matrices for specific $ \tau$ were symmetrized to give the set of matrices $ \mathbf{C}_k$ using $ \mathbf{C}_k = (\mathbf{C}_\tau +
\mathbf{C}_\tau^T)/2$

Cardoso's JAD method and Pham's BGML method were used to check off-diagonality criterion (2) and served as a starting point to refine separating matrix using our iterative steepest descent algorithm. To compare performance of different algorithms, the final unmixing matrices were row-scaled to give 1's on the diagonal of $ \hat{\mathbf{C}}_0$, the zero-delay covariance matrix of the separated sources.

As shown in Figure 1, Cardoso's method did not give good results, probably due to its presphering step. Our gradient algorithm therefore frequently converged to a local minimum when started far from the right solution. Pham's method, which is based on a different criterion, gave much better results. Further refinement of the output of Pham's algorithm using the gradient method proposed here allowed us to lower the off-diagonality criterion to nearly that of the original unmixed signals.

Figure 1: Distribution of off-diagonal criterion (2) obtained on the lagged covariance matrices of the three signals mixed with random appropriate mixing matrix using different diagonalization techniques as starting point for our iterative algorithm. ??? This caption is completely unclear. What are you trying to say? What is the point of this graph? In any case, it must be redone in some other way which is (a) understandable, (b) big enough for normal people to read without a magnifying glass, and (c) does not require color. This graph shows a lot as for me: first that our algorithm should start close to the solution, otherwise it performs almost like cardoso's one which require presphering. Cardoso's algorithm gives solution which depends on the mixing matrix and which is far from global minimum. Pham's algorithm gives much better and mixing-matrix-independent solution. If we use its solution as a start point of our algorithm - it comes closer to the value which is given by original unmixing matrix. Also it states that minimization of squares of off-diagonal elements is more plausible that Phams' log(det(. May be I didn't mentioned that explicitly...And youre right that it should be more readable and colorless - just don't know how to accomplish that - may be create separate graphs for each criterion...
\includegraphics[width=0.8\textwidth]{cmp_offdiag.eps}


Non-Stationary Sources

We applied our algorithm to separate data online, as in Pham and Cardoso (2000). Our steepest gradient descent algorithm was not as fast as the Jacobi-like rotations of Pham and Cardoso (2000), but proper tuning of the learning parameters (learning step and error threshold) could improve it for specific examples, so the introduction of modern stochastic gradient methods (Schraudolph, 2002) should give enormous acceleration.

Running code obtained from Pham and Cardoso (2000) with the same parameters (speech signals, a block length of 320 samples (40 ms), and a memory of last $ L=12$ covariance matrices to consider) but using our diagonalizer ( $ \lambda=10^{-4}$, error threshold = $ 10^{-9}$ and $ \eta_t=5 \times
10^{-3}$ ) and scaling the rows to give a unit diagonal, we obtained

$\displaystyle \hat{\mathbf{B}}\mathbf{A}= \left[\begin{array}{r@{}lr@{}lr@{}l} ...
...&.0023 & 1&.0000 & -0&.0006 \ -0&.0218 & 0&.0100 & 1&.0000 \end{array} \right]$ (7)

The JD-BGL algorithm gave better results (obtained off-diagonal elements were 10-100 times smaller then obtained with our diagonalizer), but our results are better than those of JADE in Pham and Cardoso (2000), which includes a presphering phase.

Figure 2 shows the convergence of the on-line version of JD-BGL and the same procedure but with usage of our diagonalization algorithm. It is easy to see that our algorithm converged close to solution even earlier than JD-BGL. So when JD-BGL gives better result in general, simple criterion (2) needs less data (covariance matrices) to give plausible results.

Figure: Convergence of the 9 coefficients of the gloval system $ \mathbf{B}(t)   \mathbf{A}$ versus the number of blocks for a $ n=3$ source case, shown for (a) our algorithm and (b) JD-BGL of Pham and Cardoso (2000). ??? Consider using log scale for x-axis. Consider plotting something else, like a measure of separation. Do something to make the fonts on the axes readable to a normal person, and to make this less ugly. I thought to include this graph as it appeared in Cardoso's paper and as the same but from running our algorithm, so it would be easier to compare
\includegraphics[width=\columnwidth]{TestOnlineGrad0} \includegraphics[width=\columnwidth]{TestOnlineBGML}

Discussion

Acknowledgements

Supported in part by NSF CAREER award 97-02-311, the Mental Illness and Neuroscience Discovery Institute, and a gift from the NEC Research Institute.

Bibliography

Amari, S., Cichocki, A., and Yang, H. H. (1996).
A new learning algorithm for blind signal separation.
In Advances in Neural Information Processing Systems 8. MIT Press.

Amari, S.-I. (1998).
Natural gradient works efficiently in learning.
Neural Computation, 10(2):251-276.

Bell, A. J. and Sejnowski, T. J. (1995).
An information-maximization approach to blind separation and blind deconvolution.
Neural Computation, 7(6):1129-1159.

Cardoso, J.-F. (1994).
On the performance of orthogonal source separation algorithms.
In Proc. EUSIPCO, pages 776-779, Edinburgh.

Cardoso, J.-F. and Souloumiac, A. (1996).
Jacobi angles for simultaneous diagonalization.
SIAM Journal of Matrix Analysis and Applications, 17(1).

Douglas, S., Amari, S., and Kung, S. (1999).
Gradient adaptation with unit-norm constraints.
Technical Report EE-99-003, Dept. of Electrical Engineering, Southern Methodist University, Dallas, TX.

Pham, D.-T. (1999).
Joint approximate diagonalization of positive definite matrices.
Technical report, LMC/IMAG, France.

Pham, D.-T. and Cardoso, J.-F. (2000).
Blind separation of instantaneous mixtures of non-stationary sources.
In Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000), pages 187-193, Helsinki, Finland.

Schraudolph, N. N. (2002).
Fast curvature matrix-vector products for second-order gradient descent.
Neural Computation, 14(7):1723-1738.

Yeredor, A. (July 2002).
Non-orthogonal joint diagonalization in the least-squares sense with application in blind source separation.
IEEE Transactions on Signal Processing, 50(7):1545 - 1553.

Zibulevsky, M. and Pearlmutter, B. A. (2000).
Second order blind source separation by recursive splitting of signal subspaces.
In International Workshop on Independent Component Analysis and Blind Signal Separation, pages 489-491, Helsinki, Finland.

Zumbuddy (1923).
Fill in this reference.


Derivation of the Gradient

First let's find derivative $ {d\hat{\mathbf{C}}}/{d\mathbf{B}}$, where $ \hat{\mathbf{C}}= \mathbf{B}\mathbf{C}
\mathbf{B}^T$ (we omit index $ k$ for $ \mathbf{C}$ matrices for now). To simplify notation and avoid usage of Kronecker product we find

$\displaystyle \frac{\partial\hat{\mathbf{C}}}{\partial b_{lm}} = \mathbf{E}_{lm} \mathbf{C}\mathbf{B}^T + \mathbf{B}\mathbf{C}\mathbf{E}_{ml},$ (8)

where $ \mathbf{E}_{lm}$ is elementary matrix - zero everywhere besides $ (l,m)$ element 3. So $ {\partial\hat{\mathbf{C}}}/{\partial b_{lm}}$ has only one $ l$th and $ l$th column non empty and the rest of numbers just 0s.

Off-diagonal criterion $ {\mathrm{off}}(\hat{\mathbf{C}}) = \sum_i \sum_{j \neq i} \hat{c}_{ij}^2$ would lead to the derivative

$\displaystyle \frac{{\mathrm{off}}(\hat{\mathbf{C}})}{\partial b_{lm}} = 2 \sum_i \sum_{j \neq i} \hat{c}_{ij} \frac{\partial\hat{c}_{ij}}{\partial b_{lm}}$ (9)

Equation (9) simplifies to the next one because as it was said each $ {\partial\hat{\mathbf{C}}}/{\partial b_{lm}}$ has just single non-zero column and row.

$\displaystyle \frac{\partial\hat{\mathbf{C}}}{\partial b_{lm}} = \hat{\mathbf{C}}$off$\displaystyle ( \mathbf{C}\mathbf{B}^T )^T + \hat{\mathbf{C}}$off$\displaystyle ^T (\mathbf{B}\mathbf{C})$ (10)

Finally combining gotten result into derivative of whole criterion (3) we get the result (4).



Footnotes

... near1
Any method which use presphering followed by a unitary rotation gives a near-local-minimum solution (Cardoso, 1994).
... singular.2
This particular term assumes that $ \mathbf{B}$ is square.
... element3
Pre-multiplication of the matrix $ \mathbf{A}$ by $ \mathbf{E}_{lm}$ would lead to resulting matrix zero everywhere but $ l$th row containing $ m$th row of A. Post-multiplication would lead to similar result with collumns $ \mathbf{A}\mathbf{E}_{ml}$ would have $ l$th column of A in $ m$th column of the result


Yaroslav Halchenko 2004-01-28