# Alternating randomized least squares for CP Decomposition

The function cp_arls computes an estimate of the best rank-R CP model of a tensor X using alternating randomized least-squares algorithm. The input X must be a (dense) tensor. The output CP model is a ktensor. The CP-ARLS method is described in the following reference:

• C. Battaglino, G. Ballard, T. G. Kolda. A Practical Randomized CP Tensor Decomposition, to appear in SIAM J. Matrix Analysis and Applications, 2017. Preprint: (arXiv:1701.06600).

## Set up a sample problem

We set up an especially difficult and somewhat large sample problem that has high collinearity (0.9) and 1% noise. This is an example where the randomized method will generally outperform the standard method.

```sz = [200 300 400];
R = 5;
ns = 0.01;
coll = 0.9;

info = create_problem('Size', sz, 'Num_Factors', R, 'Noise', ns, ...
'Factor_Generator', @(m,n) matrandcong(m,n,coll), ...
'Lambda_Generator', @ones);

% Extract data and solution
X = info.Data;
M_true = info.Soln;
```

## Running the CP-ARLS method

Running the method is essentially the same as using CP-ALS, feed the data matrix and the desired rank. Note that the iteration is of the form NxN which is the number of epochs x the number of iterations per epoch. The default number of iterations per epoch is 50. At the end of each epoch, we check the convergence criteria. Because this is a randomized method, we do not achieve strict decrease in the objective function. Instead, we look at the number of epochs without improvement (newi) and exit when this crosses the predefined tolerance (`newitol`), which defaults to 5. It is important to note that the fit values that are reported are approximate, so this is why it is denoted by `f~` rather than just `f`.

```tic
[M1, ~, out1] = cp_arls(X,R);
time1 = toc;
scr1 = score(M1,M_true);
fprintf('\n*** Results for CP-ARLS (with mixing) ***\n');
fprintf('Time (secs): %.3f\n', time1)
fprintf('Score (max=1): %.3f\n', scr1);
```
```CP-ARLS (with mixing):
Iter 10x50: f~ = 9.895866e-01 newi = 0
Iter 20x50: f~ = 9.895783e-01 newi = 4
Iter 21x50: f~ = 9.895769e-01 newi = 5

*** Results for CP-ARLS (with mixing) ***
Time (secs): 8.923
Score (max=1): 0.989
```

## Speed things up by skipping the initial mixing

The default behavior is to mix the data in each mode using an FFT and diagonal random +/-1 matrix. This may add substantial preprocessing time, though it helps to ensure that the method converges. Oftentimes, such as with randomly-generated data, the mixing is not necessary.

```tic
[M2, ~, out2] = cp_arls(X,R,'mix',false);
time2 = toc;
scr2 = score(M2,M_true);

fprintf('\n*** Results for CP-ARLS (no mix) ***\n');
fprintf('Time (secs): %.3f\n', time2)
fprintf('Score (max=1): %.3f\n', scr2);
```
```CP_ARLS (without mixing):
Iter 10x50: f~ = 9.889934e-01 newi = 1
Iter 20x50: f~ = 9.891646e-01 newi = 0
Iter 30x50: f~ = 9.890769e-01 newi = 5

*** Results for CP-ARLS (no mix) ***
Time (secs): 8.029
Score (max=1): 0.976
```

## Comparing with CP-ALS

CP-ALS may be somewhat faster, especially since this is a relatively small problem, but it usually will not achieve as good of an answer in terms of the score.

```tic;
[M3, ~, out3] = cp_als(X,R,'maxiters',500,'printitn',10);
time3 = toc;
scr3 = score(M3,M_true);
fprintf('\n*** Results for CP-ALS ***\n');
fprintf('Time (secs): %.3f\n', time3)
fprintf('Score (max=1): %.3f\n', scr3);
```
```CP_ALS:
Iter 10: f = 9.726647e-01 f-delta = 4.8e-04
Iter 20: f = 9.756937e-01 f-delta = 2.6e-04
Iter 30: f = 9.868227e-01 f-delta = 2.7e-03
Iter 33: f = 9.884157e-01 f-delta = 5.8e-05
Final f = 9.884157e-01

*** Results for CP-ALS ***
Time (secs): 3.505
Score (max=1): 0.709
```

## How well does the approximate fit do?

It is possible to check the accuracy of the fit computation by having the code compute the true fit and the final solution, enabled by the `truefit` option.

```[M4,~,out4] = cp_arls(X,R,'truefit',true);
```
```CP-ARLS (with mixing):
Iter 10x50: f~ = 9.897749e-01 newi = 2
Iter 17x50: f~ = 9.898412e-01 newi = 5
Final fit = 9.898393e-01 Final estimated fit = 9.898489e-01
```

## Varying epoch size

It is possible to vary that number of iterations per epoch. Fewer iterations means that more time is spent checking for convergence and it may also be harder to detect as an single iteration can have some fluctuation and we are actually looking for the overall trend. In contrast, too many iterations means that the method won't realize when it has converged and may spend too much time computing.

```tic
M = cp_arls(X,R,'epoch',1,'newitol',20);
toc
fprintf('Score: %.4f\n',score(M,M_true));
```
```CP-ARLS (with mixing):
Iter 10x1: f~ = 9.701522e-01 newi = 1
Iter 20x1: f~ = 9.725415e-01 newi = 5
Iter 30x1: f~ = 9.738450e-01 newi = 0
Iter 40x1: f~ = 9.745092e-01 newi = 0
Iter 50x1: f~ = 9.770124e-01 newi = 0
Iter 60x1: f~ = 9.877161e-01 newi = 0
Iter 70x1: f~ = 9.881069e-01 newi = 0
Iter 80x1: f~ = 9.882726e-01 newi = 2
Iter 90x1: f~ = 9.885071e-01 newi = 1
Iter 100x1: f~ = 9.886444e-01 newi = 0
Iter 110x1: f~ = 9.886701e-01 newi = 8
Iter 120x1: f~ = 9.888270e-01 newi = 1
Iter 130x1: f~ = 9.888974e-01 newi = 1
Iter 140x1: f~ = 9.891171e-01 newi = 0
Iter 150x1: f~ = 9.891283e-01 newi = 0
Iter 160x1: f~ = 9.890805e-01 newi = 1
Iter 170x1: f~ = 9.892718e-01 newi = 0
Iter 180x1: f~ = 9.891860e-01 newi = 8
Iter 190x1: f~ = 9.893004e-01 newi = 6
Iter 200x1: f~ = 9.894006e-01 newi = 0
Iter 210x1: f~ = 9.893971e-01 newi = 1
Iter 220x1: f~ = 9.894173e-01 newi = 0
Iter 230x1: f~ = 9.894356e-01 newi = 2
Iter 240x1: f~ = 9.894381e-01 newi = 3
Iter 250x1: f~ = 9.894686e-01 newi = 6
Iter 260x1: f~ = 9.894537e-01 newi = 4
Iter 270x1: f~ = 9.894538e-01 newi = 1
Iter 280x1: f~ = 9.895148e-01 newi = 1
Iter 290x1: f~ = 9.895310e-01 newi = 7
Iter 300x1: f~ = 9.895601e-01 newi = 17
Iter 303x1: f~ = 9.895373e-01 newi = 20
Elapsed time is 3.350615 seconds.
Score: 0.9171
```
```tic
M = cp_arls(X,R,'epoch',200,'newitol',3,'printitn',2);
toc
fprintf('Score: %.4f\n',score(M,M_true));
```
```CP-ARLS (with mixing):
Iter  2x200: f~ = 9.896707e-01 newi = 0
Iter  4x200: f~ = 9.896960e-01 newi = 1
Iter  6x200: f~ = 9.896960e-01 newi = 3
Elapsed time is 10.381222 seconds.
Score: 0.9720
```

## Set up another sample problem

We set up another problem with 10% noise, but no collinearity.

```sz = [200 300 400];
R = 5;
ns = 0.10;

info = create_problem('Size', sz, 'Num_Factors', R, 'Noise', ns, ...
'Factor_Generator', @rand,'Lambda_Generator', @ones);

% Extract data and solution
X = info.Data;
M_true = info.Soln;
```

## Terminating once a desired fit is achieved

If we know the noise level is 10%, we would expect a fit of 0.90 at best. So, we can set a threshold that is close to that and terminate as soon as we achieve that accuracy. Since detecting convergence is hard for a randomized method, this can lead to speed ups. However, if the fit is not high enough, the accuracy may suffer consequently.

```M = cp_arls(X,R,'newitol',20,'fitthresh',0.895,'truefit',true);
fprintf('Score: %.4f\n',score(M,M_true));
```
```CP-ARLS (with mixing):
Iter  1x50: f~ = 8.966351e-01 newi = 0
Final fit = 8.972839e-01 Final estimated fit = 8.966351e-01
Score: 0.9566
```

## Changing the number of function evaluation samples

The function evaluation is approximate and based on sampling the number of entries specified by `nsampfit`. If this is too small, the samples will not be accurate enough. If this is too large, the computation will take too long. The default is , which should generally be sufficient. It may sometimes be possible to use smaller values. The same sampled entries are used for every convergence check --- we do not resample to check other entries.

```M = cp_arls(X,R,'truefit',true,'nsampfit',100);
fprintf('Score: %.4f\n',score(M,M_true));
```
```CP-ARLS (with mixing):
Iter  7x50: f~ = 8.890117e-01 newi = 5
Final fit = 8.935068e-01 Final estimated fit = 8.948104e-01
Score: 0.9809
```

## Change the number of sampled rows in least squares solve

The default number of sampled rows for the least squares solves is `ceil(10*R*log2®)`. This seemed to work well in most tests, but this can be varied higher or lower. For R=5, this means we sample 117 rows per solve. The rows are different for every least squares problem. Let's see what happens if we reduce this to 10.

```M = cp_arls(X,R,'truefit',true,'nsamplsq',10);
fprintf('Score: %.4f\n',score(M,M_true));
```
```CP-ARLS (with mixing):
Iter 10x50: f~ = 7.180423e-01 newi = 4
Iter 11x50: f~ = 7.033075e-01 newi = 5
Final fit = 7.553195e-01 Final estimated fit = 7.554767e-01
Score: 0.1939
```

What if we use 25?

```M = cp_arls(X,R,'truefit',true,'nsamplsq',25);
fprintf('Score: %.4f\n',score(M,M_true));
```
```CP-ARLS (with mixing):
Iter  8x50: f~ = 8.812275e-01 newi = 5
Final fit = 8.816266e-01 Final estimated fit = 8.813898e-01
Score: 0.9236
```