Assessing the performance of PLS regression and CCA with high-dimensional two-block data structure

C. Lalanne (a), E. Duchesnay (a,b,c), V. Frouin (a), B. Thyreau (a), A. Tenenhaus (d), B. Thirion (a,e), J.-B. Poline (a,b,c,e)

(a) CEA, I2BM, Neurospin, Gif-sur-Yvette, France
(b) IFR49, Institut d’Imagerie Neurofonctionnelle, Paris, France
(c) INSERM-CEA U1000, Neuroimaging & Psychiatry Unit, SHFJ, Orsay, France
(d) Supélec, Department of Signal Processing and Electronic Systems, Gif-sur-Yvette, France
(e) INRIA Saclay-Ile-de-France, Parietal project, France

Introduction
Partial Least Squares (PLS) regression or Canonical Correlation Analysis (CCA) allow to study co-varying networks of variables between two-block multivariate structures [Wegelin, 2000]. Such reduction methods are particularly useful when dealing with high-dimensional data sets, such as omics data (e.g., sequence polymorphisms or gene product, Waaijenborg & Zwinderman, 2007; Parkhomenko, Tritchler & Beyene, 2009; Lê Cao, Rossouw, Robert-Granié & Besse, 2008) or anatomical neuroimaging studies [Hardoon et al., 2009]. Comparing the performance of these two techniques is, however, challenging because they do not give rise to a common statistic whose p-values might be compared, nor any likelihood allowing model selection. We here propose a new way to assess the covariation between two data sets of dimensions p and q measured on n subjects (p+q>>n), following the simultaneous reduction of both data sets and the estimation of a common indicator of their degree of association. The statistical significance of the two-block link is estimated with empirical p-values obtained from permutation.

Methods
We generate an artificial data set of mixed categorical (X, n x p) and continuous (Y, n x q) variables using a generative latent variable model, where we vary both the reliability of the measurements and the intra- and inter-block correlations. This way, we were able to generate different kind of cross-covariance matrices whose sparsity reflects the amount of variance shared between the two blocks. Next, we select the top k ranked features in the X block that were maximally correlated with any one of the q variables in the Y block, using optimized F-tests. This feature selection was done using 10-fold CV. We then estimated on the training subjects the X and Y canonical variates from PLS and CCA models under varying conditions of L1 regularization applied on each block, namely {0, 25, 50, 75, 100}% of penalization on variables loadings. Factorial scores for the test subjects were then computed based on this canonical variates and a test correlation was computed as the mean cross-product of the first X and Y factor scores for the ke fold. To estimate the statistical significance of this statistic under the null, the whole procedure---feature selection, regularization and computation of test correlation---was iterated over 1000 permutations, which ensures that we are able to conclude at a nominal 5% level when assessing significance of the results.

Results
Results of our simulations indicate that PLS and CCA are good candidates for dealing with two-block data in the case n<<p+q, as shown previously [refs], and our cross-validation procedure for feature selection is able to output reliable indicators for estimating the statistical significance of the test correlation between two artificial data sets. Increasing the degree of regularization leads to overfitting, as well as working with a low number of features, as expected. The presence of measurement errors in the Y variables slightly modulates these effects which suggests that it has to be taken into account when working with psychological constructs such as personality traits or scores derived from cognitive tasks. Moreover, we compared this approach to one where precomputed linear combinations of the original Y variables are used instead of the whole set of outcomes. Depending on the amount of noise induced by the introduction of an extra latent level, we may still be interested in recovering the true signals as in the preceding case, although this approach may yield less direct interpretable results than using raw measures.
We also applied this method on a dataset derived from the IMAGEN project for which a large base of behavioural, neuroimaging and genetic data is available on more than 600 subjects and we are in the process of analysing the results.

Conclusion
Applications of this multivariate framework for genomewide association studies with multiple outcomes are discussed in the light of recent approaches proposed for such studies. We especially emphasize the possibility of dealing with SNPs data and neuropsychological constructs or neuroimaging activation signals in a computationally efficient manner, but the framework is also applicable for blocks extracted from neuroimaging data. Although this is only preliminary work, it shows promissing features for genomewide imaging genetics studies, which will be discussed for both actual and simulated data.


References
Hardoon, D. R., Ettinger, U., Mourão-Miranda, J., Antonova, E., Collier, D., Kumari, V., Williams, S. C., and Brammer, M. (2009) Correlation-based multivariate analysis of genetic influence on brain volume. Neuroscience Letters, 450(3), 281-286.
Lê Cao, K. A., Rossouw, D., Robert-Granié, C, and Besse, P. (2008). Sparse canonical methods for biological data integration: application to a cross-platform study. Statistical Applications in Genetics and Molecular Biology, 7(35).
Parkhomenko, E., Tritchler, D., and Beyene, J. (2009). Sparse Canonical Correlation Analysis with Application to Genomic Data Integration. Statistical Applications in Genetics and Molecular Biology, 8(1).
Wegelin, J. A. (2000). A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case. Technical Report No. 371.

IMAGEN receives research funding from the European Community’s Sixth Framework Programme (LSHM-CT-2007-037286). This paper reflects only the author’s views and the Community is not liable for any use that may be made of the information contained therein.