This paper proposes a new method for determining the subset of variables that reproduce as well as possible the main structural features of the complete data set. This method can be useful for pre-treatment of large data sets since it allows discarding variables that contain redundant information. Reducing the number of variables often allows one to better investigate data structure and obtain more stable results from multivariate modelling methods. The novel method is based on the recently proposed canonical measure of correlation (CMC index) between two sets of variables [R. Todeschini, V. Consonni, A. Manganaro, D. Ballabio, A. Mauri, Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications, Anal. Chim. Acta submitted for publication (2009)]. Following a stepwise procedure (backward elimination), each variable in turn is compared to all the other variables and the most correlated is definitively discarded. Finally, a key subset of variables being as orthogonal as possible are selected. The performance was evaluated on both simulated and real data sets. The effectiveness of the novel method is discussed by comparison with results of other well known methods for variable reduction, such as Jolliffe techniques, McCabe criteria, Krzanowski approach and its modification based on genetic algorithms, loadings of the first principal component, Key Set Factor Analysis (KSFA), Variable Inflation Factor (VIF), pairwise correlation approach, and K correlation analysis (KIF). The obtained results are consistent with those of the other considered methods; moreover, the advantage of the proposed CMC method is that calculation is very quick and can be easily implemented in any software application. © 2009 Elsevier B.V. All rights reserved.

Consonni, V., Ballabio, D., Manganaro, A., Mauri, A., Todeschini, R. (2009). Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 2. Variable reduction. ANALYTICA CHIMICA ACTA, 648, 52-59 [10.1016/j.aca.2009.06.035].

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 2. Variable reduction

CONSONNI, VIVIANA;BALLABIO, DAVIDE;TODESCHINI, ROBERTO
2009

Abstract

This paper proposes a new method for determining the subset of variables that reproduce as well as possible the main structural features of the complete data set. This method can be useful for pre-treatment of large data sets since it allows discarding variables that contain redundant information. Reducing the number of variables often allows one to better investigate data structure and obtain more stable results from multivariate modelling methods. The novel method is based on the recently proposed canonical measure of correlation (CMC index) between two sets of variables [R. Todeschini, V. Consonni, A. Manganaro, D. Ballabio, A. Mauri, Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications, Anal. Chim. Acta submitted for publication (2009)]. Following a stepwise procedure (backward elimination), each variable in turn is compared to all the other variables and the most correlated is definitively discarded. Finally, a key subset of variables being as orthogonal as possible are selected. The performance was evaluated on both simulated and real data sets. The effectiveness of the novel method is discussed by comparison with results of other well known methods for variable reduction, such as Jolliffe techniques, McCabe criteria, Krzanowski approach and its modification based on genetic algorithms, loadings of the first principal component, Key Set Factor Analysis (KSFA), Variable Inflation Factor (VIF), pairwise correlation approach, and K correlation analysis (KIF). The obtained results are consistent with those of the other considered methods; moreover, the advantage of the proposed CMC method is that calculation is very quick and can be easily implemented in any software application. © 2009 Elsevier B.V. All rights reserved.
Articolo in rivista - Articolo scientifico
similarity/diversity; correlation; variable reduction; CMC index; CMD index
English
52
59
Consonni, V., Ballabio, D., Manganaro, A., Mauri, A., Todeschini, R. (2009). Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 2. Variable reduction. ANALYTICA CHIMICA ACTA, 648, 52-59 [10.1016/j.aca.2009.06.035].
Consonni, V; Ballabio, D; Manganaro, A; Mauri, A; Todeschini, R
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10281/6940
Citazioni
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 5
Social impact