Trait selection using procrustes analysis for the study of genetic diversity in Conilon coffee

Trait selection is occasionally necessary to save money and time, as well as accelerate breeding program processes. This study aimed to propose two criteria to select traits based on a Procrustes analysis that are poorly explored in genetic breeding: Criterion 1 (backward algorithm) and Criterion 2 (exhaustive algorithm). Then, these two criteria were further compared with Jolliffe’s criterion, which has often been used to select traits in genetic diversity studies. Sixteen agronomic traits were considered, and 40 Conilon coffee (Coffea canephora) accessions were evaluated. This study showed that the flexibility in selecting traits by researcher preference, graphical visualization, and Procrustes statistic through criteria 1 and 2 is a fast and reliable alternative for decision-making. These decisions are based on the removal and addition of traits for phenotyping in studies of Conilon coffee diversity that can be applied to other crops. Other relevant aspects of selection traits criteria were also discussed.


Introduction
Studies on genetic diversity play an important role in breeding programs because they are crucial at the initial phase, called prebreeding, in which it is possible to regenerate, characterize, explore, and promote the conservation of variability available on the base population. Moreover, the information collected at the prebreeding phase is useful in obtaining potential candidates to generate divergent parents. These parents are more likely to promote satisfactory results regarding the genetic potential of derived cultivars or lineages as well as combining their abilities to obtain heterotic hybrids.
The genetic diversity existing among and within populations can be measured by the difference between the phenotypic values of their accessions and is obtained in field experiments using a considerable number of morphological, agronomic, and other traits of the studied cultivars. If a collection of accessions evaluated in a given experiment comes from a population or a germplasm bank, it can be re-evaluated in future studies for a variety of purposes. In some situations with a high cost and degree of difficulty involved to obtain a particular trait(s), it may be valuable to evaluate a smaller number of traits than those recorded in the germplasm bank. However, variability is a factor of extreme importance in the development of new varieties and in the conservation of genetic resources, and it is the breeder's responsibility to investigate the extent of exclusion of one or multiple traits that will affect the present variability in the group of accessions under analysis.
The relative importance of traits in genetic diversity studies can be achieved using the criteria proposed by Singh (1981) and Jolliffe (1972). However, the use of each one is restricted to the initial choice of the researcher regarding the clustering method used in the study of genetic diversity since both methods have different approaches.
The first criterion is used when the diversity is evaluated based on the dissimilarity measuring (distance measured between accession pairs) information to provide the cluster analysis. The second criterion is based on principal components that will generate graphic dispersion information in two or threedimensional space.
In addition to those traits discarding the abovementioned criteria, there is also another methodology based on Procrustes analysis. Although it is rarely used in genetic diversity studies (especially in traits selection), the Procrustes approach has been used in many different areas including food engineering (Oliveira & Benassi, 2010;Mauricio, Palazzo, Caselato, & Bolini, 2016) and health sciences (Douglas, 2004;Daboul, Ivanovska, Bülow, Biffar, & Cardini, 2018). Thus, its application has shown great promise and has been evaluated in several studies. However, the technique has been poorly explored in genetic breeding (Klingenberg, 2003;Bramardi, Bernet, Asíns, & Carbonell, 2005;García-Peña & Dias, 2009) and, therefore, was the main motivation for this paper.
The Procrustes analysis technique allows a comparison of two configurations or two datasets as long as each line corresponding to the same individual. If two vectors are different from each other but are defined in the same subspace, it is possible to estimate the extent of the differentiation of their respective graphical representations by means of the Procrustes statistic. Thus, the smaller the value of this statistic is, the more similar the two configurations will be. Krzanowski (1987) presents a methodology that combines PCA, which is used to obtain the scores of the configurations, and Procrustes analysis to determine how much a subset of traits represents a structure of the original dataset (with all traits). The author discusses a Procrustes analysis from two perspectives: for a selection of traits from the backward elimination algorithm using the Procrustes statistic as a discard criterion and for a comparison of grouping patterns of different trait components resulting from different selection methods using the same statistics.
Based on the strategy proposed by Krzanowski (1987) and using the Procrustes statistic given by Peres-Neto and Jackson (2001), our objective is to propose two cut-off criteria called the backward algorithm (Criterion 1) and the exhaustive algorithm (Criterion 2) for the selection of traits in the genetic diversity study of Conilon coffee (Coffea canephora). To validate the methodology, we will compare both with Jolliffe's criterion (1972), since it is considered a more efficient trait discarding method used for genetic diversity studies, providing more savings in a breeding program.

Database
The databases provided by Ferrão et al. (2008) refer to the means of the characteristics. The experimental design utilized randomized blocks with six replications for PH and DCA and 4 replications for the other characteristics with each plot consisting of two plants. The model effect considered genotypes as fixed, and analyses of the variance of the characteristics were performed based on the average number of plots from the following model: where: is the phenotypic valor of the ij-th observation referring to the i-th genotype in the j-th block; is the overall mean of the character; is the effect of the i-th genotype (i = 1, 2, ... , 40); is the effect of the j-th block (j = 1, 2, ..., 4 or 6); and is experimental error, . Sixteen agronomic traits from 40 Conilon coffee accessions were evaluated in the Sooretama municipality located in the Brazilian state of Espírito Santo in the year 2000. An evaluation was conducted for the number of days (D) between flowering and total fruit maturation; the grain yield (GY kg ha -1 ); the plant height (PH in cm); the diameter crown average (DCA in cm), taken at the "middle third" of the plant; the cherry and coconut dry coffee relationship (ChCo), taken in a 2 kg sample of cherry coffee and its dried weight; the cherry and green coffee relationship (ChBe), taken in a 2 kg sample of cherry coffee and its dried weight after processing; the coconut and green coffee relationship (CoBe), taken in a 2 kg sample of cherry coffee and its dried weight after processing; the coarse grain percentage (CG%); the "flat" grain percentage (FG%); the "mocha" grain percentage (MG%); the grain moisture percentage (GM%); the percentage of grains retained on the sieve mesh size #17 (S17); the percentage of grains retained on the sieve mesh size #15 (S15); the percentage of grains retained on the sieve mesh size #13 (S13); the percentage of grains retained on the sieve mesh size #11 (S11); and the medium strainer (MS) (medium grain size). According to Ferrão et al. (2008), the coefficients of experimental variation in percentage (CVe) of the characteristics are D (0.05), GY (23.24), PH (5.29), DCA (6.75), ChCo (6.85), ChBe (5.38), CoBe (7.26), CG (65,93), FG (5.20), MG (32.90), GM (11.20) S17 (31.95), S15 (11.17), S13 (15.95), S11 (51.52), and SM (2.16), of which the majority is less than 30% and shows good experimental precision for the coffee crop (Bonomo et al., 2004;Ferrão et al., 2008;Rodrigues, Brinate, Martins, Colodetti, & Tomaz, 2017).

Procrustes analysis
To contextualize the criteria proposed in this work, it is important to first present pertinent information about Procrustes analysis. This technique allows the comparison of two datasets or two configurations, as long as each line corresponds to the same individual. If there are two sets of vectors that differ from one another but that define the same subspace, this technique allows the user to measure the difference between their respective graphical representations by means of the Procrustes statistic. When the comparison is performed for more than two datasets or configurations, it is defined as a generalized Procrustes analysis.
For the understanding of the technique, consider the triangles and as the representation of two configurations in a two-dimensional space (matrices of n = 3 individuals and p = 2 traits) with different size, location, and orientation ( Figure 1a). The difference between these configurations is obtained by means of a Procrustes analysis so that its corresponding points align as well as possible. Procrustes analysis is a procedure that minimizes the trace of the sum of squared differences between two configurations (i.e., two data matrices) in a multivariate Euclidean space (Equation [1]), which is obtained in two steps by adjusting the configuration to a reference configuration Y.

Equation [1]
First, the centering (translation) and scaling (dilation) are carried out in and (Figure 1b), such that and , where is an identity matrix and is a matrix with all elements equal to , followed by the reflection (Figure 1c), if necessary, and the rotation of ( Figure 1d) for its adjustment in . That is, is rationed to such that is the rotation matrix, is the decomposition of singular values of where is a diagonal matrix, and and are orthogonal matrices. Finally, we have the statistic (Equation [2]) as a result of the comparison between and , referred to as Procrustes statistics or residual sum of squares, ranging from zero to infinity.

Equation [2]
According to Peres-Neto and Jackson (2001), the variation of the statistic between 0 and 1 is restricted using the following transformation: Procrustes analysis, Procrustes transformation or Procrustes rotation give us the idea that the configurations should be as close as possible (in the same subspace) to compare them. Thus, the configurations under the same referential can be fairly compared, and the "real difference" between them can be quantified.
When Procrustes analysis is performed on the same configuration (Y = Z), we have = 0, indicating a perfect fit. Thus, the smaller the value of the statistics are, the more similar the configurations (García-Peña & Dias, 2009).

Trait selection criteria
To reduce the number of agronomic traits, based on the Conilon coffee dataset, we initially selected a subset of traits using Procrustes analysis according to the methodology presented by Krzanowski (1987). The methodology presented by Krazanowski (1987) combines principal component analysis (PCA) and a Procrustes analysis ( Figure 2) to determine how much a subset of traits represents the structure of the set of p original traits. Thus, after performing PCA on the matrices of the set of original traits and the subset of q traits , the novel matrices obtain the configurations and represented by the scores of the data matrices to be compared. Thus, if the true dimensionality of the data is , then will be the true configuration and is the corresponding approximation of the configuration based on only q traits. The difference between the two configurations and was calculated by the statistic from the differences between the corresponding scores of these settings. The loss of information due to the exclusion of (p-q) represents the residue produced when only the traits of q were used instead of all p traits. Krzanowski (1987).

Figure 2. Diagram illustrating Procrustes analysis data by
The choice of k in the most different areas has generally been based on the first k principal components to explain the total variance as much as possible, while also maintaining as much information contained in the original dataset as possible. A fixed k = 2 value was used in order to compare two-dimensional graphical dispersions, which is very useful in genetic breeding. Therefore, the statistic will characterize the disagreement between those two graphical representations, based on the distance between accessions presented on a single 2D chart.
The strategy proposed by Krzanowski (1987) uses the scores of the PCA to obtain the configurations. However, since his statistic goes from zero to infinity, an immeasurable space, it is better to use the Procrustes statistic provided by Peres-Neto and Jackson (2001), which gives a limited space. Accordingly, we established two criteria: backward (Criterion 1) and exhaustive (Criterion 2) algorithms for the selection of traits in the study of genetic diversity. We compare them with Jolliffe's criterion (1972) to validate the methodology. It is assumed that there is a trait's subset that satisfactorily represents the original dataset structure with a minimal loss of information (represented by M2) regarding the original dataset. That is, the residue produced by the loss of information due to the discard of some of the traits is minimal, therefore, the cluster pattern of the evaluated accessions is not significantly affected.
Criterion 1: The Backward Algorithm Based on the backward algorithm proposed by Krzanowski (1987) and considering given by Equation [3], there is no stopping rule. The result is a sequence of (p-k) traits and their respective estimated values. It is important to remember that k=2 is the number of principal components chosen to graphically evaluate the genetic diversity. Moreover, the decision about which traits to retain in the study is arbitrary.
To select the subset of traits by means of this algorithm, the purpose of Criterion 1 is to establish a cutoff value for the statistic, called the . The resulting subset of traits, named optimal selection, is the subset that has the estimated value closer (less than or equal) to the . Criterion 2: The Exhaustive Algorithm Considering all combinations with k, k+1 until (p-1) traits totalizing subsets, respectively, it is possible find the optimal solution from the same . Thus, a total of analyses were performed on all subsets and characterized a new procedure referred to as exhaustive algorithm, which certainly demands greater computational effort than Criterion 1.
From , this procedure provided a series of subsets of traits with values lower than the . However, the optimal selection was the one that resulted in the estimated to be less than or equal to the . Jolliffe's criterion: Traits discarded by the principal component's technique The trait subsets obtained by Criteria 1 and 2 were compared with the subset obtained according to Jolliffe's criterion (1972), which considered the removal of traits with greater weights for less important components (minor variance). Considering standardized traits for genetic diversity studies, Cruz et al. (2011) recommend that the number of traits to discard should be equal to the number of components with eigenvalues less than 0.7.
To avoid the traits with a greater variance affecting the grouping result, the data standardization is commonly used before the PCA since the PCA is obtained from the covariance matrix ( ). Here, statistical standardization of the data was performed where each value was subtracted by the mean and divided by the standard deviation of its respective variable. After standardization, the principal components are obtained from the covariance matrix of the standardized data, according Mingoti (2005), which is the same as obtaining the principal components from the correlation matrix (R) of the original dataset. Thus, we have . All statistics were performed in GENES software version 2016 (Cruz, 2013;2016). GENES software is available on http://www.ufv.br/dbg/genes/genes.htm.

Results and discussion
A value obtained a subset with six, eight, and seven Conilon coffee traits according to criterion 1 (backward algorithm), criterion 2 (exhaustive algorithm), and Jolliffe's criterion, respectively (Table 1). Additionally, common traits existed between the subsets given by optimal selection from criteria 1 and 2, such as MG%, S15, and MS. The importance of these traits to accessions variability of the Conilon coffee is shown since the subsets selected by both criteria provided an increase in total variance explained by the first two principal components. DCA, ChBe, CoBe, MG%, GM%, S15, S11, and MS 0.1 56.82 D, GY, PH, ChCo, CG%, FG%, S17, and S13 1 in 9,841 Jolliffe D, DCA, ChCo, CoBe, CG%, FG%, and S17 0.3359 51.50 S13, MS, MG%, ChBe, S11, GY, PH, S15, and GM% Only †VTa%: Cumulative percent of the total variation explained by the two primary components.
The 2D graphical dispersion of accessions of Conilon coffee, considering all 16 traits (Figure 3a), represents the original data configuration and explains 49.35% of total variance. Although the 40 accessions could not be grouped in clusters, the graphical dispersion was considered useful in making inferences about Conilon coffee accessions diversity in this study. Figure 3a shows that accessions 13 and 8 are divergent, and according to their per se potential, they can be used in a cross to explore vigor and increase variability.  Figure 3b shows the scores graphical dispersion of the accessions in relation to the first two components for the optimal selection resulted by Criterion 1. Note the change on the position of the accessions since they were reflected around the origin of the component 2 in relation to the original configuration given by Figure 3a (accessions now positive but were previously negative). To make the matching between these configurations feasible, following the steps described in material and methods and illustrated in Figure 2, the Procrustes analysis adjusted the configuration of Figure 3b in 3a such that the distance between them is minimal. After the Procrustes transformation on the optimal selection, it was then possible to calculate its true difference in relation to the original dataset estimated by means of . It was verified that accession 8, which was previously divergent such as accession 13, was now in the same group of genotypes that included accession 14 (Figure 3c). Thus, we have the optimal selection with six traits that did not satisfactorily represent the diversity pattern from the dispersion given by the original dataset (Figure 3a). We can better visualize the change in the clustering pattern of accessions by superimposing the graphs a and c (Figure 3d). It is worth pointing out that even if the optimal selection included the characteristics of interest, it was not adequate for evaluating the diversity of Conilon coffee accessions.
Criterion 1 provided a sequence of traits whose exclusion at each step of the backward algorithm provided the lowest estimation of the value ( Table 2). Note that the estimated value is increased by discarding the traits in each step of the algorithm. This was expected, as the discarding of variables increased the residue produced by the loss of information compared to the original dataset. As a six-variable subset resultant of criterion 1 did not satisfactorily represent the structure of accessions diversity, the researcher can choose a different value (higher or lower than the last one used). Therefore, more or fewer traits are considered for a re-evaluation of the clustering pattern among accessions according to how much loss of information the researcher can tolerate. If the subsequent values are inappropriate, the  CoBe: Coconut and green coffee relationship, S11: Percentage of grains retained on the sieve mesh size #11, D: Number of days, plant height (PH, in cm), S13: Percentage of grains retained on the sieve mesh size #13, FG%:"Flat" grain percentage, ChBe: Cherry and green coffee relationship, DCA: Diameter crown average (cm), GM%: Grain moisture percentage, CG%: Coarse grain percentage, MS: Medium strainer (medium grain size), GY: Grain yield (kg ha -1 ), ChCo: Cherry and coconut dry coffee relationship, S15: Percentage of grains retained on the sieve mesh size #15, MG%: "Mocha" grain percentage and S17: Percentage of grains retained on the sieve mesh size #17. Figure 4b shows that the accessions given by the optimal selection of Criterion 2 were reflected, as in Criterion 1, but now in relation to the origin of components 1 and 2, simultaneously. According to the dispersion of the accessions presented by the optimal selection resulting from Criterion 2, no change in the clustering pattern (Figure 4d) was observed. Therefore, the optimal selection ( Figure 4c) provided a global dispersion satisfactorily close to the given dispersion of the original dataset (Figure 4a). Criterion 2 provided a total of 9,841 combinations (subsets) that resulted in values lower than (Table 3), which include the optimal selection resulting from Criterion 1. If the optimal selection of Criterion 2 does not satisfy the breeder's purposes, it is possible to evaluate the diversity of other subsets with more or fewer traits. From the data presented in Table 4, it is possible to identify the relative importance of the traits on the genetic diversity of the Conilon coffee accessions through which the deletion must be performed. According to a criterion presented by Jolliffe (1972) and suggested by Cruz et. al. (2011), from the last to the ninth principal component, the traits of greatest weights were S13, MS, MG%, ChBe, S11, GY, PH, S15, and GM%. Accordingly, the optimal selection was given by the subset of traits: D, DCA, ChCo, CoBe, CG%, FG%, and S17. Table 4. Eigenvalue estimates from the correlation matrix, containing 16 traits and associated eigenvectors (components). † VT% VTa% Traits D(Days) GY(kg/ha) PH(cm) DCA(cm) ChCo ChBe CoBe CG% FG% MG% GM% S17(%) S15(%) S13(%) S11 ( Figure 5b shows the dispersion of the accessions scores in relation to the first two principal components for the subset of seven traits established by Jolliffe's criterion. As in previous cases, the change of accessions position occurred due to the exclusion of some traits, which were reflected around the origin of component 1 and component 2. After the Procrustes transformation on the optimal selection, its real difference in relation to the original set was estimated by . The estimated magnitude of translated the nonproximity between the coffee accessions corresponding to the configurations (Figure 5d), which revealed a significant change in the pattern of clustering of the accessions. This difference can be seen in accession 19, which was fitted to accession group 13 after transformation, as well as accessions 16, 17, and 35, all belonging to accession group 8 after the transformation (Figure 5c).
From the moment the researcher knows which traits are of greater biological importance on the characteristic to be improved, their use can reflect their importance and lead to saving time and financial resources, making breeding programs more sustainable. Thus, if the breeder has an interest in a specific subset, its diversity can be evaluated graphically and its estimated value of compared to that obtained by Acta Scientiarum. Agronomy, v. 42, e43195, 2020 optimal selection of the exhaustive or backward algorithm as a way of guiding discovery of how the magnitude of is affecting the dispersion of its accessions group. According to the obtained results it can be observed that the optimal selection given by Criterion 1 provided the lowest value of estimated and the smallest number of traits. However, this did not adequately represent the Conilon coffee diversity considering the PCA from the set with all 16 traits ( Figure  5a). Furthermore, the subset selected by Criterion 2, despite having a greater number of traits, satisfactorily represented the diversity among the accessions. Note that the subset selected by Jolliffe's criterion provided a high value of estimated , which was 3 times more than the critical , revealing a change in the cluster pattern of the accessions and making this criterion relatively less efficient than the others.
Based on the Procrustes analysis, the number of solutions of each criterion should be taken into account. In the case of Criterion 1 and Jolliffe's criterion, only one optimal selection was provided, while Criterion 2 provided all subsets of traits with an estimated value of below (Table 3). This opens a range of possibilities for the researcher's decision-making since the and the backward algorithm may not include some variables that present biological importance into the process of genetic improvement of the culture. Additionally, the subset selected by Criterion 1 may not reveal graphical scatter equivalent to that obtained by the analysis of the original set.
We also must pay attention to the process of obtaining solutions. Unlike Criterion 1, which excludes one variable at a time in each step of the backward algorithm, Criterion 2 uses the exhaustive algorithm that evaluates all possibilities of discarding traits -one by one, two by two, etc. The stepwise algorithm, which is useful in selecting traits in linear regression models, is different from the method for Criterion 2 because it establishes the importance of traits by a different decision rule and the exclusion or inclusion of traits is made interactively.
It is possible to verify the total analyses performed by the exhaustive algorithm according to the number of traits studied (Table 5). Note that as the number of traits increases, the number of analyses performed by Criterion 2 increases considerably. Thus, Criterion 2 becomes uninteresting in cases of high-order data matrices whose handling involves high computational cost, and processing the results may take months. However, the researcher must consider its use by computational resources as well as the time it has since there currently are no studies that establish the computational cost of this algorithm in relation to the number of traits. Notice that the statistic used in this work ranged from 0 to 1, and the value can be interpreted as the percentage loss of information acceptable resulting from the selected subset of traits. Thus, the researcher must consider that even if a relatively small loss is established, the resulting subset of traits may or may not satisfactorily represent the genetic diversity of the original dataset. This occurs because the breeder's considerations of the biological importance of a variable may be different from the statistical significance. Therefore, the optimal selection must include all traits that are important to the breeder and represent the level of diversity in the original dataset.
Another interesting aspect about the criteria based on the Procrustes analysis concerns the value of suggested in this study. It is worth mentioning that in Criteria 1 and 2, the critical value can be slightly relaxed according to the number of traits that the breeder wishes to discard. In this sense, we suggest a variation interval from a minimum value of 0.05 to a maximum of 0.15, as long as the clustering pattern of accessions of the culture is maintained. These limits do not constitute a rule since there are no other studies that discard traits using these specific limits for genetic diversity, and therefore, the researcher must decide them. However, it is worth remembering that the Procrustes statistic adopted in this work ranges from 0 to 1, and was selected assuming that the residual produced by the loss of information with the resulting subset of traits would be 10% at most.
Based on the strategy proposed by Krzanowski (1987) and the Krzanowski (1996) backward algorithm, Munita, Barroso, and Oliveira (2013) obtained their results with a subset of only eight traits sufficient to interpret the data in two axes (k = 2 principal components) that explained 76.6% of the total variation without substantial loss of information. The dataset represented the concentration of 13 chemical elements (traits) obtained by activation with neutrons in a set of 40 samples of ceramic fragments, whose first two components explained 79.9% of the total variation. Guedes and Ivanqui (1998) obtained similar results in a medical study with simulated data regarding 14 traits related to liver cancer, whose first two main components explained 93.61% of the total variation. Based on the backward algorithm without a stop rule, a subset with 8 traits was established by Procrustes analysis with configuration similar to the original set with representation in two axes that explained 93.66% of the data variation.
The results obtained in this study also showed that even with minimal explanation of the total variation of the data by the first two principal components, it was possible to obtain a satisfactory representation of the accessions diversity in two axes according to the optimal selection obtained by Procrustes analysis. This confirmed the importance of the contribution of the proposed criteria and the technique presented for the selection of traits in the study of genetic diversity. Finally, the exhaustive procedure, which suggests enormous potential for genetic studies, is highlighted by the number of resulting optimal solutions.
The Procrustes analysis presents wide applicability and has interesting approaches. For the plant breeding, there is currently no literature using Procrustes analysis to select phenotypic traits, which further highlights the relevance of this study for genetic improvement. Although Procrustes analysis has been minimally explored in the area of plant breeding, García-Peña and Dias (2009) used the analysis to compare different techniques of uni-and multivariate analysis by the AMMI methodology in the genotypic versus environmental interaction study. The joint use of the Procrustes and PCA techniques presents enormous potential and its application in genetic improvement extends beyond the selection of variables, including the possibility of evaluating the genetic diversity that is important for a breeding program through graphic dispersion.
This study provides the breeder with a technique based on Procrustes analysis to assist him in the decision-making regarding the exclusion of redundant characters. In practical terms, character exclusion can reduce possible measurement errors and reduce experiment time and costs since the excluded character may require a high cost of measurement or be difficult to measure. Technically, Procrustes analysis in diversity studies allows for visualization of the pattern of grouping of accessions after discarding variables. This allows the breeder to graphically evaluate the selected subset of traits, either by an automated selection method or determined by the breeders themselves. In addition, it allows quantification, through the statistics , of the loss of information of a reduced subset of selected traits in relation to the set of all traits.

Conclusion
The flexibility in selecting traits by the researcher, graphical visualization, and Procrustes M 2 statistics through Criteria 1 and 2 becomes a fast and reliable alternative for decision-making of traits for phenotyping in studies of Conilon coffee diversity as well as other crops. Procrustes analysis is advantageous in selecting traits and provides a relevant contribution to genetic diversity studies as an efficient alternative to Jolliffe's criterion.