Analyzing weight evolution in mice infected by Trypanosoma cruzi

Concerning the specificities of a longitudinal study, the trajectories of a subject's mean responses not always present a linear behavior, which calls for tools that take into account the nonlinearity of individual trajectories and that describe them towards associating possible random effects with each individual. Generalized additive mixed models (GAMMs) have come to solve this problem, since, in this class of models, it is possible to assign specific random effects to individuals, in addition to rewriting the linear term by summing unknown smooth functions, not parametrically specified, then using the P-splines smoothing technique. Thus, this article aims to introduce this methodology applied to a dataset referring to an experiment involving 57 Swiss mice infected by Trypanosoma cruzi, which had their weights monitored for 12 weeks. The analyses showed significant differences in the weight trajectory of the individuals by treatment group; besides, the assumptions required to validate the model were met. Therefore, it is possible to conclude that this methodology is satisfactory in modeling data of longitudinal sort, because, with this approach, in addition to the possibility of including fixed and random effects, these models allow adding complex correlation structures to residuals.


Introduction
Regression analysis is a statistical technique that allows investigating and modeling the relationship between two or more variables of interest, with the regression being linear if one considers that the relationship of the response variable (dependent variable) with the regressor variables (independent variables) is a linear function of some parameters, and non-linear if at least one of the partial derivatives in relation to the parameter is a function of parameters that are not known and that cannot be linearized by means of transformations. Paulson (2006) reports that the literature contains methodologies focused on non-linear modeling, such as Gompertz, Von Bertalanffy, Brody, and others; according to Paro de Paz et al. (2004), they can be applied to describing growth curves, besides being parameters of easy biological interpretation. Moreover, segmented regressions, called spline functions, can be used as well.
According to Keele (2008), splines are segmented polynomial functions united by cutoff points known as "nodes", which are used for fitting a curve in a dataset. Spline functions have been widely used in the analysis of longitudinal data -for instance, in the fitting of lactation and growth curves, as seen in Caldeira (2011), Pinheiro et al. (2013), Domingues, Gonçalves, Braz, & Pereira (2014), and Santos et al. (2015). The literature provides several techniques to represent this fitting by parts, such as linear spline, cubic spline, Bspline, etc.
Authors Boor (1978), Eilers and Marx (1996), and Keele (2008) define that a B-spline can be expressed as a set of polynomial functions of the m order for each node-limited interval; also, the connections in the nodes are smoothed, which does not happen in linear splines, thus becoming one single continuous curve. In order to make the model more flexible and prevent a possible "overfitting", Eilers and Marx (1996) proposed a new technique, which consists of combining B-splines with discrete penalty (to control the smoothness of the fit) in the log-likelihood function, and named it P-spline. Because it is not always possible to describe a phenomenon through a linear regression model, new classes of regression models have emerged, such as generalized linear models, generalized additive models, etc. When it comes to non-linear modeling techniques, Generalized Additive Mixed Models stand out; in them, the linear term is replaced by unknown smooth functions, giving more flexibility to the hypothesis of a linear relationship between the response variable and the regressor variables on the scale of a link function. Currie and Durbán (2002) highlight that this class of models allows estimating individualdifference curves using non-parametric smoothing functions, considering the presence of random effects associated with subjects assigned to treatment groups.
Thus, this research aims to use smoothing via P-spline associated with a generalized additive mixed model in order to model the weight evolution of 57 male Swiss mice, with 56 days of age, subjected to homeopathic treatments with different dilutions and observed for 12 weeks, all infected by Trypanosoma cruzi and sourced from the State University of Maringá's central vivarium.

Experiment description
The dataset for this research was collected by Ferreira et al. (2018) with the aim of assessing the biotherapeutic effect of chicken serum on rodents experimentally infected by Trypanosoma cruzi, considering clinical, immunological and parasitological parameters. The project for the execution of the experiments was approved by the Ethics Committee on Research Involving Animals of the State University of Maringá, Paraná, Brazil, under legal opinion CEUA 2401220716/2016.
Based on Ferreira et al. (2018), the experiments were run through blind, controlled and randomized assays. They used 57 male Swiss mice, aged 56 days, sourced from the central vivarium of the State University of Maringá. The mice were divided into treatment groups in a way that the initial mean weight of the animals in each group was approximately equal, and kept in cages with a maximum of 5 individuals; all groups were maintained under the same experimental conditions. With n being the number of individuals per group, and g the number of cages, the experimental groups (treatments) were characterized as follows:  G1: (Non-infected control). Non-infected and non-treated animals (n=5)(g=1)  G2: Animals treated with biotherapeutic chicken serum with 13cH dilution (n=13) 1  G3: Animals treated with biotherapeutic chicken serum with 6cH dilution (n=13)(g=3)  G4: (Infection control). The animals were infected and received no treatment (n=13)(g=3).  G5: Animals treated with biotherapeutic chicken serum with 3cH dilution (n=13)(g=4). Their body weight was monitored over 12 weeks, measured on a semi-analytical scale (Balance BEL®)

P-splines
According to Currie and Durbán (2002), a P-spline is a combination of B-splines and a difference penalty function of the order associated with the estimated coefficients of the B-splines, and the use of this tool helps reduce the flexibility of the B-splines, thus preventing an overfitting of the curve.
Authors Eilers and Marx (1996) present in their article an interesting result concerning modeling through the combination of B-splines; they state that the integral of the second derivative squared can be expressed as a quadratic function in the coefficients associated with the sum. Thus, the roughness of a curve is expressed by taking a base with several B-splines associated with a penalty, as follows: in which [a,b] is the study interval, d is the order of the difference operator for , B is the regression basis function, λ is the control parameter of the smoothing, and is a function that composes B-spline sums.

Generalized additive mixed models using P-splines
Over the last decade, advancements have been observed in the application of P-splines to the context of generalized additive models. This is due to the possibility of rewriting the linear term by summing unknown smooth functions, not parametrically specified. In this sphere, it is possible to assign peculiar random effects to subjects, generating a new class, called Generalized Additive Mixed Models. Currie and Durbán (2002), Durbán, Harezlak, Wand, and Carroll (2005) and Durbán (2014) describe that the appropriate model for this situation is that which allows estimating the curves of specific differences between individuals using non-parametric smooth terms. In this model, a smoothing non-parametric gi function is incorporated, thus enabling, according to Ruppert, Wand, and Carroll (2003), an interaction between treatment groups with the continuous predictor. This model can be described as: , ( in which yij indicates the response variable, , in which denote the random effects, and is a smoothing function that explains the trajectory of each individual, and indicates the i-th sampling unit observed at the j-th instant, and the k index indicates the k-ith treatment group. Besides, the fxij function is estimated using P-splines, which can be rewritten in the context of mixed model; besides: in which the specific curves of the sampling units have parametric components and non-parametric components , with both being random.
Thus, model 3 can be rewritten in its matrixial form as: , The X matrix of fixed effects is expressed by: in which Ti is a matrix that indicates if the i-th subject receives the k-th treatment.
The β parameter vector is defined by: The Z matrix of random effects is defined by: The vector of random effects is expressed by: And the covariance matrix is written as: According to Diggle, Heagerty, Liang, and Zeger (2002), correlation structures are used for modeling covariance matrices, referring to the dependence between observations; when it comes to mixed-effect models, they are used for modeling the dependence between errors within the group. Because the variance is related to components and , one can describe the structures of G or R, or both. Littell, Pendergast, and Natarajan (2000) and Wolfinger (1993) mention that the covariance between two observations for the same individual depends on the length of the time interval between measurements, and that the variance is considered constant over time. Covariance can be described in terms of the variance, and correlations can be expressed as a function of the time interval between measurements.

Model selection
Choosing or selecting the model that best explains data is undoubtedly fundamental in any statistical analysis in the modeling context. There are numerous tests and criteria for model selection in the literature, with one of them being the Akaike information criterion (AIC). According to Akaike (1974), the idea behind this criterion is to compare models, expressed by the following equation: with l( ) being the logarithm of the likelihood function applied to the estimator of maximum likelihood of θ, ( ) and p the number of explanatory variables. Thus, the satisfactory model will the one with the lowest AIC statistical value.

Computational aspects
Software R, version 3.6.2 R Core Team (2019), was used, for both data treatment and the fitting of the models employed. The gamm function in the mgcv package by Wood (2017) was used for the modeling involving generalized additive mixed models. The estimation method selected was REML, by means of the method function, and the lme function in the nlme package by Pinheiro, Bates, Debroy, Sarkar, and R Core Team (2019) for the fitting of the mixed models. For the B-splines bases to be selected, the bs function in the splines package was used. To build the charts, the following packages were used: lattice by Sarkar (2008), lattice Extra by Sarkar and Andrews (2019), and ggplot2 by Wickham (2016).

Description of basic statistics
To conduct the statistical analyses, only the 'Treatment groups' and 'Experiment follow-up time' variables were considered, that is, the Cage variable was not treated due to the study objective. Table 1 contains summarized statistical data on the 'weight' response variable for each one of the treatment types: Observing Table 1 allows concluding that the mean of the weights in relation to each one of the treatments is 45.49 g for the 'Non-infected control' treatment group, and 41.44 g for the individuals allocated in the homeopathic treatment with 3cH dilution.
It is also possible to see a difference as to the maximum and minimum values for the weights in each one of the treatment groups; the minimum weights of the individuals allocated in the homeopathic treatment group with 3cH dilution (G5) is 31.00 g, whereas the minimum weight of the individuals in the Control treatment group (G1) is 39.20 g; as for the maximum weight, it ranges from 55.00 g, representing the 'Infection control' group of individuals (G4), to 49.20 g, representing the group under homeopathic treatment with 6cH dilution (G3). Finally, about the coefficient of variation, there is a small oscillation, with values ranging from 0.076 g and 0.106 g, approximately. Figure 1 displays the behavior of the response variable for each one of the treatments; the group under homeopathic treatment with 3cH dilution (G5) had an average weight level lower than that of the other treatment groups, while the non-infected control group had the highest average weight level compared to the other treatment groups.
Acta Scientiarum. Health Sciences, v. 42, e51437, 2020   About the profile chart, Figure 3 shows an increase behavior as of the second week as for weight over time. An oscillating behavior is seen for the weights soon in the first week, a factor that can be explained by this being the beginning of the treatment; besides, the response variable seems to behave linearly from the second to the sixth week for each subject in relation to the treatment groups as time varies.  Figure 4 shows dispersion charts by treatment; there may be some variability between these observations, justifying the possibility of using some type of smoothing that takes into account the variability found.
It is also possible to notice that the weight gain of the mice in each treatment group varies greatly between the intercept and the slope of the lines that have been fitted; this allows considering mixed models with random effects, be they independent or correlated between intercept and slope in relation to time -in this case, for the Time covariable.

Adjustment
Considering the smoothing function over time, the model considered is described by: in which means the weight of the i-th subject in the j-th week, for i = 1,...,57, j = 1,..., 12. The homeopathic treatment groups range from k = 1,..., 5. Moreover, refers to the mean curve for the individuals allocated in the Non-infected control group, refers to mean curve for the subjects that received the homeopathic treatment with 13cH dilution, refers to the mean curve for the individuals that received the treatment with 6cH dilution, refers to the mean curve for the individuals allocated in the Infection control group, and refers to the mean curve for the individuals that received the homeopathic treatment with 3cH dilution, while and refer to the random effects corresponding to the intercepts and slopes of the curves.
Also, is the unknown smoothing function that describes the trajectory of this variable in relation to Weight evolution, in which is estimated by means of P-splines, with third-polynomialdegree B-splines basis function, and a second-order discrete penalty. The variability between subjects is determined through the ai random effect and a random effect in the slope for the Time variable, in which and , assuming the dependence of the random terms. Table 2 displayed the fitted models, considering different structures of variances and covariances, defined by: a model without considering a correlation structure in the errors, and three other models using the AR(1), CS and ARMA(1,1) correlation structures. For model selection, the Akaike criterion was used. Observing Table 2 allows concluding that the model to be selected is the one with the AR(1) correlation structure, since it provides the lowest value for the AIC statistics; in addition, this structure allows improving the fit of the curves both for the independent variables and the dependent variable. Figure 5 contains the chart for the smoothed curve of the Time covariable after the determination of the correlation structure; it is possible to see that the Weight evolution presented an increase behavior until the 7 th week approximately and, after this period, it decreased until the 12 th week; as a consequence, the effect of this covariable follows a non-linear pattern.
Acta Scientiarum. Health Sciences, v. 42, e51437, 2020 Figure (6) displays the correlogram with the AR(1) structure in the residuals; it is possible to see a significant portion of the autocorrelation values contained within the significance levels, that is, there is a decrease in the issue found as to the assumption of independence in the residuals of the model without considering a correlation structure.  Table 3 contains the estimates for the λ parameter, with their respective effective degrees of freedom (e.d.f), F test, and p-value. It can be concluded that the smoothing of the Time covariable for each homeopathic treatment group was highly significant in relation to Weight evolution.  Figure 7 shows the behavior of the estimated smoothing non-parametric functions of model 11 for each treatment group; the continuous line is the estimated smoothing function, and the gray line represents the ranges with 95% confidence. Because the effect of the Time variable for each treatment group is not linear, the estimated smoothing functions are significant with 3.751, 4.763, 2.726, 6.864 and 2.000 effective degrees of freedom. Additionally, the evolution of the weights over time has a period of increase until the 7th week approximately in the groups under homeopathic treatment with 13cH dilution (Figure 7b), 6cH dilution (Figure 7c), and 3cH dilution (Figure 7e) with the passing of time; as for the Non-infected control group (Figure 7a) and the Infection control group (Figure 7d), there is a slight rise in the beginning of the treatment, with an oscillation as of the 2nd week. Figures 7b and 7d show a significant decrease, thus evidencing that the animals lost weight after said period. Figure 8 contains the specific estimated curves for the subjects allocated in their treatment groups; model 11 describe said curves, taking into account the non-linearity and the associated random effects. It is possible to observe the fitting capacity of the model using 10 nodes, which presents itself satisfactorily as to interpolation.
Thus, with the model defined in 11, it is possible to describe the mean trajectories by treatment group and verify the objective proposed in the beginning of this application, in which Figure 9 presents the estimated curves by treatment group:  Observing Figure 9, it is possible to verify that the animals allocated in the control group had, during a significant portion of the study, an average weight superior to that of the other treatment groups. When it comes to the group under homeopathic treatment with 13cH dilution, none of the animals reached the end of the study and, in said group, the greatest dilution of the biotherapeutic serum was used for treating the Chagas disease. Finally, one can further observe that, on average, the subjects allocated in the groups under homeopathic treatment with 13cH, 6cH and 3cH dilutions and the Infection control group had a decrease in weight as of the 7 th week. Acta Scientiarum. Health Sciences, v. 42, e51437, 2020

Conclusion
Using the P-splines smoothing technique in the context of mixed models for analyzing longitudinal data provides researchers with greater flexibility in terms of modeling, since this methodology allows describing the individual trajectories of a certain situation as a function of time.
Furthermore, the introduction of random effects is taken into account for each one of the estimated curves by treatment, through the smoothing non-parametric function, thus supplying information on the average behavior of a certain object classified into observation groups.
As for the result, it is possible to conclude that the weight, until the 6 th week, of the animals treated with a dilution level of 13cH of biotherapeutic serum is superior to that of the other treatment groups. Concerning the animals treated with a dilution level of 3cH of biotherapeutic serum, they reached the end of the study, with average weights greater than those of the animals treated with a dilution level of 6cH of biotherapeutic serum.