|
|
||||||||
,2

* Institut National de la Recherche Agronomique, UR337, Station de Génétique Quantitative et Appliquée, F-78352 Jouyen-Josas, France
Institute of Agricultural and Nutritional Sciences, Martin-Luther University, Halle-Wittenberg, 06099 Halle, Germany
Institute of Scientific Computing, Technical University Dresden, 01062 Dresden, Germany
1 Corresponding author: helene.leclerc{at}jouy.inra.fr
| ABSTRACT |
|---|
|
|
|---|
Key Words: genetic evaluation validation best linear unbiased predictor random regression
| INTRODUCTION |
|---|
|
|
|---|
Making use of a simulation tool is not the only option to validate new genetic evaluation software. A simpler alternative may be preferred. For instance, the direct inversion of mixed-model equations (MME) for a small data example could verify that the iterative EBV match those obtained from direct inversion. Nevertheless, when the genetic evaluation model is more sophisticated, this does not guarantee that the MME are properly set up, and numerical problems are often detected only on larger data sets. Another option is to compare results from the new software with the ones obtained from reference software. This method is widely used when it is technically possible.
In actuality, a large number of BLUP software programs have been developed worldwide, fulfilling specific needs. For instance, in France, BLUP software (GeneKit, V. Ducrocq, personal communication) was developed to deal with test-day models (TDM). Indeed, national genetic evaluation models for dairy traits are increasingly based on TDM instead of 305-d lactation models. A large variety of models have been proposed, differing in 1) how the lactation curve is modeled as a function of DIM [with fixed classes, parametric curves, or semiparametric (spline) curves; White et al., 1999], 2) how the genetic and permanent environmental components are described (fixed or random regression using Legendre or other polynomials), and 3) how heterogeneous residual variances are accounted for. Unfortunately, no general evaluation software including all possible models is available for TDM with very large data sets. Therefore, countries have developed custom software to perform routine genetic evaluations for their own population using TDM. The lack of reference software makes the software validation step complex. Furthermore, the routine genetic evaluation software usually relies on iterative solving algorithms, which makes it even more difficult to debug them and makes the results complicated to verify. For TDM needs, extension from R. Thompsons strategy to random regression situations is not straightforward.
The objective of this paper was to present a general and flexible strategy that could be used to validate the correctness of newly developed genetic evaluation software. Consistent phenotypic data were generated in such a way that BLUP estimates from correct evaluation software were mathematically equal to the simulated effects. This methodology can be considered a helpful tool in the development and further refinement of BLUP software.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Estimation Method
The following multiple-trait linear model y = Xb + Za + e is considered to describe the derivation of constraints enforcing the simulated performance to fulfill the required properties. y is the vector of observations; b is a vector of fixed effects; a is the vector of breeding values following a normal distribution, with E[a] = 0 and Var[a] = G = G0
A; and e is the vector of random residuals following a normal distribution, with E[e] = 0 and var[e] = R = R0
I. X and Z are matrices relating y to the appropriate fixed and genetic effects. G0 is the covariance matrix for the genetic effects, and A is the additive genetic relationship matrix.
The MME corresponding to this model are
![]() | [1] |
The MME are analogous to the normal equation in the standard linear model. Indeed, MME were initially obtained by maximizing the posterior distribution of a given y (Henderson et al., 1959). System [1] can also be rewritten as F'Fx = F'c, with
, and
, and
, where M
is, for a symmetric general positive definite matrix M, the unique positive definite matrix B, with BB = M. System [1] can then be viewed as "normal equations" (Searle, 1971), giving the least squares solutions of the overdetermined system Fx = c. The least squares solutions are obtained by minimizing the squared Euclidean norm of the residual ||Fx c – 2||22:
![]() | [2] |
To minimize [2], its derivative with respect to x is set to zero, that is, 2F'Fx – 2F'c = 0. After simplification, the minimizing vector x is a solution of the normal equation F'Fx = F'c, which can be written as F'(c –Fx) = 0.
Therefore, to simulate the data y = Xb + Za + e such that F'(c – Fx) = 0 whatever the values of b, we must have
, which leads to the 2 following conditions: X'R–1e = 0 and Z'R–1e = G–1a.
Application to a Random Regression TDM with Unknown Parent Groups
The previous model can be extended to
![]() |
where y is the vector of observations; b is a vector of fixed effects; g is the vector of genetic group effects (e.g., phantom parents groups); a* is the vector of breeding values corrected for genetic group effects, that is, with E[a*] = 0 or, alternatively, E[a] = Qg, with a = Qg + a* and Var[a*] = G = G0
A; p is the vector of permanent environmental effects, with E[p] = 0 and Var[p] = P = P0
I; e is the vector of random residuals, with E[e] = 0 and var(e) = R; X, Z, and W are matrices relating y to the appropriate fixed, genetic, and permanent environmental effects, possibly through continuous covariates; and Q is the matrix assigning animals in a* to groups in g. In the case of random regressions, G0 and P0 are the covariance matrices for the genetic and permanent environmental effects, respectively, and A is the additive genetic relationship matrix. R is a diagonal matrix of residual variances. In the case of heterogeneous variances, for example, as a function of DIM (e.g., Druet et al., 2003), the diagonal terms of R vary from one record to the next. If heterogeneous variances are estimated simultaneously with the other effects during iterations, the validation test is no longer applicable because solutions are then nonlinear functions of the data. However, the validation test can still be used to validate part of the software if the updating of heterogeneous variance estimates can be bypassed.
This model is an extension of the previous one, with X and b replaced by [X ZQ] and
, respectively, and Z and a replaced by [Z W] and
, respectively. In this case, the MME are
![]() | [3] |
and the residual conditions derived above become X'R–1e = 0, Q'Z'R–1e = 0, G–1a* = Z'R–1e, and P–1p = W'R–1e. Most models used in animal breeding, such as the maternal model, dominance, and the epistatic model or gametic model, can be easily implemented.
Fulfilling of Residual Conditions
To fulfill the first 2 residual conditions, a working variable
is first simulated for all observations with any underlying distribution, for example,
i
N(0, 
2) with variance
2. Two vectors, β and
, are then computed by using any least squares solver, for example, the one included in the software package FSPAK90 (Misztal and Perez-Enciso, 1998), as solutions for
![]() |
If we choose e =
(
– Xb – ZQ
), one can check that X'R–1e = 0 and Q'Z'R–1e = 0. Here,
is a scaling factor that can take any value. Then b and g can be simulated by using any underlying distribution because they do not influence e.
The next step is to derive a* such that G–1a* = (G0–1
A–1)a* = (I
A–1)(G0–1
I)a* = Z'R–1e. Define u = (G0–1
I)a* so we need a* in such a way that (I
A–1)u = Z'R–1e. For example, if the random regression model includes 3 genetic terms (3 genetic values j per animal i, ai1*, ai2*, and ai3*), u = {uj} also includes 3 terms, ui1, ui2, and ui3, for each animal i, which are linear combinations of the terms ai1*, ai2*, and ai3*. For each j, we need to solve A–1uj = Z'R–1ej. This is easily done by using the decomposition A–1 = T–1D–1T–T (Quaas, 1976). T–1 is a lower triangular matrix with at most 2 nonzero off-diagonal elements: the ith diagonal element is equal to 1 and the columns corresponding to the sire and dam of animal i are equal to –
. D is the diagonal matrix of the proportion of the genetic variance corresponding to the Mendelian sampling terms. First, we must solve (T–1D–1)vj = Z'R–1ej for vj and then T–Tuj = vj. These 2 triangular systems are very simple. Finally, we can compute a* = G0
u [or for animal i ([ai,1* ai,2* ai,3*] ' = G0[ui,1 ui,2 ui,3]')].
We obtain p enforcing P–1p = (P0–1
I)p = W'R–1e by choosing p= (P0
I)W'R–1e. This reduces to the product of P0 by the terms in W'R–1e corresponding to animal i. Finally, y is constructed as y = Xb + ZQg + Za* + Wp + e.
Application to a TDM
As an illustration, this strategy was applied to a real data set with 47,492 first lactations of Montbéliarde cows from a sample of herds in Jura (a region of France), that is, 377,080 test-day observations recorded between September 1995 and August 2005 in 28,020 herd test-day (HTD) combinations. The average number of test-day records per lactation was 8, with a minimum of 5. On average, 115 test-day records were known per herd and per year, with a minimum of 50 and a maximum of 431. A total of 118 classes of calving month and 96 classes of calving age were considered. Days in milk had to be between 5 and 335. Information about gestation (days carried calf, DCC) was also available. The pedigree file included 102,339 animals with 8 groups of unknown parents. The approach described above was tested on the following model. For animal i, test-day j, and DCC k, the performance recorded after t days of lactation is
![]() |
where yijkrs(t) is the performance recorded, HTDj is the jth fixed herd by test-day effect applied to all animals in the herd on test day j, β is a regression coefficient related to the effect of the calf after k days of gestation on performance, brl and csp are fixed regression coefficients specific to calving month r and calving age class s,
im are the genetic values for cow i, and pin are the permanent environmental effects for cow i.
l(t),
p(t),
m(t), and
n(t) are continuous covariates depending on DIM. For this model, random regression curves and fixed regression curves for calving age were modeled by using Legendre polynomials of order 2 (Nc = Na = Np = 3). Fixed regression curves for calving month were modeled by using 6 knot splines (White et al., 1999) at DIM 5, 20, 50, 135, 245, and 335. eijkrs(t) is the residual effect for each observation, with a heterogeneous variance continuously changing with DIM.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
In contrast with other simulation approaches (R. Thompson, Rothamsted Research, Harpenden, UK; personal communication; Täubert et al., 2002; Wensch-Dorendorf et al., 2005), the proposed strategy to generate breeding values and performances can be applied to actual data sets by simply replacing performance values with simulated ones. Its main specificity is that appropriate residuals are created when the other approaches do not simulate any residual. This is more complex but also more realistic: for example, R. Thompsons (Rothamsted Research, Harpenden, UK; personal communication) approach requires animals from the base generation to have "virtual" records. Such a requirement is not imposed here.
Moreover, the method could help to investigate the relevance of different iterative algorithms and convergence criteria as long as the simulated effects are of a magnitude similar to the real effects. This can be adjusted by choosing an appropriate scaling factor
. Misztal et al. (1988) showed that the real accuracy of solutions could be far from the one suggested by some convergence criteria. Here, true solutions of the system are directly available. The relative average difference between current and true solutions, as proposed by Misztal et al. (1988), can be computed at each iteration for a particular data set. Usually, the convergence of genetic evaluation software is checked by comparing solutions obtained from a limited number of iterations with the ones resulting from an increased number of iterations. This method is easy to implement because it does not require modification of the original software, but it does not allow one to study the relevance of the convergence criteria nor the correctness of the converged solutions.
The main target for the proposed approach may be researchers implementing new genetic evaluation software to be used for sophisticated models for which no reference software exists. The development of a specific simulation tool to generate genetic and nongenetic effects as well as data under the most frequent models used in countries participating in international evaluation can be envisioned but should be done once for all countries. Covering a large number of potential models is a difficult task but is probably easier to consider than creating an all-purpose reference software. Such a tool should present 2 main advantages: it would enable any Interbull participating country to check the correctness of its national genetic evaluation software and to verify that the number of iteration rounds or the convergence criteria are adequate.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| FOOTNOTES |
|---|
Received for publication December 21, 2007. Accepted for publication March 26, 2008.
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |