JDS
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J. Dairy Sci. 2008. 91:3179-3183. doi:10.3168/jds.2007-0972
© 2008 American Dairy Science Association ®

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Interpretive Summary
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Leclerc, H.
Right arrow Articles by Swalve, H. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leclerc, H.
Right arrow Articles by Swalve, H. H.

A General Method to Validate Breeding Value Prediction Software

H. Leclerc*,1,2, M. Wensch-Dorendorf{dagger},2, J. Wensch{ddagger}, V. Ducrocq* and H. H. Swalve{dagger}

* Institut National de la Recherche Agronomique, UR337, Station de Génétique Quantitative et Appliquée, F-78352 Jouyen-Josas, France
{dagger} Institute of Agricultural and Nutritional Sciences, Martin-Luther University, Halle-Wittenberg, 06099 Halle, Germany
{ddagger} Institute of Scientific Computing, Technical University Dresden, 01062 Dresden, Germany

1 Corresponding author: helene.leclerc{at}jouy.inra.fr


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The validity of national genetic evaluations depends on the quality of input data, on the model of analysis, and on the correctness of genetic evaluation software. A general strategy was developed to validate national breeding value prediction software: performances from a real data file were replaced with simulated ones, created from simulated fixed and random effects and residuals in such a way that BLUP estimates from the evaluation software must be equal to the simulated effects. This approach was implemented for a multiple-trait model and a random regression test-day model. An example was presented on test-day observations analyzed with a random regression animal model including a lactation curve described as a sum of fixed polynomial regression and fixed spline regression on days in milk, and with genetic and permanent environmental effects modeled by using Legendre polynomials of order 2. Residuals had heterogeneous variances, and phantom parent groups were included. This method can be easily extended to other linear models. The comparison of genetic evaluation results with simulated true effects is used to demonstrate the great efficiency and usefulness of the proposed method.

Key Words: genetic evaluation • validation • best linear unbiased predictor • random regression


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
At the national level, the reliability of genetic evaluation depends on a large number of factors, such as 1) the quality of raw data, 2) the data edits, 3) the correctness of the evaluation model and of the software used, and 4) the postprocessing steps. A large number of tests are done at the national level by genetic evaluation centers to ensure the quality of their EBV. For instance, the results from 2 consecutive evaluations are systematically compared. More and more countries, such as France, Germany, or the Netherlands, have developed or are developing a quality management system based on ISO 9001 standards. At the worldwide level, Interbull has been providing international predicted breeding values of dairy bulls since 1995 on the scale of each participating country by using national genetic evaluation results. Input data quality is a crucial issue in international genetic evaluations, because the validity of results from complex genetic and statistical analyses depends on it. Therefore, monitoring and validation of input data are essential for Interbull (Fikse, 2004): data included in international evaluations have to pass a series of stringent tests before acceptance. The consistency of evaluations is assessed by the comparison of breeding values from consecutive evaluations to identify changes larger than expected based on statistical properties of the breeding values (Klei et al., 2002). Genetic trends are estimated to check that national breeding values are unbiased (Boichard et al., 1995). These checks are also used at the domestic level to guarantee the quality of national genetic evaluations and to keep customers satisfied. However, these tests do not guarantee correctness of computation. The diversity and complexity of models used in the various countries to analyze different traits have led to a situation in which new methods of validation of national data, models, or both are increasingly needed. One research project identified by Interbull in 2002 was the development of a general simulation tool to validate national genetic evaluation systems, especially the development of a simulation environment to test breeding value prediction software. With this aim, a program was developed from a strategy described by R. Thompson (Rothamsted Research, Harpenden, UK; personal communication) to simulate data with known breeding values and phenotypes for a single-trait animal model (Täubert et al., 2002) in such a way that BLUP solutions for breeding values should be equal to the simulated ones. This strategy assumes that residuals are zero but with constrained fixed and random effects. This method was later extended to a multiple-trait animal model (Wensch-Dorendorf et al., 2005).

Making use of a simulation tool is not the only option to validate new genetic evaluation software. A simpler alternative may be preferred. For instance, the direct inversion of mixed-model equations (MME) for a small data example could verify that the iterative EBV match those obtained from direct inversion. Nevertheless, when the genetic evaluation model is more sophisticated, this does not guarantee that the MME are properly set up, and numerical problems are often detected only on larger data sets. Another option is to compare results from the new software with the ones obtained from reference software. This method is widely used when it is technically possible.

In actuality, a large number of BLUP software programs have been developed worldwide, fulfilling specific needs. For instance, in France, BLUP software (GeneKit, V. Ducrocq, personal communication) was developed to deal with test-day models (TDM). Indeed, national genetic evaluation models for dairy traits are increasingly based on TDM instead of 305-d lactation models. A large variety of models have been proposed, differing in 1) how the lactation curve is modeled as a function of DIM [with fixed classes, parametric curves, or semiparametric (spline) curves; White et al., 1999], 2) how the genetic and permanent environmental components are described (fixed or random regression using Legendre or other polynomials), and 3) how heterogeneous residual variances are accounted for. Unfortunately, no general evaluation software including all possible models is available for TDM with very large data sets. Therefore, countries have developed custom software to perform routine genetic evaluations for their own population using TDM. The lack of reference software makes the software validation step complex. Furthermore, the routine genetic evaluation software usually relies on iterative solving algorithms, which makes it even more difficult to debug them and makes the results complicated to verify. For TDM needs, extension from R. Thompson’s strategy to random regression situations is not straightforward.

The objective of this paper was to present a general and flexible strategy that could be used to validate the correctness of newly developed genetic evaluation software. Consistent phenotypic data were generated in such a way that BLUP estimates from correct evaluation software were mathematically equal to the simulated effects. This methodology can be considered a helpful tool in the development and further refinement of BLUP software.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Outline of the Procedure
The starting point is a pedigree file and a data file containing, for each record, the relevant levels, variables, or a combination of both for all effects, the animal’s recoded number and permanent environmental effect level, and all other pertinent pieces of information [elements required to compute random regression coefficients; the weight of records; the genetic, permanent environment, and residual (co)variance matrices, etc.]. These files can be real data sets. The procedure to check genetic evaluation software can then be divided into 3 steps:

  1. For each effect as well as for one residual per observation, simulate values following the approach described below, leading to a simulated performance record for each record in the data file.
  2. Include as input data these simulated performance records in the national genetic evaluation software. Estimates are obtained for all effects included in the model.
  3. Compare estimates of fixed effects and predicted random effects from the national genetic evaluation software with the true (simulated) ones. If the resulting breeding values, permanent environmental effects, and all estimable contrasts of fixed effects are identical to the true ones, then the genetic evaluation software can be considered as correct.

Estimation Method
The following multiple-trait linear model y = Xb + Za + e is considered to describe the derivation of constraints enforcing the simulated performance to fulfill the required properties. y is the vector of observations; b is a vector of fixed effects; a is the vector of breeding values following a normal distribution, with E[a] = 0 and Var[a] = G = G0 {otimes} A; and e is the vector of random residuals following a normal distribution, with E[e] = 0 and var[e] = R = R0 {otimes} I. X and Z are matrices relating y to the appropriate fixed and genetic effects. G0 is the covariance matrix for the genetic effects, and A is the additive genetic relationship matrix.

The MME corresponding to this model are


Formula 1[1]

The MME are analogous to the normal equation in the standard linear model. Indeed, MME were initially obtained by maximizing the posterior distribution of a given y (Henderson et al., 1959). System [1] can also be rewritten as F'Fx = F'c, with Formula 1, and Formula 1, and Formula 1, where M1/2 is, for a symmetric general positive definite matrix M, the unique positive definite matrix B, with BB = M. System [1] can then be viewed as "normal equations" (Searle, 1971), giving the least squares solutions of the overdetermined system Fx = c. The least squares solutions are obtained by minimizing the squared Euclidean norm of the residual ||Fx c – 2||22:


Formula 2[2]

To minimize [2], its derivative with respect to x is set to zero, that is, 2F'Fx2F'c = 0. After simplification, the minimizing vector x is a solution of the normal equation F'Fx = F'c, which can be written as F'(cFx) = 0.

Therefore, to simulate the data y = Xb + Za + e such that F'(c Fx) = 0 whatever the values of b, we must have Formula 2, which leads to the 2 following conditions: X'R–1e = 0 and Z'R–1e = G–1a.

Application to a Random Regression TDM with Unknown Parent Groups
The previous model can be extended to


Formula 2

where y is the vector of observations; b is a vector of fixed effects; g is the vector of genetic group effects (e.g., phantom parents groups); a* is the vector of breeding values corrected for genetic group effects, that is, with E[a*] = 0 or, alternatively, E[a] = Qg, with a = Qg + a* and Var[a*] = G = G0 {otimes} A; p is the vector of permanent environmental effects, with E[p] = 0 and Var[p] = P = P0 {otimes} I; e is the vector of random residuals, with E[e] = 0 and var(e) = R; X, Z, and W are matrices relating y to the appropriate fixed, genetic, and permanent environmental effects, possibly through continuous covariates; and Q is the matrix assigning animals in a* to groups in g. In the case of random regressions, G0 and P0 are the covariance matrices for the genetic and permanent environmental effects, respectively, and A is the additive genetic relationship matrix. R is a diagonal matrix of residual variances. In the case of heterogeneous variances, for example, as a function of DIM (e.g., Druet et al., 2003), the diagonal terms of R vary from one record to the next. If heterogeneous variances are estimated simultaneously with the other effects during iterations, the validation test is no longer applicable because solutions are then nonlinear functions of the data. However, the validation test can still be used to validate part of the software if the updating of heterogeneous variance estimates can be bypassed.

This model is an extension of the previous one, with X and b replaced by [X ZQ] and Formula 2, respectively, and Z and a replaced by [Z W] and Formula 2, respectively. In this case, the MME are


Formula 3[3]

and the residual conditions derived above become X'R1e = 0, Q'Z'R1e = 0, G1a* = Z'R1e, and P1p = W'R1e. Most models used in animal breeding, such as the maternal model, dominance, and the epistatic model or gametic model, can be easily implemented.

Fulfilling of Residual Conditions
To fulfill the first 2 residual conditions, a working variable {varepsilon} is first simulated for all observations with any underlying distribution, for example, {varepsilon}i ~ N(0, {sigma}{varepsilon}2) with variance {sigma}2. Two vectors, β and {gamma}, are then computed by using any least squares solver, for example, the one included in the software package FSPAK90 (Misztal and Perez-Enciso, 1998), as solutions for


Formula 3

If we choose e = {delta}({varepsilon}XbZQ{gamma}), one can check that X'R–1e = 0 and Q'Z'R–1e = 0. Here, {delta} is a scaling factor that can take any value. Then b and g can be simulated by using any underlying distribution because they do not influence e.

The next step is to derive a* such that G–1a* = (G0–1 {otimes} A–1)a* = (I {otimes} A–1)(G0–1 {otimes} I)a* = Z'R–1e. Define u = (G0–1 {otimes} I)a* so we need a* in such a way that (I {otimes} A–1)u = Z'R–1e. For example, if the random regression model includes 3 genetic terms (3 genetic values j per animal i, ai1*, ai2*, and ai3*), u = {uj} also includes 3 terms, ui1, ui2, and ui3, for each animal i, which are linear combinations of the terms ai1*, ai2*, and ai3*. For each j, we need to solve A–1uj = Z'R–1ej. This is easily done by using the decomposition A–1 = T–1D–1T–T (Quaas, 1976). T–1 is a lower triangular matrix with at most 2 nonzero off-diagonal elements: the ith diagonal element is equal to 1 and the columns corresponding to the sire and dam of animal i are equal to –1/2. D is the diagonal matrix of the proportion of the genetic variance corresponding to the Mendelian sampling terms. First, we must solve (T–1D–1)vj = Z'R–1ej for vj and then T–Tuj = vj. These 2 triangular systems are very simple. Finally, we can compute a* = G0 {otimes} u [or for animal i ([ai,1* ai,2* ai,3*] ' = G0[ui,1 ui,2 ui,3]')].

We obtain p enforcing P–1p = (P0–1 {otimes} I)p = W'R–1e by choosing p= (P0 {otimes} I)W'R–1e. This reduces to the product of P0 by the terms in W'R–1e corresponding to animal i. Finally, y is constructed as y = Xb + ZQg + Za* + Wp + e.

Application to a TDM
As an illustration, this strategy was applied to a real data set with 47,492 first lactations of Montbéliarde cows from a sample of herds in Jura (a region of France), that is, 377,080 test-day observations recorded between September 1995 and August 2005 in 28,020 herd test-day (HTD) combinations. The average number of test-day records per lactation was 8, with a minimum of 5. On average, 115 test-day records were known per herd and per year, with a minimum of 50 and a maximum of 431. A total of 118 classes of calving month and 96 classes of calving age were considered. Days in milk had to be between 5 and 335. Information about gestation (days carried calf, DCC) was also available. The pedigree file included 102,339 animals with 8 groups of unknown parents. The approach described above was tested on the following model. For animal i, test-day j, and DCC k, the performance recorded after t days of lactation is


Formula 3

where yijkrs(t) is the performance recorded, HTDj is the jth fixed herd by test-day effect applied to all animals in the herd on test day j, β is a regression coefficient related to the effect of the calf after k days of gestation on performance, brl and csp are fixed regression coefficients specific to calving month r and calving age class s, {alpha}im are the genetic values for cow i, and pin are the permanent environmental effects for cow i. {omega}l(t), {xi}p(t), {alpha}m(t), and {psi}n(t) are continuous covariates depending on DIM. For this model, random regression curves and fixed regression curves for calving age were modeled by using Legendre polynomials of order 2 (Nc = Na = Np = 3). Fixed regression curves for calving month were modeled by using 6 knot splines (White et al., 1999) at DIM 5, 20, 50, 135, 245, and 335. eijkrs(t) is the residual effect for each observation, with a heterogeneous variance continuously changing with DIM.


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Correlations between simulated effects and estimates obtained with the national genetic evaluation software from the 377,080 test-day records were all equal to 1. Standard deviations for all effects included in the model were identical. For contrast analysis, the largest relative difference observed between "estimates" (estimable contrasts or predicted random effects) and true values (i.e., 478,468 estimated effects) was 4·10–4 when the convergence criterion used for BLUP evaluation (i.e., average solution change between 2 iterations) was 10–6. The software implemented for the national evaluation was therefore considered as correct for the tested model.

In contrast with other simulation approaches (R. Thompson, Rothamsted Research, Harpenden, UK; personal communication; Täubert et al., 2002; Wensch-Dorendorf et al., 2005), the proposed strategy to generate breeding values and performances can be applied to actual data sets by simply replacing performance values with simulated ones. Its main specificity is that appropriate residuals are created when the other approaches do not simulate any residual. This is more complex but also more realistic: for example, R. Thompson’s (Rothamsted Research, Harpenden, UK; personal communication) approach requires animals from the base generation to have "virtual" records. Such a requirement is not imposed here.

Moreover, the method could help to investigate the relevance of different iterative algorithms and convergence criteria as long as the simulated effects are of a magnitude similar to the real effects. This can be adjusted by choosing an appropriate scaling factor {delta}. Misztal et al. (1988) showed that the real accuracy of solutions could be far from the one suggested by some convergence criteria. Here, true solutions of the system are directly available. The relative average difference between current and true solutions, as proposed by Misztal et al. (1988), can be computed at each iteration for a particular data set. Usually, the convergence of genetic evaluation software is checked by comparing solutions obtained from a limited number of iterations with the ones resulting from an increased number of iterations. This method is easy to implement because it does not require modification of the original software, but it does not allow one to study the relevance of the convergence criteria nor the correctness of the converged solutions.

The main target for the proposed approach may be researchers implementing new genetic evaluation software to be used for sophisticated models for which no reference software exists. The development of a specific simulation tool to generate genetic and nongenetic effects as well as data under the most frequent models used in countries participating in international evaluation can be envisioned but should be done once for all countries. Covering a large number of potential models is a difficult task but is probably easier to consider than creating an all-purpose reference software. Such a tool should present 2 main advantages: it would enable any Interbull participating country to check the correctness of its national genetic evaluation software and to verify that the number of iteration rounds or the convergence criteria are adequate.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This research was partly funded by a grant from the French Ministry of Agriculture (Action Innovante UTILEG, Paris, France) and from the German Research Foundation (Deutsche Forschungsgemein-schaft, Bonn, Germany). One part of this study was realized within the framework of the Unité Mixte Technologique (Joint Technology Unit) INRA-Institut de l’Elevage (Jouyen-Josas, France) dealing with cattle genetic evaluation.


    FOOTNOTES
 
2 Equal contribution. Back

Received for publication December 21, 2007. Accepted for publication March 26, 2008.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 


Boichard, D., B. Bonaiti, A. Barbat, and S. Mattalia. 1995. Three methods to validate the estimation of genetic trend for dairy cattle. J. Dairy Sci. 78:431–437.[Abstract]

Druet, T., F. Jaffrézic, D. Boichard, and V. Ducrocq. 2003. Modeling lactation curves and estimations of genetic parameters for first lactation test-day records of French Holstein cows. J. Dairy Sci. 86:2480–2490.[Abstract/Free Full Text]

Fikse, W. F. 2004. Interbull guides through the labyrinth of national genetic evaluations. Page 326 in Proc. of the 55th Mtg. Eur. Assoc. Anim. Prod., Interlaken, Switzerland. http://www-interbull.slu.se/w-agora/getfile.php/interbullforum_edumaterial/216/EAAP2004.pdf Accessed Mar. 25, 2008.

Henderson, C. R., O. Kempthorne, S. R. Searle, and C. N. Von Krosigk. 1959. Estimation of environmental and genetic trends from records subject to culling. Biometrics 15:192–218.[CrossRef]

Klei, B., T. Mark, W. F. Fikse, and T. Lawlor. 2002. A method for verifying genetic evaluation results. Interbull Bull. 29:178–182.

Misztal, I., D. Gianola, and L. R. Schaeffer. 1988. Convergence rates in animal model solutions. J. Dairy Sci. 70:2577–2584.

Misztal, I., and M. Perez-Enciso. 1998. FSPAK90: A Fortran 90 interface to sparse-matrix package FSPAK with dynamic memory allocation and sparse matrix structure. Proc. 6th World Congr. Genet. Appl. Livest. Prod. 27:467–468.

Quaas, R. L. 1976. Computing the diagonal elements and inverse of a large numerator relationship matrix. Biometrics 32:949–953.[CrossRef]

Searle, S. R. 1971. Linear Models. John Wiley and Sons Inc., New York, NY.

Täubert, H., H. H. Swalve, and H. Simianer. 2002. The Interbull Audit Project Part II: Development of a program for auditing breeding value estimation programs. Interbull Bull. 29:165–167.

Wensch-Dorendorf, M., H. H. Swalve, and J. Wensch. 2005. Simulation of multiple trait data for testing breeding value estimation programs. Page 207 in Proc. 56th Mtg. Eur. Assoc. Anim. Prod., Uppsala, Sweden Wageningen Acad. Publ., Wageningen, the Netherlands.

White, I. M. S., R. Thompson, and S. Brotherstone. 1999. Genetic and environmental smoothing of lactation curves with cubic splines. J. Dairy Sci. 82:632–638.[Abstract]



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Interpretive Summary
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Leclerc, H.
Right arrow Articles by Swalve, H. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leclerc, H.
Right arrow Articles by Swalve, H. H.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS