|
|
||||||||
,1


* Animal Science Unit, Gembloux Agricultural University, B-5030 Gembloux, Belgium
National Fund for Scientific Research, B-1000 Brussels, Belgium
Animal and Microbial Biology Unit, Gembloux Agricultural University, B-5030 Gembloux, Belgium
Department of Genetics and Animal Breeding, August Cieszkowski Agricultural University of Poznan, Poland
1 Corresponding author: gengler.n{at}fsagx.ac.be
| ABSTRACT |
|---|
|
|
|---|
Key Words: regression on gene content single gene effect test-day model
| INTRODUCTION |
|---|
|
|
|---|
The objective of the present study was to evaluate the accuracy of this new method in estimating and, subsequently, using predicted gene content in the context of the estimation of the effects of a candidate gene on first-lactation milk, fat, and protein test-day (TD) yields and SCS in Holstein cows and the estimation of combined breeding values associating single gene effect and polygenic effects.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Candidate Gene Studied
The selected gene was the bovine transmembrane growth hormone-receptor (GHR). Indeed, Falaki et al. (1996) found effects of polymorphism of this gene on milk protein percentage. As reported by Arranz et al. (1998) and Blott et al. (2003), other interesting results showed the possible segregation of a QTL on chromosome 20 (therefore close to the GHR gene) that seemed to influence milk yield and composition in Holstein dairy cattle. Semen samples of 961 Canadian Holstein AI bulls, born mostly during the late 1980s and early 1990s, were provided by Semex Alliance (Guelph, ON, Canada) for a previous study (Parmentier, 2004). In that study, bulls were genotyped by a PCR allele-specific method to determine if they had a T
A substitution at the transmembrane domain of the GHR gene, leading to the replacement of a phenylalanine by a tyrosine (Parmentier, 2004). Around 75% of the cows with production records had at least one genotyped sire or grandsire.
Prediction of Gene Content
Gene content was approximated using the method described by Gengler et al. (2007). This method computes the conditional expectation of gene contents for nongenotyped animals, given molecular and pedigree data. The Appendix shows an alternative derivation to that given by Gengler et al. (2007), showing the underlying hypotheses.
Cross Validation of Gene Content Prediction
The method to predict gene content was used in a preliminary cross-validation study. The aim of this study was to assess the differences between predicted and known genotypes, for animals with known genotypes. For this purpose, a file containing the 961 genotyped sires was used. The sires were removed one by one from the file and their gene content was estimated by the method described above, using the remaining 960 sires and the relationship between the removed sire and the other sires. Results obtained for each sire were stored for later comparison to the true gene contents.
For each real gene content (0, 1, and 2), means and standard deviations (SD) of estimated gene contents, and mean square errors (MSE) between real and estimated gene contents were calculated. Mean square error was computed as
![]() |
where qir is the real gene content (0, 1, or 2) of the ith animal, qie is the estimated gene content of the ith animal, and nq is the number of animals with the same real gene content (0, 1, or 2).
Analysis Model
A generic TD model was used; results obtained would have been similar with slightly different models. The analysis model provided flexibility for the fixed portion and a minimum number of parameters for the random portion through the use of polynomials. Random regression effects were modeled using modified Legendre polynomials, to reduce correlations among regression coefficients. The use of third-order polynomials (constant, linear, and quadratic) was considered sufficient for a single yield trait to describe the random variation around the fixed lactation curve (Gengler et al., 1999). The 3 modified Legendre polynomials used were:
![]() |
![]() |
![]() |
where x = –1 + 2 [(DIM –1)/(305 –1)] and DIM = days in milk.
The model used for the estimation of the effects in the first-lactation TD records was the following mixed inheritance model:
![]() |
where y is a vector of production data (TD yields or SCS; 12,858,741 records); htd is a vector of herd and TD fixed effects (1,320,824 levels); sarc is a vector of season, group of age, region, and class of lactation fixed effects (560 levels);
is the allelic substitution effect; p is a vector of permanent environmental random effects (1,656,599 levels); a is a vector of random polygenic additive effects (2,755,058 levels); e is a vector of residual effects; H, S, Z, and Z* are incidence matrices;
is a vector of estimated gene content for tyrosine coding allele; and W is the covariate matrix for Legendre polynomials. There were 2 calving seasons: September to March and April to August. Four groups of calving age were defined: first <25 mo, second between 25 and 30 mo, third between 30 and 35, and fourth >35 mo. The region was a province or a group of provinces. For the 10 different provinces the following regions were defined: British Columbia (region 1); Alberta, Saskatchewan, and Manitoba (region 2); Newfoundland, New Brunswick, Nova Scotia, and Prince Edward Island (region 3); Quebec (region 4); and Ontario (region 5). Fourteen lactation stage classes were created. These groups corresponded to a group of 20 DIM from d 25 (<25, 45, 65...) to d 305. It must be acknowledged that there was no adjustment for the fact that gene content was itself estimated. We did not try to quantify potential loss of variation in the estimates compared with the observed values. Future research should focus on this aspect.
No suitable (co)variance components were directly available; therefore, they were estimated from the random subset of available data. The sample included 89,877 TD records for 11,844 cows in production. The mean production per cow per TD was 24.18 kg for milk (SD = 6.25 kg), 885 g for fat (SD = 225 g), 782 g for protein (SD = 187 g), and 2.03 for SCS (SD = 1.74). These values were close to those of the whole population in production given before. A pedigree file was extracted from Holstein Canadian database and included 24,138 animals representing selected cows and all their ancestors.
The model used for the estimation of the variance components was the same as the analysis model. The (co)variance components were obtained using REMLF90 (Misztal, 2002) as described by Gengler et al. (1999).
Estimation of Allelic Substitution Effect and Variance
A preconditioned conjugate gradient solver was used to solve the analysis model (Stranden and Lidauer, 1999). The allelic substitution effect was defined as the expected phenotypic difference resulting from the substitution of a copy of the phenylalanine-coding allele by a copy of the tyrosine-coding allele under the assumption of no dominance. This effect is given by
. Associated allelic substitution variance was estimated as 2PA(1 – PA)
2, where PA is the frequency of the A allele in the base population.
Validation Through Simulation and Permutations
Obtaining exact standard errors for our estimates of single gene effects would have been impossible. Indirect methods exist, such as those based on mixed model conjugate normal equations (Croquet et al., 2006). However, these methods do not take into account the uncertainty of the gene content used in the mixed model. Therefore, an alternative indirect method based on permutations was used:
Step 1: Genotypes were simulated as described hereafter: a biallelic gene was simulated on all the animals of the pedigree without known parents. If only one parent was known, only one allele of the gene was simulated. Each allele was simulated by sampling once from a uniform distribution. If a value equal or smaller than estimated PA was obtained, the animal received tyrosine-coding allele; phenylalanine-coding allele was received otherwise. Once alleles were attributed to all the animals with unknown parents, they were dropped down the pedigree assuming transmission probability of 1/2.
Step 2: Production records of each cow were modified using the simulated genotypes and the allelic substitution effects previously estimated to give y* = y + d, where y = 1 of the production traits, d =
if the simulated genotype was homozygote with 2 tyrosine-coding alleles, d = –
if the simulated genotype was homozygote with 2 phenylalanine-coding alleles, and d = 0 if the simulated genotype was a heterozygote.
Step 3: Only simulated genotypes of the 961 GHR genotyped bulls were supposed known and for the other animals, gene content was estimated by the method described by Gengler et al. (2007).
Step 4: The allelic substitution effect was estimated using the mixed inheritance model described above with the modified production records (step 2) and estimated gene content (step 3).
Steps 1 to 4 were repeated 15 times, which represented a compromise between available time and resources and the natural complexity of a test-day model. Mean, standard deviation, bias, and standard error (Efron and Tibshirani, 1986) were computed from 15 estimates for each production trait using the following definitions. To allow an easy comparison between the different traits, bias and standard errors were computed as relative values of simulated parameters for allele frequency and substitution effects:
![]() |
where n is number of repetitions, here 15;
=
parameter estimate based on original data;
i = ith estimate of a parameter based on modified data. To test that allelic effect,
, was significantly different from zero, approximate t-tests were performed for the substitution effects based on the inverse of the relative standard error associated with 14 degrees of freedom.
Cross Validation of Mixed Inheritance Model Using Predicted Gene Content
To validate the use of predicted gene content in a mixed inheritance model, data splitting (cross-validation) was used because it allows evaluation of the predictive ability of a model. Assessing the predictive ability of a model involves leaving out a portion of the data, fitting the model to the remainder of the data, and then testing the model fit on the omitted portion. The strategy proposed by Ramirez-Valverde et al. (2001) was used. According to earlier work by Picard and Cook (1984), optimal cross-validation should be done by splitting data randomly, to imitate a sample of future observations. This is especially true in the setting of genetic evaluations, where we try to predict unknown breeding values from limited knowledge of phenotypes and, in our case, genotypes. Our data splitting technique involved duplicating the data set and randomly discarding TD records for one-half of the cows in one subset with the TD records of the other half of the cows being discarded for the other subset. This strategy was a slight modification of the one proposed by Ramirez-Valverde et al. (2001), who discarded randomly half of the records, rather than the records belonging to half of the cows. Using the original strategy, the lactation curve of a given cow could still be reasonably well predicted from her remaining records. We think that by eliminating all of her records, we are really able to test the predictive ability of the model. The predictive ability of the mixed inheritance model was compared with the one of the model without the regression on predicted gene content. For each model, breeding values were calculated from both subsets. For the mixed inheritance model including the regression on predicted gene content, combined breeding values were defined as the sum of the polygenic effect and the product of the single gene effect and predicted gene content. Correlations among breeding values obtained from the 2 subsets were calculated. Computations were done for all 4 traits. Ten different random samples were created according to the above criteria, and reported correlations were the average of the 10 replicates. For each model, the estimated correlation coefficients provided an estimate of the model performance. Higher correlation estimates between complementary subsets indicate a greater stability of the model to predict breeding values. Correlation coefficients were calculated for all animals, for all cows with records, and for the 961 genotyped sires.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
|
Estimation of Allelic Substitution Effect and Variance
Estimated effects of a substitution of a copy of the phenylalanine-coding allele by a copy of the tyrosine-coding allele on production traits were 295 g/d for milk, –8.14 g/d for fat, –1.83 g/d for protein, and –0.022 for SCS (Table 2
). The frequency of the tyrosine-coding allele estimated by the new method was found to be 23.3% instead of the 23.8% estimated from the genotypes of the 961 sires. The difference is linked to the fact that the method used to estimate allele frequency weights the importance of every sire relative to its relationship to the population. Therefore, the obtained value reflects the allele frequency of the founders. This is an interesting feature of the method because knowledge of this frequency is important in many situations.
|
G) and phenotypic standard deviation (
P), the relative substitution effect decreased from milk, fat, and SCS relative to that for protein.
Estimated allelic substitution variances are also given in Table 2
. Based on absolute values and especially values relative to total genetic and phenotypic variances, the allelic effects considered were nearly negligible compared with the overall average TD genetic (
G2) and phenotypic variances (
P2).
The estimates of gene frequency and allelic effects are in general agreement with the results reported by Blott et al. (2003) for 2 populations of Holstein-Friesian cattle and 1 of Jersey. The rare tyrosine-coding allele was correlated with greater milk yield. As in Blott et al. (2003), the negative effect of tyrosine-coding allele on fat yield and protein yield was not so pronounced. In the present study, the effect on milk yield was greater than that reported by Blott et al. (2003) for the same polymorphism, but explained a similar portion of trait variability. Our estimate of allelic substitution effect was also greater than the one calculated for other candidate genes under a model with an additive effect of a single SNP (Szyda and Komisarek, 2007).
The larger substitution effect could be explained by the lower range of regressor variable (predicted gene content) because of the fact that no cows were genotyped. In this case, the average predicted content of tyrosine-coding alleles for cows carrying 2 tyrosine-coding alleles is lower than 2 and is greater than 0 for cows carrying 2 phenylalanine-coding alleles. This should be taken into account in the interpretation of the results.
We anticipate that more cows will be genotyped in the future and the tendency to potentially overestimate the single gene effect decreased. Still, single gene effects estimated in this study can be considered to be small to very small if compared with results of the meta-analysis done by Hayes and Goddard (2001). Indeed, in our study, all allelic substitution effects were equal to or below 0.119 and 0.065 of
G and
P, respectively. Hayes and Goddard (2001) reported that very few single gene effects as small as these were given in the literature.
Validation of Allele Substitution Results
Results of the 15 repetitions, means, SD, relative bias, and relative standard errors for milk, fat, protein and SCS and simulated values for frequency of the tyrosine-coding allele (FA) are in Table 3
. Compared with PA, allele frequencies FA provided by the 15 repetitions of the simulations had similar mean values. It must be remembered here that only the alleles from unknown parents were simulated using PA; the other alleles were simulated with a one-half probability of receiving each paternal allele. The very low relative bias of 0.2% shows that the estimation of PA based on the method to approximate gene content is consistent with the value used in the simulation, which is the value estimated from the base population. Moreover, the relative standard error was rather small, indicating that the estimation was not only unbiased but had a low sampling error, too. The new method to estimate allele frequencies proved to be rather reliable. The method proved to be resistant to selection, a feature that is directly built into the system. These results confirm the conclusion found by Gengler et al. (2007), which showed that this method computes values that are similar to those obtained by MCMC methods and iterative peeling, methods that are theoretically considered to be the most appropriate for genotype probability calculations.
|
This study showed that larger effects are, as expected, easier to estimate precisely. Still, the magnitude of the effects that could be detected was surprisingly low (below 0.1
P) compared with the results reported by Hayes and Goddard (2001) and based on a large number of studies. However, our results also showed large relative standard errors, even for the effect on milk that can be considered significant. This result is in line with the results of Hayes and Goddard (2001), who reported that small to medium QTL effects might simply be artifacts of experimental error. We also found the tendency to overestimate the size of the single gene effects as expected from the results reported by these authors (Hayes and Goddard, 2001). Small effects appeared to be most overestimated. However, the results of this simulation study may not prove whether the presented method is superior in its detection power to traditional methods such as that used by Szyda et al. (2005). Using the same data and a very similar approach to that of Szyda et al. (2005), the results obtained were nearly identical with this method (results not shown). However, the present method is easier and more general because it accepts genotypes from any genotyped animal and integrates smoothly into existing genetic evaluation models of any size and kind. Moreover, it allows genetic evaluations for traits where mixed inheritance models combining polygenic and single gene effects are required. Traditional methods using only some sires that are genotyped are unable to be used directly in this context because they relate more to QTL detection than to genetic evaluation. This method has a much larger scope.
In this study, the simulation approach was based on a real-life situation and compared, under the hypothesis that the effect exists, the estimated value to the simulated one. This method has merit in that it is as close as possible to the expected situation. The weak point is that this simulation is done under the hypothesis of a total independence of the simulated gene with the rest of the genome. This hypothesis is obviously far from reality. However, it is at the root of the mixed inheritance model, which does not consider interactions between the gene and the rest of the genome.
Cross Validation of Mixed Inheritance Model Using Predicted Gene Content
Results of this cross-validation are summarized in Table 4
. The correlation coefficients for breeding values for milk, fat, and protein estimated between the 2 subsets were slightly higher for the model with an additional regression on predicted gene content, the greatest difference being observed for milk, and then protein and fat. For SCS, correlation coefficients were similar for the 2 models. These results were expected, except for the inversion of the rank of fat and protein, given the results presented earlier and given the fact that small effects are more difficult to estimate precisely.
|
| CONCLUSIONS |
|---|
|
|
|---|
| APPENDIX |
|---|
|
|
|---|
![]() |
where
x is a vector of unknown breeding values, 1 is a vector of ones,
y is a vector of known breeding values, Axy is the additive relationship matrix between individuals with unknown breeding values and their relatives with known breeding values, Ay is the additive relationship matrix among individuals with breeding values and µg is the average breeding value and could also be a genetic group estimate.
We can then rewrite breeding values as the sum of single gene effects for every biallelic locus i. By doing this we obtain the following prediction equations:
![]() |
where
,
i is the allele substitution effect for locus i, qyi is a vector of known gene contents (genotyped animals) for locus i, qxi is a vector of unknown gene contents (ungenotyped animals) for locus i, µi is the average allele content for locus i which is also equal to the allele frequency x 2. Under certain hypotheses, such as the normality of the contributions of single gene effects to breeding values, we can write for a given locus i:
![]() |
The conditional expectation of gene contents for non-genotyped animals, given molecular and pedigree data, are then simply derived by dividing both sides of the equation by the allele substitution effect
:
![]() |
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Received for publication March 26, 2007. Accepted for publication January 1, 2008.
| REFERENCES |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |