Effects of preselection of genotyped animals on reliability and bias of genomic prediction in dairy cattle
Article information
Abstract
Objective
Models for genomic selection assume that the reference population is an unselected population. However, in practice, genotyped individuals, such as progeny-tested bulls, are highly selected, and the reference population is created after preselection. In dairy cattle, the intensity of selection is higher in males than in females, suggesting that cows can be added to the reference population with less bias and loss of accuracy. The objective is to develop formulas applied to any genomic prediction studies or practice with preselected animals as reference population.
Methods
We developed formulas for calculating the reliability and bias of genomically enhanced breeding values (GEBV) in the reference population where individuals are preselected on estimated breeding values. Based on the formulas presented, deterministic simulation was conducted by varying heritability, preselection percentage, and the reference population size.
Results
The number of bulls equal to a cow regarding the reliability of GEBV was expressed through a simple formula for the reference population consisting of preselected animals. The bull population was vastly superior to the cow population regarding the reliability of GEBV for low-heritability traits. However, the superiority of reliability from the bull reference population over the cow population decreased as heritability increased. Bias was greater for bulls than cows. Bias and reduction in reliability of GEBV due to preselection was alleviated by expanding reference population.
Conclusion
Cows are easier in expanding reference population size compared with bulls and alleviate bias and reduction in reliability of GEBV of bulls which are highly preselected than cows by expanding the cow reference population.
INTRODUCTION
Genomic prediction (GP) is used to predict the genomic breeding values of genotyped individuals [1]. The GP models usually do not account for selection. However, the reference population which is used for estimating marker effects with GP models usually consisted of progeny test bulls which was highly selected. Therefore, the prediction models are unable to incorporate past selection based on pedigree and phenotypes, perhaps leading to bias as well as decreased accuracy.
A formula for approximating the reliability and bias of the genomically enhanced breeding values (GEBV) that accounted for the prior selection of genotyped test bulls from among all test bull candidates was proposed [2]. In that method, the differences between the means and standard deviations of the estimated breeding values (EBV) of all of the test bull candidates are used to estimate the proportion of selective genotyping. Then, the selection difference or intensity of selection is calculated from quantitative genetics textbooks [3], and the authors approximated the reliability and bias of the GEBV by accounting for the effect of the intensity of selection [2]. However, the true genetic variance was reduced not only by the intensity of selection but also by the reliability of EBV [3]. The reliability of EBV differs by trait and between males and females. In dairy cattle populations, the intensity of selection is higher in males than in females [4], suggesting that cows potentially could be added to the reference population with less bias and loss of accuracy.
The genotyping of cows has become more prevalent as the cost of genotyping, in general, has decreased. The same individuals should be both genotyped and phenotyped, instead of genotyping the parents and phenotyping their progeny [5]. Adding genotyped females and their phenotypic records to the existing sire reference population is expected to increase the reliability of GPs, and increase in reliability would lead to increase in genetic gain and decreasing inbreeding in dairy cattle [6].
The genetic correlation between phenotypes of bulls and cows was approximately 0.6 for all yield traits and differed significantly from 1 [7]. Using selection index theory, the reliability of GEBV for a reference population in which the information contents in their phenotypes differed between groups, i.e., a reference population consisting of sires and cows both was presented [8]. However, in that method, the effect of preselection on reliability was not taken into account. Revaluation of cows would be beneficial from the standpoints of their less intense selection and easier incorporation into the reference population for its expansion, especially from the standpoints based on the bias and accuracy of GPs by varying the magnitudes of heritability and intensity of preselection among the animals chosen to create the reference population. A deterministic prediction model is necessary to develop simple formulas for calculating the reliability and bias of GEBV that accounts for prior selection of animals in the reference population. There are some results about using cows in the reference population with real data [9–12]. Construction of the real mixed reference population based on the deterministic model presented would improve the reliability of GEBV.
The first objective of the current study was to develop formulas for calculating the reliability and bias of GPs in which the effects of both intensity of selection and the reliability of EBV of preselected animals in the reference population would be taken into account. Next is to present a formula to calculate the number of bulls equal to a cow in regard to creating the same reliabilities of the GEBV between preselected bulls and cows in the reference population. The last is to present a guideline to create a reference population composed of preselected bulls and cows to prevent the reduction of reliability and bias of GEBV due to preselection before the actual creation of the reference population.
MATERIALS AND METHODS
Formulas for reliability and bias under selection
The variance of true breeding value of the animals selected on the basis of EBV can be expressed as:
where * denotes values after selection, G is true breeding value,
where
Assuming prediction error cov(EBV, GEBV) is zero, the correlation between EBV and GEBV can be shown as:
Therefore, the variance of the GEBV after selection is:
In addition, the cov(EBV, GEBV) can be written as:
A general expression for a covariance after selection [13] is:
where
In summary, the reliability of GEBV after accounting for selection on EBV can be expressed based on the reliability of GEBV under random selection (
When
The regression coefficient of G or deregressed proofs on GEBV is a criterion for bias in GEBV [2,14,15]. Accordingly, the regression coefficient of G on GEBV after selection by using EBV is written as:
When
Estimating the number of bulls equal to a cow in the reliability of the GEBV of preselected animals in the reference population
Reliability in a reference population after selection was expressed as (1). The reliability of GEBV without selection (
The reliabilities of GEBVs depend on the size of the reference population (nP), the effective number of loci for which effects have to be estimated (nG), and the correlation of the G of a genotyped individual with its phenotypic record (r). In a random sample of the population, the reliability of GEBV or the correlation between GEBV and G (
where λ = nP/nG. Parameter nG depends on the historical effective size of the unselected population (NE) and on the size of the genome, L (in Morgans), and can be estimated as shown in [17]:
When an individual in the reference population is both genotyped and phenotyped, r is equal to the square root of heritability of the trait. Then, the reliability of cows from their own records is:
When the reference population is based on progeny-tested sires, i.e., when sires are genotyped but their offspring are phenotyped, r equals the accuracy of the EBV obtained from progeny testing [5]:
where N is the number of half-sibling progeny on which the EBV is based.
Parameter nP can be transformed from (4):
When the reference population is composed of either bulls (m) or cows (f), parameter nP in (5) can be written as:
where the subscript letters m and f refer to male and female, respectively.
Therefore,
Alternatively, the reliability of GEBV under random selection can be written as (3) by using that of GEBV after preselection, therefore using the subscript letters m and f as defined earlier,
and
Substituting (7) into (6) yields:
The numbers of bulls equal to a cow in regard to the specific reliabilities of the GEBV of animals in the reference population with and without preselection are calculated by using (8) and (6), respectively. Note that
Therefore, the number of bulls equal to a cow in a standpoint of bringing about the same size of reliabilities of the GEBV of preselected animals in the reference population is:
Note that the number of bulls equal to a cow in terms of the reliability of GEBV without preselection, i.e., k = 0, depends only on the reliability of EBV of the individuals both genotyped and phenotyped in the reference population.
Reliability of GEBV in the reference population consisting of preselected bulls and cows
Using selection index theory, the reliability of GEBV was derived by [10], which is explained by markers, in a reference population consisting of multiple groups of animals whose phenotypes differ in their information content. We extended selection index theory to a reference population consisting of preselected bulls and cows.
From selection index theory,
where
The increase in reliability from including bulls only to both bulls and cows in a reference population is expressed as the difference between the reliability of GEBV in the reference population consisting of preselected bulls and cows and that of including preselected bulls only, i.e.,
The number of bulls corresponding to this increase (nPm−Δ) can be computed applying (5):
We designated the number of cows in the reference population as nPf. The increase in reliability is derived from adding cows into the reference population that consists of bulls only. That is, the number of bulls equal to a cow in regard to the reliability of GEBV in the reference population consisting of preselected bulls and cows (Cowvalue_m+f) is:
Simulation data
We preselected animals in the reference population according to the EBV of the trait of interest rather than selecting them randomly, to obtain more realistic reference populations [14, 18,19]. The animals in the reference population came from several generations in the past but were approximated and simplified to come from a single generation, i.e., they were preselected on EBV, and the reference population was created from the phenotypic data of the preselected bulls’ daughters’ records or the preselected cows’ own data. When the reference population is based on progeny-tested bulls, the number of daughters per test bull was set to 50 and 100. All test bull candidates were assumed to be preselected according to the PA (parent average). PA was computed by using the EBVs of sire (from 50 and 100 daughters) and dam. When the number of daughters per progeny-tested bull was set to 50, PA was calculated from EBV of sire from 50 daughters. That is, same number was set to the number of daughters of sire in PA and that of progeny-tested bull. Note that bulls for progeny testing were preselected from all test bull candidates, and they became test bulls after preselection and progeny-tested sires with their daughters’ records (50 or 100) after progeny-testing. Heifers were preselected using their PA, and cows were preselected according to the EBV from their own records. After preselection, test bulls, heifers, and cows were used to create the reference population. The reliability of the EBV from their own records was calculated by selection index theory where reliabilities of PA and their individual records constituted the index similar to the equation (10) and the number of daughters of sire in PA was set to 50.
We assumed that the length of the genome was 30 Morgans and that the heritability of the trait of interest was 0.1, 0.3, or 0.5. The historical effective population size was set to 100 animals [5,20]. The preselection percentage on EBV of animals used to create the reference population was set to 5%, 30%, and 100% for males and to 70%, 90%, and 100% for females. When animals were selected randomly, the proportional reduction (k) in the variance of G was set to zero. The reference population size was set to 5,000, 10,000, 20,000, and 40,000.
RESULTS
Reliability of GEBV of preselected animals in the reference population
We calculated the reliability of the GEBV of non-preselected animals and the ratio of the reliability of preselected animals to that of non-preselected animals for reference populations composed solely of proven bulls preselected on PA computed by using EBVs of sire (from 50 or 100 daughters) and dam (Table 1). The reliability of the GEBV of cows was shown in Table 2. The reliability of preselection on PA, EBV from a cow’s own record, or a bull’s progeny testing based on 50 daughters was calculated at three levels of heritability (Table 3). The reliability of GEBV in the bulls-only reference population was the highest among the three reference populations (bulls preselected on PA, heifers preselected on PA, and cows preselected on EBV from their own records). The bulls-only population was particularly superior to the cow population for low-heritability traits (h2 = 0.1), especially for the bulls-only population testing based on 100 daughters. However, the superiority of the reliability associated with the bull reference population decreased as heritability increased, regardless of whether the animals in the reference population were preselected or not.

The reliability of GEBV of non-preselected bulls at 100% preselection and the ratio of reliability of preselected bulls at <100% preselection to that of non-preselected bulls
In addition, the reliability of GEBV decreased as the intensity of preselection increased (i.e., a decrease in the preselection percentage), and this trend became more conspicuous as heritability increased. This change occurs because the effect of preselection on the reduction of the variance of G increases as heritability increases. The decrease in the reliability of preselected animals compared with that of non-preselected animals became more conspicuous as the reference population size decreased. That is, the effect of preselection on the decrease in reliability became more deleterious as the reference population became smaller.
Bias of GEBV
Regression coefficients of G on GEBV for animals in the reference populations composed solely of proven bulls preselected on PA calculated by using EBVs of sire (from 50 and 100 daughters) and dam were shown (Table 4). The regression coefficients of G on GEBV for cows preselected on EBV were calculated (Table 5). Regression coefficients of G on GEBV deviated more from 1 as the intensity of preselection increased, thus indicating that overestimation of GEBV became more prominent as the intensity of preselection increased. Because the intensity of preselection is higher in bulls than in cows, bias was more problematic in bulls than in cows. When the cow reference population preselected on PA was compared with that preselected by using the EBV from their own records, bias or overestimation of GEBV was greater for cows preselected on the EBV from their own records than those preselected on PA because the reliability of EBV from their individual record was greater than that of PA (Table 3). In the same way, bias or overestimation of GEBV was greater for bulls testing 100 daughters than 50 daughters. That is, bias became more pronounced with an increase in the reliability of preselection of animals used to create the reference population. Bias or overestimation of GEBV was alleviated by increasing reference population size (Tables 4, 5).
The contribution to the same reliability of the number of bulls to a cow
The number of bulls equal to a cow in terms of the bringing about the same size of reliability of the GEBV of preselected animals was calculated by (9) (Table 6). This parameter is related solely to the reliability (
Reliability of the GEBV in the reference population comprising both bulls and cows
The combined reliabilities in the reference population composed of 10,000 preselected bulls and 10,000 or 20,000 preselected cows are calculated by (10) and shown together with the reliabilities of reference populations composed solely of bulls or cows (Table 7). The cows in Table 7 are only those preselected on EBV from their individual records, because the combined reliability was almost equivalent whether heifers were preselected on PA or cows were preselected on the EBV from their own records. The combined reliability increased as the number of cows increased from 10,000 to 20,000, and this trend was more conspicuous for high-heritability traits (h2 = 0.5) than low-heritability traits (h2 = 0.1). As shown in Table 2 and 6, this result again confirmed cows’ favorable properties regarding the reliability of high-heritability traits. The contribution of cows in reliability of the combined population compared with that of a reference population composed of bulls only, i.e., the difference between combined reliability due to bulls and cows and the reliability due to bulls only, ranged from 0.03 to 0.22. The reliability for a reference population composed of either bulls or cows solely was computed by using (1). The number of bulls equal to a cow in terms of the reliability of the reference population created from both preselected bulls and cows computed by using (12) coincided with the number computed by using (9). That is, the number of bulls equal to a cow in regard to the reliability of GEBV in the reference population comprising both bulls and cows agreed with the number of bulls equal to a cow in the reference population created solely from bulls or cows in Table 6.
DISCUSSION
Benefit of cows regarding the reliability of GEBV for high-heritability traits
The superiority of the reliability of the GEBV from a bulls-only reference population over the cow population decreased as heritability increased regardless of whether animals in the reference population were preselected (Tables 1, 2). To improve GP, the same individuals should be both genotyped and phenotyped instead of genotyping parents and phenotyping their progeny [5]. In the current study, cows are both genotyped and phenotyped, whereas bulls are genotyped and their progeny are phenotyped. Because the reliability of a cow’s EBV was based on her own record, the increase in reliability concurrent with an increase in heritability was greater for cows than for bulls (Table 3). For example, the reliability of a cow’s EBV based on her own record and that of a bull’s EBV based on 50 of his daughters’ records corresponding to a heritability of 0.1 are 0.236 and 0.562; when heritability is 0.5, these are 0.604 and 0.877, respectively. Consequently, we consider that genotyping of cows with phenotypes is advantageous for high-heritability traits from the point of increasing the reliability of GEBV. The value of genotyping of cows with phenotypes was reduced by increasing the number of daughters per test bull (results not shown), because the reliability of bulls increased with increases in the number of daughters per test bull (Table 1).
The effects of preselection on reliability and bias of GEBV
The effect of preselection on reducing the variance of G increased as heritability increased, thereby decreasing the reliability of the GEBV of preselected animals. However, the reliability of GEBV in the reference population increased as heritability increased even if preselection had been practiced (Tables 1, 2). This result indicates that the effect of the increase in heritability on the increase in the reliability of the EBV of animals by using a cow’s own record or progeny testing of bulls was greater than that on the decrease in reliability due to reduction of the variance of G from preselection.
The effect of preselection on the decreased reliability of GEBV became more deleterious for smaller reference populations (Tables 1, 2). This result is explained by (1). That is, the reliability of GEBV after preselection is written as:
where rGEBV is the accuracy of GEBV in the absence of preselection or under random selection. By contrast, the ratio of reliability after preselection (
The ratio is 1.0 without loss of reliability when
Bias emerged when the cow reference population was under selection, but it decreased with increase in the size of the cow reference population (Table 5). The inclusion of cows in the reference population slightly reduced the bias in GEBV [17]. Routine genotyping of heifer calves or yearling heifers can be a cost-effective strategy for enhancing the genetic value of replacement females on commercial dairy farms [21]. That is, saying that a replacement decision was based on GEBV implies that genotyped heifer calves or yearling heifers were included in the reference population. In general, it is easier to increase reference population size by using cows than bulls. Expanding the reference population alleviated bias when heritability and intensity of preselection remained constant (Tables 4, 5). This effect occurs because the regression coefficient as a criterion of bias is defined as (
The value of cows compared with that of bulls in terms of the reliability of GEBV
The overall reliability in the reference population comprising both bulls and cows was not the sum of the reliabilities from those containing bulls or cows only (Table 7); this result indicated that marker information between bulls and cows was not independent, which was in agreement with the reliability results obtained from Danish cows and US bulls [26]. That is, the off-diagonal elements in (10) derived from index selection theory for the reference populations containing both bulls and cows were not zero.
The number of bulls equal to a cow in reliability of GEBV as calculated from (12) in the reference population containing both bulls and cows agreed with the number computed from (9) in a reference population created solely from bulls or cows. This effect occurs because the increased reliability due to the addition of cows into a bulls-only population was converted to the increase per head in the bulls-only population and the numbers of bulls only and cows only to yield the increased reliability was compared. Consequently, the number of bulls equal to a cow in terms of the reliability of the combined reference population could be computed by using the simple formula of (9) applied to reference populations created solely from bulls or cows. Cows are, in general, selected randomly compared with bulls; consequently, the effect of preselection on decreased reliability and bias of GEBV would be much smaller for cows than for bulls.
Assumption of parameters
The proportion of genetic variance explained by its markers is influenced by the effective size of the population (NE) and the density at which the genetic analysis covers the genome. The number of independent segments present in the genome is expected to be lower at low (compared with high) NE [27]. Therefore, the accuracy of GP is expected to be higher in a population with a smaller NE than in a population with a larger NE. An NE of 750 was assumed by [8], whereas an NE of 100 was assumed in the current study and by [5,20].
Simulation studies have shown that the accuracy of GEBV decreases slowly over generations when mating is random [1, 28] and more rapidly when selection is considered [29]. Given that recombination breaks up linkage disequilibrium (LD) in both situations, this finding indicates that selection is an important factor for decreasing LD between markers and qualitative trait loci. That is, accurate prediction of GEBV strongly depends on the persistence of LD between markers and qualitative trait loci across generations. However, we used a single generation in the current study to develop a simple formula for assessing the accuracy of GEBV that accounted for the effect of selection instead of accounting for persistent accuracy across generations. Advances in the use of sequence data and gene expression studies would lead to improved persistence of GP and potentially lead to greater reliabilities [12]. The development of methodology for estimating persistent accuracy of GEBV across generations is warranted.
An empirical value for the number of independent chromosome segments could be used in place of nG [6]. The accuracy of GEBV was also proposed by [27]. The current study used the original formula [6], because the number of bulls equal to a cow in terms of the reliability of GEBV must be compared under the same reliability of GEBV and is written without the term of the reliability of GEBV as shown in (9). The accuracies of the GEBV in this study are not comparable with the correlations between GEBV and degressed regression proofs determined from validation studies of field data [14,30]. The reason for this difference is that the information here is based on LD information alone, whereas the markers used for prediction of GEBV capture information on both genetic relationship and LD [24]. Both the theoretical number of bulls equal to a cow and the actual number derived from field data in terms of reliability warrants further study to validate the developed formula. However, the value of cows in terms of reliability of GEBV in the reference population under selection was simplified to the formula in (9), which likely will be a highly useful guideline for creating reference populations containing both bulls and cows.
CONCLUSION
Bias was greater for bulls than cows, because the intensity of preselection was higher in the bull population. Bias and reduction in reliability of GEBV due to preselection was alleviated by expanding reference population and by increasing the size of the reference population of cows even if the size of the reference population of bulls was held constant. Therefore, cows can contribute to reducing bias and increasing reliability due to their ease of use in expanding reference population size and by providing more recent animals compared with bulls. The number of bulls equal to a cow in a standpoint of bringing about the same size of reliabilities of the GEBV of preselected animals in the reference population was described as a simple formula (9) composed of reliability of the EBV of the trait of interest, preselection intensity and accuracy whether a reference population is either bulls/cows only or bulls and cows both. The generalized formulas presented in this study do satisfy the property of invariance and thus, is a general guideline for creating the reference population under selection for any combination of bull and cow populations.
ACKNOWLEDGMENTS
We thank our teacher, Lin CY for helpful discussion and constructive comments and wish to thank the reviewers for the constructive comments which help improve the manuscript.
Notes
CONFLICT OF INTEREST
We certify that there is no conflict of interest with any financial organization regarding the material discussed in the manuscript.