|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ANIMAL GENETICS |


* Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, West Mains Road, Edinburgh, EH9 3JT, United Kingdom;
and
Clinical Data Inc., Five Science Park, New Haven, CT 06511;and
Maple Leaf Foods Inc., 30 St. Clair Avenue West, Toronto, Ontario M4V 3A2, Canada
| Abstract |
|---|
|
|
|---|
Key Words: genetic marker genotyping error parentage identification pig product tracing single nucleotide polymorphism
| INTRODUCTION |
|---|
|
|
|---|
Thus, we analyze methods to identify the source, typically the sire or dam, or both, and through that the source herd, of the product using genotypic data on the offspring (the meat sample). Because there may be thousands of breeding animals in a selection and multiplication program, including many relatives (e.g., sibs, cousins, and uncles), precision is needed to enable the parents to be distinguished uniquely.
Single nucleotide polymorphisms are becoming the method of choice in genotyping, and we assume their use. Even so, genotyping errors can arise, and we accommodate these. To make full use of the data, we use maximum likelihood (ML; Weir, 1996
), as do package programs such as CERVUS (Marshall et al., 1998
; Kalinowski et al. 2007
). We do not introduce new theory but develop the methods for the application in livestock product tracing, particularly in pigs. The objective is to achieve a high probability of identifying the correct parent(s), say over 99.99%, and we investigate the numbers of SNP required. Some results may be relevant to other applications of SNP in parentage identification.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Model
Because the populations from which male and female parents are drawn may well be of different breeds and themselves be crossbred, we allow different gene frequencies between the parent populations and departures from Hardy-Weinberg equilibrium frequencies within them. Consider a locus with alleles A and a, where allele A has frequency pm and pf in the male and female parental (i.e., sire and dam) populations and a has frequencies qm = 1 – pm and qf, respectively. The genotype frequencies in the male population, for example, for AA, Aa, and aa are gm1, gm2, and gm3, respectively, with gm1 = pm2 if they are in Hardy-Weinberg equilibrium.
The expected genotype frequencies in the offspring generation, assuming random mating between the populations, for AA, Aa, and aa, respectively, are described by the following vector:
![]() |
For sires (dams) of specified genotype, mating to random dams (sires), and for sire x dam genotype pairs, the expected frequencies of offspring genotypes are given in Table 1
.
|
The distribution of miscalls can be represented by a table or 3 x 3 matrix E in which elements eij denote the probability that an individual of real genotype i is assigned genotype j. The matrices are general, but we assume as a reference point and in most examples, unless specified otherwise, that there is the same probability of miscalling each allele as the other one:

where
= the allelic error rate and e = 2
(1 –
) ~2
. For simplicity, we shall refer to e as the genotyping error rate. In sire identification, errors in miscalling heterozygotes as homozygotes cause the most false exclusions, assuming that the probability of calling the wrong homozygote (
2, e.g., AA as aa) is very small; therefore, we shall also consider a model where the only errors are in miscalling of heterozygotes [i.e., the first and last rows of E are (1, 0, 0) and (0, 0, 1), respectively]. For generality, and because different procedures may be used for genotyping parents and offspring (for example, the latter may be genotyped in a one-off process using a different assay when a production problem is detected), this matrix may differ among male and female parents and offspring among groups. Hence, we define matrices Em, Ef, and Eo, respectively.
Mutations. Germ-line mutations have a similar influence to genotyping errors on the distribution of genotypes of parent-offspring pairs. Thus, if the mutation rate of A to a is uA and of a to A is ua, and mutations occur sufficiently rarely that double mutations can be ignored, the total error rate Eo* recorded in offspring is:

The matrices Em and Ef are unchanged.
Likelihood Calculations
We consider as an example the calculation of likelihoods for detection of the sire. That for dams is essentially the same, and the necessary extensions for detection of a sire and dam pair are given subsequently.
Likelihood when the Putative Sire is Unrelated to the Sire.
The observed frequency (i.e., including possible erroneous genotypes and mutations) of individuals of genotype j at an arbitrary locus in the offspring population, which is also equal to the expected frequency in the offspring of an unrelated sire, is given by gj =
igojeoij, an element of the matrix product
![]() |
Hence, the probability that the offspring has observed genotype j and the putative sire has observed genotype i when it is a random male, unrelated (U) to the sire, does not depend on i and is
![]() |
Combining loci, k, which are assumed to be in linkage equilibrium, and now letting gkj be the probability that the offspring has genotype j at locus k, the likelihood of the data given that the putative sire is unrelated to the sire is given by the product
![]() |
Likelihood when the Putative Sire is the Sire. Let the element thj of

denote the frequency that a sire of true genotype h has an offspring of true genotype j (elements of Table 1
), and so the h, j element of the product TEo denotes the probability that a sire of true genotype h has an offspring of observed genotype j.
The joint probability that, for example, a sire has true genotype h and observed genotype i is given by gsheshi. Hence, the conditional probability that a sire has true genotype h given observed genotype i is
![]() |
where csih denotes an element of a 3 x 3 matrix Cs.
Hence, P(i, j|S), the probability that the offspring has observed genotype j and that the putative sire has observed genotype i when it is the sire, is obtained by summing over all possible combinations of true genotype of sire and true genotype of offspring. It is given by xij =
hcsih
lthleolj, an element of
![]() |
Combining all loci under the same assumptions of independence as above, the likelihood of the data given it is the sire is
![]() |
Likelihood Ratio. The likelihood ratio L(S)/L(U) is the relative probability of obtaining the observed genotypes when the putative sire is indeed the sire (S) compared with when it is unrelated (U). The logarithm taken to base 10 of the likelihood ratio (lod) is determined as follows:
![]() |
Identification of Dams or Both Parents. For identification of dams with unknown sires, the same analysis as outlined above still applies but with Cs, gs, and Es replaced by Cd, gd, and Ed, respectively.
For identification of parental pairs, there are now 9 possible real and observed combinations of sire and dam genotypes. The transmission probabilities are given by a 9 x 3 matrix H, with elements given in Table 1
, ordered as dam genotype nested within sire genotype, where Hi (3 x 3) defines progeny of sire genotype i:

with, for example,

For example, the second row of H1 defines the genotype probabilities for offspring of an AA male mated to an Aa female. The probability that the sire has true genotype h given observed genotype i and that the dam has true genotype y given observed genotype z is given by csihcdzy, elements of the 9 x 9 matrix Cs
Cd, the direct product of Cs and Cd. Hence, the probabilities of each combination of observed offspring genotype and observed sire x dam parental genotype pair are given by the elements of the 9 x 3 matrix X*, where
![]() |
The likelihood, given that these are the true parents, is then the product over loci k of elements of X*k, identified by observed genotypes of putative sire (i) and dam (z) and offspring (j):
![]() |
If both putative sire and dam are unrelated to the offspring, the likelihood is given by L(U) as above, and the log-likelihood ratio by lodSD = log10[L(S,D)] – log10[L(U)]. Further, if for example the sire is known, log10[L(S,D)] – log10[L(S)] is the conditional log-likelihood ratio for the dam.
Posterior Probabilities.
If there are a limited number of sires that could possibly be the father, and these have prior probabilities
m, the posterior probability that sire M is the true sire is given by
m =
MLM(S)/
m
mLm(S). Approximate values for these prior probabilities may be known: for example, a subset of the sires used in AI may have on average 10 times as many progeny as those used in natural service.
| RESULTS |
|---|
|
|
|---|
To assess the efficacy of the likelihood ratio comparisons to find the correct parent and to judge the numbers of loci to be used as a function of their frequency, we consider the distribution of the likelihood ratios under the 2 hypotheses that the putative parent is or is not the real parent. If we assume loci are unlinked and in linkage equilibrium, it is sufficient to evaluate these quantities for single loci. Thus, we compute the expectations for sire identification,
![]() |
, and the variances of each of these quantities. The expectations were computed by summing the likelihoods over all possible genotypic combinations, weighted by their frequencies according to whether the putative sire was the sire, to obtain E(lod|S), or was unrelated, to obtain E(lod|U). From these we compute:
![]() |
, the expected difference in lod score between the sire and a putative sire unrelated to it. The actual parameters used in computing their expectation to weight the likelihoods and those assumed in computing the likelihoods do not have to be the same; therefore, the effects of wrong assumptions can be checked.
Results are given in Table 2
for cases in which the gene frequency is the same in the male and female populations, pm = pf = p, and only for p
0.5, because those for gene frequency p and 1 – p are the same. In each case, there is assumed to be a genotyping error rate (e) of 0.005 (based on Kuruvilla et al., 2006
). As an example, consider results for sire identification with p = 0.5 for a single locus when E(lod|S) = 0.0718 and E(lod|U) = –0.2126, with the difference E(lodS) = 0.2845 (Table 2
). With n loci, these expected values increase n-fold; for example, with 100 loci the expectations are 7.18, –21.26, and 28.45, respectively. The standard deviation of the lod score given that the putative sire is indeed the sire,
(lod|S) =
{E[(lod|S)2] – [E(lod|S)]2}, is 0.148, and correspondingly, if it is not the sire, it is 0.683 (Table 2
). The standard deviation is much greater for the nonparent, because exclusions occur with binomial frequencies, and each such exclusion generates a large negative lod (with magnitude depending on the assumed genotyping error rate). There is no meaningful standard deviation of the difference of the lod under the 2 hypotheses, because the calculations are conditional on the hypothesis. The standard deviation s increases in proportion to
n for n loci, and thus, E(lod|S)/
(lod|S) increases in proportion to vn. For example, with 100 loci of frequency 0.5, the mean and standard deviation of the lod, if the putative sire is the true sire, is 7.18 ± 1.48; therefore, a negative lod score would be very improbable. Loci at intermediate frequency show the greatest values of E(lodS), presumably because these have the greatest heterozygosity and also the greatest chance of showing an exclusion.
|
|
Consequences of Errors in Assumptions
The results in Table 2
and Figure 1
were obtained under the assumption that the gene frequencies used in the calculations were indeed those in the parental populations, but in practice and certainly early in use of a parent assignment scheme, these frequencies may not be known accurately. Examples are given in Table 3
for frequencies in error by 0.1 or so, more than would be expected for all but the smallest initial genotype samples (under 50 or so individuals). It is seen that, although the expected likelihood ratios under both the unrelated and sire assumptions are influenced by the assumed gene frequency, the difference in E(lodS) is little affected over these examples. It seems possible for E(lodS) to be greater with incorrect assumptions on frequency, basically because similarity of genotype for a rare allele of father and son is expected to be rarer than it actually is. In general, however, the methodology appears to be robust.
|
(lod|U), but a low assumed rate increases the magnitude (of negative values for the former) of both quantities. This is not necessarily beneficial in discrimination, however, because the lower the assumed error rate, the greater the penalty attached to an apparent exclusion. The distribution of lod|U is skewed downwards by these tiny probabilities, because the apparent exclusion completely outweighs all of the data coming from other loci which have nonexcluded genotype combinations. If the genotyping error rate is assumed to be zero, when actually it is not, then E(lod|U) becomes infinite.
|
Distinguishing Parents from Relatives of the Parents
Consider a relative of the sire, with relationship R to it (strictly the numerator relationship, or twice the coancestry). Because the offspring shares 1 gene with the true parent, the probability is R that it shares 1 gene with the relative of the parent. Thus, R = 0.5 for a full sib or parent of the sire (i.e., an uncle or grandparent of the progeny), and R = 0.25 for a half-sib of the sire (i.e., a half-uncle of the progeny).
The expectation of the lod for a putative sire related to the true sire is therefore the weighted average of the lod for a sire and for an unrelated individual:
![]() |
and therefore,
![]() |
Similarly, the standard deviation of the lod is given by that for the mix of 2 groups:
![]() |
Results given in Table 2
and Figure 1
can be used accordingly. Basically, it is harder to distinguish a sire from a full uncle or grandsire than from a half-uncle than, in turn, an unrelated individual.
The likelihood equations could be set up for a series of relationships of the putative sire (say, uncle) to the true sire, the elements of the joint probabilities of the genotype of the uncle and offspring merely being weighted sums based on the conditional probabilities given in Table 1
. In practice, when sire tracing, this would require a series of likelihood calculations to be undertaken, and it is not clear that it would be worthwhile as a standard process. It would, of course, be possible to estimate the relationship of the putative sire to the true sire by maximizing the likelihood as a function of R, but that is outside the scope of this paper and a subject of study elsewhere for many years (Thompson, 1975
; Weir et al., 2006
).
Risk of Misidentification
A practical question of importance is the number of loci needed to be sure that the correct sire is identified, or a correct conclusion drawn that the sire of the offspring tested is absent from the database. Approximate values can be obtained using the expectations of the lod scores, E(lod|S) and E(lod|U), and their standard deviations assuming normality under the central limit theorem. These turn out to be rather poor estimates and only a guide when probabilities of wrong assignment become very small and tail values of the distribution are used, because lod scores at individual loci are very heavily skewed downwards, particularly for unrelated individuals due to (real or false) exclusions.
Hence, we used simulation in a practical breeding situation for a large-scale production system of pigs, in which the true sire was assumed to have 4 full brothers, 20 half-brothers, and 976 unrelated males in the data set of possible sires. Parental genotypes were sampled, with genotyping errors as appropriate. To reduce simulation at the expense of increasing sampling error because of correlated observations, but so as not to cause bias, 1,000 offspring were generated from each sire using unrelated dams, and the lod was computed for each putative sire-offspring combination. In each case, the putative sire with the greatest lod score was designated the sire. Results were run with a range of gene frequencies and numbers of loci and are shown in Table 5
, for a total of 50,000 replicates, comprising 50 sires each with 1,000 offspring. In line with the calculations of expected lod scores (Table 2
), risks of wrong assignment are smaller at intermediate gene frequencies but do not differ greatly over the frequency range 0.3 to 0.7. There it is shown that although, for example, with only 50 loci genotyped there is a high probability that the sire will be misidentified, the probability is less than 1% with 100 loci having intermediate gene frequencies or 150 loci having minor allele frequencies of 0.2 or more.
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
A simple alternative to the use of ML to identify the parent(s) is to eliminate others by simple exclusion of genotype combinations, for example an AA sire cannot produce an aa offspring. If there are no genotyping errors or mutations and there is random mating, the probability of exclusion of a randomly sampled unrelated individual is the frequency of different homozygote pairs, PX = pm2(1 – pf)2 + (1 – pm)2pf2. If pm > 0.5, PX is maximized when pf is 0, otherwise when pm = 1. With equal frequencies in the lines PX = 2p2q2, which has a maximum of 0.125 at p = 0.5, is 0.1152 if p = 0.4, but is only 0.0162 if p = 0.1. The probability of exclusion for a related sire, for example, is simply a factor 1 – R of that for an unrelated individual, assuming there are no genotyping errors.
To enable comparison with results from the likelihood analysis, consider the example in which the sire has 4 full brothers, 20 half-brothers, and 976 unrelated males in the data set. The probability that at least 1 putative sire is, by chance, not excluded if 100 loci each with frequency 0.4 are genotyped without error is given by:
![]() |
In comparison, there is only a 0.27% chance that the lod of the sire will be exceeded by that of any of the other putative sires, allowing for a genotyping error rate of 0.005 (Table 5
).
An important complication that we have incorporated quantitatively within the ML analysis is the effect of errors in genotyping or mutation, which can lead to false exclusions and thus potential wrong decisions if genotypic exclusion is the sole criterion. We consider just the most important class, calling heterozygotes as homozygotes, simply assuming Aa is equally likely to be called as aa or AA, and ignoring second order terms. If the error rate is em in sires and eo in offspring, the exclusion rate for a sire, using Table 1
, comprises terms for AA sires with aa offspring (1/2pm2qfeo), AA sires with aa offspring (1/2pmqmqfem), etc., totaling 1/2pmqf(pmeo + qmem) + 1/2qmpf(qmeo + pmem). With equal gene frequencies in males and females, and error rates in parents and offspring, this reduces to pqe, or, for example, to 0.00125 for P = 0.5 and e = 0.005, the error rate assumed in the ML analysis (Table 2
). There is thus a probability of false exclusion based on a single SNP of about 12% with 100 SNP tested. For gene frequencies closer to 0 or 1, the probability of a false exclusion of a sire increases relative to that of the correct exclusion of a nonsire (e/2pq approximately).
Practical Issues
An alternative to using SNP would be to adopt highly polymorphic markers such as microsatellites or variable number tandem repeats. Although many more SNP are needed to achieve the same level of precision in identifying relatives (Brenner, 1999
; Gill, 2001
; Vignal et al., 2002
), many costs are the same, such as collection of blood-tissue samples and DNA extraction. The developments in technology are such that relative costs and accuracies of high-throughput genotyping are changing such that SNP are becoming the method of choice. An added advantage is the binary nature of SNP, with 2 alleles, such that error rates are relatively low. Thus, SNP have been proposed for use in paternity testing (Heaton et al., 2002
) and product tracing (Plastow et al., 2007
) in livestock.
Unsurprisingly, the analysis shows that SNP at intermediate gene frequencies in the sire population are most informative and become increasingly so the nearer are loci in the female line to homozygosity. The fall in efficiency (exemplified by Table 2
for E(lod) or Table 5
for risk) is small, however, over the range 0.3 < p < 0.7. The numbers of loci required to be almost certain to identify the correct parent depend on the number of putative competitors and particularly on how many are closely related to the sire (e.g., full sibs or father). Even so, with large numbers of competitors including relatives, 100 to 150 loci seem sufficient, and simulations of data sets of populations having small numbers of breeding individuals so that there are many relationships among the sires illustrate that correct identification occurs with under 150 loci.
The analysis has not considered relationships among putative sires other than full or half-sibs. In practice, there are almost certain to be additional relationships, for example cousins and second cousins, and various multiple relationships. Hence, the calculations of both the exclusion probabilities (Dodds et al., 1996
; Double et al., 1997
) and the risk of wrong assignment are conservative and so may be greater in practice. Because the risk of wrong assignment is greatest for the closest relatives, full sibs (Table 5
), particularly when many loci are used, it is unlikely that further relatives among putative sires will greatly affect the probabilities of misidentification of the sire unless these are more highly related than full sibs. A clone of the sire could not, of course, be distinguished from the sire by genotyping. In the risk calculations, we have not, however, assumed an equal prior probability for all sires, whereas in practice, this is unlikely to be the case. For example, some sires may have been used in AI, and others not, and the pattern of use over time of sires may be known.
In our analysis of risk, we have assumed the genotype of the actual sire is present in the data set and designated as sire the putative sire with the greatest lod score, regardless of the actual and relative sizes of the lod scores of the winner and runners-up. In practice, we advocate presenting lod scores for at least the top few candidates for perusal and decision as in CERVUS (Marshall et al., 1998
; Kalinowski et al., 2007
; http://www.fieldgenetics.com/pages/aboutCervus_Functions.jsp). In the context of product tracing, if sufficiently important, possible ties or marginally convincing assignments could be resolved by further genotyping using an additional set of SNP (or a search for the absent fathers).
Simulations can be undertaken, as in CERVUS for example, on the actual data structure to assess the confidence of the parentage assignment. In the more limited context of product tracing, in which only a single relationship is to be identified and the population is known, the information on the standard deviations of lod scores would seem to provide a guide in most cases. We use the results on the distributions of lod scores, which can readily be computed for any population. As a simple example, assume all gene frequencies are 0.4 and taking results from Table 2
, for a single locus, the mean and standard deviation of lod|S are 0.0701 ± 0.154. Thus, for 100 such markers, lod|S ~7.01 ± 1.54; similarly, for unrelated individuals, lod|U ~–19.81 ± 6.54; and for full brothers, lod|R ~–6.40 ± 4.94. As noted previously, the distribution of the lod scores, even with large numbers of loci, departs from normality, in particular being skewed downwards, but even so, the normal distribution provides a guide. Assuming no relative could be closer than a full sib, then an observed lod in excess of 5 is strong evidence that the sire has been identified, whereas a negative lod is strong evidence that the sire has not been found.
Although genotyping errors should, of course, be minimized, the analysis shows that the ML method is reasonably robust, providing the error rate is not assumed to be too small (when false exclusions generate very high negative lod scores). The actual error rate to be used will depend on the genotyping platform and assignment rate in use. Therefore, the base figure of 0.005 mainly used should be regarded as no more than a typical figure.
The application considered here to product tracing is simpler than other potential applications, because the kind of relationship to be identified is limited solely to parents. In this case, there is clearly more information to be gained from a limited number of loci by identifying both parents, even though the actual matings made are not recorded (e.g., artificial insemination using mixed semen), because the likelihood ratios are much larger (Table 2
). If sire tracing is sufficient, it may be much more economical, because fewer animals need to be genotyped.
| Footnotes |
|---|
2 Corresponding author: w.g.hill{at}ed.ac.uk
Received for publication May 17, 2007. Accepted for publication May 19, 2008.
| LITERATURE CITED |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
J. Wang and A. W. Santure Parentage and Sibship Inference From Multilocus Genotype Data Under Polygamy Genetics, April 1, 2009; 181(4): 1579 - 1594. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |