J. Anim Sci.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J. Anim Sci. 2008. 86:2508-2517. doi:10.2527/jas.2007-0276
© 2008 American Society of Animal Science

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
jas.2007-0276v1
86/10/2508    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Hill, W. G.
Right arrow Articles by Webb, A. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hill, W. G.
Right arrow Articles by Webb, A. J.

ANIMAL GENETICS

Parentage identification using single nucleotide polymorphism genotypes: Application to product tracing1

W. G. Hill*,2, B. A. Salisbury{dagger} and A. J. Webb{ddagger}

* Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, West Mains Road, Edinburgh, EH9 3JT, United Kingdom; and {dagger} Clinical Data Inc., Five Science Park, New Haven, CT 06511;and {ddagger} Maple Leaf Foods Inc., 30 St. Clair Avenue West, Toronto, Ontario M4V 3A2, Canada


    Abstract
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Identification of relatives using SNP markers has many possible applications. One is as a route to tracing a food product such as a cut of meat back to its source of origin by identifying the parents of the animal from which the product was derived. We develop methods for using SNP markers with maximum likelihood, allowing for the possibility of genotyping errors that would cause false exclusions by simpler methods. We use expectations of likelihood ratios to consider how gene frequencies in the parental populations, numbers of loci, and error rates affect accuracy. This is further quantified as the risk, the probability that an incorrect sire is identified from a panel that contains many other putative sires including its relatives, using a breeding structure relevant to pig breeding. This appears to be a straightforward and potentially effective means of product tracing.

Key Words: genetic marker • genotyping error • parentage identification • pig • product tracing • single nucleotide polymorphism


    INTRODUCTION
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Marker genotypes are used to identify relatives in many applications (Weir et al., 2006Go). Examples include identification and paternity analysis in livestock (Heaton et al., 2002Go), identifying family members in nonpedigreed natural populations (Marshall et al., 1998Go), and in forensic studies to identify disaster victims or find suspects via the genotype of a relative on the database (Brenner and Weir, 2003Go; Bieber et al., 2006Go). In tracing food products, genotyping can be used to identify breed of origin and, further, to track a piece of meat with specific desirable or undesirable properties back to the parents of the animals and thus farm of origin (Hayes et al. 2005Go; Plastow et al., 2007Go). Tracing could also be used in a breeding program as a way to select on desirable traits, for example meat quality, identified at final product level.

Thus, we analyze methods to identify the source, typically the sire or dam, or both, and through that the source herd, of the product using genotypic data on the offspring (the meat sample). Because there may be thousands of breeding animals in a selection and multiplication program, including many relatives (e.g., sibs, cousins, and uncles), precision is needed to enable the parents to be distinguished uniquely.

Single nucleotide polymorphisms are becoming the method of choice in genotyping, and we assume their use. Even so, genotyping errors can arise, and we accommodate these. To make full use of the data, we use maximum likelihood (ML; Weir, 1996Go), as do package programs such as CERVUS (Marshall et al., 1998Go; Kalinowski et al. 2007Go). We do not introduce new theory but develop the methods for the application in livestock product tracing, particularly in pigs. The objective is to achieve a high probability of identifying the correct parent(s), say over 99.99%, and we investigate the numbers of SNP required. Some results may be relevant to other applications of SNP in parentage identification.


    MATERIALS AND METHODS
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Animal Care and Use Committee approval was not obtained for this study because no animals were used.

Model

Because the populations from which male and female parents are drawn may well be of different breeds and themselves be crossbred, we allow different gene frequencies between the parent populations and departures from Hardy-Weinberg equilibrium frequencies within them. Consider a locus with alleles A and a, where allele A has frequency pm and pf in the male and female parental (i.e., sire and dam) populations and a has frequencies qm = 1 – pm and qf, respectively. The genotype frequencies in the male population, for example, for AA, Aa, and aa are gm1, gm2, and gm3, respectively, with gm1 = pm2 if they are in Hardy-Weinberg equilibrium.

The expected genotype frequencies in the offspring generation, assuming random mating between the populations, for AA, Aa, and aa, respectively, are described by the following vector:


Formula

For sires (dams) of specified genotype, mating to random dams (sires), and for sire x dam genotype pairs, the expected frequencies of offspring genotypes are given in Table 1Go.


View this table:
[in this window]
[in a new window]

 
Table 1. Frequency distribution of offspring genotypes according to genotype of parents and whether the parental genotype is known or a random unknown1
 
Genotyping Errors. Single nucleotide polymorphism genotyping is not devoid of error, and miscalls can arise with automatic or manual systems, so the possibility of genotyping error and indeed of mutation has to be included in the calculations (Sobel et al., 2002Go). Otherwise, a putative parent with homozygous genotype AA, for example, could be excluded if the offspring sample was incorrectly called as aa (Table 1Go), regardless of genotypes at other loci.

The distribution of miscalls can be represented by a table or 3 x 3 matrix E in which elements eij denote the probability that an individual of real genotype i is assigned genotype j. The matrices are general, but we assume as a reference point and in most examples, unless specified otherwise, that there is the same probability of miscalling each allele as the other one:


Formula

where {delta} = the allelic error rate and e = 2{delta}(1 – {delta}) ~2{delta}. For simplicity, we shall refer to e as the genotyping error rate. In sire identification, errors in miscalling heterozygotes as homozygotes cause the most false exclusions, assuming that the probability of calling the wrong homozygote ({delta}2, e.g., AA as aa) is very small; therefore, we shall also consider a model where the only errors are in miscalling of heterozygotes [i.e., the first and last rows of E are (1, 0, 0) and (0, 0, 1), respectively]. For generality, and because different procedures may be used for genotyping parents and offspring (for example, the latter may be genotyped in a one-off process using a different assay when a production problem is detected), this matrix may differ among male and female parents and offspring among groups. Hence, we define matrices Em, Ef, and Eo, respectively.

Mutations. Germ-line mutations have a similar influence to genotyping errors on the distribution of genotypes of parent-offspring pairs. Thus, if the mutation rate of A to a is uA and of a to A is ua, and mutations occur sufficiently rarely that double mutations can be ignored, the total error rate Eo* recorded in offspring is:


Formula

The matrices Em and Ef are unchanged.

Likelihood Calculations

We consider as an example the calculation of likelihoods for detection of the sire. That for dams is essentially the same, and the necessary extensions for detection of a sire and dam pair are given subsequently.

Likelihood when the Putative Sire is Unrelated to the Sire. The observed frequency (i.e., including possible erroneous genotypes and mutations) of individuals of genotype j at an arbitrary locus in the offspring population, which is also equal to the expected frequency in the offspring of an unrelated sire, is given by gj = {sum}igojeoij, an element of the matrix product


Formula

Hence, the probability that the offspring has observed genotype j and the putative sire has observed genotype i when it is a random male, unrelated (U) to the sire, does not depend on i and is


Formula

Combining loci, k, which are assumed to be in linkage equilibrium, and now letting gkj be the probability that the offspring has genotype j at locus k, the likelihood of the data given that the putative sire is unrelated to the sire is given by the product


Formula

Likelihood when the Putative Sire is the Sire. Let the element thj of


Formula

denote the frequency that a sire of true genotype h has an offspring of true genotype j (elements of Table 1Go), and so the h, j element of the product TEo denotes the probability that a sire of true genotype h has an offspring of observed genotype j.

The joint probability that, for example, a sire has true genotype h and observed genotype i is given by gsheshi. Hence, the conditional probability that a sire has true genotype h given observed genotype i is


Formula

where csih denotes an element of a 3 x 3 matrix Cs.

Hence, P(i, j|S), the probability that the offspring has observed genotype j and that the putative sire has observed genotype i when it is the sire, is obtained by summing over all possible combinations of true genotype of sire and true genotype of offspring. It is given by xij = {sum}hcsih{sum}lthleolj, an element of


Formula

Combining all loci under the same assumptions of independence as above, the likelihood of the data given it is the sire is


Formula

Likelihood Ratio. The likelihood ratio L(S)/L(U) is the relative probability of obtaining the observed genotypes when the putative sire is indeed the sire (S) compared with when it is unrelated (U). The logarithm taken to base 10 of the likelihood ratio (lod) is determined as follows:


Formula

Identification of Dams or Both Parents. For identification of dams with unknown sires, the same analysis as outlined above still applies but with Cs, gs, and Es replaced by Cd, gd, and Ed, respectively.

For identification of parental pairs, there are now 9 possible real and observed combinations of sire and dam genotypes. The transmission probabilities are given by a 9 x 3 matrix H, with elements given in Table 1Go, ordered as dam genotype nested within sire genotype, where Hi (3 x 3) defines progeny of sire genotype i:


Formula

with, for example,


Formula

For example, the second row of H1 defines the genotype probabilities for offspring of an AA male mated to an Aa female. The probability that the sire has true genotype h given observed genotype i and that the dam has true genotype y given observed genotype z is given by csihcdzy, elements of the 9 x 9 matrix Cs {otimes} Cd, the direct product of Cs and Cd. Hence, the probabilities of each combination of observed offspring genotype and observed sire x dam parental genotype pair are given by the elements of the 9 x 3 matrix X*, where


Formula

The likelihood, given that these are the true parents, is then the product over loci k of elements of X*k, identified by observed genotypes of putative sire (i) and dam (z) and offspring (j):


Formula

If both putative sire and dam are unrelated to the offspring, the likelihood is given by L(U) as above, and the log-likelihood ratio by lodSD = log10[L(S,D)] – log10[L(U)]. Further, if for example the sire is known, log10[L(S,D)] – log10[L(S)] is the conditional log-likelihood ratio for the dam.

Posterior Probabilities. If there are a limited number of sires that could possibly be the father, and these have prior probabilities {Pi}m, the posterior probability that sire M is the true sire is given by {rho}m = {Pi}MLM(S)/{sum}m{Pi}mLm(S). Approximate values for these prior probabilities may be known: for example, a subset of the sires used in AI may have on average 10 times as many progeny as those used in natural service.


    RESULTS
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Properties of Likelihood Ratios

To assess the efficacy of the likelihood ratio comparisons to find the correct parent and to judge the numbers of loci to be used as a function of their frequency, we consider the distribution of the likelihood ratios under the 2 hypotheses that the putative parent is or is not the real parent. If we assume loci are unlinked and in linkage equilibrium, it is sufficient to evaluate these quantities for single loci. Thus, we compute the expectations for sire identification,


Formula

, and the variances of each of these quantities. The expectations were computed by summing the likelihoods over all possible genotypic combinations, weighted by their frequencies according to whether the putative sire was the sire, to obtain E(lod|S), or was unrelated, to obtain E(lod|U). From these we compute:


Formula

, the expected difference in lod score between the sire and a putative sire unrelated to it. The actual parameters used in computing their expectation to weight the likelihoods and those assumed in computing the likelihoods do not have to be the same; therefore, the effects of wrong assumptions can be checked.

Results are given in Table 2Go for cases in which the gene frequency is the same in the male and female populations, pm = pf = p, and only for p ≤ 0.5, because those for gene frequency p and 1 – p are the same. In each case, there is assumed to be a genotyping error rate (e) of 0.005 (based on Kuruvilla et al., 2006Go). As an example, consider results for sire identification with p = 0.5 for a single locus when E(lod|S) = 0.0718 and E(lod|U) = –0.2126, with the difference E(lodS) = 0.2845 (Table 2Go). With n loci, these expected values increase n-fold; for example, with 100 loci the expectations are 7.18, –21.26, and 28.45, respectively. The standard deviation of the lod score given that the putative sire is indeed the sire, {sigma}(lod|S) = {surd}{E[(lod|S)2] – [E(lod|S)]2}, is 0.148, and correspondingly, if it is not the sire, it is 0.683 (Table 2Go). The standard deviation is much greater for the nonparent, because exclusions occur with binomial frequencies, and each such exclusion generates a large negative lod (with magnitude depending on the assumed genotyping error rate). There is no meaningful standard deviation of the difference of the lod under the 2 hypotheses, because the calculations are conditional on the hypothesis. The standard deviation s increases in proportion to {surd}n for n loci, and thus, E(lod|S)/{sigma}(lod|S) increases in proportion to vn. For example, with 100 loci of frequency 0.5, the mean and standard deviation of the lod, if the putative sire is the true sire, is 7.18 ± 1.48; therefore, a negative lod score would be very improbable. Loci at intermediate frequency show the greatest values of E(lodS), presumably because these have the greatest heterozygosity and also the greatest chance of showing an exclusion.


View this table:
[in this window]
[in a new window]

 
Table 2. Expectations and standard deviations of lod scores (=log10(likelihood ratio)] for single loci in which the gene frequencies (pm, pf) are the same in the sire and dam populations, for sire identification (upper block) or joint sire and dam identification (lower block)1
 
The likelihood ratios depend on the frequencies in both parental populations, even when only 1 parent is to be identified. This is illustrated in Figure 1Go, which shows E(lodS) as curves for a series of gene frequencies (pm) plotted against frequency in the dam population (pf). In general, the differences in lod are maximized when the gene frequency in the dam population is near 0 or 1. This is, presumably, because there is no ambiguity as to the source of an A gene, say, in a heterozygous offspring if the dam line is fixed for aa. (Indeed, this is just an example of the utility of inbreds in test crossing.)


Figure 1
View larger version (14K):
[in this window]
[in a new window]

 
Figure 1. Expected lod scores [=log10(likelihood ratio)] for sire identification, E(lodS), plotted for a range of gene frequencies (pm) in the sire population (lines) and gene frequencies in the dam population (pf). Genotyping error rate (e) is 0.005.

 
The expected lod scores between a parental pair of true sire and dam and an unrelated pair of putative parents differ substantially more than that between just sire and unrelated individuals (Table 2Go), because the information content is much greater. For example, an AA sire and aa dam can have homozygous progeny only as a result of genotyping errors or mutation.

Consequences of Errors in Assumptions

The results in Table 2Go and Figure 1Go were obtained under the assumption that the gene frequencies used in the calculations were indeed those in the parental populations, but in practice and certainly early in use of a parent assignment scheme, these frequencies may not be known accurately. Examples are given in Table 3Go for frequencies in error by 0.1 or so, more than would be expected for all but the smallest initial genotype samples (under 50 or so individuals). It is seen that, although the expected likelihood ratios under both the unrelated and sire assumptions are influenced by the assumed gene frequency, the difference in E(lodS) is little affected over these examples. It seems possible for E(lodS) to be greater with incorrect assumptions on frequency, basically because similarity of genotype for a rare allele of father and son is expected to be rarer than it actually is. In general, however, the methodology appears to be robust.


View this table:
[in this window]
[in a new window]

 
Table 3. Influence of errors in estimates of gene frequency (pm, pf) on expectations and standard deviations of lod scores (=log10(likelihood ratio)] for sire identification1
 
Similarly, genotyping error rates are unlikely to be estimated accurately, certainly for each of the loci used. Some examples are given in Table 4Go of the consequences on mean and standard deviation of lod scores of different actual and assumed rates of errors in genotyping (Kuruvilla et al., 2006Go), but assuming that the same rates apply in sire, dam, and offspring populations. In Table 4Go, errors in genotyping both homozygotes and heterozygotes are included, as in Tables 2Go and 3Go. It is seen (Table 4Go) that, for the sire, E(lod|S) is not very sensitive to either the real or assumed error rate, although tending to fall with increased real error rate and rise with increased assumed error rate. Its standard deviation changes more and in the opposite direction, thus assuming a low error rate leads to an increasing probability of a negative value. For an unrelated individual, the actual rate has little effect on E(lod|U) or {sigma}(lod|U), but a low assumed rate increases the magnitude (of negative values for the former) of both quantities. This is not necessarily beneficial in discrimination, however, because the lower the assumed error rate, the greater the penalty attached to an apparent exclusion. The distribution of lod|U is skewed downwards by these tiny probabilities, because the apparent exclusion completely outweighs all of the data coming from other loci which have nonexcluded genotype combinations. If the genotyping error rate is assumed to be zero, when actually it is not, then E(lod|U) becomes infinite.


View this table:
[in this window]
[in a new window]

 
Table 4. Influence on expectations and standard deviations of lod scores (=log10(likelihood ratio)] for sire identification of actual, assumed genotyping error rate (e) of heterozygotes and homozygotes, and actual and assumed genotyping error rates in homozygotes with the actual and assumed error rate for heterozygotes of 0.0051
 
The relative effect of genotyping errors of homozygotes and heterozygotes is analyzed in Table 4Go. The heterozygote rate was kept constant at 0.005, and the real and assumed rate in homozygotes varied. The means and standard deviations of the lod scores for sire evaluation are little affected by genotyping errors of homozygotes, basically because they do not lead to exclusions, except in the very rare case that the wrong homozygote is assigned. The effect is greater when assigning 2 parents (data not shown), in which falsely reading as Aa an AA offspring of an AA x AA mating leads to an apparent exclusion.

Distinguishing Parents from Relatives of the Parents

Consider a relative of the sire, with relationship R to it (strictly the numerator relationship, or twice the coancestry). Because the offspring shares 1 gene with the true parent, the probability is R that it shares 1 gene with the relative of the parent. Thus, R = 0.5 for a full sib or parent of the sire (i.e., an uncle or grandparent of the progeny), and R = 0.25 for a half-sib of the sire (i.e., a half-uncle of the progeny).

The expectation of the lod for a putative sire related to the true sire is therefore the weighted average of the lod for a sire and for an unrelated individual:


Formula

and therefore,


Formula

Similarly, the standard deviation of the lod is given by that for the mix of 2 groups:


Formula

Results given in Table 2Go and Figure 1Go can be used accordingly. Basically, it is harder to distinguish a sire from a full uncle or grandsire than from a half-uncle than, in turn, an unrelated individual.

The likelihood equations could be set up for a series of relationships of the putative sire (say, uncle) to the true sire, the elements of the joint probabilities of the genotype of the uncle and offspring merely being weighted sums based on the conditional probabilities given in Table 1Go. In practice, when sire tracing, this would require a series of likelihood calculations to be undertaken, and it is not clear that it would be worthwhile as a standard process. It would, of course, be possible to estimate the relationship of the putative sire to the true sire by maximizing the likelihood as a function of R, but that is outside the scope of this paper and a subject of study elsewhere for many years (Thompson, 1975Go; Weir et al., 2006Go).

Risk of Misidentification

A practical question of importance is the number of loci needed to be sure that the correct sire is identified, or a correct conclusion drawn that the sire of the offspring tested is absent from the database. Approximate values can be obtained using the expectations of the lod scores, E(lod|S) and E(lod|U), and their standard deviations assuming normality under the central limit theorem. These turn out to be rather poor estimates and only a guide when probabilities of wrong assignment become very small and tail values of the distribution are used, because lod scores at individual loci are very heavily skewed downwards, particularly for unrelated individuals due to (real or false) exclusions.

Hence, we used simulation in a practical breeding situation for a large-scale production system of pigs, in which the true sire was assumed to have 4 full brothers, 20 half-brothers, and 976 unrelated males in the data set of possible sires. Parental genotypes were sampled, with genotyping errors as appropriate. To reduce simulation at the expense of increasing sampling error because of correlated observations, but so as not to cause bias, 1,000 offspring were generated from each sire using unrelated dams, and the lod was computed for each putative sire-offspring combination. In each case, the putative sire with the greatest lod score was designated the sire. Results were run with a range of gene frequencies and numbers of loci and are shown in Table 5Go, for a total of 50,000 replicates, comprising 50 sires each with 1,000 offspring. In line with the calculations of expected lod scores (Table 2Go), risks of wrong assignment are smaller at intermediate gene frequencies but do not differ greatly over the frequency range 0.3 to 0.7. There it is shown that although, for example, with only 50 loci genotyped there is a high probability that the sire will be misidentified, the probability is less than 1% with 100 loci having intermediate gene frequencies or 150 loci having minor allele frequencies of 0.2 or more.


View this table:
[in this window]
[in a new window]

 
Table 5. Frequency from 50,000 replicates in which the lod score (=log10(likelihood ratio)] for one or more of its 4 full sibs, 20 half-sibs, or 976 unrelated males exceeded the lod of the true sire for different actual and assumed gene frequencies (p = pm = pf)1
 
Further, if the number of loci is large, such that the probability of incorrect assignment is small, an incorrect sire assignment would usually be to a full sib (see Table 5Go, results for 100, 150, and 200 loci), even though full sibs comprised only 4% of the putative sires in these examples. On this basis, further simulations were run with many more replicates (1,000,000, comprising 1,000 sires each with 1,000 offspring) and more loci, but considering only 4 full sibs in addition to the true sire, such that more accurate estimates of the risk could be obtained when it was low (Table 6Go). With 250 loci and minor allele frequencies of 0.25 or more, the probability of wrong identification is less than 10–5. Some values from Tables 5Go and 6Go are plotted together in Figure 2Go, which shows that the risk of wrong assignment falls approximately linearly as -log(number of loci).


View this table:
[in this window]
[in a new window]

 
Table 6. Frequency from 1,000,000 replicates in which the lod score (=log10(likelihood ratio)] for the sire was exceeded by that for 1 or more of its full sibs1
 

Figure 2
View larger version (11K):
[in this window]
[in a new window]

 
Figure 2. Probability of incorrect sire identification, log10(P), as a function of number of loci and gene frequencies in the sire and dam populations (p), for p ranging from 0.2 (greatest risk), 0.3, 0.4, and 0.5 (least risk). Model as in Table 5Go.

 
Poor estimates of genotyping error rates or of gene frequencies do not greatly increase the risk of incorrect assignment of the sire (Table 7Go), as anticipated from the analysis of expected lod (Table 4Go). Indeed, as might be hoped, the real values have a greater influence than do the assumed values of these parameters.


View this table:
[in this window]
[in a new window]

 
Table 7. Effects of errors in assumed gene frequencies (p, same in male and female parental populations) and genotyping error rates (e) on the frequency in which the lod score (=log10(likelihood ratio)] of the true sire was exceeded by the lod for 1 or more of its 4 full sibs, 20 half-sibs, or 976 unrelated males
 

    DISCUSSION
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 
Exclusions

A simple alternative to the use of ML to identify the parent(s) is to eliminate others by simple exclusion of genotype combinations, for example an AA sire cannot produce an aa offspring. If there are no genotyping errors or mutations and there is random mating, the probability of exclusion of a randomly sampled unrelated individual is the frequency of different homozygote pairs, PX = pm2(1 – pf)2 + (1 – pm)2pf2. If pm > 0.5, PX is maximized when pf is 0, otherwise when pm = 1. With equal frequencies in the lines PX = 2p2q2, which has a maximum of 0.125 at p = 0.5, is 0.1152 if p = 0.4, but is only 0.0162 if p = 0.1. The probability of exclusion for a related sire, for example, is simply a factor 1 – R of that for an unrelated individual, assuming there are no genotyping errors.

To enable comparison with results from the likelihood analysis, consider the example in which the sire has 4 full brothers, 20 half-brothers, and 976 unrelated males in the data set. The probability that at least 1 putative sire is, by chance, not excluded if 100 loci each with frequency 0.4 are genotyped without error is given by:


Formula

In comparison, there is only a 0.27% chance that the lod of the sire will be exceeded by that of any of the other putative sires, allowing for a genotyping error rate of 0.005 (Table 5Go).

An important complication that we have incorporated quantitatively within the ML analysis is the effect of errors in genotyping or mutation, which can lead to false exclusions and thus potential wrong decisions if genotypic exclusion is the sole criterion. We consider just the most important class, calling heterozygotes as homozygotes, simply assuming Aa is equally likely to be called as aa or AA, and ignoring second order terms. If the error rate is em in sires and eo in offspring, the exclusion rate for a sire, using Table 1Go, comprises terms for AA sires with aa offspring (1/2pm2qfeo), AA sires with aa offspring (1/2pmqmqfem), etc., totaling 1/2pmqf(pmeo + qmem) + 1/2qmpf(qmeo + pmem). With equal gene frequencies in males and females, and error rates in parents and offspring, this reduces to pqe, or, for example, to 0.00125 for P = 0.5 and e = 0.005, the error rate assumed in the ML analysis (Table 2Go). There is thus a probability of false exclusion based on a single SNP of about 12% with 100 SNP tested. For gene frequencies closer to 0 or 1, the probability of a false exclusion of a sire increases relative to that of the correct exclusion of a nonsire (e/2pq approximately).

Practical Issues

An alternative to using SNP would be to adopt highly polymorphic markers such as microsatellites or variable number tandem repeats. Although many more SNP are needed to achieve the same level of precision in identifying relatives (Brenner, 1999Go; Gill, 2001Go; Vignal et al., 2002Go), many costs are the same, such as collection of blood-tissue samples and DNA extraction. The developments in technology are such that relative costs and accuracies of high-throughput genotyping are changing such that SNP are becoming the method of choice. An added advantage is the binary nature of SNP, with 2 alleles, such that error rates are relatively low. Thus, SNP have been proposed for use in paternity testing (Heaton et al., 2002Go) and product tracing (Plastow et al., 2007Go) in livestock.

Unsurprisingly, the analysis shows that SNP at intermediate gene frequencies in the sire population are most informative and become increasingly so the nearer are loci in the female line to homozygosity. The fall in efficiency (exemplified by Table 2Go for E(lod) or Table 5Go for risk) is small, however, over the range 0.3 < p < 0.7. The numbers of loci required to be almost certain to identify the correct parent depend on the number of putative competitors and particularly on how many are closely related to the sire (e.g., full sibs or father). Even so, with large numbers of competitors including relatives, 100 to 150 loci seem sufficient, and simulations of data sets of populations having small numbers of breeding individuals so that there are many relationships among the sires illustrate that correct identification occurs with under 150 loci.

The analysis has not considered relationships among putative sires other than full or half-sibs. In practice, there are almost certain to be additional relationships, for example cousins and second cousins, and various multiple relationships. Hence, the calculations of both the exclusion probabilities (Dodds et al., 1996Go; Double et al., 1997Go) and the risk of wrong assignment are conservative and so may be greater in practice. Because the risk of wrong assignment is greatest for the closest relatives, full sibs (Table 5Go), particularly when many loci are used, it is unlikely that further relatives among putative sires will greatly affect the probabilities of misidentification of the sire unless these are more highly related than full sibs. A clone of the sire could not, of course, be distinguished from the sire by genotyping. In the risk calculations, we have not, however, assumed an equal prior probability for all sires, whereas in practice, this is unlikely to be the case. For example, some sires may have been used in AI, and others not, and the pattern of use over time of sires may be known.

In our analysis of risk, we have assumed the genotype of the actual sire is present in the data set and designated as sire the putative sire with the greatest lod score, regardless of the actual and relative sizes of the lod scores of the winner and runners-up. In practice, we advocate presenting lod scores for at least the top few candidates for perusal and decision as in CERVUS (Marshall et al., 1998Go; Kalinowski et al., 2007Go; http://www.fieldgenetics.com/pages/aboutCervus_Functions.jsp). In the context of product tracing, if sufficiently important, possible ties or marginally convincing assignments could be resolved by further genotyping using an additional set of SNP (or a search for the absent fathers).

Simulations can be undertaken, as in CERVUS for example, on the actual data structure to assess the confidence of the parentage assignment. In the more limited context of product tracing, in which only a single relationship is to be identified and the population is known, the information on the standard deviations of lod scores would seem to provide a guide in most cases. We use the results on the distributions of lod scores, which can readily be computed for any population. As a simple example, assume all gene frequencies are 0.4 and taking results from Table 2Go, for a single locus, the mean and standard deviation of lod|S are 0.0701 ± 0.154. Thus, for 100 such markers, lod|S ~7.01 ± 1.54; similarly, for unrelated individuals, lod|U ~–19.81 ± 6.54; and for full brothers, lod|R ~–6.40 ± 4.94. As noted previously, the distribution of the lod scores, even with large numbers of loci, departs from normality, in particular being skewed downwards, but even so, the normal distribution provides a guide. Assuming no relative could be closer than a full sib, then an observed lod in excess of 5 is strong evidence that the sire has been identified, whereas a negative lod is strong evidence that the sire has not been found.

Although genotyping errors should, of course, be minimized, the analysis shows that the ML method is reasonably robust, providing the error rate is not assumed to be too small (when false exclusions generate very high negative lod scores). The actual error rate to be used will depend on the genotyping platform and assignment rate in use. Therefore, the base figure of 0.005 mainly used should be regarded as no more than a typical figure.

The application considered here to product tracing is simpler than other potential applications, because the kind of relationship to be identified is limited solely to parents. In this case, there is clearly more information to be gained from a limited number of loci by identifying both parents, even though the actual matings made are not recorded (e.g., artificial insemination using mixed semen), because the likelihood ratios are much larger (Table 2Go). If sire tracing is sufficient, it may be much more economical, because fewer animals need to be genotyped.


    Footnotes
 
1 We are grateful to Maple Leaf Foods Inc., Toronto, Canada, for financing this study as part of their DNA traceability initiative for pork, to colleagues for assistance, to Bruce Weir (University of Washington) for comments, and to Wei Zou (Clinical Data Inc.) and 2 referees for comments on an earlier draft. Back

2 Corresponding author: w.g.hill{at}ed.ac.uk

Received for publication May 17, 2007. Accepted for publication May 19, 2008.


    LITERATURE CITED
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 LITERATURE CITED
 


Bieber, F. R., C. H. Brenner, and D. Lazer. 2006. Finding criminals through DNA of their relatives. Science 312:1315–1316.[Abstract/Free Full Text]

Brenner, C. H. 1999. The power of SNP’s – Even without population data. http://dna-view.com/SNPpost.htm Accessed June 22, 2006.

Brenner, C. H., and B. S. Weir. 2003. Issues and strategies in the DNA identification of World Trade Center victims. Theor. Popul. Biol. 63:173–178.[CrossRef][Medline]

Dodds, K. G., M. L. Tate, J. C. McEwan, and A. M. Crawford. 1996. Exclusion probabilities for pedigree testing farm animals. Theor. Appl. Genet. 92:966–975.[CrossRef]

Double, M. C., A. Cockburn, S. C. Barry, and P. E. Smouse. 1997. Exclusion probabilities for single-locus paternity analysis when related males compete for matings. Mol. Ecol. 6:1155–1166.[CrossRef]

Gill, P. 2001. An assessment of the utility of single nucleotide polymorphisms (SNPs) for forensic purposes. Int. J. Legal Med. 114:204–210.[CrossRef][Medline]

Hayes, B., A. K. Sonneson, and B. Gjerde. 2005. Evaluation of three strategies using DNA markers for traceability in aquaculture species. Aquaculture 250:70–81.[CrossRef]

Heaton, M. P., G. P. Harhay, G. L. Bennett, R. T. Stone, W. M. Grosse, E. Casas, J. W. Keele, T. P. L. Smith, C. G. Chitko-McKown, and W. W. Laegreid. 2002. Selection and use of SNP markers for animal identification and paternity analysis in U.S. beef cattle. Mamm. Genome 13:272–281.[CrossRef][Medline]

Kalinowski, S. T., M. W. Taper, and T. C. Marshall. 2007. Revising how the computer program CERVUS accommodates genotyping error increases success in paternity assignment. Mol. Ecol. 16:1099–1106.[CrossRef][Medline]

Kuruvilla, F., T. Green, D. Altshuler, M. Daly, and S. Gabriel. 2006. An evaluation of the Bayesian robust linear modelling using Mahalanobis distance (BRLMM) genotyping algorithm. http://www.broad.mit.edu/gen_analysis/genotyping/brlmm_affy_ncrr.html Accessed Nov. 25, 2007.

Marshall, T. C., J. Slate, L. E. B. Kruuk, and J. M. Pemberton. 1998. Statistical confidence for likelihood-based paternity inference in natural populations. Mol. Ecol. 7:639–655.[CrossRef][Medline]

Plastow, G. S., A. J. Mileham, T. Wilken, C. Gladney, and J. Bastiaansen. 2007. Pig Improvement Company (UK) Ltd., assignee. System for tracing animal products. US patent 7229764.

Sobel, E., J. C. Papp, and K. Lange. 2002. Detection and integration of genotyping errors in statistical genetics. Am. J. Hum. Genet. 70:496–508.[CrossRef][Medline]

Thompson, E. A. 1975. The estimation of pairwise relationships. Ann. Hum. Genet. 39:173–188.[Medline]

Vignal, A., D. Milan, M. SanCristobal, and A. Eggen. 2002. A review on SNP and other types of molecular markers and their use in animal genetics. Genet. Sel. Evol. 34:275–305.[CrossRef][Medline]

Weir, B. S. 1996. Genetic data analysis. II. Sinauer, Sunderland, MA.

Weir, B. S., A. D. Anderson, and A. B. Hepler. 2006. Genetic relatedness analysis: Modern data and new challenges. Nat. Rev. Genet. 7:771–780.[CrossRef][Medline]


This article has been cited by other articles:


Home page
GeneticsHome page
J. Wang and A. W. Santure
Parentage and Sibship Inference From Multilocus Genotype Data Under Polygamy
Genetics, April 1, 2009; 181(4): 1579 - 1594.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
jas.2007-0276v1
86/10/2508    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Hill, W. G.
Right arrow Articles by Webb, A. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hill, W. G.
Right arrow Articles by Webb, A. J.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS