On Counting Tandem Duplication Trees
http://www.100md.com
分子生物学进展 2004年第6期
Department of Mathematics, National University of Singapore, Singapore
E-mail: matzlx@nus.edu.sg.
Key Words: molecular phylogeny ? tandem duplication history ? duplication tree model
Introduction
Large genomes are full of repeated DNA sequences. It was estimated that over half of the human DNA consists of repeated sequences (Baltimore 2001; Eichler 2001; Leem et al. 2002). Tandem duplication is one of the important evolutionary mechanisms for producing repeated DNA sequences, in which the copies that may or may not contain genes are adjacent along the genome. Fitch (1977) first observed that tandem duplication histories are much more constrained than speciation histories and proposed to model them assuming that unequal crossover is the biological mechanism from which they originate. The corresponding trees are now called tandem duplication trees, the term tandem sometimes being omitted for the sake of conciseness. With more and more genomic sequences becoming known, inferring tandem duplication history has again redrawn researchers' attention (Benson and Dong 1999; Tang, Waterman, and Yooseph 2002; Elemento, Gascuel, and Lefranc 2002; Zhang et al. 2003).
The aim of this article is, first, to present a simple recurrence formula for the number of rooted duplication trees based on the recurrence formula proved in Gascuel et al. (2003). We also give a simple non-counting proof of the fact that the number of rooted duplication trees for n segments is exactly twice the number of unrooted duplication trees for n segments. Notice that this fact was proved based on a counting argument in Gascuel et al. (2003).
Duplication Tree Model
Assume n sequence segments {1, 2, ... , n} were formed from a locus by tandem duplication. Then, assume that the locus had grown from a single copy through a series of tandem duplications. Each duplication replaced a stretch of DNA sequences containing several repeats with two identical and adjacent copies of itself. If the stretch contains k repeats, the duplication is called a k-duplication.
A rooted duplication tree for tandemly repeated segments {1, 2, ... , n} is a rooted binary tree that contains blocks as shown in figure 1. A node in represents a repeat. Obviously, the root represents the original copy at the locus and leaves the given segments.
FIG. 1. A rooted duplication tree . Multi-duplication blocks are [d, f], [h, i] [j, k], and [l, m]
A block in represents a duplication event. Each non-leaf node appears in a unique block; no node is an ancestor of another in a block. If the block corresponds to a k-duplication, it has k nodes u1, u2, ... uk listed from left to right. Assume lc(ui) and rc(ui) are the left and right children of ui, 1 i k. Then, in the model ,
are placed from left to right. Hence, for any i and j, 1 i j k, the directed edges (ui, rc(ui)) and (uj, lc(uj)) cross each other. But no other edges cross in the model. For simplicity, we will only draw blocks corresponding to multi-duplication events that contain more than one internal node.
The leaves representing given segments are placed from left to right in the order of the segments appearing on the chromosome. Here we assume that such an order is the increasing order.
Counting Rooted Duplication Trees
An r-duplication event in a rooted duplication tree is visible if none of the 2r copied segments produced by the event have been duplicated subsequently. It is easy to see that the children of the nodes contained in a visible duplication block are leaves. For example, there are five visible duplications in the rooted duplication tree in figure 1: [q], [n], [j, k], [o], [p]. For a visible r-duplication event, if there are i given segments remaining to the right of the 2r copies produced by the event, we refer it as a (i, r)-duplication event. In figure 1, the visible 2-duplication [j, k] is a (7, 2)-duplication.
We use rn to denote the number of all the rooted duplication trees for n segments. For n 2 and 0 i n – 2, let p(n, i) denote the number of all the rooted duplication trees for n segments in which the leftmost visible duplication is an (i, r)-duplication for some r, 1 r (n – i)/2. Then,
Lemma 1 (Gascuel et al. 2003) For any n 2 and 1 i n – 4,
By applying Lemma 1, we obtain the following simple recurrence formula for rn.
THEOREM 1 For any n 2,
Proof. Obviously, r2 = 1. Now, we only consider the case n 3. By definition,
as demonstrated in Gascuel et al. (2003). This can be generalized into
for any k from 2 to n.
Now, we show equation (4) by induction on k. When k = 2, 3, equation (4) becomes equation (3) and hence is true. Assume it is true for k j. For k = j + 1 > 3, by equation (2),
where we assume () = 0 to derive the 5th equality and use the formula () + () = () for two integers a, b such that 1 a, a – 1 b. This concludes the induction proof and hence equation (4) holds for any k from 2 to n.
Now, by equation (1),
where we use the formula ) for two integers 0 a b. This finishes the proof.
The recurrence formula in Theorem 1 allows us to find a closed formula for computing rn. Let X = (rn, rn–1, ... , r3, r2)T. Then, by the recurrence formula,
where A = (aij)(n–1)x(n–1) is defined as
Since A is an upper triangular matrix having 1s along the diagonal, its determinant is 1. Hence, the fact that only the last entry ‘1’ is non-zero in the right-hand vector implies that rn is the co-factor of the row n – 1 and column 1 in A. For example, we have
Unrooted Duplication Trees
An unrooted duplication tree is a tree derived from a rooted duplication tree by removal of the root. Let be an unrooted duplication tree for n segments {1, 2, ... , n}. By definition, at least one rooted duplication tree can be obtained from by rooting it at some edge in the path from 1 to n. We say that can be rooted at an edge e if a rooted duplication tree can be formed by rooting it at e. Let Si denote the set of unrooted duplication trees that can be rooted at exactly i edges, i 1. Then, the following facts are true:
If the unrooted duplication tree can be rooted at two edges e and e', then it can be rooted at any edge between e and e' in the path from segment 1 to segment n.
Let Sk for some k 3. Assume the edges in the path from 1 to n in are
where ei = (ui–1, ui), u0 = 1, and um = n. If can only be rooted at the edges ej (i j i') where i' – i + 1 = k. Then, ui–1 must be contained in a multi-duplication block and so is ui'.
Assume Sk, k 3 and i and i' are as in (2). Let Tj denote the subtree of rooted at a child of uj that is off the path from 1 to n. We also let j(j+1) denote the unrooted duplication tree obtained from by interchanging Tj and Tj+1 as illustrated in figure 2. Then, we attain i' – i – 2 unrooted trees (i+1)(i+2), (i+2)(i+3), ..., (i'–2)(i'–1) from . It is easy to see that, for each j from i + 1 to i' – 2, j(j+1) can be rooted uniquely at the edge ej+1.
FIG. 2. (a) An unrooted duplication tree . It can be rooted at 5 edges e3, e4, e5, e6, e7. The rooted duplication tree derived from by rooting it at e5 is given in figure 1. (b) An unrooted duplication tree 45 obtained from by interchanging subtrees T4 and T5. 45 can only be rooted at e5
Conversely, if can be rooted uniquely at some edge ei+1 = (ui, ui+1) in the path from 1 to n, then, ui and ui+1 must be contained in a double duplication block (and hence the right subtree of ui and the left subtree of ui+1 are swapped). Thus, i(i+1) obtained by interchanging Ti and Ti+1 can be rooted at three edges ei, ei+1, ei+2. This implies that i(i+1) is an unrooted duplication tree that can be rooted in at least three ways.
Therefore, the mapping from to = {j(j+1) | i + 1 j i' – 2} is one-to-one from unrooted duplication trees Sk (k 3) to the (k – 2)-subsets of S1. Recall that rn denotes the number of rooted duplication trees for n segments. Letting un be the number of unrooted duplication trees, we obtain
where we use the fact || = j – 2 for Sj, j 3.
In conclusion, the following two facts hold for duplication trees:
The number rn of rooted duplication trees for n segments is twice the number of unrooted duplication trees (Gascuel et al. 2003); it satisfies the following recurrence relation:
A rooted duplication tree for segments {1, 2, ... , n} is ordered if it contains only 1-duplications. Such a duplication tree is just a rooted binary tree having its leaves listed from left to right in the increasing order. The number of rooted, ordered duplication trees for n segments is the same as the number of unrooted, ordered duplication trees for n + 1 segments (Elemento and Gascuel 2003); it is ()/n (Zhang, Ma, and Wang 2003; Elemento and Gascuel 2003).
Acknowledgements
L. Zhang thanks Mike Steel for drawing attention to the counting problem studied here, and for further discussions. Thanks, too, to Olivier Gascuel for useful comments on the first version of this paper. The work is partially supported by Singapore BioMedical Research Council Research grant BMRC01/1/21/19/140.
Literature Cited
Baltimore, D. 2001. Our genome unveiled. Nature 409:814-816.
Eichler, E. E. 2001. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17:661-669.
Elemento, O., and O. Gascuel. 2003. An exact and polynomial distance-based algorithm to reconstruct single copy tandem duplication trees. Pp. 96–108 in Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching (CPM'03), Morelia, Mexico. Lecture Notes in Computer Science, vol. 2676, Springer-Verlag.
Elemento, O., O. Gascuel, and M.-P. Lefranc. 2002. Reconstructing the duplication history of tandemly repeated genes. Mol. Biol. Evol. 19:278-288.
Fitch, W. 1977. Phylogenies constrained by cross-over process as illustrated by human hemoglobins in a thirteen cycle, eleven amino-acid repeat in human apolipoprotein A-I. Genetics 86:623-644.
Gascuel, O., M. D. Hendy, A. Jean-Marie, and R. McLachlan. 2003. The combinatorics of tandem duplication trees. Syst. Biol. 52:110-118.
Leem, S.-H., J. A. Londo?o-Vallejo, and J.-H. Kim, et al. 2002. The human telomerase gene: complete genomic sequence and analysis of tandem repeat polynomisms in intronic regions. Oncogene 21:769-777.
Tang, M., M. Waterman, and S. Yooseph. 2002. Zinc finger gene clusters and tandem gene duplication. J. Comput. Biol. 9:429-446.
Zhang, L., B. Ma, L. Wang, and Y. Xu. 2003. Greedy method for inferring tandem duplication history. Bioinformatics 19:1497-1504.(Jialiang Yang and Louxin )
E-mail: matzlx@nus.edu.sg.
Key Words: molecular phylogeny ? tandem duplication history ? duplication tree model
Introduction
Large genomes are full of repeated DNA sequences. It was estimated that over half of the human DNA consists of repeated sequences (Baltimore 2001; Eichler 2001; Leem et al. 2002). Tandem duplication is one of the important evolutionary mechanisms for producing repeated DNA sequences, in which the copies that may or may not contain genes are adjacent along the genome. Fitch (1977) first observed that tandem duplication histories are much more constrained than speciation histories and proposed to model them assuming that unequal crossover is the biological mechanism from which they originate. The corresponding trees are now called tandem duplication trees, the term tandem sometimes being omitted for the sake of conciseness. With more and more genomic sequences becoming known, inferring tandem duplication history has again redrawn researchers' attention (Benson and Dong 1999; Tang, Waterman, and Yooseph 2002; Elemento, Gascuel, and Lefranc 2002; Zhang et al. 2003).
The aim of this article is, first, to present a simple recurrence formula for the number of rooted duplication trees based on the recurrence formula proved in Gascuel et al. (2003). We also give a simple non-counting proof of the fact that the number of rooted duplication trees for n segments is exactly twice the number of unrooted duplication trees for n segments. Notice that this fact was proved based on a counting argument in Gascuel et al. (2003).
Duplication Tree Model
Assume n sequence segments {1, 2, ... , n} were formed from a locus by tandem duplication. Then, assume that the locus had grown from a single copy through a series of tandem duplications. Each duplication replaced a stretch of DNA sequences containing several repeats with two identical and adjacent copies of itself. If the stretch contains k repeats, the duplication is called a k-duplication.
A rooted duplication tree for tandemly repeated segments {1, 2, ... , n} is a rooted binary tree that contains blocks as shown in figure 1. A node in represents a repeat. Obviously, the root represents the original copy at the locus and leaves the given segments.
FIG. 1. A rooted duplication tree . Multi-duplication blocks are [d, f], [h, i] [j, k], and [l, m]
A block in represents a duplication event. Each non-leaf node appears in a unique block; no node is an ancestor of another in a block. If the block corresponds to a k-duplication, it has k nodes u1, u2, ... uk listed from left to right. Assume lc(ui) and rc(ui) are the left and right children of ui, 1 i k. Then, in the model ,
are placed from left to right. Hence, for any i and j, 1 i j k, the directed edges (ui, rc(ui)) and (uj, lc(uj)) cross each other. But no other edges cross in the model. For simplicity, we will only draw blocks corresponding to multi-duplication events that contain more than one internal node.
The leaves representing given segments are placed from left to right in the order of the segments appearing on the chromosome. Here we assume that such an order is the increasing order.
Counting Rooted Duplication Trees
An r-duplication event in a rooted duplication tree is visible if none of the 2r copied segments produced by the event have been duplicated subsequently. It is easy to see that the children of the nodes contained in a visible duplication block are leaves. For example, there are five visible duplications in the rooted duplication tree in figure 1: [q], [n], [j, k], [o], [p]. For a visible r-duplication event, if there are i given segments remaining to the right of the 2r copies produced by the event, we refer it as a (i, r)-duplication event. In figure 1, the visible 2-duplication [j, k] is a (7, 2)-duplication.
We use rn to denote the number of all the rooted duplication trees for n segments. For n 2 and 0 i n – 2, let p(n, i) denote the number of all the rooted duplication trees for n segments in which the leftmost visible duplication is an (i, r)-duplication for some r, 1 r (n – i)/2. Then,
Lemma 1 (Gascuel et al. 2003) For any n 2 and 1 i n – 4,
By applying Lemma 1, we obtain the following simple recurrence formula for rn.
THEOREM 1 For any n 2,
Proof. Obviously, r2 = 1. Now, we only consider the case n 3. By definition,
as demonstrated in Gascuel et al. (2003). This can be generalized into
for any k from 2 to n.
Now, we show equation (4) by induction on k. When k = 2, 3, equation (4) becomes equation (3) and hence is true. Assume it is true for k j. For k = j + 1 > 3, by equation (2),
where we assume () = 0 to derive the 5th equality and use the formula () + () = () for two integers a, b such that 1 a, a – 1 b. This concludes the induction proof and hence equation (4) holds for any k from 2 to n.
Now, by equation (1),
where we use the formula ) for two integers 0 a b. This finishes the proof.
The recurrence formula in Theorem 1 allows us to find a closed formula for computing rn. Let X = (rn, rn–1, ... , r3, r2)T. Then, by the recurrence formula,
where A = (aij)(n–1)x(n–1) is defined as
Since A is an upper triangular matrix having 1s along the diagonal, its determinant is 1. Hence, the fact that only the last entry ‘1’ is non-zero in the right-hand vector implies that rn is the co-factor of the row n – 1 and column 1 in A. For example, we have
Unrooted Duplication Trees
An unrooted duplication tree is a tree derived from a rooted duplication tree by removal of the root. Let be an unrooted duplication tree for n segments {1, 2, ... , n}. By definition, at least one rooted duplication tree can be obtained from by rooting it at some edge in the path from 1 to n. We say that can be rooted at an edge e if a rooted duplication tree can be formed by rooting it at e. Let Si denote the set of unrooted duplication trees that can be rooted at exactly i edges, i 1. Then, the following facts are true:
If the unrooted duplication tree can be rooted at two edges e and e', then it can be rooted at any edge between e and e' in the path from segment 1 to segment n.
Let Sk for some k 3. Assume the edges in the path from 1 to n in are
where ei = (ui–1, ui), u0 = 1, and um = n. If can only be rooted at the edges ej (i j i') where i' – i + 1 = k. Then, ui–1 must be contained in a multi-duplication block and so is ui'.
Assume Sk, k 3 and i and i' are as in (2). Let Tj denote the subtree of rooted at a child of uj that is off the path from 1 to n. We also let j(j+1) denote the unrooted duplication tree obtained from by interchanging Tj and Tj+1 as illustrated in figure 2. Then, we attain i' – i – 2 unrooted trees (i+1)(i+2), (i+2)(i+3), ..., (i'–2)(i'–1) from . It is easy to see that, for each j from i + 1 to i' – 2, j(j+1) can be rooted uniquely at the edge ej+1.
FIG. 2. (a) An unrooted duplication tree . It can be rooted at 5 edges e3, e4, e5, e6, e7. The rooted duplication tree derived from by rooting it at e5 is given in figure 1. (b) An unrooted duplication tree 45 obtained from by interchanging subtrees T4 and T5. 45 can only be rooted at e5
Conversely, if can be rooted uniquely at some edge ei+1 = (ui, ui+1) in the path from 1 to n, then, ui and ui+1 must be contained in a double duplication block (and hence the right subtree of ui and the left subtree of ui+1 are swapped). Thus, i(i+1) obtained by interchanging Ti and Ti+1 can be rooted at three edges ei, ei+1, ei+2. This implies that i(i+1) is an unrooted duplication tree that can be rooted in at least three ways.
Therefore, the mapping from to = {j(j+1) | i + 1 j i' – 2} is one-to-one from unrooted duplication trees Sk (k 3) to the (k – 2)-subsets of S1. Recall that rn denotes the number of rooted duplication trees for n segments. Letting un be the number of unrooted duplication trees, we obtain
where we use the fact || = j – 2 for Sj, j 3.
In conclusion, the following two facts hold for duplication trees:
The number rn of rooted duplication trees for n segments is twice the number of unrooted duplication trees (Gascuel et al. 2003); it satisfies the following recurrence relation:
A rooted duplication tree for segments {1, 2, ... , n} is ordered if it contains only 1-duplications. Such a duplication tree is just a rooted binary tree having its leaves listed from left to right in the increasing order. The number of rooted, ordered duplication trees for n segments is the same as the number of unrooted, ordered duplication trees for n + 1 segments (Elemento and Gascuel 2003); it is ()/n (Zhang, Ma, and Wang 2003; Elemento and Gascuel 2003).
Acknowledgements
L. Zhang thanks Mike Steel for drawing attention to the counting problem studied here, and for further discussions. Thanks, too, to Olivier Gascuel for useful comments on the first version of this paper. The work is partially supported by Singapore BioMedical Research Council Research grant BMRC01/1/21/19/140.
Literature Cited
Baltimore, D. 2001. Our genome unveiled. Nature 409:814-816.
Eichler, E. E. 2001. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17:661-669.
Elemento, O., and O. Gascuel. 2003. An exact and polynomial distance-based algorithm to reconstruct single copy tandem duplication trees. Pp. 96–108 in Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching (CPM'03), Morelia, Mexico. Lecture Notes in Computer Science, vol. 2676, Springer-Verlag.
Elemento, O., O. Gascuel, and M.-P. Lefranc. 2002. Reconstructing the duplication history of tandemly repeated genes. Mol. Biol. Evol. 19:278-288.
Fitch, W. 1977. Phylogenies constrained by cross-over process as illustrated by human hemoglobins in a thirteen cycle, eleven amino-acid repeat in human apolipoprotein A-I. Genetics 86:623-644.
Gascuel, O., M. D. Hendy, A. Jean-Marie, and R. McLachlan. 2003. The combinatorics of tandem duplication trees. Syst. Biol. 52:110-118.
Leem, S.-H., J. A. Londo?o-Vallejo, and J.-H. Kim, et al. 2002. The human telomerase gene: complete genomic sequence and analysis of tandem repeat polynomisms in intronic regions. Oncogene 21:769-777.
Tang, M., M. Waterman, and S. Yooseph. 2002. Zinc finger gene clusters and tandem gene duplication. J. Comput. Biol. 9:429-446.
Zhang, L., B. Ma, L. Wang, and Y. Xu. 2003. Greedy method for inferring tandem duplication history. Bioinformatics 19:1497-1504.(Jialiang Yang and Louxin )