The Extended Newick Format


The eNewick (for “extended Newick”) string defining a phylogenetic network appeared in the packages PhyloNet (Rice University BioInformatics Group 2007) and NetGen (Morin and Moret 2006) related to phylogenetic networks, with some differences between them. The former encodes a phylogenetic network with k hybrid nodes as a series of k trees in Newick format, while the latter encodes it as a single tree in Newick format but with k repeated nodes.

Whereas the Perl module we introduce here accepts both formats as input, a complete standard for eNewick is implemented, based mainly on NetGen and following the suggestions of D. Huson and M. M. Morin (among others), to make it as complete as possible. The adopted standard has the practical advantage of encoding a whole phylogenetic network as a single string, and it also includes mandatory tags to distinguish among the various hybrid nodes in the network.

The procedure to obtain the eNewick string representing a phylogenetic network N goes as follows: Let {H1,,Hm} be the set of hybrid nodes of N, ordered in any fixed way. For each hybrid node H = Hi, say with parents u1,u2,,uk and children v1,v2,,v: split H in k different nodes; let the first copy be a child of u1 and have all v1,v2,,v as its children; let the other copies be children of u2,,uk (one for each) and have no children. Label each of the copies of H as

[label]#[type]tag[:branch_length]

where the parameters are:

In this way, we get a tree whose set of leaves is the set of leaves of the original network together with the set of hybrid nodes (possibly repeated). Then, the Newick string of the obtained tree (note that some internal nodes will be labeled and some leaves will be repeated) is the eNewick string of the phylogenetic network. The leftmost occurrence of each hybrid node in an eNewick string corresponds to the full description of the network rooted at that node.


PIC


Figure 1: A phylogenetic network N (left), and tree (right) associated to N for computing its eNewick string.


Consider, for example, the phylogenetic network depicted together with its decomposition in Fig. 1. The eNewick string for this network would be ((1,(2)#H1),(#H1,3)); or ((1,(2)h#H1)x,(h#H1,3)y)r; if all internal nodes are labeled. The leftmost occurrence of the hybrid node in the latter string corresponds to the full description of the network rooted at that node: (2)h#H1.

Obviously, the procedure to recover a network from its eNewick string is as simple as recovering the tree and identifying those nodes that are labeled as hybrid nodes with the same identifier.


PIC


Figure 2: Representation of a lateral gene transfer event (left) as a hybrid node in a phylogenetic network (right).


Notice that gene transfer events can be represented in a unique way as hybrid nodes. Consider, for example, the lateral gene transfer event depicted in Fig. 2, where a gene is transferred from species 2 to species 3 after the divergence of species 1 from species 2. The eNewick string ((1,(2,(3)h#LGT1)y)x,h#LGT1)r; describes such a phylogenetic network. A program interpreting the eNewick string can use the information on node types in different ways; for instance, to render tree nodes circled, hybridization nodes boxed, and lateral gene transfer nodes as arrows between edges.

References

   M. M. Morin and B. M. E. Moret. NETGEN: generating phylogenetic networks with diploid hybrids. Bioinformatics, 22(15):1921–1923, 2006.

   Rice University BioInformatics Group. Phylonet: Phylogenetic networks toolkit (v. 1.4). Available at http://bioinfo.cs.rice.edu/phylonet/, 2007.