Haplotype genealogy graphs based on Fitch distances
Fitchi is a python script that produces haplotype genealogy graphs from alignment files in Nexus format, along with summary statistics.
Haplotype genealogy graphs
With population genetic sequence data, haplotype genealogy graphs are often a better way to depict variation among populations than standard phylogenetic trees. In a haplotype genealogy graph, nodes usually represent unique sequences (haplotypes), and their size is according to the number of records in the input alignment that share exactly this sequence. Popular programs such as TCS draw relationships among these sequences as haplotype networks that include reticulations. These can represent either ambiguous node connections or a true biological signal of conflicting tree topologies due to recombination within the data set. If recombination is considered unlikely within the data set (e.g. for mitochondrial data or short nuclear fragments), it may thus be helpful to visualize sequence data without reticulations, which is what Fitchi does. For more information on cases, where networks with reticulation, or genealogies without reticulation may be more appropriate, I refer to Salzburger et al. (2011) and Mardulyn (2012).
In Fitchi's haplotype genealogies, edges between nodes indicate their connections in a user-supplied bifurcating phylogenetic tree, and edge lengths are according to the minimum number of mutations by which these sequences are separated. In Fitchi, ancestral sequences are reconstructed using the algorithm presented by Walter M. Fitch in his 1970 paper "Distinguishing homologous from analogous proteins", hence the name of this script. Details on the transformation of a bifucating phylogenetic tree into a haplotype genealogy are given in Salzburger et al. (2011).
The positioning of nodes in a haplotype genealogy graph can be tricky, especially with larger node numbers. In a perfectly laid-out graph, connecting edges of a node would be evenly spaced, no edges would cross each other, and no two nodes would overlap. Fortunately, graph theoreticians have run into this problem before us and developed algorithms that don't always work perfectly, but usually do a good enough job. Fitchi employs one of the most popular of these algorithms, called "neato" and accesses it through the graphviz package.
Requirements for Fitchi
Unfortunately, Fitchi is unlikely to run on your machine right out of the box, but may require a few additional installations. The below instructions should work well on Macintosh and Linux systems. It might also be possible to get all requirements to run on Windows, however, this has not been tested.
1 - python3 and pip3
First of all, Fitchi only runs with python3, not with earlier versions of python. See whether you already have this version by typing
in a terminal window. If you see something like "Python 3.X.X", you should be fine. If not, please download and install python3 from here.
If python3 is correctly installed, it is likely that the python module manager called pip has come along with it. We will need it later, so make sure that you have it:
You should see something like "pip 7.1.2 from /usr/local/lib/python3.5/site-packages (python 3.5)" (make sure the python version number given in parentheses is 3.X). If you don't, note that pip for python3 might also be named just plain "pip". Try this command instead of "pip3". If this still doesn't work for you, this blog post could help.
2 - graphviz
Next, the above-mentioned graphviz package should be installed on your machine, and it needs some additional tools in order to be accessible from python scripts (like Fitchi). If you're on OS X and you have both the highly recommended package manager Homebrew and Apple's command line tools, then the graphviz installation should be as easy as
sudo brew install graphviz
sudo brew install pkg-config
should result in something like "neato - graphviz version 2.38.0 (20140413.2041)".
3 - python modules
Finally, Fitchi requires that the following four python modules are installed: pygraphviz, biopython, and both scipy and numpy. You may use "pip3 list" to check whether any of these are installed already, and install the remaining ones with
sudo pip3 install pygraphviz
sudo pip3 install biopython
sudo pip3 install scipy
sudo pip3 install numpy
If you see an "ImportError" related to pygraphviz, you could try
sudo pip3 install pygraphviz --install-option="--include-path=/usr/include/graphviz" --install-option="--library-path=/usr/lib/graphviz (make sure to specify the right installation path of graphviz, it may not be in /usr/include/graphviz)
sudo pip3 install pygraphviz
Fitchi reads files in Nexus format that include both a sequence alignment and a bifurcating phylogenetic tree of these sequences. It is left to the user how this phylogenetic tree is obtained, but obvious candidate programs for this would include PAUP* or RAxML, depending on the number and length of sequence records. The sequence alignment may include missing data, coded with IUPAC ambiguity codes. If this is the case, Fitchi uses the Fitch algorithm to infer the sequence compatible with the ambiguity code that has the smallest number of nucleotide changes compared to internal nodes. Thus, nucleotides with completely unknown state, coded as 'N', '?', or '-', will never be counted as mismatches. Note that for Fst calculations (following Weir and Cockerham 1984), sequences are expected to be diploid and phased, with each pair of two consecutive sequences assumed to be from the same individual. If sequences are haploid (e.g. mitochondrial), this can be specified with option --haploid (see below). See here for how an example file would look like.
The good news is that once you've got the above packages and modules installed, and you've got an input file ready in Nexus format, running Fitchi is absolutely easy. Just place the python script fitchi.py somewhere on your machine, and type
python3 fitchi.py input.nex output.html
This should give you an output file in HTML format, named output.html. This output file contains the haplotype genealogy as an embedded SVG formatted graph, plus summary statistics for nucleotide diversity and overall differentiation. However, the haplotype genealogy will appear in plain grey, as no population identifiers were specified, and sequences could not be assigned to populations.
For a more informative graph, specify population identifiers that are also included in the ids of the sequence records in the alignment. For instance, populations in the example input file are simply named "pop1", "pop2", etc., and are recognized by Fitchi if the script is started with the "-p" option:
python3 fitchi.py example.nex example.html -p pop1 pop2 pop3 pop4 pop5 pop6 pop7 pop8
Sequence records that can not be assigned to any of the specified population identifiers are always shown in light grey, as are the sequence records of the last populations when more than 13 population identifiers are specified (for the lack of a good color scheme with more colors that are easy to discriminate).
Fitchi can also be piped, which can be useful when doing sliding window analyses, e.g. for genomic datasets:
cat input.nex | python3 fitchi.py > output.html
The following additional options can be specified on the command line:
Using "-f", followed by an integer number specifies the first position in the alignment that should be used for analysis. Effectively, this cuts off the first sites up to, but not including, the specified position.
Just like the "-f" option trims away the first part of the alignment, "-t" cuts off its tail. For examle "-f 3 -t 10" would cut off the first two positions, and everything from position 11 to the end of the alignment. These two options affect both the haplotype genealogy graph and all statistics calculated from the alignment.
With larger alignments, haplotype genealogy graphs can soon become too cluttered to be meaningful. In this case, it might help to specify a minimum edge length for display in the graph. This means that all edges shorter than this minimum length (in Fitch distances) will not be shown, and nodes that are linked by these short edges will be collapsed into one. As a result, nodes may not represent unique single sequences anymore, but instead a collection of closely related sequences. This option only affects the haplotype genealogy graph, not the alignment statistics.
In a similar fashion, a minimum node size for display can be specified with the "-n" option. The effect of this option depends on the degree of each node. Nodes of degree 1 (terminal nodes), that are below the minimum size will disappear together with their connection edge. For nodes of degree 2, the two connecting edges will be linked to a single edge and the node will be removed. Nodes of larger degrees (with more than two edges) will still be included in the graph, but will appear as empty nodes with size 0. Like "-e", "-n" affects only the graph, not the alignment statistics.
Another option to reduce graph complexity is to ignore all transitions and only use transversions to calculate edge lengths, which can be chosen with "-x". Like "-e" and "-n", this only affects the graph, not the alignment statistics.
With the default setting, the calculation of pairwise Fst values assumes that the alignment contains diploid phased sequences, with each pair of consecutive sequences coming from the same individual. If sequences are haploid and each individual contributes only a single sequence, this can be specified with --haploid. This option only affects the Fst calculation, not the other statistics or the haplotype genealogy graph.
For purely aesthetic reasons, you might want to increase or decrease the size of all nodes in relation to the edge lengths connecting them. This can be done by specifying a scale factor for all radi with the "-m" option (for example "-m 2.5"). You could also try "-m auto", which tells Fitchi to try to find an ideal size automatically so that no two nodes are overlapping.
This allows specification of a random number seed. Wherever multiple solutions of the Fitch algorithm are equally good, Fitchi decides among these solutions at random, thus, multiple runs of Fitchi with the same dataset (without random number seed) may lead to different results. In order to make results reproducible, a random number seed can be specified and use of the same random number seed will always produce the same solution of the Fitch algorithm.
Fitchi writes HTML files with embedded SVG code. These can be read by most browsers, including recent versions of Firefox and Safari. HTML files are great for displaying the graphs and associated statistics, however, if you're running a large number of analyses with Fitchi, or you'ld like to prepare a haplotype genealogy for publication, you might want to extract particular bits of information from these HTML files. That's why fitchi_extract.py comes along with fitchi.py. Using the -e option of fitchi_extract.py, you can choose which information to extract from the HTML. For example,
python3 fitchi_extract.py example.html example.svg -e svg
returns only the SVG part of the HTML and writes it to file example.svg, after minor changes are made to the SVG code to include a figure legend. If you need a black and white figure, you can specify "-e svg_bw", and all colors will be converted to black and white before extraction of the SVG code. Similarly, "-e svg_simple" removes the semi-transparent gradients that cause the glossy look of nodes, and "-e svg_simple_bw" removes this and also converts to black and white.
If you're not interested in the haplotype genealogy at all, but only in a particular alignment statistic, you could specify for example "-e prop_var" for the overall proportion of variable sites in the alignment, "-e tot_var" for the total number of variable sites, or "-e fst" for the first pairwise Fst value. Type
python3 fitchi_extract.py -h
to see a full list of available options.
Since fitchi_extract.py can be piped just like fitchi.py, you could do the following to obtain a statistic (here the Fst between pop3 and pop5) directly without writing an HTML file:
cat example.nex | python3 fitchi.py -p pop3 pop5 | python3 fitchi_extract.py -e fst
Credits are due to Ethan Schoonover, whose color scheme Solarized substantially contributes to the good look of Fitchi's haplotype genealogy graphs. Further thanks go to the developers of the networkx and pygraphviz python modules. While Fitchi version >1.1 does no longer import the networkx module, it implements the Graph class developed for networkx.