How to Use the MicrobesOnline Tree-Browser

Navigating to the tree-browser

To use the tree browser, the first step is to choose an "anchor" gene whose evolutionary history you want to explore. To find a gene, visit microbesonline.org, and use "Search genes" or "Sequence search". Then click on the T link in the search results or visit the gene page and click on "Browse genomes by trees".

If you want to link to the tree browser from your own web site, use the URL http://microbesonline.org/cgi-bin/treeBrowse.cgi?locus=NNNN, where NNNN is the numeric VIMSS ID of the gene. We try to maintain the VIMSS ids even when the genome sequence or gene annotations are updated.

What the tree-browser shows

Given a gene of interest, the tree browser selects a domain or gene family and displays relevant parts of a phylogenetic tree. The tree shows you which relatives are closest, and hence which are most likely to have the same function. The gene trees are computed beforehand and are rooted.

In the initial display, the tree-browser shows a gene tree together with the genomic context of those genes (see example). Conserved context implies conserved function, and often implies a similarity in function to the surrounding genes.

The tree browser can also show the gene tree together with the species tree so that you can compare them (see example). This highlights the presence or absence of a gene in related genomes, or, if close relatives in the gene are from distant genomes, suggests that horizontal gene transfer occurred.

The gene tree

At the top, the tree browser reports which gene it is showing information for and which gene family it used to build the tree. The gene family computations are not perfect, and sometimes it happens that a gene is assigned to a family but close homologs of the gene are not. In this case those homologs will be missing from the tree (see coverage).

The gene tree is at the bottom left. The gene of interest, or "anchor", is at the top of the tree, and the gene's closest homologs are beneath it. To allow more distant homologs to be shown, groups of closely related homologs are collapsed to a single cluster or clade. A single gene is shown for each cluster. Each clustered gene is shown with "bush" at the left to show how deep the internal branch is and with a "+" on the name. When you hover on a clustered gene in the tree it will show "gene-name and N similar sequences". To highlight the phylogenetic position of the "anchor" gene, it is not collapsed into clusters unless the other sequences in the cluster are 99% identical. If you wish, you can change the level of clustering or turn it off entirely.

If a gene is from one of the genomes you have selected, its name will be in magenta. Genes from selected genomes are always shown (they are never "hidden" inside clusters). If you have not selected any genomes, then the anchor gene and its paralogs (other genes from that genome) will be in magenta and will always be shown.

If a gene has been analyzed in a published paper, the gene name will be underlined in green. By default, characterized genes are collapsed into clusters with uncharacterized genes but not with other characterized genes. Most genes have few characterized homologs, but if there are many characterized homologs, then more distant homologs will not be shown. If you don't want to see all these characterized homologs, uncheck the box for "Always show characterized genes."

You can click on a gene's name in the gene tree (or in the genome context view) to bring up a menu. The menu lets you view the gene (or cluster), recenter the tree-browser to focus on that gene, or add the gene to a cart for future analysis, such as building your own alignment or phylogenetic tree. If the gene represents a cluster, the menu also includes the option to partially "expand" the cluster.

Confident clades in the tree (those with support 0.95 or higher) are marked with a black circle, and less confident clades (support 0.8 or higher) are marked with a grey circle. You can see the support value at any node in the tree by hovering. However, if you find a phylogenetic grouping of interest by using the tree-browser, we strongly urge that you build your own tree. You can do this easily within MicrobesOnline by adding genes to a cart, building a multiple sequence alignment, and then building a tree. Building your own tree allows you to check alignment quality and to use a higher-quality (slower) tree-building method.

Gene context

In the default view, the tree-browser shows a tree at the left, and to the right of each gene it shows the region of the genome surrounding that gene. Within the genome context view, every gene is shown by a pointed box; the direction of the box shows what strand the gene is on. The gene box that corresponds to the gene tree entry will have its name in green and will (by default) be at the center of the display.

To help you determine which of the other genes shown are homologous to each other, related genes are shown with the same color. If a gene is in grey, you can still try recentering on it to see what is homologous. For an explanation of how it works, see the color option. Non-coding genes are in black.

As in the gene tree, you can see more information about each gene by hovering your cursor on it, and clicking on the gene brings up a menu with more options.

The species tree

If you click on "show species tree", then the tree-browser will show you the species tree. The genome of the selected gene will be at the top of the species tree. Genomes that contain one or more genes from the shown portion of the gene tree will be shown in green. Related genomes that do not contain any of those shown genes will be shown in red. Please note that a genome will be shown in red if that genome contains genes that are in the tree but are too distantly related to the anchor to be shown.

By default, closely related groups of genomes are collapsed to a single node so that the genome tree is more compact and comprehensible. These groups will be labelled with something like "Vibrionaceae (10 genes 8 genomes)" to show how many species are grouped together, and also how many of the genes in the shown portion of the gene tree are in those genomes. You can hover on this to see the names of some of those genomes, and you can click to see the full list of genomes or for more detailed information about the genome. The group will be in yellow if some genomes in the group contain genes from the shown section of the gene tree but other genomes in the group do not.

To indicate how the gene tree corresponds to the species tree, the genes or clusters in the gene tree are numbered in blue: 1, 2, 3, etc. This numbering is only shown when the species tree is shown. Each genome or group of genomes is labelled with the numbers of the genes that it contains, e.g. 1,11. Because closely related genes are shown as a single node in the gene tree, the same gene number can show up in several genomes. Conversely, even a single genome can contain several members of a gene family (that is, paralogs).

The tree-browser's controls

The tree-browser has many controls that allow you to customize the display. After adjusting these settings, hit the "Update" button to see a new display.

Which tree to use

Domain used: This option lets you choose which gene family or which domain of the protein to show the tree for. MicrobesOnline includes pre-computed trees for every COG, Pfam, TIGRFam, SMART, PIRSF, SuperFamily, and Gene3D family. (Roughly speaking, COGs and TIGRFams are full-length gene families and Pfams, SMART, PIRSF, and Gene3D are domain families.) To ensure that virtually every gene has a tree, MicrobesOnline also includes trees for gene families that were identified by FastBLAST and for additional "ad-hoc" families. By default, the tree-browser chooses a tree that has the most aligned positions and the best coverage of putative orthologs.

Coverage: Occasionally the tree-browser selects a family that does not include all of the close homologs of the gene of interest. To test for this problem, the tree-browser shows how many of the putative orthologs of the anchor gene are present in the tree. You can check more thoroughly by clicking on "Check coverage of homologs." If the coverage is poor, try selecting a tree for a different family.

Which genes to show

Cluster: By default, the tree browser clusters together closely related clades. That is, given a clade in the gene tree whose members are all closely related to each other, it selects just one of them to show. You can turn this feature off by setting "Cluster" to none, or you can adjust the amount of clustering. Lower values allow more homologs to be grouped together so that you can see more distant homologs; the value corresponds roughly to the minimum %identity of the members of a cluster. The anchor gene is treated specially, and is only put in a cluster with genes that are ~99% identical to it (unless you turn clustering off).

Genomes selected: The tree browser's clustering also depends on the "Genomes selected" at the top of the page. The tree browser always shows genes in selected genomes (they are never hidden inside clusters with other genes), and colors them magenta in the gene tree. You can change the list of selected genomes and then hit Update to make the tree browser show genes from specific genomes of interest.

Expand: You can "expand" a specific gene of interest by clicking on a leaf in the gene tree. Only collapsed nodes (those marked with a "+") can be expanded. Once a node is expanded, you can collapse it by clicking on the red minus sign.

Limit: You can also control how many genes or clusters to show in the gene tree. Beyond the "limit," more distant homologs are ignored. Showing fewer clusters creates a more compact display and greatly speeds up the browser when showing genome context. However, if there is an error in the tree, then the supposedly more distant homologs that were ignored may actually be important for understanding the function or history of the gene.

Gene context options

Overlapping genes on separate lines: If you are showing gene context, then by default the gene's context will be shown with a single line for each gene. This can make it hard to see overlapping genes. If you wish, you can place overlapping genes on separate lines instead.

Color: If you are showing gene context, then the genes are colored according to their COG. The tree-browser can also use "orthologs" to color genes that are not in COG, but by default the tree-browser only does this for orthologs of genes in the 1st track (the anchor genome). If a gene is in grey, that means it is not in a COG and/or is not orthologous to any shown gene from the anchor genome. (Orthologs are assigned based on BLAST scores, so they may not be reliable.) If you select the "exhaustive" option, then the tree-browser looks at all orthologs of all genes shown to try and find relationships. However, because of paralogs, a grey gene could still have close homologs in the view. To learn more about a gene, click on it and select "recenter."

Species tree options

Cluster species: If you are showing the species tree, then you can control the extent to which similar genomes are grouped together. This is analogous to the "Cluster" control for the gene tree. However, the %identities for species are on a different scale because the species tree is built using the most highly conserved proteins. For example, between Escherichia coli and Salmonella typhimurium, the typical protein is about 20% different, but the species distance is only 1% or 0.01.

Simplify: It often happens that a gene is present in one bacterium but not in any of its relatives. By default, the tree-browser will group some of those relatives together, even though they do not form a clade, so that the species tree is more compact. If you want to see the details of gene presence/absence or if you want to check for horizontal gene transfer events, you should turn "Simplify" off. In particular, this will highlight cases where a genome contains the gene but multiple related groups of bacteria lack this gene. This suggests that the gene was acquired by horizontal gene transfer (although multiple independent losses of the gene could also have occurred). Before concluding that HGT occurred, you should check the tree's coverage.

Changing the tree's look

Rectangular style: By default, the tree-browser draws trees in a "straight" style, in which the vertical dimension is meaningless and the horizontal length of a branch indicates the amount of evolution on that branch. The tree-browser can use the traditional rectangular style instead.

Use branch lengths: If you wish, the tree-browser can ignore the branch lengths and show only the branching order.

How the trees are built

All of the trees shown in the tree-browser are pre-computed. Every time we update the MicrobesOnline database, we compute a new tree for every gene family and a new species tree.

Computing the gene trees

MicrobesOnline includes pre-computed trees for every COG, TIGRFam, PFam, SMART, and PIRSF family, and for every Superfamily and Gene3D model. Because many genes have homologs but do not belong to any of these families, we also build trees for all of the additional families identified by FastBLAST. We do not build trees for PANTHER families because the alignments are highly gapped (many of the hits only align to a small fraction of the model). Instead, we build "ad hoc" trees for genes that were not included in any of the other trees. These ad-hoc trees include the seed gene and its homologs as identified by FastBLAST. Genes are assigned to COGs by reverse position-specific BLAST against the conserved domain database (CDD), and genes are only included in the best-hitting COG and only if the hit covers most of the COG. Genes are assigned to other families using FastHMM.

Once we have a list of homologous regions of proteins, we need to align them. We create an alignment for each COG with MUSCLE, using max_iter 2 (this greatly speeds the alignment but reduces quality). We align HMM-based families with hmmalign from the HMMer package. FastBLAST alignments are derived from the pairwise alignments of the seed sequence. The alignments are trimmed slightly: positions that are gaps in ≥ 90% of the sequences are removed. This trimming is minimal, and makes the trees more sensitive to any errors in the alignments, but more aggressive trimming on large gene families results in very small numbers of aligned positions and poor phylogenetic signal, which would also lead to errors. Finally, we build phylogenetic trees with FastTree, a fast implementation of neighbor-joining written by Morgan Price. The support values are from a local bootstrap (that is, we see how often resampled alignments support the given split over other arrangements of the 4 subtrees). As many of the trees contain thousands of sequences, and some trees contain over 100,000 sequences, higher-quality tree-building methods (e.g. computing maximum-likelihood trees, or even neighbor-joining with maximum likelihood distances) would be prohibitively slow. We perform midpoint rooting on the trees.

As mentioned previously, if you find a phylogenetic grouping of interest using the tree-browser, we strongly urge you to confirm it by building your own custom tree on MicrobesOnline. We have done this ourselves for dozens of genes. High-bootstrap nodes in the pre-built trees are usually correct, but on rare occasions, high-bootstrap nodes are strongly rejected by the custom tree. We suspect that this is because of alignment problems. A more common situation is that the custom tree is better resolved because more positions are aligned or because more positions are maintained after trimming. This is expected if the custom tree includes only close homologs. Also, sometimes the tree is for a gene family that does not include all of the close homologs of the gene of interest. The tree-browser can check for coverage.

If you want to download these alignments or these trees, please contact us.

Computing the species tree

MicrobesOnline includes a pre-computed species tree. This tree is updated regularly as new genomes are added. The species tree is built from a variety of source trees by using matrix representation of parsimony ("MRP"). The source trees are built from concatenated alignments of highly conserved proteins, and are primarily built by maximum likelihood methods.

The first source tree is a high-quality tree of 211 bacteria and 52 genes. Genes were aligned with MUSCLE, alignments were trimmed with gblocks, and the tree was computed with Mr. Bayes. This tree is of high quality, but required several months to build and cannot be updated regularly, and furthermore, it includes only 211 bacteria.

The second source tree is another high-quality tree of 191 species (including some eukaryotes, which we ignore) and 31 genes from Bork's group (Ciccarelli et al 2006). For genomes in common, the topology of this tree is very similar to that of the Mr. Bayes tree.

The third source tree includes almost all of the prokaryotes in MicrobesOnline. (Genomes are not included if they are low-quality draft assemblies or if they are mixtures of two related species. The mixtures have also been renamed to end with "spp.") This tree is based on 74 COGs that are present as a single copy in most bacteria and archaea. Each COG was aligned with MUSCLE (using default settings). Positions were trimmed if they were gaps in more than 5% of genomes or if they were adjacent to such a position. We used protdist from the phylip package to estimate distances with gamma-distributed rates. For each COG, we normalized the distances so that the median off-diagonal entry in the matrix was 1. Then, we combined these distance matrices by using the median, and we used neighbor to build a tree. We bootstrapped this procedure 100 times by resampling the COGs (not resampling the alignment positions). Given the median distances for each bootstrap, we computed a tree with neighbor-joining (using FastTree), and finally we computed a extended-majority-rule consensus tree (using consensus from the phylip package).

Additional high-quality trees were computed for much smaller groups of genomes. We began with a supertree defined by the previous trees (with the third tree weighted low because of its lower quality). For each internal node in this rough supertree, we selected a small number of descendent genomes and close outgroups (less than 20 genomes). Given this small group of genomes, we built a maximum likelihood tree (using phyml) from proteins that are conserved among them (but might not be present in distantly related organisms), again using MUSCLE and gblocks. Using more genes increases the phylogenetic signal near the tips of the tree, and building trees for small groups of genomes allows us to use maximum-likelihood methods, which are believed to be more accurate than neighbor-joining.

Finally, we used an unrooted variant of matrix representation of parsimony ("MRP") to combine all of the above trees. We weighted the Mr. Bayes tree and the Bork group's tree 3x higher and the small maximum-likelihood trees 2x higher than the neighbor-joining tree. Within each tree, we weighted nodes with long internal branches higher, with relative weights ranging from 0.5 to 1.5. Overall, the results are dominated by maximum-likelihood trees. Given this matrix, we searched for the maximum parsimony tree with PAUP, using "multrees=no" for performance reasons. After combining the tree topologies, we computed branch lengths from the distance matrix (the same one that was used to create the neighbor-joining tree), using phylip's fitch.

The supertree contains support values that indicate the proportion of relevant source trees that support the split. The small maximum-likelihood trees also have support values (from phyml aLRT, not bootstrap). However, because the concatenated alignments are large, errors in tree reconstruction are more likely to be due to the limitations of the statistical models and of the reconstruction algorithms, rather than limitations in the amount of sequence. In other words, we wouldn't necessarily expect support value analysis to identify potential errors in the tree. Also, relationships within a species should be viewed skeptically because of recombination, which is not considered by the methods we used. Finally, distances within a genus tend to be very small (and hence not necessarily meaningful) in the supertree, but there should be a useful maximum-likelihood tree for each genus.

You can browse the source trees by clicking on nodes of the supertree (either internal or terminal nodes) in the species tree viewer.

Downloading data & making figures

You can download data from the tree-browser for further analysis or for making figures:

For more information about the tree-browser, please contact us at gtlweb@vimss.lbl.gov.

Last updated July 10 2008