Introduction to data.tree (2024)

Abraham Lincoln once said, "Give me six hours to chop down a tree and I will spend the first four sharpening the axe."
Aunt Margaret used to say, "If you dream of a forest, you'd better learn how to plant a tree."
data.tree says, "No matter if you are a lumberjack or a tree hugger. I will be your sanding block, and I will be your seed."

Trees

Trees are ubiquitous in mathematics, computer science, data sciences,finance, and in many other attributes. Trees are especially useful whenwe are facing hierarchical data. For example, trees areused:

in decision theory (cf.decision trees)
in machine learning (e.g.classification trees)
in finance, e.g.to classify financial instruments into assetclasses
in routing algorithms
in computer science and programming (e.g.binary search trees,XML)
e.g.for family trees

For more details, see the applications vignette by typingvignette("applications", package = "data.tree")

Trees in R

Tree-like structures are already used in R. For example, environmentscan be seen as nodes in a tree. And CRAN provides numerous packages thatdeal with tree-like structures, especially in the area of decisiontheory. Yet, there is no general purpose hierarchical data structurethat could be used as conveniently and generically as, say,data.frame.

As a result, people often try to resolve hierarchical problems in atabular fashion, for instance with data.frames. But often, hierarchiesdon’t marry with tables, and various workarounds are usuallyrequired.

Trees in `data.tree`

This package offers an alternative. The data.treepackage lets you create hierarchies, called data.treestructures. The building block of theses structures areNode objects. The package provides basic traversal, search,and sort operations, and an infrastructure for recursive treeprogramming. You can decorate Nodes with your ownattributes and methods, so as to extend the package to your needs.

The package also provides convenience methods for neatly printing andplotting trees. It supports conversion from and todata.frames, lists, and other tree structuressuch as dendrogram, phylo objects from the apepackage, igraph, and other packages.

Technically, data.tree structures are bi-directional,ordered trees. Bi-directional means that you can navigate from parent tochildren and vice versa. Ordered means that the sort order of thechildren of a parent node is well-defined.

Definitions

data.tree structure: a tree,consisting of multiple Node objects. Often, the entry pointto a data.tree structure is the root Node
Node: both a class and the basicbuilding block of data.tree structures
attribute: an active, a field, or a method. **Notto be confused with standard R attributes, c.f. ?attr,which have a different meaning. Many methods and functions have anattribute arg, which can refer to a an active, a field or amethod. For example, see ?Get
active (sometimes called property): a field on aNode that can be called like an attribute, but behaves likea function without arguments. For example:node$position
field: a named value on a Node,e.g.node$cost <- 2500
method: a function acting on an object (on aNode in this context). Many methods are available in OOstyle (e.g.node$Revert()) or in traditional style(Revert(node))
inheritance: in this context, inheritance refers toa situation in which a child Node inherits e.g.anattribute from one of its ancestors. For example, see ?Get,?SetNodeStyle

Tree creation

There are different ways to create a data.treestructure. For example, you can create a treeprogrammatically, by conversion fromother R objects, or from a file.

Create a tree programmatically

Let’s start by creating a tree programmatically. We do this bycreating Node objects, and linking them together so as todefine the parent-child relationships.

In this example, we are looking at a company, Acme Inc., and the treereflects its organisational structure. The root (level 1) is thecompany. On level 2, the nodes represent departments, and the leaves ofthe tree represent projects that the company is considering for nextyear:

library(data.tree)acme <- Node$new("Acme Inc.") accounting <- acme$AddChild("Accounting") software <- accounting$AddChild("New Software") standards <- accounting$AddChild("New Accounting Standards") research <- acme$AddChild("Research") newProductLine <- research$AddChild("New Product Line") newLabs <- research$AddChild("New Labs") it <- acme$AddChild("IT") outsource <- it$AddChild("Outsource") agile <- it$AddChild("Go agile") goToR <- it$AddChild("Switch to R")print(acme)

## levelName## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software ## 4 ¦ °--New Accounting Standards## 5 ¦--Research ## 6 ¦ ¦--New Product Line ## 7 ¦ °--New Labs ## 8 °--IT ## 9 ¦--Outsource ## 10 ¦--Go agile ## 11 °--Switch to R

As you can see from the previous example, each Node isidentified by its name, i.e.the argument you pass into theNode$new(name) constructor. The name needs to beunique among siblings, such that paths to Nodesare unambiguous.

Node inherits from R6 reference class. Thishas the following implications:

You can call methods on a Node in OO style,e.g.acme$Get("name")
Node exhibits reference semantics. Thus,multiple variables in R can point to the same Node, andmodifying a Node will modify it for all referencingvariables. In the above code example, both acme$IT andit reference the same object. This is different from thevalue semantics, which is much more widely used in R.

Create a tree from a `data.frame`

Creating a tree programmatically is useful especially in the contextof algorithms. However, most times you will create a tree by conversion.This could be by conversion from a nested list-of-lists, by conversionfrom another R tree-structure (e.g.an ape phylo), or byconversion from a data.frame. For more details on all theoptions, type ?as.Node and refer to the See Alsosection.

One of the most common conversions is the one from adata.frame in table format. The following code illustratesthis. We load the GNI2014 data from the treemap package. Thisdata.frame is in table format, meaning that each row willrepresent a leaf in the data.tree structure:

library(treemap)data(GNI2014)head(GNI2014)

## iso3 country continent population GNI## 3 BMU Bermuda North America 67837 106140## 4 NOR Norway Europe 4676305 103630## 5 QAT Qatar Asia 833285 92200## 6 CHE Switzerland Europe 7604467 88120## 7 MAC Macao SAR, China Asia 559846 76270## 8 LUX Luxembourg Europe 491775 75990

Let’s convert that into a data.tree structure! We startby defining a pathString. The pathString describes thehierarchy by defining a path from the root to each leaf. In thisexample, the hierarchy comes very naturally:

GNI2014$pathString <- paste("world", GNI2014$continent, GNI2014$country, sep = "/")

Once our pathString is defined, conversion to Node is very easy:

population <- as.Node(GNI2014)print(population, "iso3", "population", "GNI", limit = 20)

## levelName iso3 population GNI## 1 world NA NA## 2 ¦--North America NA NA## 3 ¦ ¦--Bermuda BMU 67837 106140## 4 ¦ ¦--United States USA 313973000 55200## 5 ¦ ¦--Canada CAN 33487208 51630## 6 ¦ ¦--Bahamas, The BHS 309156 20980## 7 ¦ ¦--Trinidad and Tobago TTO 1310000 20070## 8 ¦ ¦--Puerto Rico PRI 3971020 19310## 9 ¦ ¦--Barbados BRB 284589 15310## 10 ¦ ¦--St. Kitts and Nevis KNA 40131 14920## 11 ¦ ¦--Antigua and Barbuda ATG 85632 13300## 12 ¦ ¦--Panama PAN 3360474 11130## 13 ¦ ¦--Costa Rica CRI 4253877 10120## 14 ¦ ¦--Mexico MEX 111211789 9870## 15 ¦ ¦--Grenada GRD 90739 7910## 16 ¦ ¦--St. Lucia LCA 160267 7260## 17 ¦ ¦--Dominica DMA 72660 6930## 18 ¦ ¦--St. Vincent and the Grenadines VCT 104574 6610## 19 ¦ ¦--Dominican Republic DOM 9650054 6040## 20 ¦ °--... 7 nodes w/ 0 sub NA NA## 21 °--... 6 nodes w/ 171 sub NA NA

This is a simple example, and more options are available. Type?FromDataFrameTable for all the details.

Create a tree from a file

Often, trees are created from one of many file formats. Whendeveloping this package, We opted for a multi-step approach, meaningthat you first import the file into one of the well-known R datastructures. Then you convert these into a data.treestructure. For example, typical import patterns could be:

csv -> data.frame in table format (?read.csv) ->data.tree (?as.Node.data.frame)
Newick -> ape phylo (?ape::read.tree) ->data.tree (?as.Node.phylo )
csv -> data.frame in network format (?read.csv)-> data.tree (c.f. ?FromDataFrameNetwork)
yaml -> list of lists (?yaml::yaml.load) ->data.tree (?as.Node.list)
json -> list of lists (e.g.?jsonlite::fromJSON)-> data.tree (?as.Node.list)

If you have a choice, we recommend you consider yaml format to storeand share your hierarchies. It is concise, human-readable, and very easyto convert to a data.tree. An example is provided here for illustration.The data represents what platforms and OS versions a group of studentsuse:

library(yaml)yaml <- "name: OS Students 2014/15OS X: Yosemite: users: 16 Leopard: users: 43Linux: Debian: users: 27 Ubuntu: users: 36Windows: W7: users: 31 W8: users: 32 W10: users: 4"osList <- yaml.load(yaml)osNode <- as.Node(osList)print(osNode, "users")

## levelName users## 1 OS Students 2014/15 NA## 2 ¦--OS X NA## 3 ¦ ¦--Yosemite 16## 4 ¦ °--Leopard 43## 5 ¦--Linux NA## 6 ¦ ¦--Debian 27## 7 ¦ °--Ubuntu 36## 8 °--Windows NA## 9 ¦--W7 31## 10 ¦--W8 32## 11 °--W10 4

In cases where your leaf elements have no attributes, you might wantto interpret them as nodes, and not as attributes. In such cases, youcan use interpretNullAsList = TRUE to convert these intoNodes (instead of attributes).

For example:

library(yaml)yaml <- "name: OS Students 2014/15OS X: Yosemite: Leopard:Linux: Debian: Ubuntu:Windows: W7: W8: W10:"osList <- yaml.load(yaml)osNode <- as.Node(osList, interpretNullAsList = TRUE)osNode$printFormatters <- list(h = "\u2500" , v = "\u2502", l = "\u2514", j = "\u251C")print(osNode, "users")

## levelName users## 1 OS Students 2014/15 NA## 2 ├─OS X NA## 3 │├─Yosemite NA## 4 │└─Leopard NA## 5 ├─Linux NA## 6 │├─Debian NA## 7 │└─Ubuntu NA## 8 └─Windows NA## 9 ├─W7 NA## 10 ├─W8 NA## 11 └─W10 NA

Node methods

As seen above, a data.tree structure is composed ofNode objects, and the entry point to adata.tree structure is always a Node, oftenthe root Node of a tree.

There are different types of methods:

OO-style actives (sometimes called properties) onNodes, such as e.g.Node$isRoot
OO-style methods on Nodes, such ase.g.Node$AddChild(name)
Classical R methods, such as e.g.Clone(node).

Actives Examples (aka Properties)

Actives look and feel like attributes, but they are dynamicallyevaluated. They are documented in the Node documentation,which is accessed by typing ?Node.

Remember our population example:

print(population, limit = 15)

## levelName## 1 world ## 2 ¦--North America ## 3 ¦ ¦--Bermuda ## 4 ¦ ¦--United States ## 5 ¦ ¦--Canada ## 6 ¦ ¦--Bahamas, The ## 7 ¦ ¦--Trinidad and Tobago ## 8 ¦ ¦--Puerto Rico ## 9 ¦ ¦--Barbados ## 10 ¦ ¦--St. Kitts and Nevis ## 11 ¦ ¦--Antigua and Barbuda ## 12 ¦ ¦--Panama ## 13 ¦ ¦--Costa Rica ## 14 ¦ ¦--Mexico ## 15 ¦ °--... 12 nodes w/ 0 sub## 16 °--... 6 nodes w/ 176 sub

population$isRoot

## [1] TRUE

population$height

## [1] 3

population$count

## [1] 7

population$totalCount

## [1] 196

population$attributes

## character(0)

population$attributesAll

## [1] "GNI" "continent" "country" "iso3" "population"

population$averageBranchingFactor

## [1] 24.375

The naming convention of the package is that attributes and activesare lower case, whereas methods are upper / CamelCase. RStudio and otherIDEs work well with data.tree. If you have aNode, simply type myNode$ + SPACE to get alist of available attributes, actives and methods.

OO-Style Methods Examples

Examples of OO-Style methods

You will find more information on these examples below.

Get will traverse the tree and collect specific values for theNodes it traverses:

sum(population$Get("population", filterFun = isLeaf))

## [1] 6683146875

Prune traverses the tree and keeps only the subtrees for which thepruneFun returns TRUE.

Prune(population, pruneFun = function(x) !x$isLeaf || x$population > 1000000)

## [1] 39

Note that the Prune function has side-effects, as it acts on theoriginal population object. The population sum is now smaller:

sum(population$Get("population", filterFun = isLeaf), na.rm = TRUE)

## [1] 6669737814

Traditional R Methods

popClone <- Clone(acme)

Traditional S3 generics are available especially for conversion:

as.data.frame(acme)

## levelName## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software ## 4 ¦ °--New Accounting Standards## 5 ¦--Research ## 6 ¦ ¦--New Product Line ## 7 ¦ °--New Labs ## 8 °--IT ## 9 ¦--Outsource ## 10 ¦--Go agile ## 11 °--Switch to R

Though there is also a more specialised non-generic version:

ToDataFrameNetwork(acme)

## from to## 1 Acme Inc. Accounting## 2 Acme Inc. Research## 3 Acme Inc. IT## 4 Accounting New Software## 5 Accounting New Accounting Standards## 6 Research New Product Line## 7 Research New Labs## 8 IT Outsource## 9 IT Go agile## 10 IT Switch to R

Climbing a tree (tree navigation)

To climb a tree means to navigate to a specificNode in the data.tree structure.

Navigation by path

The most natural form of climbing a tree is to climb by path:

Navigation by position

However, there is a number of other ways to get to a specificNode. We can access the children of a Nodedirectly through Node$children:

acme$children[[1]]$children[[2]]$name

## [1] "New Accounting Standards"

Navigation by attributes

Furthermore, we can not only navigate by name, but also by otherattributes. This is achieved with the Climb method. Thename of each ... argument designates the field, and thevalue matches against Nodes. Each argument refers to thesubsequent level to climb. In this example, Climb takesacme’s child at position 1 (i.e.Accounting), then it takesAccounting's child called New Software:

acme$Climb(position = 1, name = "New Software")$path

## [1] "Acme Inc." "Accounting" "New Software"

As a shortcut, you can climb multiple levels with a singleargument:

tree <- CreateRegularTree(5, 5)tree$Climb(position = c(2, 3, 4))$path

## [1] "1" "1.2" "1.2.3" "1.2.3.4"

Finally, you can even combine. The following example starts on theroot, then looks for child at position 2, then for its child at position3. Next, we move to the child having name = “1.2.3.4”, and finally itschild having name “1.2.3.4.5”:

tree$Climb(position = c(2, 3), name = c("1.2.3.4", "1.2.3.4.5"))$path

## [1] "1" "1.2" "1.2.3" "1.2.3.4" "1.2.3.4.5"

Custom attributes

Just as with, say, a list, we can add any custom fieldto any Node in a data.tree structure. Let’s goback to our acme company:

acme

## levelName## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software ## 4 ¦ °--New Accounting Standards## 5 ¦--Research ## 6 ¦ ¦--New Product Line ## 7 ¦ °--New Labs ## 8 °--IT ## 9 ¦--Outsource ## 10 ¦--Go agile ## 11 °--Switch to R

We now add costs and probabilities to the projects in eachdepartment:

acme$Accounting$`New Software`$cost <- 1000000acme$Accounting$`New Accounting Standards`$cost <- 500000acme$Research$`New Product Line`$cost <- 2000000acme$Research$`New Labs`$cost <- 750000acme$IT$Outsource$cost <- 400000acme$IT$`Go agile`$cost <- 250000acme$IT$`Switch to R`$cost <- 50000acme$Accounting$`New Software`$p <- 0.5acme$Accounting$`New Accounting Standards`$p <- 0.75acme$Research$`New Product Line`$p <- 0.25acme$Research$`New Labs`$p <- 0.9acme$IT$Outsource$p <- 0.2acme$IT$`Go agile`$p <- 0.05acme$IT$`Switch to R`$p <- 1print(acme, "cost", "p")

## levelName cost p## 1 Acme Inc. NA NA## 2 ¦--Accounting NA NA## 3 ¦ ¦--New Software 1000000 0.50## 4 ¦ °--New Accounting Standards 500000 0.75## 5 ¦--Research NA NA## 6 ¦ ¦--New Product Line 2000000 0.25## 7 ¦ °--New Labs 750000 0.90## 8 °--IT NA NA## 9 ¦--Outsource 400000 0.20## 10 ¦--Go agile 250000 0.05## 11 °--Switch to R 50000 1.00

Note that there is a list of reserved names you cannot use asNode attributes:

NODE_RESERVED_NAMES_CONST

## [1] "AddChild" "AddChildNode" "AddSibling" ## [4] "AddSiblingNode" "attributes" "attributesAll" ## [7] "averageBranchingFactor" "children" "Climb" ## [10] "Navigate" "FindNode" "clone" ## [13] "count" "Do" "fields" ## [16] "fieldsAll" "Get" "GetAttribute" ## [19] "height" "initialize" "isBinary" ## [22] "isLeaf" "isRoot" "leafCount" ## [25] "leaves" "level" "levelName" ## [28] "name" "parent" "path" ## [31] "pathString" "position" "printFormatters" ## [34] "Prune" "Revert" "RemoveAttribute" ## [37] "RemoveChild" "root" "Set" ## [40] "siblings" "Sort" "totalCount" ## [43] ".*"

Custom attributes in constructor

An alternative, often convenient way to assign custom attributes isin the constructor, or in the Node$AddChild method:

birds <- Node$new("Aves", vulgo = "Bird")birds$AddChild("Neognathae", vulgo = "New Jaws", species = 10000)birds$AddChild("Palaeognathae", vulgo = "Old Jaws", species = 60)print(birds, "vulgo", "species")

## levelName vulgo species## 1 Aves Bird NA## 2 ¦--Neognathae New Jaws 10000## 3 °--Palaeognathae Old Jaws 60

Custom attributes as function

Nothing stops you from setting a function as a field. This calculatesa value dynamically, i.e.whenever a field is accessed in treetraversal. For example, you can add a new Node to yourstructure, and the function will reflect this. Think of this as ahierarchical spreadsheet, in which you can set formulas into cells.

Consider the following example:

birds$species <- function(self) sum(sapply(self$children, function(x) x$species))print(birds, "species")

## levelName species## 1 Aves 10060## 2 ¦--Neognathae 10000## 3 °--Palaeognathae 60

data.tree maps the self argument to theNode at hand. Thus, you must name the argumentself.

Now, let’s assume we discover a new species. Then, the species on theroot adjusts dynamically:

birds$Palaeognathae$species <- 61print(birds, "species")

## levelName species## 1 Aves 10061## 2 ¦--Neognathae 10000## 3 °--Palaeognathae 61

This, together with the Set method and recursion,becomes a very powerful tool, as we’ll see later.

Printing

Basic Printing

Basic printing is easy, as you surely have noted in the previoussections. print displays a tree in a tree-grid view. On theleft, you have the hierarchy. Then you have a column per variable youwant to print:

print(acme, "cost", "p")

## levelName cost p## 1 Acme Inc. NA NA## 2 ¦--Accounting NA NA## 3 ¦ ¦--New Software 1000000 0.50## 4 ¦ °--New Accounting Standards 500000 0.75## 5 ¦--Research NA NA## 6 ¦ ¦--New Product Line 2000000 0.25## 7 ¦ °--New Labs 750000 0.90## 8 °--IT NA NA## 9 ¦--Outsource 400000 0.20## 10 ¦--Go agile 250000 0.05## 11 °--Switch to R 50000 1.00

For more advanced printing, you have a few options.

Formatters

You can use formatters to output a variable in a certainway. You can use formatters in two ways:

You can set them on a Node using theSetFormat method. If you do this, then the formatter willbe picked up as a default formatter whenever you print,Get, convert to data.frame, etc. Formatterscan be set on any Node in a data.treestructure act on any descendant. So you can overwrite a formatter for asub-tree.
You can add an explicit ad-hoc formatter to the Getmethod (see below). This will overwrite default formatters previouslyset via the SetFormat method. You can also set theformatter to identity to void a default formatter.

Setting a formatter using the SetFormat method:

SetFormat(acme, "p", formatFun = FormatPercent)SetFormat(acme, "cost", formatFun = function(x) FormatFixedDecimal(x, digits = 2))print(acme, "cost", "p")

## levelName cost p## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software 1000000.00 50.00 %## 4 ¦ °--New Accounting Standards 500000.00 75.00 %## 5 ¦--Research ## 6 ¦ ¦--New Product Line 2000000.00 25.00 %## 7 ¦ °--New Labs 750000.00 90.00 %## 8 °--IT ## 9 ¦--Outsource 400000.00 20.00 %## 10 ¦--Go agile 250000.00 5.00 %## 11 °--Switch to R 50000.00 100.00 %

Printing using `Get`

Formatting with the Get method overwrites any formattersfound along the path:

data.frame(cost = acme$Get("cost", format = function(x) FormatFixedDecimal(x, 2)), p = acme$Get("p", format = FormatPercent))

## cost p## Acme Inc. ## Accounting ## New Software 1000000.00 50.00 %## New Accounting Standards 500000.00 75.00 %## Research ## New Product Line 2000000.00 25.00 %## New Labs 750000.00 90.00 %## IT ## Outsource 400000.00 20.00 %## Go agile 250000.00 5.00 %## Switch to R 50000.00 100.00 %

Plotting

`plot`

data.tree is mainly a data structure. As it is easy toconvert data.tree structures to other formats, you haveaccess to a large number of tools to plot a data.treestructure. For example, you can plot a data.tree structureas a dendrogram, as an ape tree, as a treeview, etc. Additionally,data.tree also provides its own plotting facility. It isbuilt on GraphViz/DiagrammeR, and you can access these features via theplot and ToGraphViz functions. Note thatDiagrammeR is not required to use data.tree, so plot onlyworks if DiagrammeR is installed on your system. For example:

plot(acme)

acme

Styling

Similar to formatters for printing, you can style your tree and storethe styling directly in the tree, for later use:

SetGraphStyle(acme, rankdir = "TB")SetEdgeStyle(acme, arrowhead = "vee", color = "grey35", penwidth = 2)SetNodeStyle(acme, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow", fontname = "helvetica", tooltip = GetDefaultTooltip)SetNodeStyle(acme$IT, fillcolor = "LightBlue", penwidth = "5px")plot(acme)

acme

For details on the styling attributes, see https://graphviz.org/Documentation.php .

Note that, by default, most Node style attributes will be inherited.Though, for example, label will not be inherited. However,inheritance can be avoided for all style attributes, as for theAccounting node in the following example:

SetNodeStyle(acme$Accounting, inherit = FALSE, fillcolor = "Thistle", fontcolor = "Firebrick", tooltip = "This is the accounting department")plot(acme)

acme

Use Do to set style on specific nodes:

Do(acme$leaves, function(node) SetNodeStyle(node, shape = "egg"))plot(acme)

acme

Other Visualisations

However, there are also endless other possibilities to visualisedata.tree structures. There are more examples in theapplications vignette. Typevignette('applications', package = "data.tree").

Dendrogram

For example, using dendrogram:

igraph

Or, using igraph:

library(igraph)plot(as.igraph(acme, directed = TRUE, direction = "climb"))

networkD3

Or, using networkD3: (you can actually touch these thingies and dragthem around, don’t be shy!)

library(networkD3)acmeNetwork <- ToDataFrameNetwork(acme, "name")simpleNetwork(acmeNetwork[-3], fontSize = 12)

Another example, which at the same time shows conversion fromcsv:

fileName <- system.file("extdata", "useR15.csv", package="data.tree")useRdf <- read.csv(fileName, stringsAsFactors = FALSE)#define the hierarchy (Session/Room/Speaker)useRdf$pathString <- paste("useR", useRdf$session, useRdf$room, useRdf$speaker, sep="|")#convert to NodeuseRtree <- as.Node(useRdf, pathDelimiter = "|")#plot with networkD3useRtreeList <- ToListExplicit(useRtree, unname = TRUE)radialNetwork( useRtreeList)

Tree Conversion

In order to take advantage of the R eco-system, you can convert yourdata.tree structure to other oft-used data types. Thegeneral rule is that, for each target type, there is a one-does-it-allgenerics, and a few more specialised conversion functions. For example,in order to convert a data.tree to a data.frame, you caneither use as.data.frame.Node, orToDataFrameTree, ToDataFrameTable, orToDataFrameNetwork. The documentation for all of thesevariations is accessible via ?as.data.frame.Node.

Converting to `data.frame`

As you saw just above, creating a data.frame iseasy.

Again, note that we always call such methods on the rootNode of a data.tree structure, or on the rootNode of a subtree:

acmedf <- as.data.frame(acme)as.data.frame(acme$IT)

## levelName## 1 IT ## 2 ¦--Outsource ## 3 ¦--Go agile ## 4 °--Switch to R

The same can be achieved by using the more specialised method:

ToDataFrameTree(acme)

We can also add field values of the Nodes as columns tothe data.frame:

ToDataFrameTree(acme, "level", "cost")

## levelName level cost## 1 Acme Inc. 1 NA## 2 ¦--Accounting 2 NA## 3 ¦ ¦--New Software 3 1000000## 4 ¦ °--New Accounting Standards 3 500000## 5 ¦--Research 2 NA## 6 ¦ ¦--New Product Line 3 2000000## 7 ¦ °--New Labs 3 750000## 8 °--IT 2 NA## 9 ¦--Outsource 3 400000## 10 ¦--Go agile 3 250000## 11 °--Switch to R 3 50000

Note that it is not required that the field is set on each and everyNode.

Other data frame conversions are:

ToDataFrameTable(acme, "pathString", "cost")

## pathString cost## 1 Acme Inc./Accounting/New Software 1000000## 2 Acme Inc./Accounting/New Accounting Standards 500000## 3 Acme Inc./Research/New Product Line 2000000## 4 Acme Inc./Research/New Labs 750000## 5 Acme Inc./IT/Outsource 400000## 6 Acme Inc./IT/Go agile 250000## 7 Acme Inc./IT/Switch to R 50000

ToDataFrameNetwork(acme, "cost")

## from to cost## 1 Acme Inc. Accounting NA## 2 Acme Inc. Research NA## 3 Acme Inc. IT NA## 4 Accounting New Software 1000000## 5 Accounting New Accounting Standards 500000## 6 Research New Product Line 2000000## 7 Research New Labs 750000## 8 IT Outsource 400000## 9 IT Go agile 250000## 10 IT Switch to R 50000

And, finally, we can also put attributes of our nodes in a column,based on a type discriminator. This sounds more complicated then what itis. Consider the default discriminator, level:

ToDataFrameTypeCol(acme, 'cost')

## level_1 level_2 level_3 cost## 1 Acme Inc. Accounting New Software 1000000## 2 Acme Inc. Accounting New Accounting Standards 500000## 3 Acme Inc. Research New Product Line 2000000## 4 Acme Inc. Research New Labs 750000## 5 Acme Inc. IT Outsource 400000## 6 Acme Inc. IT Go agile 250000## 7 Acme Inc. IT Switch to R 50000

Let’s look at a somewhat more advanced example. First, let’s assumethat for the outsourcing project, we have two separate possibilities:Outsourcing to India or outsourcing to Poland:

acme$IT$Outsource$AddChild("India")acme$IT$Outsource$AddChild("Poland")

Now, with this slightly more complex tree structure, the level is nota usefully discriminator anymore, because some projects are in level 3,while the new projects are in level 4. For this reason, we introduce atype field on our node objects: A node type can be a company (rootonly), a department (Accounting, Research, and IT), a program(Oursource), and a project (the rest, i.e.all the leaves):

acme$Set(type = c('company', 'department', 'project', 'project', 'department', 'project', 'project', 'department', 'program', 'project', 'project', 'project', 'project'))

Our tree now looks like this:

print(acme, 'type')

## levelName type## 1 Acme Inc. company## 2 ¦--Accounting department## 3 ¦ ¦--New Software project## 4 ¦ °--New Accounting Standards project## 5 ¦--Research department## 6 ¦ ¦--New Product Line project## 7 ¦ °--New Labs project## 8 °--IT department## 9 ¦--Outsource program## 10 ¦ ¦--India project## 11 ¦ °--Poland project## 12 ¦--Go agile project## 13 °--Switch to R project

We can now create a data.frame in which we have one column perdistinct type value. Namely, a company column, a department column, aprogram column, and a project column. Note that the columns are nothardcoded, but derived dynamically from your data in the treestructure:

ToDataFrameTypeCol(acme, type = 'type', prefix = NULL)

## company department program project## 1 Acme Inc. Accounting <NA> New Software## 2 Acme Inc. Accounting <NA> New Accounting Standards## 3 Acme Inc. Research <NA> New Product Line## 4 Acme Inc. Research <NA> New Labs## 5 Acme Inc. IT Outsource India## 6 Acme Inc. IT Outsource Poland## 7 Acme Inc. IT <NA> Go agile## 8 Acme Inc. IT <NA> Switch to R

Converting to List of Lists

List of lists are useful for various use cases:

as an intermediate step in converting to JSON, XML, YAML
for functions that take a lol as an input. This is especially thecase for visualisations and charts, e.g with many html widgets
to save a data.tree structure as an R object (seeperformance considerations below)

data(acme)str(as.list(acme$IT))

## List of 3## $ Outsource :List of 2## ..$ cost: num 4e+05## ..$ p : num 0.2## $ Go agile :List of 2## ..$ cost: num 250000## ..$ p : num 0.05## $ Switch to R:List of 2## ..$ cost: num 50000## ..$ p : num 1

str(ToListExplicit(acme$IT, unname = FALSE, nameName = "id", childrenName = "dependencies"))

## List of 2## $ id : chr "IT"## $ dependencies:List of 3## ..$ Outsource :List of 3## .. ..$ id : chr "Outsource"## .. ..$ cost: num 4e+05## .. ..$ p : num 0.2## ..$ Go agile :List of 3## .. ..$ id : chr "Go agile"## .. ..$ cost: num 250000## .. ..$ p : num 0.05## ..$ Switch to R:List of 3## .. ..$ id : chr "Switch to R"## .. ..$ cost: num 50000## .. ..$ p : num 1

Converting to other objects

There are also conversions to igraph objects, to phylo / ape, todendrogram, and others. For details, see ?as.phylo.Node,?as.dendrogram.Node, ?as.igraph.Node.

Tree traversal is one of the core concepts of trees. See, forexample, here: Tree Traversal onWikipedia.

`Get`

The Get method traverses the tree and collects valuesfrom each node. It then returns a vector or a list, containing thecollected values.

Additional features of the Get method are:

execute a function on each node, and append the function’s result tothe returned vector
execute a Node method on each node, and append themethod’s return value to the returned vector

Traversal order

The Get method can traverse the tree in various ways.This is called traversal order.

Pre-Order

The default traversal mode is pre-order.

pre-order

This is what is used e.g.in print:

print(acme, "level")

## levelName level## 1 Acme Inc. 1## 2 ¦--Accounting 2## 3 ¦ ¦--New Software 3## 4 ¦ °--New Accounting Standards 3## 5 ¦--Research 2## 6 ¦ ¦--New Product Line 3## 7 ¦ °--New Labs 3## 8 °--IT 2## 9 ¦--Outsource 3## 10 ¦--Go agile 3## 11 °--Switch to R 3

Post-Order

The post-order traversal mode returns childrenfirst, returning parents only after all its children have been traversedand returned:

post-order

We can use it like this on the Get method:

acme$Get('level', traversal = "post-order")

## New Software New Accounting Standards Accounting ## 3 3 2 ## New Product Line New Labs Research ## 3 3 2 ## Outsource Go agile Switch to R ## 3 3 3 ## IT Acme Inc. ## 2 1

This is useful if your parent’s value depends on the children, aswe’ll see below.

Ancestor

This is a non-standard traversal mode that does not traverse theentire tree. Instead, the ancestor mode starts from a Node,then walks the tree along the path from parent to parent, up to theroot.

data.frame(level = agile$Get('level', traversal = "ancestor"))

## level## Go agile 3## IT 2## Acme Inc. 1

Filter and Prune

You can add a filter and/or a prune function to the Getmethod. These functions have to take a Node as an input,and return TRUE if the Node should beconsidered, and FALSE otherwise. The difference between thepruneFun and the filterFun is that filters actonly on specific nodes, whereas if the pruneFun returnsFALSE, then the entire sub-tree spanned by theNode is ignored.

For example:

acme$Get('name', pruneFun = function(x) x$position <= 2)

## Acme Inc. Accounting ## "Acme Inc." "Accounting" ## New Software New Accounting Standards ## "New Software" "New Accounting Standards" ## Research New Product Line ## "Research" "New Product Line" ## New Labs ## "New Labs"

There are also some convenient filter functions available in thepackage, such as isLeaf, isRoot,isNotLeaf, etc.

acme$Get('name', filterFun = isLeaf)

## New Software New Accounting Standards ## "New Software" "New Accounting Standards" ## New Product Line New Labs ## "New Product Line" "New Labs" ## Outsource Go agile ## "Outsource" "Go agile" ## Switch to R ## "Switch to R"

Attributes

The attribute parameter determines what is collected.This is called attribute, but it should not be confusedwith R’s concept of object attributes (e.g.?attributes).In this context, an attribute can be either:

the name of a Node field
the name of a Node method or active
a function, whose first argument must be a Node

Throughout this document, we refer to attribute in thissense.

Field

acme$Get('name')

## Acme Inc. Accounting ## "Acme Inc." "Accounting" ## New Software New Accounting Standards ## "New Software" "New Accounting Standards" ## Research New Product Line ## "Research" "New Product Line" ## New Labs IT ## "New Labs" "IT" ## Outsource Go agile ## "Outsource" "Go agile" ## Switch to R ## "Switch to R"

Method

You can pass a standard R function to the Get method(and thus to print, as.data.frame, etc.). Theonly requirement this function must satisfy is that its first argumentbe of class Node. Subsequent arguments can be added throughthe ellipsis (…). For example:

ExpectedCost <- function(node, adjustmentFactor = 1) { return ( node$cost * node$p * adjustmentFactor)}acme$Get(ExpectedCost, adjustmentFactor = 0.9, filterFun = isLeaf)

## New Software New Accounting Standards New Product Line ## 450000 337500 450000 ## New Labs Outsource Go agile ## 607500 72000 11250 ## Switch to R ## 45000

Using recursion

Recursion comes naturally with data.tree, and it is one of its corestrengths:

Cost <- function(node) { result <- node$cost if(length(result) == 0) result <- sum(sapply(node$children, Cost)) return (result)}print(acme, "p", cost = Cost)

## levelName p cost## 1 Acme Inc. NA 4950000## 2 ¦--Accounting NA 1500000## 3 ¦ ¦--New Software 0.50 1000000## 4 ¦ °--New Accounting Standards 0.75 500000## 5 ¦--Research NA 2750000## 6 ¦ ¦--New Product Line 0.25 2000000## 7 ¦ °--New Labs 0.90 750000## 8 °--IT NA 700000## 9 ¦--Outsource 0.20 400000## 10 ¦--Go agile 0.05 250000## 11 °--Switch to R 1.00 50000

There is a built-in function that would make this example evensimpler: Aggregate. It is explained below.

`Do`

Do is similar to Get in that it also traverses a tree ina specific traversal order. However, instead of fetching an attribute,it will (surprise!) do something, namely run a function. For example, wecan tell the Do method to assign a value to eachNode it traverses. This is especially useful if theattribute parameter is a function, as in the previous examples. Forinstance, we can store the aggregated cost for later use andprinting:

acme$Do(function(node) node$cost <- Cost(node), filterFun = isNotLeaf)print(acme, "p", "cost")

## levelName p cost## 1 Acme Inc. NA 4950000## 2 ¦--Accounting NA 1500000## 3 ¦ ¦--New Software 0.50 1000000## 4 ¦ °--New Accounting Standards 0.75 500000## 5 ¦--Research NA 2750000## 6 ¦ ¦--New Product Line 0.25 2000000## 7 ¦ °--New Labs 0.90 750000## 8 °--IT NA 700000## 9 ¦--Outsource 0.20 400000## 10 ¦--Go agile 0.05 250000## 11 °--Switch to R 1.00 50000

`Set`

The Set method is the counterpart to theGet method. The Set method takes a vector or asingle value as an input, and traverses the tree in a certain order.Each Node is assigned a value from the vector, one afterthe other, recycling.

Assigning values

acme$Set(id = 1:acme$totalCount)print(acme, "id")

## levelName id## 1 Acme Inc. 1## 2 ¦--Accounting 2## 3 ¦ ¦--New Software 3## 4 ¦ °--New Accounting Standards 4## 5 ¦--Research 5## 6 ¦ ¦--New Product Line 6## 7 ¦ °--New Labs 7## 8 °--IT 8## 9 ¦--Outsource 9## 10 ¦--Go agile 10## 11 °--Switch to R 11

The Set method can take multiple vectors as an input,and, optionally, you can define the name of the attribute. Finally, justas for the Get method, the traversal orderis important for the Set.

secretaries <- c(3, 2, 8)employees <- c(52, 43, 51)acme$Set(secretaries, emps = employees, filterFun = function(x) x$level == 2)print(acme, "emps", "secretaries", "id")

## levelName emps secretaries id## 1 Acme Inc. NA NA 1## 2 ¦--Accounting 52 3 2## 3 ¦ ¦--New Software NA NA 3## 4 ¦ °--New Accounting Standards NA NA 4## 5 ¦--Research 43 2 5## 6 ¦ ¦--New Product Line NA NA 6## 7 ¦ °--New Labs NA NA 7## 8 °--IT 51 8 8## 9 ¦--Outsource NA NA 9## 10 ¦--Go agile NA NA 10## 11 °--Switch to R NA NA 11

Deleting attributes

The Set method can also be used to assign a single valuedirectly to all Nodes traversed. For example, to remove theavgExpectedCost, we assign NULL on each node,using the fact that the Set recycles:

acme$Set(avgExpectedCost = NULL)

However, note that setting a field to NULL will not makeit gone for good. You will still see it:

acme$attributesAll

## [1] "avgExpectedCost" "cost" "id" "emps" ## [5] "secretaries" "p"

In order remove it completely, you can use theRemoveAttribute method:

acme$Do(function(node) node$RemoveAttribute("avgExpectedCost"))

Using Set and function assignment

Earlier, we saw that we can add a function dynamically to aNode. We can, of course, also do this via theSet method

acme$Set(cost = c(function(self) sum(sapply(self$children, function(child) GetAttribute(child, "cost")))), filterFun = isNotLeaf)print(acme, "cost")

## levelName cost## 1 Acme Inc. 4950000## 2 ¦--Accounting 1500000## 3 ¦ ¦--New Software 1000000## 4 ¦ °--New Accounting Standards 500000## 5 ¦--Research 2750000## 6 ¦ ¦--New Product Line 2000000## 7 ¦ °--New Labs 750000## 8 °--IT 700000## 9 ¦--Outsource 400000## 10 ¦--Go agile 250000## 11 °--Switch to R 50000

acme$IT$AddChild("Paperless", cost = 240000)print(acme, "cost")

## levelName cost## 1 Acme Inc. 5190000## 2 ¦--Accounting 1500000## 3 ¦ ¦--New Software 1000000## 4 ¦ °--New Accounting Standards 500000## 5 ¦--Research 2750000## 6 ¦ ¦--New Product Line 2000000## 7 ¦ °--New Labs 750000## 8 °--IT 940000## 9 ¦--Outsource 400000## 10 ¦--Go agile 250000## 11 ¦--Switch to R 50000## 12 °--Paperless 240000

`Traverse` and explicit traversal

Previously, we have used the Get, Set andDo methods in their OO-style version. This is often veryconvenient for quick access to variables. However, sometimes you want tore-use the same traversal for multiple sequential operations. For this,you can use what is called explicit traversal. It workslike so:

traversal <- Traverse(acme, traversal = "post-order", filterFun = function(x) x$level == 2)Set(traversal, floor = c(1, 2, 3))Do(traversal, function(x) { if (x$floor <= 2) { x$extension <- "044" } else { x$extension <- "043" } })Get(traversal, "extension")

## Accounting Research IT ## "044" "044" "043"

`Aggregate`

The Aggregate method provides a shorthand for theoft-used case when a parent is the aggregate of its child values, asseen in the previous example. Aggregate calls a functionrecursively on children. If a child holds the attribute, that value isreturned. Otherwise, the attribute is collected from all children, andaggregated using the aggFun. For example:

Aggregate(node = acme, attribute = "cost", aggFun = sum)

## [1] 5190000

We can also use this in the Get method, of course:

acme$Get(Aggregate, "cost", sum)

Note, however, that this is not very efficient:Aggregate will be called twice on, say, IT: Oncewhen the traversal passes IT itself, the second timerecursively when Aggregate is called on the root. For thisreason, we have the option to store/cache the calculated value along theway. For one thing, this is a convenient way to save an additionalSet call in case we want to store the aggregated value.Additionally, it speeds up calculation because Aggregate onan ancestor will use a cached value on a descendant:

acme$Do(function(node) node$cost <- Aggregate(node, attribute = "cost", aggFun = sum), traversal = "post-order")print(acme, "cost")

## levelName cost## 1 Acme Inc. 5190000## 2 ¦--Accounting 1500000## 3 ¦ ¦--New Software 1000000## 4 ¦ °--New Accounting Standards 500000## 5 ¦--Research 2750000## 6 ¦ ¦--New Product Line 2000000## 7 ¦ °--New Labs 750000## 8 °--IT 940000## 9 ¦--Outsource 400000## 10 ¦--Go agile 250000## 11 ¦--Switch to R 50000## 12 °--Paperless 240000

`Cumulate`

In its simplest form, the Cumulate function just sums upan attribute value along siblings, taking into consideration allsiblings before the Node on which Cumulate iscalled:

Cumulate(acme$IT$`Go agile`, "cost", sum)

## [1] 650000

Or, to find the minimum cost among siblings:

Cumulate(acme$IT$`Go agile`, "cost", min)

## [1] 250000

This can be useful in combination with traversal, e.g.to calculate arunning sum among siblings. Specifically, thecacheAttribute lets you store the running sum in a field.This not only speeds up calculation, but lets you re-use the calculatedvalues later:

acme$Do(function(node) node$cumCost <- Cumulate(node, attribute = "cost", aggFun = sum))print(acme, "cost", "cumCost")

## levelName cost cumCost## 1 Acme Inc. 5190000 5190000## 2 ¦--Accounting 1500000 1500000## 3 ¦ ¦--New Software 1000000 1000000## 4 ¦ °--New Accounting Standards 500000 1500000## 5 ¦--Research 2750000 4250000## 6 ¦ ¦--New Product Line 2000000 2000000## 7 ¦ °--New Labs 750000 2750000## 8 °--IT 940000 5190000## 9 ¦--Outsource 400000 400000## 10 ¦--Go agile 250000 650000## 11 ¦--Switch to R 50000 700000## 12 °--Paperless 240000 940000

`Clone`

As stated above, Nodes exhibit reference semantics. Ifyou call, say, Set, then this changes theNodes in the tree. The changes will be visible for allvariables having a reference on the data.tree structure. Asa consequence, you might want to “save away” the current state of astructure. To do this, you can Clone an entire tree:

acmeClone <- Clone(acme)acmeClone$name <- "New Acme"# acmeClone does not point to the same reference object anymore:acme$name == acmeClone$name

## [1] FALSE

`Sort`

With the Sort method, you can sort an entire tree, asub-tree, or children of a specific Node. The method willsort recursively and sort children with respect to a child attribute. Asexplained earlier, the child attribute can be a function or amethod.

Sort(acme, "name")acme

## levelName## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Accounting Standards## 4 ¦ °--New Software ## 5 ¦--IT ## 6 ¦ ¦--Go agile ## 7 ¦ ¦--Outsource ## 8 ¦ ¦--Paperless ## 9 ¦ °--Switch to R ## 10 °--Research ## 11 ¦--New Labs ## 12 °--New Product Line

Sort(acme, Aggregate, "cost", sum, decreasing = TRUE, recursive = TRUE)print(acme, "cost", aggCost = acme$Get(Aggregate, "cost", sum))

## levelName cost aggCost## 1 Acme Inc. 5190000 5190000## 2 ¦--Research 2750000 2750000## 3 ¦ ¦--New Product Line 2000000 2000000## 4 ¦ °--New Labs 750000 750000## 5 ¦--Accounting 1500000 1500000## 6 ¦ ¦--New Software 1000000 1000000## 7 ¦ °--New Accounting Standards 500000 500000## 8 °--IT 940000 940000## 9 ¦--Outsource 400000 400000## 10 ¦--Go agile 250000 250000## 11 ¦--Paperless 240000 240000## 12 °--Switch to R 50000 50000

`Prune`

You can prune sub-trees out of a tree, by that removing an entiresub-tree from a tree. There are two variations of this: *temporary pruning, e.g.just for printing: This is thepruneFun parameter, e.g.in Get * sideeffect or permanent pruning, meaning that you modify yourdata.tree structure for good. This is achieved with thePrune method.

Consider the following example of permanent pruning:

acme$Do(function(x) x$cost <- Aggregate(x, "cost", sum))Prune(acme, function(x) x$cost > 700000)

## [1] 5

print(acme, "cost")

## levelName cost## 1 Acme Inc. 5190000## 2 ¦--Research 2750000## 3 ¦ ¦--New Product Line 2000000## 4 ¦ °--New Labs 750000## 5 ¦--Accounting 1500000## 6 ¦ °--New Software 1000000## 7 °--IT 940000

CPU

The data.tree package has been built to work withhierarchical data, to support visualization, to foster rapidprototyping, and for other applications where development time saved ismore important than computing time lost. Having said this, it becomesclear that big data and data.tree do not marry particularlywell. Don’t expect R to build your data.tree structure witha few million Nodes during your cigarette break. Do not tryto convert a gigabyte JSON document to a data.treestructure in a testthat test case.

However, if you are respecting the following guidelines, I promisethat you and your Nodes will have a lot of fun together. Sohere it goes:

Creating a Node is relatively expensive.CreateRegularTree(6, 6) creates a data.treestructure with 9331 Nodes. On an AWS c4.large instance,this takes about 2.5 seconds.
Clone is similar to Node creation, with anextra penalty of about 50%.
Traversing (Traverse, Get,Set and Do) is relatively cheap.

This is really what you would expect. data.tree buildson R6, i.e.reference objects. There is an overhead in creating them, asyour computer needs to manage the references they hold. However,performing operations that change your tree (e.g.Prune orSet) are often faster than value semantics, as yourcomputer does not need to copy the entire object in memory.

Just to give you an order of magnitude: The following times areachieved on an AWS c4.large instance:

system.time(tree <- CreateRegularTree(6, 6))

## user system elapsed ## 2.499 0.009 2.506

system.time(tree <- Clone(tree))

## user system elapsed ## 3.704 0.023 3.726

system.time(traversal <- Traverse(tree))

## user system elapsed ## 0.096 0.000 0.097

system.time(Set(traversal, id = 1:tree$totalCount))

## user system elapsed ## 0.205 0.000 0.204

system.time(ids <- Get(traversal, "id"))

## user system elapsed ## 0.569 0.000 0.569

leaves <- Traverse(tree, filterFun = isLeaf)Set(leaves, leafId = 1:length(leaves))system.time(Get(traversal, function(node) Aggregate(node, "leafId", max)))

## user system elapsed ## 1.418 0.000 1.417

With caching, you can save some time:

system.time(tree$Get(function(node) Aggregate(tree, "leafId", max, "maxLeafId"), traversal = "post-order"))

## user system elapsed ## 0.69 0.00 0.69

Memory

data.tree structures have a relatively large memory footprint.However, for every-day applications using modern computers, this willnot normally have an impact on your work except when saving adata.tree structure to disk.

For an explanation why that is the case, you might want to read thisanswer on StackOverflow.

Depending on your development environment, you might want to turn offthe option to save the workspace to .RData on exit.

Introduction to data.tree (2024)

Trees

Trees in R

Trees in data.tree

Definitions

Tree creation

Create a tree programmatically

Create a tree from a data.frame

Create a tree from a file

Node methods

Actives Examples (aka Properties)

OO-Style Methods Examples

Traditional R Methods

Climbing a tree (tree navigation)

Navigation by path

Navigation by position

Navigation by attributes

Custom attributes

Custom attributes in constructor

Custom attributes as function

Printing

Basic Printing

Formatters

Printing using Get

Plotting

plot

Styling

Other Visualisations

Dendrogram

igraph

networkD3

Tree Conversion

Converting to data.frame

Converting to List of Lists

Converting to other objects

Get

Traversal order

Pre-Order

Post-Order

Ancestor

Filter and Prune

Attributes

Field

Method

Using recursion

Do

Set

Assigning values

Deleting attributes

Using Set and function assignment

Traverse and explicit traversal

Aggregate

Cumulate

Clone

Sort

Prune

CPU

Memory

Trees in `data.tree`

Create a tree from a `data.frame`

Printing using `Get`

`plot`

Converting to `data.frame`

`Get`

`Do`

`Set`

`Traverse` and explicit traversal

`Aggregate`

`Cumulate`

`Clone`

`Sort`

`Prune`