Friday, April 18, 2014

Statnet & "foreign" Vertex IDs

It's common that your network comes from an external source, and the vertices already have unique identifiers. Unfortunately, if I am importing the network into Statnet using edgelist format, Statnet requires the vertex identifiers to be in sequential order, beginning at zero. This requires me to do a bit of pre-processing of the data, which is not always desirable: it requires a look-up step if I want to figure out who a particular vertex refers to, and makes attaching additional attributes more complex.

However, two, "lightly documented" features can help out here. First, statnet already accepts textual vertex names in the standard network constructors. So, if the vertices already have (unique) text names, they can be used with no additional work:

text.edgelist <- rbind(c("tom_", "dick7"), c("dick7", "harry"),
   c("harry", "jill"), c("harry", "jane"))
n_alpha <- as.network(text.edgelist, matrix.type='edgelist', directed=F)
plot(n_alpha, displaylabels=T)
If the identifiers are numbers, the simple approach then is to convert the numeric vertex identifiers to a string (of numbers), and use that string as the input to as.network:

num.edgelist <- rbind(c(3, 4), c(4, 6),
   c(6, 8), c(6, 9))

# Map the numeric edge list to characters
char.edgelist <- apply(num.edgelist, c(1,2), as.character)
n_alpha <- as.network(char.edgelist, matrix.type='edgelist', directed=F)
plot(n_alpha, displaylabels=T)
This approach works, and depending on data and goals, may be the easiest approach. There's also a second feature in statnet that can solve this problem: persistent IDs. Persistent IDs are part of the dynamic network package, and provide a way to attach unique values to vertices and edges that don't change regardless of any network manipulation.

Persistent IDs (at least to me) "feel" like the better approach to this, but they are also more complex to use in practice. In particular, adding edges is not as easy as you'd like. In order to use persistent IDs, I first have to create the network, then initialize it for use with persistent IDs. In this example, I'm going to start with a (mostly) empty network, and then add all of the vertices and edges to it. I'm assuming that the edgelist is already loaded in num.edgelist and it has two columns, a tail and a head. The network has to be initialized with at least one node, otherwise the initialize.pids() call is not "sticky."

net.pid <- network.initialize(1, directed=F)

initialize.pids(net.pid)

The next step is to add all of the vertices. This step is straight-forward. First, I create a list of the unique node identifiers (unique() is part of the base R distribution). Then, use that list to add the vertices and initialized them with the persistent ID.

node.list <- unique(c(unique(num.edgelist[,1]),
    unique(num.edgelist[,2])))
add.vertices(net.pid, length(node.list), vertex.pid = node.list)
Unfortunately, there is no add.edges() function which takes an edge defined in terms of the persistent IDs. This is not a huge problem: it's only a few lines of code to map a list of persistent IDs to the internal node representation, using the get.vertex.id function already written. Using apply, it's one (logical) line of R code, which gives the edgelist "mapped" to the internal vertex IDs. This list can be passed directly to add.edges():

mapped.edges <- apply(num.edgelist, c(1,2),
   function(v.pid, net) {
      get.vertex.id(net, v.pid)
   }, net.pid)
add.edges(net.pid, mapped.edges[,1], mapped.edges[,2])
There's one last bit of clean-up to do. At the beginning, I created the network with one node. This was done to make sure that the initialize.pids() call was "sticky." I need to remove that initial node. Since it was the first node added, I know the id is 1:

delete.vertices(n_pid, 1)
Finally, it's possible to plot the graph using the persistent IDs as labels. Just use the %v% operator to extract the vertex.pid attribute:
plot(net.pid, label=net.pid %v% 'vertex.pid')