Cooking: New Foods
I'm not sure I have a huge point here, other than even seeming expertise can sometimes mask a weird fundamental flaw.
text.edgelist <- rbind(c("tom_", "dick7"), c("dick7", "harry"),
c("harry", "jill"), c("harry", "jane"))
n_alpha <- as.network(text.edgelist, matrix.type='edgelist', directed=F)
plot(n_alpha, displaylabels=T)
If the identifiers are numbers, the simple approach then is to convert
the numeric vertex identifiers to a string (of numbers), and use that
string as the input to as.network
:
num.edgelist <- rbind(c(3, 4), c(4, 6),
c(6, 8), c(6, 9))
# Map the numeric edge list to characters
char.edgelist <- apply(num.edgelist, c(1,2), as.character)
n_alpha <- as.network(char.edgelist, matrix.type='edgelist', directed=F)
plot(n_alpha, displaylabels=T)
This approach works, and depending on data and goals, may be the
easiest approach. There's also a second feature in statnet that can
solve this problem: persistent IDs. Persistent IDs are part of the
dynamic network package, and provide a way to attach unique values to
vertices and edges that don't change regardless of any network
manipulation.initialize.pids()
call is not
"sticky."
net.pid <- network.initialize(1, directed=F)
initialize.pids(net.pid)
The next step is to add all of the vertices. This step is
straight-forward. First, I create a list of the unique node
identifiers (unique()
is part of the base R
distribution). Then, use that list to add the vertices and initialized
them with the persistent ID.
node.list <- unique(c(unique(num.edgelist[,1]),
unique(num.edgelist[,2])))
add.vertices(net.pid, length(node.list), vertex.pid = node.list)
Unfortunately, there is no add.edges()
function which
takes an edge defined in terms of the persistent IDs. This is not a
huge problem: it's only a few lines of code to map a list of
persistent IDs to the internal node representation, using the
get.vertex.id
function already written. Using
apply
, it's one (logical) line of R code, which gives the
edgelist "mapped" to the internal vertex IDs. This list can be passed
directly to add.edges()
:
mapped.edges <- apply(num.edgelist, c(1,2),
function(v.pid, net) {
get.vertex.id(net, v.pid)
}, net.pid)
add.edges(net.pid, mapped.edges[,1], mapped.edges[,2])
There's one last bit of clean-up to do. At the beginning, I created the network
with one node. This was done to make sure that the
initialize.pids()
call was "sticky." I need to remove
that initial node. Since it was the first node added, I know the id is
1:
delete.vertices(n_pid, 1)
Finally, it's possible to plot the graph using the persistent IDs as
labels. Just use the %v%
operator to extract the
vertex.pid
attribute:
plot(net.pid, label=net.pid %v% 'vertex.pid')
Use the following approach:
network <- network(matrix(1, n, n), directed=F)
(At least with Statnet 1.7 / R64 2.13)
Statnet seems optimized for sparsely connected graphs. This is not too surprising, since many of the "real" graphs I deal with have a density around d = .0002 or thereabouts, and even some fairly large small world graphs, like IMDB only have a density of ~.18. However, there's one special case where I need to have a fully connected graph: the input to an edgecov()
term in an ERGM model. This graph has to have all possible edges, not just the observed edges, and so, density = 1.0.
One of the challenges is how to create and initialize these very large networks. The step to create them would often take a very long time, and it wasn't clear I was using the best approach. There are at least two method that seemed plausible: use matrix
to initialize an adjacency matrix, or use network.intialize
and then add all of the edges in afterwards. It was not clear up front which one would be faster. So, I did a quick experiment: I ran each method 50 times on various sized graphs, and compared the results.
# Method 1: network(matrix())
startTime = proc.time()
for (i in 1:50) n <- network(matrix(1, 200, 200), directed=F)
proc.time() - startTime
# Method 2: network.intialize() and then assignment
startTime = proc.time()
for (i in 1:50) {
n <- network.initialize(200, directed=F)
n[,] <- 1
}
proc.time() - startTime
The results were pretty clear:
Method | 200x200 | 500x500 | |
#1 | 12.32 | 361.91 | |
#2 | 42.42 | 8+ hours |
I used proc.time()
for the timing. There are suggestions that this is not super-accurate, but the difference is so stark, I think even 1s resolution is more than enough. Also: I've discovered that 32-bit R is a really bad environment for working with even "medium-sized" graphs (500 nodes or so), much less "large" graphs. The extra address space afforded by the 64bit version of R avoids a lot of out-of-memory conditions.
Green Label | Edradour | Blue Label |
4 | 9 | 5 |
n=18 |
Green Label | Edradour | Blue Label | |
Observed | 4 | 9 | 5 |
Expected | 16% | 21% | 63% |
n=18 |