Thursday, September 13, 2012

ERGM: edgecov and dyadcov specifications

Exponential Random Graph Models (ERGMs) are a powerful technique to model, understand, and predict networks. What I find powerful about ERGMs is that the model can contain terms describing the nodes, edges, local structure and global structure of the graph, all at once. That's the good side of ERGMs. The bad side is that statnet, although amazingly powerful, is still very much a work in progress, and some of the documentation is rather sparse.

This is especially clear in the case of edgecov() and dyadcov(), the terms that model edge effects on the network. The standard reference for ERGM terms is here,, and in the dyadcov section, has this to say about the specification of the term:

If the network is undirected, x is either a matrix of edgewise covariates, or a network; if the latter, optional argument attrname provides the name of the edge attribute to use for edge values.

I spent a few months beating my head against the wall, trying to figure out exactly what that meant, in the practical application of how to actually specify an edgecov() term. (For this discussion, edgecov() and dyadcov() are more or less interchangeable. Of course, you need to understand your data and your models to know if that's the case for you.) Ideally, I wanted something as simple as nodematch:

network ~ edges + nodematch("age")
Instead, all of the examples went something like this:
network ~ edges + edgecov(full_net, "age")

And left many of the details of full_net somewhat ambiguous. (Honest! Try to find a single example with full data on the web. I couldn't, and I'm an Information Professional(tm)). The best starting point I found was a recent posting on The Shortest Path, but that was still a little ambiguous for me, talking about a "complete network" and the "observed network."

At this point, step back a bit, and remember what conceptually the ERGM call is doing. At a high level, it's taking your supplied network, and adding and removing edges, then seeing how that changes the network. It uses those (small) changes to the network to build an understanding of how all the terms in the model specification interact and affect the overall network. In the case of attributes on the nodes, it can use the data already in the network; it's assumed you have observed all possible nodes, and have only a small amount of missing data. Hence, the nodal term specification can refer to the supplied network (e.g. "network" in the examples above.)

But this gets more complex when you're looking at edge effects. Remember, the ERGM process is adding and removing edges. Therefore, ERGM needs to know the values of all possible edges in the network, not just the observed ones. A moment of thought (or in my case, a few days of thought) should make it clear that you can't finesse the issue by setting all the unobserved edges to NA: the only "computable" network is the observed network; every other random perturbation that ERGM tries to create has missing data. In fact, you can't have any missing edge data: NA/NaN/INF will throw an error when you try to compute the ERGM. (There are probably techniques to handle a limited amount of missing data, but it is a) far beyond my abilities right now, and b) won't work in common cases.)

So, where does this get us? The network that you pass into edgecov needs to be complete, with no missing data. Of course, if you have an undirected network, you only need the upper or lower triangle, and you may be able to set the diagonal to a constant. In many cases, the value of unobserved edges have a fixed value: for instance, if your data set revolves around meetings, and edge weights are the number of meetings two people attended together, non-existence of the edge implies the value is zero. But, in other cases, even if the edge wasn't observed, it still has a value. For instance, if edge weights are geographic distance from one person to another, you still have to actually calculate all the distances, even if there are no pairs that span the two cities.

Let me get even more specific. In one example, I end up having two networks: people, and all_pairs. The data files are set up as comma-separated files, without header rows. People has just two columns, id1, id2, in a standard edge-list representation. This is just the observed edges. First, I load and create a network based on people:

# I use _e to indicate the edgelist array
# and _n to indicate the actual network

people_e <- read.csv('people.net', header=F)
people_n <- network(people_e, directed=F)

Next, I want to load the all_pairs network, and add the edge weights. all_pairs has three columns, id1, id2, and weight (distance). This has every possible edge. Since it's undirected, I only have one triangle here, where r > c. It's also in edgelist format, mostly because it's generated from the same underlying database as people.net, which ensures my nodes align.

all_pairs_e <- read.csv('all_pairs.net', header=F)
# The [,1:2] gets just the first two columns
all_pairs_n <- network(all_pairs_e[,1:2], directed=F)
set.edge.attribute(all_pairs_n, 'dist', all_pairs_e[,3])

Now, it's off to the races:

m0 <- ergm(people_n ~ edges)
m1 <- ergm(people_n ~ edges + dyadcov(all_pairs_n, 'dist'))

I hope this helps...or maybe I'm just dense about this, and this is obvious to everyone else.

3 Comments:

Anonymous Anonymous said...

Thank you that post was *really* useful.

If all_pairs_n had been directed, i.e. if the 'distance' from i to j in one direction was different than that from j to i, how do you know which way around to format the dyadcov matrix?

dyadcov produces three outputs (upper, lower and mutual) when the network is directed, so maybe it doesn't matter and you just need to choose the right one to interpret. But then how do you know which one?

I'm two days into ERGMs.

12:57 PM  
Anonymous Anonymous said...

Thanks! I also struggled to find examples with edgecov and dyadcov terms.

2:39 PM  
Anonymous Anonymous said...

thank you!

10:05 AM  

Post a Comment

<< Home