Thursday, September 13, 2012

ERGM: edgecov and dyadcov specifications

Exponential Random Graph Models (ERGMs) are a powerful technique to model, understand, and predict networks. What I find powerful about ERGMs is that the model can contain terms describing the nodes, edges, local structure and global structure of the graph, all at once. That's the good side of ERGMs. The bad side is that statnet, although amazingly powerful, is still very much a work in progress, and some of the documentation is rather sparse.

This is especially clear in the case of edgecov() and dyadcov(), the terms that model edge effects on the network. The standard reference for ERGM terms is here,, and in the dyadcov section, has this to say about the specification of the term:

If the network is undirected, x is either a matrix of edgewise covariates, or a network; if the latter, optional argument attrname provides the name of the edge attribute to use for edge values.

I spent a few months beating my head against the wall, trying to figure out exactly what that meant, in the practical application of how to actually specify an edgecov() term. (For this discussion, edgecov() and dyadcov() are more or less interchangeable. Of course, you need to understand your data and your models to know if that's the case for you.) Ideally, I wanted something as simple as nodematch:

network ~ edges + nodematch("age")
Instead, all of the examples went something like this:
network ~ edges + edgecov(full_net, "age")

And left many of the details of full_net somewhat ambiguous. (Honest! Try to find a single example with full data on the web. I couldn't, and I'm an Information Professional(tm)). The best starting point I found was a recent posting on The Shortest Path, but that was still a little ambiguous for me, talking about a "complete network" and the "observed network."

At this point, step back a bit, and remember what conceptually the ERGM call is doing. At a high level, it's taking your supplied network, and adding and removing edges, then seeing how that changes the network. It uses those (small) changes to the network to build an understanding of how all the terms in the model specification interact and affect the overall network. In the case of attributes on the nodes, it can use the data already in the network; it's assumed you have observed all possible nodes, and have only a small amount of missing data. Hence, the nodal term specification can refer to the supplied network (e.g. "network" in the examples above.)

But this gets more complex when you're looking at edge effects. Remember, the ERGM process is adding and removing edges. Therefore, ERGM needs to know the values of all possible edges in the network, not just the observed ones. A moment of thought (or in my case, a few days of thought) should make it clear that you can't finesse the issue by setting all the unobserved edges to NA: the only "computable" network is the observed network; every other random perturbation that ERGM tries to create has missing data. In fact, you can't have any missing edge data: NA/NaN/INF will throw an error when you try to compute the ERGM. (There are probably techniques to handle a limited amount of missing data, but it is a) far beyond my abilities right now, and b) won't work in common cases.)

So, where does this get us? The network that you pass into edgecov needs to be complete, with no missing data. Of course, if you have an undirected network, you only need the upper or lower triangle, and you may be able to set the diagonal to a constant. In many cases, the value of unobserved edges have a fixed value: for instance, if your data set revolves around meetings, and edge weights are the number of meetings two people attended together, non-existence of the edge implies the value is zero. But, in other cases, even if the edge wasn't observed, it still has a value. For instance, if edge weights are geographic distance from one person to another, you still have to actually calculate all the distances, even if there are no pairs that span the two cities.

Let me get even more specific. In one example, I end up having two networks: people, and all_pairs. The data files are set up as comma-separated files, without header rows. People has just two columns, id1, id2, in a standard edge-list representation. This is just the observed edges. First, I load and create a network based on people:

# I use _e to indicate the edgelist array
# and _n to indicate the actual network

people_e <- read.csv('people.net', header=F)
people_n <- network(people_e, directed=F)

Next, I want to load the all_pairs network, and add the edge weights. all_pairs has three columns, id1, id2, and weight (distance). This has every possible edge. Since it's undirected, I only have one triangle here, where r > c. It's also in edgelist format, mostly because it's generated from the same underlying database as people.net, which ensures my nodes align.

all_pairs_e <- read.csv('all_pairs.net', header=F)
# The [,1:2] gets just the first two columns
all_pairs_n <- network(all_pairs_e[,1:2], directed=F)
set.edge.attribute(all_pairs_n, 'dist', all_pairs_e[,3])

Now, it's off to the races:

m0 <- ergm(people_n ~ edges)
m1 <- ergm(people_n ~ edges + dyadcov(all_pairs_n, 'dist'))

I hope this helps...or maybe I'm just dense about this, and this is obvious to everyone else.

Saturday, September 08, 2012

B. wants to have a town similar to Provinceton/Guerneville that's near to Seattle. Unfortunately for B, I spent most of the afternoon wrestling with geographic models of gay male social networks, and approaches to understanding them. It's useful to look at what helps support these kinds of gay resort villages (GRVs).

I'm going to use the dataset from Handel, Shklovski, "Ambiguity, Risk and Disclosure" (In press at GROUP 2012: Sanibel Island, FL) as a basis for the analysis, mostly because I have it on my machine so I can do quick and dirty queries against it. I decided to look at the four largest cities by population from this dataset (n=13442). A simple table really helps illustrate how GRVs work from a population / financial basis:

City "Partner" GRV CBSA Percentage
San Francisco Guerneville ~9.5%
Chicago Saugatuck 7.5%
Los Angeles Palm Springs ~9.5%
New York City Fire Island 9%

As a key to this, look at Los Angeles. It's GRV is Palm Springs. And, the Los Angeles CBSA constitutes about 9.5% of the total population of the dataset. Note: I'm playing really fast and loose with my Core-Based Statistical Areas here. But, I don't think it really impacts my basic argument.

What sort of jumps out here is that these cities all have about 7.5-10% of the total population of the dataset. And together, they are more than one third of the entire dataset (35.5%) This suggests that a GRV needs a huge catchment basin to be successful. Unfortunately, even a generous reading of the Seattle catchment basin (WA, OR, ID, and MT) only gives us about 4% of the dataset (even though, suprisingly, it is #6 on the by-city list).

I hate to be a Debbie Downer, but I just don't see how it's going to work from simple numbers perspective. There are two caveats here, though. First, I think the proper analysis here needs to use something more sophisticated, like the gravity model for understanding demand forecasting. The second is that using this model, I can't really explain Provincetown. Provincetown's catchment basin is only 2.25% of the dataset. Admittedly, the Handel/Shklovski is a purely gay male dataset, and lesbians may be critical for the success of smaller GRVs.