Tuesday, December 20, 2011

Thoughts on ICDM I: Negative Results (part B)

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because
  • you can target your algorithms to graphs that look like this, hopefully making them more efficient
  • You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
  • If you're lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.
But there aren't even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn't have the heavy-tailed degree distribution that's common in social networks. It also doesn't have a property that's common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

One approach to fixing this models a social network as a Stochastic Kronecker graph. You can read more about these graphs here: a simple way of imagining them is that you add an edge in the graph by a random process that does a (kind of) quad tree like descent down a partitioning of the adjacency matrix and places a 1 at a leaf. SKGs were proposed by Leskovec, Chakrabarti, Kleinberg and Faloutsos, and include ER graphs as a special case. They appear to capture heavy tailed degree distributions as well as densification, and have become a popular model used when testing algorithms on social networks. They're also used as the method to generate benchmark graphs for the HPC benchmark Graph500.

But a thorough understanding of the formal properties of SKGs has been lacking. In "An In-Depth Analysis of Stochastic Kronecker Graphs", Seshadhri, Pinar and Kolda show some rather stunning results. Firstly, they provide a complete analysis of the degree distribution of an SKG, and prove a beautiful result showing that it oscillates between having a lognormal and exponential tail. Their theorems are quite tight: plots of the actual degree distribution match their theorems almost perfectly, and convincingly display the weird oscillations in the degree frequencies (see Figure 2 in the paper).

Secondly, they also formally explain why a noisy variant of SKGs appears to have much more well-behaved degree distribution, proving that a slightly different generative process will indeed generate the desired distribution observed in practice.

Finally, they also show that the graphs generated by an SKG have many more isolated nodes than one might expect, sometimes upto 75% of the total number of vertices ! This has direct implications for the use of SKGs as benchmarks. Indeed, they mention that the Graph500 committee is considering changing their benchmarks based on this paper - now that's impact :)

What I like about this paper is that it proves definitive theoretical results about a popular graph model, and very clearly points out that it has significant problems. So any methodology that involves using SKGs for analysis will now have to be much more careful about the claims it makes.

p.s There's also more supporting evidence on the lack of value of SKGs from another metric (the clustering coefficient, that measures how many configurations uv, uw also have the third edge vw). Real social networks have a high CC, and SKGs don't.This was first mentioned by Sala, Cao, Wilson, Zablit, Zheng and Zhao, and Seshadhri/Pinar/Kolda have more empirical evidence for it as well. (Disclaimer: I was pointed to these two references by Seshadhri: my opinions are my own though :))


  1. From a more practical perspective, we also found that SKGs are incredibly sensitive to parameter tuning, and computationally very expensive to boot.

  2. Nice post!
    "the classic Erdos-Renyi (ER) model of a random graph doesn't have the heavy-tailed degree distribution that's common in social networks."

    What is your opinion of the Barabási-Albert random graphs in this context?

  3. There's a fair bit of controversy about what exactly the Barabasi/Albert graphs tend to produce in terms of degree distribution. I'm not knowledgeable enough to address that aspect of things: however, I point you to the Leskovec et al paper that I linked to above - they have a good discussion of desiderata for graphs that model social networks and how preferential attachment systems like Barabasi/Albert graphs fail to capture observed phenomena in social network graphs.

  4. It's interesting that you mention that there's some controversy regarding that. I'm only an enthusiast hoping to learn more. But John Hopcroft had a similar answer too.

    Thanks for pointing out that paper, I didn't click through. Looks very nice!

  5. You should start here for some understanding of the controversy: http://cs.unm.edu/~aaron/blog/archives/2005/10/links_links_lin.htm

    In brief, the issue is "What's a power law", and I think the definitive research article on how to determine whether data follows a power law is by Clauset, Shalizi and Newman: http://arxiv.org/abs/0706.1062

  6. Thank you very much! The Shalizi paper looks excellent.

  7. For a more high-level take on the "power law" issue, you might check out my editorial:
    The Future of Power Law Research


    or my survey -- which (unfortunately) predates the SKG stuff, but a lot of the same issues keep reappearing.



Disqus for The Geomblog