Using Modularity to Find and Explore Web Archived Communities

While playing with my WAT files of Canadian Political Parties, I wondered more about finding community and clusters of websites. Using Gephi’s functions, could we learn something about the websites that cluster around a specific political party?

This wasn’t the most successful experiment, but I think it lays the groundwork for some future explorations with metatext, and perhaps using Gephi’s command line functions to begin automating this sort of analysis. But let me show you what I found.

The initial impetus came from this:
Screen Shot 2015-02-03 at 10.29.09 AM

At left, we see the links that come out and into the Conservative Party of Canada’s website; at right, the links in and out of the Liberal Party of Canada. Links themselves aren’t the best example, but Gephi has a modularity function built in. And in that, there’s potential to help us learn more about these massive dumps of data..

As the authors of the underlying algorithm explain:

Our method, that we call Louvain Method (because, even though the co-authors now hold positions in Paris, London and Louvain, the method was devised when they all were in Louvain), outperforms other methods in terms of computation time, which allows us to analyze networks of unprecedented size (e.g. the analysis of a typical network of 2 million nodes only takes 2 minutes). The Louvain method has also been to shown to be very accurate by focusing on ad-hoc networks with known community structure. Moreover, due to its hierarchical structure, which is reminiscent of renormalization methods, it allows to look at communities at different resolutions.

The method consists of two phases. First, it looks for “small” communities by optimizing modularity in a local way. Second, it aggregates nodes of the same community and builds a new network whose nodes are the communities. These steps are repeated iteratively until a maximum of modularity is attained.

I accordingly then ran the modularity statistics in Gephi, which produces the colouring that you see on these graphs below.

Screen Shot 2015-02-03 at 4.08.22 PM

We can inspect each node to see the actual modularity value assigned – which is great for us colour-blind people – and see a few different things. Note that it’s found a community spanning out from each political party, which is a different community than the party itself. If we want the political blogosphere, we can calculate the nearest neighbour like so.

Screen Shot 2015-02-03 at 4.10.09 PM

One rough approximation if you wanted Liberal interests could be just to pick websites belonging to the Liberal.ca modularity class, or one could also incorporate the separate blogosphere community as well.

What I did now was:

      * Go to the Data Laboratory;
      * Export the data as a CSV file, making sure to include the modularity class;
      * I could then sort in Excel the different classes, and see who belonged to what.

And we then begin to see things like this:

Screen Shot 2015-02-03 at 4.13.00 PM

Based on links – and links alone – we see that Gephi is finding some liberal blogs, Macleans, Now magazine in Toronto, and is also grouping in the federal NDP. Their bloggers don’t form the same community, but the two party websites link to and are linked from similar websites.

The Conservative class is similar:

Screen Shot 2015-02-03 at 4.21.05 PM

For both, what I found very useful is that while private citizen blogs don’t generally make the same class, cabinet ministers and MPs are lumped in: I’m currently looking at a who’s who of backbenchers and Cabinet ministers: Cheryl Gallant, Dave Mackenzie, Christian Paradis, Chris Alexander, etc.

This isn’t earthshattering, but it tells me that this is working. As a rough finding aid for unstructured websites, this would be useful.

The next step broke down, however. What I did was take each of those host domains and feed it into a command that looked like:

wget -O davemackenzie.ca.html https://web.archive.org/web/20140805200004/http://davemackenzie.ca

And then, for the ensuing directories:

for f in *.html; do html2text $f >> 2014.txt; done

My goal was to get snippets of each website, run them through a text analysis suite, and each community would have a set of data to help me make sense of it. I’d then compare 2006, 2007, and so forth to each other, and we’d see how official websites stacked up against community blogospheres.

Alas, it failed. We get lots of splash pages (Canada being a bilingual country), redirects, some go to index.html and others don’t, and even a redirect limit of one or two doesn’t lead to good results. Many politicians try to collect user information before you even proceed to their website. Short of recursively grabbing all of the pages, I’m at a loss..

But I think there’s a decent idea in here. Would love to hear your thoughts.. the next step is going to be extracting meta-text from these WAT files, and comparing the meta-text within these communities to each other. Yet that may require some memory wrangling.. and time.

3 thoughts on “Using Modularity to Find and Explore Web Archived Communities

  1. davidzylberberg says:

    Ian,

    One thing that I wonder is if it is possible to track the relationship between the twitter, facebook and blog posts to check who is cross-referencing or reading each other. Many people have a sense that followers of political parties are increasingly siloed in recent years (the phenomenon in the US has been written about in terms of Red Families v Blue Families, and demographic clustering). If you could do a social network analysis of facebook accounts that noted who either explicitly supported candidates or forwarded party material, you could probably get a decent sense of the extent to which individual’ social networks seem disproportionately to support one party while the country is split. If you can track changes since 2005, you might also be able to determine if this was intrinsic to a social media driven news-culture or developed at some specific point. I look forward to hearing more about this project as it progresses.

    David Zylberberg

  2. Ian Milligan says:

    Hi David –

    Great points here, too. The connections between web archives and social media is lacking. My colleague Peter Webster has made this point in a great blog post – http://peterwebster.me/2015/01/20/religion-social-media-and-the-web-archive/ – as the social media researchers are too focused on the present.

    The downside is that while we can access archived Tweets, it’s super hard and often expensive to do so on anything approaching the scale we want to. And Facebook, well, given privacy concerns I don’t think we’re going to be able to go back to track changes…

    But one can dream, and if anything, we need to be cognizant of the gaps in the record that are emerging.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s