While playing with my WAT files of Canadian Political Parties, I wondered more about finding community and clusters of websites. Using Gephi’s functions, could we learn something about the websites that cluster around a specific political party?
This wasn’t the most successful experiment, but I think it lays the groundwork for some future explorations with metatext, and perhaps using Gephi’s command line functions to begin automating this sort of analysis. But let me show you what I found.
At left, we see the links that come out and into the Conservative Party of Canada’s website; at right, the links in and out of the Liberal Party of Canada. Links themselves aren’t the best example, but Gephi has a modularity function built in. And in that, there’s potential to help us learn more about these massive dumps of data..
Our method, that we call Louvain Method (because, even though the co-authors now hold positions in Paris, London and Louvain, the method was devised when they all were in Louvain), outperforms other methods in terms of computation time, which allows us to analyze networks of unprecedented size (e.g. the analysis of a typical network of 2 million nodes only takes 2 minutes). The Louvain method has also been to shown to be very accurate by focusing on ad-hoc networks with known community structure. Moreover, due to its hierarchical structure, which is reminiscent of renormalization methods, it allows to look at communities at different resolutions.
The method consists of two phases. First, it looks for “small” communities by optimizing modularity in a local way. Second, it aggregates nodes of the same community and builds a new network whose nodes are the communities. These steps are repeated iteratively until a maximum of modularity is attained.
I accordingly then ran the modularity statistics in Gephi, which produces the colouring that you see on these graphs below.
We can inspect each node to see the actual modularity value assigned – which is great for us colour-blind people – and see a few different things. Note that it’s found a community spanning out from each political party, which is a different community than the party itself. If we want the political blogosphere, we can calculate the nearest neighbour like so.
One rough approximation if you wanted Liberal interests could be just to pick websites belonging to the Liberal.ca modularity class, or one could also incorporate the separate blogosphere community as well.
What I did now was:
- * Go to the Data Laboratory;
- * Export the data as a CSV file, making sure to include the modularity class;
- * I could then sort in Excel the different classes, and see who belonged to what.
And we then begin to see things like this:
Based on links – and links alone – we see that Gephi is finding some liberal blogs, Macleans, Now magazine in Toronto, and is also grouping in the federal NDP. Their bloggers don’t form the same community, but the two party websites link to and are linked from similar websites.
The Conservative class is similar:
For both, what I found very useful is that while private citizen blogs don’t generally make the same class, cabinet ministers and MPs are lumped in: I’m currently looking at a who’s who of backbenchers and Cabinet ministers: Cheryl Gallant, Dave Mackenzie, Christian Paradis, Chris Alexander, etc.
This isn’t earthshattering, but it tells me that this is working. As a rough finding aid for unstructured websites, this would be useful.
The next step broke down, however. What I did was take each of those host domains and feed it into a command that looked like:
wget -O davemackenzie.ca.html https://web.archive.org/web/20140805200004/http://davemackenzie.ca
And then, for the ensuing directories:
for f in *.html; do html2text $f >> 2014.txt; done
My goal was to get snippets of each website, run them through a text analysis suite, and each community would have a set of data to help me make sense of it. I’d then compare 2006, 2007, and so forth to each other, and we’d see how official websites stacked up against community blogospheres.
Alas, it failed. We get lots of splash pages (Canada being a bilingual country), redirects, some go to index.html and others don’t, and even a redirect limit of one or two doesn’t lead to good results. Many politicians try to collect user information before you even proceed to their website. Short of recursively grabbing all of the pages, I’m at a loss..
But I think there’s a decent idea in here. Would love to hear your thoughts.. the next step is going to be extracting meta-text from these WAT files, and comparing the meta-text within these communities to each other. Yet that may require some memory wrangling.. and time.