Taking the full text of my sample of my Canadian (.ca only) websites (currently being finessed to amount 622,365 URLs out of a scrape total of 8,512,275 or 7.31%), I ran it through Stanford NER and extracted popular locations, organizations, and people. This is a morning’s work, mainly as I let my desktop crunch away at some other stuff, so I really need to preface the post that the data has not been cleaned up.
The results were interesting but fairly dry: “Canada” was the top location, for example, followed by Ontario, Toronto, Ottawa, Alberta, etc. The United States comes out as under-represented mainly because we have so many spellings of the word (US, u s, America, United States, etc.). There will be a similar issue with the United Kingdom. If this turns into ‘real’ research rather than tinkering, again, there’s a lot of cleaning up to do. But overall, we can get a rough sense of different countries and how they appeared in this sample.
Thanks to IBM Many Eyes we can throw this stuff at the wall and see what comes out.
In this graphic, we see different countries and how they are mentioned. With the caveats that this is rough data, we can see the big parties emerge. Canada drowns out all others, unsurprising given the sample. But the other big parties emerge: the United States, Russia, China, Brazil, Australia. Western Europe. But almost nothing in sub-saharan Africa, which tells you something about this coming together. The way that this map emerges I think shows that it’s working a bit.
Let’s take Canada alone:
Now this is useful. Ontario comes out, but what is surprising is the relative overrepresentation of Alberta.
Imagine if we had multiple scrapes, we could really learn something here. Anyways, a neat approach to archived big data.