Week 11: Mining Big Data

One of the aspects of Digital History that excites every historian is the bulky piles of data that institutions have scanned and users have uploaded. Maybe, just maybe, that random piece of  ephemera or key piece of research is lurking out there on the internet.

Now, the fact of the matter is how do we use it, now that it’s out there, and where the heck is it? Several articles this week covered the attempts to mine this data and how overwhelming the sheer piles of apparently unrelated data afloat in cyberspace. Daniel Cohen’s article From Babel to Knowledge discusses H-Bot and his development of the application programming interface (API) to create a syllabus finder. For the later he used Google engine to craft a way to search for syllabuses. Later he describes how APIs are prohibitively expensive for non-profits to supply, mentioning that they tax servers and require technical support and staff time.

This is the part of the equation I don’t understand. How was Cohen able to create the Syllabus Finder? Did he have a grant and thus, what happened after the grant ran out? He wasn’t hosting any of the syllabuses, instead he created a tailored Google search to do his bidding and pull syllabuses from any university website that posts them. This would increase the traffic to these sites and therefore raise costs slightly for the institutions. The question is, would they even notice? Who is pulling mass amounts of syllabuses to the point of bogging down servers? If more APIs are created to specifically search for historical material it can create easier access. Institutions that might not have anything in common, but Victorian cat photos can be searched at the same time, much to the benefit of the ironic feline photography aficionado.

I also had fund tinkering around with some of the word tools. Voyant Tools and Wordle do approximately the same thing: input a url or paste text and they generate visual art based on how common words are in the text. I tried to do my blog, Wired Love and a series of other novels and blogs. Sadly the most common word is “the” (surprise!) Even with my tinkering I couldn’t get it to take out those common phrases to see anything useful. It has the potential to be cool, but I am simply missing something.

Ngram Viewer, oh the fun I had! Not only did I test if “Amanda” showed up more than “Barry” but I got to see how popular my favorite Victorian saying. First, “some pumpkins” is how you compliment a fine looking lady in a slightly vulgar way – “She is some pumpkins.” Second, “virago queens” are loud overly dramatic women.  Please see the Ngram here. Now, the draw back to this is exactly what the execute button says, “search lots of books.” It’s not all books or all data ever, it’s all books that are searchable in Google Books.  Pretty cool, still.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s