Today is referred to as an era of “information overload.” There is more information available than is possible to read in a lifetime. Remember this scene in the bookshop from Disney’s Beauty and the Beast?
Those days of having read everything available are long gone. This is somewhat troublesome for researchers since they’re expected to review the important and relevant works on their research topics. Today, is a day where those works number more than “just a few”, making it extremely difficult to keep up with the latest trends in various academic and popular fields. With the issue of information overload, how can we get a better look at what what’s out there among the millions of written pieces being created every day? What is the optimal way to sort through all of this “stuff” and find the gem that will answer our burning questions? A few tools have emerged to help resolve these issues of sifting through all of this information.
Jean-Baptiste Michel’s (et al.) Quantitative Analysis of Culture Using Millions of Digitized Books was a thought provoking read, revealing the research possibilities that Optical Character Recognition (OCR) makes possible. He used the Google Ngram Viewer to track issues and people over time. He chose a subset of over 5 million books digitized by Google Books–roughly 4% of the books ever published as of the article’s submission–and ran a few experiments. In the article he reflected upon the many uses of the Ngram Viewer. One of the topics that I found most interesting was that he could use the Ngram Viewer to detect censorship and repression. See the example of Jewish artist Marc Chagall:
Nazi censorship of the Jewish artist Marc Chagall is evident by comparing the frequency of “Marc Chagall” in English and in German books. In both languages, there is a rapid ascent starting in the late 1910s (when Chagall was in his early 30s). In English, the ascent continues. But in German, the artist’s popularity decreases, reaching a nadir from 1936 to 1944, when his full name appears only once. (In contrast, from 1946 to 1954, “Marc Chagall” appears nearly 100 times in the German corpus.)
For my own experiment, I decided to search terms that described First Peoples over the Twentieth Century and into the Twenty-First Century. Drawing from the English Corpus, I searched the terms “Native Americans”, “First Nations”, “Aboriginals”, “First Peoples”, and “Natives.” I didn’t include “Indians” because the Ngram viewer would not be able to distinguish between what I was looking for and someone from India, though I almost certain that “Indian” would dominate the lexicon for the better part of the century. It turned out that the term “Natives” dominated the lexicon for most of the century until the late-1980s when the term “Native Americans” just shot up and continued to dominate the lexicon through 2008. Nevertheless, it still appeared that in 2008, English-speaking authors were still a bit confused as to what to call the first inhabitants of this continent.
In a related study, Adrian Veres and John Bohannon used the Ngram Viewer to create the Science Hall of Fame based on the results found on the Ngram Viewer. The person who topped the list will probably surprise you. Or, if you decided not to click the link the top five were Bertrand Russell, Charles Darwin, Albert Einstein, Lewis Carroll, and Claude Bernard. I personally thought Einstein or Stephen Hawking would top the list, so it goes to show how the data can introduce a different perspective, and how it can trigger more questions. The only thing that I would love for the Ngram viewer would be to add newspapers to the corpus. I understand how challenging it would be in terms of copyright, availability (etc.), but their inclusion would make the data so much richer, and give us a larger window into the trends of the past.
Now, there are some drawbacks to OCR and its search capabilities. Ian Milligan blogged about his experience using digitized newspaper databases in his research. He warned that there are some drawbacks to using OCR. He used the example of the Toronto Star since they digitized 110-years worth of microfilmed newspapers in just four months. Some errors are bound to be made in a rush job like that. He noted that microfilm streaks can obscure characters, tilted pages or characters can be missed, and hyphenations across two lines are not accounted for. He implied that we ought to hold content providers accountable by asking questions as to how their database works, but we should also dig in to the database more and be more thorough with our searching. For example, if we know a certain event occurred on a particular date, it may be worth just browsing around articles written on that date and around that time, maybe serendipity will reward us with the “golden article” that we need.
Finally, I also wonder about the long-term impact of scanning the web. For me, the internet is kind of like an all-you-can-eat buffet, everything looks so good that I dare not take a heaping plate of any one thing since I might miss out on something else. The drawback of this type of search is that I often don’t take the time to really immerse myself in anything. I find the quote I’m looking for, and I move on to the next article. Nicholas Carr wrote an intriguing piece entitled “Is Google Making us Stupid?“. He talks about how his reading habits have changed with the advent of the internet. He confessed that though the internet has been a godsend to him as a writer, he is no longer able to focus on long articles and gets easily distracted. He reminisced “Once I was a scuba diver in the sea of words. Now I zip along the surface like a guy on a Jet Ski.” Though his article is purely anecdotal, I find I can relate to much of what he’s talking about; and I wonder if the amount of time I’ve logged on the internet since I was 14 is the reason why I dislike reading most books, and rarely do it for fun.
For class next week
1. Reflect upon Carr’s article, can you relate to Carr and his experience? Is he just a “worrywart” with what he feels?
2. Try playing with the Google Ngram Viewer. Could you discover any neat trends? Any drawbacks or deficiencies to using this tool?