What OCR can do for You!

Today is referred to as an era of “information overload.”  There is more information available than is possible to read in a lifetime.  Remember this scene in the bookshop from Disney’s Beauty and the Beast?

Those days of having read everything available are long gone.  This is somewhat troublesome for researchers since they’re expected to review the important and relevant works on their research topics.  Today, is a day where those works number more than “just a few”, making it extremely difficult to keep up with the latest trends in various academic and popular fields.  With the issue of information overload, how can we get a better look at what what’s out there among the millions of written pieces being created every day?  What is the optimal way to sort through all of this “stuff” and find the gem that will answer our burning questions?  A few tools have emerged to help resolve these issues of sifting through all of this information.

Jean-Baptiste Michel’s (et al.) Quantitative Analysis of Culture Using Millions of Digitized Books was a thought provoking read, revealing the research possibilities that Optical Character Recognition (OCR) makes possible.  He used the Google Ngram Viewer to track issues and people over time.  He chose a subset of over 5 million books digitized by Google Books–roughly 4% of the books ever published as of the article’s submission–and ran a few experiments.   In the article he reflected upon the many uses of the Ngram Viewer.  One of the topics that I found most interesting was that he could use the Ngram Viewer to detect censorship and repression.  See the example of Jewish artist Marc Chagall:

Nazi censorship of the Jewish artist Marc Chagall is evident by comparing the frequency of “Marc Chagall” in English and in German books. In both languages, there is a rapid ascent starting in the late 1910s (when Chagall was in his early 30s). In English, the ascent continues. But in German, the artist’s popularity decreases, reaching a nadir from 1936 to 1944, when his full name appears only once. (In contrast, from 1946 to 1954, “Marc Chagall” appears nearly 100 times in the German corpus.)

For my own experiment, I decided to search terms that described First Peoples over the Twentieth Century and into the Twenty-First Century.  Drawing from the English Corpus, I searched the terms “Native Americans”, “First Nations”, “Aboriginals”, “First Peoples”, and “Natives.”  I didn’t include “Indians” because the Ngram viewer would not be able to distinguish between what I was looking for and someone from India, though I almost certain that “Indian” would dominate the lexicon for the better part of the century.  It turned out that the term “Natives” dominated the lexicon for most of the century until the late-1980s when the term “Native Americans” just shot up and continued to dominate the lexicon through 2008.  Nevertheless, it still appeared that in 2008, English-speaking authors were still a bit confused as to what to call the first inhabitants of this continent.


In a related study, Adrian Veres and John Bohannon used the Ngram Viewer to create the Science Hall of Fame based on the results found on the Ngram Viewer.   The person who topped the list will probably surprise you.  Or, if you decided not to click the link the top five were Bertrand Russell, Charles Darwin, Albert Einstein, Lewis Carroll, and Claude Bernard.  I personally thought Einstein or Stephen Hawking  would top the list, so it goes to show how the data can introduce a different perspective, and how it can trigger more questions.  The only thing that I would love for the Ngram viewer would be to add newspapers to the corpus.  I understand how challenging it would be in terms of copyright, availability (etc.), but their inclusion would make the data so much richer, and give us a larger window into the trends of the past.

Now, there are some drawbacks to OCR and its search capabilities.  Ian Milligan blogged about his experience using digitized newspaper databases in his research.  He warned that there are some drawbacks to using OCR.  He used the example of the Toronto Star since they digitized 110-years worth of microfilmed newspapers in just four months.  Some errors are bound to be made in a rush job like that.  He noted that microfilm streaks can obscure characters, tilted pages or characters can be missed, and hyphenations across two lines are not accounted for.  He implied that we ought to hold content providers accountable by asking questions as to how their database works, but we should also dig in to the database more and be more thorough with our searching.  For example, if we know a certain event occurred on a particular date, it may be worth just browsing around articles written on that date and around that time, maybe serendipity will reward us with the “golden article” that we need.

Finally, I also wonder about the long-term impact of scanning the web.  For me, the internet is kind of like an all-you-can-eat buffet, everything looks so good that I dare not take a heaping plate of any one thing since I might miss out on something else.  The drawback of this type of search is that I often don’t take the time to really immerse myself in anything.  I find the quote I’m looking for, and I move on to the next article.  Nicholas Carr wrote an intriguing piece entitled “Is Google Making us Stupid?“.  He talks about how his reading habits have changed with the advent of the internet.  He confessed that though the internet has been a godsend to him as a writer, he is no longer able to focus on long articles and gets easily distracted.  He reminisced “Once I was a scuba diver in the sea of words. Now I zip along the surface like a guy on a Jet Ski.”  Though his article is purely anecdotal, I find I can relate to much of what he’s talking about; and I wonder if the amount of time I’ve logged on the internet since I was 14 is the reason why I dislike reading most books, and rarely do it for fun.

For class next week

1. Reflect upon Carr’s article, can you relate to Carr and his experience?  Is he just a “worrywart” with what he feels?

2. Try playing with the Google Ngram Viewer.  Could you discover any neat trends? Any drawbacks or deficiencies to using this tool?


5 thoughts on “What OCR can do for You!

  1. Thanks for this post! Not only is it informative – but any blog that uses a clip from one of my all-time favourite Disney movies has to be great!

    I agree with you that our reading habits have most likely changed since the days before the Internet. Even when I am reading a good novel I find myself skipping over pages or skimming chapters because I am losing focus. I sometimes even read the end first because I can’t wait and need to know what the conclusion is (and if my favourite characters died). I suppose these are all apart of our generation and how our culture is in desperate need of all information on an immediate basis. But interesting points to think about!

  2. Very interesting post Joel! I loved the Beauty and Beast clip, a great example for us 80s and 90s kids to relate to, definitely not that simple society anymore.

    Nicholas Carr’s article (which I focused hard on reading every line and not skimming) presented some great questions and ideas, that I really agree with, but never really thought of before as an issue of our society today. Probably very similar to everyone else in a university program, we have learned to skim works in order to get the main points for our discussions and essays. We have come to the conclusion that if we look for these main points, maybe read the introduction and conclusion, and get an understanding of what the author is trying to convey, there is no need to examine the full text, especially when time is limited. This is the same for newspaper articles, I find I will read the subheading and the beginning to get the facts, then just skim the rest until I get bored and move on to another article. I enjoy leisurely reading, but in choosing a book I now look at amazon reviews, to determine if the subject of the book I am truly interested in and whether it is worth my time. I think leisure time is much more important for us today, that we want to gain something from our time and not look back on it as wasted.

    Carr’s point that are style of reading has changed, that we now focus on “efficiency and immediacy,” when reading a text, is very interesting, especially in a society where we get information from things like Twitter or Facebook status updates. With having so much information available to us today with the internet, like you said Joel, “information overload,” that it is hard to focus on reading a full article and not become distracted by an incoming email. Carr is a worrywart, but he isn’t the first, as he makes clear through his examples of Socrates and critics of the Gutenberg Press. This worrywart tendency is good for us to question our society, and realize the issues that we need to examine further.

  3. Joel, I too, found Jean-Baptiste Michel’s (et al.) “Quantitative Analysis of Culture Using Millions of Digitized Books” use of the Ngram viewer to track censorship and repression very interesting. It’s one thing to see what is there, but another to see what is missing. It changes our perspective and forces use to look at the social context. The Ngram viewer is a gateway for historians to easily find the gaps, holes, and missing links in order to dive into history at the corresponding points.

    Speaking on the change in reading approach, I think it is key to remember our purpose when we read. If we are skimming headlines or twitter feeds we are looking for something to catch our attention. We are looking for the hook, and if we are not provided with one, that we should be critical of the information that is provided to use. In terms of, lets say assigned readings or research, skimming is pertinent because of time restrictions. We are generally provided with an abstract for most journal articles or books, at which point we decide if the article will be useful or not without even reading it. I would say the common theme here is the provider knowing their audience, selling their argument, and providing keywords from different perspectives. As we become the provider of information it is key that we take note of how to maintain our public’s interest.
    Great post Joel.

  4. Great blog Joel! I agree with Laura and Liz regarding your choice of video clip.

    I particularly enjoyed the Carr article. Within the first five paragraphs I was thinking to myself, “This guy is over reacting. Humans still have the ability to concentrate on intellectually stimulating and lengthy articles.” I then proceeded to wonder why my fingers were sticky from eating Cheerios, contemplated how thin the walls are in my apartment building, checked my Facebook that was sitting open on a different tab, and then returned to bemoan the length of Carr’s article. I think it would be fascinating to use this article for a study on the attention span of the average adult in the digital age. Do we push through it as a matter of pride to prove him wrong? I think that’s why I read everything.

    I don’t think that Google is making us stupid. What it is doing is rewiring our approaches to research and making us more efficient. Why read an entire article to discover you only needed one paragraph on page 7 when you can search for the few phrases in the search bar and get the same result? However, I think we are still capable of reading the lengthy articles and books but to do so we need to practice and train ourselves to concentrate.

  5. Loved the blog post Joel! I had a lot of thoughts about Carr’s article that were constantly conflicting with one another. On the one hand, I definitely agree- my reading habits have changed. I lose interest easily, skim paragraphs and forget about lengthy journal articles- my eyelids are closed by the end of the second page! But then I had to question whether this is really a bad thing. If I find a really good (and I mean REALLY good) book, I am still capable of immersing myself in it. So maybe this lack of concentration on our own parts really just calls for better writing and more interesting styles! It definitely affects the way information should be presented to students from teachers- lengthy and wordy articles are no longer the way. Maybe more precise and interesting blogs that keep it pretty short are better. Maybe more readings at shorter lengths will do it- after all information is in no short supply.
    But then I thought, how are we ever to really explore any topic in depth or understand the most important and influential arguments of a topic? In this way I totally can see why Carr is being a worrywart because I am too! As much as we have access to it all, we use very little and getting an overall picture let alone and detailed one becomes very difficult.
    One this I did notice about my reading style that Carr did not mentioned is that if I persist and push myself to read books that I find interesting without skimming or just do a little of a book I like everyday, I gradually regain the skill of reading in depth pieces. Maybe it’s a trait that needs constant maintenance but that your muscles still have memory for.
    Anyways, you along with the article you’ve provided have given me a lot to think about!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s