What OCR can do for You!

Today is referred to as an era of “information overload.”  There is more information available than is possible to read in a lifetime.  Remember this scene in the bookshop from Disney’s Beauty and the Beast?

Those days of having read everything available are long gone.  This is somewhat troublesome for researchers since they’re expected to review the important and relevant works on their research topics.  Today, is a day where those works number more than “just a few”, making it extremely difficult to keep up with the latest trends in various academic and popular fields.  With the issue of information overload, how can we get a better look at what what’s out there among the millions of written pieces being created every day?  What is the optimal way to sort through all of this “stuff” and find the gem that will answer our burning questions?  A few tools have emerged to help resolve these issues of sifting through all of this information.

Jean-Baptiste Michel’s (et al.) Quantitative Analysis of Culture Using Millions of Digitized Books was a thought provoking read, revealing the research possibilities that Optical Character Recognition (OCR) makes possible.  He used the Google Ngram Viewer to track issues and people over time.  He chose a subset of over 5 million books digitized by Google Books–roughly 4% of the books ever published as of the article’s submission–and ran a few experiments.   In the article he reflected upon the many uses of the Ngram Viewer.  One of the topics that I found most interesting was that he could use the Ngram Viewer to detect censorship and repression.  See the example of Jewish artist Marc Chagall:

Nazi censorship of the Jewish artist Marc Chagall is evident by comparing the frequency of “Marc Chagall” in English and in German books. In both languages, there is a rapid ascent starting in the late 1910s (when Chagall was in his early 30s). In English, the ascent continues. But in German, the artist’s popularity decreases, reaching a nadir from 1936 to 1944, when his full name appears only once. (In contrast, from 1946 to 1954, “Marc Chagall” appears nearly 100 times in the German corpus.)

For my own experiment, I decided to search terms that described First Peoples over the Twentieth Century and into the Twenty-First Century.  Drawing from the English Corpus, I searched the terms “Native Americans”, “First Nations”, “Aboriginals”, “First Peoples”, and “Natives.”  I didn’t include “Indians” because the Ngram viewer would not be able to distinguish between what I was looking for and someone from India, though I almost certain that “Indian” would dominate the lexicon for the better part of the century.  It turned out that the term “Natives” dominated the lexicon for most of the century until the late-1980s when the term “Native Americans” just shot up and continued to dominate the lexicon through 2008.  Nevertheless, it still appeared that in 2008, English-speaking authors were still a bit confused as to what to call the first inhabitants of this continent.


In a related study, Adrian Veres and John Bohannon used the Ngram Viewer to create the Science Hall of Fame based on the results found on the Ngram Viewer.   The person who topped the list will probably surprise you.  Or, if you decided not to click the link the top five were Bertrand Russell, Charles Darwin, Albert Einstein, Lewis Carroll, and Claude Bernard.  I personally thought Einstein or Stephen Hawking  would top the list, so it goes to show how the data can introduce a different perspective, and how it can trigger more questions.  The only thing that I would love for the Ngram viewer would be to add newspapers to the corpus.  I understand how challenging it would be in terms of copyright, availability (etc.), but their inclusion would make the data so much richer, and give us a larger window into the trends of the past.

Now, there are some drawbacks to OCR and its search capabilities.  Ian Milligan blogged about his experience using digitized newspaper databases in his research.  He warned that there are some drawbacks to using OCR.  He used the example of the Toronto Star since they digitized 110-years worth of microfilmed newspapers in just four months.  Some errors are bound to be made in a rush job like that.  He noted that microfilm streaks can obscure characters, tilted pages or characters can be missed, and hyphenations across two lines are not accounted for.  He implied that we ought to hold content providers accountable by asking questions as to how their database works, but we should also dig in to the database more and be more thorough with our searching.  For example, if we know a certain event occurred on a particular date, it may be worth just browsing around articles written on that date and around that time, maybe serendipity will reward us with the “golden article” that we need.

Finally, I also wonder about the long-term impact of scanning the web.  For me, the internet is kind of like an all-you-can-eat buffet, everything looks so good that I dare not take a heaping plate of any one thing since I might miss out on something else.  The drawback of this type of search is that I often don’t take the time to really immerse myself in anything.  I find the quote I’m looking for, and I move on to the next article.  Nicholas Carr wrote an intriguing piece entitled “Is Google Making us Stupid?“.  He talks about how his reading habits have changed with the advent of the internet.  He confessed that though the internet has been a godsend to him as a writer, he is no longer able to focus on long articles and gets easily distracted.  He reminisced “Once I was a scuba diver in the sea of words. Now I zip along the surface like a guy on a Jet Ski.”  Though his article is purely anecdotal, I find I can relate to much of what he’s talking about; and I wonder if the amount of time I’ve logged on the internet since I was 14 is the reason why I dislike reading most books, and rarely do it for fun.

For class next week

1. Reflect upon Carr’s article, can you relate to Carr and his experience?  Is he just a “worrywart” with what he feels?

2. Try playing with the Google Ngram Viewer.  Could you discover any neat trends? Any drawbacks or deficiencies to using this tool?

Who’s the Greatest of All Time?

Being as big a basketball fan as I am, I knew it wouldn’t be too long until I started blogging about historical basketball topics on this page. Now, I realize this is a bit of a niche topic, and in fact, not many academic historians discuss sports history, except when it has a direct impact on society at large, or serves as some kind of microcosm of what was happening in the world (e.g. 1972 Hockey Summit Series between Canada and the Soviet Union).  The history of sport is usually left to journalists.  Today, I’m going to break the rules because I need to let off some steam and this is the only forum I know of where I can discuss it.

There is a lot of debate among basketball pundits these days as to who the greatest all-time basketball player is.  This has especially flared up since how LeBron James of the Miami Heat has been having his way with opponents these past few seasons.  So where does LeBron rank with the greats?  Can he be included in the same league as Kareem Abdul-Jabbar (NBA’s all-time leading scorer)? How about Wilt Chamberlain (scored 100 points in one game)? What about Michael Jordan (Undefeated in six NBA finals appearances)?  To answer this question, some people would go to the statistical records, others would look at how many awards or playoff games they won.  One journalist-turned-NBA-executive, John Hollinger, developed the player efficiency rating (PER), taking into account various individual statistics and boiling them down to a single number to make comparing players easier.

Now, I’m a casual basketball player, and I play on a fairly regular basis.  When I play against someone, I know I’m better than someone (or not) by watching them play.  My Dad still swears that despite his low statistics, Bill Russell was a superior basketball player to Wilt Chamberlain, just by watching them play.  I think there’s some value to the “eye test.”  Now, I’ve been watching basketball since I was 9 years-old (1994), and It’s also my personal opinion that the two best basketball players since 1994 were Michael Jordan and Shaquille O’Neal (explicit lyrics in O’Neal link).  The reason is because when I watched Michael and Shaq play, they seemed unbeatable.  They inspired fear in both the teams they faced, and those teams’ fans alike.  As much as I wanted my team to win, when they faced Shaq and Jordan, I was always resigned to the fact that my team would lose–and they did.  I didn’t need to see statistics, awards, or any other data to know how good Shaq and Jordan were, I saw it on the court.

For what it’s worth, I thought I’d throw my two cents and tell you who I think the best players of all time were based on what I have seen.  I do apologize that my opinions are biased toward the post-1994 era, and what’s available of legendary players prior to that time on YouTube.  Anyway, here’s the list…

Centers: Shaquille O’Neal, Hakeem Olajuwon

Power Forwards: Tim Duncan

Small Forwards: LeBron James, Larry Bird, Scottie Pippen

Shooting Guards: Michael Jordan, Pete Maravich, Kobe Bryant

Point Guards: Magic Johnson, Chris Paul, Gary Payton

The History angle to this is that the NBA severely lacks video footage of league games prior to 1980.  There have been a lot of great moments in NBA History of which video footage doesn’t exist (e.g. Wilt Chamberlain’s 100-point game).  I’m glad today that most archives–especially in television–have the sense to preserve what they are creating right now for posterity.  I suppose my wish is that any NBA video footage that does exist could be digitized and made available online so that we can have a better sense of how to settle these debates.  They may seem trivial to some, but for other people whose lives revolve around sports, it could be incredibly valuable to them, or even inspire other people to start playing the game.

In Support of Wikipedia

I recently Roy Rosenzweig’s article Can History be Open Source? Wikipedia and the Future of the Past. Though I had no issues with his article, it brought my mind back to a project I did two years ago about Wikipedia and Encyclopaedia Britannica.  This article–along with pondering my project–prompted me to think about what Wikipedia’s role should be among historical scholarly and non-scholarly resources.

Wikipedia receives a wide range of mostly-deserved criticism.  Some articles are woefully written, others give more coverage to less important issues, than the ones that matter. Some Wikipedians simply write things that are not true, or that are taken out of context, perpetuating rumours and half-truths.  All of this criticism is justified, but I believe we allow it to overshadow the good that Wikipedia does.  Nevertheless, Wikipedia is good at correcting itself, and seems to get better every day through the efforts of volunteer contributors and administrators (check out the articles for deletion forums). Furthermore, I believe that we give some of the seminal reference works (e.g. Encyclopaedia Britannica) a little too much credit for what they do, especially compared to Wikipedia.  Let’s remember that encyclopaedias were never meant to be authoritative sources of information.  A friend of mine, a seasoned academic librarian, put it this way, “you go to an encyclopaedia if you’re feeling lazy, and just want a few tidbits of information, quick and dirty.”  To be honest, I believe Wikipedia serves that purpose quite well, moreover, their articles are longer and typically contain more detail than the average traditional encyclopaedia entry.

For the purposes of historical research–especially at the high school and undergraduate Ievels–I think Wikipedia does well as a starting point for research, especially on a topic that the researcher may not be familiar with.  Of course, the student would never use a Wikipedia article as fuel to support his/her argument, but if it is used to get to know the key names, dates, and statistics of a particular event, I think it’s a great way to start research.  Many Wikipedia articles also come with some impressive bibliographies with reputable sources.  Take Wikipedia’s article “Attack on Pearl Harbor”  it had 122 endnotes as of September 13, 2013, and comes with a list of some reputable sources.  The authors of the sources include some of the world’s experts on World War II like Sir Martin GIlbert, Admiral Samuel Eliot Morison, and Dr. Gordon W. Prange.  The resources also include works published by many reputable academic presses including Oxford, Cornell, and Taylor & Francis.  A professor seeing those types of sources used for a research paper would not be disappointed.

From a Public History standpoint, in our last Digital History class we talked a lot about how we were excited about the possibilities of making historical information available to a wider audience at a low or no cost.  Though Wikipedia certainly has its flaws, so far I don’t think anyone has come up with a better way to disseminate information freely than Jimmy Wales through Wikipedia.  On his donation page, he stated that, “Wikipedia is something special. It is like a library or a public park. It is like a temple for the mind. It is a place we can all go to think, to learn, to share our knowledge with others.”  Though it’s a bit of a utopian idea, I appreciate Wales’s sentiment to make knowledge freely available to anyone who wants it.