Has Big Data Changed Historical Research?

The use of quantitative methods in historical research is not new. Quantitative research methods were a part of economic, political and social history during the 1960-1980s. Modern technological advances in computing means collecting data no longer has to be difficult and time consuming. So it is easy to see why, in 2012, Danah Boyd and Kate Crawford believed the era of big data had begun. [1] Large data sets, often seen as the defining feature of big data, can be created and manipulated on desktop computers. Technological advances led to big data becoming more accessible and naturally it was introduced into digital humanities. That does not mean it has been welcomed with open arms into either digital humanities or digital history. Scepticism as to the arguments or evidence big data has to offer has been high and the impact it has on historical research and the types of research question it creates has been a big bone of contention for digital historians. This essay will examine whether big data changes the nature of historical research, narrowing the types of research questions or whether it opens up other avenues of historical research and allows for the possibility of expanding our historical knowledge.

While quantitative methods have been implemented in historical research, the research method most associated with history is ‘close reading’. Historical research has been dominated by the interrogation of a small number of carefully selected, often text based, primary sources which can provide valuable insights. Jean-Baptise et al. argue ‘reading small collections of carefully chosen works enables scholars to make powerful inferences about trends in human thought’ [2]. Yet, as Franco Moretti argues, the trouble with close reading is that it necessarily depends on an extremely small canon and as such cannot provide an insight into the underlying system or social phenomena. The hostility towards big data could be contributed to the challenge it presents to primary sources. The expertise of ‘close reading’ is not only central to humanistic disciplines but ‘to the self identity of lots of humanists themselves’. Historical research methods implementing big data can open up other, larger avenues of research to the historian but it involves stepping away from the text and it is in this way big data changes historical research.

Distant reading focuses on the quantitative data removing text from its context. Instead of analysing sources for its literary forms, conventions and semantics, the source is text mined to create statistics and probabilities. Big data offers a different way of reading primary sources, either by reducing the text to units smaller than itself such as devices, themes or tropes or expanding it to include larger units such as genres or systems. This form of research allows cultural and historical trends which would have remained hidden. A good example of the historical trends missed by historian but found through distant reading can be found in n-grams database created by Erez Lieberman Aiden and Jean-Baptise Michel. The censorship of unknown authors and artists during the Nazi period were found through the analysis of the database containing over two trillion words.[3] Thus there are advantages to big data. It has the potential to create a global history by measuring the various changes in human society. And it does this by opening up the bigger research questions historians, have always wanted to ask but have been restrained by a lack of time, funding and resources to endeavour on a task of such magnitude.Boyd and Crawford argue big data creates a radical shift in how we think about research by looking at new objects and new methods of obtaining historical knowledge. [4] For them, big data is not considered in the binary conversation of close versus distant reading but how big data ‘reframes key questions about the constitution of knowledge, processes of research and how we should engage with knowledge’[5]. However, it is important not to over emphasise distant reading as a revolutionary new practice of digital history, big data does have its disadvantages and limitations.

The first limitation of big data is the reliance on computers for research purposes. Arguably, the majority of historians and history students are tech-savvy enough to have a good handle on standard programmes such as Microsoft Word and Excel. However, big data requires the use of more sophisticated representations of its research findings than the tables, graphs and charts created with a fair knowledge of Microsoft Office programmes. The creation of Extensible Markup Language (XML) and Application Programming Interface (API) and the gathering and analysing of large amounts of data is a ‘skill set generally restricted to those with a computational background’[6]. Perhaps it is the lack of computational knowledge which makes digital historians sceptical about the benefits big data has to offer digital history. Boyd and Crawford argue humanists use digital resources all the time, but due to their naivety are on how the resources work, they are unaware of the potential to get more out of them. Instead humanists arrange resources in ways that make it far harder for them to be used. But big data is not just limited to the computer literacy of the digital historian; the data sets, themselves, carry their own limitations.

It is important to interrogate textual sources before carrying out qualitative research; similarly it is important to ask critical questions of big data before quantitative research is carried out. Regardless of the size of the data set, the properties and limitations should be understood and the biases of the data should be determined before carrying out quantitative research.[7] Furthermore, historians should assess the limitations of the questions they can ask of big data and determine what interpretations are the most appropriate.[8] Boyd and Crawford argue big data is problematic in it enables patterns to be read in the results where they do not exist.[9] But is this not also a problem faced with textual sources? While the research methods differ both close and distant reading suffer from the same problems of using primary sources. Both require the retention of context. When a word, sentence, phrase or paragraph is removed from the entire document, it loses its context. So too does quantitative data lose context when data is interpreted at scale and it is ‘even harder to maintain when data [is] reduced to fit into a model’[10]. But this is not the only similarity between close and distant reading. The process of generalisation in close reading and the process of categorisation in distant reading both rely on historians’ judgements.  The fear of big data changing historical research could be a manifestation of the fear of digital history being ‘no more than a colonization [sic] of the humanities by the sciences’[11].

Big data may apply some of the quantitative methods of computer science but it does not apply objectivity as readily as its connection with science would imply. Big data requires an interpretive framework and is thus subjective. Big data’s tools of representation: summary tables, charts, line graphs and other modes of visualisation all require interpretation themselves. Trevor Owens argues data can be constructed artefacts, texts and processed information which can be interpreted. Data sets are created by historians who make choices on what data to collect and how to encode it. Therefore, as constructed artefacts, data has the same characteristics as textual sources created by authors. So consideration of its author’s intended audience and their purpose for the data should be undertaken. Even in the visualisation stage, data is not evidence but new artefacts, objects, and texts that is generated and which can also be read and explored. Ben Schmidt supports Owens, arguing the ‘graphs derived from big data require nuanced, contextual interpretation [and] . . . give a new source to interpret’. While the objects of interpretation are not manuscripts stored at archives, or digitally for that matter, these new artefacts created by big data are still reliant on literary canons. Big data, for the moment at least, is mainly created from textual sources. Although these textual sources are not studied in-depth but are mined for word frequency, use of semantics or for other such statistical purposes, the interpretations created from them are still reliant on textual information. Tim Hitchcock argues this is why distant reading, to a certain extent, does not provide new insights to historians. Research questions are still being determined by literary canons and still resemble those asked through older technology. This is why big data does not always change the research questions historians ask. Retention of old concepts and methods can lead to historians imposing limitations on big data. By reading data in terms of the interpretations already amassed, the value of the information data has to offer is lost. Heuser and Le-Khac have recognised the problems of such impositions on big data. They argue there is a tendency to throw away data that does not fit established concepts which potentially damages historical knowledge. Big data needs to be more than validation of existing interpretations otherwise ‘quantitative methods [will] never produce new knowledge’[12].

Historians too concerned with whether big data changes historical research miss the opportunities provided by big data to increase their historical knowledge. Instead of arguing over which research method should be employed in digital history, historians could employ a mixture of both close and distant reading to provide both the depth and breadth they have been striving for. Owens argues ‘big data is an opportunity . . . to bring the skills . . . honed in the close reading of texts . . . into service for this new species of text’. By employing close reading methods on artefacts created by distant reading methods, historians receive both the in-depth and specific human experience and the broader trends of society that experience sits within. Big data may be a new research method but the sources it creates can strengthen pre-existing methods.

Quantitative methods may not be new to historical research but technological advances has allowed quantitative research to penetrate history to such an extent that it sparked debates over the effects it has had on historical research. This debate has centred on the use of close and distant reading, the advantages and disadvantages of both methods of research and the contributions each makes to historical knowledge. The main concern of historians is how big data changes research questions. While big data offers the opportunity to explore larger, cultural research questions, its reliance on narrow literary canons allows big data to create sources which in turn can be closely read and thus can be used in the same research questions we ask of non-digital primary sources. Big data has the potential to not only inform historians of new trends, both globally and socially, but can be used to enhance pre-existing research questions.



[1] Danah Boyd & Kate Crawford, ‘Critical Questions for Big Data’, Information, Communication & Society, Vol. 15, No. 5 (2012), p. 662-679

[2] Jean-Baptise Michel et al, ‘Quantitative Analysis of Culture Using Millions of Digitized Books’, Science, Vol. 331 (2011), pp. 176-182

[3] John Bohannon, ‘Google Opens Books to New Cultural Studies’, Science, Vol. 330 (2010), p. 1600

[4] Boyd & Crawford, ‘Critical’, p. 665

[5] Boyd & Crawford, ‘Critical’, p. 665

[6] Boyd & Crawford, ‘Critical’, p. 674

[7] Boyd & Crawford, ‘Critical’, p. 668

[8] Boyd & Crawford, ‘Critical’, p. 670

[9] Boyd & Crawford, ‘Critical’, p. 668

[10] Boyd & Crawford, ‘Critical’, p. 671

[11] Ryan Heuser & Long Le-Khac, ‘Learning to Read Data: Bringing out the Humanistic in the Digital Humanities’, Victorian Studies, Vol. 54, No. 1 (2011), pp. 79-86

[12] Heuser & Le-Khac, ‘Learning’, p. 81


Hyperlink Bibliography

Hitchcock,Tim, ‘Big Data for Dead People: Digital Readings and the Conundrums of Positivism’, Historyonics, http://historyonics.blogspot.co.uk/2013/12/big-data-for-dead-people-digital.html; consulted 10th April 2014

Manning, Patrick, ‘Big Data in History’, We Think History, www. http://wethink.hypotheses.org/1485; consulted 10th April 2014

Moretti, Franco, ‘Conjectures on World Literature’, New Left Review, Vol. 1 (2000), www. http://newleftreview.org/II/1/franco-moretti-conjectures-on-world-literature; consulted 12th April 2014

Mullen, Abby, ‘”Big” Data for Military Historians’, Canadian Military History, http://canadianmilitaryhistory.ca/big-data-for-military-historians-by-abby-mullen/; consulted 10th April 2014

Owens, Trevor, ‘Defining Data for Humanists: Text, Artifact, Information or Evidence?’, Journal of Digital Humanities, Vol. 1, No. 1 (2011), http://journalofdigitalhumanities.org/1-1/defining-data-for-humanists-by-trevor-owens/; consulted 11th April 2014

Schmidt, Ben, ‘Assisted Reading vs. Data Mining’, Sapping Attention, www. http://sappingattention.blogspot.co.uk/2010/12/assisted-reading-vs-data-mining.html; consulted 11th April 2014

Schöch, Christof, ‘Big? Smart? Clean? Messy? Data in the Humanities’, Journal of Digital Humanities, Vol. 2, No. 2 (2013), http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/; consulted 12th April 2014

Digital History

Critique of the Clergy of the Church of England Database

The Clergy of the Church of England Database (CCEd) is a relational database (a database which stores information in multiple tables) which links primary sources relating to the clerical careers of the Church of England between 1540 and 1835. The creators of the database feel its contents are of use to the general public and genealogists but it will be best utilised by political and social historians, wanting to trace individual career paths, understand the structure of the Church of England or determine patterns in clerical migration.

Hompage of the CCEd

Homepage of the CCEd

The presentation of the database is simple and clear. The layout is minimal and does not distract the user with garish or numerous images.

Homepage of CCEd Evaluated for Accessibility and Design

Evaluation of Web Design for Accessibility

One of its best features is the how to use the database section. However, the navigation is a bit cumbersome and it often suggests using another section but does not link to it.

For the CCEd a web database is the most appropriate tool, permitting quick and complicated queries to be carried out from the web page. The ease of updating from any computer and the ability to link records, allow the project to create career narratives for an in-depth analysis of the sources. These narratives save historians the time and hassle in trying to plot the career of clergymen themselves and can quickly show them the major events taking place in clerical careers.

Career Narrative of William Paley

Example of Career Narrative

This simple database with limited visualisations would be relatively cheap to create and maintain compared to high-end technical supported databases. But is complex enough to hold a large amount of data (the CCEd contains 1,250,000 individual records).

A big data project (like CCEd) allows for both close reading and distant reading. Patterns and trends in the structure of the church and clerical migration can be ascertained through distant reading. However, we can lose the human element by looking at big patterns; the individual experience can challenge the overall arching trends. Engagement and imagination is an essential part of a historian’s interaction with primary sources and close reading can provide such interaction. However, direct engagement with the primary sources is not facilitated by the database.

Digitisation Methodologies

The data capture method of textual input, although time consuming, increases the accuracy of the information captured. Especially when compared to other methods, such as Optical Character Recognition, which struggles with early manuscripts and handwriting (it so renowned for its mistakes there is even a Twitter account satirising it!). However, the selection of only very specific information contained in the primary sources calls into question whether other valuable information has been missed?

Old Typed Print Scanned by OCR with Terrible Replication

Example of OCR Going Wrong. Image from A Report and Review of the Scanning Claim by the Editor at janelead.org (Link via Image)

It is understandable for a project of this magnitude to want to contain only the most essential information but a low resolution page image of the primary source would help the historian feel connected to the primary source without taking up too much storage space.

Although, a page image cannot be searched or manipulated, for the purpose of the database it would not have to be. It could simply function as a standalone feature, adding another layer of understanding to the interpretation of the sources. The image quality would not have to be high either, as long as the source retained its readability when zoomed in.

Old handwritten register scanned into a digital format and presented as a page image

Page Image of an Old Register. Image from the Wellcome Library, who retains the copyrights.

Instead the user is presented with a ‘screen format’ of the records used, giving the user no feel for the primary source and certainly no engagement with it.

Screen Format Version of Primary Sources. Information is presented in typed up tables.

Example of Screen Format of Primary Sources

To facilitate the dissemination of work interpreting the records in the database, the website has its own online journal. This is where the website uses XML to facilitate the searching of articles, although it does not provide any transparency in the use of this tool. The limited use of XML is due to the search engine within the database itself which does not have the time consuming disadvantage of having to create building blocks as XML does.

Overall, the database renders the primary sources redundant. The pre-selection of sources and of the information required from them, the presentation of their data in field format, the lack of images of the primary sources and the methods of analysis (record linkage and career narratives) seem to place an emphasis on the database as a source of historical information rather than on the primary sources.

For historians, who like to read the primary sources, the extraction of information must rub against the bone. Could there have been other contextual information that might have been contained within the primary source?

Bibliography of Links

‘Advanced’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/how-to-use-the-database/advanced/; consulted 1st March 2014

‘Big Data for Dead People: Digital Readings and the Conundrums of Positivism’, Historyonics – Tim Hitchcock’s Blog, http://historyonics.blogspot.co.uk/2013/12/big-data-for-dead-people-digital.html; consulted 1st March 2014

‘Bibliography of sources used in the Database’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/reference/bibliography-of-sources-used-in-the-database/; consulted 28th February 2014

‘Close Reading’, University of Warwick, http://www2.warwick.ac.uk/fac/arts/english/currentstudents/undergraduate/modules/fulllist/second/en227/closereading/; consulted 1st March 2014

Cohen, Daniel J and Rosenzweig, Roy, ‘Appendix – Database’, Digital History, http://chnm.gmu.edu/digitalhistory/appendix/1.php; consulted 28th February 2014

Cohen, Daniel J and Rosenzweig, Roy, ‘Becoming Digital – Digitizing Text: What Do You Want to Provide?’, Digital History, http://chnm.gmu.edu/digitalhistory/digitizing/2.php; consulted 1st March 2014

‘Contents of Database’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/about/about-the-database/content-of-database/; consulted 1st March 2014

‘Data Capture’, University of Oxford, http://digital.humanities.ox.ac.uk/methods/datacapture.aspx; consulted 1st March 2014

‘How to Use the Database’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/how-to-use-the-database/; consulted 28th February 2014

‘Information for Genealogists’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/information-for-genealogists/; consulted 28th February 2014

‘Information for General Public’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/information-for-general-pubilc/; consulted 28th February 2014

‘Interpreting Career Narratives’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/how-to-use-the-database/interpreting-career-narratives/; consulted 1st March 2014

‘Introduction to XML’, W3Schools, http://www.w3schools.com/xml/xml_whatis.asp; consulted 1st March 2014

‘Journal’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/journal/; consulted 1st March 2014

‘OCR (Optical Character Recognition)’. TechTarget, http://searchcontentmanagement.techtarget.com/definition/OCR-optical-character-recognition; consulted 1st March 2014

‘OCR Fail’, Twitter, https://twitter.com/OCRfail; consulted 1st March 2014

Schulz, Kathryn, ‘What is Distant Reading?’, The New York Times, http://www.nytimes.com/2011/06/26/books/review/the-mechanic-muse-what-is-distant-reading.html?pagewanted=all&_r=2&; consulted 1st March 2014

‘Welcome to the CCEd’, Clergy of the Church of England Database, http://theclergydatabase.org.uk/; consulted 28th February 2014

‘What are Relational Databases?’, How Stuff Works, http://computer.howstuffworks.com/question599.html; consulted 28th February 2014

‘When OCR Goes Bad: Google’s Ngram Viewer & The F-Word’, Search Engine Land, http://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181; consulted 1st March 2014