Has Big Data Changed Historical Research?

The use of quantitative methods in historical research is not new. Quantitative research methods were a part of economic, political and social history during the 1960-1980s. Modern technological advances in computing means collecting data no longer has to be difficult and time consuming. So it is easy to see why, in 2012, Danah Boyd and Kate Crawford believed the era of big data had begun. [1] Large data sets, often seen as the defining feature of big data, can be created and manipulated on desktop computers. Technological advances led to big data becoming more accessible and naturally it was introduced into digital humanities. That does not mean it has been welcomed with open arms into either digital humanities or digital history. Scepticism as to the arguments or evidence big data has to offer has been high and the impact it has on historical research and the types of research question it creates has been a big bone of contention for digital historians. This essay will examine whether big data changes the nature of historical research, narrowing the types of research questions or whether it opens up other avenues of historical research and allows for the possibility of expanding our historical knowledge.

While quantitative methods have been implemented in historical research, the research method most associated with history is ‘close reading’. Historical research has been dominated by the interrogation of a small number of carefully selected, often text based, primary sources which can provide valuable insights. Jean-Baptise et al. argue ‘reading small collections of carefully chosen works enables scholars to make powerful inferences about trends in human thought’ [2]. Yet, as Franco Moretti argues, the trouble with close reading is that it necessarily depends on an extremely small canon and as such cannot provide an insight into the underlying system or social phenomena. The hostility towards big data could be contributed to the challenge it presents to primary sources. The expertise of ‘close reading’ is not only central to humanistic disciplines but ‘to the self identity of lots of humanists themselves’. Historical research methods implementing big data can open up other, larger avenues of research to the historian but it involves stepping away from the text and it is in this way big data changes historical research.

Distant reading focuses on the quantitative data removing text from its context. Instead of analysing sources for its literary forms, conventions and semantics, the source is text mined to create statistics and probabilities. Big data offers a different way of reading primary sources, either by reducing the text to units smaller than itself such as devices, themes or tropes or expanding it to include larger units such as genres or systems. This form of research allows cultural and historical trends which would have remained hidden. A good example of the historical trends missed by historian but found through distant reading can be found in n-grams database created by Erez Lieberman Aiden and Jean-Baptise Michel. The censorship of unknown authors and artists during the Nazi period were found through the analysis of the database containing over two trillion words.[3] Thus there are advantages to big data. It has the potential to create a global history by measuring the various changes in human society. And it does this by opening up the bigger research questions historians, have always wanted to ask but have been restrained by a lack of time, funding and resources to endeavour on a task of such magnitude.Boyd and Crawford argue big data creates a radical shift in how we think about research by looking at new objects and new methods of obtaining historical knowledge. [4] For them, big data is not considered in the binary conversation of close versus distant reading but how big data ‘reframes key questions about the constitution of knowledge, processes of research and how we should engage with knowledge’[5]. However, it is important not to over emphasise distant reading as a revolutionary new practice of digital history, big data does have its disadvantages and limitations.

The first limitation of big data is the reliance on computers for research purposes. Arguably, the majority of historians and history students are tech-savvy enough to have a good handle on standard programmes such as Microsoft Word and Excel. However, big data requires the use of more sophisticated representations of its research findings than the tables, graphs and charts created with a fair knowledge of Microsoft Office programmes. The creation of Extensible Markup Language (XML) and Application Programming Interface (API) and the gathering and analysing of large amounts of data is a ‘skill set generally restricted to those with a computational background’[6]. Perhaps it is the lack of computational knowledge which makes digital historians sceptical about the benefits big data has to offer digital history. Boyd and Crawford argue humanists use digital resources all the time, but due to their naivety are on how the resources work, they are unaware of the potential to get more out of them. Instead humanists arrange resources in ways that make it far harder for them to be used. But big data is not just limited to the computer literacy of the digital historian; the data sets, themselves, carry their own limitations.

It is important to interrogate textual sources before carrying out qualitative research; similarly it is important to ask critical questions of big data before quantitative research is carried out. Regardless of the size of the data set, the properties and limitations should be understood and the biases of the data should be determined before carrying out quantitative research.[7] Furthermore, historians should assess the limitations of the questions they can ask of big data and determine what interpretations are the most appropriate.[8] Boyd and Crawford argue big data is problematic in it enables patterns to be read in the results where they do not exist.[9] But is this not also a problem faced with textual sources? While the research methods differ both close and distant reading suffer from the same problems of using primary sources. Both require the retention of context. When a word, sentence, phrase or paragraph is removed from the entire document, it loses its context. So too does quantitative data lose context when data is interpreted at scale and it is ‘even harder to maintain when data [is] reduced to fit into a model’[10]. But this is not the only similarity between close and distant reading. The process of generalisation in close reading and the process of categorisation in distant reading both rely on historians’ judgements.  The fear of big data changing historical research could be a manifestation of the fear of digital history being ‘no more than a colonization [sic] of the humanities by the sciences’[11].

Big data may apply some of the quantitative methods of computer science but it does not apply objectivity as readily as its connection with science would imply. Big data requires an interpretive framework and is thus subjective. Big data’s tools of representation: summary tables, charts, line graphs and other modes of visualisation all require interpretation themselves. Trevor Owens argues data can be constructed artefacts, texts and processed information which can be interpreted. Data sets are created by historians who make choices on what data to collect and how to encode it. Therefore, as constructed artefacts, data has the same characteristics as textual sources created by authors. So consideration of its author’s intended audience and their purpose for the data should be undertaken. Even in the visualisation stage, data is not evidence but new artefacts, objects, and texts that is generated and which can also be read and explored. Ben Schmidt supports Owens, arguing the ‘graphs derived from big data require nuanced, contextual interpretation [and] . . . give a new source to interpret’. While the objects of interpretation are not manuscripts stored at archives, or digitally for that matter, these new artefacts created by big data are still reliant on literary canons. Big data, for the moment at least, is mainly created from textual sources. Although these textual sources are not studied in-depth but are mined for word frequency, use of semantics or for other such statistical purposes, the interpretations created from them are still reliant on textual information. Tim Hitchcock argues this is why distant reading, to a certain extent, does not provide new insights to historians. Research questions are still being determined by literary canons and still resemble those asked through older technology. This is why big data does not always change the research questions historians ask. Retention of old concepts and methods can lead to historians imposing limitations on big data. By reading data in terms of the interpretations already amassed, the value of the information data has to offer is lost. Heuser and Le-Khac have recognised the problems of such impositions on big data. They argue there is a tendency to throw away data that does not fit established concepts which potentially damages historical knowledge. Big data needs to be more than validation of existing interpretations otherwise ‘quantitative methods [will] never produce new knowledge’[12].

Historians too concerned with whether big data changes historical research miss the opportunities provided by big data to increase their historical knowledge. Instead of arguing over which research method should be employed in digital history, historians could employ a mixture of both close and distant reading to provide both the depth and breadth they have been striving for. Owens argues ‘big data is an opportunity . . . to bring the skills . . . honed in the close reading of texts . . . into service for this new species of text’. By employing close reading methods on artefacts created by distant reading methods, historians receive both the in-depth and specific human experience and the broader trends of society that experience sits within. Big data may be a new research method but the sources it creates can strengthen pre-existing methods.

Quantitative methods may not be new to historical research but technological advances has allowed quantitative research to penetrate history to such an extent that it sparked debates over the effects it has had on historical research. This debate has centred on the use of close and distant reading, the advantages and disadvantages of both methods of research and the contributions each makes to historical knowledge. The main concern of historians is how big data changes research questions. While big data offers the opportunity to explore larger, cultural research questions, its reliance on narrow literary canons allows big data to create sources which in turn can be closely read and thus can be used in the same research questions we ask of non-digital primary sources. Big data has the potential to not only inform historians of new trends, both globally and socially, but can be used to enhance pre-existing research questions.



[1] Danah Boyd & Kate Crawford, ‘Critical Questions for Big Data’, Information, Communication & Society, Vol. 15, No. 5 (2012), p. 662-679

[2] Jean-Baptise Michel et al, ‘Quantitative Analysis of Culture Using Millions of Digitized Books’, Science, Vol. 331 (2011), pp. 176-182

[3] John Bohannon, ‘Google Opens Books to New Cultural Studies’, Science, Vol. 330 (2010), p. 1600

[4] Boyd & Crawford, ‘Critical’, p. 665

[5] Boyd & Crawford, ‘Critical’, p. 665

[6] Boyd & Crawford, ‘Critical’, p. 674

[7] Boyd & Crawford, ‘Critical’, p. 668

[8] Boyd & Crawford, ‘Critical’, p. 670

[9] Boyd & Crawford, ‘Critical’, p. 668

[10] Boyd & Crawford, ‘Critical’, p. 671

[11] Ryan Heuser & Long Le-Khac, ‘Learning to Read Data: Bringing out the Humanistic in the Digital Humanities’, Victorian Studies, Vol. 54, No. 1 (2011), pp. 79-86

[12] Heuser & Le-Khac, ‘Learning’, p. 81


Hyperlink Bibliography

Hitchcock,Tim, ‘Big Data for Dead People: Digital Readings and the Conundrums of Positivism’, Historyonics, http://historyonics.blogspot.co.uk/2013/12/big-data-for-dead-people-digital.html; consulted 10th April 2014

Manning, Patrick, ‘Big Data in History’, We Think History, www. http://wethink.hypotheses.org/1485; consulted 10th April 2014

Moretti, Franco, ‘Conjectures on World Literature’, New Left Review, Vol. 1 (2000), www. http://newleftreview.org/II/1/franco-moretti-conjectures-on-world-literature; consulted 12th April 2014

Mullen, Abby, ‘”Big” Data for Military Historians’, Canadian Military History, http://canadianmilitaryhistory.ca/big-data-for-military-historians-by-abby-mullen/; consulted 10th April 2014

Owens, Trevor, ‘Defining Data for Humanists: Text, Artifact, Information or Evidence?’, Journal of Digital Humanities, Vol. 1, No. 1 (2011), http://journalofdigitalhumanities.org/1-1/defining-data-for-humanists-by-trevor-owens/; consulted 11th April 2014

Schmidt, Ben, ‘Assisted Reading vs. Data Mining’, Sapping Attention, www. http://sappingattention.blogspot.co.uk/2010/12/assisted-reading-vs-data-mining.html; consulted 11th April 2014

Schöch, Christof, ‘Big? Smart? Clean? Messy? Data in the Humanities’, Journal of Digital Humanities, Vol. 2, No. 2 (2013), http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/; consulted 12th April 2014

Digital History

Reflection on Digital History

Mirror Against A Wall

This image was taken from PublicDomainPictures.net. Photographer: Lynn Greyling. http://www.publicdomainpictures.net/view-image.php?image=59258&picture=mirror-against-a-wall This image falls under the Public Domain License.

This may sound really naïve but it never occurred to me that I would need a sound knowledge of how the internet worked for a digital history module. Being computer literate and having grown up with the internet I thought the module would be a walk in the park. But within the first week I realised I would need to research and understand how things such as the search function really worked. Because even this simple feature (which I have used without ever once wondering how it worked) can have an impact on the usefulness of a digital history project and has an even greater impact on how easy it is to use.

This brings me to my other misconception. I thought the module would consist of essay writing on historical topics using only digital history projects as research sources. I’m pleased that I was wrong. Had I not learned how to create a digital history project, I would never have thought to evaluate their usefulness for historians. Considering there are loads of digital history projects available than I initially thought, the ability to identify which project is more useful and easier to use is essential when thinking about the way historical research is carried out and how technological advances shape historians research methods.

The role technological advances have in shaping digital history projects impressed upon me the problematic nature of digital history. When weighing up the best method of digitising primary sources, the basis on which method is selected seems to be determined by expense. Discussions on the advantages and disadvantages of digitisation and representation methods highlights how historians are trapped between wanting to create a detailed and useful tool and the lack of funding to enable them to use the best digitisation methods available.

I’ve come to realise historians (and historical institutions for that matter) have moved towards social media as a way of getting around this problem. Take for instance the British Library’s Flickr account. The move to hosting their digitised images on a social media site has many advantages for the British Library. Firstly, they only had to create a page image as the images can be tagged and searched for by the tags. No need for OCR, advanced search features or even XML or API. Secondly, by placing it on a social media website there is no hosting fees, no need to update their software or buy more servers to host all this digital data. Flickr does this all for them.

Personally, I think there are issues with hosting the images on Flickr (removing the image from its original context is just one of them). But hosting the images on Flickr allows for open access to images otherwise withheld from the public domain and the encouragement of tagging by Flickr users has created a crowd sourced project. Public engagement is often a stipulation for project funding, yet the expensive nature of digital history methods can make this difficult. I think this is why so many historians have taken to Twitter and blogging to disseminate their research. I was quite surprised to see historians had a sizeable presence on social media. Who knew there were so many Twitterstorians!

An online historical community has many benefits for the academic historian. Research can be shared as easily as a retweet, it allows historians to perfect their writing skills and can give valuable feedback not only from other historians but from the general public who can often be a untapped source of knowledge. Personally, I’m pleased to know that once I have left university my involvement with academic history doesn’t have to stop. I can still interact with and follow historical research.

Bibliography of Links

‘Big Data for Dead People: Digital Readings and the Conundrums of Positivism’, Historyonics – Tim Hitchcock’s Blog, http://historyonics.blogspot.co.uk/2013/12/big-data-for-dead-people-digital.html; consulted 27th March 2014

Cohen, Daniel J and Rosenzweig, Roy, ‘Becoming Digital’, Digital History, http://chnm.gmu.edu/digitalhistory/appendix/1.php; consulted 27th March 2014

‘Historians on Twitter’, Active History, http://www.activehistory.co.uk/historians-on-twitter/; consulted 27th March 2014

‘The British Library’, Flickr, https://www.flickr.com/people/britishlibrary/; consulted 27th March 2014

‘What is Public Engagement?’, National Co-ordinating Centre for Public Engagement, http://www.publicengagement.ac.uk/what; consulted 27th March 2014