Dodging the Memory Hole 2016: Saving online news

Memory holes and permanent errors: Part 1

Day 1: Intro/current practices

This is part one of a four-part white paper, “Memory Holes and Permanent Errors,” written by University of Missouri graduate student Tamar Wilner as a part of her scholarship activities related to the Dodging the Memory Hole 2016 event at UCLA Library last October. Wilner’s paper examines the neglected and complicated issues surrounding whether and how online news archives should preserve corrections, updates and other post-publication changes. Parts two, three and four will be published in the coming days, along with a PDF of the entire white paper at the conclusion.

Introduction

Online journalism is at risk of disappearing. We often think of digital words as having permanence; for some, unflattering or invasive details linger way too long. But digital news is actually incredibly fragile. Whereas once major news organizations could rest easy that librarians were collecting their old issues or that reporters could dig up old material from the paper’s own dusty “morgues,” now few institutions regularly preserve digital news (Carner, McCain, & Zarndt, 2014). File formats evolve; storage units degrade; files get corrupted or disappear during migrations from one software system to another. All this leaves old stories to become increasingly unreadable or to disappear completely (Carner, McCain, & Zarndt, 2014). New story formats and platforms, including interactives, data visualization, user-generated content and social media raise even more questions of what should be preserved and how (Hansen & Paul, 2015). As veteran journalist and archivist Victoria McCargar writes, “The fact that digital archives are much more fragile than paper ones is a problem of which many publishers are completely unaware” (McCargar, 2011).

Such ignorance has a very real effect. For example, in 2002, the Columbia Missourian lost 15 years of text and several years of photography in a server crash (McCargar, 2008; Warhover, 2011). For many newspapers, their very first print issues are more readily available than content from the inception of their online product. In fact, interviews with 10 news organizations found that most newspapers’ online archives went back no earlier than 2008, meaning “the opportunity to do serious historical research about the dawn of the digital news age is lost” (Hansen & Paul, 2015). And with news outlets’ revenues under threat, the risk that many will go under — and take years’ or decades’ worth of articles with them — is grave.

Unlike their static print cousins, online articles can and often do change over their lifetime, and this poses difficult dilemmas. First, we must consider how archives treat errors in news stories and what are the problems inherent in preserving fallacious or misleading content. There is evidence that corrections are not being adequately preserved or clearly displayed, posing the risk that both current readers and future generations could be misinformed by archived news. At the same time, overzealous or opaque corrections can effectively “scrub” stories, obliterating the original error and leaving readers — including future journalists — unable to learn from mistakes of the past. Contemplating this scenario, one can’t help but be reminded of the “memory holes” in George Orwell’s 1984, for which the Dodging the Memory Hole series of born-digital news archives conferences was named:

“As soon as all the corrections which happened to be necessary in any particular number of the Times had been assembled and collated, that number would be reprinted, the original copy destroyed, and the corrected copy placed on the files in its stead… All history was a palimpsest, scraped clean and reinscribed exactly as often as was necessary. In no case would it have been possible, once the deed was done, to prove that any falsification had taken place” (Orwell, 1949, p.41).

Of course, this paper is concerned with overzealous and opaque approaches to corrections, not with the wholesale fabrication that Orwell describes. But the “memory hole” effect is the same: once an online story is scrubbed, readers are left no clues of the change that took place.

The mutability of online news raises other questions for preservation, even in the absence of errors or corrections. Online news items are frequently updated as more information becomes available to reporters. Should these changes be flagged up for future readers? How many and which versions should be saved? There are also cases in which news outlets make substantive post-publication changes to online articles that neither fix errors nor add breaking information but do alter the tone, tenor or content of a piece. Some editors argue that these changes are made to strengthen the articles — that if they have the means to improve a piece, they should do so. But critics have raised concerns that without alerting readers to changes, the news outlets are “stealth editing” and trying readers’ trust in the process. How should these changes be documented, if at all? Is there a public interest in preserving the earlier versions, and if so, how should this be accomplished?

This white paper poses important questions to consider and factors to weigh at the nexus between online news changes and news preservation. My research was anchored by four interviews with former and current library and editorial staff at two major news outlets, The New York Times and the Los Angeles Times. These interviews inform two case studies of correction and preservation practices at those outlets and also help to supply publisher perspectives throughout the paper. The word “archive” means different things to different players in this space, and throughout the paper I have tried to consider how varying incentives might influence proposed solutions to these preservation problems.

Due to the vastness of the issue under discussion, I have placed several limitations on my research. This white paper examines only written journalism, not video, audio, interactives, social media or other formats. In keeping with the theme of the 2016 Dodging the Memory Hole conference, which sparked this investigation, the paper focuses on the problem of preserving journalism intended for online consumption. It does not examine questions of how readers of historical newspapers, now preserved as online PDFs, microfilms or on commercial databases, should be alerted to relevant corrections, though that is also a difficult and important issue.

It is my hope that this white paper can act as a launch pad for important discussions in the online news preservation space. The field is nascent. The questions raised here are therefore less about changing current practice than about properly designing future tools, technologies and workflows — most of which haven’t even been dreamt of yet. They also serve as a reminder of one more facet of the news preservation problem, for which we must seek to align the needs of various groups: publishers, journalists, librarians, archivists, researchers, database vendors and, of course, readers.

Current archival practices

“Online news archiving” means different things to different stakeholders, depending on their interests. Some approaches are well developed while others are little more than bright ideas. Below, I outline a few of the major modes of thinking about saving online news  and sketch out the incentives driving work in each area. It should be noted that many approaches towards saving or collecting online news don’t technically qualify as “archiving,” because they don’t meet technical standards for ensuring longevity and fidelity.

Internal archives

On the more established end are the archives that reside on news outlets’ websites. News organizations have a strong incentive to maintain archives here, rather than bequeath them to memory institutions, because archives can be a substantial source of revenue. “The worry that someone else will profit from their copyrighted content has deepened the resolve of many publishers to tighten control of their digital content. So when memory organizations approach them about handing off their content to be preserved, there has historically been little trust,” write Carner, McCain and Zarndt (2014). Needless to say, all major news organizations carry old stories on their websites, though they vary widely in the extent and searchability of their archives. The purpose of these archives is not merely to serve readers. Journalists have long benefited from referring to old stories, kept first in the paper archives known as “morgues” (Hansen & Paul, 2015). In the late 20th century, however, journalists increasingly came to rely on third-party databases such as Nexis (Carner, McCain, & Zarndt, 2014).

Professional archivists are concerned, however, that newspapers’ internal archives don’t adequately preserve newspaper contents. In fact, the word “archive” is likely a bit of a misnomer, given the lack of protections. Newspapers’ ad-hoc approach to keeping their old online content, often simply leaving stories in the content management system (CMS) used for publishing, is largely responsible for the fragile state of online news preservation described in the introduction to this white paper.

Memory institutions

To “memory institutions,” librarians’ and archivists’ collective lingo for their organizations, relying on news outlets’ internal archives is untenable. These professionals argue that only by archiving news at responsible, sustainable, outside institutions can we ensure that the content will be available for generations to come. In this way, memory institutions hope to continue with online news the work they have done for decades maintaining hard-bound and microfilm copies of newspapers. “While the marketplace rewards breaking news, managing previously published news content has historically been someone else’s problem, most often a librarian’s,” Carner, McCain and Zarndt (2014) argue. They write that besides cost, lack of incentives and fear of losing content ownership, publishers are also hamstrung in their preservation efforts by a lack of expertise and understanding.

The most potent memory institution in web preservation today is the Internet Archive (https://archive.org; McCain, 2016b). The non-profit’s Wayback Machine is a digital archive of the World Wide Web, including multiple versions of particular pages as they change over time. It has so far saved over 279 billion web pages (Internet Archive, n.d.). But such content is piecemeal (Hansen & Paul, 2015); it is not collected or catalogued in any consistent way because no organizations have made commitments to do so. Furthermore, the content is not truly functional; many links suffer from “link rot” and no longer work. Other content is written in outdated programming languages and no longer displays (McCain, 2016b).

Beyond the Internet Archive, the preservation of online news at memory institutions is almost unheard of. This is because news organizations fear the potential threat this might pose to their revenues, and memory institutions do not have enough funding to compensate news outlets for their losses. A survey by the Reynolds Journalism Institute found that only 11 percent of online-only news operations were supplying content to memory institutions (Carner, McCain, & Zarndt, 2014). For hybrid organizations, the figure was 60 percent, but for many organizations this could well be limited to print content (McCain, 2016a). A notable exception is the Library of Congress’s web archiving program, which aims to create a snapshot of how particular websites look at particular points in time, including HTML coding, images, audio, video and PDF files (Library of Congress, n.d.). The library recently instituted a focus on archiving born-digital news sites such as Buzzfeed and Vox (Zwaard, 2016).

One proposed solution for wider preservation is for memory institutions to place online news into “dark archives,” with material only made public once it is out of copyright (Carner, McCain, & Zarndt, 2014). Another possible solution is for the Library of Congress to enforce its right of deposit for online material. The U.S. requires that copies of copyrightable works be sent to the government for preservation at the Library of Congress, similar to the practice found in many other countries, which requires deposits with their own national libraries. But, like many other countries, the U.S. has carved out an exemption for digital-only works (Zarndt, Carner, & McCain, 2015). If news outlets were compelled to provide their online content to the Library of Congress, this would circumvent the current financial roadblock.

Technology NGOs and start-ups

Another type of player to consider is the technologist. Small tech ventures, usually side projects of professionals in journalism, library science or computer science, have been experimenting with news archiving solutions. Some of these bypass traditional memory institutions and appeal directly to individual users. For example, NewsDiffs (newsdiffs.org) is a website that automatically detects and documents changes made to online articles at five news outlets: The New York Times, Washington Post, CNN, the BBC and Politico. Such technology-led efforts are often experimental, but their approaches could be expanded through investment in the start-ups or adopted by more established organizations.

Database vendors

The final party to consider in collecting online news is the database vendor. These companies offer online news content in the products that they sell to end-users, including universities, public libraries, other memory institutions and commercial clients, which include  news organizations themselves. This is already happening on a fairly large scale, unlike the deposit of online content with memory institutions. For example, the database Proquest carries content from over 1,200 newspapers and over 200 blogs, podcasts and websites (Proquest, n.d.). Most vendors don’t publicly disclose data on their subscribers and revenue, but based on the size of their holding companies, the largest are likely LexisNexis, Proquest, Factiva and Newsbank (Carner, 2016).

Like news organizations themselves, database vendors are driven by a profit motive. They preserve online news stories because it is profitable to charge customers for accessing those stories. This presents several issues for those who wish to see online news preserved. First, there is no guarantee about what might happen to the vendors’ stock of news articles should they go out of business. Second, any of the solutions discussed here for dealing with errors, updates and large-scale edits will be non-starters if they require cooperation from database vendors, but don’t reward vendors for their efforts.

References

Carner, D., McCain, E., & Zarndt, F. (2014). Missing links: The digital news preservation discontinuity. In IFLA (International Federation of Library Associations and Institutions). Lyon, France.

Carner, D. (2016, December 15). Personal correspondence.

Hansen, K. A., & Paul, N. (2015). Newspaper archives reveal major gaps in digital age. Newspaper Research Journal, 36(3), 290–298. http://doi.org/10.1177/0739532915600745

Internet Archive: Wayback Machine (home page). (n.d.). Retrieved December 3, 2016, from https://archive.org/web/

LIbrary of Congress. (n.d.). Web Archiving FAQs. Retrieved December 3, 2016, from https://www.loc.gov/webarchiving/faq.html#faqs_13

McCain, E. (2016a). Personal communication, November 4.

McCain, E. (2016b). Remarks delivered in DtMH 2016 webinar #2, hosted by Educopia Institute. Retrieved from https://meetings.webex.com/collabs/url/6XnBUIilg12TkPah0xA2N5YzthsZzDtxjL8qpq-AVUq00000

McCain, E. (2015). Plans to save born-digital news content examined. Newspaper Research Journal, 36(3), 337–347. http://doi.org/10.1177/0739532915600747

McCargar, V. (2008). Missouri J-School and the “backstory.” Retrieved from https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/45033/MissouriJSchoolAndTheBackstory2008.pdf?sequence=1

McCargar, V. (2011). A mandate to preserve: Assessing the inaugural Newspaper Archive Summit. Reynolds Journalism Institute. Retrieved from https://www.rjionline.org/downloads/a-mandate-to-preserve-assessing-the-inaugural-newspaper-archive-summit/

Orwell, G. (1949). Nineteen Eighty-Four. Orlando, Fla.: Harcourt Brace.

Proquest (n.d.) Website accessed through University of Missouri proxy server. Retrieved Dec. 3, 2016.

Warhover, T. (2011, April 19). Dear reader: Digital archives don’t last: A tale of corruption and crashes. Columbia Missourian. Retrieved from http://www.columbiamissourian.com/news/dear-reader-digital-archives-don-t-last-a-tale-of/article_ddfd387b-03f2-54b5-b4e8-019de24351b6.html

Zarndt, F., Carner, D., & McCain, E. (2015). An international survey of born digital legal deposit policies and practices. Paper published at the 2015 International Federation of Library Associations International News Media Conference, “Transformation of the online news media: implications for preservation and access.” Stockholm, Sweden, April 15-16, 2015. Retrieved from https://drive.google.com/file/d/0B4gyLSJzlES5RmZlRDlHclAzZjQ/view

Zwaard, K. (2016). Remarks delivered at Dodging the Memory Hole conference, Oct. 13-14, 2016. Los Angeles, CA.

All references cited will be available in the final PDF download.

Comments

Comments are closed.