Extraordinary documentation can make for an extraordinary story — and terrible trouble for sources and vulnerable populations if handled without enough care. Recently, The Intercept published a story about a leaked NSA report, posted to DocumentCloud, that alleged Russian hacker involvement in a campaign to phish American election officials.
Simultaneously, the FBI arrested a government contractor, Reality Winner, for allegedly leaking documents to an online news outlet. The affidavit partially revealed how Winner was caught leaking by the FBI, including a postmark and physical characteristics of the document that the Intercept posted.
The Intercept isn’t alone in leaving digital footprints in their article material. In a post called “We Are with John McAfee Right Now, Suckers,” Vice posted a picture of the at-the-time fugitive John McAfee, complete with GPS coordinates pinpointing their source’s location, who was shortly in official custody. In 2014, The New York Times improperly redacted an NSA document from the Snowden trove, revealing the name of an NSA agent.
The first step with any sensitive material is to consider what will happen when the subjects or public sees that material. It can be hard to pause in the rush of getting a story out, but giving some thought to the nature of the information you’re releasing, what needs to be released, what could be used in unexpected ways and what could harm people, can prevent real problems.
A Checklist for Sensitive Documents
Removing potentially harmful information from documents is difficult. To make it a little easier, DocumentCloud is creating a checklist of what to think about when making a sensitive document public. But even when the material isn’t on DocumentCloud, this checklist can help reporters and news organizations protect their sources, or other vulnerable people, from getting hurt by the materials posted along with a story.
✔ Have you scrubbed the document metadata?
Many modern file formats contain metadata to support popular features. If you’ve used track changes, or geotagged a photo, those are both forms of metadata that can continue to exist invisibly in a document which may reveal details about vulnerable people/sources. Beyond those two examples, there are formats of metadata for all modern files, from email headers to ID3 details embedded in every MP3. It can seem daunting, but a search on the formats of the files you have plus the word “metadata” can help you find tools to analyze, and if needed, remove metadata.
A few examples…
- Microsoft Word documents: These documents may contain a few types of hidden information. Here’s a primer.
- Images: EXIF is the metadata attached to digital photos. There are quite a few free online EXIF viewers, but if you can’t afford to upload sensitive material, you can also view EXIF data on your own machine via these browser plugins for Firefox and Chrome.
- PDFs: Here’s an overview of PDF properties and metadata. In DocumentCloud’s case, its platform will convert images, Word and Excel documents, and HTML pages into PDFs. In these conversions, DocumentCloud removes the metadata from the original when creating the PDF. However, DocumentCloud currently does not remove metadata from documents uploaded directly as PDFs.
✔ Have you checked for identifiers?
Identifiers may include:
- Printer dots
- Watermarks
- Text/font variations
- Unusual spacing
Documents can be modified to allow the author to track a document’s life after creation. The oldest technique for doing this is a faint print on the paper — the traditional watermark. With digital documents, variations in text, spacing, spelling or even phrases, can allow an author to create versions that link back to specific people or groups of people in order to investigate the origin of a potential leak. Additionally, printers can “sign” paper documents, adding physical metadata to documents through microdots printed directly on the documents that are barely visible to the human eye.
Defeating these techniques requires a careful inspection of the documents, looking for telltale signs and modifying the document to obscure its origin. Sometimes, recreating the document may be necessary, but that’s a judgement call that you have to make on a case-by-case basis. Inspection is never foolproof, but spotting and correcting the spacing, spelling, and physically identifying features of a document can go a long way toward mitigating danger to the people who would become vulnerable once a document is published.
✔ Have you accounted for other information that could reveal vulnerable people combined with this document?
In considering the newsworthiness of a document, it’s also worth considering what will happen when the public or subjects of a document see that document. Sometimes details that aren’t personally identifying on their own can be patched together with other publicly available information, in articles or public webpages, and reveal identities or unintentional details.
It’s hard to know in advance if this possible, but it’s worth taking some time to consider. Uniquely identifying information — such as geographical or life details — can often narrow down an anonymous person quickly. Harassers (or worse) can find vulnerable people.
✔ Is the document properly redacted?
Documents can contain sensitive content which you wish to redact. These could be addresses, phone numbers, personally identifying information or information which could reveal a source. There are a number of redaction tools, DocumentCloud included, which will expunge text and visible content in a document. But it is important to understand how your redaction tools work, and to verify the results. It’s not enough to draw black boxes over digital text — the text itself must be expunged from the document.
For example, DocumentCloud will remove a digital page from a PDF, and replace that page with an image snapshot of that page. DocumentCloud will then use optical character recognition (OCR) on the image, and use the resulting text in the document. This ensures that there is no way for the text which you wish to remove to become inadvertently included in your document. In DocumentCloud, you can check the results by clicking on the text tab in the viewer, as well as checking the original document link.
Whatever tool you use, read the instructions in order to double-check redactions before they are in public.
✔ Is the document the minimum needed for the story?
Publishing only what the story needs, in content and context, minimizes the possibility of harm and focuses reader attention on what matters the most.
It’s our hope that by following this checklist, and thinking carefully about how the document will be perceived and used in public, journalists can maximize the effectiveness of the evidence that supports their stories while minimizing the harm to sources and bystanders.
This post first appeared on Source, an Open News website, and is cross-posted here with permission. It has also been translated into Arabic and Russian by GIJN.
Ted Han is director of technology for @DocumentCloud. He studied computational linguistics and has worked in technology and startups for more than a decade. He was a participant in the Knight Mozilla Journalism Challenge and has worked on DataMapper, Merb and a variety of data-based projects.
Quinn Norton is a technology journalist who started studying hackers in 1995. She has been published in Wired, The Atlantic and Maximum PC, and covers science, copyright law, robotics, body modification and medicine, but no matter how many times she tries to leave she always comes back to hackers.