In the southern Nevada desert, about 25 miles south of Las Vegas, the National Security Agency (NSA) is putting the finishing touches on a massive surveillance hub which, by most estimates, will constitute the largest coordinated effort at intrusive surveillance ever undertaken by the United States.
The Utah Data Center at Bluffdale — presented by the NSA as a major step toward fighting cyber-terrorism — will, by many published estimates, be able to monitor all international communications originating or terminating in the United States via telephone or Internet and store in excess of five zettabytes of data (5,497,558,100,000 gigabytes or 5,902,958,100,000,000,000,000 bytes).
As mind-bogglingly enormous as that number is, this is not the only way that the NSA mines for information on its targets. In April, MuckRock — a public information research firm — filed a Freedom of Information Act (FOIA) request with the NSA, which led to the declassification and release of “Untangling the Web: A Guide to Internet Research.”
The 634-page book, which is available for download and was published by the NSA’s Center for Digital Content, takes a tongue-in-cheek approach to offering advice on using search engines, the Internet Archive and other online tools in efforts by “operators” to mine for viable or useful information.
A spy named Google
One of the more interesting chapters in the book is titled “Google Hacking.” It offered practical hints toward manipulating the popular search engine to offer page listings for information that would otherwise not appear in the typical search. For example, if an NSA “operator” wanted to access inside information on companies in South Africa, he would enter a search term such as “filetype: xls site:za confidential” into Google, according to the book. Such a search will target South African Excel spreadsheets marked as “confidential” that were inadvertently posted on a publicly accessible channel — a possible bonanza of information for the “operator.”
How about Russian passwords? Try “filetype:xls site:ru login.” Control words — such as “password,” “login” and “userid” typically appear in English — even in a non-English environment — according to advice given in the book.
None of this is particularly illegal. The use of qualifiers like “filetype:” and “site:” is encouraged by Google in order to produce more accurate searches. As the author points out, “Nothing I am going to describe to you is illegal, nor does it in any way involve accessing unauthorized data.” Instead it “involves using publicly available search engines to access publicly available information that almost certainly was not intended for public distribution.”
The legality of such data mining gets into more of a gray area, however, when it’s not the government doing it.
Last November, the U.S. District Court of New Jersey found Andrew Auernheimer, 26, of Fayetteville, Ark., guilty of one count of identity fraud and one count of conspiracy to access a computer without authorization, after Auernheimer and a friend — Daniel Spitler, 26, of San Francisco, Calif. — exploited a security hole on the AT&T website by writing a program that continuously requested — by issuing a GET request — the email addresses and ICC-IDs (the unique identifiers for SIM cards) on iPad owners subscribed to AT&T.
In all, Auernheimer gathered information on more than 100,000 individuals using everyday HTTP commands. Auernheimer was sentenced to 41 months in prison and three years of supervised release.
Metadata
The NSA manual’s authors point to other risks in viewing publicly available information. For example, the authors remark in a footnote: “It is critical that you handle all Microsoft file types on the internet with extreme care. Never open a Microsoft file type on the internet. Instead, use one of the techniques described here.” The word “here” is hyperlinked, but the link is inaccessible.
It’s easy, however, to guess about the risk: Microsoft documents use heavy amounts of metadata — information that can be used to identify a document and its users. Browser plugins that allow documents to be read online share the metadata with Microsoft for usage analysis, or — more likely — are hosted by Microsoft directly.
For example, the “BTK” serial killer was ultimately caught because of the metadata on a shared Microsoft Word document.
It should also be pointed out that the NSA’s book is in large part outdated and outmoded. Originally published in 2007, it refers to techniques that are commonly known in the information technology community, who by now keep an eye out for such tricks. For example, the Google search term “filetype:doc confidential” produced mostly governmental and education non-disclosure agreements dating from 2011. Sam Richter, who wrote the first draft of his web search book, “Take the Cold Out of Cold Calling,” in 2009, told Mint Press News that he estimated that 80 percent of the information in that draft is now outmoded.
None of the information presented in the government manual is particularly unique or top secret. In 2007, Johnny Long published “Google Hacking for Penetration Testers,” in which the founder of the Google Hacking Database explains almost all of the same techniques highlighted in “Untangling the Web.” Currently, Long runs Hackers For Charity, which connects hackers to in-need charities in order to provide security assessments, customized computing options and computer training.
Despite this, the revelation of this training manual begs a particular question: Should the intelligence community rely on the Internet for information?
The reliability of the Internet
The Internet has become the greatest fount of knowledge and intellectual promotion that has ever existed in the course of humanity. But it’s also the vehicle by which politicians fall from grace for posting pictures of their private parts.
As stated by the authors of the NSA book in a web tip, “Think of search engine databases as huge warehouses in which everything from diamonds to debris is stored. Your job is to find the jewels amid the muck.”
In considering the veracity of the Internet, two facts must be considered. First, information on the Internet does not die. Even if the information on the hosting server was changed or removed, it still exists on any computer that requested and stored that data. While it is not true that everything that existed on the Internet still exists today, organizations such as the Internet Archives, WebCite and Archive-It are working to change that.
What this means is that it is becoming increasingly common to come across outdated or currently incorrect information. More damningly, it is now nearly impossible to redact or correct inaccurate information, as the uncorrected version will almost certainly remain intact — somewhere.
This brings us to the second fact. The Internet is a user-submitted, user-edited forum. Anyone can post to the Internet, meaning that anyone can knowingly post false information. Recently, the Heritage Foundation, a respected right-wing think tank used as a factual resource by many journalists, came under fire for posting a position paper attacking the intelligence of Hispanics and claiming that immigration reform will cost $6.3 trillion.
When vetting information from the Internet, we now must do our best to determine whether the information is timely, unbiased and accurate. As it is typically impossible to ascertain this, it’s important to remember an old reporting adage: “If you’re re-reporting a published fact and can’t find a living, breathing person to verify it, don’t use the fact. Likely, it’s wrong.”