What is the NSA Collecting? An Exercise in Open-Source Intelligence Analysis

Layout of the NSA's Utah Data Storage Facility. Image: Wikimedia.

Layout of the NSA’s Utah Data Storage Facility. Image: Wikimedia.

The recent revelations of NSA warrantless data collection is of interest mostly because one has to wonder why so many people seem surprised about this. It wasn’t a matter of reading tea leaves or anything, it was right out there in the open for anyone who was paying attention.

One of the big “clues” has been the construction of a massive NSA data storage facility in Utah for all the stuff they collect. Last week I got to thinking about the question whether the NSA collected more than just “metadata” such as who originates, and receives data, when, where, etc., but not actual content as they so piously insisted. Of course, we now know that they do collect content, but I arrived at that conclusion last week.

How, you ask?

Sometimes all it takes is a little back-of-the-envelope calculation to see if numbers don’t add up or, what they in fact add up to. I asked myself, how the proposed size of the new data collection facility comports with the claim the NSA only collects only metadata.

My first step is to do some data collecting of my own. The amount of metadata collected from say, a phone call (number calling, number receiving, duration of the call, etc.) doesn’t really need much data. This paragraph is more than long enough to accommodate it. (218 characters, excluding spaces)

But let’s allow one (1) kilobyte or 1000 characters for each record. That should be more than enough. Some quick research on the web gave me the following statistics about phone and other forms of traffic in the US:

    • Phone calls per day: 3 billion
    • Text messages: 4.1 billion
    • Emails: 249 billion

That comes to a total of 256.1 billion records per day. Just in case we missed other items (credit card purchases, chat sessions, VOIP calls, etc.) that get scooped up, let’s double it to 512 billion, or 5.12 x 1011.

Since each record uses 103 bytes, the total information collected in a single day using our model comes to 5.12 x 1014 bytes.

Now, let’s look at that Utah storage facility. A government web site remarks that the storage capacity is measured in “zettabytes” but does not say how many. A zetta- of something is 1021, or 10 followed by 21 zeros.

We can now ask, “How many days would it take to fill one zettabyte of data at the current rate?” Divide the daily take by the total capacity:

5.12 x 1014 / 1021 = 5.12 x 107

This comes to 51,200,000 days, or in excess of 140,000 years. Ah, but the flow of electronic traffic is going to grow, isn’t it? Let’s look at this another way. How many times can the rate of traffic double before we hit one zettabyte?

(5.12 x 1014) * 2t = 1021

Where t is the number of doublings.  Divide through by the daily rate:

2t = 1021/5.12 x 1014

2t = 1021/5.12 x 1014

2t = .195 x 107 or 1.95 x 106

Take the log of both sides:

tlog(2) = 6 (roughly)

t = 6/log(2) = 19.9

This means the rate can double about 20 times. So, if the amount of traffic doubles every 5 years, it will take 100 years to fill just one zettabyte. If it doubles every year, it will take just under 20 years. That’s a huge amount of storage. Something tells me that (a) the facility will hold more than one zettabyte and (b) the numbers point to storing much, much more than just metadata.

Of course, as I mentioned, we now have some confirmation that this is the case, but with a little math and the willingness to play with some numbers, one can peek behind the curtain just a bit.

Call it a case of using open source intelligence, but I prefer to think of it as an answer to that whiney question from bored high school students in math classes of, “Why are we learning this?”


Comments

What is the NSA Collecting? An Exercise in Open-Source Intelligence Analysis — 3 Comments

  1. If it takes 3 pumping stations for the cooling water it seems silly to build this in the Utah desert.
    Why didn’t they build it in Alaska where cooling is not such a problem and there are wide rivers of icy water ?

    The problem with building secret facilities is the lack of adult over-site in location planing. So they get handed out to states as political candy, not based on common sense. This is why they have mission control in Texas even though they launch the rockets from Florida.

  2. Which reminds me, I don’t know if you know how good text to speech software is these days. Have you used a recent version of DragonSpeak on a fast new computer?

    I have always considered monitored conversations in the modern world are NOT normally listened to by men with headphones, but turned into text which is word searchable and very small to store.

    There is also talk of a phonetic intermediary storage where the sounds are identified and stored as symbols, a precursor to translation into the actual text. I understand this takes even less storage.

    • James,

      I don’t know the state of the art of speech recognition, but I have heard that the DragonSpeak app for the iPhone is quite good. I tried it myself once using some unusual proper names and technical terms and it got nearly all of them right. The NSA tech is said to be about ten years ahead of commercial technology. If that’s the case, I imagine their equivalent is excellent.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.