An installation at the Hacked By Def Con Press Preview at the 2016 Tribeca Film Festival. (Photo by Rob Kim/Getty Images for Tribeca Film Festival)
From a data science standpoint, one of the most fascinating criticisms of the US Government report on Russian hacking of the US presidential election is that for all its hyperbolic claims, the actual hard detailed evidence presented in the report is relatively weak and the evidence it does present ends up hurting the report’s conclusions more than supporting them.
To the general American public, finding the culprits in a cyber attack must seem relatively straightforward, in line with an episode of CSI in which a few keystrokes is all that’s needed to bring up definitive proof of who the hacker is, along with his or her name and home address blinking on the screen in neon red, ready for the police to swoop in and close the case.
In reality, cyber attacks are so popular precisely because they are often so difficult to prove beyond a shadow of a doubt and thus offer the benefit of deniability. In the physical world, even the most talented of bank robbers will leave enough of a trail in terms of city-wide surveillance cameras, electronic device records, DNA evidence, financial transactions and other information to identify the likely culprit and establish concrete evidence of guilt. In the cyber realm, one can construct such a chain of intermediaries and false evidence as to make attribution extremely difficult.
One has only to read the daily roundup of latest major cyber attacks in newsletters like The CyberWire to understand just how global, pervasive and damaging cyber attacks have become. While high profile attacks like potential Russian influence in the US election garner international headlines, the myriad other attacks happening around the planet at this moment tend to receive little attention. Moreover, it is fascinating to watch the back-and-forth around attribution that often plays out in the aftermath of major attacks, with headlines one day proudly proclaiming proof positive of the culprit in an attack, while the very next day a different organization presents its own proof positive of a very different culprit.
Like any other form of data science, cyber attribution involves the processing and analysis of reams of often contradictory or weak-in-isolation data points to weave together a cohesive and holistic narrative that supports a given conclusion. Such analyses traditionally incorporate many different data sources or analyses to present and eliminate competing alternatives to the hypothesis in question. This evidence may range from verified externally-confirmed information on through weak, but compelling suggestions.
The problem comes when this continuum of evidence is presented together without distinguishing between the strongest and weakest supporting evidence. In the case of the Russian hackers report, one of the most-discussed elements of the report was its list of 876 IP addresses determined by the US intelligence community to be used by Russian hackers and which the US Government encourages companies to report any accesses from.
If all of these addresses were exclusively used by hackers affiliated with the Russian government, then this list would serve as an excellent reference database for companies to examine their networks for signs of possible Russian intrusion. The problem, as Micah Lee of The Intercept points out, is that nearly half of them are simply Tor exit nodes, meaning that while Russian hackers may have used Tor to mask their origins, accesses from such IP addresses could also represent any of the millions of other people who use Tor. IP addresses also frequently change hands, meaning even a confirmed Russian hacking IP address could tomorrow be “granny’s bake shop.” Common IP addresses, such as those associated with Yahoo’s email service, increase the likelihood of false positives, given that Yahoo’s email service is used far more widely than just by Russian hackers. Many more could simply be infected or compromised servers temporarily used by hackers as part of botnets or intermediaries, but eventually restored to normal service.
As any data scientist would say, this list of IP addresses would have been far more useful if the US Government agencies that produced it had separated the IP addresses into various categories of concern. Knowing that a Tor user has visited one’s website might be of interest to a company and possibly worthy of follow up depending on the service accessed, but not necessarily grounds for immediately shutting down the entire corporate network for fear that foreign hackers are actively penetrating its cyber defenses. Similarly, knowing an employee is checking their personal Yahoo email account from work is useful to flag as a possible violation of company internet use policy, but not in the same vein as an active exfiltration of corporate databases. Instead, having a categorized list would allow companies to flag only those accesses from known IP addresses currently and actively associated primarily with hacking activities. Of course, releasing the list itself is problematic in that even rudimentary hackers will use such a list to determine which of their exit nodes have been identified and simply switch to a different set of IP addresses.
On the one hand, one must appreciate that the US Government’s report likely does not encapsulate the entirety of intelligence community knowledge and evidence regarding attribution of the attacks in question due to a need to preserve the security of sensitive sources and methods. On the other hand, this suggests that it may have been better not to release anything at all, rather than a report which relies on such weak indicators.
From a data science perspective, the reaction to the report is a stark reminder of the importance of separating out your evidence and being clear about both the importance of and confidence in each indicator. All too often data analyses in the academic and commercial sectors present surprising new findings based on a multitude of evidence, only to be quickly dismissed based on an attack on their weakest piece of supporting material. The inclusion of even a single questionable indicator can often overwhelm discussion of the rest of an analysis’ far stronger indicators and suggests data analysts should go to greater lengths to segment their evidence in order of strength when it comes to their appendixes.
Putting this all together, the reaction to the US Government’s Russian hacking report offers a powerful lesson to all data scientists about the importance of being up front about the strength of your sources and your confidence in them. Sometimes an analysis is built exclusively on a confluence of weak evidence, but all-too-often trust in a data analysis becomes based on the weakest piece of evidence used – a stark reminder for data scientists to always play devil’s advocate with every analysis.