Using machine learning to detect possible insider threats

Cloud applications are now commonplace in enterprises. From productivity applications to storage, employees and IT departments are realizing the benefits of offloading documents and data into the cloud. But as data, identities, and applications move to the cloud, security teams must manage the risk involved with losing control of traditional network perimeter. This is largely a problem of visibility: If data never travels across corporate networks, how can defenders understand what users are doing, whether their activities are legitimate, and if their accounts have been compromised?

In the Cisco 2018 Annual Cybersec urity Report, we offer insights on the impact of the cloud on user activity—and how careful analysis of this activity could potentially unearth threats from inside an organization. Using a machine-learning algorithm, Cisco researchers looked at data exfiltration trends for 150,000 users in 34 countries, all using cloud service providers during a 6-month period from January to June 2017. The algorithm took into account not only the volume of documents being downloaded, but also variables such as the time of day of downloads, IP addresses, and locations. In the next 1.5 months, the algorithm started flagging deviations from the norm for each individual user.

Cisco’s algorithm flagged 0.5 percent of users for suspicious downloads. This may seem like a small amount, but these users downloaded, in total, 3.9 million documents from corporate cloud systems—an average of 5200 documents per user during the 1.5 month period. Sixty-two percent of the suspicious downloads occurred outside of normal works hours, while forty percent took place on weekends.

On its face, the volume of data might be a concern for many organizations. However, such activity could be normal; simply looking at the total number doesn’t necessarily indicate nefarious behavior by employees or as evidence of compromised accounts. In further analysis, Cisco threat researchers conducted a text-mining analysis on the titles of the 3.9 million documents. The most popular keyword in document titles was “data”; the keywords most commonly appearing with the word “data” were “employee” and “customer.”

The fact that documents with keywords like “employee” and “customer” are being downloaded from commonly used productivity applications could be a sign of data exfiltration, and perhaps security teams might want to focus on these outliers for further investigation. But the findings would need to be matched with analysis about normal user behavior to determine if a threat from inside is actually present.

Machine-learning algorithms can provide a more nuanced view of cloud user activity across multiple cloud platforms. For example, the algorithm can “learn” about the history of the user according to several variables, such as location and time of day for download activity. In Cisco’s analysis, 23 percent of users were flagged more than 3 times for suspicious downloads, starting usually in small numbers. The volume slowly increased each time, and eventually, these users showed a sudden and significant spike in downloads.

Such activity, especially an increase in downloads that are outside of the norm for that user, could pinpoint a possible threat. Equipped with knowledge of how and when users download information from the cloud, defenders can cut the time it takes to investigate download activity, since they won’t need to examine what is likely legitimate behavior. On the other hand, machine learning can help defenders spot patterns that could indicate data exfiltration, and stop it before the damage is done.