There is a lot of talk about the dark web these days, not least about how cyber criminals use it to spread malware, leak intellectual property, and publish user account credentials.
We decided to explore the surface, deep, and dark parts of the web to see what information is available and how it is connected. What we found was that there really is no sharp border between them. Information tends to seep into the surface web from its darker parts, and it is more appropriate to talk about one web, with different shades of darkness. The logic behind this is that brokers of illicit information on the dark web need to market their products, and hence need to post links to them on the surface web (Brian Krebs has noted the same).
People talk about the dark web as a mysterious place, hard to find and inaccessible to normal internet users. There is no sharp border between the surface web and the dark web, and that there are indeed links from the former to the latter. Different parts of the web thus exhibit varying degrees of shadiness, and can even be characterized by both actual content and what it links to. Conceptually, we might distinguish three levels of the web, each portraying different characteristics:
- Freely accessible
- Indexed by Google, Bing, and others
- Mostly open, but sometimes behind pay walls
- Fairly stable, content is available from source for a long time
- Language (mostly) suited for traditional natural language processing (NLP), and tools exist for extracting and analyzing data
- Often behind logins, but accessible to anyone registering
- Database driven, and therefore not indexed by search engines
- Sometimes by invitation only
- Mostly un-indexed by search engines such as Google and Bing
- Not indexed or searchable by Google, Bing etc.
- Often on other networks such as TOR2, Freenet3, I2P4, etc.
- Frequently behind logins, accessible by invitation only
- Sometimes uses special language like slang, leetspeak etc. which is not easily analyzed by normal NLP tools.
- Volatile, with content that sometimes only stays available for a few minutes (in one study we did more than 10% of Pastebin posts were removed within 48 hours)
Information tends to seep out even from the darkest corners of the web, if for no other reason than because that information has a value, which cannot be realized unless it is possible to find. Therefore it has to be marketed in some way. Wikipedia lists three uses of the dark web5 (or Darknet):
- Out of privacy concerns or for fear by dissidents of political reprisal
- To publish for criminal gain
- To share media files (sometimes copyrighted files)
A Journey to the Dark Side
What does the linkage into the dark web look like, in reality? We used the Recorded Future index to investigate this. Recorded Future collects and analyzes surface web sources, and its index also contains data from forums, blogs, social media, and paste sites that we expect to contain both suspect or threat related content and links to other parts of the internet (e.g., TOR sites).
As an initial example, we used the TOR Uncensored Hidden Wiki index (http://zqktlwi4fecvo6ri.onion/wiki/index.php/Main_ Page) to manually locate a dubious reseller of credit cards (Premium Cards, http://slwc4j5wkn3yyo5j.onion/ ):
These references all come from Pastebin. One of the pastes, for example, provides an index to several useful “Financial Marketplaces”:
Credit card information with CVVs is a good example of such material, and we focused on material published in 2015, and only in Russian. This yielded a small but interesting set of references related to advertising content and advice on how to obtain and use the stolen credit card information:
Thus, there is no doubt illicit material is being marketed not only on the dark web but also on other channels such as paste sites and forums. Some of this content is nefarious enough to get quickly removed, even from Pastebin:
Links From the Surface to the Dark Web
Inspired by the discoveries above, we investigated the linkage from Twitter and Pastebin to TOR/Onion links. It turns out to be fairly low volume: out of 509 million tweets, about 65 million had cyber-related content published in Q1 20156 there were 37 million URLs, but only 499 of those were Onion links. As another example, of 6.7 million Pastebin documents from 2015 Q1, with 226 million references in total, there were 8,316 Onion links, but only 1,036 unique links (the links with the most references were to index pages, adult comics, and sellers of cannabis, passports, and ID cards). In general, the number of links to TOR was low in volume, but some of them were high value.
The Malware Marketplace
We have seen how stolen financial credentials are marketed, but what about tools used by cyber criminals — can those also be found in this borderland? In some cases, the answer is a straightforward “yes.” To download a Remote Access Trojan (RAT) like DarkComet, just Google for instructions and download sites.
Shown below are all links from texts on paste sites and forums for a period of 3.5 months that contained a reference to malware and had a link to some other site, which we evaluated to see where the link was directed. Below are the top link targets. If we compare this list with a list of popular file sharing sites for general content, such as http://www.ebizmba.com/ articles/file-sharing-websites, we see a mix of “general” file sharing sites and some clearly more focused on shady material. We also note that some very popular file sharing sites, like Dropbox, are missing from the top link list.
There are clear borders between the surface, deep, and dark web in terms of accessibility and tools, but there exists information on the surface web and on the deep web that can be used to gain important understanding of what is happening on the dark web. Simple marketing mechanics underlies this — when something needs to be sold, prospective customers need to be able to find information about it quickly. The available information includes topics, link patterns, and activity levels.
As illustrated by the study of mentions of the DarkComet malware, sites such as Pastebin act as a marketing channel by providing a fairly unregulated place for posting both instructions and links to download sites for malware. Using a threat intelligence platform to monitor the activity on paste sites can, therefore, be a good way to get early warning signals for increased use of certain kind of malware and stolen data or credentials.
Topics also tend to migrate over time, from dark to surface web, and analyzing these patterns allows us to understand when high-end malware tools are becoming commodity malware. Such a shift means the volume of attacks using the commodity malware will increase, but the average skill level of attackers will go down — and the highly skilled attackers will have moved on to using another tool.