Editor’s Note: This post is from GIJN’s upcoming Reporter’s Guide to Investigating Digital Threats. Part one, on disinformation, has already been published. The guide will be released in full this September at the Global Investigative Journalism Conference.
Just as with a legitimate online site, any disinformation campaign or spyware attack relies upon digital infrastructure that includes one or more domains, servers, and applications. Anything running on the internet leaves behind some traces that can be used to track its activity and in some cases link together different infrastructure. This chapter gives an introduction to online tools you can use to investigate digital infrastructure.
How Digital Infrastructure Works
The first step in tracking digital infrastructure is to understand how it works. Let’s take as an example GIJN’s website, gijn.org.
Domain Name
First, it uses the domain name gijn.org. Domain names were established in the early days of the internet to provide user-friendly names for websites. This was so people wouldn’t have to remember complex technical IP addresses like 174.24.134.42. The Domain Name System (DNS) protocol is used to convert domain names to IP addresses. There are different types of records for IP addresses, and DNS is used to resolve them. MX records are used for the server that helps route emails to the correct address linked to a domain (such as info@gijn.org). But the main record used in DNS is the A type for IPv4 addresses (the traditional internet address such as gijn.org) and AAAA for IPv6 addresses. (IPv6 is a more recent address format allowing more addresses on the internet; most systems still use both IPv4 and IPv6 addresses.) Your web browser automatically does the work of resolving the domain name you enter to its IP address when you visit a website. But you can use an online tool like CentralOps to do that manually. For example, when we enter gijn.org in CentralOps, we get the IPv4 address 34.122.151.197 and no IPv6 address.
Domain names need to be acquired from registrars. These companies manage the registration of domain names for customers and act as intermediaries with registries who manage top level domains (TLD) like .com, .org, or .fr. Registries maintain a database of information about the domains that exist for their TLD, and this is called the Whois database. It is possible to perform a Whois search for information on current domains using web tools like CentralOps. In some cases, your Whois search will provide information about the owner of a domain, including their name, phone number, email address, and physical address. But personal data on domain owners is often hidden from Whois databases for privacy reasons. People can pay to have this information hidden from Whois search results, and many people and companies choose to do so. Even in these cases, it is still possible to find the date of registration, date of renewal, and the registrar used. For instance, here is what we get for Whois search of gijn.org.
Even if registrant information has been redacted, we can see that the domain was bought through the company GoDaddy (the registrar) on June 24, 2009 for the first time, and then renewed regularly since.
Server
A website needs to be hosted somewhere. This is a physical computer called a server, where all of the files associated with the website are stored and made available whenever someone requests a page on the site via their web browser. Most servers today are hosted by professional hosting providers, such as OVH or Digital Ocean, or even cloud providers like Amazon Web services or Google Cloud.
Servers are linked to the internet via one or several IP addresses (most of the time it’s via one IPv4 and one IPv6 address). These IP addresses are delegated by Regional Internet Registries to companies or organizations, who use them for their systems. A hosting company will have many IP addresses and will assign them to its various servers, which are used to host individual websites.
Each IP address owner also needs to inform the different networks connected to the internet of the IPs they manage, so that they can send traffic in their direction. This requires registering an Autonomous System (AS), an administrative entity recognized by all the internet networks and identified by a unique number. For example, AS1252 is the number for the UNMC-AS, the University of Nebraska Medical Center. A fairly comprehensive list of AS numbers exists online. Most hosting companies own one or several ASs.
A tool like ipinfo.io allows you to identify the AS of an IP address, the company behind it, and an estimation of the location of the server to which the IP is connected. Note that this geolocation info is not perfectly accurate. For GIJN’s 34.122.151.197, we see that it is part of the AS396982 which belongs to Google, and appears to be located in Google’s Iowa data center. Whois searches also provide an IP address, which sometimes gives information more precise than just AS, but not in this example. A tool like ipinfo.io will give you the most complete results.
HTTPs Certificate
Hypertext Transfer Protocol Secure (HTTPs) is a secure protocol used to communicate between a web browser and the server hosting the website. It enables the browser to verify the identity of the server using a cryptographic certificate. This helps ensure that the browser is loading the real gijn.org, and not a server usurping its identity. Each cryptographic certificate is issued by a third party certificate authority that is recognized by the different browsers and operating systems. These certificates are issued for a limited period of time (generally between three months and a year) and need to be regularly renewed. To view a certificate, you can click on the lock icon in your browser bar, and select “connection is secure” and “more information.” Here is what we get for the GIJN website.
We see this certificate was provided by the free certificate authority Let’s Encrypt on February 20, 2023 and will be valid until May 21. If the lock icon in your browser bar is unlocked, or if there is no lock icon and a mention “not secure,” it means you are browsing a website using the insecure HTTP protocol that doesn’t encrypt the communication with the server and verify its authenticity.
Here is a diagram summarizing the different aspects of this infrastructure.
Let’s summarize what we learned about gijn.org:
- It uses the domain gijn.org that was initially bought on GoDaddy on June 24, 2009.
- It is hosted on a server with IP address 34.122.151.197 that is part of AS396982, which belongs to Google Cloud.
- It uses an HTTPs certificate, mostly recently provided by Let’s Encrypt on February 20, 2023.
Sources of Data
Now that we understand the basics of a digital infrastructure, let’s look at how we can look into it further. There are multiple data sources that can be used for a more in-depth investigation. Some of these tools are free, and some require paid access. (Some platforms provide free research access for journalists, so it’s worth reaching out to ask.)
Whois and Historical Whois
As we saw earlier, Whois domain records can display information such as name, phone number, email, or address, but this information is often hidden for privacy reasons. (The EU’s General Data Protection Regulation — GDPR — accelerated this trend.) The good news is that several commercial platforms have been collecting Whois data for years and can provide access to these databases. This is useful in several ways. First, by using historical data, you can go back in time to the moment when the domain owner didn’t have any privacy protection and find their information. This is mostly useful for websites that have been online for a long time, meaning at least a few years or more. You can also use this ownership information as a pivot point to find additional domains registered by the same person or entity.
For example, in 2019 I was investigating a phishing and spyware campaign targeting activists from Uzbekistan. Using historical domain records, I identified that a domain used for phishing was registered with the email address b.adan1[@]walla.co.il. It turns out the attacker didn’t think of enabling Whois privacy.
By search for other domains registered using the same email address, I was able to identify many more domains related to this online campaign.
Commercial platforms that provide historical information include RiskIQ, DomainTools, Recorded Future, and Cisco Umbrella. Services such as Whoxy.com and Whoisology.com that have a free tier also sometimes provide snippets of historical records.
Passive DNS Information
As detailed earlier, the DNS protocol allows you to find the server’s IP address for a domain at a given time. In order to follow the evolution of infrastructure, people, and companies collect records of DNS queries and answers in order to register the historical DNS answer. This type of data is named passive DNS. It’s the equivalent of a historical Whois record for DNS.
Passive DNS is an important tool to track infrastructure. Many malicious online sites are temporary and may only be live for a couple of days or weeks. As such, having historical data allows us to gain a much better understanding of the domains and servers used. It also makes it possible to track digital infrastructure over a long period of time, helping us understand when the malicious activity started.
Passive DNS data is typically presented in the form of an IP, domain, start date, and end date. Most platforms allow searching per IP or domain, and some platforms include more DNS types than just A/AAAA.
To continue with the phishing campaign example started above, one of the first phishing emails identified had a link to the domain mail.gmal.con.my-id[.]top. To identify the servers used, we can search for all IP resolutions for this domain in a Passive DNS Database like Farsight DNSDB.
We can then search for domains hosted on this same IP address around the time of the attack.
Passive DNS providers include Farsight DNSDB, DomainTools, Risk IQ, Circl, Zetalytics, Recorded Future, Cisco Umbrella and Security Trails. Different providers have different data sources for passive DNS data collection, so most datasets are incomplete and complementary. You ideally want to use multiple services to get a more complete picture. The same is true for historical Whois records.
Certificate Transparency Databases
Just as every website has a domain name and IP address(s), most also use an HTTPs certificate. This means we can use information on certificates as part of an infrastructure investigation. Certificates are available for auditing thanks to a security standard called Certificate Transparency, which creates public logs for all certificates issued by authorities. Platforms like Censys or Crt.sh provide free access to this data. Certificates do not provide a lot of details on who created them but you can confirm if a domain or subdomain was used by a specific certificate and examine a timeline for the utilization of such domains.
The phishing campaign targeting activists in Uzbekistan used Android spyware that was communicating with the domain garant-help[.]com. A quick search in Crt.sh gives us a timeline of when this domain (and thus the spyware) was actually actively used by the operators of the campaign.
Internet-Wide Scan
The internet consists of a few billion interconnected systems connected all together. For instance, there are at most only a few more than four billion IPv4 addresses. With the bandwidth available today, it is possible to regularly scan a large part of internet systems. Several companies are doing this type of internet-wide scan regularly and provide access to databases with the results.
The scans have limitations, as not all services are scanned by these platforms, and they only do standard requests, which would not, for instance, give information on all websites installed on a given server. But it provides an important source of information in digital investigations. First, it allows you to quickly look at what is running on a server that may be suspicious and get an idea of the infrastructure setup. Some databases also have historical data that allows you to explore what was running before on a server. Finally, it can be used to develop complex queries in order to find related infrastructure using the same specific setup. This last feature can be critical for research —the Amnesty Tech Lab used it to track NSO Group’s Pegasus infrastructure over several years. As a journalist, it may be useful to collaborate with technical experts to execute such tracking and analysis.
The two major platforms for internet-wide scans are Shodan and Censys, but other platforms like ZoomEye, BinaryEdge, or Onyphe can also be used. Most provide free access to data, but charge for historical data and complex queries.
Databases of Malicious Activity
Many platforms exist to identify, track, or index known malicious infrastructure. They are largely used by the cybersecurity industry. These platforms may also have information on non-malicious or malicious adjacent (like disinformation) infrastructure, which makes them useful for journalists. Here’s a look at some of these platforms.
VirusTotal. This famous antivirus platform was created almost 20 years ago in Spain and was later acquired by Google. It allows anyone to submit a file and have it scanned by more than 70 antivirus scanners and URL/domain blocklisting services. VirusTotal is the world’s largest repository of legitimate and malicious files, and provides access to this internal virus database to many cybersecurity companies. If you are working on a spyware investigation, VirusTotal is a good place to search for similar programs or related infrastructure. If you use VirusTotal to check if a file you received is malicious, please keep in mind that the documents uploaded are then available to thousands of people around the world. So uploading a private document is a bad idea. It’s a good idea to reach out to experts to assist with this kind of analysis.
URLScan. URLScan is an open platform that allows users to query a specific URL and then see details about the infrastructure and website in a secure way. This platform is useful when you identify a suspicious link and want to check it in a secure way. It can also find related URLs that someone else may have submitted to the platform. Scans can be either public or private, although private scans are only allowed for users with paid access.
AlienVault OTX. This is a free database containing a significant amount of data on infrastructure that has been identified as malicious. You don’t even need an account to search its database, just enter a domain or IP address in the search bar. A search for the malicious domain garant-help[.]com, for instance, immediately led to a related publication.
The below diagram summarizes the type of tools you can use to examine each part of a digital infrastructure.
Case Studies
Mandiant report about the APT1 threat group. In 2013, the US company Mandiant attributed the activity of a threat actor called APT1 to Unit 61398 of the Chinese People’s Liberation Army. This Chinese military group had been active since at least 2006 and at the origin of the compromise of at least 141 organizations.
Investigation of Vietnamese group Ocean Lotus. Journalists from German public broadcaster Bayerischer Rundfunk and Zeit Online did great work investigating the infrastructure used by Ocean Lotus, a threat group generally considered to be connected to the Vietnamese authorities. This investigation mixed human sourcing with technical investigation of the domains and servers used by the group.
Citizen Lab report on the surveillance firm Circles. By using custom, internet-wide scanning, Citizen Lab was able to identify the configuration used by Circles to serve its customers. It allowed Citizen Lab to identify 25 governments that were customers of the Israeli surveillance company.
Togo: Hackers-for-Hire in West Africa. In October 2021, Amnesty International’s Security Lab published a report about a spyware attack against a Togo activist. This attack was then linked to an Indian company called Innefu Labs. The attribution here is an interesting example on how to use technical mistakes made in the attacker’s infrastructure to identify the actor behind an attack.
Additional Resources
Investigating Digital Threats: Disinformation
Reporter’s Guide to Investigating Organized Crime — Cybercrime
GIJN Resource: Digital Security
Etienne “Tek” Maynier is a security researcher at Amnesty International’s Security Lab. He has been investigating digital attacks against civil society since 2016, and has published many investigations on phishing, spyware, and disinformation campaigns. He can be found on his website or on Mastodon.