Using multi-threaded Python for large scale DNS name resolution
I was running a honeypot back in 2020 and an attacker kindly left a BIG list of hostnames behind. I wanted to resolve the hostnames in DNS to get the IP addresses, but there was over half a million names in the list and my initial attempt at resolving them all was too slow, so I had to switch things up a bit.
First the worst
My initial attempt looked a bit like this. A single-threaded script that just read the text file and dumped everything to the screen. Unfortunately, network interactions are slow compared to local processing capabilities and it was taking far too long to process.
# Single-threaded DNS-to-IP hostname resolver
# Version 1
# Richard Atkin
import socket
import datetime
def resolveDns():
filename = "C:\\Users\\richa\\Documents\\hostnames.txt"
with open(filename) as file:
hostnames = file.readlines()
hostnames = [line.rstrip() for line in hostnames]
start = datetime.datetime.now()
for host in hostnames:
try:
print(f"{host}: {socket.gethostbyname(host)}")
except Exception as e:
print(f"{host}: {e}")
continue
end = datetime.datetime.now()
duration = end - start
print(" ")
print(f"Time taken: {duration}")
print("")
if __name__ == "__main__":
resolveDns()
Second the best
To get around the performance issue, I decided I needed a way of querying multiple DNS requests in parallel, and that the best way to do this was via multi-threading. With a multi-threaded approach, I could divide and conquor. By splitting the list up in to multiple smaller chunks and having one thread dedicated to each chunk, I figured I could get the job done faster. I’ve never used threading in Python before or since, so while it is far from pretty, it gets the job done.
# Multi-threaded DNS-to-IP hostname resolver
# Version 2
# Richard Atkin
import threading
import socket
import datetime
def resolveDns(hostnames):
for host in hostnames:
try:
print(f"{host}: {socket.gethostbyname(host)}")
except Exception as e:
print(f"{host}: {e}")
continue
if __name__ == "__main__":
filename = "C:\\Users\\richa\\Documents\\hostnames.txt"
with open(filename) as file:
hostnames = file.readlines()
hostnames = [line.rstrip() for line in hostnames]
start = datetime.datetime.now()
threads = list()
chunksize = 100
chunks = [hostnames[i:i + chunksize] for i in range(0, len(hostnames), chunksize)]
for chunk in chunks:
x = threading.Thread(target=resolveDns, args=(chunk,))
threads.append(x)
x.start()
for chunk, thread in enumerate(threads):
thread.join()
end = datetime.datetime.now()
duration = end - start
print(" ")
print(f"Time taken: {duration}")
print("")
The proof is in the numbers
The single-threaded approach takes 73.21 seconds to resolve a list of ~1000 hostnames on my current system, but the multi-threaded approach completes the same task in just 3.96 seconds!
This is obviously a huge performance improvement, but this got me thinking… If I used smaller chunk sizes and more threads, could I get it even lower? Well…
Chunk size | Completion time (seconds) |
---|---|
200 | 4.73s |
100 | 3.96s |
50 | 1.11s |
25 | 0.5761s |
10 | 0.5969s |
5 | 0.5659s |
2 | 0.5951s |
Just over half-a-second to resolve 1000 hostnames is a huge improvement compared to the original effort of 73 seconds. This is a great example of how spending a little time optimising some code can bring huge benefits in performance.
GitHub
This code is available on GitHub: