Using multi-threaded Python for large scale DNS name resolution

I was running a honeypot back in 2020 and an attacker kindly left a BIG list of hostnames behind. I wanted to resolve the hostnames in DNS to get the IP addresses, but there was over half a million names in the list and my initial attempt at resolving them all was too slow, so I had to switch things up a bit.

First the worst

My initial attempt looked a bit like this. A single-threaded script that just read the text file and dumped everything to the screen. Unfortunately, network interactions are slow compared to local processing capabilities and it was taking far too long to process.

# Single-threaded DNS-to-IP hostname resolver
# Version 1
# Richard Atkin

import socket
import datetime

def resolveDns():
    filename = "C:\\Users\\richa\\Documents\\hostnames.txt"

    with open(filename) as file:
        hostnames = file.readlines()
        hostnames = [line.rstrip() for line in hostnames]

    start = datetime.datetime.now()

    for host in hostnames:
        try:
            print(f"{host}: {socket.gethostbyname(host)}")
        except Exception as e:
            print(f"{host}: {e}")
            continue

    end = datetime.datetime.now()
    duration = end - start
    print(" ")
    print(f"Time taken: {duration}")
    print("")

if __name__ == "__main__":
    resolveDns()

Second the best

To get around the performance issue, I decided I needed a way of querying multiple DNS requests in parallel, and that the best way to do this was via multi-threading. With a multi-threaded approach, I could divide and conquor. By splitting the list up in to multiple smaller chunks and having one thread dedicated to each chunk, I figured I could get the job done faster. I’ve never used threading in Python before or since, so while it is far from pretty, it gets the job done.

# Multi-threaded DNS-to-IP hostname resolver
# Version 2
# Richard Atkin

import threading
import socket
import datetime

def resolveDns(hostnames):
    
    for host in hostnames:
        try:
            print(f"{host}: {socket.gethostbyname(host)}")
        except Exception as e:
            print(f"{host}: {e}")
            continue

if __name__ == "__main__":

    filename = "C:\\Users\\richa\\Documents\\hostnames.txt"

    with open(filename) as file:
        hostnames = file.readlines()
        hostnames = [line.rstrip() for line in hostnames]

    start = datetime.datetime.now()
    
    threads = list()

    chunksize = 100

    chunks = [hostnames[i:i + chunksize] for i in range(0, len(hostnames), chunksize)]
    for chunk in chunks:
        x = threading.Thread(target=resolveDns, args=(chunk,))
        threads.append(x)
        x.start()

    for chunk, thread in enumerate(threads):
        thread.join()

    end = datetime.datetime.now()
    duration = end - start
    print(" ")
    print(f"Time taken: {duration}")
    print("")

The proof is in the numbers

The single-threaded approach takes 73.21 seconds to resolve a list of ~1000 hostnames on my current system, but the multi-threaded approach completes the same task in just 3.96 seconds!

This is obviously a huge performance improvement, but this got me thinking… If I used smaller chunk sizes and more threads, could I get it even lower? Well…

Chunk size	Completion time (seconds)
200	4.73s
100	3.96s
50	1.11s
25	0.5761s
10	0.5969s
5	0.5659s
2	0.5951s

Just over half-a-second to resolve 1000 hostnames is a huge improvement compared to the original effort of 73 seconds. This is a great example of how spending a little time optimising some code can bring huge benefits in performance.

GitHub

This code is available on GitHub:

RikJonAtk/multiThreadedDNS