Bulk Whois lookup Of 20,000 domains - getting timeouts

2.2k Views Asked by At

Am trying to BULK extract WHOIS information for 20,000 domain names, the python code works with 2 items in my csv file but brings error with the whole dataset of 20000 domain names

tried with 2 domain names, OK. using a full list of 20k domain names brings errors

import whois
import matplotlib.pyplot as plt
import numpy as np  
import pandas as pd  
import socket
import os
import csv 
import datetime
import time
import requests
from ipwhois import IPWhois
from urllib import request
from ipwhois.utils import get_countries
import tldextract
from ipwhois.utils import get_countries
countries = get_countries(is_legacy_xml=True)
from ipwhois.experimental import bulk_lookup_rdap
from ipwhois.hr import (HR_ASN, HR_ASN_ORIGIN, HR_RDAP_COMMON, HR_RDAP, HR_WHOIS, HR_WHOIS_NIR)
countries = get_countries(is_legacy_xml=True)
import ipaddress

df = pd.read_csv('labelled_dataset.csv')

#TimeOut Setting
s = socket.socket()
s.settimeout(10)

#Date Processing Function

def check_date_type(d):
    if type(d) is datetime.datetime:
        return d
    if type(d) is list:
        return d[0]
for index,row in df.iterrows():

    DN = df.iloc[index]['Domains']

    df['IPaddr'] = socket.gethostbyname(DN)
    df['IPcity'] = IPWhois(socket.gethostbyname(DN),     allow_permutations=True).lookup_whois()['nets'][0]['city']
    df['ASNumber'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['asn']
    df['NetAddr'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['address']
    df['NetCity'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['city']
    df['NetPostCode'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['postal_code']
    W = whois.whois(DN)
    df['WebsiteName'] = W.name
    df['ASRegistrar'] = W.registrar
    df['CtryCode'] = W.country
    df['Dstatus'] = W.status[1]
    df['RegDate'] = check_date_type(W.creation_date)
    df['ExDate'] = check_date_type(W.expiration_date)

df.to_csv('extracted_dataset_1_1.csv', index=False)

Expect the output of ASN details, WHOIS information per domain name exported in a csv file

1

There are 1 best solutions below

1
cam8001 On

You're creating a new IPWhois object for every property you are looking up. That means you are running at least 5 whois queries per iteration.

That's going to generate a lot of network traffic, and is totally unnecessary - you can just run the whois once per domain and access the results as members.

Try changing the code in your loop to something like this:

df['IPaddr'] = socket.gethostbyname(DN)
ipwhois = IPWhois(df['IPaddr'], allow_permutations=True).lookup_whois()
if (ipwhois):
  df['IPcity'] = ipwhois['nets'][0]['city']
  df['ASNumber'] = ipwhois['asn']
  df['NetAddr'] = ipwhois['nets'][0]['address']
  df['NetCity'] = ipwhois['city']
  df['NetPostCode'] = ipwhois['nets'][0]['postal_code']

There's some other optimisations I'd suggest:

  • Write to your file on every iteration or every n iterations, so that you can work incrementally and don't lose your results if your code errors.
  • Use one library - IPWhois or whois - not both.
  • Look at using aysyncio. At present, your code has to wait for a response from a whois query before continuing, and a network query is many orders of magnitude slower than your code runs through each iteration in your loop. With an asynchronous model, you can fire off multiple whois queries and only act on the results when they arrive. This model could help optimise your application's efficiency.