Am trying to BULK extract WHOIS information for 20,000 domain names, the python code works with 2 items in my csv file but brings error with the whole dataset of 20000 domain names
tried with 2 domain names, OK. using a full list of 20k domain names brings errors
import whois
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import socket
import os
import csv
import datetime
import time
import requests
from ipwhois import IPWhois
from urllib import request
from ipwhois.utils import get_countries
import tldextract
from ipwhois.utils import get_countries
countries = get_countries(is_legacy_xml=True)
from ipwhois.experimental import bulk_lookup_rdap
from ipwhois.hr import (HR_ASN, HR_ASN_ORIGIN, HR_RDAP_COMMON, HR_RDAP, HR_WHOIS, HR_WHOIS_NIR)
countries = get_countries(is_legacy_xml=True)
import ipaddress
df = pd.read_csv('labelled_dataset.csv')
#TimeOut Setting
s = socket.socket()
s.settimeout(10)
#Date Processing Function
def check_date_type(d):
if type(d) is datetime.datetime:
return d
if type(d) is list:
return d[0]
for index,row in df.iterrows():
DN = df.iloc[index]['Domains']
df['IPaddr'] = socket.gethostbyname(DN)
df['IPcity'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['city']
df['ASNumber'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['asn']
df['NetAddr'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['address']
df['NetCity'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['city']
df['NetPostCode'] = IPWhois(socket.gethostbyname(DN), allow_permutations=True).lookup_whois()['nets'][0]['postal_code']
W = whois.whois(DN)
df['WebsiteName'] = W.name
df['ASRegistrar'] = W.registrar
df['CtryCode'] = W.country
df['Dstatus'] = W.status[1]
df['RegDate'] = check_date_type(W.creation_date)
df['ExDate'] = check_date_type(W.expiration_date)
df.to_csv('extracted_dataset_1_1.csv', index=False)
Expect the output of ASN details, WHOIS information per domain name exported in a csv file
You're creating a new IPWhois object for every property you are looking up. That means you are running at least 5 whois queries per iteration.
That's going to generate a lot of network traffic, and is totally unnecessary - you can just run the
whoisonce per domain and access the results as members.Try changing the code in your loop to something like this:
There's some other optimisations I'd suggest:
IPWhoisorwhois- not both.whoisquery before continuing, and a network query is many orders of magnitude slower than your code runs through each iteration in your loop. With an asynchronous model, you can fire off multiplewhoisqueries and only act on the results when they arrive. This model could help optimise your application's efficiency.