Getting univariate probability densitiy function for a dataset of IP addresses

54 Views Asked by At

I have two simple datasets having 10k IP addresses encoded as Integers (so the data is discrete and can take any number range between 1 and 4B).

FYI: One dataset is a real dataset captured at a network, while the other one is a synthetic one. At the end of the day, I want to see how good the synthetic one is (generated via AI/ML) compared to the real one. But I am pretty stuck at the beginning:D

Since the dataset's distribution is unknown yet not following any well-known distribution, I want to calculate the PDF of them (and later compare how similar they are).

My two datasets are termed p and q, both arrays of IP addresses (as integers).

I am not an expert in probability theory, so please, bear with me :)

Since I want to compare the two probabilities eventually, to calculate the PDFs of them, I take all possible events (i.e., IP addresses) present in p and q. For this, I do the following in Python using numpy:

import numpy as np
import pandas as pd

q=np.array(real_data_1m.srcip) #
p=np.array(syn_data_1m.srcip)

#get all possible discrete events from p and q
px=np.array(list(set(p))) #use set here to remove duplicates
qx=np.array(list(set(q))) #use set here to remove duplicates

#concatenate px and qx
mx=np.concatenate([px,qx])

mx.sort() #sort them, as they are anyway integers
mx=np.array(list(set(mx))) #remove duplicates by creating a set
#mx.reshape((len(mx),1)) #reshape from 1D to nD, where n=len(mx)

Then, to calculate the PDF, I created a simply function create_prob_dist() to help towards this goal.

def create_prob_dist(data: np.array, sample_space: np.array):
  #number of all events
  sum_of_events = sample_space.size
  #get the counts of each event via pandas.crosstab()
  data_counts = pd.crosstab(index='counts', columns=data)
  
  #create probabilities for each event
  prob_dist=dict()

  for i in sample_space:
    if i in data_counts:
      prob_dist[i]=(data_counts[i]['counts'])/sum_of_events
    else: 
      prob_dist[i]=0

  return prob_dist

This function does not return the PDF itself. At this stage, it returns a Python dictionary, where the keys are the possible IP addresses that are represented in both p and q, i.e., in mx. The corresponding values, therefore, are the probability of each of them. Something like: dict[2130706433]=0.05, meaning the probability of IP address 127.0.0.1 in the dataset is 0.05.

After I have this dictionary of probabilities, I try to plot it, but then comes my problems:

#create true PDFs of p and q using mx
p_pdf=create_prob_dist(p, mx)
q_pdf=create_prob_dist(q, mx)

#get the probability values only from the dictionary
p_pdf=np.array(list(p_pdf.values())) #already sorted according to mx
q_pdf=np.array(list(q_pdf.values())) #already sorted according to mx

plt.figure()
plt.plot(mx, q_pdf, 'g', label="Q")
plt.plot(mx, p_pdf, 'r', label="P")
plt.legend(loc="upper right")
plt.show()

The PDF plot does not look good

I know there should be a problem around the scales or something, but I could not get my head around it.

What am I doing wrong? Is it a wrong Python call or is the calculation of the PDF wrong?

Btw., the pure histogram of the p and q looks like this:

# plot a histogram of the two datasets to have a quick look at them
plt.hist(np.array(syn_data_1m.srcip), bins=100)
plt.hist(np.array(real_data_1m.srcip),bins=100, alpha=.5)
plt.show()

Histogram of the two datasets

1

There are 1 best solutions below

0
cs.lev On

Thanks to slothrop, the solution is as follows:

import numpy as np
import pandas as pd

q=np.array(real_data_1m.srcip) #
p=np.array(syn_data_1m.srcip)

#get all possible discrete events from p and q
px=np.array(list(set(p))) #use set here to remove duplicates
qx=np.array(list(set(q))) #use set here to remove duplicates

#concatenate px and qx
mx=np.concatenate([px,qx])


mx=np.array(list(set(mx))) #remove duplicates by creating a set
# CALL SORT THE LAST TIME
mx.sort() #sort them, as they are anyway integers
#mx.reshape((len(mx),1)) #reshape from 1D to nD, where n=len(mx)

The PDFs are good now: the correct PDF got after repairing the code