Bin edges must be unique in array Pandas pd.qcut

65 Views Asked by At

I have a dataset of 400K entries while the values range from 0 to 50K+. These are random 100 values (sorted).

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 89, 128, 146, 173, 201, 319, 363, 396, 403, 488, 582, 790, 827, 849, 929, 2013, 2310, 2342, 2412, 2448, 2611, 2620, 2928, 3232, 3243, 3275, 3647, 3778, 4296, 4453, 4673, 4785, 5330, 5560, 5605, 5899, 5921, 5955, 6990, 7855, 9706, 11572, 14487, 15924, 20179, 21515, 24479, 30231, 30539, 32862, 41236, 44120, 50890]

I want to create quantiles of these values for one hot encoding. However, I am getting the error :

Bin edges must be unique in array([ 0. , 0. , 0. , 835.8, 5376. , 50890. ]). You can drop duplicate edges by setting the 'duplicates' kwarg. In this error chunks=5.

Below is my code:

import pandas as pd
TrainingDataset=pd.read_csv("CSVFiles/ElsveirPaper TrainingData.csv")
TrainingDataset=TrainingDataset.head(100)
convert_dict = {'stperList': int}
TrainingDataset = TrainingDataset.astype(convert_dict)

l=[]
l=TrainingDataset['stperList'].tolist()
print(sorted(l[:100]))
chunks=50
for a in range(1,chunks):
    temp=[]
    for i in range(0,a):
        temp.append(str(i)+"_col"+str(i)) #column headings
    
    try:
        pd.qcut(l, a,labels=temp)
        print("done") #this code only works when a=1,2
    except  Exception as error:            
        print("Error at",a, error)

since this is for only 100 values, it only works when chunk's value is 1 or 2. But on the whole dataset only chunks=1 works.

First, why this error, when we can see that there are many different numbers in the above array.

Second, I want to create quantiles where chunks should be >10, what updates I should make in the above code?

Note: My ultimate goal is to create one-hot encoding, if there are better solutions to do that, I am open to that. Furthermore, the code has for-loops because I wanted to see which chunk size works and which does not. Thank you.

0

There are 0 best solutions below