Find the smallest range that contains a given percentage of values

68 Views Asked by At

I have a .csv file that contains file sizes. I need to find the interval from a to b, where the majority of the files (75-80-85-90%) are, while [a,b] is the minimal interval possible. I am using python.

I know how to do that if I’m checking the specific percentile of files, but I have no idea how to do that maximization problem.

percentile_80 = df['file_size'].quantile(0.8)
num_files = df.shape[0]
num_files_in_range = df[df['file_size'] <= percentile_80].shape[0]
percent_files_in_range = num_files_in_range / num_files * 100
range_start = df['file_size'].min()
range_end = percentile_80
1

There are 1 best solutions below

0
rpm On BEST ANSWER

Here's how I understand your question:

You have a list of file sizes, and you're trying to find file sizes a and b such that 80% (or some other predetermined percentage) of the files have size s in the range [a,b], and |a-b| is minimized.

I suspect there's no built-in pandas function for this, but it's not too bad to do manually:

def minimum_size_range(file_sizes, percentage):
    # calculate how many files need to be in the range
    window_size = math.ceil(len(file_sizes) * percentage / 100)

    sorted_sizes = sorted(file_sizes)

    # initialize variables with worst-case values
    min_size, max_size = sorted_sizes[0], sorted_sizes[-1]
    min_interval = max_size - min_size

    # calculate interval for every window
    for i in range(len(sorted_sizes) - (window_size - 1)):
        lower, upper = sorted_sizes[i], sorted_sizes[i + (window_size - 1)]
        interval = upper - lower

        # if we found a new minimum interval, replace values
        if interval < min_interval:
            min_interval = interval
            min_size, max_size = lower, upper

    return min_size, max_size

Quick explanation: Since we know the desired percentage beforehand, we know how many files we want in our range, so we can just sort our file sizes and find the window with the desired number of files that has the smallest range of sizes.

You should be able to call this like so:

min_size, max_size = minimum_size_range(df['file_size'], 80)