Get statistics (q1, median q3) from count dictionary

48 Views Asked by At

I have a dictionary of counts like this:

{1:2, 2:1, 3:1}

I need to calculate q1, median, and q3 from this. It is pretty straight forward for odd numbered arrays but for even cases, I can't seem to figure it out. I want to do it without using any libraries like numpy.

Example:

counts = {
            "4": 1,
            "1": 2,
            "5": 1
        }
results = {
            "q1": 1,
            "median": 2.5,
            "q3": 4,
        }

I have something along these lines so far but this doesn't handle all cases.

def get_ratings_stats(counts):
    """"This function will return min, q1, median, q3 and max value from list of ratings."""

    cumulative_sum = 0
    cumulative_dict = {}
    for key, value in sorted(counts.items()):
        cumulative_sum += value
        cumulative_dict[key] = cumulative_sum

    q1_index = math.floor(cumulative_sum * 0.25)
    q3_index = math.ceil(cumulative_sum * 0.75)
    median_index = cumulative_sum * 0.5

    q1, q3, median = None, None, None
    print('indexes: ', q1_index, median_index, q3_index)
    for key, sum in cumulative_dict.items():
        if not q1 and sum >= q1_index:
            q1 = key
        if not q3 and sum >= q3_index:
            q3 = key
        if not median and sum >= median_index:
            median = key
1

There are 1 best solutions below

1
cards On

OP's code is almost finished as it is, just problems with the final part. Different implementations are exposed and measured the different execution's times.

import math
import statistics as st # used for stats_with_stats & workbench


def stats_with_stats(data:dict):
    # flat the data
    f_table = []
    for v, freq in data.items():
        f_table.extend([v]*freq)
    return st.quantiles(f_table)


def stats_by_cards(data:dict):
    # no explicit extra container
    n = sum(data.values()) # total frequency

    q1_i = math.floor(n * 0.25)
    q2_i = n * 0.5
    q3_i = math.ceil(n * 0.75)

    qs = iter((q1_i, q2_i, q3_i))

    out_stats = []
    q = next(qs)
    cum_f = 0
    for v, freq in sorted(data.items()):
        cum_f_new = cum_f + freq
        if cum_f <= q < cum_f_new:
            out_stats.append(v)
            q = next(qs, None)
            if q is None:
                break
        cum_f = cum_f_new

    return out_stats


def stats_by_learner(data:dict):
    # using an extra container - it is more time consuming!
    tmp_data = {}
    f_cum = 0
    for v, f in sorted(data.items()):
        f_cum_new = f_cum + f
        tmp_data[v] = (f_cum, f_cum_new) # <- pairs
        f_cum = f_cum_new

    q1_i = math.floor(f_cum * 0.25)
    q2_i = f_cum * 0.5
    q3_i = math.ceil(f_cum * 0.75)

    qs = iter((q1_i, q2_i, q3_i))

    out_stats = []
    q = next(qs)
    for v, (lower_freq, upper_freq) in tmp_data.items():
        if lower_freq <= q < upper_freq:
            out_stats.append(v)
            q = next(qs, None)
            if q is None:
                break

    return out_stats        

Timing with the following dataset

from collections import Counter
import random

# test with sample dataset
random.seed(123456) # for sake of "reproducibility"
dataset = Counter([random.randint(1, 100) for _ in range(1_000)])

Output

dataset
length  1000
seed    123456

check outputs
stats_by_cards      [23, 49, 75]
stats_by_learner    [23, 49, 75]
stats_with_stats    [23.0, 49.0, 75.0]

timing:
quartiles with "stats_by_cards"
times           [28.688197855000908, 29.46196023000084, 27.1563763080012, 27.481716411999514, 27.487445683997066]
mean            28.055139297799904
std             0.9796393506117709
quartiles with "stats_by_learner"
times           [44.43848133800202, 37.32753902799959, 36.33182038599989, 36.39275053000165, 36.30151881200072]
mean            38.15842201880078
std             3.5366534480997154
quartiles with "stats_with_stats"
times           [79.48812012000053, 80.60817988300187, 81.6248098290016, 81.89885164600128, 81.02726684300069]
mean            80.9294456642012
std             0.950457690904102

Remark on the definition of quartiles: the way quartiles are implemented (as in OP) maybe not be consistent:

check outputs (with 50 terms & seed=123456)
stats_by_learner    [13, 42, 66]
stats_by_cards      [13, 42, 66]
stats_with_stats    [12.75, 40.5, 65.25]