Improving quick sort to work with lots of duplicates in the array, using "randomized swaps"

80 Views Asked by At

If the array of size n contains just one unique number (an array with all duplicates), then the traditional quick sort algorithm will create very uneven partitions of sizes 1 and n-1. This will be the case no matter how we choose the pivot element. This would make the average case time complexity O(n^2) for such inputs.

I thought of a simple way to alleviate this problem. While partitioning the array, if the element is exactly equal to the pivot element, we can equi-probably (randomly through a coin toss) choose to treat it as "less than" or "greater than" the pivot, so that duplicates get evenly distributed around the pivot.

def partition(arr: list[int]):
  pivot = arr[-1]
  current_index = 0
  for i in range(len(arr) - 1):
    if arr[i] < pivot or (arr[i] == pivot and random.random() < 0.5):
      arr[i], arr[current_index] = arr[current_index], arr[i]
      current_index += 1
  # put the pivot in its place
  arr[-1], arr[current_index] = arr[current_index], arr[-1]
  # return the partition index
  return current_index

(Emphasis on the arr[i] == pivot and random.random() < 0.5 check)

But I have never seen this approach being used anywhere. Is there a problem with this approach that it's not widely used?

2

There are 2 best solutions below

0
rcgldr On BEST ANSWER

If the array of size n contains just one unique number (an array with all duplicates), then the traditional quick sort algorithm will create very uneven partitions of sizes 1 and n-1.

This is only true if using Lomuto partition scheme. If using Hoare partition scheme (the original quicksort scheme), the splits become ideal as the number of duplicates increases, although it needlessly swaps equal elements. Look at the repeated elements section in the Wiki article:

https://en.wikipedia.org/wiki/Quicksort

2
Caridorc On

Non-deterministic sorting is quite un-intuitive as the same list is sorted in different ways different times, leading to potentially rare and hard to find bugs:

import random
from typing import List

class User:
    def __init__(self, name, age):
        self.name = name
        self.age = age

def partition(users: List[User], key='name'):
    pivot = getattr(users[-1], key)
    current_index = 0
    for i in range(len(users) - 1):
        if getattr(users[i], key) < pivot or (getattr(users[i], key) == pivot and random.random() < 0.5):
            users[i], users[current_index] = users[current_index], users[i]
            current_index += 1
    # put the pivot in its place
    users[-1], users[current_index] = users[current_index], users[-1]
    # return the partition index
    return current_index

def quicksort(users: List[User], key='name'):
    if len(users) <= 1:
        return users
    else:
        pivot_index = partition(users, key)
        left_part = quicksort(users[:pivot_index], key)
        right_part = quicksort(users[pivot_index + 1:], key)
        return left_part + [users[pivot_index]] + right_part

# Example usage:
users = [User("Alice", 25), User("Bob", 30), User("Charlie", 25)]
sorted_users_by_name = quicksort(users, key='name')
sorted_users_by_age = quicksort(users, key='age')

print("Sorted by name:")
for user in sorted_users_by_name:
    print(f"{user.name} - {user.age} years old")

for i in range(10):
    sorted_users_by_age = quicksort(sorted_users_by_age, key='age')
    print(f"\nRe-sorting {i}: Sorted by age:")
    for user in sorted_users_by_age:
        print(f"{user.name} - {user.age} years old")

With output:

Sorted by name:
Alice - 25 years old
Bob - 30 years old
Charlie - 25 years old

Re-sorting 0: Sorted by age:
Alice - 25 years old
Charlie - 25 years old
Bob - 30 years old

Re-sorting 1: Sorted by age:
Charlie - 25 years old
Alice - 25 years old
Bob - 30 years old

Re-sorting 2: Sorted by age:
Alice - 25 years old
Charlie - 25 years old
Bob - 30 years old

Re-sorting 3: Sorted by age:
Charlie - 25 years old
Alice - 25 years old
Bob - 30 years old

Re-sorting 4: Sorted by age:
Alice - 25 years old
Charlie - 25 years old
Bob - 30 years old

Re-sorting 5: Sorted by age:
Charlie - 25 years old
Alice - 25 years old
Bob - 30 years old

Re-sorting 6: Sorted by age:
Charlie - 25 years old
Alice - 25 years old
Bob - 30 years old

Re-sorting 7: Sorted by age:
Charlie - 25 years old
Alice - 25 years old
Bob - 30 years old

Re-sorting 8: Sorted by age:
Alice - 25 years old
Charlie - 25 years old
Bob - 30 years old

Re-sorting 9: Sorted by age:
Alice - 25 years old
Charlie - 25 years old
Bob - 30 years old

Apart from this minor inconvenience, I think your solution is fine.