How to distribute files among dirs to have an equal quantity in each set and no initial files remain in any set?

70 Views Asked by At

I want to write code that distributes files among directories, they are placed in, with several conditions:

  1. At the end of the distribution, there should be no remaining initial files in the directories.
  2. All directories should have an equal number of files after the distribution. This number is determined by a specific algorithm (described below).
  3. There may be some files left at the end of the distribution that are not enough to distribute to all directories by "one more file", but they should belong to the directories that initially had more files than the distribution number.
  4. Obviously, we are not interested in case where total number of files is less than directories number.

I can't find out good algorithm.

I think it already exists, but my searches have been unsuccessful.


How is the distribution number calculated?

We have the number of files in each directory and the total number of files.

  1. Find the minimum difference between the total number of files and the number of files in a directory:

    min_diff = min(total_number_of_files - files_number_in_directory[i])
    
  2. Calculate the average number of files per directory:

    avg = total_number_of_files / directories_number
    
  3. Choose the smaller of the two values obtained in steps 1 and 2 - this will be the distribution number.

    dis_number = min(min_diff, avg)
    

Examples:

Please, don't pay attention to the order of files in output dirs. Ordering can be any, if conditions are met.

  1. The easiest case: every directory has equal bunch of files:

    input:

    directory_1: file_1_1, file_1_2
    directory_2: file_2_1, file_2_2
    directory_3: file_3_1, file_3_2
    

    output:

    directory_1: file_2_1, file_3_2
    directory_2: file_3_1, file_1_2
    directory_3: file_1_1, file_2_2
    

    Of course, it's just a one way do distribute files between those 3 directories!

  2. Case, when we can left extra files: input:

    dir_1: file_1_1, file_1_2, ..., file_1_10
    dir_2: file_2_1, file_2_2
    dir_3: file_3_1, file_3_2
    

    output:

    dir_1: file_2_1,  file_2_2, file_3_1, file_3_2
    dir_2: file_1_1, file_1_2, file_1_3, file_1_4
    dir_3: file_1_5, file_1_6, file_1_7, file_1_8
    
    extra: file_1_9, file_1_10
    

    Our distribution number is 14 - 10 = 4, so we can place all files from dir2 and dir3 to dir1, because in dir1 we have more than their summary count of files. But file_1_9 and file_1_10 can't be placed.

My current algorithm is too complicated and probably isn't working for any case:

The main idea is to move only the number of files in each iteration that corresponds to the minimum number of files among the directories. Like a permutation, and shift number is a new every time.

At each step, we "forget" about the files that have been moved previously and work with the remaining original files. It is evident that the directories that initially had the minimum number of files now have 0 files. To address this, we check if we can take files from the directories with the maximum number of files and distribute them to the directories that have 0 files. How many files should we take? It should be equal to the new minimum value among the directories.

By following this approach, all files will be distributed according to the distribution number.

Steps:

I store directories as vectors with files into it, so they are indexed. And initialize vector temp_directories.

And I also created an array of ints, which store how many files is in every directory. Index of element corresponds with an index of directory.

Every step vector of directories changes because of pop_back operation.

  1. Find minimum_files_number among directories.

  2. If minimum_files_number != 0:

    2.1. Randomly initialize shift — it's a number in range [1, n - 1], where n is number of directories.

    2.2. In temp_directory indexed i + shift we place the last element from directory-i and pop back that element from directory-i.

    2.3. Step 2.2 repeats as much times as the minimum_files_number is.

  3. If minimum_files_number == 0:

    3.1. Mark directories with no elements (0-directories), for ex, they have flag == 1.

    3.2. Check, if the current_total_number_of_files > ((number_of_0-directories + 1) * minimum_files_number)

       3.2.1 Yes: initialize `current_fill_number = minimum_files_number`
    
       3.2.2. No: initialize `current_fill_number = current_avg_number`, where avg_number is just `current_total_number_of_files / directories_number`
    

    3.3. From the directory with the massive number of files, replace files to the 0-directories: 3.3.1. Find the max_index — index of the directory with the largest number of files.

       3.3.2. Push back the last file from directory-`max_index` to 0-directory-`i`. Pop back the last file from directory-`max_index`.
    
       3.3.3. Repeat step 3.3.1 and 3.3.2. as much times as current_fill_number is.
    

    3.4. Directories, which were not flagged as 0-directories, send to processing in 2. step, but minimum_file_number = current_fill_number.

Example:

We have directories with such number of files:

dir_index: 0   1   2   3
files:     4   8  13  10

At first, the minimum_file_number is 4. We initialize shift as random number in range [1,3]. For example, it is 2. So, the permutation is

(0   1   2   3)
(2   3   0   1)

And we take the last 4th files from the directory-1 and put them in temp_directory-3(!) and etc. Because of pop back, our new count of files in directories is `[0 4 9 6].

The 0 appears, so we need to take files from directories with 9 and 6 elements, to fill it, and as a result we would have [4 4 5 6]. We don't pay attention to the directory, which was filled, so we repeat previous steps for directories [4 5 6], moving the last 4th elements between them with a new value of shift.

Now there is [0 0 1 2]. Number of files is less than number of directories, so they are going to the group "extra".

1

There are 1 best solutions below

2
tucuxi On

I will first restate the problem in my own words, and then propose a solution.

The simplified problem: you are given N sets. You must move elements between sets until all sets are (a) of the same size, and (b) no sets retain any of their original elements. You are allowed to build an extra set of left-overs for when (a) and (b) conflict.

My approach would be to:

  • find the target size of sets. If there are M total elements, the target size is floor(M/N). This is our target for rule a.
  • initialize an exclusion map, where map[element] points to its original set. To comply with rule b, we are not allowed to place that element into that set.
  • empty all sets, and start filling them by iterating the previous map. To balance the sets, in the for loop below, I would start with the set next to the one that was last successfully filled (round-robin):
while (not all elements placed):
  place next element, e
  for all sets s, in round-robin fashion,
    if size(set[s]) < target, and
       map(e) != s, then:
         place e into set[s], and exit for loop
  if no placing possible, put e into the left-over set

let new_target = minimum size of all sets
for all sets s with size(set[s]) > new_target, 
   move new_target - size(set[s]) elements into the left-over set

This is not a famous algorithm, but the question does not seem to need one - the worst-case cost of the above pseudo-code is O(n*m), and I doubt that you can find a general solution that is significantly faster, or manages to fill sets to a greater capacity (I do not guarantee this algorithm to be optimal, but cannot think of a case where it would fail to be so, either). You can shave part of the cost in some cases (as you were trying to do above), but many times a simple algorithm that always works is better than one that could be faster sometimes, but is also harder to understand and test for correctness.

It seems that when N is large, since each element only has 1 out of N sets that it is prohibited from being placed into, there will be very few conflicts. Conflicts would mostly arise when some initial sets are much larger than others - if all are initially mostly balanced, then there will be very few problems assigning the target number of elements to each set.

edit

Working code in JavaScript

const initial_sets = [
  {id: "s1", values: ["f11", "f12", "f13", "f14", "f15", 
         "f16", "f17", "f18", "f19", "f16"]},
  {id: "s2", values: ["f21", "f22"]},
  {id: "s3", values: ["f31", "f32"]}
];

function balance(sets) {
  
  // preprocess sets
  const elements = []; 
  const exclusion_map = new Map();
  let total_elements = 0;
  let total_sets = 0;
  const workspace = [];
  for (const set of sets) {
    total_elements += set.values.length;
    total_sets ++;
    workspace.push({id: set.id, size: 0});
    for (const v of set.values) {
      exclusion_map.set(elements.length, set.id);
      elements.push({src: set.id, dst: null, value: v})
    }
  }
  const target_size = Math.floor(total_elements/total_sets);

  // place elements without breaking rules
  let starting_set = 0;
  for (let el_idx = 0; el_idx<total_elements; el_idx++) {   
      console.log("Placing ", elements[el_idx].value,"; set sizes: ", 
                workspace.map(o => `${o.id}: ${o.size}`).join());
    
    let placed = false;
    for (let set_idx = 0; !placed && set_idx<total_sets; set_idx++) {
      const dst_idx = (set_idx + starting_set)%total_sets;
      console.log("trying set ", dst_idx, "...")
      if (workspace[dst_idx].size < target_size && 
          workspace[dst_idx].id != exclusion_map.get(el_idx)) {
        starting_set = dst_idx + 1;
        workspace[dst_idx].size ++;
        elements[el_idx].dst = workspace[dst_idx].id;
        placed = true;
      }
    }
  }
  
  // build result
  const result = workspace.map(o => ({id: o.id, values:
    elements.filter(e => e.dst == o.id).map(ee => ee.value)}));
  const leftovers = 
    elements.filter(e => e.dst == null).map(ee => ee.value);
  
  // ensure that all have same size as smallest 
  const smallest = Math.min(workspace.map(o => o.size));
  for (const set of result) {
    if (result.values.length > smallest) {
      console.log("Trimming ", set.id, "to size", smallest);
      leftovers.push(... values.splice(smallest));
    }
  }
  result.push({id: "_leftovers", leftovers});
  return result;  
}

const r = balance(initial_sets);
console.log(r);