I want to write code that distributes files among directories, they are placed in, with several conditions:
- At the end of the distribution, there should be no remaining initial files in the directories.
- All directories should have an equal number of files after the distribution. This number is determined by a specific algorithm (described below).
- There may be some files left at the end of the distribution that are not enough to distribute to all directories by "one more file", but they should belong to the directories that initially had more files than the distribution number.
- Obviously, we are not interested in case where total number of files is less than directories number.
I can't find out good algorithm.
I think it already exists, but my searches have been unsuccessful.
How is the distribution number calculated?
We have the number of files in each directory and the total number of files.
Find the minimum difference between the total number of files and the number of files in a directory:
min_diff = min(total_number_of_files - files_number_in_directory[i])Calculate the average number of files per directory:
avg = total_number_of_files / directories_numberChoose the smaller of the two values obtained in steps 1 and 2 - this will be the distribution number.
dis_number = min(min_diff, avg)
Examples:
Please, don't pay attention to the order of files in output dirs. Ordering can be any, if conditions are met.
The easiest case: every directory has equal bunch of files:
input:
directory_1: file_1_1, file_1_2 directory_2: file_2_1, file_2_2 directory_3: file_3_1, file_3_2output:
directory_1: file_2_1, file_3_2 directory_2: file_3_1, file_1_2 directory_3: file_1_1, file_2_2Of course, it's just a one way do distribute files between those 3 directories!
Case, when we can left extra files: input:
dir_1: file_1_1, file_1_2, ..., file_1_10 dir_2: file_2_1, file_2_2 dir_3: file_3_1, file_3_2output:
dir_1: file_2_1, file_2_2, file_3_1, file_3_2 dir_2: file_1_1, file_1_2, file_1_3, file_1_4 dir_3: file_1_5, file_1_6, file_1_7, file_1_8 extra: file_1_9, file_1_10Our distribution number is
14 - 10 = 4, so we can place all files from dir2 and dir3 to dir1, because in dir1 we have more than their summary count of files. But file_1_9 and file_1_10 can't be placed.
My current algorithm is too complicated and probably isn't working for any case:
The main idea is to move only the number of files in each iteration that corresponds to the minimum number of files among the directories. Like a permutation, and shift number is a new every time.
At each step, we "forget" about the files that have been moved previously and work with the remaining original files. It is evident that the directories that initially had the minimum number of files now have 0 files. To address this, we check if we can take files from the directories with the maximum number of files and distribute them to the directories that have 0 files. How many files should we take? It should be equal to the new minimum value among the directories.
By following this approach, all files will be distributed according to the distribution number.
Steps:
I store directories as vectors with files into it, so they are indexed. And initialize vector temp_directories.
And I also created an array of ints, which store how many files is in every directory. Index of element corresponds with an index of directory.
Every step vector of directories changes because of pop_back operation.
Find
minimum_files_numberamong directories.If
minimum_files_number != 0:2.1. Randomly initialize
shift— it's a number in range[1, n - 1], where n is number of directories.2.2. In temp_directory indexed
i + shiftwe place the last element from directory-iand pop back that element from directory-i.2.3. Step 2.2 repeats as much times as the minimum_files_number is.
If
minimum_files_number == 0:3.1. Mark directories with no elements (0-directories), for ex, they have flag == 1.
3.2. Check, if the
current_total_number_of_files > ((number_of_0-directories + 1) * minimum_files_number)3.2.1 Yes: initialize `current_fill_number = minimum_files_number` 3.2.2. No: initialize `current_fill_number = current_avg_number`, where avg_number is just `current_total_number_of_files / directories_number`3.3. From the directory with the massive number of files, replace files to the 0-directories: 3.3.1. Find the
max_index— index of the directory with the largest number of files.3.3.2. Push back the last file from directory-`max_index` to 0-directory-`i`. Pop back the last file from directory-`max_index`. 3.3.3. Repeat step 3.3.1 and 3.3.2. as much times as current_fill_number is.3.4. Directories, which were not flagged as 0-directories, send to processing in 2. step, but
minimum_file_number = current_fill_number.
Example:
We have directories with such number of files:
dir_index: 0 1 2 3
files: 4 8 13 10
At first, the minimum_file_number is 4. We initialize shift as random number in range [1,3]. For example, it is 2. So, the permutation is
(0 1 2 3)
(2 3 0 1)
And we take the last 4th files from the directory-1 and put them in temp_directory-3(!) and etc. Because of pop back, our new count of files in directories is `[0 4 9 6].
The 0 appears, so we need to take files from directories with 9 and 6 elements, to fill it, and as a result we would have [4 4 5 6]. We don't pay attention to the directory, which was filled, so we repeat previous steps for directories [4 5 6], moving the last 4th elements between them with a new value of shift.
Now there is [0 0 1 2]. Number of files is less than number of directories, so they are going to the group "extra".
I will first restate the problem in my own words, and then propose a solution.
The simplified problem: you are given
Nsets. You must move elements between sets until all sets are (a) of the same size, and (b) no sets retain any of their original elements. You are allowed to build an extra set of left-overs for when (a) and (b) conflict.My approach would be to:
Mtotal elements, the target size isfloor(M/N). This is our target for rulea.map[element]points to its original set. To comply with ruleb, we are not allowed to place that element into that set.forloop below, I would start with the set next to the one that was last successfully filled (round-robin):This is not a famous algorithm, but the question does not seem to need one - the worst-case cost of the above pseudo-code is
O(n*m), and I doubt that you can find a general solution that is significantly faster, or manages to fill sets to a greater capacity (I do not guarantee this algorithm to be optimal, but cannot think of a case where it would fail to be so, either). You can shave part of the cost in some cases (as you were trying to do above), but many times a simple algorithm that always works is better than one that could be faster sometimes, but is also harder to understand and test for correctness.It seems that when N is large, since each element only has 1 out of N sets that it is prohibited from being placed into, there will be very few conflicts. Conflicts would mostly arise when some initial sets are much larger than others - if all are initially mostly balanced, then there will be very few problems assigning the target number of elements to each set.
edit
Working code in JavaScript