Random string collision after using Fisher-Yates algorithm (C#)

365 Views Asked by At

I am doing an exercise from exercism.io, in which I have to generate random names for robots. I am able to get through a bulk of the tests until I hit this test:

[Fact]
public void Robot_names_are_unique()
{
    var names = new HashSet<string>();
    for (int i = 0; i < 10_000; i++) {
        var robot = new Robot();
        Assert.True(names.Add(robot.Name));
    }
}

After some googling around, I stumbled upon a couple of solutions and found out about the Fisher-Yates algorithm. I tried to implement it into my own solution but unfortunately, I haven't been able to pass the final test, and I'm stumped. If anyone could point me in the right direction with this, I'd greatly appreciate it. My code is below:

EDIT: I forgot to mention that the format of the string has to follow this: @"^[A-Z]{2}\d{3}$"

public class Robot
{
string _name;
Random r = new Random();
string alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
string nums = "0123456789";

public Robot()
{
    _name = letter() + num();
}

public string Name
{
    get { return _name; }
}

private string letter() => GetString(2 ,alpha.ToCharArray(), r);

private string num() => GetString(3, nums.ToCharArray(), r);

public void Reset() => _name = letter() + num();

public string GetString(int length,char[] chars, Random rnd)
{
    Shuffle(chars, rnd);
    return new string(chars, 0, length);
}

public void Shuffle(char[] _alpha, Random r)
{


    for(int i = _alpha.Length - 1; i > 1; i--)
    {
        int j = r.Next(i);
        char temp = _alpha[i];
        _alpha[i] = _alpha[j];
        _alpha[j] = temp;
    }

}

}
4

There are 4 best solutions below

0
Matt Burland On BEST ANSWER

Here's one way you can do it:

  1. Generate the list of all possible names
  2. For each robot, select a name from the list at random
  3. Remove the selected name from the list so it can't be selected again

With this, you don't even need to shuffle. Something like this (note, I stole Optional Option's method of generating names because it's quite clever and I couldn't be bother thinking of my own):

public class Robot
{
    private static List<string> names;
    private static Random rnd = new Random();
    public string Name { get; private set; }

    static Robot()
    {
        Console.WriteLine("Initializing");
        // Generate possible candidates
        names = Enumerable.Range(0, 675999).Select(i => 
        {
            var sb = new StringBuilder(5);
            sb.Append((char)('A' + i % 26));
            i /= 26;
            sb.Append((char)('A' + i % 26));
            i /= 26;
            sb.Append(i % 10);
            i /= 10;
            sb.Append(i % 10);
            i /= 10;
            sb.Append(i % 10);
            return sb.ToString();
        }).ToList();
    }

    public Robot()
    {
        // Note: if this needs to be multithreaded, then you'd need to do some work here
        // to avoid two threads trying to take a name at the same time
        // Also note: you should probably check that names.Count > 0 
        // and throw an error if not
        var i = rnd.Next(0, names.Count - 1);
        Name = names[i];
        names.RemoveAt(i);
    }
}

Here's a fiddle that generates 20 random names. They can only be unique because they are removed once they are used.

The point about multitheading is very important however. If you needed to be able to generate robots in parallel, then you'd need to add some code (e.g. locking the critical section of code) to ensure that only one name is being picked and removed from the list of candidates at a time or else things will get really bad, really quickly. This is why, when people need a random id with a reasonable expectation that it'll be unique, without worrying that some other thread(s) are trying the same thing at the same time, they use GUIDs. The sheer number of possible GUIDs makes collisions very unlikely. But you don't have that luxury with only 676,000 possible values

2
Christopher On

The first rule of any ID is:

It does not mater how big it is, how many possible value it has - if you just create enough of them, you will get a colission eventually.

To Quote Trillian from the Hithchikers Guide: "[A colission] is not impossible. Just realy, really unlikely."

However in this case, I think it is you creating Random Instances in a Loop. This is a classical beginners mistake when workign with Random. You should not create a new random isntance for each Robot Instance, you should have one for the application that you re-use. Like all Pseudorandom Number Generators, Random is deterministic. Same inputs - same outputs.

As you did not specify a seed value, it will use the time in milliseconds. Wich is going to the same between the first 20+ loop itterations at last. So it is going to have the same seed and the same inputs, so the same outputs.

1
Optional Option On

The easiest solution for unique names is to use GUIDs. In theory, it is possible to generate non-unique GUIDs but it is pretty close to zero.

Here is the sample code:

var newUniqueName = Guid.NewGuid().ToString();

Sure GUIDs do not look pretty but they are really easy to use.

EDIT: Since the I missed the additional requirement for the format I see that GUID format is not acceptable.

Here is an easy way to do that too. Since format is two letters (26^2 possibile values) and 3 digits (10^3 possible values) the final number of possible values is 26^2 * 10^3 = 676 * 1000 = 676000. This number is quite small so Random can be used to generate the random integer in the range 0-675999 and then that number can be converted to the name. Here is the sample code:

            var random = new System.Random();
            var value = random.Next(676000);
            var name = ((char)('A' + (value % 26))).ToString();
            value /= 26;
            name += (char)('A' + (value % 26));
            value /= 26;
            name += (char)('0' + (value % 10));
            value /= 10;
            name += (char)('0' + (value % 10));
            value /= 10;
            name += (char)('0' + (value % 10));

The usual disclaimer about possible identical names applies here too since we have 676000 possible variants and 10000 required names.

EDIT2: Tried the code above and generating 10000 names using random numbers produced between 9915 and 9950 unique names. That is no good. I would use a simple static in class member as a counter instead of random number generator.

2
Mathias R. Jessen On

First, let's review the test you're code is failing against:

  • 10.000 instances created
  • Must all have distinct names

So somehow, when creating 10000 "random" names, your code produces at least two names that are the same.

Now, let's have a look at the naming scheme you're using:

AB123

The maximum number of unique names we could possibly create is 468000 (26 * 25 * 10 * 9 * 8).

This seems like it should not be a problem, because 10000 < 468000 - but this is where the birthday paradox comes in!

From wikipedia:

In probability theory, the birthday problem or birthday paradox concerns the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday.

Rewritten for the purposes of your problem, we end up asking:

What's the probability that, in a set of 10000 randomly chosen people, some pair of them will have the same name.

The wikipedia article also lists a function for approximating the number of people required to reach a 50% propbability that two people will have the same name:

BirthdayProblemApproxN

where m is the total number of possible distinct values. Applying this with m=468000 gives us ~806 - meaning that after creating only 806 randomly named Robots, there's already a 50% chance of two of them having the same name.

By the time you reach Robot #10000, the chances of not having generated two names that are the same is basically 0.

As others have noted, you can solve this by using a Guid as the robot name instead.

If you want to retain the naming convention you might also get around this by implementing an LCG with an appropriate period and use that as a less collision-prone "naming generator".