I have a priority queue of "door numbers". I get the next door number from the priority queue (i.e. the door with the lowest corresponding priority value), and then open the door. Behind the door, there may be a gift or not. Based on the presence / absence of a gift, I update the priority for this door number, and put it back into the priority queue. I then repeat, getting the next door number to open, and so on.
Assuming every door has a different gift-replenishment rate (i.e. some may get a new gift daily, others never at all), how should I update the priority values in order to maximize the number of gifts I find? That is, I want to maximize the ratio of doors I open with gifts to doors I open without gifts.
I should note that replenishment rates are not guaranteed to be fixed over time / there is random variation. But I'm okay with simplifying assumptions here.
This almost seems like a Monte-Carlo problem to me, except that the more often I explore a node (door), the lower its expected value. (And of course there's no tree to build; we only need to figure out the value of depth-1 nodes.)
The most trivial way is to keep track of last priority (LP) and current priority (CP), with delta = CP - LP. If we find a gift, set the next priority NP = CP + delta - 1; otherwise set NP = CP + delta + 1. This works I guess, but seems rather slow in its optimization.
Or we could have a multiplicative value instead: NP = CP + delta * shrink or NP = CP + delta * grow, where shrink < 1 and grow > 1. This is what I currently have, and it seemed to work fine for months, but now I'm getting the situation where some doors are being opened back-to-back (i.e. open door D, found gift, put back on priority queue, D is now best choice again, no gift found of course, now put back on queue with worse priority) which seems pretty bad. For reference, I used shrink = 0.9 and grow = 1.3.
Is there a math formula (as with Monte-Carlo) expressing the optimal way to explore doors?
Multi-armed bandit theory runs deep and is not my specialty, so there's probably a reference that I don't know about. That being said, my first instinct is:
Multi-armed bandits typically have to balance exploration with exploitation, but my hunch here is that we'll get this naturally from the replenishment process.
Most of the technical detail is in doing the estimate. We have a bunch of examples (x, b) where x is the time since we last opened the door and b is whether there was a gift. For a given rate λ, the formula above for the priority gives the expected value of b. I'll suggest a maximum likelihood estimator for λ. This means maximizing the sum of log(exp(−λx)) = −λx over all (x, 0) examples plus the sum of log(1 − exp(−λx)) over all (x, 1) examples. This function can be optimized directly, but there are two issues:
What I would actually recommend is picking a small set of λ values to make this a discrete optimization problem.
(Another potential problem is that the priority formula could be inefficient for many doors. What you could do instead is pick a target threshold for priority and then calculate when the priority will exceed that threshold.)