r - Latin hypercube sampling with varying number of levels per variable

2.1k Views Asked by At

I did some digging around, but I'm still very new to the concept of latin hypercube sampling. I found this example which uses the lhs pacakge:

set.seed(1)
randomLHS(5,2)

           [,1]       [,2]
[1,] 0.84119491 0.89953985
[2,] 0.03531135 0.74352370
[3,] 0.33740457 0.59838122
[4,] 0.47682074 0.07600704
[5,] 0.75396828 0.35548904

From my understanding, the entries in the resulting matrix are the coordinates of 5 points that will be used to determine combinations of two continuous variables.

I'm trying to do a simulation with 5 categorical variables. The number of levels per variable range from 2 to 5. This results in 2 x 3 x 4 x 2 x 5 = 240 scenarios. I'd like to cut it down as much as possible so I was thinking of using a latin hypercube, but I'm confused about how to proceed. Any ideas would be much appreciated!

Also, do you know of any good resources that explains how to analyze the results from latin hypercube sampling?

1

There are 1 best solutions below

1
pjs On

I'd recommend sticking with the full factorial with 240 design points, for the following reasons.

  1. Heck, this is what computers are for—to automate tedious computational tasks. 240 design points is nothing, you're doing this on a computer! You can easily automate the process with nested loops iterating through the levels, one loop per factor. Don't forget an innermost loop for replications. If each simulation takes more than a minute or two, break it across multiple cores or multiple machines. One of my students recently did this for his MS thesis work, and was able to run more than a million simulated experiments over a weekend.

  2. With continuous factors, you generally assume some degree of smoothness in the response surface and infer/project the response between adjacent design points based on regression. With categorical data, inference isn't valid for excluded factor combinations and interactions may very well be the dominant effects. Unless you do the full factorial, the combinations you omit may or may not be the most important ones, but the point is that you'll never know if you didn't sample there.

In general, you use the same analysis tools you would use if you were doing any other kind of sampling—Regression, logistic regression, ANOVA, partition trees,... For categorical factors, I'm a fan of partition trees.