Trouble generating randomly distributed points within bin bounds from 2D histogram

34 Views Asked by At

My goal is to produce a scatterplot from a 2D histogram, where if a bin has a count of n, then n points are randomly generated within the bin boundaries. Sort of like this:

Image of 2D histogram and corresponding scatterplot

However, I am having a problem generating the points within the bin boundaries for more uniform distributions. For example, the following heatmap is not producing a scatterplot that it should be.

2D histogram:

2D histogram

scatterplot that does NOT reflect 2D histogram:

scatterplot that does NOT reflect 2D histogram

I have added the code and function calls below. How can I fix my code so it will correctly generate points?

def repopulateScatterHelper(x,y,m):
    """
      generate a random point within bounds
    """
    # compute x and y axis min and max
    maxX = max(xedges) #the max x value from edges
    maxY = max(yedges)

    minX = min(xedges)
    minY = min(yedges)

    # compute bin boundaries
    x1 = float(x)/m * (maxX-minX) + minX
    x2 = float(x+1)/m *(maxX-minX) + minX

    y1 = float(y)/m * (maxY-minY) + minY
    y2 = float(y+1)/m * (maxY-minY) + minY

    # generate random point within bin boundaries
    the_x = uniform(x1, x2)
    the_y = uniform(y1, y2)
    return the_x, the_y
# enddef

def repopulateScatter(H, m):
  """
    @params
      H - 2D array of counts
      m - number of bins along each axis
    @returns
      new_x, new_y - Generated corresponding x and y coordinates of points

  """
  new_x = []
  new_y = []
  for i in range(0,m): # rows
      for j in range(0,m): #colomns
          if H[i][j] > 0: # if count is greater than zero, generate points
              for point in range(0, int(H[i][j])):
                  x_i, y_i = repopulateScatterHelper(i,j,m)
                  new_x.append(x_i)
                  new_y.append(y_i)
              #endfor
          #endif
      #endfor
  #endfor

  return new_x,new_y
#enddef

def plotHistToScatter(new_x, new_y):
   """
      new_x, new_y - x,y coordinates to plot
   """
  new_x = np.array(new_x)
  new_y = np.array(new_y)

  # plot data points
  fig, ax = plt.subplots()
  ax.scatter(new_x,new_y)

  # add LOBF to plot  - https://www.statology.org/line-of-best-fit-python/
  a, b = np.polyfit(new_x,new_y, 1)
  a = float(a)

  plt.plot(new_x, a*new_x+b, color = "red")
  print("DP LOBF:", a , "*(x) +" , b)

  # label the plot
  plt.xlabel(xAxisLabel)
  plt.ylabel(yAxisLabel)
  plt.title("heatmap to scatterplot for " + xAxisLabel + ' vs ' + yAxisLabel + "epsilon =" + str(epsilon))

  plt.show()
#enddef

My function calls are:

H, xedges, yedges = np.histogram2d(df[xAxisLabel],df[yAxisLabel], bins=(m, m)) # plot 2D histogram
new_x,new_y = repopulateScatter(H,m) 
plotHistToScatter(new_x, new_y)

I have tried to change the repopulateScatterHelper() function to fix it, however I have been unsuccessful.

0

There are 0 best solutions below