MLOps Hyperparameter tuning when the dataframe is too large to be broadcast

37 Views Asked by At

I am trying to do some hyperparameter optimization on Databricks using MLOps. My dataframe containing training data is extremely large (over 1G) and it can't be cut due to the nature of the model. It's also too large to be broadcast. So I tried to follow the best practices as recommended by Databricks for very large datasets (my model is not a LASSO one, this is just for an example):

def train_and_eval(data, alpha):
 X_train, X_test, y_train, y_test = data  
 model = linear_model.Lasso(alpha=alpha)
 model.fit(X_train, y_train)
 loss = model.score(X_test, y_test)
 return {"loss": loss, "status": STATUS_OK}
 

def tune_alpha(objective):
   best = fmin(
    fn=objective,
    space=hp.uniform("alpha", 0.0, 10.0),
    algo=tpe.suggest,
    max_evals=4,
    trials=SparkTrials(parallelism=2))
  return best["alpha"]


def save_to_dbfs(data):
  """
  Saves input data (a tuple of numpy arrays) to a temporary file on DBFS and returns its path.
 """
  # Save data to a local file first
  data_filename = "data.npz"
  local_data_dir = tempfile.mkdtemp()
  local_data_path = os.path.join(local_data_dir, data_filename)
  np.savez(local_data_path, *data)
  
  # Move the data to DBFS, which is shared among cluster nodes
  dbfs_tmp_dir = "/dbfs/ml/tmp/hyperopt"
  os.makedirs(dbfs_tmp_dir, exist_ok=True)
  dbfs_data_dir = tempfile.mkdtemp(dir=dbfs_tmp_dir)  
  dbfs_data_path = os.path.join(dbfs_data_dir, data_filename)  
  shutil.move(local_data_path, dbfs_data_path)
  return dbfs_data_path
 
def load(path):
  """
   Loads saved data (a tuple of numpy arrays).
  """
  return list(np.load(path).values())


data_large_path = save_to_dbfs(data_large)
def objective_large(alpha):
   # Load data back from DBFS onto workers
   data = load(data_large_path)
   return train_and_eval(data, alpha)

But this does not work in my case: in this approach (which is just a cut and paste from databricks tutorials), "data" is a list of ntuples, and my model expects a dataframe with specific columns. Should there be a step that combines those lists of ntuples back into a dataframe? If so, what's the most efficient way of doing so? Thanks for any help!

0

There are 0 best solutions below