I am trying to do some hyperparameter optimization on Databricks using MLOps. My dataframe containing training data is extremely large (over 1G) and it can't be cut due to the nature of the model. It's also too large to be broadcast. So I tried to follow the best practices as recommended by Databricks for very large datasets (my model is not a LASSO one, this is just for an example):
def train_and_eval(data, alpha):
X_train, X_test, y_train, y_test = data
model = linear_model.Lasso(alpha=alpha)
model.fit(X_train, y_train)
loss = model.score(X_test, y_test)
return {"loss": loss, "status": STATUS_OK}
def tune_alpha(objective):
best = fmin(
fn=objective,
space=hp.uniform("alpha", 0.0, 10.0),
algo=tpe.suggest,
max_evals=4,
trials=SparkTrials(parallelism=2))
return best["alpha"]
def save_to_dbfs(data):
"""
Saves input data (a tuple of numpy arrays) to a temporary file on DBFS and returns its path.
"""
# Save data to a local file first
data_filename = "data.npz"
local_data_dir = tempfile.mkdtemp()
local_data_path = os.path.join(local_data_dir, data_filename)
np.savez(local_data_path, *data)
# Move the data to DBFS, which is shared among cluster nodes
dbfs_tmp_dir = "/dbfs/ml/tmp/hyperopt"
os.makedirs(dbfs_tmp_dir, exist_ok=True)
dbfs_data_dir = tempfile.mkdtemp(dir=dbfs_tmp_dir)
dbfs_data_path = os.path.join(dbfs_data_dir, data_filename)
shutil.move(local_data_path, dbfs_data_path)
return dbfs_data_path
def load(path):
"""
Loads saved data (a tuple of numpy arrays).
"""
return list(np.load(path).values())
data_large_path = save_to_dbfs(data_large)
def objective_large(alpha):
# Load data back from DBFS onto workers
data = load(data_large_path)
return train_and_eval(data, alpha)
But this does not work in my case: in this approach (which is just a cut and paste from databricks tutorials), "data" is a list of ntuples, and my model expects a dataframe with specific columns. Should there be a step that combines those lists of ntuples back into a dataframe? If so, what's the most efficient way of doing so? Thanks for any help!