Programmatic checkout of Databricks Repos branch

839 Views Asked by At

I have an integration test that compares the output from running the same scripts from 2 different branches (ie, master and a feature branch). Currently this test kicks off from my local machine, but I'd like to migrate it to a Databricks job, and run it entirely from the Workflows interface.

I'm able to recreate most of the existing integration test (written in Python) using notebooks and dbutils, with the exception of the feature branch checkout. I can make a call from my local machine to the Repos REST API to perform the checkout, but (from what I can tell) I can't make that same call from a job that's running on the Databricks cloud. (I run into credentials/authentication issues when I try, and my solutions are getting increasingly hacky.)

Is there a way to checkout a branch using pure Python code; something like a dbutils.repos.checkout()? Alternatively, is there a safe way to call the REST APIs from from a job that's running on the Databricks cloud?

2

There are 2 best solutions below

5
Alex Ott On BEST ANSWER

You can either use Repos REST API, specifically, the Update command of it. But in case of doing CI/CD, it's easier to use databricks repos update command of Databricks CLI, like this:

databricks repos update --path <path> --branch <branch>

P.S. I have end-to-end example of doing CI/CD for Repos + Notebooks on Azure DevOps, but approach will be the same for other systems. Here is an example of using Databricks CLI for checkout.

1
George Sotiropoulos On

Just for the record I give some code that you can execute in a notebook and "update" another repo folder and then execute it. I believe it does what the accepted answer says, by using the databricksapi within databricks notebook.

context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
from databricks_cli.repos.api import ReposApi
from databricks_cli.sdk.api_client import ApiClient
from databricks_cli.workspace.api import WorkspaceApi


api_client = ApiClient(
    host=url,
    token=token
)
repo_url = "https://[email protected]/your_repo_url" # same as the one you use to clone
repos_path = "/Repos/your_repo/"
repos_api = ReposApi(api_client)
workspace_api = WorkspaceApi(api_client)


workspace_api.mkdirs(repos_path) # 1. create the initial folder if doesnt exist
# 2. Then if the repo already exists, delete it and create it again. That is, to ensure that you get the update branch you want. 
try: 
  repo_id = repos_api.get_repo_id(repos_path+ "your_repo")
  repos_api.delete(repo_id)
except RuntimeError:
  pass

repos_api.create(url=repo_url,  path=  repos_path+ "your_repo",  provider = 'azureDevOpsServices' )
repos_api.update(repo_id = repos_api.get_repo_id( repos_path+ "your_repo"),
                 branch='master', tag = None)

What it does:

  1. First connects using the context.
  2. Then deletes the target folder if exists
  3. creates and updates. (probably update is redundant)

I am deleting the existing folder o avoid conflicts with local changes. If someone made changes in the target Repo folder and you just update, you pull the changes from the origin but doesnt remove you changes existing there. With delete and create , it’s like resetting the folder.

In that way you can execute a script from another repo.

Alternatively, another way to do that is to create a job in databricks and use the databricksAPI to run it. However, you will have to create different job for each different notebook to be executed.