training time seems excessive for training a stable diffusion v1-4 model, given the hardware and hyperparameters

49 Views Asked by At

I am training a model using stable-diffusion-v1-4, with around 4900 training dataset size. I use the following to create a pipe. It is run on cuda with Nvidia GPU of 8GB, even though I tried it on google colab the time to train a batch size of 16 wirh number of inference=25 is around 7 minutes, much higher if I increase this number, so for one full training of one epoc it takes around 34 hours which seems very excessive. I am wondering if I am doing something wrong here or is this the best I can hope for given the hardware which is a 8GB GPU on my personal cmputer? All the delay is for this line of code :

        with autocast():
            outputs = pipe(combined_meanings, tokenized_prompts=batch_tokenized_prompts, num_inference_steps=num_steps)["images"]

I need to keep the model to sd-v1-4.

from diffusers import StableDiffusionPipeline
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, cache_dir=".../models/ldm/stable-diffusion-v1/")

I then train run the model using pipe and create associated functions as follows:

# Define function to pre-tokenize prompts
def pre_tokenize_prompts(train_dataset, tokenizer):
    """
    Pre-tokenize prompts for training.

    Args:
        train_dataset: The training dataset.
        tokenizer: The CLIP tokenizer.

    Returns:
        list: Pre-tokenized prompts.
    """
    tokenized_prompts = []
    for entry in train_dataset:
        meanings = [str(entry['meaning'])]
        max_length = 77
        stride = 5
        for start in range(0, len(meanings), stride):
            end = min(start + max_length, len(meanings))
            segment = ' '.join(meanings[start:end])
            tokens = tokenizer.tokenize(segment)
            tokenized_prompts.extend(tokens)
    return tokenized_prompts
        


# Define function to train model
def train(epochs, batch_size, lr, num_steps, train_dataset, device, dataset_length):
    """
    Train the model.

    Args:
        epochs (int): Number of epochs.
        batch_size (int): Batch size.
        lr (float): Learning rate.
        num_steps (int): Number of inference steps.
        train_dataset: The training dataset.
        device: The device for computations.
        dataset_length (int): Length of the training dataset.
    """
    # Instantiate the CLIP tokenizer
    tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
    # Pre-tokenize prompts
    tokenized_prompts = pre_tokenize_prompts(train_dataset, tokenizer)
    # Define the optimizer
    optimizer = create_optimizer(pipe, lr)
    for epoch in range(epochs):
        for i in range(0, len(train_dataset), batch_size):
            batch = train_dataset[i:i + batch_size]
            meanings = [str(entry['meaning']) for entry in batch]
            images = [entry['image'] for entry in batch]

            # Convert list of meanings to a single string
            combined_meanings = ' '.join(meanings)

            # Zero the gradients
            optimizer.zero_grad()

            # Use pre-tokenized prompts
            start_idx = i * len(tokenized_prompts)
            end_idx = (i + batch_size) * len(tokenized_prompts)
            batch_tokenized_prompts = tokenized_prompts[start_idx:end_idx]
                          
        
            # Print the meanings for each batch
            print('combined meanings: ', combined_meanings)
            
            # Forward pass
            start_time = time.time()
            
            with autocast():
                outputs = pipe(combined_meanings, tokenized_prompts=batch_tokenized_prompts, num_inference_steps=num_steps)["images"]
            
1

There are 1 best solutions below

0
kevbuntu On

Make sure operations pipe() and tensors reside on GPU by using .to(device) with device defined as device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')