I am training a model using stable-diffusion-v1-4, with around 4900 training dataset size. I use the following to create a pipe. It is run on cuda with Nvidia GPU of 8GB, even though I tried it on google colab the time to train a batch size of 16 wirh number of inference=25 is around 7 minutes, much higher if I increase this number, so for one full training of one epoc it takes around 34 hours which seems very excessive. I am wondering if I am doing something wrong here or is this the best I can hope for given the hardware which is a 8GB GPU on my personal cmputer? All the delay is for this line of code :
with autocast():
outputs = pipe(combined_meanings, tokenized_prompts=batch_tokenized_prompts, num_inference_steps=num_steps)["images"]
I need to keep the model to sd-v1-4.
from diffusers import StableDiffusionPipeline
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, cache_dir=".../models/ldm/stable-diffusion-v1/")
I then train run the model using pipe and create associated functions as follows:
# Define function to pre-tokenize prompts
def pre_tokenize_prompts(train_dataset, tokenizer):
"""
Pre-tokenize prompts for training.
Args:
train_dataset: The training dataset.
tokenizer: The CLIP tokenizer.
Returns:
list: Pre-tokenized prompts.
"""
tokenized_prompts = []
for entry in train_dataset:
meanings = [str(entry['meaning'])]
max_length = 77
stride = 5
for start in range(0, len(meanings), stride):
end = min(start + max_length, len(meanings))
segment = ' '.join(meanings[start:end])
tokens = tokenizer.tokenize(segment)
tokenized_prompts.extend(tokens)
return tokenized_prompts
# Define function to train model
def train(epochs, batch_size, lr, num_steps, train_dataset, device, dataset_length):
"""
Train the model.
Args:
epochs (int): Number of epochs.
batch_size (int): Batch size.
lr (float): Learning rate.
num_steps (int): Number of inference steps.
train_dataset: The training dataset.
device: The device for computations.
dataset_length (int): Length of the training dataset.
"""
# Instantiate the CLIP tokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
# Pre-tokenize prompts
tokenized_prompts = pre_tokenize_prompts(train_dataset, tokenizer)
# Define the optimizer
optimizer = create_optimizer(pipe, lr)
for epoch in range(epochs):
for i in range(0, len(train_dataset), batch_size):
batch = train_dataset[i:i + batch_size]
meanings = [str(entry['meaning']) for entry in batch]
images = [entry['image'] for entry in batch]
# Convert list of meanings to a single string
combined_meanings = ' '.join(meanings)
# Zero the gradients
optimizer.zero_grad()
# Use pre-tokenized prompts
start_idx = i * len(tokenized_prompts)
end_idx = (i + batch_size) * len(tokenized_prompts)
batch_tokenized_prompts = tokenized_prompts[start_idx:end_idx]
# Print the meanings for each batch
print('combined meanings: ', combined_meanings)
# Forward pass
start_time = time.time()
with autocast():
outputs = pipe(combined_meanings, tokenized_prompts=batch_tokenized_prompts, num_inference_steps=num_steps)["images"]
Make sure operations pipe() and tensors reside on GPU by using .to(device) with device defined as device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')