I wanted to use the script below for embedding. It worked fine on a small amount of data, but when I loaded a CSV with 300,000 records into it, the embedding has been running for 40 minutes and doesn't stop.
The script:
load_dotenv('.env')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
openai.api_key = OPENAI_API_KEY
model = OpenAIEmbeddings()
dataset = pd.read_csv('keywords.csv', encoding='ISO-8859-1')
dataset['embedding'] = dataset['keyword'].apply(
lambda x: get_embedding(x, engine='text-embedding-ada-002')
)
dataset['embedding'].apply(np.array)
keyword = input('Input:')
keywordVector = get_embedding(
keyword, engine="text-embedding-ada-002"
)
print(keywordVector)
How can I optimize this?
Instead of calling the API for each keyword separately, you can try batching multiple keywords together if the API supports it. Unfortunately, with the OpenAI API, it appears that the openai.Embed.create function only accepts one prompt at a time, so this might not be possible in this case.