I am working on a project where I need Sindhi Sentence level Embedding. For this I am using the Word2vec available pretrained model as described in the sample code. The code is only presented for the Word level embedding whereas I want it for entire Sentence and there can be any strategy, like Average or anything. However I am facing issues in my pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
Use WordEmbeddings instead of WordEmbeddingsModel
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sd") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
Use SentenceEmbeddings for obtaining sentence embeddings
sentence_embeddings = SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, word_embeddings, sentence_embeddings])
data = spark.createDataFrame([["مون کي اسپارڪ اين ايل پي سان پيار آهي"]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Extract the final embeddings
sentence_embeddings = result.select("sentence_embeddings.result").first()[0]
print(sentence_embeddings)
