Finding embedding of a molecule dataset

80 Views Asked by At

Embeddings of a drug dataset using macaw (Molecular Autoencoding AutoWorkAround) which is an Accessible Tool for Molecular Embedding and InverseMolecular Design. After that I convert the embeddings into pandas dataframe and the convert it into a .csv file which includes class labels of the main dataset.

But when I try to apply the smote algorithm on MLP or Logistic Regression Classifier the classification metrices named precision, recall, F1 score remains the same that means there is no improvement after applying the smote.

So, I think there is a problem in finding the embeddings. Please help.

The code which I applied, the dataset and the paper from where I got the idea are given below.

My source code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVR

from rdkit import SimDivFilters
from rdkit.Chem import rdMolDescriptors
import sys
sys.path.append('../')
import macaw
print(macaw.__version__)
from macaw import *
from google.colab import files
df=files.upload()
df=pd.read_csv("BBBP.csv")
smiles=df.smiles
print(len(smiles))
mcw = MACAW(random_state=42)
mcw.fit(smiles)
BBBP_embedding=mcw.transform(smiles)
print(BBBP_embedding)
hiv_embedding=pd.DataFrame(BBBP_embedding)
extracted_col=df["p_np"]
hiv_embedding=hiv_embedding.join(extracted_col)
hiv_embedding.to_csv("BBBP_embedding.csv")
from google.colab import files
files.download("BBBP_embedding.csv")

Dataset link: https://moleculenet.org/

Paper link: https://pubs.acs.org/doi/10.1021/acs.jcim.2c00229

I expect someone can find the code's mistake and help me to correct it. Thanks!

0

There are 0 best solutions below