I am writing a pre-processing function in natural language processing for LSA (Latent Semantic Analysis). All the other functions such as tfidf, remove_stopwords work with the unit tests that I created. However the LSA function keeps giving me the following error when testing its functionality:
"Expected 2D array, got 1D array instead: array=['I ate dinner at Olive Garden', 'we are buying a house', 'I did not eat dinner at Olive Garden', 'our neighbors are buying a house']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
Here is my code for the LSA function and the test code:
import pandas as pd
import nltk
import string
import sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer
def LSA(data, tfidf = True, remove_stopwords=True):
# done with stop word removal and tf-idf weighting keeping the 100 most common concepts
text = data.iloc[:,-1] #isolate text column
#Define the LSA function
vectors = sklearn.decomposition.TruncatedSVD(n_components = 2, algorithm = 'randomized', n_iter = 100, random_state = 100)
vectors.fit(text.tolist())
svd_matrix = vectors.fit_transform(text.tolist())
svd_matrix = Normalizer(copy=False).fit_transform(text.tolist())
dense = svd_matrix.todense()
denselist = dense.tolist()
data["cleaned_vectorized_document"] = denselist
return data
Here is the test code that I am using that throws the error:
p = pd.DataFrame({'two':[1,2,3,4],'test':['I ate dinner at Olive Garden', 'we are buying a house',
'I did not eat dinner at Olive Garden', 'our neighbors are buying a house']})
print(LSA(p))
I am not sure if this is your issue, but your array is missing commas between items, which throws at least this error:
Try this instead: