Getting an "Expected a 2D array, got a 1D array" error when computing LSA

96 Views Asked by At

I am writing a pre-processing function in natural language processing for LSA (Latent Semantic Analysis). All the other functions such as tfidf, remove_stopwords work with the unit tests that I created. However the LSA function keeps giving me the following error when testing its functionality:

"Expected 2D array, got 1D array instead: array=['I ate dinner at Olive Garden', 'we are buying a house', 'I did not eat dinner at Olive Garden', 'our neighbors are buying a house']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."

Here is my code for the LSA function and the test code:

import pandas as pd
import nltk
import string
import sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer

def LSA(data, tfidf = True, remove_stopwords=True):
    # done with stop word removal and tf-idf weighting keeping the 100 most common concepts
    text = data.iloc[:,-1] #isolate text column
    
     
    #Define the LSA function
    vectors = sklearn.decomposition.TruncatedSVD(n_components = 2, algorithm = 'randomized', n_iter = 100, random_state = 100)

    vectors.fit(text.tolist())
    svd_matrix = vectors.fit_transform(text.tolist())
    svd_matrix = Normalizer(copy=False).fit_transform(text.tolist())

    dense = svd_matrix.todense()
    denselist = dense.tolist()
    
    data["cleaned_vectorized_document"] = denselist
    return data

Here is the test code that I am using that throws the error:

p = pd.DataFrame({'two':[1,2,3,4],'test':['I ate dinner at Olive Garden', 'we are buying a house',
'I did not eat dinner at Olive Garden', 'our neighbors are buying a house']})

print(LSA(p))
1

There are 1 best solutions below

0
chefhose On

I am not sure if this is your issue, but your array is missing commas between items, which throws at least this error:

ValueError: arrays must all be same length

Try this instead:

p = pd.DataFrame({'two':[1,2,3,4],'test':['I ate dinner at Olive Garden', 'we are buying a house', 'I did not eat dinner at Olive Garden', 'our neighbors are buying a house']})