Is there a way to reduce the size of the sklearn 20newsgroups dataset?

117 Views Asked by wierd23 At 31 August 2022 at 04:40

I am in process of learning the basics of NLP and I am trying to code the kNN classifier.

In the data preparation stage, I am trying to reduce the set size down to a certain dimension but I am confused about how to do that.

Can anyone help me out?

I have written the code below for getting the training dataset

trainingData = fetch_20newsgroups(subset="train",categories=allCategories)

Original Q&A

There are 1 best solutions below

meti On 31 August 2022 at 07:15

What you're trying to do is renowned for Dimension Reduction which has its own variants, it the broadest sense it is divided into Supervised and Unsupervised. Any flavor of it using sklearn API would be implemented as below:

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print('Original Features: ', X.shape[1])
X_unsupervised = TSNE(n_components=2, learning_rate='auto', init='random', perplexity=3).fit_transform(X)
print('Features after Unsupervised Dimension Reduction: ', X_unsupervised.shape[1])
y = [1, 0, 0, 1]
X_supervised = SelectKBest(chi2, k=2).fit_transform(X, y)
print('Features after Supervised Dimension Reduction: ', X_supervised.shape[1])

output:

Original Features:  9
Features after Unsupervised Dimension Reduction:  2
Features after Supervised Dimension Reduction:  2

Is there a way to reduce the size of the sklearn 20newsgroups dataset?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in KNN

Related Questions in TEXT-CLASSIFICATION

Related Questions in AUTO-SKLEARN

Trending Questions

Popular # Hahtags

Popular Questions