Is there a way to reduce the size of the sklearn 20newsgroups dataset?

117 Views Asked by At

I am in process of learning the basics of NLP and I am trying to code the kNN classifier.

In the data preparation stage, I am trying to reduce the set size down to a certain dimension but I am confused about how to do that.

Can anyone help me out?

I have written the code below for getting the training dataset

trainingData = fetch_20newsgroups(subset="train",categories=allCategories)
1

There are 1 best solutions below

0
meti On

What you're trying to do is renowned for Dimension Reduction which has its own variants, it the broadest sense it is divided into Supervised and Unsupervised. Any flavor of it using sklearn API would be implemented as below:

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print('Original Features: ', X.shape[1])
X_unsupervised = TSNE(n_components=2, learning_rate='auto', init='random', perplexity=3).fit_transform(X)
print('Features after Unsupervised Dimension Reduction: ', X_unsupervised.shape[1])
y = [1, 0, 0, 1]
X_supervised = SelectKBest(chi2, k=2).fit_transform(X, y)
print('Features after Supervised Dimension Reduction: ', X_supervised.shape[1])

output:

Original Features:  9
Features after Unsupervised Dimension Reduction:  2
Features after Supervised Dimension Reduction:  2