Document classification using pretrained models like BERT

472 Views Asked by npatil At 09 February 2021 at 22:09

I am looking for methods to classify documents. For ex. I have a bunch of documents with text and I want to label the document on whether it belongs to sports, food, politics etc. Can I use BERT (for documents with words > 500) for this or are there any other models that do this task efficiently?

Original Q&A

There are 1 best solutions below

justanyphil On 10 February 2021 at 06:36

BERT has a maximum sequence length of 512 tokens (note that this is usually much less than 500 words), so you cannot input a whole document to BERT at once. If you still want to use the model for this task, I would suggest that you

split up each document into chunks that are processable by BERT (e.g. 512 tokens or less)
classify all document chunks individually
classify the whole document according to the most frequently predicted label of the chunks, i.e. take a majority vote

In this case, the only modification you have to make is to add a fully connected layer on top of BERT.

This approach might be quite expensive, though. There is the alternative of representing the text documents as bag of word vectors and then train a classifier on the data. If you are not familiar with BOW, the Wikipedia entry to it is a good starting point. It can serve as a feature vector for all kinds of classifiers, I would suggest you try SVM or kNN.

Document classification using pretrained models like BERT

There are 1 best solutions below

Related Questions in NLP

Related Questions in BERT-LANGUAGE-MODEL

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in DOCUMENT-CLASSIFICATION

Trending Questions

Popular # Hahtags

Popular Questions