Currently, I am building a multi-class document classifier which has to classify either 3 known classes, namely "Financial Report", "Insurance_Sheet", "Endorsement", and 1 unknown class which is "Random Doc". The following methods have been trialed, but did not prove a good result as quite a number of random documents have been classified as the known classes: "Financial Report", "Insurance_Sheet", "Endorsement".
- Method 1: TD-IDF + Linear SVC
- Method 2: Word2Vec for word embedding, then average those word-embedding to get the embedding vector for each document then feed to a classification model.
- Method 3: Doc2Vec to get the embedding vector for each document and then feed to a classification model.
Can you help suggest a good approach for this case ? Thanks a lot.