How to separate code (specifically JS) into "terms" for use with TF-IDF

70 Views Asked by Lee Morgan At 06 September 2023 at 08:36

I am looking to implement TF-IDF*) on some files, however, they are JS files and I am unsure of how exactly to break up a JS file by term. With regular text it is generally pretty straight-forward as you can use words. Just seperated by spaces. However this seems like something that is not as straight-forward when working with code, as 'terms' could be separated by any number of characters.

I would like to use it to search for and find instances of plagiarism across many code bases, and compare them for how similar they are. Comments I think would be important for identifying similar/same files.

Is there some standard way of doing this? Or does anybody have any ideas on how to separate code into terms for use with TF-IDF?

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

Original Q&A

How to separate code (specifically JS) into "terms" for use with TF-IDF

There are 0 best solutions below

Related Questions in JAVASCRIPT

Related Questions in TF-IDF

Related Questions in INFORMATION-RETRIEVAL

Trending Questions

Popular # Hahtags

Popular Questions