I am looking to implement TF-IDF*) on some files, however, they are JS files and I am unsure of how exactly to break up a JS file by term. With regular text it is generally pretty straight-forward as you can use words. Just seperated by spaces. However this seems like something that is not as straight-forward when working with code, as 'terms' could be separated by any number of characters.
I would like to use it to search for and find instances of plagiarism across many code bases, and compare them for how similar they are. Comments I think would be important for identifying similar/same files.
Is there some standard way of doing this? Or does anybody have any ideas on how to separate code into terms for use with TF-IDF?
*)
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.