I am trying to get all of terms and related postings which called Terms from a Lucene`s document field(i.e. How to calculate term frequeny in Lucene?). According to documentation there is a method to do that:
public final Terms getTermVector​(int docID, String field) throws IOException
Retrieve term vector for this document and field, or null if term vectors were not indexed. The returned Fields instance acts like a single-document inverted index (the docID will be 0).
There is a field called int docID. What is this?? for a given document what is the id field of that and how does Lucene recognize that?
According to Lucene's documentation i have used StringField as id and it is not a int.
import org.apache.lucene.document.*;
Document doc = new Document();
Field idField = new StringField("id",post.Id,Field.Store.YES);
Field bodyField = new TextField("body", post.Body, Field.Store.YES);
doc.add(idField);
doc.add(bodyField);
I have five question accordingly:
- How does Lucene recognize the
idfield is used asdocIdfor this document? or even Lucene does it or not ?? - I used
Stringfor id but this method give aint. Does it cause a problem? - Is there any appropriate method to get postings?
- I have used
TextField. Is there any way to retrieve term vector(Terms) of that field? I don't want to re-index my doc as explained here, because it is too large (35-GB). - Is there any way to get terms count and get each term frequency from
TextField?
To calculate term frequency we can use
IndexReader.getTermVector(int docID ,String field).int docIDis a field which refers to document id created by Lucene. You can retrievedocIDby the code follow:Each
termVectorobject have term and frequency related to a document field and you can retrieve that by the following code:Note: Don't forget to set store term vector as i explained here (Or this one) when you are indexing documents. If you index your document without setting to store term vector, the method
getTermVectorwill returnnull. All kind of predefind Lucene Field deactivated this option by default. So you need to set it.