I am using Datastax 6.8. This is my SOLR schema:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
<fieldType class="org.apache.solr.schema.TextField" name="NameField">
<analyzer type="index">
<filter class="solr.ASCIIFoldingFilterFactory"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.NGramFilterFactory" maxGramSize="15" minGramSize="2"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.NGramFilterFactory" maxGramSize="15" minGramSize="2"/>
</analyzer>
</fieldType>
</types>
<fields>
<field indexed="true" multiValued="false" name="nama" type="StrField"/>
<field indexed="true" multiValued="false" name="nama_copy" type="NameField"/>
</fields>
<uniqueKey>(nama)</uniqueKey>
<copyField dest="nama_copy" source="nama"/>
</schema>
I have this field value in a row batamindo v
Then I ran this query:
http://my_ip_address:8983/solr/search.form/select?wt=json&indent=true&fl=nama&q=nama_copy:batamindo\ v
I got very nice result
{
"responseHeader":{
"status":0,
"QTime":8},
"response":{"numFound":579,"start":0,"docs":[
{
"nama":"BATAMINDO V "},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"},
{
"nama":"BATAMINDO V"}]
}}
But when I ran
http://my_ip_address:8983/solr/search.form/select?wt=json&indent=true&fl=nama&q=nama_copy:batamindo\ vi
My search result is very bad
{
"responseHeader":{
"status":0,
"QTime":14},
"response":{"numFound":602,"start":0,"docs":[
{
"nama":"MV. VINCA"},
{
"nama":"MV. VINASHIP PEARL"},
{
"nama":"MV. VINASHIP PEARL"},
{
"nama":"MV. VINCENT TRADER"},
{
"nama":"MV. MEGHNA VICTORY"},
{
"nama":"MV. MEGHNA VICTORY"},
{
"nama":"NAVI SUNNY"},
{
"nama":"MV. MEGHNA VICTORY"},
{
"nama":"MT. GOLDEN VIOLET"},
{
"nama":"MT. GOLDEN VIOLET"}]
}}
What is happening here?
What you are seeing is expected behaviour.
The
NGramFilterFactoryclass tokenises strings into grams of N size. In your case, the strings are broken up into grams of 2 to 15 characters based on your schema definition of:For an input string like
cassandra, the N-gram filter generates the following grams:ca as ss sa an nd dr racas ass ssa san and ndr dracass assa ssan sand andr ndraFor search term
ss, the Solr query will get a match forss,ass,ssa,assa,ssanand so on.In your case where the search term is
vi, it is expected to matchvinca,vinaship,vincent,victory,navi,violetand so on.For more information, see Document Analysis in Solr. Cheers!