Ektorp CouchDb: Query for pattern with multiple contains

867 Views Asked by At

I want to query multiple candidates for a search string which could look like "My sear foo". Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).

I found some code which allows me to do a search by pattern:

@View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
    String trim = serialNumber.trim();

    if (StringUtils.isEmpty(trim)) {
        return new ArrayList<>();
    }
    ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");

    return db.queryView(viewQuery, DeviceEntityCouch.class);
}

which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?

EDIT: This is the current workaround, but there must be a better way i guess. Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.

@View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
    String trim = serialNumber.trim();

    if (StringUtils.isEmpty(trim)) {
        return new ArrayList<>();
    }

    String[] split = trim.split(" ");

    List<DeviceEntityCouch> list = new ArrayList<>();
    for (String s : split) {
        ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
        list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
    }

    return list;
}
1

There are 1 best solutions below

0
seb On

Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).

Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for. Otherwise you must fetch every single record and perform a string comparison on it.

The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):

  1. Your serial numbers are not very long (say 20 chars?)
  2. You force the search to be always 5 characters
  3. Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):

    ...
    for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
        emit([doc.serialNo.substring(i, i + 5), doc._id]);
    }
    ...
    
  4. Use _count reduce function

Now the following url:

http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true

Will return a list of documents with a hit count for a key of 01234. If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits. Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.

I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).

I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount). Generating the view took a while but the lookups where fast.

You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.