How can I find the stop and start index for a Java vector?

108 Views Asked by At

I have a vector that looks like this:

y =

 Columns 1 through 19:

   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2

 Columns 20 through 38:

   2   2   2   2   3   3   3   3   3   3   3   3   3   3   3   4   4   4   4

 Columns 39 through 57:

   4   4   4   4   4   4   4   5   5   5   5   5   5   5   5   5   5   5   6

 Columns 58 through 67:

   6   6   6   6   6   6   6   6   6   6

The vector y is always start at 1 and be counted up. You see that there are lots of same numbers. It's the classes for the samples.

Here we have 1 1 1 1 1 1 1 1 1 1 1 1 = 12 samples for class number 1.

We have 2 2 2 2 2 2 2 2 2 2 2 = 11 samples for class number 2.

My problem here is that I want to find start and stop for every class. For example: Class 1 begins always at index 0 and ends, in this case, at index 11.

Class 2 begins directly after class 1 ends.

Question:

I'm using EJML (Effient Java Matrix Library) and I'm planning to use this function:

C = A.extractMatrix(1,4,2,8) 

Which is equal to this MATLAB code:

C = A(2:4,3:8) 

But I need to find the start and stop indexes from this y vector. In what index does e.g class 3 stops and starts? Do you have any smart ideas how to do that?

Sure, I could use a for-loop, to do this, but for-loops in Java is quite slow because I'm going to have a very very large y vector.

Suggestions?

Edit:

Here is an suggestion. Is that good, or could it be done better?

private void startStopIndex(SimpleMatrix y, int c, Integer[] startStop) {
    int column = y.numCols();
    startStop[0] = startStop[1] + 1; // Begin at the next class
    for(int i = startStop[0]; i < column; i++) {
        if(y.get(i) != c) {
            break;
        }else {
            startStop[1] = i;
        }
    }

}

Assuming that we are calling the method from:

        Integer[] startStop = new Integer[2];
        for(int i = 0; i < c; i++) {
            startStopIndex(y, c, startStop);
        }
3

There are 3 best solutions below

0
On BEST ANSWER

If you want to do it faster then binary search is your friend. Threw this together really quick and it does things in O(log n) time, where as a linear search does it in O(n). It's pretty basic and assumes your data looks pretty much like you describe it. Feed it weird data and it will break.:

int[] breakPoints(int[] arr, int low, int high){
    int[] rtrn = new int[high];
    for(int i=low;i<high;i++){
        rtrn[i]=binarySearch(arr, i, 0, arr.length-1);
    }
    return rtrn;
}

int binarySearch(int[] arr, int k, int start, int end){
    int mid = (start+end)/2;
    if(mid==arr.length){
        return -1;
    }
    if(arr[mid]==k && arr[mid+1]==k+1){
        return mid+1; //or just mid if you want before breakpoint
    }
    if(arr[mid]<=k){
        return binarySearch(arr, k, mid+1, end);
    }
    return binarySearch(arr, k, start, mid-1);
}

You'd call it like this:

int[] data = {1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,5,5,6,6,6,6};
int[] bp = breakPoints(data,1,6);
//return 0, 3, 8, 13, 16, 18 
4
On

The below is in MATLAB. the for loop will go through each unique value stored in x1 and then find the first and last occurrence of that value.

x = [ 1 1 1 2 2 3 3 3 3 3 4 4 4 4 5 5 5 ]
x1 = unique(x)'

for k1 = 1:length(x1)
    x1(k1,2:3) = [find(x == x1(k1,1),1,"first"), find(x == x1(k1,1),1,"last")];
end

the above code yields x1 to be a 3 column matrix

 1     1     3
 2     4     5
 3     6    10
 4    11    14
 5    15    17
8
On

I think there is a name for this, but I can't remember what it might be, but you start looking for the next boundary with an accelerating search, and use a binary search after that.

You know the numbers are in ascending order, and there are potentially a lot of the same number, so you start by checking the next element. But instead of keep going 1 step at a time, you accelerate and step 2, 4, 8, 16, ... until you find a higher number.

Once you've found a higher number, you've gone too far, but the last step had the initial number, so you know the boundary is somewhere between the last two steps, and you then apply a binary search for the boundary.

Once you've fund the boundary, you start over stepping 1, 2, 4, ... for the next boundary.

If you expect most numbers to have about the same number of occurrences, you could keep a running average count, and make the first step with that average, to get a running start.

I'll leave it to you to actually code this.