I followed the instructions here to run octave with nvblas. I have CUDA toolkit 7.5 installed and a tesla k40c GPU. To start octave with nvblas, I used LD_PRELOAD=libnvblas.so octave. I then ran the following simple code:
N = 256
A = rand(N,N)
B = rand(N,N)
A*B
which produces a matrix with reasonable values. However, if I increase N to 512, or any number over 512, I get all zeros (or very small numbers) back as a result.
If I use OpenBLAS this does not happen. The matrices should be small enough that they fit in the card's RAM (12GB). Any idea why this might happen?
Note: If I make A and B identity matrices this does not happen, but it still happens with A = B = ones(N,N).
Sorry the question is somewhat stale, but I tried it on an Amazon AWS EC2 p2.xlarge instance with a k80 gpu and it seems to have worked.
I was getting similar results to you (lots of zeros) when I had the default "NVBLAS_GPU_LIST 0 1" setting in nvblas.conf, which seems to refer to two GPUs, so I changed it to just one and it worked. Complete file below:
Program (t1.m) slightly modified from the NVidia link, to count the number of non-zeros in the output matrix:
FYI Here is the nvidia-smi output while it was running as above (it peaked at 172MiB usage with N=16384):
Here are the nvidia & cuda files I'd previously installed:
I seem to get a speed up of about 8.6, with about 55 gflops from plain octave, and 478 from the GPU version.