I am trying to run multiple instances of a shortest path algorithm, for independent pairs of nodes in a graph.
The algorithm internally uses the bfs method, which is "sandboxed" (self-contained; no writing of variables outside the method). The graph consists of 600,000 nodes and 2,000,000 edges.
I ran the parallelized and sequential iterations for 14k nodes (from which the independent pairs of nodes were chosen), the sequential iterations take 30 minutes while openmp takes 50 minutes.

I would expect openmp to be faster - there is no critical section or complicated bottlenecks

int bfs(int src, int dst, int thread) {
    const int _VERTEX = 600000; 
    std::stringstream msg;
    msg << "Finding shortest path using bfs from : " << src <<" to " << dst <<   " using thread " << thread << endl;
    
    cout << msg.str();
    
    std::vector<int>dist (_VERTEX,INT_MAX);
    std::vector<bool> visited(_VERTEX,false);
    
    std::queue <int> q;
    q.push(src);
    visited[src] = true;
    dist[src] = 0;
    
    while (!q.empty()) {
        int size = q.size();
        while (size--) {
            int curr = q.front();
            q.pop();
            for (vector<int> ::iterator it = _ADJLST[curr].begin();  it != _ADJLST[curr].end(); ++it) {
                if (visited[*it]) {continue;}
    
                if (dist[*it] > dist[curr] +1) {
                    dist[*it] = dist[curr] + 1;
                    q.push(*it);
                }
                // if (curr == dst) {return dist[dst];}
                visited[*it] = 1;
            }
        }
    }
    return dist[dst];
}

I know bfs can be optimized further by returning dist as soon as dst is found.

And here is my main function:


int main() {

    int i, tid, nthreads;
    
    // #pragma omp parallel for private(i,tid,threads)  
    schedule(dynamic,1)
    #pragma omp parallel for private(i,tid,nthreads)
    for (i = 0 ; i < _PAIRLST.size(); i++) {
        int src = _PAIRLST[i].first ;
        int dst = _PAIRLST[i].second;
    
        tid = omp_get_thread_num();
        bfs(src,dst,tid);

}

_PAIRLST has size of 14k, nodes stored to be processed. _ADJLST is an unordered_map of vectors storing an adjacency list.

compiled with gcc-8.2.0 with -Ofast

I also experimented with static scheduling of OpenMP and not setting the thread by default, and I got a similar result.

Is there something I am missing here?

My hardware:
| Key | Value |
|:----------------- |:------------------------------------------ |
|Architecture | x86_64 |
|CPU op-mode(s) | 32-bit, 64-bit |
|Byte Order | Little Endian |
|CPU(s) | 48 |
|On-line CPU(s) list| 0-47 |
|Thread(s) per core | 2 |
|Core(s) per socket | 12 |
|Socket(s) | 2 |
|NUMA node(s) | 2 |
|Vendor ID | GenuineIntel |
|CPU family | 6 |
|Model | 85 |
|Model name | Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz|
|Stepping | 4 |
|CPU MHz | 2300.000 |
|BogoMIPS | 4600.00 |
|Virtualization | VT-x |
|L1d cache | 32K |
|L1i cache | 32K |
|L2 cache | 1024K |
|L3 cache | 16896K |
|NUMA node0 CPU(s) | 0-11,24-35 |
|NUMA node1 CPU(s) | 12-23,36-47 |

Linux 3.10.0-1160.80.1.el7.x86_64 #1 SMP Sat Oct 8 18:13:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

as per Hari's request, I modified the bfs method to allocate less memory:

int bfs(int src, int dst , int thread) {
    std::stringstream msg;
    msg << "Finding shortest path using bfs from : " << src <<" to " << dst <<   " using thread " << thread << endl;

    cout << msg.str() << std::flush;

    // std::vector<int>  dist (_VERTEX,INT_MAX);
    // std::vector<bool> visited(_VERTEX,false);

    std::unordered_set<int> visited;
    std::unordered_map<int,int> dist;

    std::queue <int> q;
    q.push(src);
    visited.insert(src);
    dist[src] = 0;

    while (!q.empty()) {
        int size = q.size();
        while (size--) {
            int curr = q.front();
            q.pop();
            for (vector<int> ::const_iterator it = _ADJLST[curr].begin();  it != _ADJLST[curr].end(); ++it) {
                if (visited.find(*it)  != visited.end()) {continue;}

                if (dist.find(*it) == dist.end()) {dist[*it] = INT_MAX ; }

                if (dist[*it] > dist[curr] +1) {
                    dist[*it] = dist[curr] + 1;
                    q.push(*it);
                }
                // if (curr == dst) {return dist[dst];}
                // visited[*it] = 1;
                visited.insert(*it);
            }
        }
    }
    return dist[dst];
}

Edit:

Here are some of my logs: Finding shortest path using bfs from : 9661 to 0 using thread 11

Finding shortest path using bfs from : 1787 to 0 using thread 2

Finding shortest path using bfs from : 9661 to 0 using thread 11

I am aware of the missing optimizations in the bfs algorithm itself.

Is it possible that the thread number given by omp_get_thread_num() are just dummies?

1

There are 1 best solutions below

7
Hari On

The goal is to find shortest distance using BFS between pairs of nodes given in a list. Since each pair is independent, the BFS used to find the shorted path for any pair of nodes is running independently of the other pairs. There are no variables shared by the threads, i.e., there is no coordination between the threads. Thus, there will be no race conditions.

The code below runs for a toy graph (4 nodes, 5 edges and six pairs of source and target nodes). With such a light load, the total amount of work done is very less. In order to test the effect of parallelization, artificial delay is added by calling sleep for 5 seconds in the for loop. When the number of threads is set to 1, then the total time taken is around 30 seconds. When the number of threads is set to 6, then the total time is around 5 seconds. The code was run in Ubuntu and time was used to get the timing information. Thus, we can see that the speed-up is happening as expected. Also, the shortest distances computed are correct.

Possible reasons for slow down:

  • Could it be the data structures dist and visited, which are present in each thread? Less likely. Each are of the order of number of nodes and for around million nodes, they together occupy a few MBs (See discussion in the comments).
  • Could it be the threads accessing the adjacency list data structure in an uncoordinated manner? If one thread could re-use the information loaded in the cache by another thread, it will be better. This could be investigated.

Note: As per the answer here, the steps used to set the number of threads in OpenMP ensures that that many threads are created.

#include <climits>

#include <iostream>
#include <queue>
#include <vector>

#include <unistd.h>

#include "omp.h"

const int kVertex = 4;

const std::vector<std::vector<int>> kAdjList{{1, 2, 3}, {0, 2}, {0, 1, 3}, {0, 2}};

const std::vector<std::pair<int, int>> kPairList{{0, 1}, {0, 2}, {0, 3},
                                                 {1, 2}, {1, 3},
                                                 {2, 3}};

int bfs(int src, int dst , int thread) {
   std::vector<int> dist(kVertex,INT_MAX);
   std::vector<bool> visited(kVertex,false);

   std::queue<int> q;
   q.push(src);
   visited[src] = true;
   dist[src] = 0;

   while (!q.empty()) {
       int size = q.size();
       while (size--) {
           int curr = q.front();
           q.pop();
           for (const auto &it : kAdjList[curr]) {
               if (visited[it]) {continue;}

               if (dist[it] > dist[curr] +1) {
                   dist[it] = dist[curr] + 1;
                   q.push(it);
               }
               // if (curr == dst) {return dist[dst];}
               visited[it] = 1;
           }
       }
   }

   return dist[dst];
}

int main() {
    // Force OpenMP to use a given number of threads.
    omp_set_dynamic(0);
    omp_set_num_threads(6);

    #pragma omp parallel for
    for (int i = 0; i < kPairList.size(); ++i) {
        sleep(5);

        std::stringstream op_str;
        op_str<<"distance between "<<kPairList[i].first
               <<" "<<kPairList[i].second
               <<" "<<bfs(kPairList[i].first, kPairList[i].second, omp_get_thread_num())
               <<std::endl;
        std::cout<<op_str.str();
    }

    return 0;
}