Why does computation on contiguous slices of tensors take less time than doing it directly on the tensor?

60 Views Asked by At

I have a test, which performs an element wise multiplication operation on a large tensor and stores the result back into it, on an identical sliced view of it and on a noncontiguous sliced chipped view of a larger tensor with the same values:

#include <iostream>
#include <Eigen/Dense>
#include <unsupported/Eigen/CXX11/Tensor>
#include <chrono>
#include <numeric>
#include <vector>
#include <tuple>
#include <iomanip>

using namespace Eigen;
using namespace std;
using namespace std::chrono;

template <typename T>
double performTestAndMeasureTime(T& tensor) {
    auto start = high_resolution_clock::now();
    tensor = tensor * tensor;
    auto end = high_resolution_clock::now();
    duration<double, std::milli> elapsed = end - start;
    return elapsed.count();
}

int main() {
    cout << std::fixed << std::setprecision(16);

    const int size = 3000;
    const int numTests = 10;
    for (int j = 0; j < 10; ++j) {
      Tensor<double, 2> tensor(size, size);
      tensor.setConstant(j);

      Tensor<double, 2> tensor_(size, size);
      tensor_.setConstant(j);
      auto tensorView1 = tensor_.slice(Eigen::array<Index, 2>{0, 0}, Eigen::array<Index, 2>{size, size});

      Tensor<double, 3> largerTensor(100, size * 2, size * 2);
      auto tensorView2_ = largerTensor.chip(50, 0);
      auto tensorView2 = tensorView2_.slice(Eigen::array<Index, 2>{0, 0}, Eigen::array<Index, 2>{size, size});
      tensorView2.setConstant(j);

      vector<tuple<double, double, double>> results;

      for (int i = 0; i < numTests; ++i) {
          double res1 = performTestAndMeasureTime(tensor);
          double res2 = performTestAndMeasureTime(tensorView1);
          double res3 = performTestAndMeasureTime(tensorView2);
          results.emplace_back(res1, res2, res3);
      }

      double sum1 = 0, sum2 = 0, sum3 = 0;
      for (const auto& [res1, res2, res3] : results) {
          sum1 += res1;
          sum2 += res2;
          sum3 += res3;
      }

      cout << "Results for tensor constant = " << j << endl;
      cout << "  Average Normal:        " << sum1 / numTests << " ms" << endl;
      cout << "  Average Sliced View:   " << sum2 / numTests << " ms" << endl;
      cout << "  Average Noncontiguous: " << sum3 / numTests << " ms" << endl;
    }
    return 0;
}

This prints out

Results for tensor constant = 0
  Average Normal:        109.6084899999999891 ms
  Average Sliced View:   86.6384500000000060 ms
  Average Noncontiguous: 559.3868599999999560 ms
Results for tensor constant = 1
  Average Normal:        107.4885000000000019 ms
  Average Sliced View:   86.0359700000000061 ms
  Average Noncontiguous: 562.5502300000000560 ms
Results for tensor constant = 2
  Average Normal:        106.0686999999999927 ms
  Average Sliced View:   85.6165200000000084 ms
  Average Noncontiguous: 539.1068299999999454 ms
Results for tensor constant = 3
  Average Normal:        105.9735300000000109 ms
  Average Sliced View:   85.8508899999999926 ms
  Average Noncontiguous: 554.8193299999999226 ms
Results for tensor constant = 4
  Average Normal:        112.9430100000000010 ms
  Average Sliced View:   86.7122600000000006 ms
  Average Noncontiguous: 559.5556199999999762 ms
Results for tensor constant = 5
  Average Normal:        107.4511799999999795 ms
  Average Sliced View:   87.0094900000000138 ms
  Average Noncontiguous: 552.7255600000000868 ms
Results for tensor constant = 6
  Average Normal:        115.6903399999999920 ms
  Average Sliced View:   90.0774900000000116 ms
  Average Noncontiguous: 590.9229000000000269 ms
Results for tensor constant = 7
  Average Normal:        110.9844900000000081 ms
  Average Sliced View:   87.2257299999999844 ms
  Average Noncontiguous: 563.0328299999999899 ms
Results for tensor constant = 8
  Average Normal:        108.2359100000000183 ms
  Average Sliced View:   85.5443699999999865 ms
  Average Noncontiguous: 555.7776999999999816 ms
Results for tensor constant = 9
  Average Normal:        107.4942699999999860 ms
  Average Sliced View:   84.7212200000000024 ms
  Average Noncontiguous: 544.9521099999999478 ms

It's understandable that the noncontiguous tensor takes much longer because the caching mechanism of the processor is not working as well. However, I don't understand why in a high-performing library like eigen, the computation on the view of the same tensor can be this much faster than when we perform the same operation on a tensor directly. If anything I would have expected the opposite since with the slice, we need to first evaluate the result of a mapping expression before we get the final value. Why does the computation finish more quickly on the slice?

Compile command: g++ -std=c++20 -I/usr/include/eigen3/ test.cpp -o tensor_test && ./tensor_test

Eigen Version: 3.4.0

2

There are 2 best solutions below

0
Yes On

using

g++ -std=c++20 -I/usr/include/eigen3/ test.cpp -o tensor_test -Ofast && ./tensor_test

changed the output to the expected

Results for tensor constant = 0
  Average Normal:        5.1356099999999998 ms
  Average Sliced View:   6.0342300000000009 ms
  Average Noncontiguous: 289.1670799999999986 ms
Results for tensor constant = 1
  Average Normal:        5.1476300000000004 ms
  Average Sliced View:   5.8024499999999994 ms
  Average Noncontiguous: 261.0361199999999826 ms
Results for tensor constant = 2
  Average Normal:        5.0972700000000000 ms
  Average Sliced View:   5.8569899999999997 ms
  Average Noncontiguous: 259.1227200000000153 ms
Results for tensor constant = 3
  Average Normal:        5.1468300000000005 ms
  Average Sliced View:   5.7064599999999999 ms
  Average Noncontiguous: 260.6153199999999970 ms
Results for tensor constant = 4
  Average Normal:        5.0209299999999999 ms
  Average Sliced View:   5.4007199999999997 ms
  Average Noncontiguous: 260.8348000000000297 ms
Results for tensor constant = 5
  Average Normal:        5.7401500000000008 ms
  Average Sliced View:   6.1950500000000002 ms
  Average Noncontiguous: 364.8314100000000053 ms
Results for tensor constant = 6
  Average Normal:        5.1002900000000002 ms
  Average Sliced View:   5.7117500000000003 ms
  Average Noncontiguous: 260.0248899999999708 ms
Results for tensor constant = 7
  Average Normal:        4.9162100000000004 ms
  Average Sliced View:   5.7174899999999997 ms
  Average Noncontiguous: 257.0656500000000051 ms
Results for tensor constant = 8
  Average Normal:        4.8135900000000005 ms
  Average Sliced View:   5.5151299999999992 ms
  Average Noncontiguous: 256.0223799999999983 ms
Results for tensor constant = 9
  Average Normal:        4.8785400000000001 ms
  Average Sliced View:   5.5409099999999993 ms
  Average Noncontiguous: 255.9427200000000084 ms
0
gnasher729 On

What costs the time is accessing memory. Whatever additional instructions the slicing takes is irrelevant and mostly optimised away anyway.

Typically reading from L1 cache takes 4 cycles, L2 cache takes 20, RAM takes 100. (Highly variable between processors but the principle is right). So if slicing means you perform many operations in L1 cache instead of using L2 cache or RAM you can get dramatical speed improvements.