I have a test, which performs an element wise multiplication operation on a large tensor and stores the result back into it, on an identical sliced view of it and on a noncontiguous sliced chipped view of a larger tensor with the same values:
#include <iostream>
#include <Eigen/Dense>
#include <unsupported/Eigen/CXX11/Tensor>
#include <chrono>
#include <numeric>
#include <vector>
#include <tuple>
#include <iomanip>
using namespace Eigen;
using namespace std;
using namespace std::chrono;
template <typename T>
double performTestAndMeasureTime(T& tensor) {
auto start = high_resolution_clock::now();
tensor = tensor * tensor;
auto end = high_resolution_clock::now();
duration<double, std::milli> elapsed = end - start;
return elapsed.count();
}
int main() {
cout << std::fixed << std::setprecision(16);
const int size = 3000;
const int numTests = 10;
for (int j = 0; j < 10; ++j) {
Tensor<double, 2> tensor(size, size);
tensor.setConstant(j);
Tensor<double, 2> tensor_(size, size);
tensor_.setConstant(j);
auto tensorView1 = tensor_.slice(Eigen::array<Index, 2>{0, 0}, Eigen::array<Index, 2>{size, size});
Tensor<double, 3> largerTensor(100, size * 2, size * 2);
auto tensorView2_ = largerTensor.chip(50, 0);
auto tensorView2 = tensorView2_.slice(Eigen::array<Index, 2>{0, 0}, Eigen::array<Index, 2>{size, size});
tensorView2.setConstant(j);
vector<tuple<double, double, double>> results;
for (int i = 0; i < numTests; ++i) {
double res1 = performTestAndMeasureTime(tensor);
double res2 = performTestAndMeasureTime(tensorView1);
double res3 = performTestAndMeasureTime(tensorView2);
results.emplace_back(res1, res2, res3);
}
double sum1 = 0, sum2 = 0, sum3 = 0;
for (const auto& [res1, res2, res3] : results) {
sum1 += res1;
sum2 += res2;
sum3 += res3;
}
cout << "Results for tensor constant = " << j << endl;
cout << " Average Normal: " << sum1 / numTests << " ms" << endl;
cout << " Average Sliced View: " << sum2 / numTests << " ms" << endl;
cout << " Average Noncontiguous: " << sum3 / numTests << " ms" << endl;
}
return 0;
}
This prints out
Results for tensor constant = 0
Average Normal: 109.6084899999999891 ms
Average Sliced View: 86.6384500000000060 ms
Average Noncontiguous: 559.3868599999999560 ms
Results for tensor constant = 1
Average Normal: 107.4885000000000019 ms
Average Sliced View: 86.0359700000000061 ms
Average Noncontiguous: 562.5502300000000560 ms
Results for tensor constant = 2
Average Normal: 106.0686999999999927 ms
Average Sliced View: 85.6165200000000084 ms
Average Noncontiguous: 539.1068299999999454 ms
Results for tensor constant = 3
Average Normal: 105.9735300000000109 ms
Average Sliced View: 85.8508899999999926 ms
Average Noncontiguous: 554.8193299999999226 ms
Results for tensor constant = 4
Average Normal: 112.9430100000000010 ms
Average Sliced View: 86.7122600000000006 ms
Average Noncontiguous: 559.5556199999999762 ms
Results for tensor constant = 5
Average Normal: 107.4511799999999795 ms
Average Sliced View: 87.0094900000000138 ms
Average Noncontiguous: 552.7255600000000868 ms
Results for tensor constant = 6
Average Normal: 115.6903399999999920 ms
Average Sliced View: 90.0774900000000116 ms
Average Noncontiguous: 590.9229000000000269 ms
Results for tensor constant = 7
Average Normal: 110.9844900000000081 ms
Average Sliced View: 87.2257299999999844 ms
Average Noncontiguous: 563.0328299999999899 ms
Results for tensor constant = 8
Average Normal: 108.2359100000000183 ms
Average Sliced View: 85.5443699999999865 ms
Average Noncontiguous: 555.7776999999999816 ms
Results for tensor constant = 9
Average Normal: 107.4942699999999860 ms
Average Sliced View: 84.7212200000000024 ms
Average Noncontiguous: 544.9521099999999478 ms
It's understandable that the noncontiguous tensor takes much longer because the caching mechanism of the processor is not working as well. However, I don't understand why in a high-performing library like eigen, the computation on the view of the same tensor can be this much faster than when we perform the same operation on a tensor directly. If anything I would have expected the opposite since with the slice, we need to first evaluate the result of a mapping expression before we get the final value. Why does the computation finish more quickly on the slice?
Compile command: g++ -std=c++20 -I/usr/include/eigen3/ test.cpp -o tensor_test && ./tensor_test
Eigen Version: 3.4.0
using
changed the output to the expected