Fastest Pixel Access (for loop vs. visitors)

Hi,

with :cvb: 14 we released visitors, a CVB++ pattern to implement pixel/element based algorithms independently from the memory layout and the the data type. This alone makes visitors a very convenient and useful feature. But what about performance? What would be a good performance anyway? And how to optimize them if possible?

We all write performant code, do we? Let’s check some benchmarks!

First of all define your system and compiler (newer compilers may optimize differently and lead to totally different results). Be aware that optimizing your code for Windows does not have to be an improvement on Linux either.

System & Complier
  • Windows 10
  • VC 14.2 (Visual Studio 2019)
  • i5-6600
  • Catch2 Benchmark

Secondly, find a optimal reference as a target. In this case I use three vectors and will simply loop over those vectors and and multiply the elements in a with the corresponding elements in b and store the result in c.

// enough data to get reliable results within a few seconds
const size_t ContainerSize = 128 * 1024 * 1024;
std::vector<uint8_t> a(ContainerSize);
std::vector<uint8_t> b(ContainerSize);
std::vector<uint8_t> c(ContainerSize);

for (size_t i = 0; i < c.size(); ++i)
       c[i] = a[i] * b[i];

This seems pretty optimal. The standard library is usually a good point to start and the vectors guarantee contiguous memory. The algorithm is super simple and the element access is also pretty straight forward. Or in other words: I would expect this to be approved during code review ;-).

Now let’s compare this to visitors in CVB++. As visitors do not work on standard containers but image planes we have to add some complexity by wrapping images around the containers data first.

auto imageA = Cvb::WrappedImage::FromGrey8Pixels(a.data(), 16 * 1024, 8 * 1024);
auto imageB = Cvb::WrappedImage::FromGrey8Pixels(b.data(), 16 * 1024, 8 * 1024);
auto imageC = Cvb::WrappedImage::FromGrey8Pixels(c.data(), 16 * 1024, 8 * 1024);

Cvb::Visit([func](auto blockA, auto blockB, auto blockC)
         {
            for (int y = 0; y < blockC.Height(); ++y)
              for (int x = 0; x < blockC.Width(); ++x)
                blockC(x, y) = blockA(x, y) * blockB(x, y);       
          }, imageA->Plane(0), imageB->Plane(0), imageC->Plane(0));

Now the question is how fast do we get with the visitor pattern. Let’s have a look at the results:
(I’ve blurred the result so you can test your intuition about this benchmark)

classic for loop: 193.443 ms
CVB++ visitor:     73.948 ms

It came as quite a surprise to me that the visitors take less 40% of the time the classic for loop requires. So the CVB++ visitors are much faster. Clearly there must be something wrong with my reference. Of course I double checked that they both produce the same results.
Anyway, it seems that CVB++ visitors are really really fast.

Please leave a comment or :heart: if you want me to go into the performance details.

Cheers

Andreas

4 Likes