Fastest Pixel Access (for loop vs. visitors)

Hi,

with :cvb: 14 we released visitors, a CVB++ pattern to implement pixel/element based algorithms independently from the memory layout and the the data type. This alone makes visitors a very convenient and useful feature. But what about performance? What would be a good performance anyway? And how to optimize them if possible?

We all write performant code, do we? Let’s check some benchmarks!

First of all define your system and compiler (newer compilers may optimize differently and lead to totally different results). Be aware that optimizing your code for Windows does not have to be an improvement on Linux either.

System & Complier
  • Windows 10
  • VC 14.2 (Visual Studio 2019)
  • i5-6600
  • Catch2 Benchmark

Secondly, find a optimal reference as a target. In this case I use three vectors and will simply loop over those vectors and and multiply the elements in a with the corresponding elements in b and store the result in c.

// enough data to get reliable results within a few seconds
const size_t ContainerSize = 128 * 1024 * 1024;
std::vector<uint8_t> a(ContainerSize);
std::vector<uint8_t> b(ContainerSize);
std::vector<uint8_t> c(ContainerSize);

for (size_t i = 0; i < c.size(); ++i)
       c[i] = a[i] * b[i];

This seems pretty optimal. The standard library is usually a good point to start and the vectors guarantee contiguous memory. The algorithm is super simple and the element access is also pretty straight forward. Or in other words: I would expect this to be approved during code review ;-).

Now let’s compare this to visitors in CVB++. As visitors do not work on standard containers but image planes we have to add some complexity by wrapping images around the containers data first.

auto imageA = Cvb::WrappedImage::FromGrey8Pixels(a.data(), 16 * 1024, 8 * 1024);
auto imageB = Cvb::WrappedImage::FromGrey8Pixels(b.data(), 16 * 1024, 8 * 1024);
auto imageC = Cvb::WrappedImage::FromGrey8Pixels(c.data(), 16 * 1024, 8 * 1024);

Cvb::Visit([func](auto blockA, auto blockB, auto blockC)
         {
            for (int y = 0; y < blockC.Height(); ++y)
              for (int x = 0; x < blockC.Width(); ++x)
                blockC(x, y) = blockA(x, y) * blockB(x, y);       
          }, imageA->Plane(0), imageB->Plane(0), imageC->Plane(0));

Now the question is how fast do we get with the visitor pattern. Let’s have a look at the results:
(I’ve blurred the result so you can test your intuition about this benchmark)

classic for loop: 193.443 ms
CVB++ visitor:     73.948 ms

It came as quite a surprise to me that the visitors take less 40% of the time the classic for loop requires. So the CVB++ visitors are much faster. Clearly there must be something wrong with my reference. Of course I double checked that they both produce the same results.
Anyway, it seems that CVB++ visitors are really really fast.

Please leave a comment or :heart: if you want me to go into the performance details.

Cheers

Andreas

5 Likes

Hi,

It has been some, but I did not forget that I promised some more details.

However, after some more experiments I did not find a compelling answer why:

for (size_t i = 0; i < c.size(); ++i)
       c[i] = a[i] * b[i];

is so much slower than visitors. I got a small improvement by caching the size.

auto size =  c.size();
for (size_t i = 0; i < size; ++i)
       c[i] = a[i] * b[i];

But this speedup was not nearly enough to explain the gap. Anyway I would have expected the compiler to optimize that for me. Modern compilers are supposed to be smart right? So I decided to have a look and configured Visual Studio so it outputs the assembly code.

C/C++ → Assembler Output

  • /FA Assembly code; .asm
  • /FAc Machine and assembly code; .cod
  • /FAs Source and assembly code; .asm (my choice as I’m not good in reading assembly code)
  • /FAcs Machine, source, and assembly code; .cod

This finally gave me an answer. For whatever reason the compiler unrolls some of loops inside the Catch2 benchmark producing significantly more instruction than the visitors. So I implemented a comparison without Catch2 and had a look at the assembly. The unrolling was gone and the simple for loop was slightly faster than the visitors.

No my worldview is back in order. However I’m a little disappointed with Catch2 or the compiler, or both. On the other hand another reason to use visors as the compiler can handle them more reliably it seems.

2 Likes