Fastest Pixel Access (for loop vs. visitors)

Andreas · July 5, 2022, 8:11pm

Hi,

with 14 we released visitors, a CVB++ pattern to implement pixel/element based algorithms independently from the memory layout and the the data type. This alone makes visitors a very convenient and useful feature. But what about performance? What would be a good performance anyway? And how to optimize them if possible?

We all write performant code, do we? Let’s check some benchmarks!

First of all define your system and compiler (newer compilers may optimize differently and lead to totally different results). Be aware that optimizing your code for Windows does not have to be an improvement on Linux either.

System & Complier

Windows 10
VC 14.2 (Visual Studio 2019)
i5-6600
Catch2 Benchmark

Secondly, find a optimal reference as a target. In this case I use three vectors and will simply loop over those vectors and and multiply the elements in a with the corresponding elements in b and store the result in c.

// enough data to get reliable results within a few seconds
const size_t ContainerSize = 128 * 1024 * 1024;
std::vector<uint8_t> a(ContainerSize);
std::vector<uint8_t> b(ContainerSize);
std::vector<uint8_t> c(ContainerSize);

for (size_t i = 0; i < c.size(); ++i)
       c[i] = a[i] * b[i];

This seems pretty optimal. The standard library is usually a good point to start and the vectors guarantee contiguous memory. The algorithm is super simple and the element access is also pretty straight forward. Or in other words: I would expect this to be approved during code review ;-).

Now let’s compare this to visitors in CVB++. As visitors do not work on standard containers but image planes we have to add some complexity by wrapping images around the containers data first.

auto imageA = Cvb::WrappedImage::FromGrey8Pixels(a.data(), 16 * 1024, 8 * 1024);
auto imageB = Cvb::WrappedImage::FromGrey8Pixels(b.data(), 16 * 1024, 8 * 1024);
auto imageC = Cvb::WrappedImage::FromGrey8Pixels(c.data(), 16 * 1024, 8 * 1024);

Cvb::Visit([func](auto blockA, auto blockB, auto blockC)
         {
            for (int y = 0; y < blockC.Height(); ++y)
              for (int x = 0; x < blockC.Width(); ++x)
                blockC(x, y) = blockA(x, y) * blockB(x, y);       
          }, imageA->Plane(0), imageB->Plane(0), imageC->Plane(0));

Now the question is how fast do we get with the visitor pattern. Let’s have a look at the results:
(I’ve blurred the result so you can test your intuition about this benchmark)

classic for loop: 193.443 ms
CVB++ visitor:     73.948 ms

It came as quite a surprise to me that the visitors take less 40% of the time the classic for loop requires. So the CVB++ visitors are much faster. Clearly there must be something wrong with my reference. Of course I double checked that they both produce the same results.
Anyway, it seems that CVB++ visitors are really really fast.

Please leave a comment or if you want me to go into the performance details.

Cheers

Andreas

Andreas · October 13, 2022, 6:48am

Hi,

It has been some, but I did not forget that I promised some more details.

However, after some more experiments I did not find a compelling answer why:

for (size_t i = 0; i < c.size(); ++i)
       c[i] = a[i] * b[i];

is so much slower than visitors. I got a small improvement by caching the size.

auto size =  c.size();
for (size_t i = 0; i < size; ++i)
       c[i] = a[i] * b[i];

But this speedup was not nearly enough to explain the gap. Anyway I would have expected the compiler to optimize that for me. Modern compilers are supposed to be smart right? So I decided to have a look and configured Visual Studio so it outputs the assembly code.

C/C++ → Assembler Output

/FA Assembly code; .asm
/FAc Machine and assembly code; .cod
/FAs Source and assembly code; .asm (my choice as I’m not good in reading assembly code)
/FAcs Machine, source, and assembly code; .cod

This finally gave me an answer. For whatever reason the compiler unrolls some of loops inside the Catch2 benchmark producing significantly more instruction than the visitors. So I implemented a comparison without Catch2 and had a look at the assembly. The unrolling was gone and the simple for loop was slightly faster than the visitors.

No my worldview is back in order. However I’m a little disappointed with Catch2 or the compiler, or both. On the other hand another reason to use visors as the compiler can handle them more reliably it seems.

Topic		Replies	Views
Lib Bridge \| CVB to OpenCV C++ API opencv	8	2168	February 6, 2020
About the C++ API category C++ API	1	1744	December 19, 2017
Image Display Performance C++ API display	3	2246	February 13, 2018
Image Saving Time Considerations General Questions	2	332	June 21, 2022
CVuEye.vin - Access violation when accessing pixel data C-style API	5	608	May 4, 2020

Fastest Pixel Access (for loop vs. visitors)

Related topics