Quantcast
Viewing latest article 5
Browse Latest Browse All 12

Answer by OldCurmudgeon for Why are elementwise additions much faster in separate loops than in a combined loop?

Imagine you are working on a machine where n was just the right value for it only to be possible to hold two of your arrays in memory at one time, but the total memory available, via disk caching, was still sufficient to hold all four.

Assuming a simple LIFO caching policy, this code:

for(int j=0;j<n;j++){
    a[j] += b[j];
}
for(int j=0;j<n;j++){
    c[j] += d[j];
}

would first cause a and b to be loaded into RAM and then be worked on entirely in RAM. When the second loop starts, c and d would then be loaded from disk into RAM and operated on.

the other loop

for(int j=0;j<n;j++){
    a[j] += b[j];
    c[j] += d[j];
}

will page out two arrays and page in the other two every time around the loop. This would obviously be much slower.

You are probably not seeing disk caching in your tests but you are probably seeing the side effects of some other form of caching.


There seems to be a little confusion/misunderstanding here so I will try to elaborate a little using an example.

Say n = 2 and we are working with bytes. In my scenario we thus have just 4 bytes of RAM and the rest of our memory is significantly slower (say 100 times longer access).

Assuming a fairly dumb caching policy of if the byte is not in the cache, put it there and get the following byte too while we are at it you will get a scenario something like this:

  • With

    for(int j=0;j<n;j++){
     a[j] += b[j];
    }
    for(int j=0;j<n;j++){
     c[j] += d[j];
    }
    
  • cache a[0] and a[1] then b[0] and b[1] and set a[0] = a[0] + b[0] in cache - there are now four bytes in cache, a[0], a[1] and b[0], b[1]. Cost = 100 + 100.

  • set a[1] = a[1] + b[1] in cache. Cost = 1 + 1.
  • Repeat for c and d.
  • Total cost = (100 + 100 + 1 + 1) * 2 = 404

  • With

    for(int j=0;j<n;j++){
     a[j] += b[j];
     c[j] += d[j];
    }
    
  • cache a[0] and a[1] then b[0] and b[1] and set a[0] = a[0] + b[0] in cache - there are now four bytes in cache, a[0], a[1] and b[0], b[1]. Cost = 100 + 100.

  • eject a[0], a[1], b[0], b[1] from cache and cache c[0] and c[1] then d[0] and d[1] and set c[0] = c[0] + d[0] in cache. Cost = 100 + 100.
  • I suspect you are beginning to see where I am going.
  • Total cost = (100 + 100 + 100 + 100) * 2 = 800

This is a classic cache thrash scenario.


Viewing latest article 5
Browse Latest Browse All 12

Trending Articles