To get the same work complexity without the storage cost, you can do what Halide does and use the "sliding window" method (Halide is a domain specific programming language designed for high performance image processing).
The sliding window method computes values when they are needed and stores them until they are no longer needed.
In this particular example we could do something like:
float tmp_buff[3];
for (int j = 0; j < HEIGHT; j++) {
int i = 0;
// initialize the tmp buffer
for (int ii = 0; ii <3; ii++) {
float tmp = 0.f;
for (int jj = 0; jj < 3; jj++) {
tmp += input[(j + jj) * (WIDTH + 2) + i+ii];
}
tmp_buff[ii] = tmp;
}
// process patches horizontally
while (i < WIDTH) {
float tmp = 0.f;
float patch_val = 0.f;
for (int ii= 0;ii < 3; ii++)
patch_val += weights[ii] * tmp[ii];
output[j*WIDTH + i] = patch_val;
// slide over to next 3x3 patch and evict leftmost tmp_buff value
i++;
float new_val = 0.f;
for (int jj = 0; jj < 3; jj++)
new_val += input[(j+jj) * (WIDTH+2) + i+2];
// shift tmp_buff values over
tmp_buff[0] = tmp_buff[1];
tmp_buff[1] = tmp_buff[2];
tmp_buff[2] = new_val;
}
}
So here we only store 3 values in the temp buffer.
For more info see slide from 15-418.
kayvonf
@lucida. Nice. One quick clarification: I wouldn't say "Halide does" the optimization you described above. Halide makes it easier to write the program you wrote above by means of its schedule definition API, but the schedule used is a decision by the programmer, not by the Halide compiler.
lucida
@kayvonf right, bad working on my part! What I meant by "Halide does" is, that with Halide, the programmer does not have to be concerned with the nitty gritty details of how to implement the sliding window, the programmer just needs to tell Halide to use this type of scheduler.
To get the same work complexity without the storage cost, you can do what Halide does and use the "sliding window" method (Halide is a domain specific programming language designed for high performance image processing).
The sliding window method computes values when they are needed and stores them until they are no longer needed.
In this particular example we could do something like:
So here we only store 3 values in the temp buffer. For more info see slide from 15-418.
@lucida. Nice. One quick clarification: I wouldn't say "Halide does" the optimization you described above. Halide makes it easier to write the program you wrote above by means of its schedule definition API, but the schedule used is a decision by the programmer, not by the Halide compiler.
@kayvonf right, bad working on my part! What I meant by "Halide does" is, that with Halide, the programmer does not have to be concerned with the nitty gritty details of how to implement the sliding window, the programmer just needs to tell Halide to use this type of scheduler.