So in the scenario where we have very constrained memory like mobile devices, is it more preferable to choose the seven for loop implementation over the matrix matrix multiplication one?
This is just a trade off between time and space. We avoid the overhead of materializing the matrix, but then we can't utilize the highly optimized dense matrix multiplication for efficient computation. With significant data reuse it's hard to improve the time efficiency.
@jerryzh Based on the lecture it would seem the best approach would be to prune out unnecessary (near 0) filter values, apply compression to filters to further reduce their space utilization, and reduce the precision of values in filters from 64 bit values to 8 bit, or at the extreme, even possibly 1 bit.