Our group is investigating all aspects of efficient, low-power in-sensor ML. This includes network architectures, model compilation, compilation to target-specific backends and various hardware acceleration paradigms (traditional hardwired accelerators, processing-in-memory, ISA extensions, and data flow). We believe that the diversity of network topologies, applications, and power-performance-area tradeoffs calls for the ability to efficiently study the vast design space and make relevant design choices.
Our group investigated how some memory side processing can improve performance and minimize bringing bookkeeping data to processing caches. For example, we explored developing programmable macros (called gestures) that can be executed by memory side logic (possibly as a processing-in-memory). Such a gesture could be used to improve performance of depth-first applications whereby the memory engine will traverse the structure and return only the relevant node and eliminate cache pollution resulting from brining intermediate nodes to CPU. We also explore using memory-side logic to achieve bounded time memory allocation and garbage collection.
More recently we are investigating how the memory side logic (or a specialized Load/Store Unit) can be used to handle spase matrix/vector data as well as perform Scatter/Gather operations. The data can be bufferred for use by processing engines (such as GPUs or Vector units) without having to resolve array indexes.
6. S. Adavally, K. Kavi, N. Gulur. A technique for improving performance of moderately sparse matrix algorithms, Submitted for publication
5. M. Rezaei and K. M. Kavi. "Intelligent memory manager: Reducing cache pollution due to memory management functions" Journal of Systems Architecture, Vol. 52, No.1., pp 207-219 (Jan. 2006).
4. Wentong Li, Saraju Mohanty and Krishna Kavi. "Page-based software-hardware co-design of a dynamic memory allocator", the IEEE Computer Architecture Letters, 2006
3. L.M. Fox, C.R. Hill, R.K. Cytron and K.M. Kavi. "Optimization of storage-referencing gestures" Proceedings of the Workshop on Compilers and Tools for Constrained Embedded Systems (CTES-2003), held in conjunction with Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES-2003), Oct. 29, 2003, San Jose, CA.
2. S. Donahue, M.P. Hampton, R. Cytron, M. Franklin and K.M. Kavi. "Hardware support for fast and bounded time storage allocation", Proceedings of the Workshop on Memory Processor Interfaces (WMPI), in conjunction with the International Symposium on Computer Architecture, May 2002, Anchorage, Alaska.
1. S.M. Donahue, M.P. Hampton, M. Deters, J.M. Nye, R.K. Cytron and K.M. Kavi. "Storage Allocation for real-time, embedded systems", Proceedings of the First International Workshop on Embedded Software (Washington, DC, May 2001), Springer Verlag, pp 131-147
Dataflow Computational Models
Our group has investigated dataflow architectures for more than two decades. This has resulted in the design an the Scheduled Dtaflow Architecture.
More recently, we have investigated reconfigurable dataflow graphs as accelerators. A dataflow graph is configured using coarse grainted functional units to represent computational kernels. Once configured, the accelerator will peform the computation as if an ASIC is created for the kernel, thus eliminating instruction fetch, decode and issue operations.
1. Charles Shelor, Krishna Kavi, "Reconfigurable Dataflow Graphs For Processing-In-Memory", IEEE 20th International Conference on Distributed Computing and Networking (ICDCN-2019), Bangaluru, India, Jan. 4-7, 2019.
2. C. Shelor and K. Kavi. "Dataflow based near data computing achieves excellent energy efficiency", International symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2017), Bochum, Germany, June 7-9, 2017
Value Prediction and Redundant Computations
In another project, we have investigated how to eliminate the execution of side-effect-free functions by caching previous results. The cache is indexed using the address of the function and if the argument(s) for the current call matches argument(s) for previous call, the function invocation is supressed and the result from the cahe is inserted appropriately in the processing pipeline.
P. Chen, K. Kavi and R. Akl. "Performance enhancement by eliminating redundant function execution", Proceedings of the IEEE 39th Annual Simulation Conference, Huntslville, AL, April 2-6, 2006, pp 143-150.