Batching in GPU inference is effective due to parallelism.
We implement request aggregation and batching in Llama.cpp to
explore if it is useful in CPUs.
Our optimizations show 20% speedup in request response time.
Implemented several Obilivious RAM(ORAM) techniques used to combat against memory access patterns as a side channel and measured their performance against settings with no security.
A simulator built in C++ and Python to analyze memory access patterns and improve the
performance
of disaggregated server architecture using the predictable nature of data-oblivious algorithms
Implemented a dataset expander, generator, and robustness checker using reverse nearest neighbors, with augmentation techniques like distance averaging and random noise addition.