25.6 C
New York
Thursday, July 4, 2024

Decrease Vitality, Excessive Efficiency LLM on FPGA With out Matrix Multiplication


A brand new technical paper titled “Scalable MatMul-free Language Modeling” was printed by UC Santa Cruz, Soochow College, UC Davis, and LuxiTech.

Summary

“Matrix multiplication (MatMul) sometimes dominates the general computational value of huge language fashions (LLMs). This value solely grows as LLMs scale to bigger embedding dimensions and context lengths. On this work, we present that MatMul operations will be utterly eradicated from LLMs whereas sustaining robust efficiency at billion-parameter scales. Our experiments present that our proposed MatMul-free fashions obtain efficiency on-par with state-of-the-art Transformers that require much more reminiscence throughout inference at a scale as much as no less than 2.7B parameters. We examine the scaling legal guidelines and discover that the efficiency hole between our MatMul-free fashions and full precision Transformers narrows because the mannequin dimension will increase. We additionally present a GPU-efficient implementation of this mannequin which reduces reminiscence utilization by as much as 61% over an unoptimized baseline throughout coaching. By using an optimized kernel throughout inference, our mannequin’s reminiscence consumption will be decreased by greater than 10x in comparison with unoptimized fashions. To correctly quantify the effectivity of our structure, we construct a customized {hardware} resolution on an FPGA which exploits light-weight operations past what GPUs are able to. We processed billion-parameter scale fashions at 13W past human readable throughput, shifting LLMs nearer to brain-like effectivity. This work not solely reveals how far LLMs will be stripped again whereas nonetheless performing successfully, but in addition factors on the varieties of operations future accelerators needs to be optimized for in processing the subsequent technology of light-weight LLMs. Our code implementation is accessible at this https URL.”

Discover the technical paper right here (preprint). Revealed June 2024.  The college’s information abstract is right here.

Zhu, Rui-Jie, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Ok. Eshraghian. “Scalable MatMul-free Language Modeling.” arXiv preprint arXiv:2406.02528 (2024).

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles