Intel and AMD define ACE extensions for CPU-based AI
Intel and AMD have released the full specification for ACE, a new set of CPU extensions designed to make AI workloads easier and more power-efficient on x86 processors. The approach targets smaller models, single-user latency-sensitive tasks, and systems where a GPU is unavailable or limited.
ACE builds on existing AVX10 registers while adding silicon dedicated to matrix multiplication, a core operation in AI workloads. The design uses AVX’s 512-bit inputs to simplify integration with current CPU designs. For the same number of input vectors, ACE can perform 16x as many operations as AVX10, though actual speed gains will depend on each implementation.
The extensions are intended to reduce instruction overhead, improve power efficiency, and potentially make better use of RAM bandwidth. ACE is also implementation-agnostic, giving machine learning frameworks and libraries such as PyTorch and TensorFlow a consistent code path instead of requiring multiple variations for different AVX support levels.
ACE natively supports common ML data types including INT8, INT32, FP8, FP16, FP32, and BF16, along with Open Compute Project’s MX block-scaled formats. That could let developers move some NPU-specific workloads back to the CPU when they need a faster, more consistent target across x86 hardware.