Maximiliano Levi

Writing a CPU only text to speech engine

How I ported VITS, an end-to-end neural TTS model, to C++ using ggml for mobile devices, and what it took to make it fast enough to be usable.

For my undergraduate project I decided to port VITS, an end-to-end text-to-speech model, to C++ so it could run on mobile devices without a GPU. The result is vits.cpp, built on top of ggml, the same tensor library used by llama.cpp.

The goal was simple to state: take a modern neural TTS model and make it run fast enough to be useful on a phone.

Why VITS?

VITS stands for Variational Inference with adversarial learning for end-to-end Text-to-Speech. Compared to older two-stage pipelines that required a separate acoustic model and vocoder, VITS is trained end-to-end and produces noticeably more natural-sounding output.

It was also a practical choice. The pretrained models are openly available, the architecture is well documented, and the original PyTorch implementation is clean enough to be a useful reference when building something from scratch.

Model architecture

VITS is composed of three main submodels that work together in sequence.

Prior encoder. This part of the model takes phonemes as input and encodes them into latent vectors. Working in phoneme space rather than raw text makes the representation more regular and language-agnostic.

Stochastic duration predictor. Given the encoded phonemes, this model estimates how long each phoneme should last in the output audio. This is what gives the synthesized speech its natural rhythm and prosody rather than sounding mechanically even.

Decoder (HiFi-GAN). The decoder takes the outputs of the prior encoder and the duration predictor and produces raw audio. Under the hood this is the HiFi-GAN V1 generator, a model designed specifically for high-fidelity waveform generation.

The stochastic part matters. VITS samples from the learned distribution rather than always picking the most likely output, which is part of why the resulting speech sounds varied and natural rather than robotic.

Building on ggml

Rather than writing a tensor library from scratch, I built on top of ggml. ggml handles memory management, quantization, and low-level math operations, which meant I could focus on mapping the VITS architecture to operations the library already understood.

The approach was similar to how llama.cpp works: define the model as a graph of ggml operations, load pretrained weights into the graph, and then run inference by executing that graph. The model weights were exported from the original PyTorch implementation using a small Python script.

This was mostly straightforward. The trickier parts were a few operations that ggml did not have built-in, which had to be implemented manually, and making sure the numerics matched the reference implementation closely enough to produce recognizable speech.

Convolutions: the bottleneck

Once a basic working version was running, profiling made the problem obvious: the decoder was consuming almost all of the inference time, and inside the decoder, convolutions were the culprit.

HiFi-GAN works by repeatedly upsampling the data. At each step it applies a Conv1dTranspose to increase resolution, followed by a stack of Conv1d operations for feature mixing. By the time the model reaches the final audio output, it is handling tensors with hundreds of millions of floating point values — even for short phrases.

This pattern is expensive on the CPU for reasons that go beyond raw compute.

CPU sucks for convolutions

The access patterns for Conv1d and Conv1dTranspose are awkward on CPU. A naive implementation ends up making many small, scattered reads from memory as it slides the kernel across the input. At the sizes VITS produces during decoding, this creates a lot of cache pressure and the CPU spends a disproportionate amount of time waiting for memory rather than doing arithmetic.

GPUs handle this well because they have enormous memory bandwidth and their hardware is designed around exactly this kind of operation. A CPU inference engine does not have that luxury.

Solution: im2col

The standard solution to this is im2col, which rearranges the input tensor into a form where convolution becomes a single matrix multiplication. Instead of sliding a kernel over scattered memory locations, the input patches are first collected into a dense matrix, and then the convolution is computed as one large matrix multiply.

The advantage is that matrix multiplication is something CPUs are much better at. Good implementations use SIMD instructions and are designed to keep data in cache across a large computation, which is exactly the access pattern that benefits from the hardware.

A Conv1dTranspose can be expressed as a Conv1d with modified padding and stride, so the same technique applies to both operations. Once both are reformulated as matrix multiplications, they benefit from the same optimized kernel.

Dot product benchmarks

As part of optimizing the matrix multiplications that underpin everything else, I ran benchmarks comparing different dot product implementations:

ImplementationMean time (ms)Std dev (ms)Mean errorStd dev error
Naive float3213.130.231029.442.10
SIMD float3229.350.2511665.77.30
GGML float3213.050.311029.442.10
GGML float166.560.1324897522.47
Naive double26.450.6600

A few things stand out. The SIMD float32 implementation was actually slower than the naive float32 version. SIMD should in theory be faster, but for this particular access pattern the vectorization overhead and the lack of cache-friendly layout meant it did not help. This matches the general observation that memory bandwidth, not raw compute, was the real bottleneck.

GGML float32 matched the naive float32 speed but with much better integration into the rest of the graph. GGML float16 was about twice as fast at the cost of a large increase in numerical error. For TTS, the tolerance for numerical error is meaningful: the output is audio, and some loss in precision does not necessarily produce audible degradation. Whether float16 is acceptable depends on the specific model and use case.

The naive double implementation was the slowest by a large margin, which confirmed that staying in float32 or float16 was the right call.

Final thoughts

Getting VITS running on CPU at a usable speed was a good exercise in understanding where time goes in neural inference. The model architecture itself is not the bottleneck. The bottleneck is always somewhere in the data movement, and the solution is usually to find a way to reformulate the computation so it fits the hardware better.

The im2col approach for convolutions is a well-known technique for a reason: it converts an awkward memory access pattern into something the CPU can handle efficiently. The benchmark results for dot products reinforce the same idea from a different angle — raw compute is rarely the limiting factor, and the naive implementation is often competitive precisely because it avoids the overhead of abstractions that were not designed for the specific problem.

The full code and pretrained model weights are available at github.com/maxilevi/vits.cpp.