SpeechCombine brings instruction following to speech models

28 June 2026, 01:15·1 min read

SpeechCombine is an instruction-following speech language model designed to avoid the complexity and scale demands of conventional speech instruction tuning. The work targets a core challenge in speech language models: adapting a text-based LLM to a new modality while supporting speech-specific instructions, without relying on the large synthetic datasets often used in text LLM training pipelines.

The method starts with a text LLM base model and uses continuous pre-training on speech utterances to produce a speech-adapted model. It then combines that model’s weights with the weight difference between the instruction-tuned and base versions of the original text LLM, transferring instruction-following capabilities directly into the speech domain.

IBM Research reports that SpeechCombine can be trained with only a single round of speech pre-training on as little as 30k hours of speech data. Results indicate that the approach preserves the knowledge and capabilities of the original text LLM while extending them to speech, pointing to a training path that reduces dependence on massive speech datasets.

Originally reported by research.ibm.comRead the source →

Related coverage

Chips

SpeechCombine brings instruction following to speech models

IBM unveils sub-1 nanometer chip technology

Agentic AI shifts IBM’s focus to workforce readiness

Ibm expands enterprise AI for hybrid cloud and mainframes

Executives see limited AI productivity gains so far