Hume launches text-to-speech model Octave that generates emotive, adjustable AI voices on-demand based on your prompts


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


New York City startup Hume AI emerged from stealth two years ago and has since raised multimillions in funding on the basis of its technology that creatives emotive AI voices for use in enterprise applications.

Today, it is taking its offerings a step further with a new large-language and speech model called the “Omni-capable text and voice engine,” or Octave for short, designed to produce lifelike, emotionally nuanced speech for use across different forms of content, from audiobooks to prerecorded video game character dialog and film/TV/video.

Hume claims Octave the first text-to-speech system powered by a large language model (LLM) trained not only on text but on speech and emotion tokens, enabling it to understand words in context and adjust tone, rhythm, and cadence accordingly — and which the user can adjust on the sentence-level with text prompts.

“We’re launching the first LLM for text-to-speech—a model that understands words in context, predicting the right emotions, rhythm, cadence, and emphasis, making speech sound more human than ever before,” said Alan Cowen, Hume AI’s co-founder and CEO, in a video call interview with VentureBeat.

Octave’s capabilities go beyond basic voice generation. It can interpret character traits and style from a script alone, adjusting vocal inflections to match implied emotions. A sarcastic remark will be spoken sarcastically, a panicked sentence will sound urgent, and a whispered secret will be hushed—all without needing explicit direction.

In addition, if the user doesn’t like the generated voice or wants to adjust it, they can do so granularly through natural language by simply typing in a text instruction to Octave, such as “happier, sadder, more frustrated, angrier, more sarcastic, more sincere,” etc.

“You can describe a character—like a sarcastic medieval peasant—and the model will instantly create that voice, adjusting emotions like anger, sadness, or happiness based on your instructions,” Cowen added.

While the current release focuses on English-language speech, Octave also supports Spanish and is expected to expand its language capabilities in the near future.

Tailored for content creation

Octave is tailored for content creators and media production, offering applications in audiobooks, podcasts, video game characters, and video voiceovers.

“This new model is designed for offline text-to-speech—perfect for audiobooks, podcasts, video voiceovers, and video game characters—where creators need realistic, character-specific voices,” Cowen explained.

However, the user must access it through Hume’s website either on its Projects page or through an application programming interface (API). The “offline” component refers to the fact this model is designed to produce discrete audio files that can be added to projects such as videos or audiobooks. It’s not designed to carry on realtime conversation, though that could theoretically be allowed by piping in text queries to the website.

Hume’s API allows developers to make up to 50 requests of the new Octave model per minute, with a maximum text length of 5,000 characters and descriptions capped at 1,000 characters. Each request can generate up to five outputs, and the supported audio formats include MP3, WAV, and PCM.

Hume’s prior EVI series of models allows for streaming, realtime, back-and-forth interactions and remain available and will continue to be developed.

Hume AI offers a subscription-based pricing model with tiers ranging from a free option to Creator, Creator Pro, and Enterprise plans.

Here’s a concise breakdown of the offerings:

  • Free ($0/month) – 10,000 characters of text-to-speech per month (~10 minutes) with unlimited custom voices.
  • Starter ($3/month) – 30,000 characters (~30 minutes) plus support for up to 20 projects.
  • Creator ($10/month) – 100,000 characters (~100 minutes), usage-based pricing for extra characters ($0.20/1,000), and support for up to 1,000 projects.
  • Pro ($50/month) – 500,000 characters (~500 minutes), lower usage-based pricing ($0.15/1,000), and support for up to 3,000 projects.
  • Scale ($150/month) – 2,000,000 characters (~2,000 minutes), further reduced usage-based pricing ($0.13/1,000), and support for up to 10,000 projects.
  • Business ($900/month) – 10,000,000 characters (~10,000 minutes), even lower usage-based pricing ($0.10/1,000), and support for up to 20,000 projects.
  • Enterprise (Custom price) – Unlimited usage, custom legal terms, security assurances, significantly discounted bulk pricing, and priority support.

Altogether, Hume emphasized its Octave TTS pricing is around half the cost of competing AI voice creation startup ElevenLabs, showing the intensifying competition in the space of text-to-speech.

In addition, Hume AI conducted a blind comparison study with 180 human raters to benchmark Octave against ElevenLabs. The results showed that Octave was preferred in terms of audio quality (71.6% of trials), naturalness (51.7% of trials), and how well the speech matched descriptions of the desired voice (57.7% of trials), across 120 diverse prompts.

To further evaluate its performance, Hume AI has also launched the Expressive TTS Arena, a public benchmark designed to test how well AI models handle longer, expressive speech—an area that previous TTS benchmarks have largely overlooked.

10s of trillions of language tokens

Unlike traditional text-to-speech systems that rely on limited speech datasets, Octave TTS is built on an LLM trained on tens of trillions of language tokens.

“Traditional text-to-speech models are trained on limited speech data, but ours is built on an LLM trained on tens of trillions of tokens, enabling it to reason, think, and infer emotions from text,” Cowen said.

The model was trained using millions of hours of public, long-form speech data and Hume AI’s proprietary datasets of new voices recored by survey participants.

“We collected data from people recording themselves through webcams, reacting naturally to videos, telling stories, and talking to others, including friends and family, to capture a wide range of emotional expressions,” Cowen said.

This extensive training allows the model to infer emotional context and follow detailed instructions, creating voices that match specific character descriptions and attributes.

The model, available today through Hume AI’s platform and API, offers sentence-level emotional control, with some flexibility within sentences.

“Voice modulation works at the sentence level, but you can also adjust parts of a sentence, instructing the model to convey nuanced emotions like slight frustration mixed with humor or exasperation,” Cowen noted. The model also considers context beyond individual sentences. “Unlike traditional models that process text word by word, our model considers entire paragraphs, capturing context to deliver more natural and emotionally accurate speech,” he explained.

Consistent character voices and limitations

Octave TTS maintains consistent character voices across long-form content.

“With our platform, you can generate unique voices for each character in an audiobook—like a middle-aged orc—and maintain that character’s voice throughout the story,” Cowen said.

This capability is supported by Hume AI’s “Projects” page, which handles long-form content like audiobooks by automatically chunking text while preserving character consistency and context across chapters.

Hume has technical guardrails built into its website and API prohibiting the creation of realistic children’s voices and imitations of specific individuals, but other than that, it is open to use across a wide range of content and subject, including potentially not-safe-for-work scenes such as those in popular romance novels.

“We give developers freedom, allowing content across a broad range of human experiences, though we restrict the creation of realistic children’s voices and imitations of specific individuals,” Cowen explained.

In addition, Cowen said that the company could adjust these guardrails for specific clients upon request, such as a children’s book publisher looking to create voices for children’s audiobooks.

Additionally, Hume AI is working on a forthcoming Voice Cloning feature, which will allow users to replicate a voice from as little as five seconds of audio. The company is developing safeguards to ensure ethical use before rolling out the feature publicly.

With its combination of contextual awareness, emotional expression, and character customization, Octave TTS aims to provide content creators with more control and flexibility, delivering voices that sound both realistic and emotionally engaging.



Source link

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe

Latest Articles