NVIDIA unveils ‘Fugato’, an AI model that generates audio using only text

NVIDIA announced that it has developed a generative AI model called ‘Foundational Generative Audio Transformer Opus 1, Fugatto’, which can control audio output using only text.

ⓒ Nvidia

Fugato, developed by NVIDIA’s generative AI research team, boasts greater sophistication than some AI models that can compose songs or modify voices, the company explained. Fugato is a foundational generative transformer model that builds on the team’s previous work in areas such as speech modeling, audio vocoding, and audio understanding.

Fugato can use a combination of text and audio files to create or transform any combination of music, voice, and sound described in the prompt. For example, you can follow text prompts to create musical snippets, remove or add instruments to an existing song, and change the intonation or emotion of your voice. It can even create sounds you’ve never heard before.

“We wanted to create a model that understands and produces sound like a human,” said Rafael Valle, manager of applied audio research at NVIDIA and co-creator of Fugato.

NVIDIA said Fugato, which supports a variety of audio creation and transformation tasks, exhibits emergent properties that emerge from the interaction of multiple trained capabilities and has the ability to combine free-form instructions.

Rafael Vallée added, “Fugato is the first step toward a future where unsupervised multi-task learning in audio synthesis and transformation is possible at any data and model scale.”

Various examples of using fugato

Music producers can use Fugato to quickly prototype or edit song ideas. During this process, you can try out different styles, voices, and instruments. You can also add effects and improve the overall audio quality of existing tracks. By applying Fugato, advertising agencies can quickly adapt existing campaigns to different regions or situations and apply various intonations and emotions to voiceovers.

Ballet provides an example of how one model can use language in many different ways.avocado chair”, Fugato also said that it can create anything the user describes, such as making a trumpet sound like a dog or a saxophone sound like a cat. Unlike most other models that can only reproduce exposed training data, Fugato can be used to create soundscapes that have never been heard before, such as a thunderstorm fading into dawn accompanied by birdsong.

Precise sound control

Fugato uses a technology called ComposableART to combine commands that have only been learned individually. For example, by combining the two commands ‘sad emotion’ and ‘French accent’, you can request the creation of a sound spoken ‘with a sad French accent’. The model’s ability to interpolate between commands allows users to precisely control text commands, such as the strength of intonation or the level of sadness.

“We wanted to allow users to combine attributes in subjective or artistic ways, and to choose how much to emphasize each attribute,” explains Rohan Badlani, an AI researcher at NVIDIA who designed this aspect of Fugato.

Additionally, Fugato provides a ‘temporal interpolation’ function that generates sounds that change over time. For example, thunder could get louder and then fade away, creating the sound of a storm moving through an area. Users have precise control over how the soundscape progresses.

The full version of Fugato uses 2.5 billion parameters and was trained on an NVIDIA DGX system equipped with 32 NVIDIA H100 Tensor Core GPUs. The company explained that multiple accents and multilingual features were further strengthened with the participation of various people from all over the world, including India, Brazil, China, Jordan, and Korea, in the production.
editor@itworld.co.kr

Source: www.itworld.co.kr