Pipeline: Audio Embedding using VGGish

Authors: Jael Gu

Overview

Recommend upgrading Towhee to >=1.1.1 and using https://towhee.io/towhee/audio-embedding

This pipeline extracts features of a given audio file using a VGGish model implemented in Pytorch. This is a supervised model pre-trained with AudioSet, which contains over 2 million sound clips.

Interface

Input Arguments:

audio_path:
- the input audio in .wav
- supported types: str (path to the audio)
- the audio should be as least 1 second

Pipeline Output:

The Operator returns a list of named tuple [NamedTuple('AudioOutput', [('vec', 'ndarray')])] containing following fields:

each item in the output list represents for embedding(s) for an audio clip, length & timestamps of which depend on time-window in yaml (You can modify time_range_sec & time_step_sec to change the way of audio split.)
vec:
- embeddings of input audio
- data type: numpy.ndarray
- shape: (num_clips, 128)

How to use

Install Towhee

$ pip3 install towhee

You can refer to Getting Started with Towhee for more details. If you have any questions, you can submit an issue to the towhee repository.

Run it with Towhee

>>> from towhee import pipeline

>>> embedding_pipeline = pipeline('towhee/audio-embedding-vggish')
>>> outs = embedding_pipeline('/path/to/your/audio')
>>> embeds = outs[0][0]

How it works

This pipeline includes two main operator types: audio-decode & audio-embedding. By default, the pipeline uses towhee/audio-decoder to load audio path as a list of audio frames in ndarray. Then time-window will combine audio frames into a list of ndarray, each of which represents an audio clip in fixed length. At the end, the towhee/torch-vggish) operator will generate a list of audio embeddings for each audio clip.

# More Resources

- [Exploring Multimodal Embeddings with FiftyOne and Milvus - Zilliz blog](https://zilliz.com/blog/exploring-multimodal-embeddings-with-fiftyone-and-milvus): This post explored how multimodal embeddings work with Voxel51 and Milvus.

About Towhee team | Zilliz: Towhee is an open-source machine learning pipeline that helps you encode your unstructured data into embeddings.
How to Get the Right Vector Embeddings - Zilliz blog: A comprehensive introduction to vector embeddings and how to generate them with popular open-source models.
The guide to clip-vit-base-patch32 | OpenAI: clip-vit-base-patch32: a CLIP multimodal model variant by OpenAI for image and text embedding.
Audio Retrieval Based on Milvus - Zilliz blog: Create an audio retrieval system using Milvus, an open-source vector database. Classify and analyze sound data in real time.
Vector Database Use Case: Audio Similarity Search - Zilliz: Building agile and reliable audio similarity search with Zilliz vector database (fully managed Milvus).
Making Machine Learning More Accessible for Developers - Zilliz blog: Learn how Towhee, an open-source embedding pipeline, supercharges the app development that requires embeddings and other ML tasks.
An Introduction to Vector Embeddings: What They Are and How to Use Them - Zilliz blog: In this blog post, we will understand the concept of vector embeddings and explore its applications, best practices, and tools for working with embeddings.

Jael Gu af48b23265 Add more resources Signed-off-by: Jael Gu <mengjia.gu@zilliz.com>			14 Commits
README.md	4.1 KiB	Add more resources	11 months ago
audio_embedding_vggish.py	285 B	add audio_embedding_vggish.py	2 years ago
audio_embedding_vggish.yaml	2.3 KiB	Adapt new audio-decoder & vggish	3 years ago