The audio embedding operator converts an input audio into a dense vector which can be used to represent the audio clip's semantics.
Each vector represents for an audio clip with a fixed length of around 0.9s.
This operator is built on top of [VGGish](https://github.com/tensorflow/models/tree/master/research/audioset/vggish) with Pytorch.
The model is a [VGG](https://arxiv.org/abs/1409.1556) variant pre-trained with a large scale of audio dataset [AudioSet](https://research.google.com/audioset).
As suggested, it is suitable to extract features at high level or warm up a larger model.
@ -81,6 +82,7 @@ An audio embedding operator generates vectors in numpy.ndarray given an audio fi
The audio path or link in string.
Or audio input data in towhee audio frames.
The input data should represent for an audio longer than 0.9s.