logo
Browse Source

init the operator.

Signed-off-by: wxywb <xy.wang@zilliz.com>
main
wxywb 3 years ago
parent
commit
8513384c3c
  1. 103
      README.md
  2. 16
      data2vec_audio.py

103
README.md

@ -1,2 +1,103 @@
# data2vec-audio
# Audio Embdding with data2vec
*author: David Wang*
<br />
## Description
This operator extracts features for audio with [data2vec](https://arxiv.org/abs/2202.03555). The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture.
<br />
## Code Example
Generate embeddings for the audio "test.wav".
*Write the pipeline in simplified style*:
```python
import towhee
(
towhee.glob('test.wav')
.audio_decode.ffmpeg()
.runas_op(func=lambda x:[y[0] for y in x])
.towhee.data2vec_audio()
.show()
)
```
<img src="https://towhee.io/towhee/data2vec-vision/raw/branch/main/result1.png" alt="result1" style="height:20px;"/>
*Write a same pipeline with explicit inputs/outputs name specifications:*
```python
import towhee
(
towhee.glob['path']('test.wav')
.audio_decode.ffmpeg['path', 'frames']()
.runas_op['frames', 'frames'](func=lambda x:[y[0] for y in x])
.towhee.data2vec_audio['frames', 'vecs'](model_name="facebook/data2vec-audio-base-960h")
.show()
)
```
<img src="https://towhee.io/towhee/data2vec-vision/raw/branch/main/result2.png" alt="result2" style="height:60px;"/>
<br />
## Factory Constructor
Create the operator via the following factory method
***data2vec_vision(model_name='facebook/data2vec-vision-base')***
**Parameters:**
***model_name***: *str*
The model name in string.
The default value is "facebook/data2vec-audio-base-960h".
Supported model name:
-
- facebook/data2vec-audio-base-960h
- facebook/data2vec-audio-large-960h
- facebook/data2vec-audio-base
- facebook/data2vec-audio-base-100h
- facebook/data2vec-audio-base-10m
- facebook/data2vec-audio-large
- facebook/data2vec-audio-large-100h
- facebook/data2vec-audio-large-10m
<br />
## Interface
An audio embedding operator generates vectors in numpy.ndarray given an audio file path or towhee audio frames.
**Parameters:**
***data:*** *List[towhee.types.audio_frame.AudioFrame]*
​ Input audio data is a list of towhee audio frames. The input data should represent for an audio longer than 0.9s.
**Returns:** *numpy.ndarray*
​ The audio embedding extracted by model.

16
data2vec_audio.py

@ -11,13 +11,13 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy
import numpy as np
import torch
import towhee
from PIL import Image as PILImage
from transformers import BeitFeatureExtractor, Data2VecVisionForImageClassification
from transformers import Data2VecAudioModel, Wav2Vec2Processor
from towhee.operator.base import NNOperator
class Data2VecAudio(NNOperator):
@ -28,12 +28,10 @@ class Data2VecAudio(NNOperator):
def __call__(self, data):
audio = np.hstack(data).reshape(1, -1)
audio = audio.astype(np.float32, order='C') / 32768.0
inputs = processor(audio, sampling_rate=sampling_rate, return_tensors="pt")
sampling_rate = data[0]._sample_rate
inputs = self.processor(audio.flatten(), sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
outputs = self.model(**inputs)
last_hidden_states = outputs.last_hidden_state
return last_hidden_states
feat = last_hidden_states[:,-1,:].flatten().detach().cpu().numpy()
return feat

Loading…
Cancel
Save