From 0627c26aaf2058d9403f18e14ad750a8e9b5c630 Mon Sep 17 00:00:00 2001 From: Jael Gu Date: Wed, 2 Mar 2022 21:48:44 +0800 Subject: [PATCH] update Signed-off-by: Jael Gu --- README.md | 43 ++++++++++++++++++++++++++++--------------- 1 file changed, 28 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index ec3a213..b574033 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,18 @@ -# Operator: nlp-longformer +# NLP embedding: Longformer Operator -Author: Kyle He, Jael Gu +Authors: Kyle He, Jael Gu ## Overview +This operator uses Longformer to convert long text to embeddings. +The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1]. +**Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2]. + +Transformer-based models are unable to process long sequences due to their self-attention +operation, which scales quadratically with the sequence length. To address this limitation, +we introduce the Longformer with an attention mechanism that scales linearly with sequence +length, making it easy to process documents of thousands of tokens or longer[2]. ## Interface @@ -12,40 +20,45 @@ Author: Kyle He, Jael Gu __init__(self, model_name: str, framework: str = 'pytorch') ``` -Args: +**Args:** - model_name: - the model name for embedding - - supported types: str, for example 'xxx' or 'xxx' + - supported types: `str`, for example 'allenai/longformer-base-4096' or 'allenai/longformer-large-4096' - framework: - the framework of the model - - supported types: str, default is 'pytorch' + - supported types: `str`, default is 'pytorch' ```python -__call__(self, call_arg_1: xxx) +__call__(self, txt: str) ``` -Args: +**Args:** -- txt: - - input text in words, sentences, or paragraphs + txt: + - the input text content - supported types: str -Returns: -The Operator returns a tuple Tuple[('feature_vector', numpy.ndarray)] containing following fields: +**Returns:** + +The Operator returns a tuple `Tuple[('feature_vector', numpy.ndarray)]` containing following fields: - feature_vector: - the embedding of the text - - data type: numpy.ndarray - - shape: (x, dim) where x is number of vectors and dim is dimension of vector depending on model_name + - data type: `numpy.ndarray` + - shape: (dim,) ## Requirements - +You can get the required python package by [requirements.txt](./requirements.txt). ## How it works - +The `towhee/nlp-longformer` Operator implements the conversion from text to embedding, which can add to the pipeline. ## Reference + +[1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig + +[2].https://arxiv.org/pdf/2004.05150.pdf