# NLP embedding: Longformer Operator Authors: Kyle He, Jael Gu ## Overview This operator uses Longformer to convert long text to embeddings. The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1]. **Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2]. Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer[2]. ## Interface ```python __init__(self, model_name: str, framework: str = 'pytorch') ``` **Args:** - model_name: - the model name for embedding - supported types: `str`, for example 'allenai/longformer-base-4096' or 'allenai/longformer-large-4096' - framework: - the framework of the model - supported types: `str`, default is 'pytorch' ```python __call__(self, txt: str) ``` **Args:** txt: - the input text content - supported types: str **Returns:** The Operator returns a tuple `Tuple[('feature_vector', numpy.ndarray)]` containing following fields: - feature_vector: - the embedding of the text - data type: `numpy.ndarray` - shape: (dim,) ## Requirements You can get the required python package by [requirements.txt](./requirements.txt). ## How it works The `towhee/nlp-longformer` Operator implements the conversion from text to embedding, which can add to the pipeline. ## Reference [1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig [2].https://arxiv.org/pdf/2004.05150.pdf