|  | @ -1,10 +1,18 @@ | 
		
	
		
			
				|  |  | # Operator: nlp-longformer |  |  |  | 
		
	
		
			
				|  |  |  |  |  | # NLP embedding: Longformer Operator | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | Author: Kyle He, Jael Gu |  |  |  | 
		
	
		
			
				|  |  |  |  |  | Authors: Kyle He, Jael Gu | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | ## Overview |  |  | ## Overview | 
		
	
		
			
				|  |  |  |  |  | This operator uses Longformer to convert long text to embeddings. | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  |  |  |  | The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan[1]. | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  |  |  |  | **Longformer** models were proposed in “[Longformer: The Long-Document Transformer][2]. | 
		
	
		
			
				|  |  |  |  |  | 
 | 
		
	
		
			
				|  |  |  |  |  | Transformer-based models are unable to process long sequences due to their self-attention | 
		
	
		
			
				|  |  |  |  |  | operation, which scales quadratically with the sequence length. To address this limitation, | 
		
	
		
			
				|  |  |  |  |  | we introduce the Longformer with an attention mechanism that scales linearly with sequence | 
		
	
		
			
				|  |  |  |  |  | length, making it easy to process documents of thousands of tokens or longer[2]. | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | ## Interface |  |  | ## Interface | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
	
		
			
				|  | @ -12,40 +20,45 @@ Author: Kyle He, Jael Gu | 
		
	
		
			
				|  |  | __init__(self, model_name: str, framework: str = 'pytorch') |  |  | __init__(self, model_name: str, framework: str = 'pytorch') | 
		
	
		
			
				|  |  | ``` |  |  | ``` | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | Args: |  |  |  | 
		
	
		
			
				|  |  |  |  |  | **Args:** | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | - model_name: |  |  | - model_name: | 
		
	
		
			
				|  |  |   - the model name for embedding |  |  |   - the model name for embedding | 
		
	
		
			
				|  |  |   - supported types: str, for example 'xxx' or 'xxx' |  |  |  | 
		
	
		
			
				|  |  |  |  |  |   - supported types: `str`, for example 'allenai/longformer-base-4096' or 'allenai/longformer-large-4096' | 
		
	
		
			
				|  |  | - framework: |  |  | - framework: | 
		
	
		
			
				|  |  |   - the framework of the model |  |  |   - the framework of the model | 
		
	
		
			
				|  |  |   - supported types: str, default is 'pytorch' |  |  |  | 
		
	
		
			
				|  |  |  |  |  |   - supported types: `str`, default is 'pytorch' | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | ```python |  |  | ```python | 
		
	
		
			
				|  |  | __call__(self, call_arg_1: xxx) |  |  |  | 
		
	
		
			
				|  |  |  |  |  | __call__(self,  txt: str) | 
		
	
		
			
				|  |  | ``` |  |  | ``` | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | Args: |  |  |  | 
		
	
		
			
				|  |  |  |  |  | **Args:** | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | - txt: |  |  |  | 
		
	
		
			
				|  |  |   - input text in words, sentences, or paragraphs |  |  |  | 
		
	
		
			
				|  |  |  |  |  |  txt: | 
		
	
		
			
				|  |  |  |  |  |   - the input text content | 
		
	
		
			
				|  |  |   - supported types: str |  |  |   - supported types: str | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | Returns: |  |  |  | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | The Operator returns a tuple Tuple[('feature_vector', numpy.ndarray)] containing following fields: |  |  |  | 
		
	
		
			
				|  |  |  |  |  | **Returns:** | 
		
	
		
			
				|  |  |  |  |  | 
 | 
		
	
		
			
				|  |  |  |  |  | The Operator returns a tuple `Tuple[('feature_vector', numpy.ndarray)]` containing following fields: | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | - feature_vector: |  |  | - feature_vector: | 
		
	
		
			
				|  |  |   - the embedding of the text |  |  |   - the embedding of the text | 
		
	
		
			
				|  |  |   - data type: numpy.ndarray |  |  |  | 
		
	
		
			
				|  |  |   - shape: (x, dim) where x is number of vectors and dim is dimension of vector depending on model_name |  |  |  | 
		
	
		
			
				|  |  |  |  |  |   - data type: `numpy.ndarray` | 
		
	
		
			
				|  |  |  |  |  |   - shape: (dim,) | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | ## Requirements |  |  | ## Requirements | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | 
 |  |  |  | 
		
	
		
			
				|  |  |  |  |  | You can get the required python package by [requirements.txt](./requirements.txt). | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | ## How it works |  |  | ## How it works | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | 
 |  |  |  | 
		
	
		
			
				|  |  |  |  |  | The `towhee/nlp-longformer` Operator implements the conversion from text to embedding, which can add to the pipeline. | 
		
	
		
			
				|  |  | 
 |  |  | 
 | 
		
	
		
			
				|  |  | ## Reference |  |  | ## Reference | 
		
	
		
			
				|  |  |  |  |  | 
 | 
		
	
		
			
				|  |  |  |  |  | [1].https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/longformer#transformers.LongformerConfig | 
		
	
		
			
				|  |  |  |  |  | 
 | 
		
	
		
			
				|  |  |  |  |  | [2].https://arxiv.org/pdf/2004.05150.pdf | 
		
	
	
		
			
				|  | 
 |