The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which exhibits a scale invariance property exploited in a novel text generation method for label preserving dataset augmentation. Experiments on synthetic and real text data show the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiments.

Heavy-tailed Representations, Text Polarity Classification & Data Augmentation / Jalalzai, Hamid; Colombo, Pierre; Clavel, Chloé; Gaussier, Eric; Varni, Giovanna; Vignon, Emmanuel; Sabourin, Anne. - 2020-:(2020). ( 34th Conference on Neural Information Processing Systems, NeurIPS 2020 virtual event December 6-12, 2020).

Heavy-tailed Representations, Text Polarity Classification & Data Augmentation

Varni, Giovanna;
2020-01-01

Abstract

The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which exhibits a scale invariance property exploited in a novel text generation method for label preserving dataset augmentation. Experiments on synthetic and real text data show the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiments.
2020
Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
Canada
Neural Information Processing Systems Foundation, inc.
9781713829546
Jalalzai, Hamid; Colombo, Pierre; Clavel, Chloé; Gaussier, Eric; Varni, Giovanna; Vignon, Emmanuel; Sabourin, Anne
Heavy-tailed Representations, Text Polarity Classification & Data Augmentation / Jalalzai, Hamid; Colombo, Pierre; Clavel, Chloé; Gaussier, Eric; Varni, Giovanna; Vignon, Emmanuel; Sabourin, Anne. - 2020-:(2020). ( 34th Conference on Neural Information Processing Systems, NeurIPS 2020 virtual event December 6-12, 2020).
File in questo prodotto:
File Dimensione Formato  
NeurIPS-Varni_2020.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 668.99 kB
Formato Adobe PDF
668.99 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/365629
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact