Natural language inference
Natural language inference is the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise".
|A man inspects the uniform of a figure in some East Asian country.||contradiction||The man is sleeping.|
|An older and younger man smiling.||neutral||Two men are smiling and laughing at the cats playing on the floor.|
|A soccer game with multiple males playing.||entailment||Some men are playing a sport.|
The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.
State-of-the-art results can be seen on the SNLI website.
The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. The data can be downloaded from the MultiNLI website.
|Model||Matched||Mismatched||Paper / Source||Code|
|XLNet-Large (ensemble) (Yang et al., 2019)||90.2||89.8||XLNet: Generalized Autoregressive Pretraining for Language Understanding||Official|
|MT-DNN-ensemble (Liu et al., 2019)||87.9||87.4||Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding||Official|
|Snorkel MeTaL(ensemble) (Ratner et al., 2018)||87.6||87.2||Training Complex Models with Multi-Task Weak Supervision||Official|
|Finetuned Transformer LM (Radford et al., 2018)||82.1||81.4||Improving Language Understanding by Generative Pre-Training|
|Multi-task BiLSTM + Attn (Wang et al., 2018)||72.2||72.1||GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding|
|GenSen (Subramanian et al., 2018)||71.4||71.3||Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning|
The SciTail entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist "in the wild". Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.
|Model||Accuracy||Paper / Source|
|Finetuned Transformer LM (Radford et al., 2018)||88.3||Improving Language Understanding by Generative Pre-Training|
|Hierarchical BiLSTM Max Pooling (Talman et al., 2018)||86.0||Natural Language Inference with Hierarchical BiLSTM Max Pooling|
|CAFE (Tay et al., 2018)||83.3||A Compare-Propagate Architecture with Alignment Factorization for Natural Language Inference|