Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.
|Title of host publication||Findings of the Association for Computational Linguistics Findings of ACL|
|Subtitle of host publication||EMNLP 2020|
|Publisher||Association for Computational Linguistics (ACL)|
|Number of pages||17|
|State||Published - 2020|
|Event||Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 - Virtual, Online|
Duration: 16 Nov 2020 → 20 Nov 2020
|Name||Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020|
|Conference||Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020|
|Period||16/11/20 → 20/11/20|
Bibliographical noteFunding Information:
We thank the anonymous reviewers for their helpful feedback on this paper, as well as many others who gave constructive comments on a publicly-available preprint. Various authors of this paper were supported in part by ERC grant 677352, NSF grant 1562364, NSF grant IIS-1756023, NSF CAREER 1750499, ONR grant N00014-18-1-2826 and DARPA grant N66001-19-2-403.
©2020 Association for Computational Linguistics