TCAB

The Text Classification Attack Benchmark

What is TCAB?


TCAB is a set of benchmark datasets for analyzing, understanding, detecting, and labeling adversarial attacks against text classifiers. TCAB includes over 1.5 million successful attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English.


Why was TCAB created?


A common defense strategy against adversarial attacks is to make classifiers more robust, thus most evaluation frameworks focus on model robustness. However, these defenses are often computationally expensive or result in reduced accuracy. Additionally, benchmarks that evaluate these defenses require carefully controlled and expensive human filtering, typically resulting in a small number of evaluation examples.

TCAB comprises a large collection of fully-automated attacks, enabling new tasks such as attack labeling --- automatically determining the adversarial attacks (if any) used to generate a given piece of text --- attack localization, and attack characterization. As a complement to model robustness, TCAB facilitates research that enables one to learn more about their attackers and subsequently develop appropriate defenses.


How was TCAB created?


After training a target model on a particular domain dataset, TCAB generates adversarial examples by attacking instances in the test set using TextAttack or OpenAttack --- two open source toolchains that provide fully-automated off-the-shelf attacks.

TCAB contains attacks from methods that cover a wide range of design choices and assumptions, such as model access level (e.g., white/gray/black box), perturbation level (e.g., char/word/token), and linguistic constraints that can help make attacks more indistinguishable from the original text.


Human Evaluation


We adopt crowdsourcing to label a portion of the adversarial examples in TCAB to get a sense of how often perturbed instances preserve their original labels. In total, 5,581 adversarial examples were annoted by workers from Amazon Mechanical Turk, with each instance labeled by 5 different workers. We observe that on average, 51% and 81% of the adversarial instances' labels were preserved for sentiment analysis and abuse detection datasets, respectively.


Getting Started


Download a copy of the dataset hosted on Zenodo (distributed with a CC-BY 4.0 license). The dataset contains two files that include a training set and a validation set.


Baseline Models


We provide a baseline approach for attack detection and labeling that combines contextualized embeddings from a fine-tuned BERT model and hand-crafted text, language model, and target model properties. Our baseline approach achieves 91.7% and 66.7% accuracy for attack detection and labeling, on average. Code for these baselines is in the TCAB Benchmark repository on Github.


Extending TCAB


TCAB is designed to be extended with additional datasets as new attack methods and text classifiers are developed. To add a new domain dataset or attack, follow the instructions in the TCAB Generation repository on GitHub.


Who Created TCAB?


TCAB was created by researchers at the Univesity of Oregon and the University of California Irvine. Please direct questions to lowd@cs.uoregon.edu or sameer@uci.edu.


Reference


Additional details regarding TCAB are in our paper.