Research

Simplistic Collection and Labeling Practices Limit the Generalizability of Benchmark Datasets for Twitter Bot Detection

Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools em ploy machine learning and often achieve near perfect accuracy on existing datasets, making it seem like bot detection is accurate, reliable and fit for use in downstream applications. We argue this is not the case, providing evidence that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules – shallow decision trees trained on a small number of features – achieve near state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. The classifiers reveal that outcomes are highly dependent on each dataset’s collection and labeling procedures rather than fundamental differences between bots and humans. Our findings have important implications for both transparency in sampling and labeling procedures and potential biases in work using existing tools for pre-processing.