DIT SMS Spam Dataset

The DIT SMS Spam Dataset is a corpus of 1,353 unique spam SMS text messages collected by scraping messages from two UK public consumer complaints websites. Each message is stamped with the date it was reported on, and the corpus covers the period from late 2003 to the middle of 2010. All of the data occurred in the same linguistic region, since all messages had originally been received by UK mobile users. 

Exact duplicates have been removed from the data, but many of the non-matching messages may still be close matches, since SMS spam is characterised by obfuscation.  The data is in XML format.

Information on how the dataset was collected and an analysis of the types of messages it contains is available in the following paper:

SJ. Delany, M. Buckley & D. Greene (2012) "SMS spam filtering: methods and data" Expert Systems With Applications 39, p 9899-9908

Please reference this paper if you use this dataset. 

Download the SMS spam dataset 

Computing @ DIT are not responsible for content on external sites
     Find us on Facebook      Follow us on Twitter      Follow us on LinkedIn

Member of the European University Association