Enron Dataset - Individual .PST Datasets


About this Resource

The resources available from this site constitute some of the practical research completed for a dissertation project at the University of Surrey.  The project investigates the identification of intellectual property in emails using cloud computing.  As such, this resource demonstrates how a large, readily available email corpus can be stored, accessed, and downloaded from a computing cloud.  This particular resource is an implementation of a Windows Azure storage project.  Windows Azure is Microsoft’s public cloud computing operating system which is currently in development. 

Downloads

The 148 individual mailboxes from the Enron CALO dataset can be downloaded (in zip format).

The URL format for each mailbox is as follows:
http://enrondata.blob.core.windows.net/pst/pst/<mailbox_name>.zip
where <mailbox_name> is the custodian  name of the user.

For example:  
http://enrondata.blob.core.windows.net/pst/pst/allen-p.zip

To view a complete list of Enron custodians please click here.

Note that the mailbox size is dependent on the user.  Some downloads are very small (<1MB) whilst some are much larger (150+MB).

Alternately, the complete dataset belonging to all 148 users may be downloaded from here (1.73GB)

Related Publications

Neil Cooke and Lee Gillam (2008) "Distributions and Distributional Lexical Semantics for Stop Words". Workshop on Corpus Profiling for Information Retrieval and Natural Language Processing, in conjunction with Information Interaction in Context (IIiX) 2008. London 18 October 2008. BCS eWiC Series. Download Paper from this link (PDF)

Lee Gillam and Neil Cooke (2008) "Intellectual property escaped with the email? Press F1 for help". Journal of Information Assurance and Security 3(1): 16-26, March 2008. Download paper from this link (PDF)

Neil Cooke, Lee Gillam, and Ahmet Kondoz (2007) "IP Protection: Detecting Email-based Breaches of Confidence". 3rd International Symposium on Information Assurance and Security, Manchester 29-31 August. IEEE Computer Society Press DOI Bookmark Download paper from this link (PDF)

Neil Cooke, Lee Gillam, and Ahmet Kondoz (2007) "The Best Kept Secrets with Corpus Linguistics" 4th Corpus Linguistics Conference 2007, Birmingham 27-30 July. Download paper from this link (PDF)

External Links

EnronData.org

Microsoft Azure

Department of Computing - University of Surrey

 

 

Last updated - 22nd July 2009
Contact - cs41js@surrey.ac.uk