Homework 3: Email Analysis¶
|Due Date:||April 11|
Email continues to be an essential component of conducting business. This assignment will introduce you to email
forensics by having you write a tool that can search for terms in the Enron email data set and, for any search results,
return the sender’s email address and the date and time the message was sent by using the
Do the following before you get started programming:
- Download the Enron email data set from https://www.cs.cmu.edu/~enron/. Be sure to get the May 7, 2015 version (the
file will be named
- Go to https://github.com/lintool/Enron2mbox and follow the instructions to convert the data to the mbox format.
- For whatever language you plan to use for this assignment, find an mbox library.
If you plan to use Python, check out the following links:
- Take note that the
mboxclass is, because it is a subclass of the
Mailboxclass, an iterator, meaning you can use it in a
forloop like this:
for message in my_mbox_class_instance:
- Take note that the
mailbox.mboxMessageis a subclass of
mailbox.Messagewhich is a subclass of
email.message.Message. You will need to understand the
email.message.Messageclass’s API for accessing the email’s payload. Take note that
email.message.Messageis not a subclass of
Write a program that satisfies the description above and conforms to the following usage specifications:
enron_search term [term ...]
term A word to search for in the data set. The search will be case-insensitive, but exact, meaning neither fuzzy matching nor partial matching is performed. When more than one term is given, only emails with ALL terms in the body will be returned.
Your program should ignore duplicate terms and term order, so that the following are equivalent:
enron_search the The THE money enron_search MONEY tHe monEY enron_search the money
The exclusion of fuzzy matching means that the term
cash will not match the string
money, although they are
semantically similar. Exact matching (no partial matching) means
the will not match the string
For each email with a message body (payload) that matches all the terms given by the user, you should capture and output
the sender (using the
From: header field) and the date the email was sent (using the
Date: header field). Your
program should number the results and display the total number of results found when the search completes.
It is totally fine for your program to output both the sender’s email address and the date/time sent in the same formats as they are stored in the email headers.
The following examples are notional (i.e., made up) and are not from the Enron data set. Lines starting with
what the user enters into the command line. The other lines are the program’s output.
$ enron_search hide all the evidence 1. Guy Incharge <[email protected]> Mon, 18 Mar 1995 14:47:38 -0500 2. Peon Smith <[email protected]> Tue, 19 Mar 1995 14:47:38 -0500 3. Guy Incharge <[email protected]> Wed, 20 Mar 1995 14:47:38 -0500 4. Peon Smith <[email protected]> Thu, 21 Mar 1995 14:47:38 -0500 Results found: 4
Your program must work on Ubuntu 18.04 64-bit with the default packages installed. You may find it helpful to set up a virtual machine to do your development. VirtualBox is a free and open-source VM system.
If you wish to use packages that are not installed on Ubuntu 18.04 64-bit by default, please submit a file with your
packages, with a list of packages that you would like installed before calling
make. Each line of
packages must be a valid package name, one package per line. The submission
system will automatically install all the dependencies that the package lists.
We’ve created a test script called
test.sh to help you test your program before compiling.
- Download test.sh to the directory where your code lives (including
- Ensure that
chmod +x test.sh
You will need to submit your source code, along with a Makefile and README, on Gradescope. The Makefile must create
your executable, called
enron_search, when the command
make is run. Your README file must be plain text and
should contain your name, ASU ID, and a description of how your program works.
Do NOT submit the Enron data set with your code! There’s no need to upload it since we will add it to the
Autograder’s files. The path to the file will be stored in the
ENRON_FILE environment variable before your
program runs. If that environment variable isn’t available, your program should fall back to use
the same directory as your executable file.
For those programming in Python, more information on accessing environment variables is available here.
A prior TA compiled some resources on how to write a Makefile which might be helpful: