Homework 3: Email Analysis
- Due Date:
April 11
- Done By:
Individuals
- Submission:
On Gradescope
Email continues to be an essential component of conducting business. This assignment will introduce you to email
forensics by having you write a tool that can search for terms in the Enron email data set and, for any search results,
return the sender’s email address and the date and time the message was sent by using the From:
and Date:
headers, respectively.
Pre-Assignment
Do the following before you get started programming:
Download the Enron email data set from https://www.cs.cmu.edu/~enron/. Be sure to get the May 7, 2015 version (the file will be named
enron_mail_20150507.tar.gz
).Go to https://github.com/lintool/Enron2mbox and follow the instructions to convert the data to the mbox format.
For whatever language you plan to use for this assignment, find an mbox library.
If you plan to use Python, check out the following links:
https://docs.python.org/3/library/mailbox.html#mbox
Take note that the
mbox
class is, because it is a subclass of theMailbox
class, an iterator, meaning you can use it in afor
loop like this:for message in my_mbox_class_instance:
https://docs.python.org/3/library/mailbox.html#mailbox.mboxMessage
mailbox.mboxMessage
is a subclass ofmailbox.Message
which is a subclass ofemail.message.Message
. You will need to understand theemail.message.Message
class’s API for accessing the email’s payload. Take note thatemail.message.Message
is not a subclass ofemail.message.EmailMessage
.
The Program
Write a program that satisfies the description above and conforms to the following usage specifications:
enron_search term [term ...]
term
A word to search for in the data set. The search will be case-insensitive, but exact, meaning neither fuzzy matching nor partial matching is performed. When more than one term is given, only emails with ALL terms in the body will be returned.
Your program should ignore duplicate terms and term order, so that the following are equivalent:
enron_search the The THE money
enron_search MONEY tHe monEY
enron_search the money
The exclusion of fuzzy matching means that the term cash
will not match the string money
, although they are
semantically similar. Exact matching (no partial matching) means the
will not match the string them
.
For each email with a message body (payload) that matches all the terms given by the user, you should capture and output
the sender (using the From:
header field) and the date the email was sent (using the Date:
header field). Your
program should number the results and display the total number of results found when the search completes.
It is totally fine for your program to output both the sender’s email address and the date/time sent in the same formats as they are stored in the email headers.
Example
The following examples are notional (i.e., made up) and are not from the Enron data set. Lines starting with $
are
what the user enters into the command line. The other lines are the program’s output.
$ enron_search affair
1. Creepy Dude <[email protected]> Mon, 18 Mar 1995 15:45:04 -0500
2. Creepy Dude <[email protected]> Sun, 17 Mar 1995 14:47:38 -0500
Results found: 2
$ enron_search hide all the evidence
1. Guy Incharge <[email protected]> Mon, 18 Mar 1995 14:47:38 -0500
2. Peon Smith <[email protected]> Tue, 19 Mar 1995 14:47:38 -0500
3. Guy Incharge <[email protected]> Wed, 20 Mar 1995 14:47:38 -0500
4. Peon Smith <[email protected]> Thu, 21 Mar 1995 14:47:38 -0500
Results found: 4
Implementation
Your program must work on Ubuntu 18.04 64-bit with the default packages installed. You may find it helpful to set up a virtual machine to do your development. VirtualBox is a free and open-source VM system.
If you wish to use packages that are not installed on Ubuntu 18.04 64-bit by default, please submit a file with your
code named packages
, with a list of packages that you would like installed before calling make
. Each line of
packages
must be a valid package name, one package per line. The submission
system will automatically install all the dependencies that the package lists.
For example, if you were going to write your assignment in Haskell, you could install the
GHC compiler with the following packages
file:
ghc
ghc-dynamic
We’ve created a test script called test.sh
to help you test your program before compiling.
Download test.sh to the directory where your code lives (including
README
andMakefile
).Ensure that
test.sh
is executable:chmod +x test.sh
Run:
./test.sh
Submission Instructions
You will need to submit your source code, along with a Makefile and README, on Gradescope. The Makefile must create
your executable, called enron_search
, when the command make
is run. Your README file must be plain text and
should contain your name, ASU ID, and a description of how your program works.
Warning
Do NOT submit the Enron data set with your code! There’s no need to upload it since we will add it to the
Autograder’s files. The path to the file will be stored in the ENRON_FILE
environment variable before your
program runs. If that environment variable isn’t available, your program should fall back to use enron.mbox
in
the same directory as your executable file.
For those programming in Python, more information on accessing environment variables is available here.
A prior TA compiled some resources on how to write a Makefile which might be helpful:
https://www.cs.swarthmore.edu/~newhall/unixhelp/howto_makefiles.html