Tuesday, June 12, 2012

Parsing Email and Fixing Timestamps in Python

I decided to POP out all my Yahoo mail into my Google Apps account so I could stop paying for Yahoo's "premium" service (WTF, it's 2012, and POP is a paid feature--and there's no IMAP?).  I have fetchmail then download all of my messages which get post-processed by procmail and re-served by dovecot.  Since a bunch of really old messages were just downloaded by fetchmail, they appeared to be "new" from dovecot's perspective.  This is because the name of the message files stored in the Maildir format used by dovecot starts with a number representing when the messages were downloaded.  So years-old messages that were just downloaded will have a very recent timestamp encoded in their filenames.

The file names look like this:

1339506150.22834_0.hoth:2,Sb
1339506889.22952_0.hoth:2,Sb
1339507621.23058_0.hoth:2,Sb
1339509572.27344_0.hoth:2,Sb
1339510487.386_0.hoth:2,Sb

To fix this, I wrote a little python program to parse out the dates the messages were originally received, and rename the message files.  It also updates the file system's atime and mtime timestamps.  Since several new "Received" headers were attached to the message when Google POPed the messages from Yahoo, and then when fetchmail downloaded them from Google, I needed to figure out which ones to disregard.  I decided to compare the date the message was originally sent to each of the Received: dates, and use the most recent Received: header that was no more than 24 hours after the message was sent.

I made a backup of ~/Maildir/cur/ first.
You can loop over the message files like this:

$ for file in $(ls -1); do ~/bin/timestamp.py $file; done

I'm not really sure what the implications are for dovecot while mucking around in ~/Maildir, so I stopped that, and also fetchmail while doing it.

Here's the script.  Use at your own risk: