Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add Gmail takeout mbox import #5

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

UtahDave
Copy link

WIP

This PR adds the ability to import emails from a Gmail mbox export from Google Takeout.

This is my first PR to a datasette/dogsheep repo. I've tested this on my personal Google Takeout mbox with ~520,000 emails going back to 2004. This took around ~20 minutes to process.

To provide some feedback on the progress of the import I added the "rich" python module. I'm happy to remove that if adding a dependency is discouraged. However, I think it makes a nice addition to give feedback on the progress of a long import.

Do we want to log emails that have errors when trying to import them?

Dealing with encodings with emails is a bit tricky. I'm very open to feedback on how to deal with those better. As well as any other feedback for improvements.

@UtahDave UtahDave mentioned this pull request Feb 22, 2021
@UtahDave
Copy link
Author

Also, @simonw I created a test based off the existing tests. I think it's working correctly

@UtahDave
Copy link
Author

I noticed that @simonw is using black for formatting. I ran black on my additions in this PR.

@simonw
Copy link
Collaborator

simonw commented Feb 26, 2021

Thanks!

I requested my Gmail export from takeout - once that arrives I'll test it against this and then merge the PR.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

Wow, my mbox is a 10.35 GB download!

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

The Rich-powered progress bar is pretty:

rich

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:

mbox = mailbox.mbox(mbox_file)
print("Processing {} emails".format(len(mbox)))

I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

I'm not sure if it would work, but there is an alternative pattern for showing a progress bar against a really large file that I've used in healthkit-to-sqlite - you set the progress bar size to the size of the file in bytes, then update a counter as you read the file.

https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/cli.py#L24-L57 and https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/utils.py#L4-L19 (the progress_callback() bit) is where that happens.

It can be a bit of a convoluted pattern, and I'm not at all sure it would work for mbox files since it looks like that library has other reasons it needs to do a file scan rather than streaming it through one chunk of bytes at a time. So I imagine this would not work here.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

I got 9 warnings that look like this:

Errors: 1
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 103, in get_mbox
    message["date"] = get_message_date(email.get("Date"), email.get_from())
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 167, in get_message_date
    datetime_tuple = email.utils.parsedate_tz(mail_date)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz
    data = data.split()
AttributeError: 'Header' object has no attribute 'split'

It would be useful if those warnings told me the message ID (or similar) of the affected message so I could grep for it in the mbox and see what was going on.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

It looks like the body is being loaded into a BLOB column - so in Datasette default it looks like this:

mbox__mbox_emails__753_446_rows

If I datasette install datasette-render-binary and then try again I get this:

mbox__mbox_emails__753_446_rows

It would be great if we could store the body as unicode text instead. May have to do something clever to decode it based on some kind of charset header?

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

Confirmed: removing the len() call does not speed things up, so it's reading through the entire file for some other purpose too.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

Looks like you're doing this:

    elif message.get_content_type() == "text/plain":
        body = message.get_payload(decode=True)

So presumably that decodes to a unicode string?

I imagine the reason the column is a BLOB for me is that sqlite-utils determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.

"""
Import Gmail mbox from google takeout
"""
db["mbox_emails"].upsert_all(
Copy link
Collaborator

@simonw simonw Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fix for the problem I had where my body column ended up being a BLOB rather than text would be to explicitly create the table first.

You can do that like so:

if not db["mbox_emails"].exists():
    db["mbox_emails"].create({
        "id": str,
        "X-GM-THRID": str,
        "X-Gmail-Labels": str,
        "From": str,
        "To": str,
        "Subject": str,
        "when": str,
        "body": str,
    }, pk="id")

I had to upgrade to the latest sqlite-utils for this to work because prior to sqlite-utils 2.0 the table.exists property was a boolean not a method.

@UtahDave
Copy link
Author

UtahDave commented Mar 4, 2021

The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:

mbox = mailbox.mbox(mbox_file)
print("Processing {} emails".format(len(mbox)))

I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.

The wait is from python loading the mbox file. This happens regardless if you're getting the length of the mbox. The mbox module is on the slow side. It is possible to do one's own parsing of the mbox, but I kind of wanted to avoid doing that.

@UtahDave
Copy link
Author

UtahDave commented Mar 4, 2021

Looks like you're doing this:

    elif message.get_content_type() == "text/plain":
        body = message.get_payload(decode=True)

So presumably that decodes to a unicode string?

I imagine the reason the column is a BLOB for me is that sqlite-utils determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.

Ah, that's good to know. I think explicitly creating the tables will be a great improvement. I'll add that.

Also, I noticed after I opened this PR that the message.get_payload() is being deprecated in favor of message.get_content() or something like that. I'll see if that handles the decoding better, too.

Thanks for the feedback. I should have time tomorrow to put together some improvements.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

I added this code to output a message ID on errors:

             print("Errors: {}".format(num_errors))
             print(traceback.format_exc())
+            print("Message-Id: {}".format(email.get("Message-Id", "None")))
             continue

Having found a message ID that had an error, I ran this command to see the context:

rg --text --context 20 '44F289B0.000001.02100@SCHWARZE-DWFXMI' ~/gmail.mbox

This was for the following error:

  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 102, in get_mbox
    message["date"] = get_message_date(email.get("Date"), email.get_from())
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 178, in get_message_date
    datetime_tuple = email.utils.parsedate_tz(mail_date)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz
    data = data.split()
AttributeError: 'Header' object has no attribute 'split'

Here's what I spotted in the ripgrep output:

177133570:Message-Id: <44F289B0.000001.02100@SCHWARZE-DWFXMI>
177133571-Date: Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit)
177133572-X-Mailer: IncrediMail (5002253)

So it could it be that _parsedate_tz is having trouble with that Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit) string.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

Solution could be to pre-process that string by splitting on ( and dropping everything afterwards, assuming that the (...) bit isn't necessary for correctly parsing the date.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

I imported my 10GB mbox with 750,000 emails in it, ran this tool (with a hacked fix for the blob column problem) - and now a search that returns 92 results takes 25.37ms! This is fantastic.

@simonw
Copy link
Collaborator

simonw commented Mar 4, 2021

I'm not sure why but my most recent import, when displayed in Datasette, looks like this:

mbox__mbox_emails__753_446_rows

Sorting by id in the opposite order gives me the data I would expect - so it looks like a bunch of null/blank messages are being imported at some point and showing up first due to ID ordering.

@maxhawkins
Copy link

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

@UtahDave
Copy link
Author

UtahDave commented Mar 5, 2021

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

@maxhawkins a limitation of the python mbox module is it loads the entire mbox into memory. I did find another approach to this problem that didn't use the builtin python mbox module and created a generator so that it didn't have to load the whole mbox into memory. I was hoping to use standard library modules, but this might be a good reason to investigate that approach a bit more. My worry is making sure a custom processor handles all the ins and outs of the mbox format correctly.

Hm. As I'm writing this, I thought of something. I think I can parse each message one at a time, and then use an mbox function to load each message using the python mbox module. That way the mbox module can still deal with the specifics of the mbox format, but I can use a generator.

I'll give that a try. Thanks for the feedback @maxhawkins and @simonw. I'll give that a try.

@simonw can we hold off on merging this until I can test this new approach?

@maxhawkins
Copy link

Any updates?

@maxhawkins
Copy link

maxhawkins commented Jul 22, 2021

How does this commit look? maxhawkins@72802a8

It seems that Takeout's mbox format is pretty simple, so we can get away with just splitting the file on lines begining with From . My commit just splits the file every time a line starts with From and uses email.message_from_bytes to parse each chunk.

I was able to load a 12GB takeout mbox without the program using more than a couple hundred MB of memory during the import process. It does make us lose the progress bar, but maybe I can add that back in a later commit.

@maxhawkins
Copy link

One thing I noticed is this importer doesn't save attachments along with the body of the emails. It would be nice if those got stored as blobs in a separate attachments table so attachments can be included while fetching search results.

@maxhawkins
Copy link

I added a follow-up commit that deals with emails that don't have a Date header: maxhawkins@4bc7010

@UtahDave
Copy link
Author

Hi @maxhawkins , I'm sorry, I haven't had any time to work on this. I'll have some time tomorrow to test your commits. I think they look great. I'm great with your commits superseding my initial attempt here.

@maxhawkins
Copy link

I'm not sure why but my most recent import, when displayed in Datasette, looks like this:

mbox__mbox_emails__753_446_rows

I did some investigation into this issue and made a fix here. The problem was that some messages (like gchat logs) don't have a Message-Id and we need to use X-GM-THRID as the pkey instead.

@simonw While looking into this I found something unexpected about how sqlite_utils handles upserts if the pkey column is None. When the pkey is NULL I'd expect the function to either use rowid or throw an exception. Instead, it seems upsert_all creates a row where all columns are NULL instead of using the values provided as parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants