WIP: Add Gmail takeout mbox import #5

UtahDave · 2021-02-22T21:30:40Z

WIP

This PR adds the ability to import emails from a Gmail mbox export from Google Takeout.

This is my first PR to a datasette/dogsheep repo. I've tested this on my personal Google Takeout mbox with ~520,000 emails going back to 2004. This took around ~20 minutes to process.

To provide some feedback on the progress of the import I added the "rich" python module. I'm happy to remove that if adding a dependency is discouraged. However, I think it makes a nice addition to give feedback on the progress of a long import.

Do we want to log emails that have errors when trying to import them?

Dealing with encodings with emails is a bit tricky. I'm very open to feedback on how to deal with those better. As well as any other feedback for improvements.

UtahDave · 2021-02-23T01:13:54Z

Also, @simonw I created a test based off the existing tests. I think it's working correctly

UtahDave · 2021-02-24T00:36:18Z

I noticed that @simonw is using black for formatting. I ran black on my additions in this PR.

simonw · 2021-02-26T22:23:10Z

Thanks!

I requested my Gmail export from takeout - once that arrives I'll test it against this and then merge the PR.

simonw · 2021-03-04T05:48:16Z

Wow, my mbox is a 10.35 GB download!

simonw · 2021-03-04T06:54:46Z

The Rich-powered progress bar is pretty:

simonw · 2021-03-04T06:57:25Z

The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:

google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py

Lines 66 to 67 in a3de045

    
           mbox = mailbox.mbox(mbox_file) 
        
           print("Processing {} emails".format(len(mbox)))

I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.

simonw · 2021-03-04T07:01:18Z

I'm not sure if it would work, but there is an alternative pattern for showing a progress bar against a really large file that I've used in healthkit-to-sqlite - you set the progress bar size to the size of the file in bytes, then update a counter as you read the file.

https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/cli.py#L24-L57 and https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/utils.py#L4-L19 (the progress_callback() bit) is where that happens.

It can be a bit of a convoluted pattern, and I'm not at all sure it would work for mbox files since it looks like that library has other reasons it needs to do a file scan rather than streaming it through one chunk of bytes at a time. So I imagine this would not work here.

simonw · 2021-03-04T07:01:58Z

I got 9 warnings that look like this:

Errors: 1
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 103, in get_mbox
    message["date"] = get_message_date(email.get("Date"), email.get_from())
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 167, in get_message_date
    datetime_tuple = email.utils.parsedate_tz(mail_date)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz
    data = data.split()
AttributeError: 'Header' object has no attribute 'split'

It would be useful if those warnings told me the message ID (or similar) of the affected message so I could grep for it in the mbox and see what was going on.

simonw · 2021-03-04T07:12:48Z

It looks like the body is being loaded into a BLOB column - so in Datasette default it looks like this:

If I datasette install datasette-render-binary and then try again I get this:

It would be great if we could store the body as unicode text instead. May have to do something clever to decode it based on some kind of charset header?

simonw · 2021-03-04T07:14:41Z

Confirmed: removing the len() call does not speed things up, so it's reading through the entire file for some other purpose too.

simonw · 2021-03-04T07:17:05Z

Looks like you're doing this:

    elif message.get_content_type() == "text/plain":
        body = message.get_payload(decode=True)

So presumably that decodes to a unicode string?

I imagine the reason the column is a BLOB for me is that sqlite-utils determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.

simonw · 2021-03-04T07:18:56Z

google_takeout_to_sqlite/utils.py

+    """
+    Import Gmail mbox from google takeout
+    """
+    db["mbox_emails"].upsert_all(


A fix for the problem I had where my body column ended up being a BLOB rather than text would be to explicitly create the table first.

You can do that like so:

if not db["mbox_emails"].exists(): db["mbox_emails"].create({ "id": str, "X-GM-THRID": str, "X-Gmail-Labels": str, "From": str, "To": str, "Subject": str, "when": str, "body": str, }, pk="id")

I had to upgrade to the latest sqlite-utils for this to work because prior to sqlite-utils 2.0 the table.exists property was a boolean not a method.

UtahDave · 2021-03-04T07:32:04Z

The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:

google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py

Lines 66 to 67 in a3de045

mbox = mailbox.mbox(mbox_file)

print("Processing {} emails".format(len(mbox)))

I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.

The wait is from python loading the mbox file. This happens regardless if you're getting the length of the mbox. The mbox module is on the slow side. It is possible to do one's own parsing of the mbox, but I kind of wanted to avoid doing that.

UtahDave · 2021-03-04T07:36:24Z

Looks like you're doing this:
    elif message.get_content_type() == "text/plain":
        body = message.get_payload(decode=True)
So presumably that decodes to a unicode string?

I imagine the reason the column is a BLOB for me is that sqlite-utils determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.

Ah, that's good to know. I think explicitly creating the tables will be a great improvement. I'll add that.

Also, I noticed after I opened this PR that the message.get_payload() is being deprecated in favor of message.get_content() or something like that. I'll see if that handles the decoding better, too.

Thanks for the feedback. I should have time tomorrow to put together some improvements.

simonw · 2021-03-04T14:43:58Z

I added this code to output a message ID on errors:

             print("Errors: {}".format(num_errors))
             print(traceback.format_exc())
+            print("Message-Id: {}".format(email.get("Message-Id", "None")))
             continue

Having found a message ID that had an error, I ran this command to see the context:

rg --text --context 20 '44F289B0.000001.02100@SCHWARZE-DWFXMI' ~/gmail.mbox

This was for the following error:

  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 102, in get_mbox
    message["date"] = get_message_date(email.get("Date"), email.get_from())
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 178, in get_message_date
    datetime_tuple = email.utils.parsedate_tz(mail_date)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz
    data = data.split()
AttributeError: 'Header' object has no attribute 'split'

Here's what I spotted in the ripgrep output:

177133570:Message-Id: <44F289B0.000001.02100@SCHWARZE-DWFXMI>
177133571-Date: Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit)
177133572-X-Mailer: IncrediMail (5002253)

So it could it be that _parsedate_tz is having trouble with that Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit) string.

simonw · 2021-03-04T14:46:06Z

Solution could be to pre-process that string by splitting on ( and dropping everything afterwards, assuming that the (...) bit isn't necessary for correctly parsing the date.

simonw · 2021-03-04T15:18:36Z

I imported my 10GB mbox with 750,000 emails in it, ran this tool (with a hacked fix for the blob column problem) - and now a search that returns 92 results takes 25.37ms! This is fantastic.

simonw · 2021-03-04T15:20:42Z

I'm not sure why but my most recent import, when displayed in Datasette, looks like this:

Sorting by id in the opposite order gives me the data I would expect - so it looks like a bunch of null/blank messages are being imported at some point and showing up first due to ID ordering.

maxhawkins · 2021-03-05T02:03:19Z

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

UtahDave · 2021-03-05T16:28:07Z

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

@maxhawkins a limitation of the python mbox module is it loads the entire mbox into memory. I did find another approach to this problem that didn't use the builtin python mbox module and created a generator so that it didn't have to load the whole mbox into memory. I was hoping to use standard library modules, but this might be a good reason to investigate that approach a bit more. My worry is making sure a custom processor handles all the ins and outs of the mbox format correctly.

Hm. As I'm writing this, I thought of something. I think I can parse each message one at a time, and then use an mbox function to load each message using the python mbox module. That way the mbox module can still deal with the specifics of the mbox format, but I can use a generator.

I'll give that a try. Thanks for the feedback @maxhawkins and @simonw. I'll give that a try.

@simonw can we hold off on merging this until I can test this new approach?

maxhawkins · 2021-05-27T15:01:42Z

Any updates?

maxhawkins · 2021-07-22T05:56:31Z

How does this commit look? maxhawkins@72802a8

It seems that Takeout's mbox format is pretty simple, so we can get away with just splitting the file on lines begining with From . My commit just splits the file every time a line starts with From and uses email.message_from_bytes to parse each chunk.

I was able to load a 12GB takeout mbox without the program using more than a couple hundred MB of memory during the import process. It does make us lose the progress bar, but maybe I can add that back in a later commit.

maxhawkins · 2021-07-22T15:51:46Z

One thing I noticed is this importer doesn't save attachments along with the body of the emails. It would be nice if those got stored as blobs in a separate attachments table so attachments can be included while fetching search results.

maxhawkins · 2021-07-22T17:41:32Z

I added a follow-up commit that deals with emails that don't have a Date header: maxhawkins@4bc7010

UtahDave · 2021-07-22T17:47:50Z

Hi @maxhawkins , I'm sorry, I haven't had any time to work on this. I'll have some time tomorrow to test your commits. I think they look great. I'm great with your commits superseding my initial attempt here.

maxhawkins · 2021-07-28T07:18:56Z

I'm not sure why but my most recent import, when displayed in Datasette, looks like this:

I did some investigation into this issue and made a fix here. The problem was that some messages (like gchat logs) don't have a Message-Id and we need to use X-GM-THRID as the pkey instead.

@simonw While looking into this I found something unexpected about how sqlite_utils handles upserts if the pkey column is None. When the pkey is NULL I'd expect the function to either use rowid or throw an exception. Instead, it seems upsert_all creates a row where all columns are NULL instead of using the values provided as parameters.

UtahDave added 2 commits February 22, 2021 12:56

Add ability to import Gmail Takeout mbox

8008357

Add some tests

50e7e8d

UtahDave mentioned this pull request Feb 22, 2021

Feature Request: Gmail #4

Open

Format with Black

a3de045

simonw reviewed Mar 4, 2021

View reviewed changes

maxhawkins mentioned this pull request Jul 28, 2021

Add Gmail takeout mbox import (v2) #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add Gmail takeout mbox import #5

WIP: Add Gmail takeout mbox import #5

UtahDave commented Feb 22, 2021

UtahDave commented Feb 23, 2021

UtahDave commented Feb 24, 2021

simonw commented Feb 26, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021 •

edited

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021 •

edited

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw Mar 4, 2021 •

edited

UtahDave commented Mar 4, 2021

UtahDave commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

maxhawkins commented Mar 5, 2021

UtahDave commented Mar 5, 2021

maxhawkins commented May 27, 2021

maxhawkins commented Jul 22, 2021 •

edited

maxhawkins commented Jul 22, 2021

maxhawkins commented Jul 22, 2021

UtahDave commented Jul 22, 2021

maxhawkins commented Jul 28, 2021

WIP: Add Gmail takeout mbox import #5

Are you sure you want to change the base?

WIP: Add Gmail takeout mbox import #5

Conversation

UtahDave commented Feb 22, 2021

UtahDave commented Feb 23, 2021

UtahDave commented Feb 24, 2021

simonw commented Feb 26, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021 • edited

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021 • edited

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw Mar 4, 2021 • edited

Choose a reason for hiding this comment

UtahDave commented Mar 4, 2021

UtahDave commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

simonw commented Mar 4, 2021

maxhawkins commented Mar 5, 2021

UtahDave commented Mar 5, 2021

maxhawkins commented May 27, 2021

maxhawkins commented Jul 22, 2021 • edited

maxhawkins commented Jul 22, 2021

maxhawkins commented Jul 22, 2021

UtahDave commented Jul 22, 2021

maxhawkins commented Jul 28, 2021

simonw commented Mar 4, 2021 •

edited

simonw commented Mar 4, 2021 •

edited

simonw Mar 4, 2021 •

edited

maxhawkins commented Jul 22, 2021 •

edited