Skip to content

Faster email parser #5

@rth

Description

@rth

The default email.Parser (converts the raw email text to a structured dict) in written in pure Python in the standard library and is somewhat slow. As a result when threading emails, the performance bottleneck is in the e-mail parsing. Here is a benchmark for a dataset of 5,000 emails,

  • email.Parser: 33.329 s
  • converting to jwthreading.Message format: 0.121s
  • the JWZ threading algorithm: 0.031s
  • sorting of threads: 0.002s

A solution could be to,

  • use a MIME parser from https://github.com/mailgun/flanker (though no PY3 support for the moment and has a lot of additional dependencies)
  • adapt the https://github.com/jkr/pygmime (no PY3 support either, cross-platform support would be difficult)
  • write a custom simplified email.Parser (we only require the References:, In-Reply-To: and Subject header fields, for the JWZ algorithm )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions