Extracting text from email messages with JavaMail

This blog is focused on email processing – mostly how to extract the clear text from an email message. There is a lot of buzz (and eventually  good use) of unstructured data processing – often referred as BigData processing. In my case – we have a pilot project aimed for automatic classification of documents and emails – for abstraction – any textual content. For that purpose we need our server to “read” the received emails.

I am not very aware of historical protocols and formats, but today almost all emails are delivered as MIME message – see MIME on Wikipedia. From practical point – all delivered content has metadata – what is it and how should I read it.

Only for introduction – this is how to read email messages inside J2EE container. Now we will ignore message count, windowing,.. and we assume reader is skilled with Java and J2EE:

Context ctx = new InitialContext();
Session mailSession = (javax.mail.Session) ctx.lookup("java:comp/env/mail/Session");
Store store = mailSession.getStore(IMAP_STORE);
store.connect();
Folder inbox = store.getFolder(INBOX_FOLDER);
inbox.open(Folder.READ_ONLY);

Message[] messages = inbox.geTMessages();
// do somethinf with messages
this.content = EmailUtils.extractClearText(m, null);
inbox.close(false);
store.close();

About MIME parts – every message may be a multipart container or holds its content.   Most of current email clients send a multipart message, where one part is a HTML message and second is a plain text message for email clients, which do not support or disable HTML messages. It means, that a MimeMessage may (and will) contain several parts (BodyPart) messages. In this blog we won’t discuss all message types. For purposes we have already stated (from a human created email sent by a default emailing client we need to extract its clear text) we will assume there is a plain text part and alternatively an HTML part. Pretend that other parts can be happily ignored. Lets find and read the text.

Most of messages are instance of Multipart class. Multipart message contains parts – BodyPart objectd. To check what the part is, we will use the isMimeType() method.

Generally – if we find a text/plain multipart in the email message, we will use it, else we will use its html mime part. If the email was sent as a simple text message, we will use the text directly.

       if(message instanceof MimeMessage)
        {
            MimeMessage m = (MimeMessage)message;
            Object contentObject = m.getContent();
            if(contentObject instanceof Multipart)
            {
                BodyPart clearTextPart = null;
                BodyPart htmlTextPart = null;
                Multipart content = (Multipart)contentObject;
                int count = content.getCount();
                for(int i=0; i<count; i++)
                {
                    BodyPart part =  content.getBodyPart(i);
                    if(part.isMimeType("text/plain"))
                    {
                        clearTextPart = part;
                        break;
                    }
                    else if(part.isMimeType("text/html"))
                    {
                        htmlTextPart = part;
                    }
                }

                if(clearTextPart!=null)
                {
                    result = (String) clearTextPart.getContent();
                }
                else if (htmlTextPart!=null)
                {
                    String html = (String) htmlTextPart.getContent();
                    result = Jsoup.parse(html).text();
                }

            }
             else if (contentObject instanceof String) // a simple text message
            {
                result = (String) contentObject;
            }
            else // not a mime message
            {
                logger.log(Level.WARNING,"not a mime part or multipart {0}",message.toString());
                result = null;
            }

As you see, for parsing HTML we’ve used the Jsoup project. Simple and effective way to treat html and get rid of all tags.

This blog serves as my personal notepad, but if anybody has a good use of that, feel free to share,  link or comment.

Author is a senior consultant at Apogado.

Advertisements

, , , ,

  1. #1 by Willy Ristanto on October 5, 2012 - 17:28

    Thanks….Very Nice !! It’s really helpful 🙂

  2. #2 by Arkden on December 6, 2013 - 10:46

    Thank you very much! It is really helpful!! 😀

  3. #3 by Pablo on March 26, 2014 - 14:54

    Work for unread messages?

    • #4 by Gabriel on March 26, 2014 - 15:12

      Hello Pablo. Indeed, it uses the same protocol as your email client under the hood. So once read, they are marked as read.

  4. #5 by Pablo on March 26, 2014 - 15:34

    Man, what you mean with: Object contentObject = m.getContent();

    Object, contentObject ???

    I having a exception here;

    Exception in thread “main” java.io.IOException: No MimeMessage content

  5. #6 by Pablo on March 26, 2014 - 15:46

    In this code, where he print the content message??

  6. #7 by joe7pak on April 9, 2014 - 17:24

    Thanks for this … much needed.

    A comment …. did you forget to put a ‘break;’ after finding the html part? You did that after finding the text part.

  7. #8 by zombie catchers hack on October 23, 2015 - 08:56

    Excellent web site you have here.. It’s difficult to find good quality writing
    like yours nowadays. I really appreciate people like you!
    Take care!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: