Posts Tagged JavaMail

Extracting text from email messages with JavaMail

This blog is focused on email processing – mostly how to extract the clear text from an email message. There is a lot of buzz (and eventually  good use) of unstructured data processing – often referred as BigData processing. In my case – we have a pilot project aimed for automatic classification of documents and emails – for abstraction – any textual content. For that purpose we need our server to “read” the received emails.

I am not very aware of historical protocols and formats, but today almost all emails are delivered as MIME message – see MIME on Wikipedia. From practical point – all delivered content has metadata – what is it and how should I read it.

Only for introduction – this is how to read email messages inside J2EE container. Now we will ignore message count, windowing,.. and we assume reader is skilled with Java and J2EE:

Context ctx = new InitialContext();
Session mailSession = (javax.mail.Session) ctx.lookup("java:comp/env/mail/Session");
Store store = mailSession.getStore(IMAP_STORE);
store.connect();
Folder inbox = store.getFolder(INBOX_FOLDER);
inbox.open(Folder.READ_ONLY);

Message[] messages = inbox.geTMessages();
// do somethinf with messages
this.content = EmailUtils.extractClearText(m, null);
inbox.close(false);
store.close();

About MIME parts – every message may be a multipart container or holds its content.   Most of current email clients send a multipart message, where one part is a HTML message and second is a plain text message for email clients, which do not support or disable HTML messages. It means, that a MimeMessage may (and will) contain several parts (BodyPart) messages. In this blog we won’t discuss all message types. For purposes we have already stated (from a human created email sent by a default emailing client we need to extract its clear text) we will assume there is a plain text part and alternatively an HTML part. Pretend that other parts can be happily ignored. Lets find and read the text.

Most of messages are instance of Multipart class. Multipart message contains parts – BodyPart objectd. To check what the part is, we will use the isMimeType() method.

Generally – if we find a text/plain multipart in the email message, we will use it, else we will use its html mime part. If the email was sent as a simple text message, we will use the text directly.

       if(message instanceof MimeMessage)
        {
            MimeMessage m = (MimeMessage)message;
            Object contentObject = m.getContent();
            if(contentObject instanceof Multipart)
            {
                BodyPart clearTextPart = null;
                BodyPart htmlTextPart = null;
                Multipart content = (Multipart)contentObject;
                int count = content.getCount();
                for(int i=0; i<count; i++)
                {
                    BodyPart part =  content.getBodyPart(i);
                    if(part.isMimeType("text/plain"))
                    {
                        clearTextPart = part;
                        break;
                    }
                    else if(part.isMimeType("text/html"))
                    {
                        htmlTextPart = part;
                    }
                }

                if(clearTextPart!=null)
                {
                    result = (String) clearTextPart.getContent();
                }
                else if (htmlTextPart!=null)
                {
                    String html = (String) htmlTextPart.getContent();
                    result = Jsoup.parse(html).text();
                }

            }
             else if (contentObject instanceof String) // a simple text message
            {
                result = (String) contentObject;
            }
            else // not a mime message
            {
                logger.log(Level.WARNING,"not a mime part or multipart {0}",message.toString());
                result = null;
            }

As you see, for parsing HTML we’ve used the Jsoup project. Simple and effective way to treat html and get rid of all tags.

This blog serves as my personal notepad, but if anybody has a good use of that, feel free to share,  link or comment.

Author is a senior consultant at Apogado.

Advertisements

, , , ,

8 Comments