Archive | Uncategorized RSS feed for this section

Email Classification – Supervised Learning

Background:  Learning to Data Crunch

The first challenge I set for myself was email classification.

While spam classification might seem like beating a dead horse. some of my reasons for choosing it is as below

  • It had to be based on text processing as that what is most of my personal projects are based on.
  • This is the ideal candidate for a beginner as it a well worked on problem. Hence getting a huge corpus with truckloads of them tagged was very easy.
To make it more interesting and challenging, the target of this experiment will aimed at comparing the performance of supervised learning with semi-supervised learning.

If you find what I have written below to be a bit incoherent, thats because these were written as I progressed in solving the problem, rather than after it was done.

Corpus Stats

  1. The corpus is from TREC 2005 Spam Public Corpora.
  2. It has a total of 92,189 Messages.
  3. The total number of spam and ham is 52790 and 39399 respectively.

 

The Idea

2000 messages are selected at random for test (1000 spam, 1000 ham) and 2000 messages are selected at random for validation.  I’m going to use the same training corpus to run a supervised learning process and a semi supervised learning algo and see how the performance of the semi supervised learning algo measures up to the supervised learning algo.

Supervised Learning 88,189 messages as training

Semisupervised Learning Treat Initial batch of 5000 as tagged and then work through the data as unlabelled batches of 5000 each.

The features I plan to use besides the bag of words method 1. html or plaintext 2. attached images count 3. attached videos count 4. attached applications count

Implementing Supervised Learning Algorithm

The initial number of features were greater than 4.4L or 0.44M This is was mostly bag of words besides the number of images, applications or videos attached in the content.

I reduced the features further by removing any word that was not repeated in other documents.

The number of emails = 88189 Number of features = 190562 Assuming I need 1 byte to store a feature per email, it would result in 16805472218 bytes or 16800G approx :)

Refining further, let’s remove any email address, url or number from the vocabulary. This will reduce 58201, 7446 and 6417 terms repectively. Thus the terms come down to 123286. Sparse Matrix to the rescue!

Using a sparse matrix from scipy greatly improves the performance. I have chosen is Support Vector Classifier and uses the rbf kernel.

Execution

The only input required is datadir – the location of the untarred dataset /path/to/trec05p–1 .

python process_email.py $datadir

process_email.py parses the email and stores the extracted data in the database for later processing.

python create_data_sets.py $datadir

create_data_sets.py splits the email_index into training sets(supervised and semisupervised), validation and testing email_index.

python supervisor_learn.py $datadir

supervisor_learn.py trains the supervised model.

The code is hosted at https://github.com/sharmi/emailspam_ml

The supervised model is trained with C=1 and gamma=0.001. The error rate on predicting the test set is 6.35%. I’m directly using the error rate as  the measure as the spam and ham messages are nearly equal in distribution.

 

I suppose these are good results. Currently the grid search for combinations of C and gamma is in progress. The rate at which it is going, it is will be days before it completes. Once it completes, I will post the results.

Learning to data crunch – My foray in Machine Learning

Those who know me well would have known that I had taken a break from my job for my pregnancy and now enjoying my son’s company.   But my hands itched to code and to create. I have been fleshing out my pet project and experimenting in crawling and natural langauge processing.  The project is highly ambitious and may take several man years to materialize. Some of the side projects were good enough to give back to the community like decruft . Yet, most of them are still works in progress, and the more I do, the more I learn.

Working on the project helped me identify areas to improve my knowledge in.  So when ‘Artificial Intelligence’ and ‘Machine Learning’ were offered as online courses I jumped at the chance to take them.  Then I did not have any clear notion on what would be taught in machine learning except that it would turn out to be useful to me. Now, after the classes, it feels like whole new worlds have been thrown open to me.  Now it is up to me take the baby steps. I have decided to explore the world of machine learning.  The magic of  emerging patterns where none seemed to exist. Creating crystal balls to predict what that next input might do.  Swishing wands that coax meaning out of random numbers.

As those who had taken the machine learning classes would know, they were very lucid and had lots of quizes and exercises.  On the other hand, the programming exercises had lots of hand holding.  It was very convenient then, when I was constrained in time, but it has its disadvantages.  The best way to understand anything intimately is for you to build it up yourself, from the ground up.  With that in mind, I’m getting my hands dirty solving my first machine learning problem.  I have decided on blogging all my exercises as this would help me get feedback easily from community and also track my progress over time.  It would be apparent to you that this is not the most updated blog around.  Hopefully it would improve in the months to come.  So without more ado, lets move on to the next post to see the first challenge that I have set for myself.

The story of Sugru

Today I came upon a link on Hacker News.   When I clicked on the link, I was not expecting much, except that there might be something interesting, considering that it was on Hacker News’ front-page. What I saw, blew me away.  It’s the story of how Sugru was dreamt about in February 2003, how it evolved through conceptualisation, fleshing out the product, funding, branding and what not to where it is today, serving customers in all 7 continents.

The page is so beautiful, light and well laid out,  kinda like a fairy tale beautifully told we know will have a happy/magical ending.  But the reality would have been anything but that.  Years of labour, sweat, empty pockets and uncertainty of what the future holds. Yet, they persevered, and powered on through.  In today’s entrepreneurs’ era, where everyone seems to make it big overnight, when an idea flashes into the mind, and immediately everything falls into place, Sugru shows the real grit and determination that is essential to make great products.  You know, there are some stuff that does take half a decade to build! And every minute of it is totally worth it! Kudos, team!

So why do I blog about it?  I don’t think my blog would have even an infinitisimal effect on their online presence. So whats the ‘head fake’ in this, as Randy Paush would call it?  This is not for the world or Sugru.  This is for me.  This story affected me so much, it is not enough to bookmark it or tweet it.  It needs to be edified in stone, or in something as close to it.  So on some rainy day, when everything seems cold, dark and cloudy, I would know where the fireplace is.

 

 

Follow the Cheshire Cat!

Penning My Thoughts

Penning My Thoughts

From my notebook – 4th August 2011

Today I’ve been consumed by a great urge to write. Write into a notebook. I dunno why but I’ve never been very successful writing/typing into my laptop. One reason being, I’ve never found a good software that has a right balance of online and offline support to act like an always-available-notebook. Another being, all those open windows that are so distracting. So anyways I felt it would be fun to yield to this wish and see where it goes.

I used to do a lot of writing while at school. Now, though, I usually avoid writing of all kind unless it has a ‘real’ purpose. This is because my time is split between managing my house, caring for my son and working on my pet project. But today I will just let my intuition take over. Of late I find that my intuition is most often right. What is most surprising is that the moment I took my pen in my hand, I feel so refreshed and the tiredness that seemed to be clogging my mind evaporates away. May be I should write more. Maybe I should listen to my intuition a little more and that might really help me handle the restlessness I occasionally feel.

We’ve been doing a pet project for sometime now (monstrous in resources though!). Its like holding a candle in the dark. I know I need to push forward but I see only a few paces at a time. But i wholly enjoy it and my intuition says it will work. Time will validate that!

Now I feel like blog posting all that I wrote and so here it is!

PS: The feel of my pen moving smoothly on paper producing tracing out my feelings as text is so amazing!