Archive | January, 2012

Email Classification – Supervised Learning

Background:  Learning to Data Crunch

The first challenge I set for myself was email classification.

While spam classification might seem like beating a dead horse. some of my reasons for choosing it is as below

  • It had to be based on text processing as that what is most of my personal projects are based on.
  • This is the ideal candidate for a beginner as it a well worked on problem. Hence getting a huge corpus with truckloads of them tagged was very easy.
To make it more interesting and challenging, the target of this experiment will aimed at comparing the performance of supervised learning with semi-supervised learning.

If you find what I have written below to be a bit incoherent, thats because these were written as I progressed in solving the problem, rather than after it was done.

Corpus Stats

  1. The corpus is from TREC 2005 Spam Public Corpora.
  2. It has a total of 92,189 Messages.
  3. The total number of spam and ham is 52790 and 39399 respectively.

 

The Idea

2000 messages are selected at random for test (1000 spam, 1000 ham) and 2000 messages are selected at random for validation.  I’m going to use the same training corpus to run a supervised learning process and a semi supervised learning algo and see how the performance of the semi supervised learning algo measures up to the supervised learning algo.

Supervised Learning 88,189 messages as training

Semisupervised Learning Treat Initial batch of 5000 as tagged and then work through the data as unlabelled batches of 5000 each.

The features I plan to use besides the bag of words method 1. html or plaintext 2. attached images count 3. attached videos count 4. attached applications count

Implementing Supervised Learning Algorithm

The initial number of features were greater than 4.4L or 0.44M This is was mostly bag of words besides the number of images, applications or videos attached in the content.

I reduced the features further by removing any word that was not repeated in other documents.

The number of emails = 88189 Number of features = 190562 Assuming I need 1 byte to store a feature per email, it would result in 16805472218 bytes or 16800G approx :)

Refining further, let’s remove any email address, url or number from the vocabulary. This will reduce 58201, 7446 and 6417 terms repectively. Thus the terms come down to 123286. Sparse Matrix to the rescue!

Using a sparse matrix from scipy greatly improves the performance. I have chosen is Support Vector Classifier and uses the rbf kernel.

Execution

The only input required is datadir – the location of the untarred dataset /path/to/trec05p–1 .

python process_email.py $datadir

process_email.py parses the email and stores the extracted data in the database for later processing.

python create_data_sets.py $datadir

create_data_sets.py splits the email_index into training sets(supervised and semisupervised), validation and testing email_index.

python supervisor_learn.py $datadir

supervisor_learn.py trains the supervised model.

The code is hosted at https://github.com/sharmi/emailspam_ml

The supervised model is trained with C=1 and gamma=0.001. The error rate on predicting the test set is 6.35%. I’m directly using the error rate as  the measure as the spam and ham messages are nearly equal in distribution.

 

I suppose these are good results. Currently the grid search for combinations of C and gamma is in progress. The rate at which it is going, it is will be days before it completes. Once it completes, I will post the results.

Learning to data crunch – My foray in Machine Learning

Those who know me well would have known that I had taken a break from my job for my pregnancy and now enjoying my son’s company.   But my hands itched to code and to create. I have been fleshing out my pet project and experimenting in crawling and natural langauge processing.  The project is highly ambitious and may take several man years to materialize. Some of the side projects were good enough to give back to the community like decruft . Yet, most of them are still works in progress, and the more I do, the more I learn.

Working on the project helped me identify areas to improve my knowledge in.  So when ‘Artificial Intelligence’ and ‘Machine Learning’ were offered as online courses I jumped at the chance to take them.  Then I did not have any clear notion on what would be taught in machine learning except that it would turn out to be useful to me. Now, after the classes, it feels like whole new worlds have been thrown open to me.  Now it is up to me take the baby steps. I have decided to explore the world of machine learning.  The magic of  emerging patterns where none seemed to exist. Creating crystal balls to predict what that next input might do.  Swishing wands that coax meaning out of random numbers.

As those who had taken the machine learning classes would know, they were very lucid and had lots of quizes and exercises.  On the other hand, the programming exercises had lots of hand holding.  It was very convenient then, when I was constrained in time, but it has its disadvantages.  The best way to understand anything intimately is for you to build it up yourself, from the ground up.  With that in mind, I’m getting my hands dirty solving my first machine learning problem.  I have decided on blogging all my exercises as this would help me get feedback easily from community and also track my progress over time.  It would be apparent to you that this is not the most updated blog around.  Hopefully it would improve in the months to come.  So without more ado, lets move on to the next post to see the first challenge that I have set for myself.

iTunes on Linux Mint – TunesViewer

Well,

The title is kinda misleading, as no one has yet produced a satisfactorily working installation of iTunes on Linux not even with wine.  I tried, and failed.  To put it in context, a few days back I came across the article “Harvard Statistics 110: Introduction to Probability, on iTunes”  on Hacker News.   I was looking at options to learn prob and stats properly and this course offered the best coverage I had ever seen.  I wanted to use the resource.  I tried installing wine and then itunes on wine.  But it did not work.  Most search results were dead ends.   So I went back to Hacker news post and looked at the comments section for clues. http://news.ycombinator.com/item?id=3469393

That is were I came across the post by iamabhi9 recommending TunesViewer.  TunesViewer, in their own words, turns out to be ” a small, easy to use program to access itunes-university media and podcasts in Linux.” Caveat: This will not let you connect to iTunes store accounts or buy anything.

I do not care about iTunes store accounts as I am only interested in the university content.  Once installed, it blew my mind.  Everything from arts, history, brain evolution, Statistics, design, culinary skills to network security is all there. It also provides you the video links so that you may download it and view it at your leisure. I’m attaching a screen shot of the app at the end.  There is also an android version for those who are interested. http://sourceforge.net/projects/tunesviewer/files%2FAndroid/

This is a shout-out for the TunesViewer developers who have done an excellent Job.