Decruft: Arc90’s Readability in Python

One of the pressing problems in web data extraction is separating meaningful content pertaining to the subject from cruft like navigation links, ads, footnotes and sidebar contents promoting links to other pages with the site or other sites. This is crucial as the page would be processed for the right content and the signal/noise ratio Continue reading Decruft: Arc90’s Readability in Python

Chardet: Detecting Unknown String Encodings

Have you ever worked with data extracted from a random source? Like an unknown website? This can sometimes become a nightmare for developpers as it is impossible to determine the encoding. Further text processing without using the correct encoding can become error prone. Lets see how to handle the different situations where encoding is known Continue reading Chardet: Detecting Unknown String Encodings

How to ssh in python using Paramiko?

If you have ever agonized over connecting and communicating with a remote machine in python, give Paramiko a go.  Paramiko is most helpful for cases where one needs to securely communicate and exchange data,  execute commands on remote machines, handle connect requests from remove machines or access ssh services like sftp. As described in the Continue reading How to ssh in python using Paramiko?