One of the pressing problems in web data extraction is separating meaningful content pertaining to the subject from cruft like navigation links, ads, footnotes and sidebar contents promoting links to other pages with the site or other sites. This is crucial as the page would be processed for the right content and the signal/noise ratio … Continue reading Decruft: Arc90’s Readability in Python
Have you ever worked with data extracted from a random source? Like an unknown website? This can sometimes become a nightmare for developpers as it is impossible to determine the encoding. Further text processing without using the correct encoding can become error prone. Lets see how to handle the different situations where encoding is known … Continue reading Chardet: Detecting Unknown String Encodings
GUID is a term that was bandied about in my office to signify any unique id that we used to identify our database records. But I never gave it a second thought for a long time, that is until I heard UUID mentioned in the context of couchdb, as enabling distributed data storage. This piqued … Continue reading Creating Universally Unique ID in Python
If you have ever agonized over connecting and communicating with a remote machine in python, give Paramiko a go. Paramiko is most helpful for cases where one needs to securely communicate and exchange data, execute commands on remote machines, handle connect requests from remove machines or access ssh services like sftp. As described in the … Continue reading How to ssh in python using Paramiko?