Open Source world often provides us treasure troves of golden libraries. Only you might have to spend some time mining for the nuggets. WordNet is such a nugget, if there ever was one. So many words and their senses are catalogued often along with their inherent structure. A definition for each word sense and plenty of examples are also provided. Only what is missing is a well rounded algorithm for similarity comparison for words.
One of the pressing problems in web data extraction is separating meaningful content pertaining to the subject from cruft like navigation links, ads, footnotes and sidebar contents promoting links to other pages with the site or other sites.