by Erik Rose
If you've ever wanted to get started with parsers, here's your chance for a ground-floor introduction. A harebrained spare-time project gives birth to a whirlwind journey from basic algorithms to Python libraries and, at last, to a parser for one of the craziest syntaxes out there: the MediaWiki grammar that drives Wikipedia.
Some languages were designed to be parsed. The most obvious example is Lisp and its relatives which are practically parsed when they hit the page. However, many others—including most wiki grammars—grow organically and get turned into HTML by sedimentary strata of regular expressions, all backtracking and warring with one another, making it difficult to output other formats or make changes to the language.
We will explore the tools and techniques necessary to attack one of the hairiest lingual challenges out there: MediaWiki syntax. Join me for an introduction to the general classes of parsing algorithms, from the birth of the field to the state of the art. Learn how to pick the right one. Have a comparative look at a dozen different Python parsing toolkits. And finally, learn some optimization tricks to get a grammar going at a reasonable clip.
by Jeff Elmore
Many of you are probably familiar with NLTK, the wonderful Natural Language Toolkit for Python. You may not be familiar with Linkgrammar, which is a sentence parsing system created at Carnegie Melon university. Linkgrammar is quite robust and works "out of the box" in a way that NLTK does not for sentence parsing.
NLTK is a fantastic library with broad capabilities. But often I find that I want something that will just do what I want without my having to figure out all of the details. An example of this is sentence parsing. A quick google search for parsing sentences with NLTK returns a number of articles describing how to write your own grammar and define a parser based on that grammar and parse sentences. This is great for toy problems and education, but if you actually need to parse sentences "from the wild," writing your own grammar is a huge undertaking.
Enter Linkgrammar. Linkgrammar was developed at Carnegie Melon university and is now maintained by the developers of Abiword as the basis for their grammar checking capabilities. It works nicely out of the box and is tolerant of irregularities found in authentic text.
7th–15th March 2012