Languages for computational linguistics
Languages for doing computational work can be divided into two broad categories--declarative languages, and procedural languages. Most work in industry is done in Perl and C++; Java can be expected to have a growing role as time goes on.
Perl
Perl is popular for its phenomenal support for string-handling tasks of all kinds. The best books for learning Perl are:
Use them together-- Learning Perl will help you understand Perl, and Perl for Dummies will let you actually be effective in it. If you're an impoverished graduate student and can only afford one, buy Perl for Dummies.
There's also a new book by
Michael Hammond, Programming
for Linguists: Perl for Language Researchers. I'm not familiar
with
it, one way or the other.
Perl really is quite wonderful for linguists. For some examples of what you can do with it, see:
- Perl version of the Brill parser--click here to get the source code. Click here to see a demo of the POS tagger in action. (When you're done being amazed and impressed, try giving it the input the orange saw saw the orange.)
- A language identifier written in Perl
- Chris Barker's thing--see note in conference proceedings for URL
- Many things by I. Dan Melamed--scroll down to find links to Perl programs
Java
Java shows some promise for replacing C++ for building some linguistic
applications. Although it doesn't quite match the ease of
string-handling that Perl delivers, it is certainly easier to perform
many string manipulations in Java than in C++. Also, a good Java
regular expression engine is available, which is a big help in
bringing Java closer to the level of Perl. Update: Java 1.4 includes a regular
expression package. The best documentation of it is probably in
David Flanagan's Java In A Nutshell, 4th ed. Also see the
second
edition of Jeffrey Friedl's Mastering Regular
Expressions
(be sure to get the second edition--Java isn't covered in the
first), and Sun's pages
here.
Java is very strong on handling non-ASCII character sets; if you
need
to work with Unicode to work with your language of interest, you
should
definitely check out Java.
Michael Hammond has a new book, Programming
for Linguists: Java Technology for Language Researchers, that you
should check out. The Java is somewhat out of date; see the book for
why.
I provide some links here to various documents that address linguistic functionality in Java. I also have some squibs here that I've written that serve as tutorials on the sorts of things that a linguist would want to do with Java. Note that these tutorials assume that you have some basic familiarity with Java already. If that's not you, see Java in a Nutshell, by Daniel Flanagan. If the Flanagan book is more than you can handle at this point, try Java2 Fast and Easy Web Development or Elizabeth Castro's book (sorry no URL).
- Java Speech grammar format specification
- Java
Yacc-able grammar
- Java grammar extensions
- Notes on string manipulation in Java
- Notes on regular expressions in
Java using the Jakarto ORO Perl5Util package
- Building Parsers With Java. You need to understand Java and object-oriented design before you tackle this one. It's not oriented towards parsers that apply a separately-specified grammar, as a linguist would expect, but rather towards building parsers that implement a grammar directly. Hence, for every alternative grammar, you write a different parser--the parser and the grammar are one and the same, sorta. Still, it's an interesting book.
Declarative languages
Your basic choices here are LISP or Prolog. A good beginning LISP book: The Little LISPer. Look for it before you actually need it, because most places will have to special-order it for you. To really learn Lisp, get Paul Graham's ANSI Common Lisp. When you're ready for linguistic programming, check out Gazdar et al.'s Natural language processing in Lisp: an introduction to computational linguistics and Peter Norvig's Paradigms of artificial intelligence programming: case studies in Common Lisp.
Sundry LISP sites:
Some Prolog sites: