[Home]

Some Persian NLP projects of mine




Here's how many of the Persian programs fit together:



Using an input text like "من کتاب‌های تو را نمی‌بینم" in Unicode HTML decimal form

Orthography

You can convert the text from one character-set encoding to another, including to and from: Romanized, ArabTeX, Windows-1256, ISIRI 3342, UTF-8, and Unicode HTML numeric entities.

Morphology

Syntax

Further information on Perstem

The command "perl perstem.pl --help" gives the following usage information:

Usage:    perl perstem.pl [options] < input > output

Function: Stemmer and morphological analyzer for the Persian language (Farsi).
          Inflexional morphemes are separated from their roots.

Options:
  -d, --nostem           Don't stem -- mostly for character-set conversion
  -h, --help             Print usage
  -i, --input <type>     Input character encoding type {cp1256,isiri3342,utf8,unihtml}
  -l, --links            Show morphological links
  -n, --noroman          Delete all non-Arabic script characters (eg. HTML tags)
  -o, --output <type>    Output character encoding type {arabtex,cp1256,isiri3342,utf8,unihtml}
  -p, --pos              Tag words for parts of speech
  -r, --recall           Increase recall by parsing ambiguous affixes
  -t, --tokenize         Tokenize punctuation
  -u, --unvowel          Remove short vowels
  -v, --version          Print version
  -w, --root             Return only word roots
  -z, --zwnj             Insert Zero Width Non-Joiners where they should be