Some Persian NLP projects of mine
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl --nostem -i roman -o utf8and the output:
من کتابهای تو را نمیبینم
echo "من کتابهای تو را نمیبینم" | perl perstem.pl -d -i unihtml -o arabtexand the output:
mn ktAb\hspace{0ex}hAI tU rA nmI\hspace{0ex}bInm
This could then be inserted into a LaTeX document as:
\documentclass{article}
\usepackage{arabtex}
\begin{document}
\setfarsi
\novocalize
\< mn ktAb\hspace{0ex}hAI tU rA nmI\hspace{0ex}bInm >
\end{document}
echo "من کتابهای تو را نمیبینم" | perl perstem.pl -l -i unihtmland the output:
mn ktAb_+-hA_+e tu rA n+_mi-+_bin_+m
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -land the output:
mn ktAb_+-hA_+e tu rA n+_mi-+_bin_+m
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -l -pand the output:
mn ktAb_+-hA_+e/N+PL+EZ tu rA n+_mi-+_bin_+m/V+NEG+DUR+1S
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl -wand the output:
mn ktAb tu rA bin
echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pland the output:
mn ktAb hA e tu rA n mi bin m
source persianlg.sh echo "mn ktAb-hAi tu rA nmi-binm" | perl perstem.pl | persianlgor the short form:
persianparse.sh "mn ktAb-hAi tu rA nmi-binm"and the output of either input is:
+--------------------------------Wi-------------------------------+
| +---------------------------Spn1--------------------------+
| | +---------------------On---------------------+ |
| | +-----------PA-----------+ | |
| | +-----EZ-----+ | +----VMneg----+ |
| | +-NMSp-+ +--M-+ | | +-VMdur+-VMP-+
| | | | | | | | | | |
LEFT-WALL mn.pn ktAb.n hA.nms e.ez tu.pn rA.acc n.vmn mi.vmd bin.vp m.vmp
source persianlg.sh echo "من کتابهای تو نمیبینم" | perl perstem.pl -i unihtml | persianlgwhich will result in the same output:
+--------------------------------Wi-------------------------------+
| +---------------------------Spn1--------------------------+
| | +---------------------On---------------------+ |
| | +-----------PA-----------+ | |
| | +-----EZ-----+ | +----VMneg----+ |
| | +-NMSp-+ +--M-+ | | +-VMdur+-VMP-+
| | | | | | | | | | |
LEFT-WALL mn.pn ktAb.n hA.nms e.ez tu.pn rA.acc n.vmn mi.vmd bin.vp m.vmp
Usage: perl perstem.pl [options] < input > output
Function: Stemmer and morphological analyzer for the Persian language (Farsi).
Inflexional morphemes are separated from their roots.
Options:
-d, --nostem Don't stem -- mostly for character-set conversion
-h, --help Print usage
-i, --input <type> Input character encoding type {cp1256,isiri3342,utf8,unihtml}
-l, --links Show morphological links
-n, --noroman Delete all non-Arabic script characters (eg. HTML tags)
-o, --output <type> Output character encoding type {arabtex,cp1256,isiri3342,utf8,unihtml}
-p, --pos Tag words for parts of speech
-r, --recall Increase recall by parsing ambiguous affixes
-t, --tokenize Tokenize punctuation
-u, --unvowel Remove short vowels
-v, --version Print version
-w, --root Return only word roots
-z, --zwnj Insert Zero Width Non-Joiners where they should be