[Home]
Corpora
Iranian Persian / فارسی
VOA Corpus
These files are in the public domain
Combined years (7.9 million words):
Kayhan Corpus
These files are in the public domain in most countries outside Iran
- 2005 Transliterated - xz (~24MB) - messy, 19 million words
English
Similar to the CMU Pronunciation Dictionary, but the transcriptions are based on a speech corpus instead of intuitions. Includes occurrence counts and mean length of utterance.
Dari / دری
VOA Corpus (small)
This corpus is in the public domain
Combined years (82k words):
Pashto / پښتو
VOA Corpus (small)
This corpus is in the public domain
Combined years (62k words):
Urdu / اردو
VOA Corpus
This corpus is in the public domain
Combined years (4 million words):