[Home]

Corpora

Iranian Persian / فارسی

VOA Corpus

These files are in the public domain
Combined years (7.9 million words):

Kayhan Corpus

These files are in the public domain in most countries outside Iran

English

Buckeye Pronunciation Dictionary

Similar to the CMU Pronunciation Dictionary, but the transcriptions are based on a speech corpus instead of intuitions. Includes occurrence counts and mean length of utterance.

Dari / دری

VOA Corpus (small)

This corpus is in the public domain
Combined years (82k words):

Pashto / پښتو

VOA Corpus (small)

This corpus is in the public domain
Combined years (62k words):

Urdu / اردو

VOA Corpus

This corpus is in the public domain
Combined years (4 million words):