A Thai word frequency list of ~20k words used in textbooks of primary and secondary school for Thai children The แจ่มไพบูลย์/แรช Frequency List for Thai Learners v2.4
The first 2,500-2,700 roughly correspond to primary school level. The whole list to secondary school level.
index - the index in the source thesis (19,494 words).
word - the Thai word
Measures of complexity (from the orginal work)
senses - textblock where each line is a type followed by the English meaning, e.g. Prep. To
examples - most entries have one or more sample. [I personally have a strong dislike of Anki and the likes, I prefer to learn in context.)
classifiers – one or more classifiers with their transliteration.
ipa – phonetics (most ipa, a few ThaiPhon)
translit - paiboon-esque transliteration with tone marks
audio – 1 if an audio file exists (sorry not shareable at this stage)
audio_urls – 1 or several URLs of audio in wikimedia
num_senses - number of senses/meanings
weight – 0 to 4 – how many dictionaries the word was found in
quality – quality markers include: NO_SENSE, NO_IPA and NO_TRANSLIT
cumfreq – the cumulative frequency in the original study, e.g. words 1 to 110 represent 50% of the occurrences in the 3M-words corpus.
Stats: 19,494 words, 1,169 repeat-words, 2/3-rds of the words have examples. 110 words represent half the corpus, and slightly less than 2,100 represent 90%. But on the other hand, 13,600 words are present in 3 or all 4 of the source dictionaries (see section ‘sources’), so they compose a ‘hard’ core of the Thai language.
~60% have audio available; audio caveat: the links to Wikimedia are effective, but have not been verified one by one. I have not yet received authorisation to share the files for the ‘audio’ column (value=1) I will update here if and when. In the meanwhile, the files are not available.
Normalized Frequency (NF) –ความยากง่ายในมิติต่าง ๆ –learning levels 1 to 4
Original work | List v2.4 for Thai L2 |
---|---|
คำเนื้อหา-คำไวยากรณ์ C/F C = คำเนื้อหา F = คำไวยากรณ์ CF = คำที่เป็นได้ทั้งคำเนื้อหาและคำไวยากรณ์ |
role - Is it a content word, a grammar word, or both? |
การสร้าง-การประกอบคำ Morpho เดี่ยว = คำเดี่ยว ประสาน = คำประสาน ประสม = คำประสม ซ้อน = คำซ้อน ซ้ำ = คำซ้ำ ทับE = คำทับศัพท์ภาษาอังกฤษ |
Morpho-Word Formation Single = Single word Combined = Combined word Compound = Compound word Duplicated = Repeated word COMPLEX? Repeat = Repeated word Transliterated = English transliterated word |
จำนวนพยางค์ Syl 1 = คำที่มี 1 พยางค์ 2 = คำที่มี 2 พยางค์ 3 = คำที่มีตั้งแต่ 3 พยางค์ขึ้นไป |
syllables - 1, 2, or 3-and-more syllables |
รูปเขียนกับการออกเสียง F/S 1 – ไม่จำกัด ตัวเลขที่แสดงแทนจำนวนความเป็นไปได้ในการออกเสียงรูปศัพท์นั้น ๆ |
spelling and pronunciation - 1, 2, or 3-and-more |
ความทึบใสทางความหมาย Seman 0 = คำเดี่ยวและคำทับศัพท์ภาษาอังกฤษ 1 = คำใส 2 = คำก้ำกึ่ง ไม่ทึบไม่ใส 3 = คำทึบ |
Semantic opacity 0 = Single words and English transliterations 1 = Transparent 2 = Ambiguous words, neither opaque nor transparent 3 = Opaque words |
role | |
---|---|
both | 102 |
content | 19052 |
grammar | 340 |
morpho | |
---|---|
combined | 301 |
complex | 1642 |
compound | 7057 |
loan Eng. | 551 |
repeated word | 1169 |
single | 8774 |
spelling | |
---|---|
1 | 9470 |
2 | 4732 |
3 | 1540 |
4-990 | 3752 |
semantics | |
---|---|
ambiguous | 1289 |
opaque | 1102 |
single | 9325 |
transparent | 7778 |
schooling levels vs 3k-bins - words in the list (grouped in 3k words bins) compared with school levels
morphology vs semantics
number of words in dictionaries
The thesis [https://www.arts.chula.ac.th/~ling/TTC/], as far as I can tell is in the public domain.
Lexitron v2: [https://opend-portal.nectec.or.th/en/prepare/lexitron-2-0] NECTEC licence.
Wiktionary [https://th.wiktionary.org/] is licenced under CC BY-SA 4.0 (Attribution-Share Alike 4.0 International)
Volubilis v. 25.2 [https://belisan-volubilis.blogspot.com/], also under CC BY-SA 4.0.
The Royal Institute Dictionary 1999 is also under NECTEC licence.
This product is created by the adaptation of LEXiTRON developed by NECTEC. This frequency list is shared under CC BY-SA 4.0, including the mention above as work derivative from a NECTEC production.
.