newthai

Textbooks Frequency List v2

Overview

The original frequency list is the 2016 work of Dr. Tantong Champaiboon (Ph.D. from Chulalongkorn University, Linguistics Department). She studied a corpus of textbooks for Thai students age 3-16 yo. The list is organised by various dimensions: measures of complexity of the vocabulary, comparison across 4 age ranges and 4 historical and current curricula.

The แจ่มไพบูลย์/แรช Frequency List for Thai Learners v2 is the enhanced version of the list as adapted for (English-speaking) Thai learners. Reddit r/learnthai for v2.4v1 in the same sub.

Major caveat

The original study is useful to us adult Thai learners because of its domain: school textbooks. The small size, however, is an issue (only around 3 M words). As you go down the index number (first column), the probability that the word has that rank in real life decreases rapidly; it is not linear. To put it in other words: words number 1 to 9-10,000 are highly likely to be in the 20,000 most used words IRL; but if you take word number, say 16,000, all you can assert is that it is likely amongst the 50,000 most used words. The index is indicative of rank, but is not strictly a rank, take it with a pinch of salt. Index is an indication of rank — in the corpus [yes, em-dash]. If your preferred domain to learn Thai is lakorn or news, แล้วแต่คุณ.

How many words do we need?

Do we need all 19,494 words? No. 110 words represent half the corpus, and slightly less than 2,100 represent 90%. And with say 6-7,000, you could read any of the textbooks at Extensive Reading level (95-98% Paul Nation, 2005), the first word reaching 95% cumulative frequency is at rank 3,856, the last 98% is at 8,361. On the other hand, 13,600 words are present in 3 or all 4 of the source dictionaries (see section ‘sources’), so they compose a ‘hard’ core of the Thai language (see the hexagon-based chart in the doc).

Furthermore, if you want to produce a list of 2,000 words with complex spelling, or 3,000 compound words, which are more than the sum of their parts, (see section ‘examples of use’), you need more than 2-3,000 overall. So, this long list gives us learners the flexibility we need, based on individuals’ goals.

For a description of all columns and their possible values, see the ‘Notice’ tab in the sheet, or the full docs in github. We will highlight key changes with v1. More dimensions have been added in this version (see below).

Stats: 19,494 words, 1,169 repeat-words, 2/3-rds of the words have examples. ~60% have audio available; audio caveat: the links to Wikimedia are effective, but have not been verified one by one. I have not yet received authorisation to share the files for the ‘audio’ column (value=1) I will update here if and when. Don’t bother DM-ing to ask for the files.

Key changes with v1

A note on meanings/senses: Why are all senses of a word aggregated? Can you not emphasise the most frequent meaning? One of the key findings of the original thesis is that when a word is introduced to children at a given level, all senses/facets of this word are also introduced, i.e. they are not developed over time.

DETAILS ON COLUMNS AND DISTRIBUTION

Examples of usage

430 grammar words have a sense, and most have one or more examples - good to find out which you already know, and which you should research or ask your teacher. Note that most rank pretty high in frequency, that figures.

Sources & licences

The thesis (link), as far as I can tell is in the public domain.
Lexitron v2: (link) NECTEC licence.
Wiktionary [(link)]((link) is licenced under CC BY-SA 4.0 (Attribution-Share Alike 4.0 International)
Volubilis v. 25.2 (link), also under CC BY-SA 4.0.
The Royal Institute Dictionary 1999 is also under NECTEC licence.

“This product is created by the adaptation of LEXiTRON developed by NECTEC.”
This frequency list is shared under CC BY-SA 4.0, including the mention above as work derivative from a NECTEC production.

Google sheets

If you have suggestions, the sheet is now not only public, but open for comments. However, if you disagree with some of the meanings, you should likely take it with the corresponding dictionary authors. I welcome any constructive criticism.

The Other link: github docs

TLDR

A Thai word frequency list of ~20k words used in the primary and secondary school textbooks, with various dimensions to cut and slice custom lists.