|To specify codepoints, use plain hexadecimal such as "0041". Separate distinct codepoints with ",". Commas with nothing in between (e.g. ",,") specify empty cells. Contiguous ranges can increase (e.g. "0041-005A") or decrease (e.g. "005A-0041"). Use "|" to break rows.|
Table of Contents
- Character Sets
- Alphabets, Abjads and Abugidas
- Other Blocks
Unicode is a standardised mapping between numbers and characters. The numbers, or "codepoints", are typically expressed in hexadecimal, like "U+0041". This codepoint (65 in decimal) corresponds to the capital letter "A" in Unicode:
Version 14.0.0 of the Unicode standard contains over 144,000 characters. Visual representations of characters are called glyphs. Some characters have more than one glyph associated with them (e.g. "variations"). There is no single outline font that covers all the Unicode characters. Notoverse is an attempt at a bitmap font that does. It is limited to one glyph per character, but is perfectly adequate for illustrating codepoints.
Hovering the mouse over the glyph above zooms in to the bitmap. Clicking it jumps to a simple user interface for rendering ranges.
Before Unicode, there were many ways of encoding character sets.
The first 128 codepoints of Unicode map one-to-one with characters from 7-bit ASCII from 1963.
Coded Character Sets, History and Development by Charles E. Mackenzie (1980) has an exhaustive history of ASCII development.
Contemporaneous with ASCII was IBM's EBCDIC. Below is the character set of EBCDIC code page 037. A few control codes have no Unicode equivalent.
EBCDIC is surprisingly still in use today.
Beyond the realm of business, the explosion of home computers in the 1980s led to a parallel explosion of character sets.
The PETSCII character set was used on Commodore's PET, VIC-20 and C64 machines. Below are the unshifted (left) and shifted (right) variations.
Sinclair ZX80 (1980)
The ZX80 only had 64 printable characters (along with their inverse forms), but their organisation meant hexadecimal encoding/decoding was trivial.
Sinclair ZX81 (1981)
ZX81 had the same characters as the ZX80, but in a slightly different order.
Acorn BBC Micro (1981)
The character set for modes 0 to 6 was essentially ASCII but with backquote replaced with a pound sign:
For mode 7, the Latin G0 (English) Teletext character set is adhered to when the most significant bit (MSB) is set. When the MSB is clear (the upper grid below), the hash, pound and underscore characters are shuffled to try to more closely align with ASCII.
When Teletext graphics mode is toggled on, this shuffling wreaks havoc with bit-to-pixel mapping. For this reason, it is advisable to set the MSB when rendering block graphics.
Sinclair ZX Spectrum (1982)
The printable ZX Spectrum character set was very close to ASCII. Only the "↑", "£" and "©" characters were swapped in.
ISO/IEC 8859-1 (1987), sometimes erroneously named "8-bit ASCII", is the basis for the first 256 codepoints of Unicode:
Alphabets, Abjads and Abugidas
Here is a chronological list of writing systems according to ISO-15924 (which is also maintained by the Unicode Consortium).
|ISO-15924 Name||Code||Classification||Region||Since||Until||Unicode Block(s)||Codepoints|
The 26 letters of the basic Latin alphabet are used for English and French:
The German alphabet adds four letters:
The story of the capital Eszett "ß" is quite interesting.
Officially, the Spanish alphabet used to treat "CH" and "LL" as separate letters until 2010. Now, only "Ñ" is treated as an additional twenty-seventh letter:
The Dutch alphabet also has an additional letter, "Ĳ":
The modern Italian alphabet has only 21 letters:
The Polish alphabet consists of 32 letters:
Icelandic also has 32 letters, albeit very different:
Esperanto has 28 letters:
The Latin alphabet has also been the basis for supranational alphabets such as the International African Alphabet (1928) with 36 letters:
This was developed into the World Orthography (1948) alphabet with 31 letters:
The 33 letters of the basic Cyrillic alphabet are used for Russian:
The Belarusian Cyrillic alphabet has 32 letters:
The Ukrainian alphabet has 33 letters:
The Bulgarian alphabet has 30 letters:
The Serbian Cyrillic alphabet has 30 letters:
The Armenian alphabet consists of 38 letters:
Caucasian Albanian Alphabet
The Caucasian Albanian alphabet has 52 letters but no case distinction:
The Elbasan alphabet had 40 letters and was used in Albanian religious texts:
The extended Avestan alphabet has 38 consonants and 16 vowels:
The Coptic alphabet has uppercase and lowercase letters:
This Slavic alphabet also has uppercase and lowercase:
The Phase G (1918) Bamum script has 80 characters:
Bassa Vah Alphabet
The abandoned Bassa Vah alphabet of Liberia had 23 consonants and 7 vowels:
The Greek alphabet consists of 24 letters with uppercase and lowercase forms. Sigma also has a word-final form:
Based on the Greek alphabet with additions for the Gothic language:
The Carian alphabet from Kaunos is thought to be the most complete version:
There are four forms of the modern 33-letter Georgian alphabet:
- Asomtavruli is the oldest form, dating from the fifth century A.D.
- Nuskhuri dates from the ninth century A.D.
- Mkhedruli is the current Georgian script
- Mtavruli is the uppercase version of Mkhedruli
Modern Hangul is written using 24 basic letters (14 consonants and 10 vowels):
These are organised into jamo (19 compound consonants and 21 compound vowels):
Additionally, precomposed Hangul syllables are encoded as individual Unicode codepoints.
The Hebrew abjad is a right-to-left script. So when rendered as text below, the first letter appears on the right-hand side:
Unicode includes three special format characters:
- U+202D LEFT-TO-RIGHT OVERRIDE (LRO)
- U+202E RIGHT-TO-LEFT OVERRIDE (RLO)
- U+202C POP DIRECTIONAL FORMATTING (PDF)
Like many non-Latin scripts, it does not have a distinction between uppercase and lowercase, but does have variations for letters that appear at the end of words.
Arabic script is also written right-to-left but in a cursive (joined) form:
The Adlam alphabet for the Fulani language was invented in 1990 by two young brothers:
Brahmic abugidas are not true alphabets. They typically use diacritics to represent some vowels. Consequently, the following usually only list the consonants.
The 24 consonants of the extinct Ahom language:
The 33 pure consonants of the Assamese language:
The 33 consonants of the Balinese language:
The 19 basic characters (surat) of the Karo variant of the Batak script:
The 15 consonants of Tagalog from the Philippines:
The 31 consonants of the Bengali (Bangla) language:
Bhaiksuki has 33 consonants and was used around the turn of the first millennium for writing Sanskrit:
The original Brahmi script had 34 consonants:
The 15 consonants of the Buhid language:
The 34 consonants of the Burmese language:
The 32 consonants of the Chakma language:
The 35 consonants of the Cham language:
The 33 consonants of Devanagari:
Dhives Akuru was used to write the Maldivian language up until the 20th century:
Dogra Akkhar was used to write Dogri:
Grantha is in traditional Vedic schools to write Sanskrit:
The 34 consonants of the Gujarati language:
Used for writing the Adilabad dialect of the Gondi language:
Used for writing the Punjabi language:
Used for writing the Hanunó'o language:
The 20 consonants of the Javanese script in hanacaraka order:
Historically used for writing legal, administrative, and private records:
Used for writing Kannada, Konkani, Tulu, Badaga, Kodava, Beary and others:
Used for writing Kayah languages
Used by the Khasa, Saka, and Yuezhi peoples:
There are 35 consonants in the Khmer language, though two ("ឝ" and "ឞ") are obsolete:
The Khojki script was used by the Khoja community for Muslim religious literature:
Khudawadi, also known as Khudabadi, is used for writing the Sindhi language:
There are 27 consonants of the modern Lao language:
The name pairs "FO TAM"/"FO SUNG" and "LO LING"/"LO LOOT" were accidentally and irrevocably switched when Lao was added to Unicode.
Used for writing the Lepcha (Róng) language:
Used for writing the Limbu language:
Used for writing the Buginese language:
Historically used in northern India for writing accounts and financial records.
Used in South Sulawesi, Indonesia for writing the Makassarese language:
Used for writing the Malayalam language:
Used in the Tibetan Bön tradition to write the extinct Zhang-Zhung language:
Used for writing Gondi but based on Brahmi characters:
Used for the Meitei language:
Used to write the Marathi language:
Used to write the Multani language:
Historically used to write Sanskrit in southern India:
New Tai Lue
The 44 consonants of the Tai Lü language come in pairs to denote two tonal registers (high and low):
The Odia (or Oriya) script is used for writing the Odia language:
There are 41 basic letters in the ʼPhags-pa script, historically used during the Mongol Yuan dynasty:
Used to write Sanskrit, Nepali, Hindi, Bengali, and Maithili languages:
The Rejang language is mostly obsolete:
The Saurashtra language also mostly obsolete:
Used for writing Sanskrit and Kashmiri:
Used for writing Sanskrit
In addition to 18 consonants, the modern Sinhala language has 12 independent vowels:
Developed by the monk and scholar Zanabazar in 1686 to write Mongolian:
The national symbol for Mongolia derives from this script:
Modern Sundanese has 18 main consonants and 7 independent vowels:
There are 27 main consonants of the Sylheti language:
There are 13 consonants in the Tagbanwa languages:
Used for writing the Tai Nüa language:
There are 47 consonants in the full Tai Tham (Lanna) script:
Used for writing the Tai Dam language with 48 consonants split into high and low forms:
Used for writing Chambeali and other languages:
There are 18 basic consonants in the Tamil language:
The 35 main consonants of the Telugu language:
The Thai script has 44 consonants:
Tibetan has 30 basic consonants:
Historically used for the Maithili language with 33 consonants:
Used to write Mongolian, Tibetan and Sanskrit:
The Geʽez abjad was used in Ethiopia until the advent of Christianity and had 26 consonants. Vowels were not indicated.
Since about 350 A.D. Geʽez has been written as an abugida (alphasyllabary). However, instead of using diacritics to denote vowels, Geʽez uses different letter forms:
Bopomofo Phonetic Script
Bopomofo (Zhuyin fuhao) is an official transliteration system in Taiwan.
Unified Canadian Aboriginal Syllabics
Canadian syllabic scripts are abugidas where vowels are denoted by rotation of the consonants instead of by diacritics:
The Cherokee syllabary has 86 letters (including the archaic "Ᏽ"):
The Deseret alphabet was an attempt at spelling reform by the Mormons in the mid-nineteenth century:
Developed in the early 20th century by missionary James Fraser, with 30 consonants and 10 vowels:
The English Braille alphabet:
Developed in 1974 by Valerie Sutton for writing sign languages. It was based on her experience two years earlier developing a system for writing down dance movements.
Pahawh Hmong Script
Nyiakeng Puachue Hmong Script
Old Hungarian Alphabet
Khitan Small Script
Mende Kikakui Syllabary
Meroitic Cursive Alphabet
Traditionally written vertically, from top to bottom, Mongolian script is typically rendered horizontally on devices:
Old North Arabian Alphabet
Old South Arabian Alphabet
The story behind the Ogham space mark is itself interesting.
Ol Chiki Alphabet
Old Turkic Alphabets
Old Uyghur Alphabet
Pau Cin Hau Alphabet
Old Permic Alphabet
Pollard Miao Abugida
Hanifi Rohingya Alphabet
Named after George Bernard Shaw who posthumously funded a competition for English language writing reform. It was ultimately "won" by Kingsley Read:
Old Sogdian Abjad
Sora Sompeng Alphabet
There is no single uppercase codepoint for 'ǰ' (U+01F0), but a combining caron can be used 'J̌' (U+004A U+030C)
Although pressed into clay with a pointed stick, its thirty symbols were unrelated to Sumero-Akkadian cuneiform.
Old Persian Semisyllabary
A semi-alphabetic cuneiform script loosely inspired by Sumero-Akkadian cuneiform.
Warang Citi Alphabet
Modern Yezidi is a Kurdish alphabet without ligatures.
Modern Yi Syllabary
Old Italic Alphabets
Several Old Italic alphabets shared the same Unicode codepoints. It is assumed that font character variations are used to display the glyphs slightly differently, where appropriate.
From the Marsiliana d'Albegna tablet of the 7th century BCE:
Used from 7th century BCE to 5th century BCE:
Used from 4th century BCE to 1st century BCE:
Used from 5th century BCE to 1st century CE:
Used from 7th century BCE to 1st century BCE:
Used from 7th century BCE to 2nd century BCE:
From only four inscriptions from about 650 BCE:
Used from 6th century BCE to 4th century BCE:
Used from 6th century BCE to 1st century BCE:
Used from 5th century BCE to 1st century CE:
Also known as the Lepontic alphabet. Used from 550 BCE to 100 CE:
Used from 2nd to 8th centuries CE:
Used from 5th to 11th centuries CE:
Younger Futhark (long-branch)
Used in Denmark from 8th to 12th centuries CE:
Younger Futhark (short-twig)
Used in Sweden and Norway from 8th to 12th centuries CE:
Used from 12th to 15th centuries CE:
A runic alphabet created by J.R.R. Tolkien for "The Hobbit" to transliterate English.
These are scripts that use characters (often pictorial) to represent words or morphemes. This includes hieroglyphs and Chinese characters.
Originally used for the Sumerian language, cuneiform was also used for Akkadian (Assyrian/Babylonian), Eblaite, Amorite, Elamite, Hattic, Hurrian, Urartian, Hittite and other languages.
See also Ugaritic abjad.
Still undeciphered but assumed to be syllabic.
Deciphered by Michael Ventris in 1952.
Han (Hanzi, Kanji, Hanja)
A syllabic script created and used exclusively by women in Hunan Province, China. Women were forbidden formal education there and developed the script in order to communicate with one another.
Used for writing the extinct Tangut language of the Western Xia dynasty, China.
The "Aegean Numbers" Aegean_Numbers block includes symbols for units (1-9, first row), tens (10-90, second row), hundreds, thousands and ten-thousands used in Linear A, Linear B and the Cypriot syllabary.
The "Alchemical Symbols" Alchemical block:
Alphabetic Presentation Forms
The "Alphabetic Presentation Forms" Alphabetic_PF block includes Latin, Armenian and Hebrew ligatures:
Ancient Greek Musical Notation
The "Ancient Greek Musical Notation" Ancient_Greek_Music block:
Ancient Greek Numbers
The "Ancient Greek Numbers" Ancient_Greek_Numbers block:
The "Ancient Symbols" Ancient_Symbols block:
Arabic Mathematical Alphabetic Symbols
The "Arabic Mathematical Alphabetic Symbols" Arabic_Math block:
The "Arrows" Arrows|Sup_Arrows_A|Sup_Arrows_B|Sup_Arrows_C blocks:
The "Block Elements" Block_Elements block:
The "Box Drawing" Box_Drawing block:
Byzantine Musical Symbols
The "Byzantine Musical Symbols" Byzantine_Music block:
The "Chess Symbols" Chess_Symbols block (when combined with standard chess pieces U+2654-265F from the "Miscellaneous Symbols" block):
The "CJK Compatibility" CJK_Compat block includes symbols for hours of the day, days of the month and various Latin abbreviations for units:
CJK Compatibility Forms
The "CJK Compatibility Forms" CJK_Compat_Forms block:
The "CJK Strokes" CJK_Strokes block:
CJK Symbols and Punctuation
The "CJK Symbols and Punctuation" CJK_Symbols block:
The "Control Pictures" Control_Pictures block:
Coptic Epact Numbers
The "Coptic Epact Numbers" Coptic_Epact_Numbers block:
Counting Rod Numerals
The "Counting Rod Numerals" Counting_Rod block:
The "Currency Symbols" Currency_Symbols block:
Combining Diacritical Marks
The "Combining Diacritical Marks" Diacriticals|Diacriticals_Ext|Diacriticals_Sup|Diacriticals_For_Symbols|Half_Marks block:
For example, to compose the missing uppercase 'J' with caron, we use a combining caron (U+004A U+030C):
The "Dingbats" Dingbats block:
The "Domino Tiles" Domino block:
The "Emoticons" Emoticons block:
The "Enclosed Alphanumerics" Enclosed_Alphanum block:
The "Enclosed Alphanumerics Supplement" Enclosed_Alphanum_Sup block includes twenty-six "Regional indicator symbols" which can be paired together to produce regional flags with the right font support. In this case, I'm using the BabelStone Flags webfont:
Enclosed CJK Letters and Months
The "Enclosed CJK Letters and Months" Enclosed_CJK block:
Enclosed Ideographic Supplement
The "Enclosed Ideographic Supplement" Enclosed_Ideographic_Sup block includes six symbols from Chinese folk religion: "luck", "prosperity", "longevity", "happiness", "double happiness" and "wealth":
The "Geometric Shapes" Geometric_Shapes block:
Geometric Shapes Extended
The "Geometric Shapes Extended" Geometric_Shapes_Ext block:
Halfwidth and Fullwidth Forms
The "Halfwidth and Fullwidth Forms" Half_And_Full_Forms block includes fullwidth versions of the ASCII characters for use alongside ideographic glyphs.
Ideographic Description Characters
The "Ideographic Description Characters" IDC block:
Indic Number Forms
The "Common Indic Number Forms" Indic_Number_Forms block:
Indic Siyaq Numbers
The "Indic Siyaq Numbers" Indic_Siyaq_Numbers block:
International Phonetic Alphabet
The "IPA Extensions", "Phonetic Extensions" and "Phonetic Extensions Supplement" IPA_Ext|Phonetic_Ext|Phonetic_Ext_Sup blocks:
The "Kanbun" Kanbun block:
The block name in Unicode 1.0 was "CJK Miscellaneous" and its codepoint range was defined differently, including the then-unallocated space now occupied by "Bopomofo Extended", "CJK Strokes" and "Katakana Phonetic Extensions".
The "Letterlike Symbols" Letterlike_Symbols block:
The "Mahjong Tiles" Mahjong block:
Mathematical Alphanumeric Symbols
The "Mathematical Alphanumeric Symbols" Math_Alphanum block:
The "Mathematical Operators" Math_Operators|Sup_Math_Operators blocks:
The "Mayan Numerals" Mayan_Numerals block:
Miscellaneous Mathematical Symbols
The "Miscellaneous Mathematical Symbols" Misc_Math_Symbols_A|Misc_Math_Symbols_B blocks:
The "Miscellaneous Symbols" Misc_Symbols block:
Miscellaneous Symbols and Arrows
The "Miscellaneous Symbols and Arrows" Misc_Arrows block:
Miscellaneous Symbols and Pictographs
The "Symbols and Pictographs" Misc_Pictographs|Sup_Symbols_And_Pictographs|Symbols_And_Pictographs_Ext_A blocks:
The "Miscellaneous Technical" Misc_Technical block:
The "Spacing Modifier Letters" Modifier_Letters block:
Modifier Tone Letters
The "Modifier Tone Letters" Modifier_Tone_Letters block:
The "Musical Symbols" Music block:
The "Number Forms" Number_Forms block:
Optical Character Recognition
The "Optical Character Recognition" OCR block:
The "Ornamental Dingbats" Ornamental_Dingbats block:
Ottoman Siyaq Numbers
The "Ottoman Siyaq Numbers" Ottoman_Siyaq_Numbers block:
The "Phaistos Disc" Phaistos block:
The "Playing Cards" Playing_Cards block:
Private Use Areas
The blocks PUA|Sup_PUA_A|Sup_PUA_B are reserved for private use. This can include codepoints not covered by the Unicode Standard, e.g. Klingon.
The ConScript Unicode Registry maintains a list of Private Use codepoints allocated for constructed/artificial scripts. The text below is rendered with the "Klingon pIqaD HaSta" font.
The "General Punctuation" Punctuation block:
The "Supplemental Punctuation" Sup_Punctuation block:
Rumi Numeral Symbols
The "Rumi Numeral Symbols" Rumi block:
Shorthand Format Controls
The "Shorthand Format Controls" Shorthand_Format_Controls block:
Sinhala Archaic Numbers
The "Sinhala Archaic Numbers" Sinhala_Archaic_Numbers block:
Small Form Variants
The "Small Form Variants" Small_Forms block contains small punctuation characters for compatibility with the Chinese National Standard 11643.
The "Specials" Specials block:
Superscripts and Subscripts
The "Superscripts and Subscripts" Super_And_Sub block:
These blocks High_Surrogates|High_PU_Surrogates|Low_Surrogates are reserved for surrogate codepoints.
Symbols for Legacy Computing
The "Symbols for Legacy Computing" Symbols_For_Legacy_Computing block:
The "Tags" Tags block:
Tai Xuan Jing Symbols
The "Tai Xuan Jing Symbols" Tai_Xuan_Jing block:
Transport and Map Symbols
The "Transport and Map Symbols" Transport_And_Map block:
The "Variation Selectors" VS|VS_Sup blocks:
Yijing Hexagram Symbols
The "Yijing Hexagram Symbols" Yijing block:
Znamenny Musical Notation
The "Znamenny Musical Notation" Znamenny_Music block:
Many of these topics are covered by my "Unicode Trivia" blogs posts.
Development of the English Alphabet
The following should be taken with a pinch of salt. In particular, the changes of positions of letters and the re-allocation of sounds are completely ignored. But according to sidebars for individual letters in Wikipedia, the development of the majuscules of the English alphabet was as follows:
- Phoenician alphabet (c.1050 BCE)
- Ancient Greek alphabet (c.750 BCE)
- Etruscan alphabet (c.700 BCE)
- Archaic Latin alphabet (c.600 BCE)
- Old Latin alphabet (c.250 BCE)
- Classical Latin alphabet (c.50 CE)
- Old English alphabet (c.750 CE)
- Modern English alphabet (c.1550 CE)
This is a list of all blocks in Unicode 14.0: