What is COLLATE in a Database and What Does It Do?

What is COLLATE?
When creating a Database or Table, we define the CHARACTER SET and COLLATE. For example, in MariaDB, utf8mb4_unicode_ci
consists of:
Character set
utf8mb4
means using a character set encoded in full 4-byte UTF-8 (older utf8
was 3-byte UTF-8, which doesn’t support full emojis and Unicode).
Collation Algorithm
unicode
means sorting (the value of) characters according to Unicode.
Diacritical sensitivity
ai
(ancient insensitive) e.g., a == ᾰas
(ancient sensitive) e.g., a != ᾰci
(case insensitive) e.g., A == acs
(case sensitive) e.g., A != a
Usecase
The simplest example is probably searching for names, whether people or places, like:
SELECT
_id,
first_name,
last_name
FROM
users
WHERE
LOWER(first_name) = 'kawin';
Result:
[
{
"_id": 5,
"first_name": "KAWIN",
"last_name": "eiei"
},
{
"_id": 4,
"first_name": "kawin",
"last_name": "naja"
},
{
"_id": 3,
"first_name": "Kawin",
"last_name": "Viriyaprasopsook"
}
]
We get Kawin
, kawin
, KAWIN
, which seems normal. But if we try to EXPLAIN it:
EXPLAIN FORMAT=JSON
SELECT
_id,
first_name,
last_name
FROM
users
WHERE
LOWER(first_name) = 'kawin';
We will find that our query reads all 1000 rows in the search.
{
"query_block": {
"select_id": 1,
"nested_loop": [
{
"table": {
"table_name": "users",
"access_type": "index",
"key": "udx_full_name",
"key_length": "518",
"used_key_parts": ["first_name", "last_name"],
"rows": 1000,
"filtered": 100,
"attached_condition": "lcase(users.first_name) = 'kawin'",
"using_index": true
}
}
]
}
}
This is called index obscure, meaning the database sees the WHERE
clause as fn(column)
and has to iterate through the entire table to apply LOWER(). If the table/column was already created as case-insensitive, we can just use =
.
EXPLAIN FORMAT=JSON
SELECT
_id,
first_name,
last_name
FROM
users
WHERE
first_name = 'kawin';
Which will EXPLAIN to this, reading only 3 rows instead of the entire table, and the result is the same as using LOWER().
{
"query_block": {
"select_id": 1,
"nested_loop": [
{
"table": {
"table_name": "users",
"access_type": "ref",
"possible_keys": ["udx_full_name"],
"key": "udx_full_name",
"key_length": "259",
"used_key_parts": ["first_name"],
"ref": ["const"],
"rows": 3,
"filtered": 100,
"attached_condition": "users.first_name = 'kawin'",
"using_index": true
}
}
]
}
}
UNIQUE
constraints, as the database will consider “Kawin” to be a duplicate of “kawin”.