UTF-8 aware string.sub
The built-in Lua function string.sub() does not work correctly with the UTF-8 strings that are pervasive in non-US clients for World of Warcraft. For example, run the following code in WoW:
/run print(string.sub("Gnøppix", 1, 3))
WoW will show Gn?, which is not what we would expect. This is due to ø being a multi-byte UTF character (it's built of two bytes "\195\184". Luckily, the UTF standard is very regular and well-defined. The following function can take a substring of a UTF-8 string:
Example usage:
print(utf8sub("Gnøppix", 1, 2)) Gn print(utf8sub("Gnøppix", 1, 3)) Gnø print(utf8sub("Gnøppix", 2, 3)) nøp
Note that the first parameter describes an offset in byte, whereas the second parameter describe a count in codepoint. Note also that this function will return an incorrect result if the first parameter does not correspond to a codepoint boundary.
Snippet
-- UTF-8 Reference: -- 0xxxxxxx - 1 byte UTF-8 codepoint (ASCII character) -- 110yyyxx - First byte of a 2 byte UTF-8 codepoint -- 1110yyyy - First byte of a 3 byte UTF-8 codepoint -- 11110zzz - First byte of a 4 byte UTF-8 codepoint -- 10xxxxxx - Inner byte of a multi-byte UTF-8 codepoint local function chsize(char) if not char then return 0 elseif char > 240 then return 4 elseif char > 225 then return 3 elseif char > 192 then return 2 else return 1 end end -- This function can return a substring of a UTF-8 string, properly handling -- UTF-8 codepoints. Rather than taking a start index and optionally an end -- index, it takes the string, the starting character, and the number of -- characters to select from the string. local function utf8sub(str, startChar, numChars) local startIndex = 1 while startChar > 1 do local char = string.byte(str, startIndex) startIndex = startIndex + chsize(char) startChar = startChar - 1 end local currentIndex = startIndex while numChars > 0 and currentIndex <= #str do local char = string.byte(str, currentIndex) currentIndex = currentIndex + chsize(char) numChars = numChars -1 end return str:sub(startIndex, currentIndex - 1) end