UTF-8 aware string.sub
The built-in Lua function string.sub() does not work correctly with the UTF-8 strings that are pervasive in non-US clients for World of Warcraft. For example, run the following code in WoW:
/run print(string.sub("Gnøppix", 1, 3))
WoW will show Gn?, which is not what we would expect. This is due to ø being a multi-byte UTF character (it's built of two bytes "\195\184". Luckily, the UTF standard is very regular and well-defined. The following function can take a substring of a UTF-8 string:
Example usage:
print(utf8sub("Gnøppix", 1, 2)) Gn print(utf8sub("Gnøppix", 1, 3)) Gnø print(utf8sub("Gnøppix", 2, 3)) nøp
Note that the first parameter describes an offset in byte, whereas the second parameter describe a count in codepoint. Note also that this function will return an incorrect result if the first parameter does not correspond to a codepoint boundary.
Snippet
-- UTF-8 Reference:
-- 0xxxxxxx - 1 byte UTF-8 codepoint (ASCII character)
-- 110yyyxx - First byte of a 2 byte UTF-8 codepoint
-- 1110yyyy - First byte of a 3 byte UTF-8 codepoint
-- 11110zzz - First byte of a 4 byte UTF-8 codepoint
-- 10xxxxxx - Inner byte of a multi-byte UTF-8 codepoint
local function chsize(char)
if not char then
return 0
elseif char > 240 then
return 4
elseif char > 225 then
return 3
elseif char > 192 then
return 2
else
return 1
end
end
-- This function can return a substring of a UTF-8 string, properly handling
-- UTF-8 codepoints. Rather than taking a start index and optionally an end
-- index, it takes the string, the starting character, and the number of
-- characters to select from the string.
local function utf8sub(str, startChar, numChars)
local startIndex = 1
while startChar > 1 do
local char = string.byte(str, startIndex)
startIndex = startIndex + chsize(char)
startChar = startChar - 1
end
local currentIndex = startIndex
while numChars > 0 and currentIndex <= #str do
local char = string.byte(str, currentIndex)
currentIndex = currentIndex + chsize(char)
numChars = numChars -1
end
return str:sub(startIndex, currentIndex - 1)
end