UTF-8 aware string.sub

The built-in Lua function string.sub() does not work correctly with the UTF-8 strings that are pervasive in non-US clients for World of Warcraft. For example, run the following code in WoW:

/run print(string.sub("Gnøppix", 1, 3))

WoW will show Gn?, which is not what we would expect. This is due to ø being a multi-byte UTF character (it's built of two bytes "\195\184". Luckily, the UTF standard is very regular and well-defined. The following function can take a substring of a UTF-8 string:

Example usage:

print(utf8sub("Gnøppix", 1, 2)) Gn print(utf8sub("Gnøppix", 1, 3)) Gnø print(utf8sub("Gnøppix", 2, 3)) nøp

Note that the first parameter describes an offset in byte, whereas the second parameter describe a count in codepoint. Note also that this function will return an incorrect result if the first parameter does not correspond to a codepoint boundary.

Snippet

-- UTF-8 Reference:
-- 0xxxxxxx - 1 byte UTF-8 codepoint (ASCII character)
-- 110yyyxx - First byte of a 2 byte UTF-8 codepoint
-- 1110yyyy - First byte of a 3 byte UTF-8 codepoint
-- 11110zzz - First byte of a 4 byte UTF-8 codepoint
-- 10xxxxxx - Inner byte of a multi-byte UTF-8 codepoint

local function chsize(char)
    if not char then
        return 0
    elseif char > 240 then
        return 4
    elseif char > 225 then
        return 3
    elseif char > 192 then
        return 2
    else
        return 1
    end
end

-- This function can return a substring of a UTF-8 string, properly handling
-- UTF-8 codepoints.  Rather than taking a start index and optionally an end
-- index, it takes the string, the starting character, and the number of
-- characters to select from the string.

local function utf8sub(str, startChar, numChars)
  local startIndex = 1
  while startChar > 1 do
      local char = string.byte(str, startIndex)
      startIndex = startIndex + chsize(char)
      startChar = startChar - 1
  end

  local currentIndex = startIndex

  while numChars > 0 and currentIndex <= #str do
    local char = string.byte(str, currentIndex)
    currentIndex = currentIndex + chsize(char)
    numChars = numChars -1
  end
  return str:sub(startIndex, currentIndex - 1)
end
Posted by jnwhiteh at Mon, 27 Apr 2009 10:49:05 +0000