UTF-8 + 文字数カウント

C++ で UTF-8 の文字数を数えるコードを試しに書いてみたメモ。

バイト数ではなく、文字数。

  
#include   
#include   
  
int strlen_utf8( const char *buff )  
{  
    if( buff == NULL ) return 0;  
      
    int count = 0;  
    int pos = 0;  
    int max_bytes = strlen( buff );  
      
    // BOM 読み飛ばし  
    if( max_bytes >= 3 )  
    {  
        if( static_cast( buff[0] ) == 0xEF &&  
            static_cast( buff[1] ) == 0xBB &&  
            static_cast( buff[2] ) == 0xBF )  
        {  
            pos += 3;  
        }  
    }  
      
    while( pos < max_bytes )  
    {  
        ++count; // 文字数カウント  
          
        if( ( buff[pos] & 0x80 ) == 0 )  
        {  
            ++pos; // 1バイト文字  
        }  
        else  
        {  
            for( char tmp = buff[pos] & 0xfc; (tmp & 0x80); tmp = tmp << 1 )  
            {  
                ++pos; // 複数バイト文字  
            }  
        }  
    }  
    return count;  
}  
  
int main()  
{  
    char *str = "あ過ｻタな";  
    int cnt = strlen_utf8( str );  
      
    printf( "String: %s\n", str );  
    printf( "Count: %d\n", cnt );  
  
    return 0;  
}

結果

  
String: あ過ｻタな  
Count: 5

おしまい

UTF-8 は1バイト目に後に続くバイト数が示されている為、簡単に実装できた。

意外とここら辺の勉強もしてみると面白いなー。

参考: UTF-8 Wikipedia