Class: ActiveSupport::Multibyte::Chars

Chars enables you to work transparently with UTF-8 encoding in the Ruby String class without having extensive knowledge about the encoding. A Chars object accepts a string upon initialization and proxies String methods in an encoding safe manner. All the normal String methods are also implemented on the proxy.

String methods are proxied through the Chars object, and can be accessed through the mb_chars method. Methods which would normally return a String object now return a Chars object so methods can be chained.

  "The Perfect String  ".mb_chars.downcase.strip.normalize #=> "the perfect string"

Chars objects are perfectly interchangeable with String objects as long as no explicit class checks are made. If certain methods do explicitly check the class, call to_s before you pass chars objects to them.

  bad.explicit_checking_method "T".mb_chars.downcase.to_s

The default Chars implementation assumes that the encoding of the string is UTF-8, if you want to handle different encodings you can write your own multibyte string handler and configure it through ActiveSupport::Multibyte.proxy_class.

  class CharsForUTF32
    def size
      @wrapped_string.size / 4
    end

    def self.accepts?(string)
      string.length % 4 == 0
    end
  end

  ActiveSupport::Multibyte.proxy_class = CharsForUTF32

Methods

+
<=>
=~
[]
[]=
acts_like_string?
capitalize
center
compose
compose_codepoints
consumes?
decompose
decompose_codepoints
downcase
g_length
g_pack
g_unpack
in_char_class?
include?
index
insert
length
ljust
lstrip
method_missing
new
normalize
ord
reorder_characters
respond_to?
reverse
rindex
rjust
rstrip
size
slice
slice!
split
strip
tidy_bytes
tidy_bytes
u_unpack
upcase
wants?

Included Modules

Comparable

Constants

HANGUL_SBASE	=	0xAC00
	Hangul character boundaries and properties
HANGUL_LBASE	=	0x1100
HANGUL_VBASE	=	0x1161
HANGUL_TBASE	=	0x11A7
HANGUL_LCOUNT	=	19
HANGUL_VCOUNT	=	21
HANGUL_TCOUNT	=	28
HANGUL_NCOUNT	=	HANGUL_VCOUNT * HANGUL_TCOUNT
HANGUL_SCOUNT	=	11172
HANGUL_SLAST	=	HANGUL_SBASE + HANGUL_SCOUNT
HANGUL_JAMO_FIRST	=	0x1100
HANGUL_JAMO_LAST	=	0x11FF
UNICODE_WHITESPACE	=	[ (0x0009..0x000D).to_a, # White_Space # Cc [5] <control-0009>..<control-000D> 0x0020, # White_Space # Zs SPACE 0x0085, # White_Space # Cc <control-0085> 0x00A0, # White_Space # Zs NO-BREAK SPACE 0x1680, # White_Space # Zs OGHAM SPACE MARK 0x180E, # White_Space # Zs MONGOLIAN VOWEL SEPARATOR (0x2000..0x200A).to_a, # White_Space # Zs [11] EN QUAD..HAIR SPACE 0x2028, # White_Space # Zl LINE SEPARATOR 0x2029, # White_Space # Zp PARAGRAPH SEPARATOR 0x202F, # White_Space # Zs NARROW NO-BREAK SPACE 0x205F, # White_Space # Zs MEDIUM MATHEMATICAL SPACE 0x3000, # White_Space # Zs IDEOGRAPHIC SPACE ].flatten.freeze
	All the unicode whitespace
UNICODE_LEADERS_AND_TRAILERS	=	UNICODE_WHITESPACE + [65279]
	BOM (byte order mark) can also be seen as whitespace, it‘s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
UNICODE_TRAILERS_PAT	=	/(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+\Z/
UNICODE_LEADERS_PAT	=	/\A(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+/
UTF8_PAT	=	ActiveSupport::Multibyte::VALID_CHARACTER['UTF-8']

Attributes

[R]

wrapped_string

Public Class methods

compose_codepoints(codepoints)

Compose decomposed characters to the composed form.