Android Guides | Samples

Java.Nio.Charset.Charset Class

A charset is a named mapping between Unicode characters and byte sequences.

See Also: Charset

Syntax

[Android.Runtime.Register("java/nio/charset/Charset", DoNotGenerateAcw=true)]
public abstract class Charset : Object, IComparable, IDisposable

Remarks

A charset is a named mapping between Unicode characters and byte sequences. Every Charset can decode, converting a byte sequence into a sequence of characters, and some can also encode, converting a sequence of characters into a byte sequence. Use the method Charset.CanEncode to find out whether a charset supports both.

Characters

In the context of this class, character always refers to a Java character: a Unicode code point in the range U+0000 to U+FFFF. (Java represents supplementary characters using surrogates.) Not all byte sequences will represent a character, and not all characters can necessarily be represented by a given charset. The method Charset.Contains(Charset) can be used to determine whether every character representable by one charset can also be represented by another (meaning that a lossless transformation is possible from the contained to the container).

Encodings

There are many possible ways to represent Unicode characters as byte sequences. See for detailed discussion.

The most important mappings capable of representing every character are the Unicode Transformation Format (UTF) charsets. Of those, UTF-8 and the UTF-16 family are the most common. UTF-8 (described in ) encodes a character using 1 to 4 bytes. UTF-16 uses exactly 2 bytes per character (potentially wasting space, but allowing efficient random access into BMP text), and UTF-32 uses exactly 4 bytes per character (trading off even more space for efficient random access into text that includes supplementary characters).

UTF-16 and UTF-32 encode characters directly, using their code point as a two- or four-byte integer. This means that any given UTF-16 or UTF-32 byte sequence is either big- or little-endian. To assist decoders, Unicode includes a special byte order mark (BOM) character U+FEFF used to determine the endianness of a sequence. The corresponding byte-swapped code point U+FFFE is guaranteed never to be assigned. If a UTF-16 decoder sees 0xfe, 0xff, for example, it knows it's reading a big-endian byte sequence, while 0xff, 0xfe, would indicate a little-endian byte sequence.

UTF-8 can contain a BOM, but since the UTF-8 encoding of a character always uses the same byte sequence, there is no information about endianness to convey. Seeing the bytes corresponding to the UTF-8 encoding of U+FEFF (0xef, 0xbb, 0xbf) would only serve to suggest that you're reading UTF-8. Note that BOMs are decoded as the U+FEFF character, and will appear in the output character sequence. This means that a disadvantage to including a BOM in UTF-8 is that most applications that use UTF-8 do not expect to see a BOM. (This is also a reason to prefer UTF-8: it's one less complication to worry about.)

Because a BOM indicates how the data that follows should be interpreted, a BOM should occur as the first character in a character sequence.

See the for more about dealing with BOMs.

Endianness and BOM behavior

The following tables show the endianness and BOM behavior of the UTF-16 variants.

This table shows what the encoder writes. "BE" means that the byte sequence is big-endian, "LE" means little-endian. "BE BOM" means a big-endian BOM (that is, 0xfe, 0xff).

CharsetEncoder writes
UTF-16BEBE, no BOM
UTF-16LELE, no BOM
UTF-16BE, with BE BOM

The next table shows how each variant's decoder behaves when reading a byte sequence. The exact meaning of "failure" in the table is dependent on the CodingErrorAction supplied to CharsetDecoder.MalformedInputAction, so "BE, failure" means "the byte sequence is treated as big-endian, and a little-endian BOM triggers the malformedInputAction".

The phrase "includes BOM" means that the output includes the U+FEFF byte order mark character.

CharsetBE BOMLE BOMNo BOM
UTF-16BEBE, includes BOMBE, failureBE
UTF-16LELE, failureLE, includes BOMLE
UTF-16BELEBE

Charset names

A charset has a canonical name, returned by Charset.Name. Most charsets will also have one or more aliases, returned by Charset.Aliases. A charset can be looked up by canonical name or any of its aliases using Charset.ForName(String).

Guaranteed-available charsets

The following charsets are available on every Java implementation:

  • ISO-8859-1
  • US-ASCII
  • UTF-16
  • UTF-16BE
  • UTF-16LE
  • UTF-8

All of these charsets support both decoding and encoding. The charsets whose names begin "UTF" can represent all characters, as mentioned above. The "ISO-8859-1" and "US-ASCII" charsets can only represent small subsets of these characters. Except when required to do otherwise for compatibility, new code should use one of the UTF charsets listed above. The platform's default charset is UTF-8. (This is in contrast to some older implementations, where the default charset depended on the user's locale.)

Most implementations will support hundreds of charsets. Use Charset.AvailableCharsets or Charset.IsSupported(String) to see what's available. If you intend to use the charset if it's available, just call Charset.ForName(String) and catch the exceptions it throws if the charset isn't available.

Additional charsets can be made available by configuring one or more charset providers through provider configuration files. Such files are always named as "java.nio.charset.spi.CharsetProvider" and located in the "META-INF/services" directory of one or more classpaths. The files should be encoded in "UTF-8". Each line of their content specifies the class name of a charset provider which extends CharsetProvider. A line should end with '\r', '\n' or '\r\n'. Leading and trailing whitespace is trimmed. Blank lines, and lines (after trimming) starting with "#" which are regarded as comments, are both ignored. Duplicates of names already found are also ignored. Both the configuration files and the provider classes will be loaded using the thread context class loader.

Although class is thread-safe, the CharsetDecoder and CharsetEncoder instances it returns are inherently stateful.

[Android Documentation]

Requirements

Namespace: Java.Nio.Charset
Assembly: Mono.Android (in Mono.Android.dll)
Assembly Versions: 0.0.0.0
Since: Added in API level 1

The members of Java.Nio.Charset.Charset are listed below.

See Also: Object

Protected Constructors

A constructor used when creating managed representations of JNI objects; called by the runtime.
Constructs a Charset object.

Public Properties

[read-only]
IsRegisteredBoolean. Returns true if this charset is known to be registered in the IANA Charset Registry.

Protected Properties

[read-only]
override
ThresholdClassIntPtr. This API supports the Mono for Android infrastructure and is not intended to be used directly from your code.
[read-only]
override
ThresholdTypeType. This API supports the Mono for Android infrastructure and is not intended to be used directly from your code.

Public Methods

Aliases() : ICollection<string>
Returns an unmodifiable set of this charset's aliases.
static
AvailableCharsets() : IDictionary<string, Java.Nio.Charset.Charset>
Returns an immutable case-insensitive map from canonical names to Charset instances.
CanEncode() : Boolean
Returns true if this charset supports encoding, false otherwise.
CompareTo(Charset) : Int32
Compares this charset with the given charset.
abstract
Contains(Charset) : Boolean
Determines whether this charset is a superset of the given charset.
Decode(ByteBuffer) : CharBuffer
Returns a new CharBuffer containing the characters decoded from buffer.
static
DefaultCharset() : Charset
Returns the system's default charset.
DisplayName() : String
Returns the name of this charset for the default locale.
DisplayName(Locale) : String
Returns the name of this charset for the specified locale.
Encode(CharBuffer) : ByteBuffer
Returns a new ByteBuffer containing the bytes encoding the characters from buffer.
Encode(String) : ByteBuffer
Returns a new ByteBuffer containing the bytes encoding the characters from s.
override
Equals(Object) : Boolean
Determines whether this charset equals to the given object.
static
ForName(String) : Charset
Returns a Charset instance for the named charset.
override
GetHashCode() : Int32
Gets the hash code of this charset.
static
IsSupported(String) : Boolean
Determines whether the specified charset is supported by this runtime.
Name() : String
Returns the canonical name of this charset.
abstract
NewDecoder() : CharsetDecoder
Returns a new instance of a decoder for this charset.
abstract
NewEncoder() : CharsetEncoder
Returns a new instance of an encoder for this charset.
override
ToString() : String
Gets a string representation of this charset.

Explicitly Implemented Interface Members