class Poco::UTF8Encoding
Overview
UTF-8 text encoding, as defined in RFC 2279. More…
#include <UTF8Encoding.h> class UTF8Encoding: public Poco::TextEncoding { public: // methods virtual const char* canonicalName() const; virtual bool isA(const std::string& encodingName) const; virtual const CharacterMap& characterMap() const; virtual int convert(const unsigned char* bytes) const; virtual int convert( int ch, unsigned char* bytes, int length ) const; virtual int queryConvert( const unsigned char* bytes, int length ) const; virtual int sequenceLength( const unsigned char* bytes, int length ) const; static bool isLegal( const unsigned char* bytes, int length ); };
Inherited Members
public: // typedefs typedef SharedPtr<TextEncoding> Ptr; typedef int CharacterMap[256]; // enums enum { MAX_SEQUENCE_LENGTH = 6, }; // fields static const std::string GLOBAL; // methods virtual const char* canonicalName() const = 0; virtual bool isA(const std::string& encodingName) const = 0; virtual const CharacterMap& characterMap() const = 0; virtual int convert(const unsigned char* bytes) const; virtual int queryConvert( const unsigned char* bytes, int length ) const; virtual int sequenceLength( const unsigned char* bytes, int length ) const; virtual int convert( int ch, unsigned char* bytes, int length ) const; static TextEncoding& byName(const std::string& encodingName); static TextEncoding::Ptr find(const std::string& encodingName); static void add(TextEncoding::Ptr encoding); static void add( TextEncoding::Ptr encoding, const std::string& name ); static void remove(const std::string& encodingName); static TextEncoding::Ptr global(TextEncoding::Ptr encoding); static TextEncoding& global(); protected: // methods static TextEncodingManager& manager();
Detailed Documentation
UTF-8 text encoding, as defined in RFC 2279.
Methods
virtual const char* canonicalName() const
Returns the canonical name of this encoding, e.g.
“ISO-8859-1”. Encoding name comparisons are case insensitive.
virtual bool isA(const std::string& encodingName) const
Returns true if the given name is one of the names of this encoding.
For example, the “ISO-8859-1” encoding is also known as “Latin-1”.
Encoding name comparision are be case insensitive.
virtual const CharacterMap& characterMap() const
Returns the CharacterMap for the encoding.
The CharacterMap should be kept in a static member. As characterMap() can be called frequently, it should be implemented in such a way that it just returns a static map. If the map is built at runtime, this should be done in the constructor.
virtual int convert(const unsigned char* bytes) const
The convert function is used to convert multibyte sequences; bytes will point to a byte sequence of n bytes where sequenceLength(bytes, length) == -n, with length >= n.
The convert function must return the Unicode scalar value represented by this byte sequence or -1 if the byte sequence is malformed. The default implementation returns (int) bytes[0].
virtual int convert( int ch, unsigned char* bytes, int length ) const
Transform the Unicode character ch into the encoding’s byte sequence.
The method returns the number of bytes used. The method must not use more than length characters. Bytes and length can also be null - in this case only the number of bytes required to represent ch is returned. If the character cannot be converted, 0 is returned and the byte sequence remains unchanged. The default implementation simply returns 0.
virtual int queryConvert( const unsigned char* bytes, int length ) const
The queryConvert function is used to convert single byte characters or multibyte sequences; bytes will point to a byte sequence of length bytes.
The queryConvert function must return the Unicode scalar value represented by this byte sequence or -1 if the byte sequence is malformed or -n where n is number of bytes requested for the sequence, if lenght is shorter than the sequence. The length of the sequence might not be determined by the first byte, in which case the conversion becomes an iterative process: First call with length == 1 might return -2, Then a second call with lenght == 2 might return -4 Eventually, the third call with length == 4 should return either a Unicode scalar value, or -1 if the byte sequence is malformed. The default implementation returns (int) bytes[0].
virtual int sequenceLength( const unsigned char* bytes, int length ) const
The sequenceLength function is used to get the lenth of the sequence pointed by bytes.
The length paramater should be greater or equal to the length of the sequence.
The sequenceLength function must return the lenght of the sequence represented by this byte sequence or a negative value -n if length is shorter than the sequence, where n is the number of byte requested to determine the length of the sequence. The length of the sequence might not be determined by the first byte, in which case the conversion becomes an iterative process as long as the result is negative: First call with length == 1 might return -2, Then a second call with lenght == 2 might return -4 Eventually, the third call with length == 4 should return 4. The default implementation returns 1.
static bool isLegal( const unsigned char* bytes, int length )
Utility routine to tell whether a sequence of bytes is legal UTF-8.
This must be called with the length pre-determined by the first byte. The sequence is illegal right away if there aren’t enough bytes available. If presented with a length > 4, this function returns false. The Unicode definition of UTF-8 goes up to 4-byte sequences.
Adapted from ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c Copyright 2001-2004 Unicode, Inc.