HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents |
HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
use HTML::Encoding 'encoding_from_http_message'; use LWP::UserAgent; use Encode;
my $resp = LWP::UserAgent->new->get('http://www.example.org'); my $enco = encoding_from_http_message($resp); my $utf8 = decode($enco => $resp->content);
The interface and implementation are guranteed to change before this module reaches version 1.00! Please send feedback to the author of this module.
HTML::Encoding helps to determine the encoding of HTML and XML/XHTML documents...
Most routines need to know some suspected character encodings which
can be provided through the encodings
option. This option always
defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference
which means the following encodings are considered by default:
* ISO-8859-1 * UTF-16LE * UTF-16BE * UTF-32LE * UTF-32BE * UTF-8
If you change the values or pass custom values to the routines note that the Encode manpage must support them in order for this module to work correctly.
encoding_from_xml_document
, encoding_from_html_document
, and
encoding_from_http_message
return in list context the encoding
source and the encoding name, possible encoding sources are
* protocol (Content-Type: text/html;charset=encoding) * bom (leading U+FEFF) * xml (<?xml version='1.0' encoding='encoding'?>) * meta (<meta http-equiv=...) * default (default fallback value) * protocol_default (protocol default)
Routines exported by this module at user option. By default, nothing is exported.
encoding_from_content_type($content_type)
Content-Type
header value and returns
its value or undef
(or an empty list in list context) if there
is no such value. Only the first component will be examined
(HTTP/1.1 only allows for one component), any backslash escapes in
strings will be unescaped, all leading and trailing quote marks
and white-space characters will be removed, all white-space will be
collapsed to a single space, empty charset values will be ignored
and no case folding is performed.
Examples:
+-----------------------------------------+-----------+ | encoding_from_content_type(...) | returns | +-----------------------------------------+-----------+ | "text/html" | undef | | "text/html,text/plain;charset=utf-8" | undef | | "text/html;charset=" | undef | | "text/html;charset=\"\\u\\t\\f\\-\\8\"" | 'utf-8' | | "text/html;charset=utf\\-8" | 'utf\\-8' | | "text/html;charset='utf-8'" | 'utf-8' | | "text/html;charset=\" UTF-8 \"" | 'UTF-8' | +-----------------------------------------+-----------+
If you pass a string with the UTF-8 flag turned on the string will be converted to bytes before it is passed to the HTTP::Headers::Util manpage. The return value will thus never have the UTF-8 flag turned on (this might change in future versions).
The result can be ambiguous, for example qq(\xFF\xFE\x00\x00)
could
be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a
U+0000 character. It is also possible that $octets
starts with
something that looks like a byte order mark but actually is not.
encoding_from_byte_order_mark sorts the list of possible encodings by the length of their BOM octet sequence and returns in scalar context only the encoding with the longest match, and all encodings ordered by length of their BOM octet sequence in list context.
Examples:
+-------------------------+------------+-----------------------+ | Input | Encodings | Result | +-------------------------+------------+-----------------------+ | "\xFF\xFE\x00\x00" | default | qw(UTF-32LE) | | "\xFF\xFE\x00\x00" | default | qw(UTF-32LE UTF-16LE) | | "\xEF\xBB\xBF" | default | qw(UTF-8) | | "Hello World!" | default | undef | | "\xDD\x73\x66\x73" | default | undef | | "\xDD\x73\x66\x73" | UTF-EBCDIC | qw(UTF-EBCDIC) | | "\x2B\x2F\x76\x38\x2D" | default | undef | | "\x2B\x2F\x76\x38\x2D" | UTF-7 | qw(UTF-7) | +-------------------------+------------+-----------------------+
Note however that for UTF-7 it is in theory possible that the U+FEFF combines with other characters in which case such detection would fail, for example consider:
+--------------------------------------+-----------+-----------+ | Input | Encodings | Result | +--------------------------------------+-----------+-----------+ | "\x2B\x2F\x76\x38\x41\x39\x67\x2D" | default | undef | | "\x2B\x2F\x76\x38\x41\x39\x67\x2D" | UTF-7 | undef | +--------------------------------------+-----------+-----------+
This might change in future versions, although this is not very relevant for most applications as there should never be need to use UTF-7 in the encoding list for existing documents.
If no BOM can be found it returns undef
in scalar context and an
empty list in list context. This routine should not be used with
strings with the UTF-8 flag turned on.
encoding_from_xml_declaration($declaration)
Examples:
+-------------------------------------------+---------+ | encoding_from_xml_declaration(...) | Result | +-------------------------------------------+---------+ | "<?xml version='1.0' encoding='utf-8'?>" | 'utf-8' | | "<?xml encoding='utf-8'?>" | 'utf-8' | | "<?xml encoding=\"utf-8\"?>" | 'utf-8' | | "<?xml foo='bar' encoding='utf-8'?>" | 'utf-8' | | "<?xml encoding='a' encoding='b'?>" | 'a' | | "<?xml encoding=' a b '?>" | 'a b' | | "<?xml-stylesheet encoding='utf-8'?>" | undef | | " <?xml encoding='utf-8'?>" | undef | | "<?xml encoding =\x{2028}'utf-8'?>" | 'utf-8' | | "<?xml version='1.0' encoding=utf-8?>" | undef | | "<?xml x='encoding=\"a\"' encoding='b'?>" | 'a' | +-------------------------------------------+---------+
Note that encoding_from_xml_declaration()
determines the encoding even
if the XML declaration is not well-formed or violates other requirements
of the relevant XML specification as long as it can find an encoding
pseudo-attribute in the provided string. This means XML processors must
apply further checks to determine whether the entity is well-formed, etc.
This is useful to distinguish e.g. UTF-16LE from UTF-8 if the byte string does not start with a byte order mark nor an XML declaration (e.g. if the document is a HTML document) to get at least a base encoding which can be used to decode enough of the document to find <meta> elements using encoding_from_meta_element. $options{whitespace} defaults to qw/CR LF SP TB/. Returns nothing if unsuccessful. Returns the matching encodings in order of the number of octets matched in list context and the best match in scalar context.
Examples:
+---------------+----------+---------------------+ | String | Encoding | Result | +---------------+----------+---------------------+ | '<!DOCTYPE ' | UTF-16LE | UTF-16LE | | ' <!DOCTYPE ' | UTF-16LE | UTF-16LE | | '...' | UTF-16LE | undef | | '...<' | UTF-16LE | undef | | '<' | UTF-8 | ISO-8859-1 or UTF-8 | | "<!--\xF6-->" | UTF-8 | ISO-8859-1 or UTF-8 | +---------------+----------+---------------------+
* </head> * encoding errors * the end of the input * ... (see todo)
If relevant <meta> elements, i.e. something like
<meta http-equiv=Content-Type content='...'>
are found, uses encoding_from_content_type to extract the charset parameter. It returns all such encodings it could find in document order in list context or the first encoding in scalar context (it will currently look for others regardless of calling context) or nothing if that fails for some reason.
Note that there are many edge cases where this does not yield in ``proper'' results depending on the capabilities of the HTML::Parser version and the options you pass for it, for example,
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [ <!ENTITY content_type "text/html;charset=utf-8"> ]> <meta http-equiv="Content-Type" content="&content_type;"> <title></title> <p>...</p>
This would likely not detect the utf-8
value if HTML::Parser
does not resolve the entity. This should however only be a concern
for documents specifically crafted to break the encoding detection.
Examples:
+----------------------------+----------+-----------+----------+ | Input | Encoding | Encodings | Result | +----------------------------+----------+-----------+----------+ | "<?xml?>" | UTF-16 | default | UTF-16BE | | "<?xml?>" | UTF-16LE | default | undef | | "<?xml encoding='utf-8'?>" | UTF-16LE | default | utf-8 | | "<?xml encoding='utf-8'?>" | UTF-16 | default | UTF-16BE | | "<?xml encoding='cp37'?>" | CP37 | default | undef | | "<?xml encoding='cp37'?>" | CP37 | CP37 | cp37 | +----------------------------+----------+-----------+----------+
Lacking a return value from this routine and higher-level protocol information (such as protocol encoding defaults) processors would be required to assume that the document is UTF-8 encoded.
Note however that the return value depends on the set of suspected encodings you pass to it. For example, by default, EBCDIC encodings would not be considered and thus for
<?xml version='1.0' encoding='cp37'?>
this routine would return the undefined value. You can modify the list of suspected encodings using $options{encodings}.
Returns nothing if no declaration could be found, the winning declaration in scalar context and a list of encoding source and encoding name in list context, see ENCODING SOURCES.
...
Other problems arise from differences between HTML and XHTML syntax and encoding detection rules, for example, the input could be
Content-Type: text/html
<?xml version='1.0' encoding='utf-8'?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <meta http-equiv = "Content-Type" content = "text/html;charset=iso-8859-2"> <title></title> <p>...</p>
This is a perfectly legal HTML 4.01 document and implementations might be expected to consider the document ISO-8859-2 encoded as XML rules for encoding detection do not apply to HTML documents. This module attempts to avoid making decisions which rules apply for a specific document and would thus by default return 'utf-8' for this input.
On the other hand, if the input omits the encoding declaration,
Content-Type: text/html
<?xml version='1.0'?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <meta http-equiv = "Content-Type" content = "text/html;charset=iso-8859-2"> <title></title> <p>...</p>
It would return 'iso-8859-2'. Similar problems would arise from other differences between HTML and XHTML, for example consider
Content-Type: text/html
<?foo > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html ... ?> ... <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> ...
If this is processed using HTML rules, the first > will end the processing instruction and the XHTML document type declaration would be the relevant declaration for the document, if it is processed using XHTML rules, the ?> will end the processing instruction and the HTML document type declaration would be the relevant declaration.
IOW, an application would need to assume a certain character encoding (family) to process enough of the document to determine whether it is XHTML or HTML and the result of this detection would depend on which processing rules are assumed in order to process it. It is thus in essence not possible to write a ``perfect'' detection algorithm, which is why this routine attempts to avoid making any decisions on this matter.
$HTML::Encoding::DEFAULT_ENCODINGS
.
qr{^text/html$}i
.
qr{^.+/(?:.+\+)?xml$}i
.
qr{^text/(?:.+\+)?xml$}i
. This will only be checked if is_xml
matches, too.
ISO-8859-1
.
UTF-8
.
undef
in which case the default is ignored. This should
be set to US-ASCII
if desired as this module is by default
inconsistent with RFC 3023 which requires that for text/xml documents
without a charset parameter in the HTTP header US-ASCII
is assumed.
This requirement is inconsistent with RFC 2616 (HTTP/1.1) which requires
to assume ISO-8859-1
, has been widely ignored and is thus disabled by
default.
1
.
1
.
This is furhter possibly inconsistent with XML MIME types that differ in other ways from application/xml, for example if the MIME Type does not allow for a charset parameter in which case applications might be expected to ignore the charset parameter if erroneously provided.
By default, this module does not support EBCDIC encodings. To enable support for EBCDIC encodings you can either change the $HTML::Encodings::DEFAULT_ENCODINGS array reference or pass the encodings to the routines you use using the encodings option, for example
my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../; my $enc = encoding_from_xml_document($doc, encodings => \@try);
Note that there are some subtle differences between various EBCDIC
encodings, for example !
is mapped to 0x5A in posix-bc
and
to 0x4F in cp500
; these differences might affect processing in
yet undetermined ways.
* bundle with test suite * optimize some routines to give up once successful * avoid transcoding for HTML::Parser if e.g. ISO-8859-1
* http://www.w3.org/TR/REC-xml/#charencoding * http://www.w3.org/TR/REC-xml/#sec-guessing * http://www.w3.org/TR/xml11/#charencoding * http://www.w3.org/TR/xml11/#sec-guessing * http://www.w3.org/TR/html4/charset.html#h-5.2.2 * http://www.w3.org/TR/xhtml1/#C_9 * http://www.ietf.org/rfc/rfc2616.txt * http://www.ietf.org/rfc/rfc2854.txt * http://www.ietf.org/rfc/rfc3023.txt * perlunicode * Encode * HTML::Parser
Copyright (c) 2004 Bjoern Hoehrmann <bjoern@hoehrmann.de>. This module is licensed under the same terms as Perl itself.
HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents |