Copyright | (c) 2020 Sam May |
---|---|
License | MPL-2.0 |
Maintainer | ag.eitilt@gmail.com |
Stability | experimental |
Portability | portable |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
Web.Willow.Common.Encoding.Sniffer
Contents
Description
In an ideal internet, every server would declare the binary encoding with which
it is transmitting a file (actually, the true ideal would be for it to always
be Utf8
, but there are still a lot of legacy documents out there). However,
that's not always the case.
A good fallback would be for every document to declare itself what encoding it
has been saved in. However, not every one does, and the ones that do may still
get it wrong (take, for instance, the case of a server which does translate
everything it sends to Utf8
).
And so, the HTML standard describes an
algorithm for guessing the proper bytes-to-text translation to use in
decode
. While this does therefore assume some
HTML syntax and specific tags, none of the semantics should cause an issue for
other filetypes.
Synopsis
- data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
- data Confidence
- data ReparseData = ReparseData {}
- emptyReparseData :: ReparseData
- sniff :: SnifferEnvironment -> ByteString -> Confidence
- data SnifferEnvironment = SnifferEnvironment {}
- emptySnifferEnvironment :: SnifferEnvironment
- sniffDecoderState :: SnifferEnvironment -> ByteString -> DecoderState
- decoderConfidence :: DecoderState -> Confidence
- confidenceEncoding :: Confidence -> Encoding
- extractEncoding :: ByteString -> Maybe Encoding
Types
Encoding:
encoding
All character encoding schemes supported by the HTML standard, defined as a
bidirectional map between characters and binary sequences. Utf8
is
strongly encouraged for new content (including all encoding purposes), but
the others are retained for compatibility with existing pages.
Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.
Constructors
Utf8 | The UTF-8 encoding for Unicode. |
Utf16be | The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme. |
Utf16le | The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme. |
Big5 | Big5, primarily covering traditional Chinese characters. |
EucJp | EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212. |
EucKr | EUC-KR, primarily covering Hangul. |
Gb18030 | The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization. |
Gbk | GBK, primarily covering simplified Chinese characters. In practice, this is just |
Ibm866 | DOS and OS/2 code page for Cyrillic characters. |
Iso2022Jp | A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana. |
Iso8859_2 | Latin-2 (Central European). |
Iso8859_3 | Latin-3 (South European and Esperanto) |
Iso8859_4 | Latin-4 (North European). |
Iso8859_5 | |
Iso8859_6 | |
Iso8859_7 | Latin/Greek (modern monotonic). |
Iso8859_8 | Latin/Hebrew (visual order). |
Iso8859_8i | Latin/Hebrew (logical order). |
Iso8859_10 | Latin-6 (Nordic). |
Iso8859_13 | Latin-7 (Baltic Rim). |
Iso8859_14 | Latin-8 (Celtic). |
Iso8859_15 | Latin-9 (revision of ISO 8859-1 Latin-1, Western European). |
Iso8859_16 | Latin-10 (South-Eastern European). |
Koi8R | KOI-8 specialized for Russian Cyrillic. |
Koi8U | KOI-8 specialized for Ukrainian Cyrillic. |
Macintosh | |
MacintoshCyrillic | Mac OS Cyrillic (as of Mac OS 9.0) |
ShiftJis | The Windows variant (code page 932) of Shift JIS. |
Windows874 | ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai. |
Windows1250 | The Windows extension and rearrangement of ISO 8859-2 Latin-2. |
Windows1251 | |
Windows1252 | The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1. |
Windows1253 | Windows Greek (modern monotonic). |
Windows1254 | The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5. |
Windows1255 | The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew. |
Windows1256 | |
Windows1257 | |
Windows1258 | |
Replacement | The input is reduced to a single No encoder is provided for this scheme. |
UserDefined | Non-ASCII bytes ( |
Instances
Bounded Encoding # | |
Enum Encoding # | |
Defined in Web.Willow.Common.Encoding.Common | |
Eq Encoding # | |
Ord Encoding # | |
Defined in Web.Willow.Common.Encoding.Common | |
Read Encoding # | |
Show Encoding # | |
Hashable Encoding # | |
Defined in Web.Willow.Common.Encoding.Common |
data Confidence #
HTML:
confidence
How likely the specified encoding is to be the actual stream encoding.
The spec names a third confidence level irrelevant
, to be used when the
stream doesn't depend on any particular encoding scheme (i.e. it is
composed directly of Char
s rather than parsed from a binary stream). This
has not been included in the sum type, as it makes little sense to have that
as a parameter of the decoding stage. Use
to
represent it instead.Maybe
DecoderState
Constructors
Tentative Encoding ReparseData | The binary stream is likely the named encoding, but more data may
prove it to be something else. In the latter case, the
|
Certain Encoding | The binary stream is confirmed to be of the given encoding. |
Instances
Eq Confidence # | |
Defined in Web.Willow.Common.Encoding.Common | |
Read Confidence # | |
Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS Confidence # readList :: ReadS [Confidence] # readPrec :: ReadPrec Confidence # readListPrec :: ReadPrec [Confidence] # | |
Show Confidence # | |
Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> Confidence -> ShowS # show :: Confidence -> String # showList :: [Confidence] -> ShowS # |
data ReparseData #
HTML:
change the encoding
The data required to determine if a new encoding would produce an identical
output to what the current one has already done, and to restart the parsing
with the new one if the two are incompatible. Values may be easily
initialized via emptyReparseData
.
Constructors
ReparseData | |
Fields
|
Instances
Eq ReparseData # | |
Defined in Web.Willow.Common.Encoding.Common | |
Read ReparseData # | |
Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS ReparseData # readList :: ReadS [ReparseData] # readPrec :: ReadPrec ReparseData # readListPrec :: ReadPrec [ReparseData] # | |
Show ReparseData # | |
Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> ReparseData -> ShowS # show :: ReparseData -> String # showList :: [ReparseData] -> ShowS # |
emptyReparseData :: ReparseData #
The collection of data which would indicate nothing has yet been parsed.
The Algorithm
sniff :: SnifferEnvironment -> ByteString -> Confidence #
HTML:
encoding sniffing algorithm
Given a stream and related metadata, try to determine what encoding may have been used to write it.
Will resolve and/or wait for the number of bytes requested by prescanDepth
to be available in the stream (or, if it comes sooner, the end of the
stream), if they have not yet been produced.
data SnifferEnvironment #
Various datapoints which may indicate a document's binary encoding, to be
fed into the sniff
algorithm. Values may be easily instantiated as
updates to emptySnifferEnvironment
.
Constructors
SnifferEnvironment | |
Fields
|
Instances
Eq SnifferEnvironment # | |
Defined in Web.Willow.Common.Encoding.Sniffer Methods (==) :: SnifferEnvironment -> SnifferEnvironment -> Bool # (/=) :: SnifferEnvironment -> SnifferEnvironment -> Bool # | |
Read SnifferEnvironment # | |
Defined in Web.Willow.Common.Encoding.Sniffer Methods readsPrec :: Int -> ReadS SnifferEnvironment # readList :: ReadS [SnifferEnvironment] # | |
Show SnifferEnvironment # | |
Defined in Web.Willow.Common.Encoding.Sniffer Methods showsPrec :: Int -> SnifferEnvironment -> ShowS # show :: SnifferEnvironment -> String # showList :: [SnifferEnvironment] -> ShowS # |
emptySnifferEnvironment :: SnifferEnvironment #
A neutral set of parameters to pass to the sniff
algorithm: no accessory
data, and a prescanDepth
limit of 1024 bytes.
sniffDecoderState :: SnifferEnvironment -> ByteString -> DecoderState #
Guess what encoding may be in use by the binary stream, and generate a collection of data based on that which results in the behaviour described by the decoding algorithm at the start of the stream.
Auxiliary
decoderConfidence :: DecoderState -> Confidence #
The encoding scheme currently in use by the parser, along with how likely that scheme actually represents the binary stream.
confidenceEncoding :: Confidence -> Encoding #
Extract the underlying encoding scheme from the wrapping data.
extractEncoding :: ByteString -> Maybe Encoding #
HTML:
algorithm for extracting a character encoding from a meta element
Find the first occurrence of an ASCII-encoded string charset
in the
stream, and try to parse its attribute-style value into an Encoding
.
Returns Nothing
if the stream does not contain charset
followed by =
,
or if the value can not be successfully parsed as an encoding label.