Copyright | (c) 2020 Sam May |
---|---|
License | MPL-2.0 |
Maintainer | ag.eitilt@gmail.com |
Stability | experimental |
Portability | portable |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
Web.Willow.Common.Encoding
Description
This module and the internal branch it heads implement the Encoding specification for translating text to and from UTF-8 and a selection of less-favoured but grandfathered encoding schemes. As the standard authors' primary goal has been security followed closely by compatibility with existing web pages, the algorithms described and the names associated with them do not perfectly match the descriptions originally given by the various original encoding specifications themselves.
Synopsis
- data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
- data DecoderState
- decoderEncoding :: DecoderState -> Encoding
- decoderRemainder :: DecoderState -> ShortByteString
- data ReparseData
- data EncoderState
- initialDecoderState :: Encoding -> DecoderState
- setEncodingCertain :: Encoding -> DecoderState -> DecoderState
- setRemainder :: ShortByteString -> DecoderState -> DecoderState
- initialEncoderState :: Encoding -> EncoderState
- decode :: DecoderState -> ByteString -> ([Either ShortByteString String], DecoderState)
- decode' :: DecoderState -> ByteString -> (Text, DecoderState)
- byteOrderMark :: ByteString -> (Maybe Encoding, ByteString)
- finalizeDecode :: DecoderState -> [Either ShortByteString String]
- finalizeDecode' :: DecoderState -> Text
- decodeUtf8 :: ByteString -> ([Either ShortByteString String], DecoderState)
- decodeUtf8NoBom :: ByteString -> ([Either ShortByteString String], DecoderState)
- decodeUtf8' :: ByteString -> (Text, DecoderState)
- decodeUtf8NoBom' :: ByteString -> (Text, DecoderState)
- encode :: EncoderState -> Text -> ([Either Char ShortByteString], EncoderState)
- encode' :: EncoderState -> Text -> (ByteString, EncoderState)
- encodeUtf8 :: Text -> (ByteString, EncoderState)
- decodeStep :: DecoderState -> ByteString -> (Maybe (Either ShortByteString String), DecoderState, ByteString)
- encodeStep :: EncoderState -> Text -> Maybe (Either Char ShortByteString, EncoderState, Text)
- decodeStep' :: DecoderState -> ByteString -> (Maybe String, DecoderState, ByteString)
- encodeStep' :: EncoderState -> Text -> Maybe (ShortByteString, EncoderState, Text)
- data InnerDecoderState
- data InnerEncoderState
Types
Encoding:
encoding
All character encoding schemes supported by the HTML standard, defined as a
bidirectional map between characters and binary sequences. Utf8
is
strongly encouraged for new content (including all encoding purposes), but
the others are retained for compatibility with existing pages.
Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.
Constructors
Utf8 | The UTF-8 encoding for Unicode. |
Utf16be | The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme. |
Utf16le | The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme. |
Big5 | Big5, primarily covering traditional Chinese characters. |
EucJp | EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212. |
EucKr | EUC-KR, primarily covering Hangul. |
Gb18030 | The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization. |
Gbk | GBK, primarily covering simplified Chinese characters. In practice, this is just |
Ibm866 | DOS and OS/2 code page for Cyrillic characters. |
Iso2022Jp | A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana. |
Iso8859_2 | Latin-2 (Central European). |
Iso8859_3 | Latin-3 (South European and Esperanto) |
Iso8859_4 | Latin-4 (North European). |
Iso8859_5 | |
Iso8859_6 | |
Iso8859_7 | Latin/Greek (modern monotonic). |
Iso8859_8 | Latin/Hebrew (visual order). |
Iso8859_8i | Latin/Hebrew (logical order). |
Iso8859_10 | Latin-6 (Nordic). |
Iso8859_13 | Latin-7 (Baltic Rim). |
Iso8859_14 | Latin-8 (Celtic). |
Iso8859_15 | Latin-9 (revision of ISO 8859-1 Latin-1, Western European). |
Iso8859_16 | Latin-10 (South-Eastern European). |
Koi8R | KOI-8 specialized for Russian Cyrillic. |
Koi8U | KOI-8 specialized for Ukrainian Cyrillic. |
Macintosh | |
MacintoshCyrillic | Mac OS Cyrillic (as of Mac OS 9.0) |
ShiftJis | The Windows variant (code page 932) of Shift JIS. |
Windows874 | ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai. |
Windows1250 | The Windows extension and rearrangement of ISO 8859-2 Latin-2. |
Windows1251 | |
Windows1252 | The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1. |
Windows1253 | Windows Greek (modern monotonic). |
Windows1254 | The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5. |
Windows1255 | The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew. |
Windows1256 | |
Windows1257 | |
Windows1258 | |
Replacement | The input is reduced to a single No encoder is provided for this scheme. |
UserDefined | Non-ASCII bytes ( |
Instances
Bounded Encoding # | |
Enum Encoding # | |
Defined in Web.Willow.Common.Encoding.Common | |
Eq Encoding # | |
Ord Encoding # | |
Defined in Web.Willow.Common.Encoding.Common | |
Read Encoding # | |
Show Encoding # | |
Hashable Encoding # | |
Defined in Web.Willow.Common.Encoding.Common |
data DecoderState #
All the data which needs to be tracked for correct behaviour in decoding a binary stream into readable text.
Instances
Eq DecoderState # | |
Defined in Web.Willow.Common.Encoding.Common | |
Read DecoderState # | |
Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS DecoderState # readList :: ReadS [DecoderState] # | |
Show DecoderState # | |
Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> DecoderState -> ShowS # show :: DecoderState -> String # showList :: [DecoderState] -> ShowS # |
decoderEncoding :: DecoderState -> Encoding #
Retrieve the encoding scheme currently used by the decoder to decode the binary document stream.
decoderRemainder :: DecoderState -> ShortByteString #
Any leftover bytes at the end of the binary stream, which require further input to be processed in order to correctly map to a character or error value.
data ReparseData #
HTML:
change the encoding
The data required to determine if a new encoding would produce an identical
output to what the current one has already done, and to restart the parsing
with the new one if the two are incompatible. Values may be easily
initialized via emptyReparseData
.
Instances
Eq ReparseData # | |
Defined in Web.Willow.Common.Encoding.Common | |
Read ReparseData # | |
Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS ReparseData # readList :: ReadS [ReparseData] # readPrec :: ReadPrec ReparseData # readListPrec :: ReadPrec [ReparseData] # | |
Show ReparseData # | |
Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> ReparseData -> ShowS # show :: ReparseData -> String # showList :: [ReparseData] -> ShowS # |
data EncoderState #
All the data which needs to be tracked for correct behaviour in decoding a binary stream into readable text.
Instances
Eq EncoderState # | |
Defined in Web.Willow.Common.Encoding.Common | |
Read EncoderState # | |
Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS EncoderState # readList :: ReadS [EncoderState] # | |
Show EncoderState # | |
Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> EncoderState -> ShowS # show :: EncoderState -> String # showList :: [EncoderState] -> ShowS # |
Initialization
Decoding
initialDecoderState :: Encoding -> DecoderState #
The collection of data which, for any given encoding scheme, results in behaviour according to the vanilla decoder before any bytes have been read.
setEncodingCertain :: Encoding -> DecoderState -> DecoderState #
Instruct the decoder that the binary document stream is known to be in the certain encoding.
setRemainder :: ShortByteString -> DecoderState -> DecoderState #
Encoding
initialEncoderState :: Encoding -> EncoderState #
The collection of data which, for any given encoding scheme, results in behaviour according to the vanilla decoder before any bytes have been read.
Transformations
Decoding
The standard decode
and decode'
functions (and therefore the similar but
higher-level functions which build on it) defer to a byte-order mark over
the argument encoding. If this behaviour isn't desired (i.e., you want to
force the parser to use the encoding, even if it's not appropriate),
try to explicitly parse byteOrderMark
first:
(_, input') =byteOrderMark
input Just text =decode
enc input'
decode :: DecoderState -> ByteString -> ([Either ShortByteString String], DecoderState) #
Encoding:
run an encoding's decoder
with error mode fatal
Given a character encoding scheme, transform a dependant ByteString
into portable Char
s. If any byte sequences are meaningless or illegal,
they are returned verbatim for error reporting; a Left
should not be
parsed further.
See decodeStep
to decode only a minimal section, or decode'
for simple
error replacement. Call finalizeDecode
on the returned DecoderState
if
no further bytes will be added to the document stream.
decode' :: DecoderState -> ByteString -> (Text, DecoderState) #
Encoding:
decode
Given a character encoding scheme, transform a dependant ByteString
into a portable Text
. If any byte sequences are meaningless or
illegal, they are replaced with the Unicode replacement character \xFFFD
.
See decodeStep'
to decode only a minimal section, or decode
if the
original data should be retained for custom error reporting. Call
finalizeDecode'
on the returned DecoderState
if no further bytes will be
added to the document stream.
byteOrderMark :: ByteString -> (Maybe Encoding, ByteString) #
Encoding:
BOM sniff
Checks for a "byte-order mark" signature character in various encodings. If present, returns both the encoding found and the remainder of the stream, otherwise returns the input unchanged.
finalizeDecode :: DecoderState -> [Either ShortByteString String] #
Explicitly indicate that the input stream will not contain any further bytes, and perform any finalization processing based on that.
See finalizeDecode'
for simple error replacement.
finalizeDecode' :: DecoderState -> Text #
Explicitly indicate that the input stream will not contain any further bytes, and perform any finalization processing based on that.
See finalizeDecode
if the original data should be retained for custom
error reporting.
UTF-8
decodeUtf8 :: ByteString -> ([Either ShortByteString String], DecoderState) #
Read a binary stream of UTF-8 encoded text. If the stream begins with a
UTF-8 byte-order mark, it's silently dropped (any other BOM is ignored but
remains in the output). Fails (returning a Left
) if the stream contains
byte sequences which don't represent any character, or which encode a
surrogate character.
See decodeUtf8'
for simple error replacement, or decodeUtf8NoBom
if the
BOM should always be retained.
decodeUtf8NoBom :: ByteString -> ([Either ShortByteString String], DecoderState) #
Encoding:
UTF-8 decode without BOM or fail
Read a binary stream of UTF-8 encoded text. If the stream begins with a
byte-order mark, it is kept as the first character of the output. Fails
(returning a Left
) if the stream contains byte sequences which don't
represent any character, or which encode a surrogate character.
See decodeUtf8NoBom'
for simple error replacement, or decodeUtf8'
if a
redundant UTF-8 BOM should be dropped.
decodeUtf8' :: ByteString -> (Text, DecoderState) #
Encoding:
UTF-8 decode
Read a binary stream of UTF-8 encoded text. If the stream begins with a
UTF-8 byte-order mark, it's silently dropped (any other BOM is ignored but
remains in the output). Any surrogate characters or invalid byte sequences
are replaced with the Unicode replacement character \xFFFD
.
See decodeUtf8
if the original data should be retained for custom error
reporting, or decodeUtf8NoBom'
if the BOM should always be retained.
decodeUtf8NoBom' :: ByteString -> (Text, DecoderState) #
Encoding:
UTF-8 decode without BOM
Read a binary stream of UTF-8 encoded text. If the stream begins with a
byte-order mark, it is kept as the first character of the output. Any
surrogate characters or invalid byte sequences are replaced with the Unicode
replacement character \xFFFD
.
See decodeUtf8NoBom
if the original data should be retained for custom
error reporting, or decodeUtf8'
if a redundant UTF-8 BOM should be
dropped.
Encoding
encode :: EncoderState -> Text -> ([Either Char ShortByteString], EncoderState) #
Encoding:
run an encoding's encoder
with error mode fatal
Given a character encoding scheme, transform a portable Text
into a
sequence of bytes representing those characters. If the encoding scheme
does not define a binary representation for a character in the input, the
original Char
is returned unchanged for custom error reporting.
See encodeStep
to encode only a minimal section, or encode'
for escaping
with HTML-style character codes.
encode' :: EncoderState -> Text -> (ByteString, EncoderState) #
Encoding:
encode
Given a character encoding scheme, transform a portable Text
into a
sequence of bytes representing those characters. If the encoding scheme
does not define a binary representation for a character in the input, they
are replaced with an HTML-style escape (e.g. "�"
).
See encodeStep'
to encode only a minimal section, or encode
if the
original data should be retained for custom error reporting.
encodeUtf8 :: Text -> (ByteString, EncoderState) #
Encoding:
UTF-8 encode
Transform a portable Text
into a sequence of bytes according to the
UTF-8 encoding scheme.
Continuations
decodeStep :: DecoderState -> ByteString -> (Maybe (Either ShortByteString String), DecoderState, ByteString) #
Encoding:
run an encoding's decoder
with error mode fatal
Read the smallest number of bytes from the head of the ByteString
which would leave the decoder in a re-enterable state. If any byte
sequences are meaningless or illegal, they are returned verbatim for error
reporting; a Left
should not be parsed further.
See decode
to decode the entire string at once, or decodeStep'
for
simple error replacement.
encodeStep :: EncoderState -> Text -> Maybe (Either Char ShortByteString, EncoderState, Text) #
Encoding:
run an encoding's encoder
with error mode fatal
Read the smallest number of characters from the head of the Text
which
would leave the encoder in a re-enterable state. If the encoding scheme
does not define a binary representation for a character in the input, the
original Char
is returned unchanged for custom error reporting.
See encode
to decode the entire string at once, or encodeStep'
for
simple error replacement.
decodeStep' :: DecoderState -> ByteString -> (Maybe String, DecoderState, ByteString) #
Encoding:
run an encoding's decoder
with error mode replacement
Read the smallest number of bytes from the head of the ByteString
which would leave the decoder in a re-enterable state. Any byte
sequences which are meaningless or illegal are replaced with the Unicode
replacement character \xFFFD
.
See decode'
to decode the entire string at once, or decodeStep
if the
original data should be retained for custom error reporting.
encodeStep' :: EncoderState -> Text -> Maybe (ShortByteString, EncoderState, Text) #
Encoding:
run an encoding's encoder
with error mode html
Read the smallest number of characters from the head of the Text
which
would leave the encoder in a re-enterable state. If the encoding scheme
does not define a binary representation for a character in the input, they
are replaced with an HTML-style escape (e.g. "�"
).
See encode'
to encode the entire string at once, or encodeStep
if the
original data should be retained for custom error reporting.
Internal
These types will almost certainly not be useful for anyone using the library, and are exported purely for internal usage. They can be safely ignored. Note, however, that they may be removed without warning.
data InnerDecoderState #
The union of all state variables tracked by the bytes-to-Char
decoding
algorithm of a single encoding scheme.
Instances
Eq InnerDecoderState # | |
Defined in Web.Willow.Common.Encoding Methods (==) :: InnerDecoderState -> InnerDecoderState -> Bool # (/=) :: InnerDecoderState -> InnerDecoderState -> Bool # | |
Read InnerDecoderState # | |
Defined in Web.Willow.Common.Encoding Methods readsPrec :: Int -> ReadS InnerDecoderState # readList :: ReadS [InnerDecoderState] # | |
Show InnerDecoderState # | |
Defined in Web.Willow.Common.Encoding Methods showsPrec :: Int -> InnerDecoderState -> ShowS # show :: InnerDecoderState -> String # showList :: [InnerDecoderState] -> ShowS # |
data InnerEncoderState #
The union of all state variables tracked by the Char
-to-bytes encoding
algorithm of a single encoding scheme.
Instances
Eq InnerEncoderState # | |
Defined in Web.Willow.Common.Encoding Methods (==) :: InnerEncoderState -> InnerEncoderState -> Bool # (/=) :: InnerEncoderState -> InnerEncoderState -> Bool # | |
Read InnerEncoderState # | |
Defined in Web.Willow.Common.Encoding Methods readsPrec :: Int -> ReadS InnerEncoderState # readList :: ReadS [InnerEncoderState] # | |
Show InnerEncoderState # | |
Defined in Web.Willow.Common.Encoding Methods showsPrec :: Int -> InnerEncoderState -> ShowS # show :: InnerEncoderState -> String # showList :: [InnerEncoderState] -> ShowS # |