The IETF Layer

This layer implements modules that deal with the kinds of data that make up internet protocol messages (as described in the RFCs). For the purposes of HTTP, this includes headers, status messages and mime-encoded entities.

The Entity Module

This module implements the Entity type for all HTML pages, images etc. that are transferred over HTTP. It also includes a simple bit of MIME type handling. The functions for transferring entities around the server are in here. This includes the transfer protocol (the XferProto type) and the producer/consumer system. See the section called Entities, Producers and Consumers in Chapter 8 for an overview.

First here is the MIME type interface.

datatype MType =
        MType of {
            mtype:      string,
            msubtype:   string,
            mparams:    (string * string) list
            }

    |   MTypeUnknown

val formatType:     MType -> TextFrag.Text

(*  This creates a simple type e.g. text/plain.
*)
val simpleType:     string -> string -> MType

(*  This works out a Mime type for a file. It only
    looks at the file name.
*)
val getMimeType:    string -> MType

This just declares the type and provides some utility functions. The declaration is needed here for the entity info. The parsing of MIME types in headers is taken care of in the HTTPHeader module (see the section called The HTTPHeader Module). The formatType function converts the type back to the text format suitable for a header. The getMimeType function maps the file name extension to a MIME type using the types file specified by the TypesConfig configuration parameter (see the section called The Server Parameters in Chapter 8). Most of the work for this is done in the Config module in the section called The Config Module - Interface. These functions are simple enough to not need further explanation here.

The Encoding type is treated similarly to MType above. The parsing of encodings is done in the HTTPHeader module. The type is declared here for the entity info. The formatEncoding encoding function converts an encoding value back to the text format for a header. I won't discuss if further here. The server never takes notice of the encoding. Entities passing through are never decoded. When the server has to generate an entity such as a message or a fancy index for a directory they are never encoded. I don't recognise compression in disk files.

As explained in the section called Entities, Producers and Consumers in Chapter 8 the interface for an entity is abstract. The data is delivered by a producer function. The entity body itself is represented by a function that can create a producer function.

datatype Entity =
        Entity of {
            info:   Info,
            body:   MKProducer
            }
    |   None

and Info = Info of {
        etype:      MType option,
        encoding:   Encoding option,
        length:     int option,
        last_mod:   Date.date option
        }

The Info type contains information about the entity. This corresponds to the Content-Type, Content-Encoding, Content-Length and Last-Modified HTTP headers. Not all of the Info fields are used by all of the different kinds of producer function. For example if the entity is stored in a disk file then the length and last modification date are taken from the file instead of the Info. The MIME type is derived from the file name extension but the encoding is ignored. I should either look at the extension or try to detect the type of the file from the first few bytes. But ignoring the encoding will be enough for this simple server at the moment.

Here are the interface declarations for the transfer protocol and producer/consumer system. The protocol is described in more detail in the section called Entities, Producers and Consumers in Chapter 8.

(*  A producer sends messages of this type to its consumer. *)
and XferProto = 
        XferInfo  of Info               (* send this first *)
    |   XferBytes of Word8Vector.vector (* then lots of these *)
    |   XferDone                        (* then one of these *)
    |   XferAbort                       (* or else one of these *)
withtype Consumer = XferProto CML.chan
     and MKProducer = Abort.Abort -> Info -> Consumer -> CML.thread_id

(*  This creates a producer for an entity. *)
val startProducer:  Abort.Abort -> Entity -> Consumer -> CML.thread_id
val textProducer:   TextFrag.Text -> MKProducer
val tmpProducer:    TmpFile.TmpFile -> MKProducer
val fileProducer:   string -> MKProducer

(*  Beware that process producers are one-shot. 
    The holder is closed after the entity has been produced.
*)
val procProducer:   ExecReader.Holder -> MKProducer

To create an Entity value which represents a disk file you would use the fileProducer function. This returns a MKProducer function which in turn can be used to make multiple concurrent producers. Each producer delivers the contents of the file using the transfer protocol.

The startProducer function starts the delivery process.

and startProducer abort (Entity {info, body}) consumer = body abort info consumer

|   startProducer abort None consumer =
(
    CML.spawn (fn () => CML.send(consumer, XferDone))
)

All it does is call the producer function. It handles the case of a non-existent entity by starting a producer thread that just sends the XferDone message.

Here is the fileProducer function.

and fileProducer name abort old_info consumer =
let
    fun producer() =
    let
        (*  All of the info fields are regenerated from the
            file at the time we send it.
        *)
        val opt_len = FileIO.fileSize name
        val modt = Option.map Date.fromTimeUniv (FileIO.modTime name)

        val info = Info {
                    etype       = SOME(getMimeType name),
                    encoding    = NONE,
                    length      = opt_len,
                    last_mod    = modt
                    }
    in
        CML.send(consumer, XferInfo info);

        case opt_len of     (* NONE if can't access the file *)
          NONE     => CML.send(consumer, XferDone)
        | SOME len => send_file()
    end


    and send_file() =
    let
        (*  Record the open file so that it can be finalised if
            the consumer is aborted e.g. due to a connection timeout.
        *)
        fun loop strm =
        (
            if Abort.aborted abort
            then
                CML.send(consumer, XferAbort)
            else
            let
                val chunk = BinIO.inputN(strm, file_chunk)
            in
                if Word8Vector.length chunk = 0
                then
                    CML.send(consumer, XferDone)
                else
                (
                    CML.send(consumer, XferBytes chunk);
                    loop strm
                )
            end
        )
    in
        case BinIOReader.openIt abort name of
          NONE   => ()
        | SOME h => (loop (BinIOReader.get h); BinIOReader.closeIt h)
    end
    handle x => (Log.logExn x; ())

in
    CML.spawn producer
end

I am using currying here. The function call (fileProducer "foo.html") returns a function that takes abort, Info and consumer arguments and starts the producer thread and returns its id. This returned function has the type MkProducer. When all of the arguments to the fileProducer function are eventually supplied it spawns a thread which runs its producer function. New entity info is derived from the file each time that a producer thread is spawned. This allows changes to the file length and modification time to be noticed. There is no safety check for a file changing as it is being delivered. If this happens then the Content-Length header won't match the amount of data actually sent.

If the file is of non-zero length then its contents are sent by the send_file function. This does some binary I/O to read the file in chunks and deliver them in XferBytes messages. The BinIOReader module takes care of waiting for free file descriptors and closing the file on an abort. (See the section called The Open File Manager). I also need to check for the abort condition while sending the file. The CML library has no function like "inputNEvt" which returns an event for when data is ready from a BinIO.instream. I have to poll for the abort condition before each file read. This is a case where the server may end up trying to send data to an aborted connection. This will be caught when an attempt is made to write to a closed connection socket.

The tmpProducer delivers from a temporary file. This is just a particular case of the fileProducer. The textProducer delivers from a TextFrag in memory. The length is obtained from the length of the text and the other Info must be supplied. Each fragment of the text is sent as a separate XferBytes message.

and textProducer frag abort einfo consumer =
let
    val len = TF.length TF.UseCrLf frag

    fun producer() =
    (
        CML.send(consumer, XferInfo(update_length einfo len));
        TF.apply TF.UseCrLf send frag;
        CML.send(consumer, XferDone)
    )

    and send str = CML.send(consumer, XferBytes(Byte.stringToBytes str))
in
    CML.spawn producer
end

The procProducer function delivers an entity from a pipe that is reading from a CGI script. This makes it a bit different from the other producers in that it can only work once. The Info for the entity is obtained from the headers returned by the CGI script. See the section called The CGI Node Handler for more details.

and procProducer (holder: ExecReader.Holder) abort einfo consumer =
let
    val opened as (proc, _) = ExecReader.get holder
    val (strm, _) = Unix.streamsOf proc

    fun producer() =
    (
        CML.send(consumer, XferInfo einfo);
        send_file();
        ExecReader.closeIt holder;
        ()
    )


    and send_file () =
    (
        (*  See send_file above
            PROBLEM: CML timeouts don't seem to interrupt the inputN
            operation.
        *)
        if Abort.aborted abort
        then
        (
            CML.send(consumer, XferAbort)
        )
        else
        let
            val chunk = TextIO.inputN(strm, pipe_chunk)
        in
            if chunk = ""
            then
            (
                CML.send(consumer, XferDone)
            )
            else
            (
                CML.send(consumer, XferBytes(Byte.stringToBytes chunk));
                send_file()
            )
        end
        handle x => (Log.logExn x; ())
    )
in
    CML.spawn producer
end

The producer function is straight-forward enough: send the info then send the file then close. Sending the file consists of a loop to read chunks from the pipe and deliver them in XferBytes messages. As usual I need to check for an abort condition each time around.

The HTTPHeader Module

The HTTPHeader module handles the parsing of the header lines in messages. This isn't rocket science, just a lot of string handling, so I won't go through all of the code in detail. I'll just describe the overall layout.

Here is the type for a header.

datatype Header = 
        HdrDate of Date.date
    |   HdrPragma of string

    |   HdrAuthorization of Authorization
    |   HdrFrom of string
    |   HdrIfModified of Date.date
    |   HdrReferer of string
    |   HdrUserAgent of string

    |   HdrConEnc of Entity.Encoding(* content encoding *)
    |   HdrConLen of int            (* content length *)
    |   HdrConType of Entity.MType  (* mime type *)
    |   HdrLastModified of Date.date

    |   HdrChallenge of AuthChallenge

    (*  These can appear in CGI script output. *)
    |   HdrStatus of HTTPStatus.Status
    |   HdrLocation of URL.URL

    |   HdrExt of (string * string) (* extensions *)
    |   HdrBad of string            (* unparsable junk *)


and Authorization =
        AuthBasic of (string option * string)  (* user id and password *)

and AuthChallenge =
        ChallBasic of string    (* the realm *)

The well-known headers are separated out. Anything that isn't recognised is thrown into the extension category (HdrExt) and left as a pair of strings for the header name and value. The status header is included since it appears in the CGI protocol. (See the section called The CGI Node Handler). Any header that cannot be parsed is thrown into the bad category (HdrBad) for later error reporting.

Here is the header interface.

val readAllHeaders: (unit -> string option) -> Header list

val parseHeader:    string -> Header

val formatHeader: Header -> TextFrag.Text

(*  These functions retrieve well-known headers. *)

val getContentLength:   Header list -> int option
val getContentType:     Header list -> Entity.MType option
val getContentEnc:      Header list -> Entity.Encoding option
val getContentMod:      Header list -> Date.date option
val getAuth:            Header list -> Authorization option
val getStatus:          Header list -> HTTPStatus.Status option
val getLocation:        Header list -> URL.URL option

(*  This extracts the relevant headers to build the entity info
    record.
*)
val toEntityInfo:       Header list -> Entity.Info

(*  This overrides one set of headers with another. *)
val overrideHeaders:    Header list -> Header list -> Header list

(*  This excludes a set of headers.  The excluded set is
    demonstrated by sample headers in the first list.
*)
val excludeHeaders:     Header list -> Header list -> Header list

The readAllHeaders function reads and parses all of the header section of a message. It stops after the blank line that ends a header section. The argument is a function for reading lines as strings from a data source. The lines must have any trailing CR-LF trimmed off. The Connect.readLine function matches this requirement.

The readAllHeaders function uses parseHeader to parse each header. This function can be called separately. The formatHeader function restores a header to text form as a TextFrag.

Next come a group of utility functions which fetch particular headers from a list. The toEntityInfo function extracts those headers relevant to the contents of an entity and builds an Info value (see the section called The Entity Module). The override and exclude functions allow merging groups of headers. They aren't actually used anymore.

The readAllHeaders function has this general scheme.

fun readAllHeaders readLine : Header list =
let
... omitted material ...
    val lines = loop []
    val hdr_lines = merge lines [] []
    val headers = map parseHeader hdr_lines
in
    (* show_lines hdr_lines; *)
    headers
end

The loop function reads in all of the header lines into a list. The merge function merges continuation lines. If a line starts with white space then it is a continuation of the previous line. The leading white space of the continuation line is stripped off. Then each line is parsed.

The parseHeader function has more meat in it.

and parseHeader line : Header =
let

    val dispatch = [
        ("DATE",                parse_date HdrDate),
        ("PRAGMA",              parse_pragma),
        ("AUTHORIZATION",       parse_auth),
        ("FROM",                parse_from),
        ("IF-MODIFIED-SINCE",   parse_if_modified),
        ("REFERER",             parse_referer),
        ("USER-AGENT",          parse_useragent),
        ("CONTENT-ENCODING",    parse_cont_encoding),
        ("CONTENT-LENGTH",      parse_cont_length),
        ("CONTENT-TYPE",        parse_cont_type),
        ("LAST-MODIFIED",       parse_date HdrLastModified),
        ("WWW-AUTHENTICATE",    parse_challenge),
        ("LOCATION",            parse_location),
        ("STATUS",              parse_status)
        ]

    (*  The value has the leading and trailing white space removed. *)
    fun parse sstoken ssvalue =
    let
        val value = (SS.string(SS.dropl Char.isSpace
                                    (SS.dropr Char.isSpace ssvalue)))
        val token  = SS.string sstoken
        val utoken = upperCase token
    in
        case List.find (fn (n, _) => n = utoken) dispatch of
          NONE => HdrExt (token, value)

        | SOME (n, f) => f value
    end


    (*  The common characters are caught early for speed. *)
    fun is_token c = Char.isAlphaNum c orelse c = #"-" orelse
                     Char.contains "!#$%&'*+.^_`|~" c orelse
                     (ord c >= 128)


    val (name, rest) = SS.splitl is_token (SS.all line)
in
    (*  Expect a token, colon and more parts. *)
    if not (SS.isEmpty name) andalso SS.sub(rest, 0) = #":"
    then
        parse name (SS.triml 1 rest)
    else
        HdrBad line
end

Down the bottom I first separate out the header name which is a "token" in the IETF terminology. Splitting uses the Substring type to avoid copying parts of strings. The parse function converts the token to upper case and looks it up in the dispatch table. The dispatch functions are passed the value of the header line after the colon with the leading and trailing white space stripped. These functions must return a Header value, possibly HdrBad.

The date parsing dispatch function is shared by the different headers. The header's constructor is passed as an argument to the function. Remember that a constructor in a datatype is equivalent to a function that constructs the type. See the parse_date function below.

Some header values consist of multiple tokens that need further parsing. This is described in more detail in the section called The IETF_Line and IETF_Part Modules. The result is a list of parts that are described by the following type in the IETF_Part module. (In the code I abbreviate IETF_Part to IP and IETF_Line to IETF.).

datatype Part =
        Token of string     (* including quoted strings *)
    |   TSpec of char
    |   TWh of string       (* the white space *)
    |   TBad  of char       (* invalid character *)
    |   TEOF

Here is a simple header parsing function for the Pragma header.

and parse_pragma value =
let
    val hparts = IETF.split value
in
    case strip_ws hparts of
      [IP.Token s] =>
    (
        if field_match s "no-cache"
        then
            HdrPragma "no-cache"
        else
            HdrBad value
    )

    | _ => HdrBad value
end

It splits the header value into parts and then checks that this results in exactly one token. The only token that is recognised is no-cache. The field_match function does case-insensitive matching of two strings.

The most complicated parsing function is for dates. There are three different date formats that are allowed in date headers. See the section called The Date Header in Chapter 8 for more details. Here is the top-level of the function.

and parse_date (constr: Date.date -> Header) value =
let
    val hparts = IETF.split value
... omitted material ...
in
    (* print "looking at the date parts ";
       IETF.dump parts; print "\n"; *)

    case strip_ws hparts of
      [IP.Token wkday,
       IP.TSpec #",",
       IP.Token day,
       IP.Token month,
       IP.Token year,
       IP.Token hh,
       IP.TSpec #":",
       IP.Token mm,
       IP.TSpec #":",
       IP.Token ss,
       IP.Token "GMT"] => build wkday day month year hh mm ss

    | [IP.Token wkday,
       IP.TSpec #",",
       IP.Token dmy,        (* hyphen isn't special *)
       IP.Token hh,
       IP.TSpec #":",
       IP.Token mm,
       IP.TSpec #":",
       IP.Token ss,
       IP.Token "GMT"] =>
        (
            case String.fields (fn c => c = #"-") dmy of
              [day, month, year] =>
                  build wkday day month ("19"^year) hh mm ss

            | _ => HdrBad value
        )

    | [IP.Token wkday,
       IP.Token month,
       IP.Token day,
       IP.Token hh,
       IP.TSpec #":",
       IP.Token mm,
       IP.TSpec #":",
       IP.Token ss,
       IP.Token year] => build wkday day month year hh mm ss

    | _ => HdrBad value
end

The date value is split into tokens and then a big case expression matches it against each of the date formats. The compiler will be able optimise these cases to efficient code. The second format is trickier since the hyphen character is not considered a token separator. The day-month-year field must be split again into fields on the hyphen character. The build function assembles the field values into a Date.date value. This involves recognising month and weekday names. I'll skip describing that here.

The formatHeader function converts each header back to text as fragments. This avoids all of the copying that would result from concatenating strings.

and formatHeader (HdrDate date) =
(
    format_date "Date: " date
)

|   formatHeader (HdrPragma pragma) =
(
    TF.L ["Pragma: ", IETF.quoteField pragma]
)
... omitted material ...

The quoteField function reintroduces quoting for special characters as described in the section called HTTP Requests in Chapter 8.

The remaining functions in this module are simple utility functions that need no further explanation.

The IETF_Line and IETF_Part Modules

The IETF_Line module contains the code for splitting a string into tokens and special characters according to the syntax in the section called HTTP Requests in Chapter 8. The result is a list of parts described by this type in the IETF_Part module.

datatype Part =
        Token of string     (* including quoted strings *)
    |   TSpec of char
    |   TWh of string       (* the white space *)
    |   TBad  of char       (* invalid character *)
    |   TEOF

To help recognise the tokens and special characters I've used a lexer generated by the ML-Lex utility (which is part of the SML/NJ distribution). ML-Lex is similar to the standard Unix lex utility for the C language. You provide a specification of regular expressions for the various parts you want to recognise and it builds a lexer for these expressions. Here is the body of the specification from the ietf.lex file.

%structure IETFLex
%full

ctl=[\000-\031\127];
ws=[\ \t];
tokn=[!#$%&'*+.0-9A-Z^_`a-z|~\h-];
str=[^\000-\031\127"];

%%

{ws}+           => (TWh yytext);
{tokn}+         => (Token yytext);
\"{str}*\"      => (fix_str yytext);

"("             => (TSpec (String.sub(yytext, 0)));
")"             => (TSpec (String.sub(yytext, 0)));
"<"             => (TSpec (String.sub(yytext, 0)));
">"             => (TSpec (String.sub(yytext, 0)));
"@"             => (TSpec (String.sub(yytext, 0)));
","             => (TSpec (String.sub(yytext, 0)));
";"             => (TSpec (String.sub(yytext, 0)));
":"             => (TSpec (String.sub(yytext, 0)));
"\\"            => (TSpec (String.sub(yytext, 0)));
"\""            => (TSpec (String.sub(yytext, 0)));
"/"             => (TSpec (String.sub(yytext, 0)));
"["             => (TSpec (String.sub(yytext, 0)));
"]"             => (TSpec (String.sub(yytext, 0)));
"?"             => (TSpec (String.sub(yytext, 0)));
"="             => (TSpec (String.sub(yytext, 0)));
"{"             => (TSpec (String.sub(yytext, 0)));
"}"             => (TSpec (String.sub(yytext, 0)));

.               => (TBad (String.sub(yytext, 0)));

The generated SML file will contain a structure named IETFLex. This contains these declarations (among others).

structure IETFLex=
struct
    structure UserDeclarations =
    struct
        open IETF_Part
        type lexresult = Part

        fun eof() = TEOF

        (*      Strip off the surrounding quotes. *)
        fun fix_str s = Token(String.substring(s, 1, size s - 2))
    end

    fun makeLexer yyinput = ...
    ...
end

The contents of the UserDeclarations structure is copied in from the top part of the ietf.lex file. The lexresult declaration is required. It gives the type of the part that is returned by the lexer. The right-hand side of a regular expression specification must be an expression of this type. As in C lex, a variable named yytext is available containing the matched string. The eof function is also required. It will be called at the end of the lexer's input.

The makeLexer function returns a lexer function that can be called successively to get each part. So the lexer function is imperative. The yyinput argument to makeLexer is a function that the lexer can call to fetch chunks of the input string. It takes an integer argument for the preferred chunk length, which you can ignore if you like. The end of the input is indicated when yyinput returns the empty string.

Here is the IETF_Line.split function that operates the lexer.

fun split str : IP.Part list =
let
    val done = ref false
    fun input n = if !done then "" else (done := true; str)

    val lexer = IETFLex.makeLexer input

    fun read toks =
    (
        case lexer() of
          IP.TEOF => rev toks
        | t       => read (t::toks)
    )
in
    read []
end

I pass the string to be split in a single chunk to the lexer. I have to arrange for the second call to the input function to return an empty string. This requires a kludge with a state variable. The imperative nature of the lexer tends to poison like this all code that interacts with it. The read function is a simple loop that keeps getting parts from the lexer until the end-of-file part is found. A list of the parts is returned.

Complementing the split function is the join function. This converts a list of parts back into a string. At the moment this only used by the quoteField function. (In an earlier version of the server I used the join function in more places).

and join hparts =
let
    fun to_str []         rslt = concat(rev rslt)
    |   to_str [IP.TWh _] rslt = to_str [] rslt    (* trailing ws *)
    |   to_str ((IP.Token s1)::r) rslt = to_str r ((quote s1)::rslt)
    |   to_str ((IP.TWh s)::r)    rslt = to_str r (s :: rslt)
    |   to_str ((IP.TSpec c)::r)  rslt = to_str r ((str c) :: rslt)
    |   to_str ((IP.TBad  c)::r)  rslt = to_str r rslt
    |   to_str (IP.TEOF::r)       rslt = to_str r rslt


    and quote str =
    let
        (* If there are unsafe characters then right won't be empty.
        *)
        val (_, right) = SS.splitl safe (SS.all str)
    in
        if SS.isEmpty right
        then
            str
        else
            strip_dq str
    end


    and safe c = not (Char.isCntrl c orelse
                      Char.isSpace c orelse
                      Char.contains "()<>@,;:\\\"/[]?={}" c)

    and strip_dq str =
    let
        val fields = SS.fields (fn c => c = #"\"") (SS.all str)
    in
        concat("\"" :: ((map SS.string fields) @ ["\""]))
    end

in
    to_str hparts []
end

The to_str function is the main loop that converts each part to a string, building a list of strings. Then this list is concatenated. Trailing white space is deleted. The text of tokens is quoted if they contain unsafe characters. I use the Substring.splitl function as a simple way to search for a character that matches a predicate. If there are any unsafe characters then the whole token is enclosed in double quotes. Since the HTTP v1.0 specification does not allow double quote characters inside quoted strings I just delete them, for want of a better solution. There shouldn't be any of them inside tokens anyway.

Now I can implement the quoteField function as just a split followed by a join.

and quoteField field = join(split field)

The HTTPStatus Module

This module provides a simple abstraction for status codes. The codes are classified by severity and protocol version. The text description of the code can be generated.

There's not much to say about this. Each code is made into an exported value. The type is abstract.

datatype Severity =
    Info | Success | Redirect | ClientError | ServerError

type Status

val OK:             Status          (* 200 *)
val Created:        Status          (* 201 *)
val Accepted:       Status          (* 202 *)
val NoContent:      Status          (* 204 *)
... omitted material ...
val formatStatus:   Status -> string

val severity:       Status -> Severity
val code:           Status -> int

val isV11:          Status -> bool
val same:           Status -> Status -> bool

val fromInt:        int -> Status

(*  This tests if the response needs a body according to the
    status code.  See section 7.2 of RFC1945.
*)
val needsBody:      Status -> bool

The HTTPMsg Module

This module defines types for the Request and Response types that pass between the HTTP protocol section and the resource store. For more information see the section called Requests and Responses in Chapter 8.