The web has certainly become one of the most ubiquitous and important publishing spaces around.
What defines the web, technically, is html and http. Html is the publishing format and http is the
transport protocol.
Let's consider http. This protocol is so simple. Some would say simplistic -- it certainly has it share of detractors.
While it may not be the most sophisticated protocol around, it gets the job done.
It's a client/server protocol and, like many internet standards, is text-based. The client sends an http request and the server
responds with a reply. There are only a few commands in the repertory: 'GET', 'PUT', 'POST', and a few others.
See RFC2616 for all the details.
The bytes sent out across the network are composed of a header and the data (content). The header is nothing more than
a few lines of simple text. The first line contains the command, the remaining lines contain 'key: value' pairs.
If either server or client don't understand a particular key, it's ignored -- this leaves quite a bit of leeway for fun.
Normally, you never see these headers. As a counter-example, consider email. Most email clients allow you to see the
message headers if you want. This makes it more accessible to understand the mail protocol. But most web clients,
such as web browsers, never let you see this stuff. Too bad -- it can be interesting and informative.
I wrote a simple web client for just that purpose. Called ufetch
, it's a command line utility that fetches
data from web servers. For example, in a Terminal window type:
ufetch www.bebits.com
This will download the home page of BeBits
and put it in a file called f.data. As it runs, it spits out various status info to
the screen.
You may be familiar with an similar Unix utility called wget
. wget
is actually more powerful,
as it will download from ftp servers as well. But ufetch
is simpler, both in its design and its source code.
I think this makes it a spiffy tool for learning about various details of the http protocol and web client/server communication.
ufetch
was inspired by an old BeOS Newsletter article by Benoit Schillings
called Mining the Net.
Benoit created a sample C++ program called site_getter
for fetching URL resources.
I took the code, converted it to C, removed stuff I didn't need, added other stuff, tweaked, coddled, and massaged
the code to my heart's content. It is so completely modified that I don't think there's one line of code remaining
in ufetch
from Benoit's original code. But it certainly was inspired by his work and his comments.
It's really not very hard to implement a web client. The simple text format of the headers makes them trivial to deal with.
Most of the work in ufetch
involves establishing connections to the web servers and sending/receiving data.
Even this, however, is pretty simple because the sockets interface handles all the low-level grunge. If you are a member
of the OpenBeOS networking team, then you have the task of implementing the sockets interface. But as a network programmer,
you needn't be concerned with the details and only need to know how to use the sockets themselves.
The sockets interface was originally designed by Unix programmers at Berkeley. Which is why they are often referred
to as "berkeley sockets". This interface has been ported to other platforms such as Windows, often with many changes
and alterations. The BeOS sockets interface is very close to the BSD module, but varies slightly (most notably in that
sockets are not true file descriptors).
The semantics of socket operations is similar to file operations. You create a socket and then bind or connect it to a
network address (similar to 'open' for a file). While connected, you cand send and receive data
(like the 'read' and 'write' for files). When finished, you close the socket. You are required to know the IP address
of a remote socket in order to connect, but there are database functions for determining the IP address when given
a URL.
Walking thru an example
Ok, let's see how this works in practice. Consider the sample command line:
ufetch www.bebits.com
First, the URL is split into (protocol, host, port, resource). There is no "http://" in the URL, so 'http' will be
assumed for the protocol. The host is 'www.bebits.com'. No port is specified, so it defaults to the standard web port 80.
No resource is specified either, so it defaults to '/'.
Next, the IP address of host 'www.bebits.com' is looked up using the standard network function gethostbyname()
.
In this case, it returns 28.245.212.78 (in dotted format).
A socket is created using the socket()
function. Then the socket is connected to the web server with
the connect()
function using port=80 and IP=28.245.212.78. If it's unable to connect, an error message
is sent to the screen and the program exits.
The following request header is generated:
GET / HTTP/1.1
Host: www.bebits.com
User-Agent: ufetch
Accept: */*
Connection: close
The first line is the status line. 'GET' is the command and '/' is the requested resource. This is followed by the
http version. The remaining lines are standard header tags. The 'Accept: */*' line says to accept anything (ufetch
is not picky).
The 'Connection: close' was added after some real world testing: HTTP 1.1 supports persistent connections (unlike 1.0),
so you need this tag to avoid a delay in terminating the connection.
This request header is sent using the send()
function.
Then recv()
is called to receive the reply.
A block of memory is allocated to hold the incoming data. The reply header will be the first part of this data,
followed by the data bytes for the resource. It's difficult (impossible?) to know exactly how big the header will
be, but the end of header is always easy to find -- the first blank line in the incoming stream marks the end.
Here's the header received from the BeBits server for this example:
HTTP/1.1 200 OK
Date: Fri, 14 Dec 2001 19:29:07 GMT
Server: Apache/1.3.9 (Unix) TARP/0.42-alpha PHP/4.0.4pl1 secured_by_Raven/1.4.2
X-Powered-By: PHP/4.0.4pl1
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
A web client might want to parse this header into all the different tags and make use of the info. For the most
part, ufetch
doesn't bother. The one exception is for redirects. Often a particular URL simply redirects to another
URL. In this case, the response code is 301 or 302 and there's a tag called 'Location:' that identifies the URL to
redirect to. This is the only header tag that ufetch
cares about.
The header is sent to the screen so that it can be viewed along with the status info. The data bytes, however, are
written out to a file called f.data. This makes it easy to find, but it also means that each run of the program
will clobber the output of the previous run.
ufetch
is certainly limited in what in can do. But it's loads of fun to use. You can see all kinds of interesting
info being returned by web servers. Try it and see how many web sites are sending cookies you didn't know about. Or snoop on
just what software the server is running. It might even be useful as a way to debug connections to certain troublesome servers.
Expanding ufetch
in any number of ways would not take too much effort. Have fun.
Source Code:
ufetch.zip