The most rewarding thing for me as a contributor to the Be Developer Newsletter is the chance to explore applications and technologies that aren't on any schedule or project master plan. It's an opportunity to write code just for the fun of it, and share the knowledge and experience with others. One long-time interest of mine is Internet search engines and Internet robots. I thought it would be cool to scan the Web to retrieve particular items and then process the returns using my own rules and conditions. I know there are services out there to do just that, but I wanted something that I could control, so for this article I built a simple Internet search engine.
I'd like to thank Benoît Schillings for his article Be Engineering Insights: Mining The Net... (Part 1 of N). In it, Benoît lays out a "site_getter" class and goes into detail about the Internet search process. I also want to thank James Clark for his awesome SGML parser - normalizer NSGMLS. These tools are definitely worth checking out.
Currently, most people use large scale search engines such as Lycos, Yahoo, Alta Vista, and others. These have huge databases that hold Web links and other information that users can access. Robots—also called spiders—roam the Web and feed the local databases, trying to keep an up to date representation of the Web. I wanted to call my project "the embryonic beginnings of a Web 'bot," but I can't, because as you'll see below, my sample engine is really a meta-search engine.
Meta-search engines query primary search engines (like those named above) in order to gather query results. They maintain no local database of their own and often do pre- and post-processing on the query and returned results. A good example of a meta-search engine is MetaCrawler http://www.metacrawler.com/; it can span multiple search engines and claims to provide users with a more consistent interface. Another popular meta-search engine is Sherlock http://www.apple.com/sherlock/;. Sherlock ships with Mac OS 8.5, and replaces the system Find application as a generalized search program. In addition to local searches, it does Web-based searches.
Both Sherlock and MetaCrawler can span multiple search engines, but unlike MetaCrawler, Sherlock processes the query results locally through configurable plug-ins. Well heck, if all a meta-search engine needs to do is contact the main search engines, and have them return the goods, that seems easy enough—so it's time to start coding.
In the sample below, a
TCP
connection is made to http://www.altavista.com,
server port 80 (HTTP). I send a GET request with the path of the
CGI
script to execute, and the parameters to pass to the script. The main
parameter of interest is the search
keyword, in this case "primus"—a
locally popular rock band. I end the request with HTTP/1.0 and a blank
line to terminate the request. I receive back from the server an
HTML
page with a list of hits matching "primus." The page is sent to standard
out.
#include <stdio.h> #include <socket.h> #include <netdb.h> #include <string.h> #include <iostream.h> classmysearch
{ public:mysearch
(); };mysearch
::mysearch
() { intsd
; charbuf
[BUFSIZ
+1]; // Resolve Host name to IP address struct in_addr **pptr
; struct hostent *hp
;hp
=gethostbyname
("www.altavista.com");pptr
= (struct in_addr**)hp
->h_addr_list
; // Server side socket struct sockaddr_indest
;memset
(&dest
,0,sizeof(dest
));dest
.sin_family
=AF_INET
;memcpy
(&dest
.sin_addr
, *pptr
, sizeof(struct in_addr));dest
.sin_port
=htons
(80);sd
= socket(AF_INET
,SOCK_STREAM
, 0);connect
(sd
, (sockaddr*)&dest
,sizeof
(dest
)); // Blank line terminates HTTP requestsprintf
(buf
, "GET /cgi- bin/query?pg=q&kl=XX&q=primus&search=Search HTTP/1.0\n\n");send
(sd
,buf
, strlen(buf
), 0); while (recv
(sd
,buf
, sizeof(buf
), 0) > 0) cout <<buf
; } int main() {mysearch
mysrch
;exit
(EXIT_SUCCESS
); }
Why do it this way? Since searching and filtering tasks are quite
disparate in nature, I've put the search mechanism in one binary and the
query parser in another. I hope this makes the basic operations clearer,
and easier to follow. I use standard in and out to communicate between
the binaries. I haven't tried to make this robust or of production
quality, as the code serves only to demonstrate my topic. For the search
engine, much of the code is just getting a client connection set up, and
sending and receiving
HTTP
data. The calls gethostbyname()
,
socket()
,
connect()
, send()
,
recv()
are all part of the
BSD sockets API and are
similar across socket-based TCP/IP clients, so I hope they're not overly
distracting.
The above sample goes out to Alta Vista and looks for "primus." However, it's easy to query other search engines as well. Just change the host name, the CGI path, and the parameter string. For example:
To search using Alta Vista:
Host Name: www.altavista.com
Parameters: /cgi-bin/query?pg=q&kl=XX&q=primus&search=Search
To search using Excite:
Host Name: search.excite.com
Parameters: /search.gw?search=beegees
To search using Yahoo:
Host Name: search.yahoo.com
Parameters: /bin/search?search=beegees
Notice that I changed the Excite and Yahoo queries to search for "beegees" rather than "primus." To query using other search engines I haven't listed, such as Infoseek or Lycos, go the site with your browser and enter your search keyword, then look at the URL that is sent to sent the Web server (in the location field at the top part of the browser window). It should be easy to see what the host name and the CGI parameters. are.
Be aware that there are several ways to send parameters back to Web servers; if the simple method described above doesn't work, a deeper understanding of CGI and HTTP may be required. Dumping the HTML page as text is also a great way to see what's going on. Another way to understand how CGI and HTTP work is to set a Web server such as Apache and write some scripts and forms. I recommend doing this before you use the samples and blast away at the main search services.
If the query was successful an HTML page listing the search results is returned. To confirm that things are working, you can save the page to a file and then open it in the browser. It should appear just as if the browser had launched the query.
Now we're at the heart of things, and it's time to get at the goods! The problem with this phase is that it's very application specific; what I'm after may not be what you're after. Also, each search site has its own results format, which may even change from query to query. Sherlock uses pattern matching on the returned HTML document to get at items of interest. OK, but we can do better. I use James Clark's NSGMLS application to transform the document to a simpler and more rigid format called ESIS (Element Structure Information Set). The ESIS format makes getting at HTML elements easy. NSGMLS is a command line utility that uses James's SP library to parse SGML and also HTML and XML. James provides nice HTML documentation and sources for his parse utilities at http://www.jclark.com/sp.html. The port to BeOS was trivial. For example, the following HTML document is transformed to ESIS using NSGMLS:
<html> <body>Your rock candy </body> </html>
In ESIS every element is on a separate line, and the first character on the line is known as the command. The command can be used to decide how to parse the line. In this simple sample that's no big deal, but results returned from the search engines are more complex.
(HTML (BODY -Your rock candy )BODY )HTML
Now we're getting somewhere. If we run the search engine and pipe the output to NSGMLS, the ESIS syntax will be sent to standard out. However, it's the whole page, and I just want to get at specific elements within it. Time to slap down a bit more code.
#include <stdio.h> #include <iostream.h> #include <string.h> #include <stdlib.h> classmyfilter
{ public:myfilter
(); friend istream&operator >>
(istream&s
,myfilter
&myftlr
); private: intGetNextFieldStr
(char*instring
, char*outstring
); };myfilter
::myfilter
() { } istream&operator >>
(istream&s
, myfilter&myfltr
) { charbuf
[BUFSIZ
+1]; charcmpbuf
[BUFSIZ
+1]; while (s
){s
.getline
(buf
, sizeof(buf
)); // Parse the lines using ESIS syntax if (buf
[0] == 'A'){myfltr
.GetNextFieldStr
(buf
,cmpbuf
); if (strcmp
(cmpbuf
, "AHREF") != 0) continue;myfltr
.GetNextFieldStr
(buf
,cmpbuf
); if (strcmp
(cmpbuf
, "CDATA") != 0) continue;myfltr
.GetNextFieldStr
(buf
,cmpbuf
); if (strncmp
(cmpbuf
, "http", 4) != 0) continue; // Found something of interest cout <<buf
<< endl; } } returns
; } intmyfilter
::GetNextFieldStr
(char*instring
, char*outstring
) { char* ptr; char* ptr1; char* ptr2;ptr
=instring
; if((ptr1
= strchr(ptr
, ' ')) ==NULL
){strcpy
(outstring
,instring
); return (0); } *ptr1
= '\0';ptr2
= ++ptr1
; for(intn
=BUFSIZ
;n
>= 0;n
--){ptr1
--; if (*ptr1
== ' ') *ptr1
= '\0'; else break; }strcpy
(outstring
,ptr
);strcpy
(instring
,ptr2
); return (1); } intmain
() {myfilter
myfltr
; // Overloaded extraction operator cin >>myfltr
;exit
(EXIT_SUCCESS
); }
The sample above goes through the page and grabs URLs beginning with
http. The URLs are located using the "AHREF CDATA" syntax of ESIS. So if
the search engine binary is called mysearch
, and the filter above is
called myfilter
, then after running the following, only URLs starting
with http will be sent to standard out.
$ mysearch | nsgmls | myfilter
I know by looking at the URLs outputted that further filtering is needed to really nail down the "primus" search. This should be easy: I'll just look at the ESIS output and add some more parsing code to the filter operation.
I hope you can see from these samples that there's a world of possibilities to explore. Once URL links are returned, they can be further filtered to really customize the results of the query. Also, I go after only http URLs, but within the page returned from the main search engines there's often hit-rating information and other goodies. I'd like to turn the command line utilities into real applications, and search the different engines in parallel using threads, and cross- reference the results. It would also be great to plug the meta-search engine into Be's Find application, and turn the results into NetPositive Bookmarks. I didn't make it quite that far in this article, but there's always next time.
I recently implemented a client for the "Distributed Fish Protocol" (DFP). The protocol defines a common environment for transferring simple data objects ("fish") between machines ("tanks"). The protocol was designed as a demonstration of some operating systems created by members of the ACM chapter at UIUC. The entire system is extremely small; the whole thing can be implemented using only broadcast UDP packets. Because UDP is somewhat unreliable, steps are taken to make sure that fish are not lost. For relevant RFCs, check out
http://www.acm.uiuc.edu/sigops/eoh98/dfp/fish.html
To handle incoming packets, my DFP client implementation
defines the "FishPortal" class. A
FishPortal
object binds and listens to a
UDP port (in a separate thread). When it gets a Fish
Protocol Packet, the packet is repackaged as a
BMessage
, which is sent to and handled by the
client's BWindow
object.
There are two types of messages that need to be handled: The "Tank" message lets a tank locate its neighbors, and the "Transfer" message lets a tank transfer a fish to a neighboring tank.
A tank lives in a 256x256 tank array called "TankSpace." To see if its
neighbors are "alive", a tank broadcasts a PING
message to its nearest
neighbors in TankSpace (top, bottom, left, right), and then listens for a
responsive PONG
. In my DFP client, I use the new
BMessageRunner
class to
regularly broadcast the PING
message. Before every
PING
, my FishPortal
object marks all four neighbors as dead, and then marks them alive if
they respond. (A more complex system could be a bit smarter about
tracking the identity and states of the other tanks.)
Whenever the status of a neighboring tanks changes, a BMessage
is sent so
that the client can update any indicators it uses. My client displays
green and red dots to indicate the states of the other tanks. Still
another neat thing for a client to do would be to recognize when it is in
the same space as another tank and move itself to another location to
avoid tank collisions.
When a fish swims to the edge of a tank, the client checks to make sure
the neighboring tank in that direction is alive, and, if it is, it sends
the tank a SEND_FISH
message that includes information about the relevant
fish. The receiver responds with an ACK_FISH
to acknowledge that it will
take the fish, or a NACK_FISH
to refuse. The fish supplier responds with
a final SYNC_FISH
packet, which formally passes responsibility for the
fish to the receiving tank.
If a neighboring tanks isn't alive, the fish is thrown back in the same
tank and swims off in a new direction. Either way, the client need only
send a message to the FishPortal
, which takes care of all the details of
passing the fish along or throwing it back in the tank.
The FishPortal class also handles a list of "pending" fish, since fish
spend some time in limbo while the fish-passing messages are being
handled. Currently, there's no provision for other clients denying
reception (through a NACK_FISH
), or not responding. A smarter client
would include a time-out system to make sure that pending fish don't get
lost between the display layer and the network.
Note that I didn't spend much time fine-tuning the display. As a result, my fish swim backwards and flicker.
I know of one bug in the code, and I'm sure I will get lots of mail pointing out the ones that got away. If anyone out there has any interesting additions or suggestions (or even new clients), I'm always interested.
The supplied client defaults to a tank located at (3,3). To specify a different location, use the command-line arguments -x <int> and -y <int>. For example, if you have two machines on the same network, just starting a client on one machine with the default location and the other with the line "Tank -x 4 -y 3" should put them next to each other in TankSpace.
Of all the functions the BeOS provides to programmers, one of the least
understood and most widely debated is is_computer_on()
. Documentation has
been sketchy, and developers have debated the best use for this function.
This week, we'll investigate is_computer_on()
in depth.
Some developers claim that this function always returns 1, and say that it's a sham. They say their tests have produced significant empirical evidence to this effect. Others offer circumstantial evidence, such as "programs can't run when the computer's off," which doesn't really explain anything.
This is a lot like trying to figure out if the light stays on in the
refrigerator when you close the door. How can you be sure that
is_computer_on()
isn't returning some other value when the computer's off?
Let's look at a test program:
#include <OS.h> #include <stdio.h> intmain
(void) { inti
;printf
("# Result\n");printf
("----- ------\n"); for (i
=1;i
<=25;i
++) {printf
("%-5d %ld\n",i
,is_computer_on
()); } return 0; }
If you run this program, you get the following output:
# Result ----- ------ 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 (and so on)
Now try running the program and shutting off your computer while it's
running. Unfortunately, printf()
doesn't work reliably in this case; you
don't see your output, because the computer's video hardware becomes
unreliable when turned off. You might try customizing this code to write
the output to disk, but unfortunately, hard drives also become unreliable
when power isn't applied to them throughout their operation. Networking
and serial ports likewise have a high failure rate when power is turned
off.
In other words, the empirical evidence is a little flimsy. Due to
failures beyond Be's control, test programs such as the one above can't
present their full results. I'm not saying that is_computer_on()
returns
some other value when the computer isn't running, but this is a prime
example of the difficulty encountered when testing a theory through code.
However, Be guarantees that is_computer_on()
will always return 1 if the
computer is running.
As often, this column is in response to recurring discussions I've had or witnessed, in person or electronically. Today's topic is what type of developers we're looking for—a subject that appears to have been re-energized by recent announcements around Release 4 and Comdex.
When major, "brand name" developers such as MGI or Steinberg announce their support of the BeOS platform, what does it mean for emerging developers who lack the resources and market muscle of more well-established companies? Are we abandoning the little guy and cozying up to the establishment?
I'll say at the outset that these questions address very real, permanent issues. The need to give critical mass to the platform is one issue; the threat of established developers muscling smaller ones out of the ecosystem is another; and Be's role, if any, in maintaining balance in the system yet another.
Meeting the first need—giving critical mass to the platform—causes trouble and ambivalence. On one hand, seeing a major player on the nascent platform is reassuring, since it legitimizes it and strengthens the self-fulfilling prophecy: the platform will take because respectable, established people believe it will. On the other hand, these established companies may become ogres that devour fledgling developers. This is a real issue, but it's only part of the story. The establishment may be powerful, but it is also conservative. It rarely shows the most innovation and is often muscle-bound.
A new platform is like a new musical instrument. You can transpose existing music for it, or write new music that exploits its unique expressive power. The saxophone, for example, solved two problems: it provided better coupling and impedance between the reed and the air mass; and the mechanics offered better, faster fingering and a wider range of notes. You could play Bach fugues on a saxophone (rather nicely) or explore the then-emerging genre of jazz.
Likewise, with a new operating system, you can write a PhotoShop clone, or take a more innovative approach to image creation and processing. If this sounds like "all you need to do to lose weight is eat less and exercise more," my apologies. So instead of offering simplistic advice, perhaps I should address what we can do. For one thing, we're already working to make BeDepot an effective, inexpensive distribution channel for Be developers. We still have much work to do, but our intent is clear: this is equally open to all, this is a level playing field.
And then there is what we shouldn't do. I still remember the 1985 day when Bill Gates threatened to stop developing applications for the Macintosh if a certain software and its look and feel license wasn't renewed. Apple's executive staff met and decided not to go along. One dinner later, the license was renewed. We can only speculate what would have happened if other developers had been allowed to take the room on the Mac platform that Microsoft now occupies. The lesson in this story for Be is that we know we shouldn't make decisions that limit innovation and tie the platform.