Be Newsletters - Volume 3: 1998

Issue 3-49, December 9, 1998

Be Engineering Insights: Creating an Internet Search Engine

By Russ McMahon

The most rewarding thing for me as a contributor to the Be Developer Newsletter is the chance to explore applications and technologies that aren't on any schedule or project master plan. It's an opportunity to write code just for the fun of it, and share the knowledge and experience with others. One long-time interest of mine is Internet search engines and Internet robots. I thought it would be cool to scan the Web to retrieve particular items and then process the returns using my own rules and conditions. I know there are services out there to do just that, but I wanted something that I could control, so for this article I built a simple Internet search engine.

I'd like to thank Benoît Schillings for his article Be Engineering Insights: Mining The Net... (Part 1 of N). In it, Benoît lays out a "site_getter" class and goes into detail about the Internet search process. I also want to thank James Clark for his awesome SGML parser - normalizer NSGMLS. These tools are definitely worth checking out.

Currently, most people use large scale search engines such as Lycos, Yahoo, Alta Vista, and others. These have huge databases that hold Web links and other information that users can access. Robots—also called spiders—roam the Web and feed the local databases, trying to keep an up to date representation of the Web. I wanted to call my project "the embryonic beginnings of a Web 'bot," but I can't, because as you'll see below, my sample engine is really a meta-search engine.

Meta-search engines query primary search engines (like those named above) in order to gather query results. They maintain no local database of their own and often do pre- and post-processing on the query and returned results. A good example of a meta-search engine is MetaCrawler http://www.metacrawler.com/; it can span multiple search engines and claims to provide users with a more consistent interface. Another popular meta-search engine is Sherlock http://www.apple.com/sherlock/;. Sherlock ships with Mac OS 8.5, and replaces the system Find application as a generalized search program. In addition to local searches, it does Web-based searches.

Both Sherlock and MetaCrawler can span multiple search engines, but unlike MetaCrawler, Sherlock processes the query results locally through configurable plug-ins. Well heck, if all a meta-search engine needs to do is contact the main search engines, and have them return the goods, that seems easy enough—so it's time to start coding.

In the sample below, a TCP connection is made to http://www.altavista.com, server port 80 (HTTP). I send a GET request with the path of the CGI script to execute, and the parameters to pass to the script. The main parameter of interest is the search keyword, in this case "primus"—a locally popular rock band. I end the request with HTTP/1.0 and a blank line to terminate the request. I receive back from the server an HTML page with a list of hits matching "primus." The page is sent to standard out.

#include <stdio.h>
#include <socket.h>
#include <netdb.h>
#include <string.h>
#include <iostream.h>

class mysearch
{
public:
  mysearch();
};

mysearch::mysearch()
{
  int sd;
  char buf[BUFSIZ+1];

  // Resolve Host name to IP address
  struct in_addr **pptr;
  struct hostent *hp;

  hp = gethostbyname("www.altavista.com");
  pptr = (struct in_addr**)hp->h_addr_list;

  // Server side socket
  struct sockaddr_in dest;
  memset(&dest,0,sizeof(dest));
  dest.sin_family=AF_INET;
  memcpy(&dest.sin_addr, *pptr, sizeof(struct in_addr));
  dest.sin_port=htons(80);

  sd = socket(AF_INET, SOCK_STREAM, 0);
  connect(sd, (sockaddr*)&dest, sizeof(dest));

  // Blank line terminates HTTP request
  sprintf(buf,
    "GET /cgi- bin/query?pg=q&kl=XX&q=primus&search=Search HTTP/1.0\n\n");
  send(sd, buf, strlen(buf), 0);

  while (recv(sd, buf, sizeof(buf), 0) > 0)
    cout << buf;
}

int main()
{
  mysearch mysrch;

  exit(EXIT_SUCCESS);
}

Why do it this way? Since searching and filtering tasks are quite disparate in nature, I've put the search mechanism in one binary and the query parser in another. I hope this makes the basic operations clearer, and easier to follow. I use standard in and out to communicate between the binaries. I haven't tried to make this robust or of production quality, as the code serves only to demonstrate my topic. For the search engine, much of the code is just getting a client connection set up, and sending and receiving HTTP data. The calls gethostbyname(), socket(), connect(), send(), recv() are all part of the BSD sockets API and are similar across socket-based TCP/IP clients, so I hope they're not overly distracting.

The above sample goes out to Alta Vista and looks for "primus." However, it's easy to query other search engines as well. Just change the host name, the CGI path, and the parameter string. For example:

To search using Alta Vista:

Host Name: www.altavista.com

Parameters: /cgi-bin/query?pg=q&kl=XX&q=primus&search=Search

To search using Excite:

Host Name: search.excite.com

Parameters: /search.gw?search=beegees

To search using Yahoo:

Host Name: search.yahoo.com

Parameters: /bin/search?search=beegees

Notice that I changed the Excite and Yahoo queries to search for "beegees" rather than "primus." To query using other search engines I haven't listed, such as Infoseek or Lycos, go the site with your browser and enter your search keyword, then look at the URL that is sent to sent the Web server (in the location field at the top part of the browser window). It should be easy to see what the host name and the CGI parameters. are.

Be aware that there are several ways to send parameters back to Web servers; if the simple method described above doesn't work, a deeper understanding of CGI and HTTP may be required. Dumping the HTML page as text is also a great way to see what's going on. Another way to understand how CGI and HTTP work is to set a Web server such as Apache and write some scripts and forms. I recommend doing this before you use the samples and blast away at the main search services.

If the query was successful an HTML page listing the search results is returned. To confirm that things are working, you can save the page to a file and then open it in the browser. It should appear just as if the browser had launched the query.

Now we're at the heart of things, and it's time to get at the goods! The problem with this phase is that it's very application specific; what I'm after may not be what you're after. Also, each search site has its own results format, which may even change from query to query. Sherlock uses pattern matching on the returned HTML document to get at items of interest. OK, but we can do better. I use James Clark's NSGMLS application to transform the document to a simpler and more rigid format called ESIS (Element Structure Information Set). The ESIS format makes getting at HTML elements easy. NSGMLS is a command line utility that uses James's SP library to parse SGML and also HTML and XML. James provides nice HTML documentation and sources for his parse utilities at http://www.jclark.com/sp.html. The port to BeOS was trivial. For example, the following HTML document is transformed to ESIS using NSGMLS:

<html>
<body>Your rock candy
</body>
</html>

In ESIS every element is on a separate line, and the first character on the line is known as the command. The command can be used to decide how to parse the line. In this simple sample that's no big deal, but results returned from the search engines are more complex.

(HTML
(BODY
-Your rock candy
)BODY
)HTML

Now we're getting somewhere. If we run the search engine and pipe the output to NSGMLS, the ESIS syntax will be sent to standard out. However, it's the whole page, and I just want to get at specific elements within it. Time to slap down a bit more code.

#include <stdio.h>
#include <iostream.h>
#include <string.h>
#include <stdlib.h>

class myfilter
{
public:
  myfilter();


  friend istream& operator >> (istream& s, myfilter& myftlr);

private:
  int GetNextFieldStr(char* instring, char* outstring);

};

myfilter::myfilter()
{
}

istream& operator >> (istream& s, myfilter& myfltr)
{
  char buf[BUFSIZ+1];
  char cmpbuf[BUFSIZ+1];

  while (s){
    s.getline(buf, sizeof(buf));

    // Parse the lines using ESIS syntax
    if (buf[0] == 'A'){
      myfltr.GetNextFieldStr(buf, cmpbuf);
      if (strcmp(cmpbuf, "AHREF") != 0)
        continue;

      myfltr.GetNextFieldStr(buf, cmpbuf);
      if (strcmp(cmpbuf, "CDATA") != 0)
        continue;

      myfltr.GetNextFieldStr(buf, cmpbuf);
      if (strncmp(cmpbuf, "http", 4) != 0)
        continue;

      // Found something of interest
        cout << buf << endl;
    }
  }

  return s;
}

int
myfilter::GetNextFieldStr(char* instring, char* outstring)
{
  char* ptr;
  char* ptr1;
  char* ptr2;

  ptr = instring;

  if((ptr1 = strchr(ptr, ' ')) == NULL){
    strcpy(outstring, instring);
    return (0);
  }

  *ptr1 = '\0';
  ptr2 = ++ptr1;

  for(int n = BUFSIZ; n >= 0; n--){
    ptr1--;
    if (*ptr1 == ' ')
      *ptr1 = '\0';
    else
      break;
  }

  strcpy(outstring, ptr);
  strcpy(instring, ptr2);

  return (1);
}

int main()
{
  myfilter myfltr;

  // Overloaded extraction operator
  cin >> myfltr;

  exit(EXIT_SUCCESS);
}

The sample above goes through the page and grabs URLs beginning with http. The URLs are located using the "AHREF CDATA" syntax of ESIS. So if the search engine binary is called mysearch, and the filter above is called myfilter, then after running the following, only URLs starting with http will be sent to standard out.

$ mysearch | nsgmls | myfilter

I know by looking at the URLs outputted that further filtering is needed to really nail down the "primus" search. This should be easy: I'll just look at the ESIS output and add some more parsing code to the filter operation.

I hope you can see from these samples that there's a world of possibilities to explore. Once URL links are returned, they can be further filtered to really customize the results of the query. Also, I go after only http URLs, but within the page returned from the main search engines there's often hit-rating information and other goodies. I'd like to turn the command line utilities into real applications, and search the different engines in parallel using threads, and cross- reference the results. It would also be great to plug the meta-search engine into Be's Find application, and turn the results into NetPositive Bookmarks. I didn't make it quite that far in this article, but there's always next time.

Be Engineering Insights: Something Fishy

By Adam Haberlach

I recently implemented a client for the "Distributed Fish Protocol" (DFP). The protocol defines a common environment for transferring simple data objects ("fish") between machines ("tanks"). The protocol was designed as a demonstration of some operating systems created by members of the ACM chapter at UIUC. The entire system is extremely small; the whole thing can be implemented using only broadcast UDP packets. Because UDP is somewhat unreliable, steps are taken to make sure that fish are not lost. For relevant RFCs, check out

http://www.acm.uiuc.edu/sigops/eoh98/dfp/fish.html

To handle incoming packets, my DFP client implementation defines the "FishPortal" class. A FishPortal object binds and listens to a UDP port (in a separate thread). When it gets a Fish Protocol Packet, the packet is repackaged as a BMessage, which is sent to and handled by the client's BWindow object.

There are two types of messages that need to be handled: The "Tank" message lets a tank locate its neighbors, and the "Transfer" message lets a tank transfer a fish to a neighboring tank.

The Location Message

A tank lives in a 256x256 tank array called "TankSpace." To see if its neighbors are "alive", a tank broadcasts a PING message to its nearest neighbors in TankSpace (top, bottom, left, right), and then listens for a responsive PONG. In my DFP client, I use the new BMessageRunner class to regularly broadcast the PING message. Before every PING, my FishPortal object marks all four neighbors as dead, and then marks them alive if they respond. (A more complex system could be a bit smarter about tracking the identity and states of the other tanks.)

Whenever the status of a neighboring tanks changes, a BMessage is sent so that the client can update any indicators it uses. My client displays green and red dots to indicate the states of the other tanks. Still another neat thing for a client to do would be to recognize when it is in the same space as another tank and move itself to another location to avoid tank collisions.

The Transfer Message

When a fish swims to the edge of a tank, the client checks to make sure the neighboring tank in that direction is alive, and, if it is, it sends the tank a SEND_FISH message that includes information about the relevant fish. The receiver responds with an ACK_FISH to acknowledge that it will take the fish, or a NACK_FISH to refuse. The fish supplier responds with a final SYNC_FISH packet, which formally passes responsibility for the fish to the receiving tank.

If a neighboring tanks isn't alive, the fish is thrown back in the same tank and swims off in a new direction. Either way, the client need only send a message to the FishPortal, which takes care of all the details of passing the fish along or throwing it back in the tank.

The FishPortal class also handles a list of "pending" fish, since fish spend some time in limbo while the fish-passing messages are being handled. Currently, there's no provision for other clients denying reception (through a NACK_FISH), or not responding. A smarter client would include a time-out system to make sure that pending fish don't get lost between the display layer and the network.

Note that I didn't spend much time fine-tuning the display. As a result, my fish swim backwards and flicker.

I know of one bug in the code, and I'm sure I will get lots of mail pointing out the ones that got away. If anyone out there has any interesting additions or suggestions (or even new clients), I'm always interested.

The supplied client defaults to a tank located at (3,3). To specify a different location, use the command-line arguments -x <int> and -y <int>. For example, if you have two machines on the same network, just starting a client on one machine with the default location and the other with the line "Tank -x 4 -y 3" should put them next to each other in TankSpace.

Developers' Workshop: The Refrigerator Question

By Eric Shepherd

Of all the functions the BeOS provides to programmers, one of the least understood and most widely debated is is_computer_on(). Documentation has been sketchy, and developers have debated the best use for this function. This week, we'll investigate is_computer_on() in depth.

Some developers claim that this function always returns 1, and say that it's a sham. They say their tests have produced significant empirical evidence to this effect. Others offer circumstantial evidence, such as "programs can't run when the computer's off," which doesn't really explain anything.

This is a lot like trying to figure out if the light stays on in the refrigerator when you close the door. How can you be sure that is_computer_on() isn't returning some other value when the computer's off? Let's look at a test program:

#include <OS.h>
#include <stdio.h>

int main(void) {
  int i;

  printf("#     Result\n");
  printf("----- ------\n");
  for (i=1;i<=25;i++) {
    printf("%-5d %ld\n", i, is_computer_on());
  }
  return 0;
}

If you run this program, you get the following output:

#     Result
----- ------
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
(and so on)

Now try running the program and shutting off your computer while it's running. Unfortunately, printf() doesn't work reliably in this case; you don't see your output, because the computer's video hardware becomes unreliable when turned off. You might try customizing this code to write the output to disk, but unfortunately, hard drives also become unreliable when power isn't applied to them throughout their operation. Networking and serial ports likewise have a high failure rate when power is turned off.

In other words, the empirical evidence is a little flimsy. Due to failures beyond Be's control, test programs such as the one above can't present their full results. I'm not saying that is_computer_on() returns some other value when the computer isn't running, but this is a prime example of the difficulty encountered when testing a theory through code.

However, Be guarantees that is_computer_on() will always return 1 if the computer is running.

Developers Large and Small

By Jean-Louis Gassée

As often, this column is in response to recurring discussions I've had or witnessed, in person or electronically. Today's topic is what type of developers we're looking for—a subject that appears to have been re-energized by recent announcements around Release 4 and Comdex.

When major, "brand name" developers such as MGI or Steinberg announce their support of the BeOS platform, what does it mean for emerging developers who lack the resources and market muscle of more well-established companies? Are we abandoning the little guy and cozying up to the establishment?

I'll say at the outset that these questions address very real, permanent issues. The need to give critical mass to the platform is one issue; the threat of established developers muscling smaller ones out of the ecosystem is another; and Be's role, if any, in maintaining balance in the system yet another.

Meeting the first need—giving critical mass to the platform—causes trouble and ambivalence. On one hand, seeing a major player on the nascent platform is reassuring, since it legitimizes it and strengthens the self-fulfilling prophecy: the platform will take because respectable, established people believe it will. On the other hand, these established companies may become ogres that devour fledgling developers. This is a real issue, but it's only part of the story. The establishment may be powerful, but it is also conservative. It rarely shows the most innovation and is often muscle-bound.

A new platform is like a new musical instrument. You can transpose existing music for it, or write new music that exploits its unique expressive power. The saxophone, for example, solved two problems: it provided better coupling and impedance between the reed and the air mass; and the mechanics offered better, faster fingering and a wider range of notes. You could play Bach fugues on a saxophone (rather nicely) or explore the then-emerging genre of jazz.

Likewise, with a new operating system, you can write a PhotoShop clone, or take a more innovative approach to image creation and processing. If this sounds like "all you need to do to lose weight is eat less and exercise more," my apologies. So instead of offering simplistic advice, perhaps I should address what we can do. For one thing, we're already working to make BeDepot an effective, inexpensive distribution channel for Be developers. We still have much work to do, but our intent is clear: this is equally open to all, this is a level playing field.

And then there is what we shouldn't do. I still remember the 1985 day when Bill Gates threatened to stop developing applications for the Macintosh if a certain software and its look and feel license wasn't renewed. Apple's executive staff met and decided not to go along. One dinner later, the license was renewed. We can only speculate what would have happened if other developers had been allowed to take the room on the Mac platform that Microsoft now occupies. The lesson in this story for Be is that we know we shouldn't make decisions that limit innovation and tie the platform.