[wordup] How the Wayback Machine Works

Thu Jan 24 19:06:28 EST 2002

This is kinda cool.  Brewster is also the founder of SFLan
(www.sflan.com), one of the first community networks on the west
coast.   

adam.

From: http://www.oreillynet.com/lpt/a//webservices/2002/01/18/brewster.html

How the Wayback Machine Works
by Richard Koman
01/21/2002

The Internet Archive made headlines back in November with the release of
the Wayback Machine, a Web interface to the Archive's five-year,
100-terabyte collection of Web pages. The archive is the result of the
efforts of its director, Brewster Kahle, to capture the ephemeral pages
of the Web and store them in a publicly accessible library. In addition
to the other millions of web pages you can find in the Wayback Machine,
it has direct pointers to some of the pioneer sites from the early days
of the Web, including the NCSA What's New page, The Trojan Room Coffee
Pot, and Feed magazine.

How big is 100 terabytes? Kahle, who serves as archive director and
president of Alexa Internet, a wholly-owned subsidiary of Amazon.com,
says it's about five times as large as the Library of Congress, with its
20 million books.

"What we have on the Web is phenomenal," Kahle says. "There are more
than 10 million people's voices evidenced on the Web. It's the people's
medium, the opportunity for people to publish about anything -- the
great, the noble, the absolute picayune, and the profane."

The existence of such an archive suggests all kinds of possibilities for
research and scholarship, but in Kahle's vision, all of the streams of
research commingle into a single purpose: "The idea is to build a
library of everything, and the opportunity is to build a great library
that offers universal access to all of human knowledge. That may sound
laughable, but I'd suggest that the Internet is going exactly in that
direction, so if we shoot directly for it, we should be able to get to
universal access to human knowledge."

If the goal sounds lofty, the Wayback Machine itself may be the crudest
imaginable tool for data-mining a 100-terabyte database. At the
Archive's Web site, simply enter a URL and the Wayback Machine gives you
a list of dates for which the site is available.

Clicking on an old site is like time travel. I visited a December 1996
issue of Web Review (webreview.com) and found a cover story on
"Christmas Cookies " an article dismissing privacy concerns about the
new-fangled Web technology. A report from Internet World featured the
hottest and most promising technology of the day: Push.

But that report, and the other articles I looked at in the Wayback
Machine, were truncated; links to subsequent pages and many graphics
were missing. Kahle concedes the Web interface does not show the full
glory of the archive, but he says it wasn't meant to. "This is a
browsing interface, a wow-isn't-this-cool interface ... It's a first
step, but it's technically rather interesting because it's such a huge
collection."

While the Wayback Machine has received plenty of press, we were
interested in going deeper into the technical workings of this audacious
project. We sat down with Kahle (who previously worked at the late
supercomputer maker Thinking Machines and founded WAIS, Inc.) at the
Archive's offices in San Francisco's Presidio.

Consider the hardware: a computer system with close to 400 parallel
processors, 100 terabytes of disk space, hundreds of gigs of RAM, all
for under a half-million dollars. As you'll read in this in interview,
the folks at the Archive have turned clusters of PCs into a single
parallel computer running the biggest database in existence -- and wrote
their own operating system, P2, which allows programmers with no
expertise in parallel systems to program the system.

Richard Koman: So how much stuff do you have here?

Brewster Kahle: In the Wayback Machine, currently there are 10 billion
Web pages, collected over five years. That amounts to 100 terabytes,
which is 100 million megabytes. So if a book is a megabyte, which is
about what it is, and the Library of Congress has 20 million books,
that's 20 terabytes. This is 100 terabytes. At that size, this is the
largest database ever built. It's larger than Walmart's, American
Express', the IRS. It's the largest database ever built. And it's
receiving queries -- because every page request when people are surfing
around is a query to this database -- at the rate of 200 queries per
second. It's a fairly fast database engine. And it's built on commodity
PCs, so we can do this cost-effectively. It's just using clusters of
Linux machines and FreeBSD machines.

Koman: How many machines?

Kahle: Three hundred, we may be up to 400 machines now. When we first
came out, we didn't architect it for the load we wound up with, so we
had to throw another 20 to 30 machines at serving the index.

Koman: You just throw more PCs at the problem?

Kahle: You can build amazing systems out of these bricks that cost only
a couple hundred dollars each, and you just throw more bricks at the
problem to give it more computer power, more RAM, more disk, more
network bandwidth, whatever it is you need. So we build massive database
systems by striping the index over tens of machines. And its a very
cost-effective system.

Koman: What kind of performance do you get?

Kahle: We're getting exceptional performance. Basically to build a 10oTB
database costs -- in hardware costs -- less than $400,000, including all
the network equipment, all the redundancy, all the backup systems. We've
had to do it based on necessity, because there's not a lot of money in
the library trade. Where the Library of Congress has a budget of $450
million a year, you can be sure we don't.

Koman: How does it work technically?

Kahle: How the archive works is just with stacks and stacks of computers
runnning Solaris on x86, FreeBSD, and Linux, all of which have serious
flaws, so we need to use different operating systems for different
functions. The crawling machines are running Solaris; there's a dozen or
possibly more.

Koman: What are the crawlers written in?

Kahle: Combinations of C and Perl. Almost everything we can, we do in
Perl -- for ease of portability, maintability, flexibility. Because
there's so much horsepower we don't really require a tight system. The
crawlers record pages into 100MB files in a standard archive file
format, and then store it on one of the storage machines. Those are just
normal PCs with four IDE hard drives, and its just writes along until
it's filled up and then it goes to the next one. It goes through a
couple of these machines a day: hundreds of gigabytes a day. The total
gathering speed when everything is moving is about 10 terabytes a month,
or half a Library of Congress a month.

Then they're indexed onto another set of machines -- there's a whole
hierarchical indexing structure for the Wayback Machine, and that is
kept up to date on an hourly basis. So when people come to the Wayback
Machine, there's a load balancer that goes and distributes those queries
to 12 or 20 machines that operate the front end, and those query another
dozen or so machines that hold a striped version of the index, and that
index allows the queries to answer what pages are available for any
particular URL. So if you were to click on one of those pages, it goes
back to that index machine, finds out where it is in all the hundreds of
machines, retrieves that document, changing the links in it so that it
points back to the path, and then hands it back to the user. And it does
that at a couple hundred per second.

What's amazing to me is the fact that the hardware is free. For doing
things even in the hundreds of terabytes, it costs in the hundreds of
thousands of dollars. When you talk to most people in IT departments,
they spend a couple hundred thousand dollars just on a CPU, much less a
terabyte of disk storage. You buy from EMC a terabyte for maybe
$300,000. That's just the storage for 1 TB. We can buy 100 TBs with 250
CPUs to work on it, all on a high-speed switch with redundancy built in.
Something has changed by using these modern constructs that are heavily
used at Google, Hotmail, here, Transmeta. There's a whole sector of
companies that are more cost-constrained than say, banks, that just buy
Oracle and Sun and EMC.

Koman: You mentioned Perl, Linux, and FreeBSD. Do you use exclusively
open source software?

Kahle: We use as much open source software as we can; we make as much of
our software as we can open because we're a library. The idea is to help
people make sense of the Net and we leverage all the open tools. Alexa
put up a television archive called tvarchive.org , which is televison
news from around the world from Sept. 11 to Sept. 18. Twenty channels in
Chinese, Russian, Japanese, Iraqi. Iraqi television is really
interesting. So in three weeks, Alexa took all these recordings from
tape, massaged them, put them online, and converted them into several
different formats. The only way to do this is to cross-cluster hundreds
of commodity Linux boxes and use freeware tools, all of which barely
work.

Koman: This all takes a lot of brain cells; you have to have some smart
people working on this.

Kahle: Yes, this is not for the light of heart. If you're going to run
100TB databases and support hundreds of queries per second, it's going
to take good folks. But on the other hand, there are good folk doing a
whole lot less than that. The archive is a real vindication that you can
do new and different things with these open tools. Because these open
tools are available to use in ways different from those for which they
were originally designed, it makes striving for the biggest collection
of information ever possible.

Koman: Does the fact you can do this at this scale suggest new
possibilities for the private sector, that businesses can operate on a
scale not previously imagined?

Kahle: Having the capital cost of equipment drop to effectively zero
allows you to think bigger. You start thinking about the whole thing.
For instance, the gutsy maneuver of saying "let's index it all," which
was the breakthrough of Altavista . Altavista in 1995 was an astonishing
achievement, not because of the hardware -- yes, that was interesting
and important from a technical perspective -- but because of the
mindset. "Let's go index every document in the world." And once you have
that sort of mindset, you can get really far.

So if all books are 20 TBs, and 20 TBs are $80,000, that's the Library
of Congress. Then something big has changed. All music? It's tiny. It
looks like there're only one million records that have been produced
over the last century. That's tiny. All movies? All theatrical releases
have been estimated at 100,000, and most of those from India. If you
take all the rest of ephemeral films, that's on the order of a couple
hundred thousand. It's just not that big. It allows you to start
thinking about the whole thing.

It will change also the relationships of corporations to their IT
departments. IT spends a lot of money on this stuff; they spend
millions. And if they really understood that it doesn't have to cost
millions, it could cost hundreds of thousands of dollars, and they could
hire a few smart people rather than large numbers of people to maintain
all this equipment, we might be able to make some big steps forward. It
would open it up to smaller companies to do bigger things. Where people
used to think that warehouses full of mainframes was an asset, that may
not be the case.

Koman: How do you mine all this stuff?

Kahle: That's where the fun begins. Datamining these materials is great
fun. What Alexa does in its free toolbar is create a related-links
service, and it does it based on the collaborative filtering of "other
people who went to this page went to these other pages." We use the link
structure of the Net and the usage trails from the Alexa users to be
able to compute this. And all of these techniques require tens if not
hundreds of machines to be able to data process.

Because there are only a couple hundred gigabytes for every processor
and the processor and RAM are very closely tied to the disks, you can
operate this cluster as a large parallel computer. It's very inexpensive
to do. We program the computer using a technology called P2, which we'll
be putting out as open source for other people to able to operate
parallel clusters of Linux or FreeBSD or Solaris boxes.

Koman: What is P2?

Kahle: P2 is a Perl script that takes commands and runs them on remote
boxes, splits up data to be able to run on them, and then brings back
and correlates the data.

Koman: It's an operating system for a parallel cluster?

Kahle: But it sits on top. You can take people who know how to do shell
scripts or Perl scripts on normal Unix boxes and within two weeks, they
can be world-class parallel data miners. That's a huge step past the
problems we've had with parallel computing, where you had to learn a
whole new methodology. This is: no new methodology, no rocket science,
no magic. And it's only because it's straightforward that we've been
able to leverage normal programmers' expertise to be able to run
programs on hundreds of machines.

Koman: It sounds quite simple.

Kahle: We've been at it for years. The first company I worked in was
Thinking Machines. And we blew it. We built the fastest computer in the
world that very few people could program. It required people to think in
a new way. What a horrible thing to have to do to be able to attract
customers. The idea is to be able to think the same and be able to do
more. I think we've cracked the parallel computer problem for a very
large set of problems, which is fundamentally data-mining and
database-type operations.

Koman: So will people looking for more than the Wayback Machine be able
to mine the Archive?

Kahle: The idea is to try to allow people to use a Web interface --
clunky, but you can step through it -- but then it would show you the
command that's going to be run across the cluster. But if you say,
"Yeah, that's kind of what I want, but instead of this I want to be able
to go in and put in my own Perl script," then we'll allow people to do
it.

We're going to try to expose what we do internally, but first put an
easy interface to at least get something done, and then an easy path
from novice to expert. But you'll need to know things like Perl. And
then our challenge will be how to manage, say, 10 to 20 programs running
at the same time over the data sets and not have people clobber each
other. Kind of timesharing, but at the hundreds-of-computers level.

Koman: You have several other collections besides the Web. The ephemeral
films and the television archives are not content from the Web, but
content you're putting on the Web.

Kahle: We've put 1,000 films up online for people to download and use in
any way that they want. What we really want is for people to make their
own movies. But these, they're pretty wild films; education films,
government films, propoganda films, industrial films. They're all
available for download in MPEG2, which is DVD-quality, for people to do
anything they want. People have made some really terrific films, and
some of them are on the site as well. I really recommend "The ABCs of
Happiness" and "The Consequence of War." Awesome films.

Koman: You wouldn't think with 100 terabytes of stuff already that you
would need to encourage the creation of more content.

Kahle: We're trying to show how people can do it themselves. We're
trying to encourage everyone to take their old content that's not online
and put it online. A professor at UC Berkeley said that students use the
Web as the resource of first resort, which is a huge change. But that's
a little dangerous if the Web doesn't have the good stuff on it, and
many people complain it doesn't. Instead of trying to whip students to
go back to the physical library, let's put the good stuff on the Net.
Otherwise, we could have a whole generation learning from ephemeral
content collections, as opposing to learning from the books of the
ancients. And a lot of materials are not there yet.

Koman: are you working with the great libraries on digitization?

Kahle: Yes, we're working with the Library of Congress on some of these
Web collections and starting to work with them on digitizing different
parts of their print collections. The Prelinger Archives is digitizing
films. We're working with different researchers on automatic
transcription of the television materials, so we can get that to be a
referenceable resource. These are the sort of things we have to get to,
and get to very soon. Every year that passes, we have more and more
students using not the best we have to offer and that is a tragedy. We
are the establishment. We should be making tools that allow children and
students to have access to it all. And we're letting them down so far.

Koman: What about the question of rights? I just wrote about Lawrence
Lessig's book on intellectual property. Surely the publishers and the
television networks and the record companies aren't willing to let you
keep a copy of all of their stuff?

Kahle: All we collect for the Web archive are sites that are publicly
accessible for free, and if there's any indication from the site owner
that they don't want it in the archive, we take it out. If there's a
robot exclusion, it's removed from the Wayback Machine. Over the years,
people would notice these things in their logs and would say, what are
you doing? And we'd explain what we're doing -- building this archive
and donating a copy to the Library of Congress, etc., etc., and 90% of
the time they say, "Oh, that's cool, you're crazy, but go ahead." About
10% of the time, they'd say, "I don't want any part of it," and we
instruct them on how to use a robot exclusion and they're taken out of
history. That seems to work for everybody at this point. People are
really excited about this future that we're building together.

Koman: The dot-bomb hasn't disillusioned you at all?

Kahle: I never predicited the capital market in the first place. I don't
know where that came from, but wow, there was a lot of money there for
awhile. But I love the era of dreams. I loved it when people were trying
to make services whose only constraint was to be popular. They didn't
have to make money, they just had to do something people liked. It was
amazing the ideas… I'm glad they're captured in some way, because it's
those dreams when the medium is new before you realize all it's faults
and foibles, and the Internet is going to disappoint, it's going to be
good at a few things and not good at everything else, but at least those
dreams are something we should try to live up to the next time. As we
refine technologies and come up with the next thing, let's see if we can
live up to a few more of those dreams, not just the making a million
dollars, but having the ability to get your words out, to reinvent
government, whatever it is. If it doesn't happen this time, let's
remember it, so the next time, let's give it another good shot.

Am I disillusioned? No. Is it depressing to see a lot of my friends out
of work? Yes! But the goal of universal access to human knowledge is in
many ways an original goal of the Net. It's a tremendous goal. It makes
me want to jump out of bed in the morning and try to get this thing
done. People working on digital divide issues want to join in, advocates
for children's literacy programs want to join in. It's not about driving
slick cars, it's about using this technology for the betterment of
education and people. I'll take that any day over random stock option
grants.

Richard Koman is a freelance writer and the editor of several O'Reilly
titles on Web design and javascript.