19 October 2004

Google & the Semantic Web

Ftrain.com is the website of Paul Ford and his pseudonyms. He published an article written over 2 years ago about Google and its relationship to The Semantic Web. Considering the relevance and positioning of Google in our lives today, it's certainly worth a read...and an enlightening read as well.

. . . . . . . . . . . .

Friday, July 26, 2002

August 2009: How Google beat Amazon and Ebay to the Semantic Web

By Paul Ford

A work of fiction. A Semantic Web scenario. A short feature from a business magazine published in 2009.

Please note that this story was written in 2002.



Googlebot Controls the Earth! Illustration by Rebecca Dravos.

It's hard to believe Google - which is now the world's largest single online marketplace - came on the scene only a little more than 8 years ago, back in the days when Amazon and Ebay reigned supreme. So how did Google become the world's single largest marketplace?

Well, the short answer is �the Semantic Web� (whatever that is - more in a moment). While Amazon and Ebay continue to have average quarterly profits of $1 billion and $1.8 billion, respectively, and are successes by any measure, the $17 billion per annum Google Marketplace is clearly the most impressive success story of what used to be called, pre-crash, �The New Economy.�

Amazon and Ebay both worked as virtual marketplaces: they outsourced as much inventory as possible (in Ebay's case, of course, that was all the inventory, but Amazon also kept as little stock on hand as it could). Then, through a variety of methods, each brought together buyers and sellers, taking a cut of every transaction.

For Amazon, that meant selling new items, or allowing thousands of users to sell them used. For Ebay, it meant bringing together auctioneers and auction buyers. Once you got everything started, this approach was extremely profitable. It was fast. It was managed by phone calls, emails, and database applications. It worked.

Enter Google. By 2002, it was the search engine, and its ad sales were picking up. At the same time, the concept of the �Semantic Web,� which had been around since 1998 or so, was gaining a little traction, and the attention of an increasing circle of people.

So what's the Semantic Web? At its heart, it's just a way to describe things in a way that a computer can �understand.� Of course, what's going on is not understanding, but logic, like you learn in high school:

If A is a friend of B, then B is a friend of A.

Jim has a friend named Paul.

Therefore, Paul has a friend named Jim.

Jim has a friend named Paul.

Therefore, Paul has a friend named Jim.

Using a markup language called RDF (an acronym that's here to stay, so you might as well learn it - it stands for Resource Description Framework), you could put logical statements like these on the Internet, �spiders� could collect them, and the statements could be searched, analyzed, and processed. What makes this different than regular search is that the statements can be combined. So if I find a statement on Jim's web site that says �Jim is a friend of Paul� and someone does a search for Paul's friends, even if Paul's web site doesn't have a mention of Jim on it, we know Jim's considers himself a friend of Paul.

Other things we might know for sure? That Car Seller A is selling Miatas for 10% less than Car Seller B. That Jan Hammer played keyboards on the Mahavishnu Orchestra's albums in the 1970s. That dogs have paws. That your specific model of computer requires a new motherboard and a faster bus before it can be upgraded to a Pentium 18. The Semantic Web isn't about pages and links, it's about relationships between things - whether one thing is a part of another, or how much a thing costs, or when it happened.

The Semweb was originally supposed to give the web the �smarts� it lacked - and much of the early work on it was in things like calendaring and scheduling, and in expressing relationships between people. By late 2003, when Google began to seriously experiment with the Semweb (after two years of experiments at their research labs), it was still a slow-growing technology that almost no one understood and very few people used, except for academics with backgrounds in logic, computer science, or artificial intelligence. The learning curve was as steep as a cliff, and there wasn't a great incentive for new coders to climb it and survey the world from their new vantage.

The Semweb, it was promised, would make it much easier to schedule dentist's appointment, update your computer, check the train schedule, and coordinate shipments of car parts. It would make searching for things easier. All great stuff, stuff to make millions of dollars from, perhaps. But not exactly sexy to the people who write the checks, especially after they'd been burnt 95 times over by the dot-com bust. All they saw was the web - the same web that had lined a few pockets and emptied a few million - with the word �semantic� in front of it.

. . . . .

Semantics vs. Syntax, Fight at 9

The semantics of something is the meaning of it. Nebulous stuff, but in the world of AI, the goal has long been getting semantics out of syntax. See, the trillion dollar question is, when you have a whole lot of stuff arranged syntactically, in a given structure that the computer can chew up, how do you then get meaning out of it? How does syntax become semantics? Human brains are really good at this, but computers, are dreadful. They're whizzes at syntax. You can tell them anything, if you tell it in a structured way, but they can't make sense of it, they keep deciding that �The flesh is willing but the spirit is weak� in English translates to �The meat is full of stars but the vodka is made of pinking shears� or suchlike in Russian.

So the guess has always been that you need a whole lot of syntactically stable statements in order to come up with anything interesting. In fact, you need a whole brain's worth - millions. Now, no one has proved this approach works at all, and the #1 advocate for this approach was a man named Doug Lenat of the CYC corporation, who somehow ended up on President Ashcroft's post-coup blacklist as a dangerous intellectual and hasn't been seen since. But the basic, overarching idea with the Semweb was - and still is, really - to throw together so much syntax from so many people that there's a chance to generate meaning out of it all.

As you know, computers still aren't listening to us as well as we'd like, but in the meantime the Semweb technology matured, and all of a sudden centralized databases - and Amazon and Ebay were prime examples of centralized databases with millions of items each - could suddenly be spread out through the entire web. Everyone could own their little piece of the database, their own part of the puzzle. It was easy to publish the stuff. But the problem was that there was no good way to bring it all together. And it was hard to create RDF files, even for some programmers - so we're back to that steep learning curve.

That all changed - suprisingly slowly - in late 2004, when with little fanfare, Google introduced three services, Google Marketplace Search, Google Personal Agent, and Google Verification Manager, and a software product, Google Marketplace Manager.

. . . . .

Google Marketplace Search

Marketplace Search is a search feature built on top of the Google Semantic Search feature, and it's likely nearly everyone reading will have used it at least once. You simply enter:

sell:martin guitar

to see a list of people buying Martin-brand acoustic guitars, and

buy:martin guitar

to see a list of sellers. Google asked for, and remembered, your postal code, and you could use easy sort controls inside the page to organize the resulting list of guitars by price, condition, model number, new/used, and proximity. The pages drew from Google's �classic,� non-Semantic-Web search tools, long considered the best on the Web, to link to information on Martin models and buyer's guides, as well as from Google's Usenet News archive. Links to sites like Epinions filled in the gaps.

So where did Google Marketplace Search get its information? The same way Google got all of its information - by crawling through the entire web and indexing what it found. Except now it was looking for RDDL files, which pointed to RDF files, which contained logical statements like these:

(Scott Rahin) lives in Zip Code (11231).
(Scott Rahin) has the email address (ford@ftrain.com).
(Scott Rahin) has a (Martin Guitar).
[Scott's] (Martin Guitar) is a model (245).
[Scott's] (Martin Guitar) can be seen at (http://ftrain.com/picture/martin.jpg).
[Scott's] (Martin Guitar) costs ($900).
[Scott's] (Martin Guitar) is in condition (Good).
[Scott's] (Martin Guitar) can be described as �Well cared for, and played rarely (sadly!). Beautiful, mellow sound and a spare set of strings. I'll be glad to show it to anyone who wants to stop by, or deliver it anywhere within the NYC area.�

What's important to understand is that the things in parentheses and brackets above are not just words, they're pointers. (Scott Rahin) is a pointer to http://ftrain.com/people/Scott. (Martin Acoustic Guitar) is a pointer to a URL that in turn refers to a special knowledge database that has other logical statements, like these:

(Martin Guitar) is an (Acoustic Guitar).
(Acoustic Guitar) is a (Guitar).
(Guitar) is an (Instrument).

Which means that if someone searches for guitar, or acoustic guitar, all Martin Guitars can be included in the search. And that means that Scott can simply say he has a Martin, or a Martin guitar, and the computers figure the rest out for him.

Actually, I just lied to you - it doesn't work exactly that way, and there's a lot of trickery with the pointers, and even the verb phrases are pointers, but rather than spout out a few dozen ugly terms like namespaces, URIs, prefixes, serialization, PURLs, and the like, we'll skip that part and just focus on the essential fact: everything on the Semantic Web describes something that has a URL. Or a URI. Or something like that. What that really means is that RDF is data about web data - or metadata. Sometimes RDF describes other RDF. So do you see how you take all those syntactic statements and hope to build a semantic web, one that can figure things out for itself? Combining the statements like that? Do you? Come on now, really? Yeah, well no one does.

So Google connects everyone by spidering RDF and indexing it. Of course, connecting anonymous buyers and sellers isn't enough. There needs to be accountability. Enter the �Web Accountability and Rating Framework.� There were a lot of various frameworks for accountability, but this one was certified, finally, by the World Wide Web Consortium, before the nuclear accident at MIT, and ECMA, and it's now the standard. How does it work? Well:

On Kara Dobbs's site, we find this statement:

[Kara Dobbs] says (Scott Rahin) is (Trustworthy).

On James Drevin's site, we find this statement:

[James Drevin] says (Scott Rahin) is (Trustworthy).

And so forth. Fine - but how do you know how to trust any of these people in the first place? Stay with me:

On Citibank's site:

[Citibank] says (Scott Rahin) is (Trustworthy).

On Mastercard's site:

[Mastercard] says (Scott Rahin) is (Trustworthy).

And inside Google:

[Google Verification Service] says (Scott Rahin) is (Trustworthy).

and if

[Citibank] says (Kara Dobbs, etc) is (Trustworthy).

then you start to see how it can all fit together, and you can actually get a pretty good sense of whether someone is the least bit dishonest or not. Now, this raises a billion problems about accountability and the nature of truth and human behavior and so forth, but we don't have the requisite 30 trillion pages, so just accept that it works for now. And that a lot of other stuff in this ilk is coming down the pike, like:

[The United States Social Security Administration] says (Pete Jefferson) was born in (1992).

Which means that Pete Jefferson can download smutty videos and �adult� video games from the Internet, since he's 19 and has a Social Security number. That's what the Safe Access for Minors bill says should happen, anyway. And don't forget the civil liberty ramifications of statements like these:

[The Sherriff's Department of Dallas, Texas] says (Martin Chalbarinstik) is a (Repeat Sexual Offender).

[The Sherriff's Department of Dallas, Texas] says (Dave Trebuchet) has (Bounced Checks).

[The Green Party, USA] says (Susan Petershaw) is a (Member).

Databases are powerful, and as much as they bring together data, they can intrude on privacy, but rather than giving the author permission to become a frothing mess lamenting the total destruction of our civil liberties at the hand of cruel machines, let's move on.

Anyway, when you think about it, you can see why Google was a natural to put it all together. Google already searched the entire Web. Google already had a distributed framework with thousands of independent machines. Google already looked for the links between pages, the way they fit together, in order to build its index. Google's search engine solved equations with millions of variables. Semantic Web content, in RDF, was just another search problem, another set of equations. The major problem was getting the information in the first place. And figuring out what to do with it. And making a profit from all that work. And keeping it updated....

. . . . .

Google Marketplace Manager

Well, first you need the information. Asking people to simply throw it on a server seemed like chaos - so enter Google Marketplace Manager, a small piece of software for Windows, Unix, and Macintosh (this is before Apple bought Spain and renamed it the Different-thinking Capitalist Republic of Information). The Marketplace Manager, or MM, looked like a regular spreadsheet and allowed you to list information about yourself, what you wanted to sell, what you wanted to buy, and so forth. MM was essentially an �logical statement editor,� disguised as a spreadsheet. People entered their names, addresses, and other relevant information about themselves, then they entered what they were selling, and MM saved RDF-formatted files to the server of their choice - and sent a �ping� to Google which told the search engine to update their index.

When it came out, the MM was a little bit magical. Let's say you wanted to sell a book. You entered �Book� in the category and MM queried the Open Product Taxonomy, then came back and asked you to identify whether it was a hardcover book, softcover, used, new, collectible, and so forth. The Open Product Taxonomy is a structured thesaurus, essentially, of product types, and it's quickly becoming the absolute standard for representing products for sale.

Then you enter an ISBN number from the back of the book, hit return, and the MM automatically fills in the author, copyright, number of pages, and a field for notes - it just queries a server for the RDF, gets it, chews it up, and gives it to you. If you were a small publishing house, you could list your catalog. If you had a first edition Grapes of Wrath you could describe it and give it a lowest acceptable price, and it'd appear in Google Auctions. Most of the smarts in the MM were actually on the server, as Google interpreted what was entered and adapted the spreadsheet around it. If you entered car, it asked for color. If you entered wine, it asked for vintage, vineyard, number of bottles. Then, when someone searched for 1998 Merlot, your bottle was high on the list.

You could also buy advertisements on Google right through the Manager for high-volume or big ticket items, and track how those advertisements were doing; it all updated and refreshed in a nice table. You could see the same data on the Web at any time, but the MM was sweet and fast and optimized. When you bought something, it was listed in your �purchases� column, organized by type of purchase - easy to print out for your accountant, nice for your records.

So, as we've said, Google allowed you to search for buyers and sellers, and then, using a service shamelessly copied from the then-ubiquitous PayPal, handled the transaction for a 1.75% charge. Sure, people could send checks or contact one another and avoid the 1.75%, but for most items that was your best bet - fast and cheap. 1.75% plus advertising and a global reach, and you can count on millions flowing smoothly through your accounts.

Amazon and Ebay - remember them? - doubtless saw the new product and realized they were in a bind. They would have to �cannibalize their own business� in order to go the Google path - give up their databases to the vagaries of the Web. So, in classic big-company style, they hedged their bets and did nothing.

Despite their inaction, before long all manner of competing services popped up, spidering the same data as Google and offering a cheaper transaction rate. But Google had the brand and the trust, and the profits.

It took 2 years for over a million individuals to accept and begin using the new, Semweb-based shopping. During that time, Google had about $300 million in volume - for a net of $4.5 million on transactions. But, just as Ebay and Amazon had once compelled consumers to bring their business to the web, the word-of-mouth began to work its magic. Since it was easy to search for things to buy, and easy to download the MM and get started, the number of people actively looking through Google Marketplace grew to 10 million by 2006.

. . . . .

Google Personal Agent

Now, search is not enough. You need service. You need the computer to help you. So Google also rolled out the Personal Agent - a small piece of software that, in essence, simply queried Google on a regular basis and sent you email when it found what you were looking for on the Semweb.

Want cheap phone rates? Ask the agent. Want to know when Wholand, the Who-based theme park, opens outside of London? Ask the agent. Or when your wife updates her web-based calendar, or when the price of MSFT goes up three bucks, or when stories about Ghanaian politics hit the wire. You could even program it to negotiate for you - if it found a first-edition Paterson in good condition for less than $2000, offer $500 below the asking price and work up from there. It's between you and the seller, anonymously, perhaps even tax-free if you have the right account number, no one takes a cut. Not using it to buy items began to be considered backwards. Just as the regular Google search negotiated the logical propositions of the Semweb, the Personal Agent did the same - it just did it every few minutes, and on its own, according to pre-set rules.

. . . . .

Google Verification Service

Finally, Google realized they could grab a cut on the �Web of Trust� idea by offering their own verification and rating service, $15 a year to answer a questionnaire, have your credit checked, and fill in some bank account information. But people signed up, because Google was the marketplace; the Google seal of approval meant more than the government's.

. . . . .

A Jury of Your Peer-to-Peers

Since all the information was already in RDF format, Google's own strategy came back to bite it. Free clones of Google Marketplace Manager began to appear, and other search engines began to aggregate without the 1.75% cut, trying to find other revenue models. The Peer-to-Peer model, long the favorite of MP3 and OGG traders, came back to include real-time sales data aggregation, spread over hundreds of thousands of volunteer machines - the same model used by Google, but decentralized among individuals. Amazon and Ebay began automatically including RDF-spidered data on their sites, fitting it right in with existing auctions and items for sale, taking whatever cuts they could find or force out of the situation.

In 2006, Citibank introduced Drop Box Accounts for $100/month, then $30, then $15, and $5/month for checking account holders. The Drop Box account is identified by a single number, and can only receive deposits, which can then be transferred into a checking or savings account. They were even URL-addressable, and hosted using the Finance Transfer Protocol. Simply point your browser to account://382882-2838292-29-1939 and enter the amount you want to deposit. There's no risk in giving out a secure drop box number, and no fee for deposits. Banks held the account information of depositors in federally supervised escrow accounts. Suddenly everyone could simply publish their bank account number and sell their goods without any middleman at all.

Feeling the pressure, and concerned, just as the music companies had been ears before, that their lead would slip to the peer-to-peer market, Google dropped its fees to 1%, allowed MM users to use Drop Box accounts, and began to charge $25 a year for the MM software and service for sellers, while still making it free for users. After a nervous few months, Google found that the majority of users who sold more than 10 items per year - the volume users - were glad to buy a working product with a brand name behind it; the peer-to-peer networks were considered less trustworthy, and the connection to Google advertising. Google also realized that they could also offer Drop Box accounts, and tie them to stock and money-market trading accounts, which opened a can of worms that we'll skip over here. If you're interested, you can read The Dragon in the Chicken Coop, by Tom Rawley.

Google's financials can, of course, be automatically inserted into your MM stock ticker; right now they're trading at 25,000 times earnings, heralding news of the �New New New New Economy.� You'll get no such heralding here; while they've pulled it off once, the competition is fierce. Google was the dream company for a little less than the last decade, but they're finally slowing down, and it's high time for a new batch of graduate students too itchy to finish their Ph.D.'s to get on the ball. And I'm sure they will.

. . . . .

A Semantically Terrifying Future?

The cultural future of the Semantic Web is a tricky one. Privacy is a huge concern, but too much privacy is unnerving. Remember those taxonomies? Well, a group of people out of the Cayman Islands came up with a �ghost taxonomy� - a thesaurus that seemed to be a listing of interconnected yacht parts for a specific brand of yacht, but in truth the yacht-building company never existed except on paper - it was a front for a money-laundering organization with ties to arms and drug smuggling. When someone said �rigging� they meant high powered automatic rifles. Sailcloth was cocaine. And an engine was weapons-grade plutonium.

So, you're a small African republic in the midst of a revolution with a megalomaniac leader, an expatriate Russian scientist in your employ, and 6 billion in heroin profits in your bank account, and you need to buy some weapons-grade plutonium. Who does it for you? Google Personal Agent, your web-based pal, ostensibly buying a new engine for your yacht, a little pricey for $18 million, sure. But you're selling aluminum coffeemakers through the Home Products Unlimited (Barbados) Ghost Taxonomy - or nearly pure heroin, you might say - so you'll make up the difference.

Suddenly one of the biggest problems of being a criminal mastermind - finding a seller who won't sell you out - is gone. With so many sellers, you can even bargain. Selling plutonium is as smooth and easy and anonymous (now that you can get Free Republic of Christian Ghana Drop Boxes) as selling that Martin guitar. Couldn't happen? Some people say it can, which explains the Mandatory Metadata Review bill on its way through Congress right now, where all RDF must be referenced to a public taxonomy approved by a special review board. Like the people say, may you live in interesting times. Which people? Look it up on Google.


Googlebot Controls the Moon! Illustration by Rebecca Dravos.

. . . . . . . . . . . .



No comments: