Fabrice Canel, part of the Live Search Crawling Team, announced on Tuesday (February 12, 2008) significant updates to the Live Search Engine crawler, MSNBot. The updates significantly improve the efficiency with which they crawl and index websites. The two biggest updates were improvements with HTTP Compression and Conditional Get.
We support conditional get as defined by RFC 2616 (Section 14.25), generally we will not download the page unless it has changed since the last time we crawled it. As per the standard, our crawler will include the "If-Modified-Since" header & time of last download in the GET request and when available, our crawler will include the "If-None-Match" header and the ETag value in the GET request. If the content hasn't changed the web server will respond with a 304 HTTP response.
They also updated their user agent.
In addition to these two features there are many more improvements in performance that should help further optimize our crawling. As a result, we've also upgraded our user agent to reflect the changes, it is now "msnbot/1.1". If you think you are experiencing any issues with MSNbot, or have any questions about the updates, please use our Crawler Feedback & Discussion form.
When Nathan Buggia, Lead Program Manager at Live Search Webmaster Center, first brought this news to my attention, one of the first questions I asked him was regarding the respect (or disrespect) of robots.txt. Some people have the impression that MSNBot doesn't respect robots.txt, because they often see their content in their index when they've specifically requested that it not be crawled. Nathan replied with:
We do read and respect the robots.txt file, however, if there is a link on a 3rd party site, that points to a page blocked by the REP on your site, we may still put that link (& associated anchor text) into our index. And we may surface that link (and anchor text) in our search results if it appears to be relevant, but we still won't go and crawl/index the actual page.
This is something that we spend a lot of time debating about internally, I would love to hear your thoughts on this.
I thought that was interesting, and the fact that they occasionally include links in their SERPs to pages they don't actually crawl, might be the reason for all of the confusion. For me, it's a foreign concept that a search engine result would contain links to websites that they haven't even crawled. However, I can see the case for that if many trusted websites keep referencing a resource and it's determined that the destination URL is an appropriate search result — regardless of whether or not the destination content has been crawled or analyzed.
Yesterday I wrote about the threat of vertical search engines to horizontal search engines like Google. I created a simple mockup of how Google could easily update their steadfast and simple homepage to accommodate vertical search. Marios Alexandrou mentioned in the comments of that blog entry that Ask.com now prompts users to help them refine their search. I gave it a try, but it still didn't help with the search example I gave in my original entry.

After checking out Ask.com, Scott Holdren was perusing Marketing Pilgrim and decided to click on this ad:

That ad took him to Quintura — a search engine that uses a keyword cloud to help users instantly refine their search. My mockup had the user specifying a subject upfront, but Quintura waits for the initial search results before it provides refinement options. Searching for "reasons why I should run" not only gave me much more relevant search results, it also provided appropriate keywords in the refinement cloud. I hovered my cursor over the keyword "exercise" and it instantly changed the search results to display links related to that keyword. Surprisingly, it made all of the search results relevant to my search query.

Quintura also searches images, videos and products on Amazon.com. Not surprisingly, it uses their Amazon.com affiliate code in their results. Quintura has a great deal of promise. The search cloud may take some time to get used to, but you'll find yourself using it a lot to quickly refine your search results. Overall, I think Quintura does an excellent job at tackling the horizontal-vertical search engine dilemma. Too bad their name is hard to remember and difficult to spell. If I were them, I would change the name or simply hope that one of the major search engines buys and integrates their technology.
Update: Embed a search cloud anywhere. This is cool. Here's an example:
[2]
Sramana Mitra recently wrote about Google's Achilles Heel. In her article, she suggested that vertical search engines are Google's worst vulnerability.
Google has so far stayed focused on horizontal, generic search with a simple, one-bar user interface. And it has brought them a remarkably long way.
However, as users get more sophisticated, they are discovering brands that offer richer user experiences customized to the dynamics of the vertical.
Here's an example. Let's say that I've been considering a new exercise routine. I'm trying to decide whether I should start running, jogging or walking, so I decide to go to Google and search, "reasons why I should run." Instead of getting what I'm looking for, I get an article I recently wrote, "Five Reasons Why You Should Run a "Do Follow" Blog." This of course is fantastic for our website, but it's not what I was searching for. It's not until the the third result that I see a related result, "10 reasons why women should run." That result is certainly closer to what I'm looking for, but I would still be hesitant to click on it, because I'm not a woman.

The remaining results are related to politics — none of which are related to the intention of my search query. That "unrelatedness," along with close matches that are exclusive (I'm not a woman), is why vertical search engines are the way of the future. Basically, to get better results, I would want to use a vertical search engine. A good example would be a search engine that only focuses on exercise or just running.
However, the problem with vertical search engines is that there are too many of them (or there soon will be), which will create the need for a search engine to find vertical search engines — ridiculous, I know. That problem takes us full circle and back to "do everything" search engines like Google, Yahoo! and Live. So, what's the solution?
I think the solution is relatively easy, at least from a user interface (UI) perspective. Adding the addition of a category/tag input field could go a long way in returning much more valid results. This of course would wreak havoc on the SEO industry, because it would become even harder (or possibly easier in some cases) to target and track search engine result pages (SERPs). I would envision something as simple as this Google mockup I put together. The first image shows the "hint" language that would go in each input field before the user enters any text, while the second image is an example of the actual search term and subject(s) I would use.


If Google could correctly apply "subjects" (aka categories, tags, etc...) to individual Web pages and entire websites — which I believe they can and already do in some respect — then updating their interface to work similar to my mockup may save them from the vertical search proliferation.
[5]
There's been a lot of misinformation going around regarding an algorithm tweak by Google. The tweak affects subdomains and subdirectories, which for an SEO specialist can be quite alarming. Fortunately, there's not much bite to this update.
The algorithm tweak was relatively minor and was only intended to clean up some SERPs that were returning too many results from a top level domain (TLD) that uses one or more subdomains. Matt Cutts described the subdomains and subdirectories change like this:
This change doesn't apply across the board; if a particular domain is really relevant, we may still return several results from that domain. For example, with a search query like [ibm] the user probably likes/wants to see several results from ibm.com. Note that this is a pretty subtle change, and it doesn't affect a majority of our queries. In fact, this change has been live for a couple weeks or so now and no one noticed. The only reason I talked about the subject at PubCon at all was because someone asked for my advice on subdomains vs. subdirectories.
Just to be doubly-sure that this update was truly minor, I asked Matt if subdomains and TLDs were still treated as separate, unique entities (AKA virtual properties), and this was his reponse:
Our policies for how subdomain.example.com vs. example.com relate to each other haven't changed.
One interesting thing that came out of the comment thread on that post was about subdomains and multilingual websites. Matt actually listed his preference for how to setup up multilingual sites. As with most of Matt's preferences, it can be assumed, unless he states otherwise, that it's also the preference of Google's algorithm. His preferences were:
- ccTLDS such as example.fr or example.de
- otherwise, subdomains such as fr.example.com or de.example.com.
- otherwise, subdirectories such as example.com/fr/ or example.com/de/
[3]
The year is almost at an end and while many of us are still out fighting the crowds at the local stores searching for gifts (or at PubCon), Yahoo! has taken a jump on Google and Live Search by releasing their report on the top search trends of 2007. Yahoo! has separated the results into nine different categories.
Of course, the statistics speak for themselves, but it's always interesting as an SEO geek to take a look at the figures for the year and to see trends in search.
Top 10 News Stories
- Saddam Hussein
- Iran
- Iraq
- President George W. Bush
- Oil and Gas Prices
- Barack Obama
- Hillary Rodham Clinton
- San Diego Fires
- Afghanistan
- Virginia Tech
You can see the 8 other categories at Yahoo's top search trends for 2007.