Monday, September 28, 2009

Starting Findory: Acqusition talks

[I wrote a draft of this post nearly two years ago as part of my Starting Findory series, but did not publish it at the time; it seemed inappropriate given my position at Microsoft and the economic downturn. Recently, Google and Microsoft both announced ([1] [2]) that they intend to make 12-15 acquisitions a year, which makes this much more timely.]

At various points when I was running Findory, I approached or was approached by other firms about acquisition. For the most part, these talks went well, but, as with many experiences at Findory, my initial expectations proved naive. It gave me much to contemplate, both for what startups should think about when entering these talks and changes bigger companies might consider in how they do acquisitions.

For a startup, acquisition talks can be a major distraction. It is time away from building features for customers. It creates legal bills that increase burn rate. It distracts the startup with nervous flutters over uncertainty over the future, potential distant payouts, and the complexity of a move.

Acquisition talks also can be dangerous for a startup. Some companies might start due diligence, extract all the information they can, then decided to try to build themselves.

There is some disagreement on this last point. For example, Y-Combinator's Paul Graham wrote, "What protects little companies from being copied by bigger competitors is ... the thousand little things the big company will get wrong if they try." Paul is claiming that big companies have such poor ability to execute that the danger of telling them everything is low.

However, big companies systematically underestimate the risk of failure and cost of increased time to market. For internal teams, which often already are jealous of the supposedly greener grass of the startup life, the perceived fun of trying to build it themselves means they are unreasonably likely to try to do so. Paul is right that a big company likely will get it wrong when they try, but they also are likely to try, which means the startup got nothing from their talks but a distraction.

There are other things to watch out for in acquisition talks. At big companies, acquisitions of small startups often are channeled into the same slow, bureaucratic process as an acquisition of a 300-person company. Individual incentives at large firms often reward lack of failure more than success, creating a bias toward doing nothing over doing something. In fact, acquiring companies usually feel little sense of urgency until an executive is spooked by an immediate competitive threat, at which point they panic like a wounded beast, suddenly motivated by fear.

Looking back, my biggest surprise was that companies show much less interest than I expected in seeking out very small, early-stage companies to acquire.

As Paul Graham argued in his essay, "Hiring is obsolete", for the cost of what looks like a large starting bonus, the companies get experience, passion, and proven ability to deliver. By acquiring early, there is only talent and technology. There is no overhead, no markups for financiers, and no investment in building a brand.

Moreover, as the business literature shows, it is the small acquisitions that usually bring value to a company and the large acquisitions that destroy value. On average, companies should prefer doing 100 $2-5M acquisitions over one large $200-500M one, but business development groups at large companies are not set up that way.

There is a missed opportunity here. Bigger companies could treat startups like external R&D, letting those that fail fail at no cost, scooping up the talent that demonstrates ingenuity, passion, and ability to execute. It would be a different way of doing acquisitions, one that looks more like hiring than a merging of equals, but also one that is likely to yield much better results.

For more on that, see also Paul Graham's essay, "The Future of Web Startups", especially his third point under the header "New Attitudes to Acquisition".

Thursday, September 17, 2009

Book review: Search User Interfaces

UC Berkeley Professor Marti Hearst has a great new book out, "Search User Interfaces".

The book is a survey of recent work in search, but with an unusual focus on the importance of interface design on searcher's perceptions of the quality and usefulness of the search results.

Marti writes with the opinionated authority of an expert in the field, usefully pointing at techniques which have shown promise while dismissing others as consistently confusing to users. Her book is a guide to what works and what does not in search, warning of paths that likely lead into the weeds and counseling us toward better opportunities.

To see what I mean, here are some extended excerpts. First, on why web search result pages still are so simple and spartan in design:
[The] search results page from Google in 2007 [and] ... Infoseek in 1997 ... are nearly identical. Why is the standard interface so simple?

Search is a means towards some other end, rather than a goal in itself. When a person is looking for information, they are usually engaged in some larger task, and do not want their flow of thought interrupted ... The fewer distractions while reading, the more usable the interface.

Almost any feature that a designer might think is intuitive and obvious is likely to be mystifying to a significant proportion of Web users.
On the surprising importance of small and subtle design tweaks:
Small [design] details can make [a big] difference ... For example, Franzen and Karlgren, 2000 found that showing study participants a wider entry form encouraged them to type longer queries.

[In] another example ... [in] an early version of the Google spelling suggestions ... searchers generally did not notice the suggestion at the top of the page of results ... Instead, they focused on the search results, scrolling down to the bottom of the page scanning for a relevant result but seeing only the very poor matches to the misspelled words. They would then give up and complain that the engine did not return relevant results ... [The solution] was to repeat the spelling suggestion at the bottom of the page.

[More generally,] Hotchkiss, 2007b attributed [higher satisfaction on Google] not to the quality of the search results, but rather to [the] design ... A Google VP confirmed that the Web page design is the result of careful usability testing of small design elements.

Hotchkiss, 2007b also noted that Google is careful to ensure that all information in ... the upper left hand corner ... where users tend to look first for search results ... is of high relevance ... He suggested that even if the result hits for other search engines are equivalent in quality to Google's, they sometimes show ads that are not relevant at the top of the results list, thus degrading the user experience.
Speedy search results are important for staying on task, getting immediate feedback, and rapidly iterating:
Rapid response time is critical ... Fast response time for query reformulation allows the user to try multiple queries rapidly. If the system responds with little delay, the user does not feel penalized for [experimenting] ... Research suggests that when rapid responses are not available, search strategies change.
So how do people search? Marti summarizes many models, but here are excerpts on my favorites, berry-picking and information foraging:
The berry-picking model of information seeking ... [assumes] the searchers' information needs, and consequently their queries, continually shift.

Information encountered at one point in a search may lead in a new, unanticipated direction. The original goal may become partly fulfilled, thus lowering the priority of one goal in favor of another ... [The] searchers' information needs are not satisfied by a single, final retrieved set of documents, but rather by a series of selections and bits of information found along the way.

[Similarly,] information foraging theory ... assumes that search strategies evolve toward those that maximize the ratio of valuable information gained to unit of cost for searching and reading.

The berry-picking model is supported by a number of observational studies (Ellis, 1989, Borgman, 1996b ... O'Day and Jeffries, 1993) .... A commonly-observed search strategy is one in which the information seeker issues a quick, imprecise query in the hopes of getting into approximately the right part of the information space, and then doing a series of local navigation operations to get closer to the information of interest (Marchionini, 1995, Bates, 1990).

One part of ... information foraging theory discusses the notion of information scent : cues that provide searchers with concise information about content that is not immediately perceptible. Pirolli, 2007 notes that small pertubations in the accuracy of information scent can cause qualitative shifts in the cost of browsing; improvements in information scent are related to more efficient foraging .... Search results listings must provide the user with clues about which results to click.
What can we do to help people forage for information? Let's start with providing strong information scent in our search result snippets:
[It is important to] display ... a summary that takes the searcher's query terms into account. This is referred to as keyword-in-context (KWIC) extractions.

[It] is different than a standard abstract, whose goal is to summarize the main topics of the document but might not contain references to the terms within the query. A query-oriented extract shows sentences that summarize the ways the query terms are used within the document.

Visually highlighting query terms ... helps draw the searcher's attention to the parts of the document most likely to be relevant to the query, and to show how closely the query terms appear to one another in the text. However, it is important not to highlight too many terms, as the positive effects of highlighting will be lost.

The prevalence of query-biased summaries is relatively recent ... [when] Google began storing full text of documents, making them visible in their cache and using their content for query-biased summaries. Keyword-in-context summaries [now] have become the de facto standard for web search engine result displays.

There is an inherent tradeoff between showing long, informative summaries and minimizing the screen space required by each search hit. There is also a tension between showing fragments of sentences that contain all or most of the query terms and showing coherent stretches of text containing only some of the query terms. Research is mixed about how and when chopped-off sentences are preferred and when they harm usability (Aula, 2004, Rose et al., 2007). Research also shows that different results lengths are appropriate depending on the type of query and expected result type (Lin et al., 2003, Guan and Cutrell, 2007, Kaisser et al., 2008), although varying the length of results has not been widely adopted in practice.
Next, let's help people iterate on their searches:
Roughly 50% of search sessions involve some kind of query reformulation .... Term suggestion tools are used roughly 35% of the time that they are offered to users.

Usability studies are generally positive as to the efficacy of term suggestions when users are not required to make relevance judgements and do not have to choose among too many terms ... Negative results... seem to stem from problems with the presentation interface.

[When used,] search results should be shown immediately after the initial query, alongside [any] additional search aids .... A related recent development in rapid and effective user feedback is an interface that suggests a list of query terms dynamically, as the user types the query.

10-15% of queries contain spelling or typographical errors .... [Searchers] may prefer [a spelling] correction to be made automatically to avoid the need for an extra click ... [perhaps] with [the] guess of the correct spelling interwoven with others that contain the original, most likely incorrect spelling.

Web search ... query spelling correction [is] a harder problem than traditional spelling correction because of the prevalence of proper names, company names, neologisms, multi-word phrases, and very short contexts ... A key insight for improving spelling suggestions on the Web was that query logs often show not only the misspelling, but also the corrections that users make in subsequent queries.
Avoid the temptation to prioritize anything other than very fast delivery of very relevant results into the top 3 positions. That is all most people will see. Beyond that, we've likely missed our shot, and we probably should focus on helping people iterate:
Searchers rarely look beyond the first page of search results. If the searcher does not find what they want in the first page, they usually either give up or reformulate their query ... Web searchers expect the best answer to be among the top one or two hits in the results listing.
The book has much advice on designs and interfaces that appear to be helpful as well as those that do not. Here is some of the advice on what appears to be helpful:
Numerous studies show that an important search interface design principle is to show users some search results immediately after their initial query ... This helps searchers understand if they are on the right track or not, and also provides them with suggestions of related words that they might use for query reformulation. Many experimental systems make the mistake of requiring the user to look at large amounts of helper information, such as query refinement suggestions or category labels, before viewing results directly.

Taking [query term] order and proximity into account ... [in ranking can] improve the results without confusing users despite the fact that they may not be aware of or understand those transformations .... In general, proximity information can be quite effective at improving precision of searches (Hearst, 1996, Clarke et al., 1996, Tao and Zhai, 2007).

Research shows that people are highly likely to revisit information they have viewed in the past and to re-issue queries that they have written in the past (Jones et al., 2002, Milic-Frayling et al., 2004) .... A good search history interface should substantially improve the search experience for users.

Recent work has explored how to use implicit relevance judgements from multiple users to improve search results rankings ... [For example] Joachims et al., 2005 conducted experiments to assess the reliability of clickthrough data ... [and found] several new effective ways for generating relative signals from this implicit information ... Agichtein et al., 2006b built upon this work and showed even more convincingly that clickthrough and other forms of implicit feedback are useful when gathered across large numbers of users.

A general rule of thumb for search usability is to avoid showing the user empty results sets .... eBay introduced an interesting ... [technique where] when no results are found for a query ... the searcher is shown a view indicating how many results would be brought back if only k out of n terms were included in the query.

The wording above the text box can influence the kind of information that searchers type in .... A hint [in] the search box [can] indicate what kind of search the user should do .... Short query forms lead to short queries.

It is important not to force the user to make selections before offering a search box.

Graphical or numerical displays of relevance scores have fallen out of favor ... [Studies] tend to find that users do not prefer them.

Stemming is useful, but removing stopwords can be hazardous.

Diversity... [of] the first few results displayed ... [is] important for ambiguous queries.
Three ideas -- universal search including multimedia, faceted search, and implicit personalization -- appear to be helpful only in some cases:
Web search engines are increasingly blending search results from multiple information sources ... Multimedia results [are] best placed a few positions down in the search results list ... When placed just above the "fold" (above where scrolling is needed) they can increase clickthrough.

Eye-tracking studies suggest that even when placed lower down, an image often attracts the eye first (Hotchkiss et al., 2007). It is unclear if information-rich layouts ... are desirable or if this much information is too overwhelming for users on a daily basis.

Hierarchical faceted metadata ... [allows] users to browse information collections according to multiple categories simultaneously ... [by selecting] a set of category hierarchies, each of which corresponds to a different facet (dimension or feature type) ... Most documents discuss several different topics simultaneously ... Faceted metadata provides a usable solution to the problems with navigation of strict hierarchies ... A disadvantage of category systems is that they require the categories to be assigned by hand or by an algorithm.

Usability results suggest that this kind of interface is highly usable for navigation of information collections with somewhat homogeneous content (English et al., 2001, Hearst et al., 2002, Yee et al., 2003). [People] like and are successful using hierarchical faceted metadata for navigating information collections, especially for browsing tasks ... [but] there are some deficiencies ... If the facets do not reflect a user's mental model of the space, or if items are not assigned facet labels appropriately, the interface will suffer ... The facets should not be too wide nor too deep ... and the interface must be designed very carefully to avoid clutter, dead ends, and confusion. This kind of interface is heavily used on Web sites today, including shopping and specialized product sites, restaurant guides, and online library catalogs.

One site that had some particularly interesting design choices is the eBay Express online shopping interface ... The designers determined in advance which subset of facets were of most interest to most users for each product type (shoes, art, etc.), and initially exposed only [those] ... After the user selected a facet, one of the compressed facets from the list below was expanded and moved up ... [They also] had a particularly interesting approach to handling keyword queries. The system attempted to map the user-entered keywords into the corresponding facet label, and simply added that label to the query breadcrumb. For example, a search on "Ella Fitzgerald" created a query consisting of the Artists facet selected with the Ella Fitzgerald label. Search within results was accomplished by nesting an entry form within the query region.

Most personalization efforts make use of preference information that is implicit in user actions .... A method of gathering implicit preference information that seems especially potent is recording which documents the user examines while trying to complete an extended search task ... The information "trails" that users leave behind them as a side-effect of doing their tasks have been used to suggest term expansions (White et al., 2005, White et al., 2007), automatically re-rank search results (Teevan et al., 2005b) , predict next moves (Pitkow and Pirolli, 1999) , make recommendations of related pages (Lieberman, 1995) , and determine user satisfaction (Fox et al., 2005) .... Individual-based personalized rankings seem to work best on highly ambiguous queries.
Several search interfaces, despite being popular in the literature, repeatedly have been shown to either hurt or yield no improvement in usability when applied to web search. Here are brief excerpts on boolean queries, thumbnails of result pages, clustering, pseudo-relevance feedback, explicit personalization, and visualizations of query refinements and search results:
Studies have shown time and again that most users have difficulty specifying queries in Boolean format and often misjudge what the results will be ... Boolean queries ... strict interpretation tends to yield result sets that are either too large, because the user includes many terms in a disjunct, or are empty, because the user conjoins terms in an effort to reduce the result set.

This problem occurs in large part because the user does not know the contents of the collection or the role of terms within the collection ... Most people find the basic semantics counter-intuitive. Many English-speaking users assume everyday meanings are associated with Boolean operators when expressed using the English words AND and OR, rather than their logical equivalents ... Most users are not familiar with the use of parentheses for nested evaluation, nor with the notions associated with operator precedence.

Despite the generally poor usability of Boolean operators, most search engines support [the] notation.

One frequently suggested idea is to show search results as thumbnail images ... but [no attempts] have shown a proven advantage for search results viewing .... The downside of thumbnails are that the text content in the thumbnails is difficult to see, and text-heavy pages can be difficult to distinguish from one another. Images also take longer to generate and download than text .... The extreme sensitivity of searchers to delays of even 0.5 seconds suggests that such highly interactive and visual displays need to have a clear use-case advantage over simple text results before they will succeed.

In document clustering, similarity is typically computed using associations and commonalities among features, where features are usually words and phrases (Cutting et al., 1992) ; the centroids of the clusters determine the themes in the collections .... Clustering methods ... [are] fully automatable, and thus applicable to any text collection, but ... [have poor] consistency, coherence, and comprehensibility.

Despite its strong showing in artificial or non-interactive search studies, the use of classic relevance feedback in search engine interfaces is still very rare (Croft et al., 2001, Ruthven and Lalmas, 2003), suggesting that in practice it is not a successful technique. There are several possible explanations for this. First, most of the earlier evaluations assumed that recall was important, and relevance feedback's strength mainly comes from its ability to improve recall. High recall is no longer the standard assumption when designing and assessing search results; in more recent studies, the ranking is often assessed on the first 10 search results. Second, relevance feedback results are not consistently beneficial; these techniques help in many cases but hurt results in other cases (Cronen-Townsend et al., 2004, Marchionini and Shneiderman, 1988, Mitra et al., 1998a). Users often respond negatively to techniques that do not produce results of consistent quality. Third, many of the early studies were conducted on small text collections. The enormous size of the Web makes it more likely that the user will find relevant results with fewer terms than is the case with small collections. And in fact there is evidence that relevance feedback results do not significantly improve over web search engine results (Teevan et al., 2005b).

But probably the most important reason for the lack of uptake of relevance feedback is that the method requires users to make relevance judgements, which is an effortful task (Croft et al., 2001, Ruthven and Lalmas, 2003) ... Users often struggle to make relevance judgements (White et al., 2005), especially when they are unfamiliar with the domain (Vakkari, 2000b, Vakkari and Hakala, 2000, Spink et al., 1998) ... The evidence suggests it is more cognitively taxing to mark a series of relevance judgements than to scan a results listing and type in a reformulated query.

The evidence suggests that manual creation of ... [personalization] profiles does not work very well. [In] Yang and Jeh, 2006 ... [most] participants ... [said it] required too much effort ... Several researchers have studied ... [instead] allowing users to modify ... profiles after they are created by or augmented by machine learning algorithms. Unfortunately, the outcome of these studies tends to be negative. For example ... Ahn et al., 2007 examined whether allowing users to modify a machine-generated profile for news recommendations could improve the results ... [and] found that allowing them to adjust the profiles significantly worsened the results.

Applying visualization to textual information is quite challenging ... When reading text, one is focused on that task; it is not possible to read and visually perceive something else at the same time. Furthermore, the nature of text makes it difficult to convert it to a visual analogue.

Many text visualizations have been proposed that place icons representing documents on a 2-dimensional or 3-dimensional spatial layout ... Adjacency on maps like these is meant to indicate semantic similarity along an abstract dimension, but this dimension does not have a spatial analogue that is easily understood. Usability results for such displays tend not to be positive.

Nodes-and-link diagrams, also called network graphs, can convey relationships ... [but] do not scale well to large sizes -- the nodes become unreadable and the links cross into a jumbled mess. Another potential problem with network graphs is that there is evidence that lay users are not particularly comfortable with nodes-and-links views (ViƩgas and Donath, 2004) ... [and they] have not been shown to work well to aid the standard search process .... Kleiboemer et al., 1996 [for example found] that graphical depictions (representing clusters with circles and lines connecting documents) were much harder to use than textual representations ... [and] Swan and Allan, 1998 implement ... [a] node-and-link networks based on inter-document similarity ... [but] results of a usability study were not positive.

Applications of visualization to general search have not been widely accepted to date, and few usability results are positive. For example, Chen and Yu, 2000 conducted a meta-analysis of information visualization usability studies ... [and] found no evidence that visualization improved search performance. This is not to say that advanced visual representations cannot help improve search; rather that there are few proven successful ideas today.
Finally, three ideas -- social search, dialogue-based interfaces, and sensemaking -- may be fertile ground:
[An] idea that has been investigated numerous times is that of allowing users to explicitly comment on or change the ranking produced by the search engine ... [For example] Google has recently introduced SearchWiki which allows the user to move a search hit to the top of the rankings, remove it from the rankings, and comment on the link, and the actions are visible to other users of the system ... Experimental results on this kind of system have not be strongly positive in the past (Teevan et al., 2005b), but have not been tried on a large scale in this manner.

Another variation on the idea of social ranking is to promote web pages that people in one's social network have rated highly in the past, as seen in the Yahoo MyWeb system ... Small studies have suggested that using a network of one's peers can act as a kind of personalization to bias some search results to be more effective (Joachims, 2002, Mislove et al., 2006).

There [also] is an increasing trend in HCI to examine how to better support collaboration among users of software systems, and this has recently extended to collaborative or cooperative search. At least three studies suggest that people often work together when performing searches, despite the limitations of existing software (Twidale et al., 1997, Morris, 2008, Evans and Chi, 2008) ... Pickens et al., 2008 ... found much greater gains in collaboration on difficult tasks than on simple ones.

Dialogue-based interfaces have been explored since the early days of information retrieval research, in an attempt to mimic the interaction provided by a human search intermediary (e.g., a reference librarian) ... Dialogue-style interactions have not yet become widely used, most likely because they are still difficult to develop for robust performance.

[We can] divide the entire information access process into two main components: information retrieval through searching and browsing, and analysis and synthesis of results. This [second] process is often referred to ... as sensemaking.

The standard Web search interface does not do a good job of supporting the sensemaking process .... A more supportive search tool would ... help [people] keep track of what they had already viewed ... suggest what to look for next ... find additional documents similar to those already found ... allow for aliasing of terms and concepts ... [and] flexibly arrange, re-arrange, group, and name and re-name groups of information ... A number of research and commercial tools have been developed that attempt to mimic physical arrangement of information items in a virtual representation.
I only touched on the material in the book here. There is an interesting section on mobile search, more surveys of past academic work, screen shots from various clever but crazy visualization interfaces, and many other worthwhile goodies in the full text.

Marti was kind to make her book available free online -- a great resource -- but this is too good of a book for a casual skim. I'd recommend picking up a copy so you can sit down for a more thorough read.

If you think you might enjoy Marti's new book, I also can't recommend strongly enough the recently published "Introduction to Information Retrieval". Earlier this year, I posted a review of that book with extended excerpts.

[Full disclosure: I offered comments on a draft of one of the chapters in Marti's book prior to publication; I have no other involvement.]

Monday, September 14, 2009

Experiments and performance at Google and Microsoft

Despite frequently appearing together at conferences, it is fairly rare to see public debate on technology and technique between people from Google and Microsoft. A recent talk on A/B testing at Seattle Tech Startups is a fun exception.

In the barely viewable video of the talk, the action starts at the Q&A around 1:28:00. The presenters of the two talks, Googler Sandra Cheng and Microsoft's Ronny Kohavi, aggressively debate the importance of performance when running weblabs, with others chiming in as well. Oddly, it appears to be Microsoft, not Google, arguing for faster performance.

Making this even more amusing is that both Sandra and Ronny cut their experimenting teeth at Amazon.com. Sandra Cheng now is the product manager in charge of Google Website Optimizer. Ronny Kohavi now runs the experimentation team at Microsoft. Amazon is Experimentation U, it seems.

By the way, if you have not seen it, the paper "Online Experimentation at Microsoft" (PDF) that was presented at a workshop at KDD 2009 has great tales of experimentation woe at the Redmond giant. Section 7 on "Cultural Challenges" particularly is worth a read.

Friday, September 11, 2009

Google AdWords now personalized

It has been a long time coming, but Google finally started personalizing their AdWords search advertising to the past behavior of searchers:
When determining which ads to show on a Google search result page, the AdWords system evaluates some of the user's previous queries during their search session as well as the current search query. If the system detects a relationship, it will show ads related to these other queries, too.

It works by generating similar terms for each search query based on the content of the current query and, if deemed relevant, the previous queries in a user's search session.
There have been hints of this coming for some time. Last year, there were suggestions that this feature was being A/B tested. Earlier, Google did a milder form of personalized ad targeting if the immediately previous query could be found in the referrer. Now, they finally have launched real personalized advertising using search history out to everyone.

For more on Google's personalized advertising efforts, you might be interested in my earlier post, "Google launches personalized advertising", on the interest-based behavioral targeted advertising Google recently launched for AdSense.

Please see also my July 2007 post, "What to advertise when there is no commercial intent?"

Thursday, September 10, 2009

Rapid releases and rethinking software engineering

I have a new post up at blog@CACM, "Frequent releases change software engineering", on why software companies should consider deploying software much more frequently than they usually do.

Here is an excerpt, the last two paragraphs, as a teaser:
Frequent releases are desirable because of the changes it forces in software engineering. It discourages risky, expensive, large projects. It encourages experimentation, innovation, and rapid iteration. It reduces the cost of failure while also minimizing the risk of failure. It is a better way to build software.

The constraints on software deployment have changed. Our old assumptions on the cost, consistency, and speed of software deployments no longer hold. It is time to rethink how we do software engineering.
This CACM article expands on some of the discussion in an earlier post, "The culture at Netflix", on this blog and in the comments to that post. By the way, if you have not yet seen Reed Hastings' slides on the culture at Netflix, they are worth a look.