Context Clusters and Query Suggestions at Google
A new patent application from Google tells us about how the search engine may use context to find query suggestions before a searcher has completed typing in a full query. After seeing this patent, I’ve been thinking about previous patents I’ve seen from Google that have similarities.
It’s not the first time I’ve written about a Google Patent involving query suggestions. I’ve written about a couple of other patents that were very informative, in the past:
- 6/10/2016 – Google Entity Search Suggestions Patent (Associating an entity with a search query)
- 5/26/2010How a Search Engine Might Identify Possible Query Suggestions (Generating query suggestions using contextual information)
In both of those, the inclusion of entities in a query impacted the suggestions that were returned. This patent takes a slightly different approach, by also looking at context.
Context Clusters in Query Suggestions
We’ve been seeing the word Context spring up in Google patents recently. Context terms from knowledge bases appearing on pages that focus on the same query term with different meanings, and we have also seen pages that are about specific people using a disambiguation approach. While these were recent, I did blog about a paper in 2007, which talks about query context with an author from Yahoo. The paper was Using Query Contexts in Information Retrieval. The abstract from the paper provides a good glimpse into what it covers:
User query is an element that specifies an information need, but it is not the only one. Studies in literature have found many contextual factors that strongly influence the interpretation of a query. Recent studies have tried to consider the user’s interests by creating a user profile. However, a single profile for a user may not be sufficient for a variety of queries of the user. In this study, we propose to use query-specific contexts instead of user-centric ones, including context around query and context within query. The former specifies the environment of a query such as the domain of interest, while the latter refers to context words within the query, which is particularly useful for the selection of relevant term relations. In this paper, both types of context are integrated in an IR model based on language modeling. Our experiments on several TREC collections show that each of the context factors brings significant improvements in retrieval effectiveness.
The Google patent doesn’t take a user-based approach ether, but does look at some user contexts and interests. It sounds like searchers might be offered a chance to select a context cluster before showing query suggestions:
In some implementations, a set of queries (e.g., movie times, movie trailers) related to a particular topic (e.g., movies) may be grouped into context clusters. Given a context of a user device for a user, one or more context clusters may be presented to the user when the user is initiating a search operation, but prior to the user inputting one or more characters of the search query. For example, based on a user’s context (e.g., location, date and time, indicated user preferences and interests), when a user event occurs indicating the user is initiating a process of providing a search query (e.g., opening a web page associated with a search engine), one or more context clusters (e.g., “movies”) may be presented to the user for selection input prior to the user entering any query input. The user may select one of the context clusters that are presented and then a list of queries grouped into the context cluster may be presented as options for a query input selection.
I often look up the inventors of patents to get a sense of what else they may have written, and worked upon. I looked up Jakob D. Uszkoreit in LinkedIn, and his profile doesn’t surprise me. He tells us there of his experience at Google:
Previously I started and led a research team in Google Machine Intelligence, working on large-scale deep learning for natural language understanding, with applications in the Google Assistant and other products.
This passage reminded me of the search results being shown to me by the Google Assistant, which are based upon interests that I have shared with Google over time, and that Google allows me to update from time to time. If the inventor of this patent worked on Google Assistant, that doesn’t surprise me. I haven’t been offered context clusters yet (and wouldn’t know what those might look like if Google did offer them. I suspect if Google does start offering them, I will realize that I have found them at the time they are offered to me.)
Like many patents do, this one tells us what is “innovative” about it. It looks at:
…query data indicating query inputs received from user devices of a plurality of users, the query data also indicating an input context that describes, for each query input, an input context of the query input that is different from content described by the query input; grouping, by the data processing apparatus, the query inputs into context clusters based, in part, on the input context for each of the query inputs and the content described by each query input; determining, by the data processing apparatus, for each of the context clusters, a context cluster probability based on respective probabilities of entry of the query inputs that belong to the context cluster, the context cluster probability being indicative of a probability that at least one query input that belongs to the context cluster and provided for an input context of the context cluster will be selected by the user; and storing, in a data storage system accessible by the data processing apparatus, data describing the context clusters and the context cluster probabilities.
It also tells us that it will calculate probabilities that certain context clusters might be requested by a searcher. So how does Google know what to suggest as context clusters?
Each context cluster includes a group of one or more queries, the grouping being based on the input context (e.g., location, date and time, indicated user preferences and interests) for each of the query inputs, when the query input was provided, and the content described by each query input. One or more context clusters may be presented to the user for input selection based on a context cluster probability, which is based on the context of the user device and respective probabilities of entry of the query inputs that belong to the context cluster. The context cluster probability is indicative of a probability that at least one query input that belongs to the context cluster will be selected by the user. Upon selection of one of the context clusters that is presented to the user, a list of queries grouped into the context cluster may be presented as options for a query input selection. This advantageously results in individual query suggestions for query inputs that belong to the context cluster but that alone would not otherwise be provided due to their respectively low individual selection probabilities. Accordingly, users’ informational needs are more likely to be satisfied.
The Patent in this patent application is:
(US20190050450) Query Composition System
Publication Number: 20190050450
Publication Date: February 14, 2019
Applicants: Google LLC
Inventors: Jakob D. Uszkoreit
Methods, systems, and apparatus for generating data describing context clusters and context cluster probabilities, wherein each context cluster includes query inputs based on the input context for each of the query inputs and the content described by each query input, and each context cluster probability indicates a probability that at a query input that belongs to the context cluster will be selected by the user, receiving, from a user device, an indication of a user event that includes data indicating a context of the user device, selecting as a selected context cluster, based on the context cluster probabilities for each of the context clusters and the context of the user device, a context cluster for selection input by the user device, and providing, to the user device, data that causes the user device to display a context cluster selection input that indicates the selected context cluster for user selection.
What are Context Clusters as Query Suggestions?
The patent tells us that context clusters might be triggered when someone is starting a query on a web browser. I tried it out, starting a search for “movies” and got a number of suggestions that were combinations of queries, or what seem to be context clusters:
The patent says that context clusters would appear before someone began typing, based upon topics and user information such as location. So, if I were at a shopping mall that had a movie theatre, I might see Search suggestions for movies like the ones shown here:
One of those clusters involved “Movies about Business”, which I selected, and it showed me a carousel, and buttons with subcategories to also choose from. This seems to be a context cluster:
This seems to be a pretty new idea, and may be something that Google would announce as an availble option when it becomes available, if it does become available, much like they did with the Google Assistant. I usually check through the news from my Google Assistant at least once a day. If it starts offering search suggestions based upon things like my location, it could potentially be very interesting.
User Query Histories
The patent tells us that context clusters selected to be shown to a searcher might be based upon previous queries from a searcher, and provides the following example:
Further, a user query history may be provided by the user device (or stored in the log data) that includes queries and contexts previously provided by the user, and this information may also factor into the probability that a user may provide a particular query or a query within a particular context cluster. For example, if the user that initiates the user event provides a query for “movie show times” many Friday afternoons between 4 PM-6 PM, then when the user initiates the user event on a Friday afternoon in the future between these times, the probability associated with the user inputting “movie show times” may be boosted for that user. Consequentially, based on this example, the corresponding context cluster probability of the context cluster to which the query belongs may likewise be boosted with respect to that user.
It’s not easy to tell whether the examples I provided about movies above are related to this patent or if it is tied more closely to the search results that appear in Google Assistant results. It’s worth reading through and thinking about potential experimental searches to see if they might influence the results that you may see. It is interesting that Google may attempt to anticipate what is suggests to show to us as query suggestions, after showing us search results based upon what it believes are our interests based upon searches that we have performed or interests that we have identified for Google Assistant.
The contex cluster may be related to the location and time that someone accesses the search engine. The patent provides an example of what might be seen by the searcher like this:
In the current example, the user may be in the location of MegaPlex, which includes a department store, restaurants, and a movie theater. Additionally, the user context may indicate that the user event was initiated on a Friday evening at 6 PM. Upon the user initiating the user event, the search system and/or context cluster system may access the content cluster data 214 to determine whether one or more context clusters is to be provided to the user device as an input selection based at least in part on the context of the user. Based on the context of the user, the context cluster system and/or search system may determine, for each query in each context cluster, a probability that the user will provide that query and aggregate the probability for the context cluster to obtain a context cluster probability.
In the current example, there may be four queries grouped into the “Movies” cluster, four queries grouped into the “Restaurants” cluster, and three queries grouped into the “Dept. Store” cluster. Based on the analysis of the content cluster data, the context cluster system may determine that the aggregate probability of the queries in each of the “Movies” cluster, “Restaurant” cluster, and “Dept. Store” cluster have a high enough likelihood (e.g., meet a threshold probability) to be input by the user, based on the user context, that the context clusters are to be presented to the user for selection input in the search engine web site.
I could see running such a search at a shopping mall, to learn more about the location I was at, and what I could find there, from dining places to movies being shown. That sounds like it could be the start of an interesting adventure.
Copyright © 2019 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana
Red flags and “disputed” tags just entrenched people’s views about suspicious news articles, so Facebook is hoping to give readers a wide array of info so they can make their own decisions about what’s misinformation. Facebook will try showing links to a journalist’s Wikipedia entry, other articles, and a follow button to help users make up their mind about whether they’re a legitimate source of news. The test will show up to a subset of users in the U.S. if the author’s publisher has implemented Facebook’s author tags.
Meanwhile, Facebook is rolling out to everyone in the U.S. its test from October that gives readers more context about publications by showing links to their Wikipedia pages, related articles about the same topic, how many times the article has been shared and where, and a button for following the publisher. Facebook will also start to show whether friends have shared the article, and a a snapshot of the publisher’s other recent articles.
Since much of this context can be algorithmically generated rather than relying on human fact checkers, the system could scale much more quickly to different languages and locations around the world.
These moves are designed to feel politically neutral to prevent Facebook from being accused of bias. After former contractors reported that they suppressed conservative Trending topics on Facebook in 2016, Facebook took a lot of heat for supposed liberal bias. That caused it to hesitate when fighting fake news before the 2016 Presidential election…and then spend the next two years dealing with the backlash for allowing misinformation to run rampant.
Facebook’s partnerships with outside fact checkers that saw red Disputed flags added to debunked articles actually backfired. Those sympathetic to the false narrative saw the red flag as a badge of honor, clicking and sharing any way rather than allowing someone else to tell them they’re wrong.
That’s why today’s rollout and new test never confront users directly about whether an article, publisher, or author is propagating fake news. Instead Facebook hopes to build a wall of evidence as to whether a source is reputable or not.
If other publications have similar posts, the publisher or author have well-established Wikipedia articles to back up their integrity, and if the publisher’s other articles look legit, users could draw their own conclusion that they’re worth beleiving. But if there’s no Wikipedia links, other publications are contradicting them, no friends have shared it, and a publisher or author’s other articles look questionable too, Facebook might be able to incept the idea that the reader should be skeptical.
Context is everything when dealing with dialog systems. We humans take for granted how complex even our simplest conversations are. That’s part of the reason why dialog systems can’t live up to their human counterparts. But with an interactive learning approach and some open source love, Berlin-based Rasa is hoping to help enterprises solve their conversational AI problems. The… Read More
Enterprise – TechCrunch
One of the limitations of information on the Web is that it is organized differently at each site on the Web. As a newly granted Google patent notes, there is no official catalog of information available on the internet, and each site has its own organizational system. Search engines exist to index information, but they have issues, as described in this new patent that make finding information challenging.
Limitations on Conventional Keyword-Based Search Engines
The patent granted to Google, in September of 2016, discusses a way to organize information on the Web in a manner which can help to better organize and index that information, using context vectors to better understand how words are being used. The patent describes limitations of search engines that are based upon indexing content using keywords, such as:
- A search engine working with Conventional keyword searching will return all instances of the keyword being searched for, regardless of how that word is used on a site. This can be a lot of results
- Conventional search engines may only return only the home page of a site that contains the keyword. Finding where the keyword is used on the site could be difficult
- Often a conventional search engine will return a list of URLs in response to a keyword search that may be difficult to modify or search further in a meaningful manner.
- Information obtained through a search can become dated quickly. Such information may need to be checked up upon
The patent tells us about those limitations and also points out some of the limitations of directories that could also be used to help find information. It then goes on to provide a possible solution to this problem, with a “data extraction tool” capable of providing many of the benefits of both search engines and directories, without the drawbacks that this patent points out.
Is this The Google Search Engine with RankBrain Inside?
A search engine based on a data extraction tool like the one described in the patent would be an improvement over most search engines. Is this Google’s search engine with RankBrain applied to it? It’s possible that it is, though it doesn’t use the word RankBrain
The Bloomberg introduction to RankBrain, Google Turning Its Lucrative Web Search Over to AI Machines provides information about the algorithm used in RankBrain, and it tells us:
RankBrain uses artificial intelligence to embed vast amounts of written language into mathematical entities — called vectors — that the computer can understand.
This new patent refers to what it calls Context Vectors to index content about words found on the Web. To put it clearly, the patent tells us:
In view of the foregoing, in accordance with the invention as embodied and broadly described herein, a method and apparatus are disclosed in one embodiment of the present invention for determining contexts of information analyzed. Contexts may be determined for words, expressions, and other combinations of words in bodies of knowledge such as encyclopedias. Analysis of use provides a division of the universe of communication or information into domains and selects words or expressions unique to those domains of subject matter as an aid in classifying information. A vocabulary list is created with a macro-context (context vector) for each, dependent upon the number of occurrences of unique terms from a domain, over each of the domains. This system may be used to find information or classify information by subsequent inputs of text, in calculation of macro-contexts, with ultimate determination of lists of micro-contests including terms closely aligned with the subject matter.
When a search submits a query to a search engine, we are told that the search engine may try to give it contexts based upon “other queries from the same user, the query associated with other information or query results from the same use, or other inputs related to that user to give it more context.
The patent is:
User-context-based search engine
Inventors: David C. Taylor
Application Date: 09/04/2012
Grant Number: 09449105
Grant Date: 09/20/2016
A method and apparatus for determining contexts of information analyzed. Contexts may be determined for words, expressions, and other combinations of words in bodies of knowledge such as encyclopedias. Analysis of use provides a division of the universe of communication or information into domains and selects words or expressions unique to those domains of subject matter as an aid in classifying information. A vocabulary list is created with a macro-context (context vector) for each, dependent upon the number of occurrences of unique terms from a domain, over each of the domains. This system may be used to find information or classify information by subsequent inputs of text, in calculation of macro-contexts, with ultimate determination of lists of micro-contests including terms closely aligned with the subject matter.
When RankBrain was first announced, I found a patent that was co-invented by one of the members of the team that was working on it, that described how Google might provide substitutions for some query terms, based upon an understanding of the context of those terms and the other words used in a query. I wrote about that patent in the post, Investigating Google RankBrain and Query Term Substitutions. I think reading the patent that post is about, and the one that this post is about can be helpful in understanding some of the ideas behind a process such as RankBrain.
This patent does provide a lot of insights in explaining the importance of context and how helpful that can be to a system that may be attempting to extract data from a source and index that data in a way which makes it easier to locate. I liked this passage in particular:
Interestingly, some words in the English language, and other languages pertain to many different areas of subject matter. Thus, one may think of the universe of communication as containing numerous domains of subject matter. For example, the various domains in FIG. 2 refer to centers of meaning or subject matter areas. These domains are represented as somewhat indistinct clouds, in that they may accumulate a vocabulary of communication elements about them that pertain to them or that relate to them. Nevertheless, some of those same communication elements may also have application elsewhere. For example, a horse to a rancher is an animal. A horse to a carpenter is an implement of work. A horse to a gymnast is an implement on which to perform certain exercises. Thus, the communication element that we call “horse” belongs to, or pertains to, multiple domains.
A search engine that can identify the domains or contexts that a word might fit within may be able to better index such words; as described in this patent:
In an apparatus and method in accordance with the invention, a search engine process is developed that provides a deterministic method for establishing context for the communication elements submitted in a query. Thus, it is possible for a search engine now to determine to which domain or domains a communication element is “attracted.” Since few things are absolute, domains may actually overlap or be very close such that they man share certain communication elements. That is, communication elements do not “belong” to any domain, they are attracted to or have an affinity for various domains, and may have differing degrees of affinity for differing domains. One may think of this affinity as perhaps a goodness of fit or a best alignment or quality alignment with the subject matter of a particular domain.
Contextually Rewarding Search Results
The patent tells us that a search engine that works well is one that provides a searcher with information in response to a query that is “comparatively close related”. Information that is exactly what has been sought. Then information that is close to what has been sought and is still useful. Then it tells us that what would be “contextually unrewarding” would be information that shares the word in a completely different and useless context related to the query
Words might be related to a wide range of particular fields or subject matter domains. The patent describes how these might be used:
Typically, a domain list of about 40 to 50 terms have been found to be effective. Some domain lists have been operated successfully in an apparatus and method in accordance with the invention with as few as 10 terms. Some domain lists may contain a few hundreds of individual terms. For example, some domains may justify about 300 terms. Although the method is deterministic, rather than statistical, it is helpful to have about 40 to 50 terms in the domain list in order to improve the efficiency of the calculations and determinations of the method.
The domain lists have utility in quickly identifying the particular domain to which their members pertain. This results from the lack of commonality of the terms and the lack of ambiguity as to domains to which they may have utility. By the same token, a list as small as the domain lists are necessarily limited when considering the overall vocabulary of communication elements available in any language. Thus, the terms in domain lists do not necessarily arise with the frequency that is most useful for rapid searching. That is, a word that is unique to a particular subject matter domain, but infrequently used, may not arise in very many queries submitted to a search engine.
A process for creating a vocabulary list of a substantial universe or a substantial portion of a universe of communication elements may be performed by identifying a body or corpus of information organized by topical entries. Thereafter, the text of each of those entries identified may be subjected to a counting process in which occurrences of terms from the domain list occur within each of the topical entries. Ultimately, a calculation of a macro context may be made for each of the topical entries. This calculation is based on the domain lists, and the domains represented thereby.
This is where this patent enters into the world of the Semantic Web. The places where different subject matter domains may be identified for different words could be in knowledge bases or online encyclopedias. Such collections of what is referred to as public knowledge might be called a “corpus”. This kind of corpus of information could be used to create a context vector used to index different meanings of words.
When a different meaning is found, it might then be counted from that information corpus The patent tells us that terms found in such a place could be “individual words, terms, expressions, phrases, and so forth.”
The patent attempts to put this into context for us with this statement:
One may think of a topical entry as a vocabulary term. That is, every topical entry is a vocabulary word, expression, place, person, etc. that will be added to the overall vocabulary. That is, for example, the universe may be divided into about 100 to 120 domains for convenient navigation. Likewise, the domain lists may themselves contain from about 10 to about 300 select terms each. By contrast, the topical entries that may be included in the build of a vocabulary list may include the number of terms one would find in a dictionary such as 300 to 800,000. Less may be used, and conceivably more. Nevertheless, unabridged dictionaries and encyclopedias typically have on this order of numbers of entries.
Contexts as Vectors
When RankBrain first came out, there was a post published that looked at some information that might make it a little more understandable; it included some information about Geoffrey Hinton’s Thought Vectors, and there’s more about those in this post from Jennifer Slegg: RankBrain: Everything We Know About Google’s AI Algorithm.
There is a Google Open Source Blog post on Word Vectors which is closely related, titled Learning the meaning behind words, written by Tomas Mikolov, Ilya Sutskever, and Quoc Le. Ilya Sutskever was a student of Geoffry Hinton. Tomas Mikolov worked on a number of papers about word vectors while with the Google Brain team, including Efficient Estimation of Word Representations in Vector Space.
The patent spends a fair amount of time describing what it considers context vectors to be; the different domains which a word might fall into, and number of occurrences or weights for those words within those domains. It’s worth drilling down into the patent and reading about how terms can be considered context vectors that a search engine might label them as.
When a searcher enters a query into a search engine to be searched, the query may be classified within contexts, to help in selecting information in response to that query.
Using a Browser Helper Object
The patent describes how it might identify different domains that might be associated with specific terms. It tells us that this might be done:
By compiling a list of domain-specific questions, it is possible to (1) specify differences between very similar domains with great precision, and (2) create a rapid way to prototype a domain that does not require many hours of an expert’s time, and can be expanded by relatively inexperienced people.
The patent also describes the use of a BHO (Browser Helper Object) in this manner:
Another slightly more complex implementation is something like a Browser Helper Object (BHO) that runs on the user’s machine and watches/categorizes all surfing activity. With this system, even non-participating sites can contribute to the picture of the user, and any clicking the user does to ad sites served by certified clicks will pick up a much more comprehensive picture.
The patent provides more details on how this contextual vector based system might work, and how data might be extracted from web pages. It is highly recommended reading if you want to get a better sense of how a context-based system might be used to index the web and to make specific information on the Web easier to improve upon most conventional keyword-based search engines.
Copyright © 2016 SEO by the Sea. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana