CBPO

Tag: Scores

Quality Scores for Queries: Structured Data, Synthetic Queries and Augmentation Queries

July 31, 2018 No Comments

Augmentation Queries

In general, the subject matter of this specification relates to identifying or generating augmentation queries, storing the augmentation queries, and identifying stored augmentation queries for use in augmenting user searches. An augmentation query can be a query that performs well in locating desirable documents identified in the search results. The performance of the query can be determined by user interactions. For example, if many users that enter the same query often select one or more of the search results relevant to the query, that query may be designated an augmentation query.

In addition to actual queries submitted by users, augmentation queries can also include synthetic queries that are machine generated. For example, an augmentation query can be identified by mining a corpus of documents and identifying search terms for which popular documents are relevant. These popular documents can, for example, include documents that are often selected when presented as search results. Yet another way of identifying an augmentation query is mining structured data, e.g., business telephone listings, and identifying queries that include terms of the structured data, e.g., business names.

These augmentation queries can be stored in an augmentation query data store. When a user submits a search query to a search engine, the terms of the submitted query can be evaluated and matched to terms of the stored augmentation queries to select one or more similar augmentation queries. The selected augmentation queries, in turn, can be used by the search engine to augment the search operation, thereby obtaining better search results. For example, search results obtained by a similar augmentation query can be presented to the user along with the search results obtained by the user query.

This past March, Google was granted a patent that involves giving quality scores to queries (the quote above is from that patent). The patent refers to high scoring queries as augmentation queries. Interesting to see that searcher selection is one way that might be used to determine the quality of queries. So, when someone searches. Google may compare the SERPs they receive from the original query to augmentation query results based upon previous searches using the same query terms or synthetic queries. This evaluation against augmentation queries is based upon which search results have received more clicks in the past. Google may decide to add results from an augmentation query to the results for the query searched for to improve the overall search results.

How does Google find augmentation queries? One place to look for those is in query logs and click logs. As the patent tells us:

To obtain augmentation queries, the augmentation query subsystem can examine performance data indicative of user interactions to identify queries that perform well in locating desirable search results. For example, augmentation queries can be identified by mining query logs and click logs. Using the query logs, for example, the augmentation query subsystem can identify common user queries. The click logs can be used to identify which user queries perform best, as indicated by the number of clicks associated with each query. The augmentation query subsystem stores the augmentation queries mined from the query logs and/or the click logs in the augmentation query store.

This doesn’t mean that Google is using clicks to directly determine rankings But it is deciding which augmentation queries might be worth using to provide SERPs that people may be satisfied with.

There are other things that Google may look at to decide which augmentation queries to use in a set of search results. The patent points out some other factors that may be helpful:

In some implementations, a synonym score, an edit distance score, and/or a transformation cost score can be applied to each candidate augmentation query. Similarity scores can also be determined based on the similarity of search results of the candidate augmentation queries to the search query. In other implementations, the synonym scores, edit distance scores, and other types of similarity scores can be applied on a term by term basis for terms in search queries that are being compared. These scores can then be used to compute an overall similarity score between two queries. For example, the scores can be averaged; the scores can be added; or the scores can be weighted according to the word structure (nouns weighted more than adjectives, for example) and averaged. The candidate augmentation queries can then be ranked based upon relative similarity scores.

I’ve seen white papers from Google before mentioning synthetic queries, which are queries performed by the search engine instead of human searchers. It makes sense for Google to be exploring query spaces in a manner like this, to see what results are like, and using information such as structured data as a source of those synthetic queries. I’ve written about synthetic queries before at least a couple of times, and in the post Does Google Search Google? How Google May Create and Use Synthetic Queries.

Implicit Signals of Query Quality

It is an interesting patent in that it talks about things such as long clicks and short clicks, and ranking web pages on the basis of such things. The patent refers to such things as “implicit Signals of query quality.” More about that in the patent here:

In some implementations, implicit signals of query quality are used to determine if a query can be used as an augmentation query. An implicit signal is a signal based on user actions in response to the query. Example implicit signals can include click-through rates (CTR) related to different user queries, long click metrics, and/or click-through reversions, as recorded within the click logs. A click-through for a query can occur, for example, when a user of a user device, selects or “clicks” on a search result returned by a search engine. The CTR is obtained by dividing the number of users that clicked on a search result by the number of times the query was submitted. For example, if a query is input 100 times, and 80 persons click on a search result, then the CTR for that query is 80%.

A long click occurs when a user, after clicking on a search result, dwells on the landing page (i.e., the document to which the search result links) of the search result or clicks on additional links that are present on the landing page. A long click can be interpreted as a signal that the query identified information that the user deemed to be interesting, as the user either spent a certain amount of time on the landing page or found additional items of interest on the landing page.

A click-through reversion (also known as a “short click”) occurs when a user, after clicking on a search result and being provided the referenced document, quickly returns to the search results page from the referenced document. A click-through reversion can be interpreted as a signal that the query did not identify information that the user deemed to be interesting, as the user quickly returned to the search results page.

These example implicit signals can be aggregated for each query, such as by collecting statistics for multiple instances of use of the query in search operations, and can further be used to compute an overall performance score. For example, a query having a high CTR, many long clicks, and few click-through reversions would likely have a high-performance score; conversely, a query having a low CTR, few long clicks, and many click-through reversions would likely have a low-performance score.

The reasons for the process behind the patent are explained in the description section of the patent where we are told:

Often users provide queries that cause a search engine to return results that are not of interest to the users or do not fully satisfy the users’ need for information. Search engines may provide such results for a number of reasons, such as the query including terms having term weights that do not reflect the users’ interest (e.g., in the case when a word in a query that is deemed most important by the users is attributed less weight by the search engine than other words in the query); the queries being a poor expression of the information needed; or the queries including misspelled words or unconventional terminology.

A quality signal for a query term can be defined in this way:

the quality signal being indicative of the performance of the first query in identifying information of interest to users for one or more instances of a first search operation in a search engine; determining whether the quality signal indicates that the first query exceeds a performance threshold; and storing the first query in an augmentation query data store if the quality signal indicates that the first query exceeds the performance threshold.

The patent can be found at:

Query augmentation
Inventors: Anand Shukla, Mark Pearson, Krishna Bharat and Stefan Buettcher
Assignee: Google LLC
US Patent: 9,916,366
Granted: March 13, 2018
Filed: July 28, 2015

Abstract

Methods, systems, and apparatus, including computer program products, for generating or using augmentation queries. In one aspect, a first query stored in a query log is identified and a quality signal related to the performance of the first query is compared to a performance threshold. The first query is stored in an augmentation query data store if the quality signal indicates that the first query exceeds a performance threshold.

References Cited about Augmentation Queries

These were a number of references cited by the applicants of the patent, which looked interesting, so I looked them up to see if I could find them to read them and share them here.

  1. Boyan, J. et al., A Machine Learning Architecture for Optimizing Web Search Engines,” School of Computer Science, Carnegie Mellon University, May 10, 1996, pp. 1-8. cited by applicant.
  2. Brin, S. et al., “The Anatomy of a Large-Scale Hypertextual Web Search Engine“, Computer Science Department, 1998. cited by applicant.
  3. Sahami, M. et al., T. D. 2006. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23-26, 2006). WWW ’06. ACM Press, New York, NY, pp. 377-386. cited by applicant.
  4. Ricardo A. Baeza-Yates et al., The Intention Behind Web Queries. SPIRE, 2006, pp. 98-109, 2006. cited by applicant.
  5. Smith et al. Leveraging the structure of the Semantic Web to enhance information retrieval for proteomics” vol. 23, Oct. 7, 2007, 7 pages. cited by applicant.
  6. Robertson, S.E. Documentation Note on Term Selection for Query Expansion J. of Documentation, 46(4): Dec. 1990, pp. 359-364. cited by applicant.
  7. Talel Abdessalem, Bogdan Cautis, and Nora Derouiche. 2010. ObjectRunner: lightweight, targeted extraction and querying of structured web data. Proc. VLDB Endow. 3, 1-2 (Sep. 2010). cited by applicant .
  8. Jane Yung-jen Hsu and Wen-tau Yih. 1997. Template-based information mining from HTML documents. In Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative application of artificial intelligence (AAAI’97/IAAI’97). AAAI Press, pp. 256-262. cited by applicant .
  9. Ganesh, Agarwal, Govind Kabra, and Kevin Chen-Chuan Chang. 2010. Towards rich query interpretation: walking back and forth for mining query templates. In Proceedings of the 19th international conference on World wide web (WWW ’10). ACM, New York, NY USA, 1-10. DOI=10. 1145/1772690. 1772692 http://doi.acm.org/10.1145/1772690.1772692. cited by applicant.

This is a Second Look at Augmentation Queries

This is a continuation patent, which means that it was granted before, with the same description, and it now has new claims. When that happens, it can be worth looking at the old claims and the new claims to see how they have changed. I like that the new version seems to focus more strongly upon structured data. It tells us that it might use structured data in sites that appear for queries as synthetic queries, and if those meet the performance threshold, they may be added to the search results that appear for the original queries. The claims do seem to focus a little more on structured data as synthetic queries, but it doesn’t really change the claims that much. They haven’t changed enough to publish them side by side and compare them.

What Google Has Said about Structured Data and Rankings

Google spokespeople had been telling us that Structured Data doesn’t impact rankings directly, but what they have been saying does seem to have changed somewhat recently. In the Search Engine Roundtable post, Google: Structured Data Doesn’t Give You A Ranking Boost But Can Help Rankings we are told that just having structured data on a site doesn’t automatically boost the rankings of a page, but if the structured data for a page is used as a synthetic query, and it meets the performance threshold as an augmentation query, it might be shown in rankings, thus helping in rankings (as this patent tells us.)

Note that this isn’t new, and the continuation patent’s claims don’t appear to have changed that much so that structured data is still being used as synthetic queries, and is checked to see if they work as augmented queries. This does seem to be a really good reason to make sure you are using the appropriate structured data for your pages.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Quality Scores for Queries: Structured Data, Synthetic Queries and Augmentation Queries appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Etleap scores $1.5 million seed to transform how we ingest data

April 24, 2018 No Comments

Etleap is a play on words for a common set of data practices: extract, transform and load. The startup is trying to place these activities in a modern context, automating what they can and in general speeding up what has been a tedious and highly technical practice. Today, they announced a $ 1.5 million seed round.

Investors include First Round Capital, SV Angel, Liquid2, BoxGroup and other unnamed investors. The startup launched five years ago as a Y Combinator company. It spent a good 2.5 years building out the product says CEO and founder Christian Romming. They haven’t required additional funding up until now because they have been working with actual customers. Those include Okta, PagerDuty and Mode among others.

Romming started out at ad tech startup VigLink and while there he encounter a problem that was hard to solve. “Our analysts and scientists were frustrated. Integration of the data sources wasn’t always a priority and when something broke, they couldn’t get it fixed until a developer looked at it.” That lack of control slowed things down and made it hard to keep the data warehouse up-to-date.

He saw an opportunity in solving that problem and started Etleap. While there were (and continue to be) legacy solutions like Informatica, Talend and Microsoft SQL Server Integration Services, he said when he studied these at a deeply technical level, he found they required a great deal of help to implement. He wanted to simplify ETL as much as possible, putting data integration into the hands of much less technical end users, rather than relying on IT and consultants.

One of the problems with traditional ETL is that the data analysts who make use of the data tend to get involved very late after the tools have already been chosen and Romming says his company wants to change that. “They get to consume whatever IT has created for them. You end up with a bread line where analysts are at the mercy of IT to get their jobs done. That’s one of the things we are trying to solve. We don’t think there should be any engineering at all to set up ETL pipeline,” he said.

Etleap is delivered as managed SaaS or you can run it within your company’s AWS accounts. Regardless of the method, it handles all of the managing, monitoring and operations for the customer.

Romming emphasizes that the product is really built for cloud data warehouses. For now, they are concentrating on the AWS ecosystem, but have plans to expand beyond that down the road. “We want help more enterprise companies make better use of their data, while modernizing data warehousing infrastructure and making use of cloud data warehouses,” he explained.

The company is currently has 15 employees, but Romming plans to at least double that in the next 12-18 months, mostly increasing the engineering team to help further build out the product and create more connectors.


Enterprise – TechCrunch


Using Ngram Phrase Models to Generate Site Quality Scores

September 30, 2017 No Comments
Scrabble-phrases
Source: https://commons.wikimedia.org/wiki/File:Scrabble_game_in_progress.jpg
Photographer: McGeddon
Creative Commons License: Attribution 2.0 Generic

Navneet Panda, whom the Google Panda update is named after, has co-invented a new patent that focuses on site quality scores. It’s worth studying to understand how it determines the quality of sites.

Back in 2013, I wrote the post Google Scoring Gibberish Content to Demote Pages in Rankings, about Google using ngrams from sites and building language models from them to determine if those sites were filled with gibberish, or spammy content. I was reminded of that post when I read this patent.

Rather than explaining what ngrams are in this post (which I did in the gibberish post), I’m going to point to an example of ngrams at the Google n-gram viewer, which shows Google indexing phrases in scanned books. This article published by the Wired site also focused upon ngrams: The Pitfalls of Using Google Ngram to Study Language.

An ngram phrase could be a 2-gram, a 3-gram, a 4-gram, or a 5-gram phrase; where pages are broken down into two-word phrases, three-word phrases, four-word phrases, or 5 word phrases. If a body of pages are broken down into ngrams, they could be used to create language models or phrase models to compare to other pages.

Language models, like the ones that Google used to create gibberish scores for sites could also be used to determine the quality of sites, if example sites were used to generate those language models. That seems to be the idea behind the new patent granted this week. The summary section of the patent tells us about this use of the process it describes and protects:

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of obtaining baseline site quality scores for a plurality of previously-stored sites; generating a phrase model for a plurality of sites including the plurality of previously-scored sites, wherein the phrase model defines a mapping from phrase-specific relative frequency measures to phrase-specific baseline site quality scores; for a new site, the new site not being one of the plurality of previously-scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of the plurality of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.

The newly granted patent from Google is:

Predicting site quality
Inventors: Navneet Panda and Yun Zhou
Assignee: Google
US Patent: 9,767,157
Granted: September 19, 2017
Filed: March 15, 2013

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for predicating a measure of quality for a site, e.g., a web site. In some implementations, the methods include obtaining baseline site quality scores for multiple previously scored sites; generating a phrase model for multiple sites including the previously scored sites, wherein the phrase model defines a mapping from phrase specific relative frequency measures to phrase specific baseline site quality scores; for a new site that is not one of the previously scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.

In addition to generating ngrams from text upon sites, in some versions of the implementation of this patent will include generating ngrams from anchor text of links pointing to pages of the sites. Building a phrase model involves calculating the frequency of n-grams on a site “based on the count of pages divided by the number of pages on the site.”

The patent tells us that site quality scores can impact rankings of pages from those sites, according to the patent:

Obtain baseline site quality scores for a number of previously-scored sites. The baseline site quality scores are scores used by the system, e.g., by a ranking engine of the system, as signals, among other signals, to rank search results. In some implementations, the baseline scores are determined by a backend process that may be expensive in terms of time or computing resources, or by a process that may not be applicable to all sites. For these or other reasons, baseline site quality scores are not available for all sites.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Using Ngram Phrase Models to Generate Site Quality Scores appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Apttus scores $55M as it closes in on an IPO

September 13, 2017 No Comments

 Apttus, the unicorn quote-to-cash vendor built on the Salesforce platform, announced a $ 55 million round, which is likely its final private investment on the way to an IPO. While CEO Kirk Krappe wouldn’t definitively confirm the company was going public, he did say that today’s round was about gaining the confidence of future investors. “We decided we needed a certain amount… Read More
Enterprise – TechCrunch


Will Google Start Giving People Social Media Influencer Scores?

April 29, 2017 No Comments

Social Media Influencer Scores

A patent granted to Google this week tells us about social media influencer scores developed at Google that sound very much like the scores at Klout. In the references section of the patent, Klout is referred to a couple of times as well, with a link to the Wikipedia Page about Klout, and the Klout FAQ page. We aren’t given a name for these influencer scores in Google’s patent, but it does talk about topic-based influencer scores and advertisers.

Many patents are published that might give the inventors behind those patents a right to the technology described in them, but often the decision to move ahead with the processes described in those patents might be based upon business-based matters, such as whether or not there might be value is pursuing the patent. When I read this patent, I was reminded of an earlier patent from Google from a couple of years ago that described an advertising model that used social media influencers and their interests called Adheat. That patent was AdHeat Advertisement Model for Social Network. A whitepaper that gives us a little more indepth information about that process was AdHeat: An Influence-based Diffusion Model for Propagating Hints to Match Ads. One of the authors/inventors, Edward Chang left Google after the paper came out to join HTC as their Vice President of Research and Innovation.

This new patent was originally filed on May 29, 2012. Edward Chang left Google for HTC in July, 2012. I don’t know if those events are related, but the idea of using social media influencers in advertising is an interesting one. The patent doesn’t pinpoint specific social media platforms that would be used the way that Klout does. Interestingly, Klout does use Google+ as one of the social media networks that they use to generate Klout Scores.

I like seeing what Google patents say about things on the Web. Their introduction to social media and to influencer scores was interesting:

Social media is pervasive in today’s society. Friends keep in contact throughout the day on social networks. Fans can follow their favorite celebrities and interact on blogs, micro-blogs, and the like. Such media are referred to as “social media,” which can be considered media primarily, but not exclusively, for social interaction, and which can use highly accessible and scalable communication techniques. Brands and products mentioned on such sites can reflect customers’ interests and feedback.

Some technologies have been developed to analyze social media. For example, some systems allow users to discover their “influence scores” on various social media. An influence score is a metric to measure a user’s impact in social media.

The patent tells us about the role of the process it defines:

…one aspect of the subject matter described in this specification can be embodied in methods that include the actions of identifying a user in a community; determining an influence score to be associated with the user in the community for a particular topic including determining a reach of one or more communications that relate to the particular topic that have been distributed from the user in the community; evaluating the reach as compared to one or more other users in the community for the particular topic; and storing the influence score in association with the user.

This new patent tells us about

  1. Identifying a user in a community;
  2. Determining an influence score to be associated with the user in the community for a particular topic,
  3. Determining a reach of communications that relate to the particular topic distributed from the user to other users in the community, and
  4. Evaluating that reach and comparing it to the reach of communications from other users in the community for the particular topic; and
  5. storing the influence score in association with the user.
  6. The patent also tells us that the following are advantages to be gained from the use of the process described in the patent:

    (1) The subject matter can be used to attribute viral growth to certain individuals or selected group.
    (2) Such attribution can be used for targeted advertising to the selected group or even to the individuals or other individuals that are influenced by the individual or group.

    The patent is:

    Determining influence in a social community
    Inventors: Emily K. Moxley, Vinod Anupam, Hobart Sze, Dani Suleman, Khanh B. Nguyen
    Assignee: Google Inc.
    US Patent 9,632,972
    Granted: April 25, 2017
    Filed: May 29, 2012

    Abstract

    Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining influence in a social community. In one aspect, a method includes identifying a user in a community; determining an influence score to be associated with the user in the community for a particular topic, including: determining a reach of one or more communications that relate to the particular topic that have been distributed from the user to other users in the community, and evaluating the reach as compared to the reach of one or more communications distributed from other users in the community for the particular topic; and storing the influence score in association with the user.

    The patent is worth reading in full, and it contains some interesting insights including some hints regarding whether Google might engage in this type of social media advertising (see the screenshot from the patent that starts this post, showing influencers and topic scores for them, which is described in a little more detail in the patent.

    I also liked this quote from the patent, and wanted to make sure that I shared it, because it raises a good point:

    Every community has individuals who influence that community. From a prominent economist’s advice on economics to a celebrity buying the latest designer bag, thousands of people pay attention to what influential individuals are doing within their field. However, less attention is paid when an influential individual opines on a topic outside their field. For example, the thousands of individuals that pay attention to the economists on economics would be unlikely to pay attention to the economist’s latest jacket purchase.

    These social media influencer scores do seem very similar to what Klout is doing. Would Google venture into such territory?


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Will Google Start Giving People Social Media Influencer Scores? appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓