CBPO

SEO

PageRank Updated

April 26, 2018 No Comments

A popular search engine developed by Google Inc. of Mountain View, Calif. uses PageRank.RTM. as a page-quality metric for efficiently guiding the processes of web crawling, index selection, and web page ranking. Generally, the PageRank technique computes and assigns a PageRank score to each web page it encounters on the web, wherein the PageRank score serves as a measure of the relative quality of a given web page with respect to other web pages. PageRank generally ensures that important and high-quality web pages receive high PageRank scores, which enables a search engine to efficiently rank the search results based on their associated PageRank scores.

~ Producing a ranking for pages using distances in a web-link graph

A continuation patent of an updated PageRank was granted today. The original patent was filed in 2006, and reminded me a lot of Yahoo’s Trustrank (which is cited by the patent’s applicants as one of a large number of documents that this new version of the patent is based upon.)

I first wrote about this patent in the post titled, Recalculating PageRank. It was originally filed in 2006, and the first claim in the patent read like this (note the mention of “Seed Pages”):

What is claimed is:

1. A method for producing a ranking for pages on the web, comprising: receiving a plurality of web pages, wherein the plurality of web pages are inter-linked with page links; receiving n seed pages, each seed page including at least one outgoing link to a respective web page in the plurality of web pages, wherein n is an integer greater than one; assigning, by one or more computers, a respective length to each page link and each outgoing link; identifying, by the one or more computers and from among the n seed pages, a kth-closest seed page to a first web page in the plurality of web pages according to the lengths of the links, wherein k is greater than one and less than n; determining a ranking score for the first web page from a shortest distance from the kth-closest seed page to the first web page; and producing a ranking for the first web page from the ranking score.

The first claim in the newer version of this continuation patent is:

What is claimed is:

1. A method, comprising: obtaining data identifying a set of pages to be ranked, wherein each page in the set of pages is connected to at least one other page in the set of pages by a page link; obtaining data identifying a set of n seed pages that each include at least one outgoing link to a page in the set of pages, wherein n is greater than one; accessing respective lengths assigned to one or more of the page links and one or more of the outgoing links; and for each page in the set of pages: identifying a kth-closest seed page to the page according to the respective lengths, wherein k is greater than one and less than n, determining a shortest distance from the kth-closest seed page to the page; and determining a ranking score for the page based on the determined shortest distance, wherein the ranking score is a measure of a relative quality of the page relative to other pages in the set of pages.

Producing a ranking for pages using distances in a web-link graph
Inventors: Nissan Hajaj
Assignee: Google LLC
US Patent: 9,953,049
Granted: April 24, 2018
Filed: October 19, 2015

Abstract

One embodiment of the present invention provides a system that produces a ranking for web pages. During operation, the system receives a set of pages to be ranked, wherein the set of pages are interconnected with links. The system also receives a set of seed pages which include outgoing links to the set of pages. The system then assigns lengths to the links based on properties of the links and properties of the pages attached to the links. The system next computes shortest distances from the set of seed pages to each page in the set of pages based on the lengths of the links between the pages. Next, the system determines a ranking score for each page in the set of pages based on the computed shortest distances. The system then produces a ranking for the set of pages based on the ranking scores for the set of pages.

Under this newer version of PageRank, we see how it might avoid manipulation by building trust into a link graph like this:

One possible variation of PageRank that would reduce the effect of these techniques is to select a few “trusted” pages (also referred to as the seed pages) and discovers other pages which are likely to be good by following the links from the trusted pages. For example, the technique can use a set of high quality seed pages (s.sub.1, s.sub.2, . . . , s.sub.n), and for each seed page i=1, 2, . . . , n, the system can iteratively compute the PageRank scores for the set of the web pages P using the formulae:

.A-inverted..noteq..di-elect cons..function..times..fwdarw..times..function..times..function..fwdarw. ##EQU00002## where R.sub.i(s.sub.i)=1, and w(q.fwdarw.p) is an optional weight given to the link q.fwdarw.p based on its properties (with the default weight of 1).

Generally, it is desirable to use a large number of seed pages to accommodate the different languages and a wide range of fields which are contained in the fast growing web contents. Unfortunately, this variation of PageRank requires solving the entire system for each seed separately. Hence, as the number of seed pages increases, the complexity of computation increases linearly, thereby limiting the number of seeds that can be practically used.

Hence, what is needed is a method and an apparatus for producing a ranking for pages on the web using a large number of diversified seed pages without the problems of the above-described techniques.

The summary of the patent describes it like this:

One embodiment of the present invention provides a system that ranks pages on the web based on distances between the pages, wherein the pages are interconnected with links to form a link-graph. More specifically, a set of high-quality seed pages are chosen as references for ranking the pages in the link-graph, and shortest distances from the set of seed pages to each given page in the link-graph are computed. Each of the shortest distances is obtained by summing lengths of a set of links which follows the shortest path from a seed page to a given page, wherein the length of a given link is assigned to the link based on properties of the link and properties of the page attached to the link. The computed shortest distances are then used to determine the ranking scores of the associated pages.

The patent discusses the importance of a diversity of topics covered by seed sites, and the value of a large set of seed sites. It also gives us a summary of crawling and ranking and searching like this:

Crawling Ranking and Searching Processes

FIG. 3 illustrates the crawling, ranking and searching processes in accordance with an embodiment of the present invention. During the crawling process, web crawler 304 crawls or otherwise searches through websites on web 302 to select web pages to be stored in indexed form in data center 308. In particular, web crawler 304 can prioritize the crawling process by using the page rank scores. The selected web pages are then compressed, indexed and ranked in 305 (using the ranking process described above) before being stored in data center 308.

During a subsequent search process, a search engine 312 receives a query 313 from a user 311 through a web browser 314. This query 313 specifies a number of terms to be searched for in the set of documents. In response to query 313, search engine 312 uses the ranking information to identify highly-ranked documents that satisfy the query. Search engine 312 then returns a response 315 through web browser 314, wherein the response 315 contains matching pages along with ranking information and references to the identified documents.

I’m thinking about looking up the many articles cited in the patent, and providing links to them, because they seem to be tremendous resources about the Web. I’ll likely publish those soon.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post PageRank Updated appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


3 Ways Query Stream Ontologies Change Search

March 8, 2018 No Comments

What are query stream ontologies, and how might they change search?

Search engines trained us to use keywords when we searched – to try to guess what words or phrases might be the best ones to use to try to find something we are interested in. That we might have a situational or informational need to find out more about. Keywords were an important and essential part of SEO – trying to get pages to rank highly in search results for certain keywords found in queries that people would search for. SEOs still optimize pages for keywords, hoping to use a combination of information retrieval relevance scores and link-based PageRank scores, to get pages to rank highly in search results.

With Google moving towards a knowledge-based attempt to find “things” rather than “strings”, we are seeing patents that focus upon returning results that provide answers to questions in search results. One of those from January describes how query stream ontologies might be created from searcher’s queries, that can be used to respond to fact-based questions using information about attributes of entities.

There is a white paper from Google co-authored by the same people who are the inventors of this patent published around the time this patent was filed in 2014, and it is worth spending time reading through. The paper is titled, Biperpedia: An Ontology for Search Applications

The patent (and paper) both focus upon the importance of structured data. The summary for the patent tells us this:

Search engines often are designed to recognize queries that can be answered by structured data. As such, they may invest heavily in creating and maintaining high-precision databases. While conventional databases in this context typically have a relatively wide coverage of entities, the number of attributes they model (e.g., GDP, CAPITAL, ANTHEM) is relatively small.

The patent is:

Identifying entity attributes
Inventors: Alon Yitzchak Halevy, Fei Wu, Steven Euijong Whang and Rahul Gupta
Assignee: Google Inc. (Mountain View, CA)
US Patent: 9,864,795
Granted: January 9, 2018
Filed: October 28, 2014

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating an ontology of entity attributes. One of the methods includes extracting a plurality of attributes based upon a plurality of queries; and constructing an ontology based upon the plurality of attributes and a plurality of entity classes.

The paper echoes sentiments in the patent, with statements such as this one:

For the first time in the history of the Web, structured data is a first-class citizen among search results. The main search engines make significant efforts to recognize when a user’s query can be answered using structured data.

To cut right to the heart of what this patent covers, it’s worth pulling out the first claim from the patent that expresses how much of an impact this patent may have from a knowledge-based approach to collecting data and indexing information on the Web. Like most patent language, it’s a long passage that tends to run on, but it is very detailed about the process that this patent covers:

1. A method comprising: generating an ontology of class-attribute pairs, wherein each class that occurs in the class-attribute pairs of the ontology is a class of entities and each attribute occurring in the class-attribute pairs of the ontology is an attribute of the respective entities in the class of the class-attribute pair in which the attribute occurs, wherein each attribute in the class-attribute pairs has one or more domains of instances to which the attribute applies and a range that is either a class of entities or a type of data, and wherein generating the ontology comprises: obtaining class-entity data representing a set of classes and, for each class, entities belonging to the class as instances of the class; obtaining a plurality of entity-attribute pairs, wherein each entity-attribute pair identifies an entity that is represented in the class-entity data and a candidate attribute for the entity; determining a plurality of attribute extraction patterns from occurrences of the entities identified by the entity-attribute pairs with the candidate attributes identified by the entity-attribute pairs in text of documents in a collection of documents, wherein determining the plurality of attribute extraction patterns comprises: identifying an occurrence of the entity and the candidate attribute identified by a first entity-attribute pair in a first sentence from a first document in the collection of documents; generating a candidate lexical attribute extraction pattern from the first sentence; generating a candidate parse attribute extraction pattern from the first sentence; and selecting the candidate lexical attribute extraction pattern and the candidate parse attribute extraction pattern as attribute extraction patterns if the candidate lexical attribute pattern and the candidate parse attribute extraction patterns were generated using at least a predetermined number of unique entity-attribute pairs; and applying the plurality of attribute extraction patterns to the documents in the collection of documents to determine entity-attribute pairs, and from the entity-attribute pairs and the class-entity data, for each of one or more entity classes represented in the class-entity data, attributes possessed by entities belonging to the entity class.

Rather than making this post just the claims of this patent (which are worth going through if you can tolerate the legalese), I’m going to pull out some information from the description which describes some of the implications of the process behind the patent. This first one tells us of the benefit of crowdsourcing an ontology, by building it from the queries of many searchers, and how that may mean that focusing upon matching keywords in queries with keywords in documents becomes less important than responding to queries with answers to questions:

Extending the number of attributes known to a search engine may enable the search engine to answer more precisely queries that lie outside a “long tail,” of statistical query arrangements, extract a broader range of facts from the Web, and/or retrieve information related to semantic information of tables present on the Web.

This patent provides a lot of information about how such an ontology might be used to assist search:

The present disclosure provides systems and techniques for creating an ontology of, for example, millions of (class, attribute) pairs, including 100,000 or more distinct attribute names, which is up to several orders of magnitude larger than available conventional ontologies. Extending the number of attributes “known” to a search engine may provide several benefits. First, additional attributes may enable the search engine to more precisely answer “long-tail” queries, e.g., brazil coffee production. Second, additional attributes may allow for extraction of facts from Web text using open information extraction techniques. As another example, a broad repository of attributes may enable recovery of the semantics of tables on the Web, because it may be easier to recognize attribute names in column headers and in the surrounding text.

Answering Queries with Attributes

I wrote about the topic of How Knowledge Base Entities can be Used in Searches to describe how Google might search a data store of attributes about entities such as movies to return search results by asking about facts related to a movie, such as “What is the movie where Robert Duvall loves the smell of Napalm in the morning?” By building up a detailed ontology that includes may facts can mean a search engine can answer many questions quickly. This may be how featured snippets may be responded to in the futured, but the patent that describes this approach is returning SERPs filled with links to web documents, rather than answers to questions.

Open Information Extraction

That mention of open information extraction methods from the patent reminded me of an acquistion that Google made a few years ago when Google acquired Wavii in April of 2013. Wavii did research about open extraction as described in these papers:

A video that might be helpful to learn about how Open Information Extraction works is this one:

Open Information Extraction at Web Scale

An Ontology created from a query stream can lead to this kind of open information extraction

Semantics from Tables on the Web

Google has been running a Webtables project for a few years, and has released a followup that describes how the project has been going. Semantics from Tables is mentioned in this patent, so it’s worth including some papers about the Webtables project to give you more information about them, if you hadn’t come across them before:

Query Stream Ontologies

The process in the patent involves using a query stream to create an ontology. I enjoyed the statements in this patent about what an ontology was and how one works to help search. I recommend clicking through and reading the description in the patent along with the Biperpedia paper. This really is a transformation of search that brings it beyond keywords and understanding entities better, and how search works. This appears to be a very real future of Search:

Systems and techniques disclosed herein may extract attributes from a query stream, and then use extractions to seed attribute extraction from other text. For every attribute a set of synonyms and text patterns in which it appears is saved, thereby enabling the ontology to recognize the attribute in more contexts. An attribute in an ontology as disclosed herein includes a relationship between a pair of entities (e.g., CAPITAL of countries), between an entity and a value (e.g., COFFEE PRODUCTION), or between an entity and a narrative (e.g., CULTURE). An ontology as disclosed herein may be described as a “best-effort” ontology, in the sense that not all the attributes it contains are equally meaningful. Such an ontology may capture attributes that people consider relevant to classes of entities. For example, people may primarily express interest in attributes by querying a search engine for the attribute of a particular entity or by using the attribute in written text on the Web. In contrast to a conventional ontology or database schema, a best-effort ontology may not attach a precise definition to each attribute. However, it has been found that such an ontology still may have a relatively high precision (e.g., 0.91 for the top 100 attributes and 0.52 for the top 5000 attributes).

The ontologies that are created from query streams expressly to assist search applications are different from more conventional manually generated ontologies in a number of ways:

Ontologies as disclosed herein may be particularly well-suited for use in search applications. In particular, tasks such as parsing a user query, recovering the semantics of columns of Web tables, and recognizing when sentences in text refer to attributes of entities, may be performed efficiently. In contrast, conventional ontologies tend to be relatively inflexible or brittle because they rely on a single way of modeling the world, including a single name for any class, entity or attribute. Hence, supporting search applications with a conventional ontology may be difficult because mapping a query or a text snippet to the ontology can be arbitrarily hard. An ontology as disclosed herein may include one or more constructs that facilitate query and text understanding, such as attaching to every attribute a set of common misspellings of the attribute, exact and/or approximate synonyms, other related attributes (even if the specific relationship is not known), and common text phrases that mention the attribute.

The patent does include more about ontologies and schema and data sources and query patterns.

This is a direction that search is traveling towards, and if you want to know or do SEO, it’s worth learning about. SEO is changing, just as it has many times in the past.

I’ve also written a followup to this post on the Go Fish Digital blog at: SEO Moves From Keywords to Ontologies and Query Patterns


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post 3 Ways Query Stream Ontologies Change Search appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Related Questions are Joined by ‘People Also Search For’ Refinements; Now Using a Question Graph

February 22, 2018 No Comments

meyer lemon tree related questions

I recently bought a lemon tree and wanted to learn how to care for it. I started asking about it at Google, which provided me with other questions and answers related to caring for a lemon tree. As I clicked upon some of those, others were revealed that gave me more information that was helpful.

Last March, I wrote a post about Related Questions at Google, Google’s Related Questions Patent or ‘People Also Ask’ Questions.

As Barry Schwartz noted recently at Search Engine Land, Google is now also showing alternative query refinements as ‘People Also Search For’ listings, in the post, Google launches new look for ‘people also search for’ search refinements. That was enough to have me look to see if the original “Related Questions” patent was updated by Google. It was. A continuation patent was granted in June of last year, with the same name, but updated claims

The older version of the patent can be found at Generating related questions for search queries

It doesn’t say anything about the changing of the wording of “Related Questions” Some “people also search for” results don’t necessarily take the form of questions, either (so “people also ask” may be very appropriate, and continue to be something we see in the future.) But the claims from the new patent contain some new phrases and language that wasn’t in the old patent. The new patent is at:

Generating related questions for search queries
Inventors: Yossi Matias, Dvir Keysar, Gal Chechik, Ziv Bar-Yossef, and Tomer Shmiel
Assignee: Google Inc.
US Patent: 9,679,027
Granted: June 13, 2017
Filed: December 14, 2015

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying related questions for a search query is described. One of the methods includes receiving a search query from a user device; obtaining a plurality of search results for the search query provided by a search engine, wherein each of the search results identifies a respective search result resource; determining one or more respective topic sets for each search result resource, wherein the topic sets for the search result resource are selected from previously submitted search queries that have resulted in users selecting search results identifying the search result resource; selecting related questions from a question database using the topic sets; and transmitting data identifying the related questions to the user device as part of a response to the search query.

The first claim brings a new concept into the world of related questions and answers, which I will highlight in it:

1. A method performed by one or more computers, the method comprising: generating a question graph that includes a respective node for each of a plurality of questions; connecting, with links in the question graph, nodes for questions that are equivalent, comprising: identifying selected resources for each of the plurality of questions based on user selections of search results in response to previous submissions of the question as a search query to a search engine; identifying pairs of questions from the plurality of questions, wherein the questions in each identified pair of questions have at least a first threshold number of common identified selected resources; and for each identified pair, connecting the nodes for the questions in the identified pair with a link in the question graph; receiving a new search query from a user device; obtaining an initial ranking of questions that are related to the new search query; generating a modified ranking of questions that are related to the new search query, comprising, for each question in the initial ranking: determining whether the question is equivalent to any higher-ranked questions in the initial ranking by determining whether a node for the question is connected by a link to any of the nodes for any of the higher-ranked questions in the question graph; and when the question is equivalent to any of the higher-ranked questions, removing the question from the modified ranking; selecting one or more questions from the modified ranking; and transmitting data identifying the selected questions to the user device as part of a response to the new search query.

A question graph would be a semantic approach towards asking and answering questions that are related to each other in meaningful ways.

In addition to the “question graph” that is mentioned in that first claim, we are also told that Google is keeping an eye upon how often it appears that people are selecting these related questions and watching how often people are clicking upon and reading those.

The descriptions and the images in the patent are from the original version of the patent, so there aren’t any that reflect upon what a question graph might look like. For a while, Facebook introduced graph search as a feature that you could use to search on Facebook and that used questions that were related to each other. I found a screen that shows some of those off, and such related questions could be considered from a question graph of related questions. It isn’t quite the same thing as what Google is doing with related questions, but the idea of showing questions that may be related to any initial one in a query, and keeping an eye upon those to see if people are spending time looking at them makes sense. I’ve been seeing a lot of related questions in search results and have been using them. Here are the Facebook graph search questions:

Facebook Graph Search Related questions

As you can see, those questions share some facts, and are considered to be related to each other because they do. This makes them similar to the related questions that are found from a question graph that might mean they could be of interest to a searcher who asks the first query. It is interesting that the new patent claims ask about whether or not the related questions being shown are being clicked upon, and that tells Google if there is any interest on the part of searchers to continue to see related questions. I’ve been finding them easy to click upon and interesting.

Are you working questions and answers into your content?


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Related Questions are Joined by ‘People Also Search For’ Refinements; Now Using a Question Graph appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Google’s Mobile Location History

January 30, 2018 No Comments

Google Location History

If you use Google Maps to navigate from place to place, or if you have agreed to be a local guide for Google Maps, there is a chance that you have seen Google Mobile Location history information. There is a Google Account Help page about how to Manage or delete your Location History. The location history page starts off by telling us:

Your Location History helps you get better results and recommendations on Google products. For example, you can see recommendations based on places you’ve visited with signed-in devices or traffic predictions for your daily commute.

You may see this history as your timeline, and there is a Google Help page to View or edit your timeline. This page starts out by telling us:

Your timeline in Google Maps helps you find the places you’ve been and the routes you’ve traveled. Your timeline is private, so only you can see it.

Mobile Location history has been around for a while, and I’ve seen it mentioned in a few Google patents. It may be referred to as a “Mobile location history” because it appears to contain information collected by your mobile device. Here are three posts I’ve written about patents that mention location history and describe processes that depend upon Mobile Location history.

An interesting article that hints at some possible aspects of location history just came out on January 24th, in the post, If you’re using an Android phone, Google may be tracking every move you make.

The timing of the article about location history is interesting given that Google was granted a patent on user location histories the day before that article was published. It focuses upon telling us how location history works:

The present disclosure relates generally to systems and methods for generating a user location history. In particular, the present disclosure is directed to systems and methods for analyzing raw location reports received from one or more devices associated with a user to identify one or more real-world location entities visited by the user.

Techniques that could be used to attempt to determine a location associated with a device can include GPS, IP Addresses, Cell-phone triangulation, Proximity to Wifi Access points, and maybe even power line mapping using device magnetometers.

The patent has an interesting way of looking at location history, which sounds reasonable. I don’t know the latitudes and longitudes of places I visit:

Thus, human perceptions of location history are generally based on time spent at particular locations associated with human experiences and a sense of place, rather than a stream of latitudes and longitudes collected periodically. Therefore, one challenge in creating and maintaining a user location history that is accessible for enhancing one or more services (e.g. search, social, or an API) is to correctly identify particular location entities visited by a user based on raw location reports.

The location history process looks like it involves collecting data from mobile devices in a way that allows it to gather information about places visited, with scores for each of those locations. I have had Google Maps ask me to verify some of the places that I have visited, as if the score it had for those places may not have been sufficient (not high enough of a level of confidence) for it to believe that I had actually been at those places.

The location history patent is:

Systems and methods for generating a user location history
Inventors: Daniel Mark Wyatt, Renaud Bourassa-Denis, Alexander Fabrikant, Tanmay Sanjay Khirwadkar, Prathab Murugesan, Galen Pickard, Jesse Rosenstock, Rob Schonberger, and Anna Teytelman
Assignee: Google LLC
US Patent: 9,877,162
Granted: January 23, 2018
Filed: October 11, 2016

Abstract

Systems and methods for generating a user location history are provided. One example method includes obtaining a plurality of location reports from one or more devices associated with the user. The method includes clustering the plurality of location reports to form a plurality of segments. The method includes identifying a plurality of location entities for each of the plurality of segments. The method includes determining, for each of the plurality of segments, one or more feature values associated with each of the location entities identified for such segment. The method includes determining, for each of the plurality of segments, a score for each of the plurality of location entities based at least in part on a scoring formula. The method includes selecting one of plurality of locations entities for each of the plurality of segments.

Why generate a location history?

A couple of reasons stand out in the patent’s extended description.

1) The generated user location history can be stored and then later accessed to provide personalized location-influenced search results.
2) As another example, a system implementing the present disclosure can provide the location history to the user via an interactive user interface that allows the user to view, edit, and otherwise interact with a graphical representation of her mobile location history.

I like the interactive user Interface that shows times and distances traveled.

This statement from the patent was interesting, too:

According to another aspect of the present disclosure, a plurality of location entities can be identified for each of the plurality of segments. As an example, map data can be analyzed to identify all location entities that are within a threshold distance from a segment location associated with the segment. Thus, for example, all businesses or other points of interest within 1000 feet of the mean location of all location reports included in a segment can be identified.

Google may track information about locations that appear in that history, such as popularity features, which may include, “a number of social media mentions associated with the location entity being valued; a number of check-ins associated with the location entity being valued; a number of requests for directions to the location entity being valued; and/or and a global popularity rank associated with the location entity being valued.”

Personalization features may also be collected which described previous interactions between the user and the location entity, such as:

1) a number of instances in which the user performed a map click with respect to the location entity being valued;
2) a number of instances in which the user requested directions to the location entity being valued;
3) a number of instances in which the user has checked-in to the location entity being valued;
4) a number of instances in which the user has transacted with the location entity as evidenced by data obtained from a mobile payment system or virtual wallet;
5) a number of instances in which the user has performed a web search query with respect to the location entity being valued.

Other benefits of location history

This next potential feature was one that I tested to see if it was working, querying location history. It didn’t seem to be active at this point:

For example, a user may enter a search query that references the user’s historical location (e.g. “Thai restaurant I ate at last Thursday”). When it is recognized that the search query references the user’s location history, then the user’s location history can be analyzed in light of the search query. Thus, for example, the user location history can be analyzed to identify any Thai restaurants visited on a certain date and then provide such restaurants as results in response to the search query.

The patent refers to a graphical representation of mobile location history, which is available:

As an example, in some implementations, a user reviewing a graphical representation of her location history can indicate that one of the location entities included in her location history is erroneous (e.g. that she did not visit such location). In response, the user can be presented with one or more of the location entities that were identified for the segment for which the incorrect location entity was selected and can be given an opportunity to select a replacement location.

Location History Timeline Interface
A Location History Timeline Interface

In addition to the timeline interface, you can also see a map of places you may have visited:

Timeline with Map Interface
Map Interface

You can see in my screenshot of my timeline, I took a photo of a Kumquat tree I bought yesterday. It gives me a chance to see the photos I took, so that I can edit them, if I would like. The patent tells us this about the user interface:

In other implementations, opportunities to perform other edits, such as deleting, annotating, uploading photographs, providing reviews, etc., can be provided in the interactive user interface. In such fashion, the user can be provided with an interactive tool to explore, control, share, and contribute to her location history.

The patent tells us that it tracks activities that you may have engaged in at specific locations:

In further embodiments of the present disclosure, a location entity can be associated with a user action within the context of a location history. For example, the user action can be making a purchase (e.g. with a digital wallet) or taking a photograph. In particular, in some embodiments, the user action or an item of content generated by the user action (e.g. the photograph or receipt) can be analyzed to assist in identifying the location entity associated with such user action. For example, the analysis of the user action or item of content can contribute to the score determined for each location entity identified for a segment.

I have had the Google Maps application ask me if I would like to contribute photos that I have taken at specific locations, such as at the sunset at Solana Beach. I haven’t used a digital wallet, so I don’t know if that is potentially part of my location history.

The patent describes the timeline feature and the Map feature that I included screenshots from above.

The patent interestingly tells us that location entities may be referred to by the common names of the places they are called, and refers to those as “Semantic Identifiers:

Each location entity can be designated by a semantic identifier (e.g. the common “name” of restaurant, store, monument, etc.), as distinguished from a coordinate-based or location-based identifier. However, in addition to a name, the data associated with a particular location entity can further include the location of the location entity, such as longitude, latitude, and altitude coordinates associated with the location entity.

It’s looking like location history could get smarter:

As an example, an interaction evidenced by search data can include a search query inputted by a user that references a particular location entity. As another example, an interaction evidenced by map data 218 can include a request for directions to a particular location entity or a selection of an icon representing the particular location entity within a mapping application. As yet another example, an interaction evidenced by email data 220 can include flight or hotel reservations to a particular city or lodging or reservations for dinner at a particular restaurant. As another example, an interaction evidenced by social media data 222 can include a check-in, a like, a comment, a follow, a review, or other social media action performed by the user with respect to a particular location entity.

Tracking these interactions is being done under the name “user/location entity interaction extraction,” and it may calculate statistics about such interactions:

Thus, user/location entity interaction extraction module 212 can analyze available data to extract interactions between a user and a location entity. Further, interaction extraction module 212 can maintain statistics regarding aggregate interactions for a location entity with respect to all users for which data is available.

It appears that to get the benefit of being able to access information such as this, you would need to give Google the ability to collect such data.

The patent provides more details about location history, and popularity and other features, and even a little more about personalization. Many aspects of location history have been implemented, while there are some that look like they might have yet to be developed. As can be seen from the three posts I have written about that describes patents that use information from location history, it is possible that location history may be used in other processes used by Google.

How do you feel about mobile location history from Google?


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Google’s Mobile Location History appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Google Targeted Advertising, Part 1

January 28, 2018 No Comments

Google Targeted Advertisements

One of the inventors of the newly granted patent I am writing about was behind one of the most visited Google patents I’ve written about, from Ross Koningstein, which I posted about under the title, The Google Rank-Modifying Spammers Patent It described a social engineering approach to stop site owners from using spammy tactics to raise the ranking of pages.

This new patent is about targeted advertising at Google in paid search, which I haven’t written too much about here. I did write one post about paid search, which I called, Google’s Second Most Important Algorithm? Before Google’s Panda, there was Phil I started that post with a quote from Steven Levy, the author of the book In the Plex, which goes like this:

They named the project Phil because it sounded friendly. (For those who required an acronym, they had one handy: Probabilistic Hierarchical Inferential Learner.) That was bad news for a Google Engineer named Phil who kept getting emails about the system. He begged Harik to change the name, but Phil it was.

What this showed us was that Google did not use the AdSense algorithm from the company they acquired in 2003 named Applied Semantics to build paid search. But, it’s been interesting seeing Google achieve so much based on a business model that relies upon advertising because they seemed so dead set against advertising when then first started out the search engine. For instance, there is a passage in an early paper about the search engine they developed that has an appendix about advertising.

If you read through The Anatomy of a Large-Scale Hypertextual Web Search Engine, you learn a lot about how the search engine was intended to work. But the section about advertising is really interesting. There, they tell us:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine, one of the top results for cellular phone is “The Effect of Cellular Phone Use Upon Driver Attention”, a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.

So, when Google was granted a patent on December 26, 2017, that provides more depth on how targeted advertising might work at Google, it made interesting reading. This is a continuation patent, which means the description ideally should be approximately the same as the original patent, but the claims should be updated to reflect how the search engine might be using the processes described in a newer manner. The older version of the patent was filed on December 30, 2004, but it wasn’t granted under the earlier claims. It may be possble to dig up those earlier claims, but it is interesting looking at the description that accompanies the newest version of the patent to get a sense of how it works. Here is a link to the newest version of the patent with claims that were updated in 2015:

Associating features with entities, such as categories of web page documents, and/or weighting such features
Inventors: Ross Koningstein, Stephen Lawrence, and Valentin Spitkovsky
Assignee: Google Inc.
US Patent: 9,852,225
Granted: December 26, 2017
Filed: April 23, 2015

Abstract

Features that may be used to represent relevance information (e.g., properties, characteristics, etc.) of an entity, such as a document or concept for example, may be associated with the document by accepting an identifier that identifies a document; obtaining search query information (and/or other serving parameter information) related to the document using the document identifier, determining features using the obtained query information (and/or other serving parameter information), and associating the features determined with the document. Weights of such features may be similarly determined. The weights may be determined using scores. The scores may be a function of one or more of whether the document was selected, a user dwell time on a selected document, whether or not a conversion occurred with respect to the document, etc. The document may be a Web page. The features may be n-grams. The relevance information of the document may be used to target the serving of advertisements with the document.

I will continue with details about how this patent describes how they might target advertising at Google in a part 2 of this post.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Google Targeted Advertising, Part 1 appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Does Google Use Latent Semantic Indexing?

January 26, 2018 No Comments
Railroad Turntable Sign
Technology evolves and changes over time.

There was a park in the town in Virginia where I used to live that had been a railroad track that was turned into a walking path. At one place near that track was a historic turntable where cargo trains might be unloaded so that they could be added to later trains or trains headed in the opposite direction. This is a technology that is no longer used but it is an example of how technology changes and evolves over time.

There are people who write about SEO who have insisted that Google uses a technology called Latent Semantic Indexing to index content on the Web, but make those claims without any proof to back them up. I thought it might be helpful to explore that technology and its sources in more detail. It is a technology that was invented before the Web was around, to index the contents of document collections that don’t change much. LSI might be like the railroad turntables that used to be used on railroad lines.

There is also a website which offers “LSI keywords” to searchers but doesn’t provide any information about how they generate those keywords or use LSI technology to generate them, or provide any proof that they make a difference in how a search engine such as Google might index content that contains those keywords. How is using “LSI Keywords” different from keyword stuffing that Google tells us not to do. Google tells us that we should:

Focus on creating useful, information-rich content that uses keywords appropriately and in context.

Where does LSI come from

One of Microsoft’s researchers and search engineers, Susan Dumais was an inventor behind a technology referred to as Latent Semantic Indexing which she worked on developing at Bell Labs. There are links on her home page that provide access to many of the technologies that she worked upon while performing research at Microsoft which are very informative and provide many insights into how search engines perform different tasks. Spending time with them is highly recommended.

She performed earlier research before joining Microsoft at Bell Labs, including writing about Indexing by Latent Semantic Analysis. She was also granted a patent as a co-inventor on the process. Note that this patent was filed in April of 1989, and was published in August of 1992. The World Wide Web didn’t go live until August 1991. The LSI patent is:

Computer information retrieval using latent semantic structure
Inventors: Scott C. Deerwester, Susan T. Dumais, George W. Furnas, Richard A. Harshman, Thomas K. Landauer, Karen E. Lochbaum, and Lynn A. Streeter
Assigned to: Bell Communications Research, Inc.
US Patent: 4,839,853
Granted: June 13, 1989
Filed: September 15, 1988

Abstract

A methodology for retrieving textual data objects is disclosed. The information is treated in the statistical domain by presuming that there is an underlying, latent semantic structure in the usage of words in the data objects. Estimates to this latent structure are utilized to represent and retrieve objects. A user query is recouched in the new statistical domain and then processed in the computer system to extract the underlying meaning to respond to the query.

The problem that LSI was intended to solve:

Because human word use is characterized by extensive synonymy and polysemy, straightforward term-matching schemes have serious shortcomings–relevant materials will be missed because different people describe the same topic using different words and, because the same word can have different meanings, irrelevant material will be retrieved. The basic problem may be simply summarized by stating that people want to access information based on meaning, but the words they select do not adequately express intended meaning. Previous attempts to improve standard word searching and overcome the diversity in human word usage have involved: restricting the allowable vocabulary and training intermediaries to generate indexing and search keys; hand-crafting thesauri to provide synonyms; or constructing explicit models of the relevant domain knowledge. Not only are these methods expert-labor intensive, but they are often not very successful.

The summary section of the patent tells us that there is a potential solution to this problem. Keep in mind that this was developed before the world wide web grew to become the very large source of information that it is, today:

These shortcomings, as well as other deficiencies and limitations of information retrieval, are obviated, in accordance with the present invention, by automatically constructing a semantic space for retrieval. This is effected by treating the unreliability of observed word-to-text object association data as a statistical problem. The basic postulate is that there is an underlying latent semantic structure in word usage data that is partially hidden or obscured by the variability of word choice. A statistical approach is utilized to estimate this latent structure and uncover the latent meaning. Words, the text objects and, later, user queries are processed to extract this underlying meaning and the new, latent semantic structure domain is then used to represent and retrieve information.

To illustrate how LSI works, the patent provides a simple example, using a set of 9 documents (much smaller than the web as it exists today). The example includes documents that are about human/computer interaction topics. It really doesn’t discuss how a process such as this could handle something the size of the Web because nothing that size had quite existed yet at that point in time. The Web contains a lot of information and goes through changes frequently, so an approach that was created to index a known document collection might not be ideal. The patent tells us that an analysis of terms needs to take place, “each time there is a significant update in the storage files.”

There has been a lot of research and a lot of development of technology that can be applied to a set of documents the size of the Web. We learned, from Google that they are using a Word Vector approach developed by the Google Brain team, which was described in a patent that was granted in 2017. I wrote about that patent and linked to resources that it used in the post: Citations behind the Google Brain Word Vector Approach. If you want to get a sense of the technologies that Google may be using to index content and understand words in that content, it has advanced a lot since the days just before the Web started. There are links to papers cited by the inventors of that patent within it. Some of those may be related in some ways to Latent Semantic Indexing since it could be called their ancestor. The LSI technology that was invented in 1988 contains some interesting approaches, and if you want to learn a lot more about it, this paper is really insightful: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge. There are mentions of Latent Semantic Indexing in Patents from Google, where it is used as an example indexing method:

Text classification techniques can be used to classify text into one or more subject matter categories. Text classification/categorization is a research area in information science that is concerned with assigning text to one or more categories based on its contents. Typical text classification techniques are based on naive Bayes classifiers, tf-idf, latent semantic indexing, support vector machines and artificial neural networks, for example.

~ Classifying text into hierarchical categories


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Does Google Use Latent Semantic Indexing? appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Google Giving Less Weight to Reviews of Places You Stop Visiting?

December 19, 2017 No Comments

Google-Timeline-Reviews

I don’t consider myself paranoid, but after reading a lot of Google patents, I’ve been thinking of my phone as my Android tracking device. It’s looking like Google thinks of phones similarly; paying a lot of attention to things such as a person’s location history. After reading a recent patent, I’m fine with Google continuing to look at my location history, and reviews that I might write, even though there may not be any financial benefit to me. When I write a review of a business at Google, it’s normally because I’ve either really liked that place or disliked it, and wanted to share my thoughts about it with others.

A Google patent application filed and published by the search engine, but not yet granted is about reviews of businesses.

It tells us about how reviews can benefit businesses:

Furthermore, once a review platform has accumulated a significant number of reviews it can be a useful resource for users to identify new entities or locales to visit or experience. For example, a user can visit the review platform to search for a restaurant at which to eat, a store at which to shop, or a place to have drinks with friends. The review platform can provide search results based on location, quality according to the reviews, pricing, and/or keywords included in textual reviews.

But, there are problems with reviews that this patent sets out to address and assist with:

However, one problem associated with review platforms is collecting a significant number of reviews. For example, a large majority of people do not take the time to visit the review platform and contribute a review for each point of interest they visit throughout a day.

Furthermore, even after a review is contributed by a user, the user’s opinion of the point of interest may change, rendering the contributed review outdated and inaccurate. For example, a restaurant for which the user previously provided a positive review may come under new ownership or experience a change in kitchen staff that causes the quality of the restaurant to decrease. As such, the user may cease visiting the restaurant or otherwise decrease a frequency of visits. However, the user may not take the time to return to the review platform and update their review.

The patent does have a solution to reviews that don’t get made or updated – if a person stops going to a place that they have reviewed in the past, the review that they submitted may be treated as a diminished review:

Thus, a location history associated with a user can provide one or more signals that indicate an implied review of points of interest. Therefore, systems and methods for using user location information to provide reviews are needed. In particular, systems and methods for providing a diminished review for a point of interest when a frequency of visits by one or more users to the point of interest decreases are desirable.

The pending patent application is at:

User Location History Implies Diminished Review
Inventors: Daniel Victor Klein and Dean Kenneth Jackson
US Patent Application 20170358015
Published: December 14, 2017
Filed: April 7, 2014

Abstract

Systems and methods for providing reviews are provided. One example system includes one or more computing devices. The system includes one or more non-transitory computer-readable media storing instructions that, when executed by the one or more computing devices, cause the one or more computing devices to perform operations. The operations include identifying, based on a location history associated with a user, a first signal. The first signal comprises a frequency of visits by the user to a first point of interest over a first time period. The operations include identifying, based on the location history associated with the user, a change in the first signal after the first time period. The operations include providing a diminished review for the user with respect to the first point of interest when the identified change comprises a decrease in the frequency of visits by the user to the first point of interest.

Some highlights from the patent description:

1. Location updates can be received from more than one mobile devices associated with a user, to create a location history over time.

2. Points of interest can be tracked and cover a really wide range of place types; or a point of interest such as a shopping mall may be treated as a single point of interest.

3. A person may control what information is collected about their location, and may be given a chance to modify or update that information.

4. Not visiting a particular place may lead to an assumption that a “user’s opinion of the point of interest has diminished or otherwise changed.”

5. A Diminished review might be a negative review or a lowering of a review score.

6. A reviewer may also be asked to “confirm or edit/elaborate on the previously contributed review,” if they don’t return to a place they have reviewed in a while.

7. User Contributed Reviews could be said to have a decay period, in which, their influence on search or rating systems wanes.”

8. Other factors besides a change of opinion about a place may be considered, such as a change of residence or workplace to a new location, or an overall change in visitation patterns for all points of interest. These types of changes may not lead to a diminished review.

9. Aggregated frequencies of visits from many people may be considered, and if many still continue to visit a place, then a change by one people may not be used to reduce an overall score for a place. If visits by many people show a decrease than an assumption that something has changed with the point of interest could affect the overall score.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Google Giving Less Weight to Reviews of Places You Stop Visiting? appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Does Tomorrow Deliver Topical Search Results at Google?

November 14, 2017 No Comments
The Oldest Pepper Tree in California

At one point in time, search engines such as Google learned about topics on the Web from sources such as Yahoo! and the Open Directory Project, which provided categories of sites, within directories that people could skim through to find something that they might be interested in.

Those listings of categories included hierarchical topics and subtopics; but they were managed by human beings and both directories have closed down.

In addition to learning about categories and topics from such places, search engines used to use such sources to do focused crawls of the web, to make sure that they were indexing as wide a range of topics as they could.

It’s possible that we are seeing those sites replaced by sources such as Wikipedia and Wikidata and Google’s Knowledge Graph and the Microsoft Concept Graph.

Last year, I wrote a post called, Google Patents Context Vectors to Improve Search. It focused upon a Google patent titled User-context-based search engine.

In that patent we learned that Google was using information from knowledge bases (sources such as Yahoo Finance, IMDB, Wikipedia, and other data-rich and well organized places) to learn about words that may have more than one meaning.

An example from that patent was that the word “horse” has different meanings in different contexts.

To an equestrian, a horse is an animal. To a carpenter, a horse is a work tool when they do carpentry. To a gymnast, a horse is a piece of equipment that they perform manuevers upon during competitions with other gymnasts.

A context vector takes these different meanings from knowledge bases, and the number of times they are mentioned in those places to catalogue how often they are used in which context.

I thought knowing about context vectors was useful for doing keyword research, but I was excited to see another patent from Google appear where the word “context” played a featured role in the patent. When you search for something such as a “horse”, the search results you recieve are going to be mixed with horses of different types, depending upon the meaning. As this new patent tells us about such search results:

The ranked list of search results may include search results associated with a topic that the user does not find useful and/or did not intend to be included within the ranked list of search results.

If I was searching for a horse of the animal type, I might include another word in my query that identified the context of my search better. The inventors of this new patent seem to have a similar idea. The patent mentions

In yet another possible implementation, a system may include one or more server devices to receive a search query and context information associated with a document identified by the client; obtain search results based on the search query, the search results identifying documents relevant to the search query; analyze the context information to identify content; and generate a group of first scores for a hierarchy of topics, each first score, of the group of first scores, corresponding to a respective measure of relevance of each topic, of the hierarchy of topics, to the content.

From the pictures that accompany the patent it looks like this context information is in the form of Headings that appear above each search result that identify Context information that those results fit within. Here’s a drawing from the patent showing off topical search results (showing rock/music and geology/rocks):

Search Results in Context
Different types of ‘rock’ on a search for ‘rock’ at Google

This patent does remind me of the context vector patent, and the two processes in these two patents look like they could work together. This patent is:

Context-based filtering of search results
Inventors: Sarveshwar Duddu, Kuntal Loya, Minh Tue Vo Thanh and Thorsten Brants
Assignee: Google Inc.
US Patent: 9,779,139
Granted: October 3, 2017
Filed: March 15, 2016

Abstract

A server is configured to receive, from a client, a query and context information associated with a document; obtain search results, based on the query, that identify documents relevant to the query; analyze the context information to identify content; generate first scores for a hierarchy of topics, that correspond to measures of relevance of the topics to the content; select a topic that is most relevant to the context information when the topic is associated with a greatest first score; generate second scores for the search results that correspond to measures of relevance, of the search results, to the topic; select one or more of the search results as being most relevant to the topic when the search results are associated with one or more greatest second scores; generate a search result document that includes the selected search results; and send, to a client, the search result document.

It will be exciting to see topical search results start appearing at Google.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Does Tomorrow Deliver Topical Search Results at Google? appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Semantic Keyword Research and Topic Models

November 14, 2017 No Comments

Seeing Meaning

I went to the Pubcon 2017 Conference this week in Las Vegas Nevada and gave a presentation about Semantic Search topics based upon white papers and patents from Google. My focus was on things such as Context Vectors and Phrase-Based Indexing.

I promised in social media that I would post the presentation on my blog so that I could answer questions if anyone had any.

I’ve been doing keyword research like this for years, where I’ve looked at other pages that rank well for keyword terms that I want to use, and identify phrases and terms that tend to appear upon those pages, and include them on pages that I am trying to optimize. It made a lot of sense to start doing that after reading about phrase based indexing in 2005 and later.

Some of the terms I see when I search for Semantic Keyword Research include such things as “improve your rankings,” and “conducting keyword research” and “smarter content.” I’m seeing phrases that I’m not a fan of such as “LSI Keywords” which has as much scientific credibility as Keyword Density, which is next to none. There were researchers from Bell Labs, in 1990, who wrote a white paper about Latent Semantic Indexing, which was something that was used with small (less than 10,000 documents) and static collections of documents (the web is constantly changing and hasn’t been that small for a long time.)

There are many people who call themselves SEOs who tout LSI keywords as being keywords that are based upon having related meanings to other words, unfortunately, that has nothing to do with the LSI that was developed in 1990.

If you are going to present research or theories about things such as LSI, it really pays to do a little research first. Here’s my presentation. It includes links to patents and white papers that the ideas within in are based upon. I do look forward to questions.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Semantic Keyword Research and Topic Models appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Using Ngram Phrase Models to Generate Site Quality Scores

September 30, 2017 No Comments
Scrabble-phrases
Source: https://commons.wikimedia.org/wiki/File:Scrabble_game_in_progress.jpg
Photographer: McGeddon
Creative Commons License: Attribution 2.0 Generic

Navneet Panda, whom the Google Panda update is named after, has co-invented a new patent that focuses on site quality scores. It’s worth studying to understand how it determines the quality of sites.

Back in 2013, I wrote the post Google Scoring Gibberish Content to Demote Pages in Rankings, about Google using ngrams from sites and building language models from them to determine if those sites were filled with gibberish, or spammy content. I was reminded of that post when I read this patent.

Rather than explaining what ngrams are in this post (which I did in the gibberish post), I’m going to point to an example of ngrams at the Google n-gram viewer, which shows Google indexing phrases in scanned books. This article published by the Wired site also focused upon ngrams: The Pitfalls of Using Google Ngram to Study Language.

An ngram phrase could be a 2-gram, a 3-gram, a 4-gram, or a 5-gram phrase; where pages are broken down into two-word phrases, three-word phrases, four-word phrases, or 5 word phrases. If a body of pages are broken down into ngrams, they could be used to create language models or phrase models to compare to other pages.

Language models, like the ones that Google used to create gibberish scores for sites could also be used to determine the quality of sites, if example sites were used to generate those language models. That seems to be the idea behind the new patent granted this week. The summary section of the patent tells us about this use of the process it describes and protects:

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of obtaining baseline site quality scores for a plurality of previously-stored sites; generating a phrase model for a plurality of sites including the plurality of previously-scored sites, wherein the phrase model defines a mapping from phrase-specific relative frequency measures to phrase-specific baseline site quality scores; for a new site, the new site not being one of the plurality of previously-scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of the plurality of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.

The newly granted patent from Google is:

Predicting site quality
Inventors: Navneet Panda and Yun Zhou
Assignee: Google
US Patent: 9,767,157
Granted: September 19, 2017
Filed: March 15, 2013

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for predicating a measure of quality for a site, e.g., a web site. In some implementations, the methods include obtaining baseline site quality scores for multiple previously scored sites; generating a phrase model for multiple sites including the previously scored sites, wherein the phrase model defines a mapping from phrase specific relative frequency measures to phrase specific baseline site quality scores; for a new site that is not one of the previously scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.

In addition to generating ngrams from text upon sites, in some versions of the implementation of this patent will include generating ngrams from anchor text of links pointing to pages of the sites. Building a phrase model involves calculating the frequency of n-grams on a site “based on the count of pages divided by the number of pages on the site.”

The patent tells us that site quality scores can impact rankings of pages from those sites, according to the patent:

Obtain baseline site quality scores for a number of previously-scored sites. The baseline site quality scores are scores used by the system, e.g., by a ranking engine of the system, as signals, among other signals, to rank search results. In some implementations, the baseline scores are determined by a backend process that may be expensive in terms of time or computing resources, or by a process that may not be applicable to all sites. For these or other reasons, baseline site quality scores are not available for all sites.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Using Ngram Phrase Models to Generate Site Quality Scores appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓