CBPO

SEO

How Google might Identify Primary Versions of Duplicate Pages

October 15, 2018 No Comments

I came across this statement on the Web earlier this week, and wondered about it, and decided to investigate more:

If there are multiple instances of the same document on the web, the highest authority URL becomes the canonical version. The rest are considered duplicates.

~ Link inversion, the least known major ranking factor.

Man in a cave
Luke Leung

I read that article from Dejan SEO, and thought it was worth exploring more. As I was looking around at Google patents that included the word “Authority” in them, I found this patent which doesn’t quite say the same thing that Dejan does, but is interesting in that it finds ways to distinguish between duplicate pages on different domains based upon priority rules, which is interesting in determining which duplicate page might be the highest authority URL for a document.

The patent is:

Identifying a primary version of a document
Inventors: Alexandre A. Verstak and Anurag Acharya
Assignee: Google Inc.
US Patent: 9,779,072
Granted: October 3, 2017
Filed: July 31, 2013

Abstract

A system and method identifies a primary version out of different versions of the same document. The system selects a priority of authority for each document version based on a priority rule and information associated with the document version and selects a primary version based on the priority of authority and information associated with the document version.

Since the claims of a patent are what patent examiners at the USPTO look at when they are prosecuting a patent, and deciding whether or not it should be granted. I thought it would be worth looking at the claims contained within the patent to see if they helped encapsulate what it covered. The first one captures some aspects of it that are worth thinking about while talking about different document versions of particular documents, and how the metadata associated with a document might be looked at to determine which is the primary version of a document:

What is claimed is:

1. A method comprising: identifying, by a computer system, a plurality of different document versions of a particular document; identifying, by the computer system, a first type of metadata that is associated with each document version of the plurality of different document versions, wherein the first type of metadata includes data that describes a source that provides each document version of the plurality of different document versions; identifying, by the computer system, a second type of metadata that is associated with each document version of the plurality of different document versions, wherein the second type of metadata describes a feature of each document version of the plurality of different document versions other than the source of the document version; for each document version of the plurality of different document versions, applying, by the computer system, a priority rule to the first type of metadata and the second type of metadata, to generate a priority value; selecting, by the computer system, a particular document version, of the plurality of different document versions, based on the priority values generated for each document version of the plurality of different document versions; and providing, by the computer system, the particular document version for presentation.

This doesn’t advance the claim that the primary version of a document is considered the canonical version of that document, and all links pointed to that document are redirected to the primary version.

There is another patent that shares an inventor with this one that refers to one of the duplicate content URL being chosen as a representative page, though it doesn’t use the phrase “canonical.” From that patent:

Duplicate documents, sharing the same content, are identified by a web crawler system. Upon receiving a newly crawled document, a set of previously crawled documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query-independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

In some embodiments, a method for selecting a representative document from a set of duplicate documents includes: selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score, where each respective document in the plurality of documents has a fingerprint that identifies the content of the respective document, the fingerprint of each respective document in the plurality of documents indicating that each respective document in the plurality of documents has substantially identical content to every other document in the plurality of documents, and a first document in the plurality of documents is associated with the query-independent score. The method further includes indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including only the indexed first document in a document index.

This other patent is:

Representative document selection for a set of duplicate documents
Inventors: Daniel Dulitz, Alexandre A. Verstak, Sanjay Ghemawat and Jeffrey A. Dean
Assignee: Google Inc.
US Patent: 8,868,559
Granted: October 21, 2014
Filed: August 30, 2012

Abstract

Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.

Regardless of whether the primary version of a set of duplicate documents is treated as the representative document as suggested in this second patent (whatever that may mean exactly), I think it’s important to get a better understanding of what a primary version of a document might be.

The primary version patent provides some reasons why one of them might be considered a primary version:

(1) Including of different versions of the same document does not provide additional useful information, and it does not benefit users.
(2) Search results that include different versions of the same document may crowd out diverse contents that should be included.
(3) Where there are multiple different versions of a document present in the search results, the user may not know which version is most authoritative, complete, or best to access, and thus may waste time accessing the different versions in order to compare them.

Those are the three reasons this duplicate document patent says it is ideal to identify a primary version from different versions of a document that appears on the Web. The search engine also wants to furnish “the most appropriate and reliable search result.”

How does it work?

The patent tells us that one method of identifying a primary version is as follows.

The different versions of a document are identified from a number of different sources, such as online databases, websites, and library data systems.

For each document version, a priority of authority is selected based on:

(1) The metadata information associated with the document version, such as

  • The source
  • Exclusive right to publish
  • Licensing right
  • Citation information
  • Keywords
  • Page rank
  • The like

(2) As a second step, the document versions are then determined for length qualification using a length measure. The version with a high priority of authority and a qualified length is deemed the primary version of the document.

If none of the document versions has both a high priority and a qualified length, then the primary version is selected based on the totality of information associated with each document version.

The patent tells us that scholarly works tend to work under the process in this patent:

Because works of scholarly literature are subject to rigorous format requirements, documents such as journal articles, conference articles, academic papers and citation records of journal articles, conference articles, and academic papers have metadata information describing the content and source of the document. As a result, works of scholarly literature are good candidates for the identification subsystem.

Meta data that might be looked at during this process could include such things as:

  • Author names
  • Title
  • Publisher
  • Publication date
  • Publication location
  • Keywords
  • Page rank
  • Citation information
  • Article identifiers such as Digital Object Identifier, PubMed Identifier, SICI, ISBN, and the like
  • Network locution (e.g., URL)
  • Reference count
  • Citation count
  • Language
  • So forth

The patent goes into more depth about the methodology behind determining the primary version of a document:

The priority rule generates a numeric value (e.g., a score) to reflect the authoritativeness, completeness, or best to access of a document version. In one example, the priority rule determines the priority of authority assigned to a document version by the source of the document version based on a source-priority list. The source-priority list comprises a list of sources, each source having a corresponding priority of authority. The priority of a source can be based on editorial selection, including consideration of extrinsic factors such as reputation of the source, size of source’s publication corpus, recency or frequency of updates, or any other factors. Each document version is thus associated with a priority of authority; this association can be maintained in a table, tree, or other data structures.

The patent includes a table illustrating the source-priority list.

The patent includes some alternative approaches as well. It tells us that “the priority measure for determining whether a document version has a qualified priority can be based on a qualified priority value.”

A qualified priority value is a threshold to determine whether a document version is authoritative, complete, or easy to access, depending on the priority rule. When the assigned priority of a document version is greater than or equal to the qualified priority value, the document is deemed to be authoritative, complete, or easy to access, depending on the priority rule. Alternatively, the qualified priority can be based on a relative measure, such as given the priorities of a set of document versions, only the highest priority is deemed as qualified priority.

Take aways

I was in a Google Hangout on air within the last couple of years where I and a number of other SEOs (Ammon Johns, Eric Enge, Jennifer Slegg, and I) asked some questions to John Mueller and Andrey Lipattse, and we asked some questions about duplicate content. It seems to be something that still raises questions among SEOs.

The patent goes into more detail regarding determining which duplicate document might be the primary document. We can’t tell whether that primary document might be treated as if it is at the canonical URL for all of the duplicate documents as suggested in the Dejan SEO article that I started with a link to in this post, but it is interesting seeing that Google has a way of deciding which version of a document might be the primary version. I didn’t go into much depth about quantified lengths being used to help identify the primary document, but the patent does spend some time going over that.

Is this a little-known ranking factor? The Google patent on identifying a primary version of duplicate documents does seem to find some importance in identifying what it believes to be the most important version among many duplicate documents. I’m not sure if there is anything here that most site owners can use to help them have their pages rank higher in search results, but it’s good seeing that Google may have explored this topic in more depth.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post How Google might Identify Primary Versions of Duplicate Pages appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Quality Scores for Queries: Structured Data, Synthetic Queries and Augmentation Queries

July 31, 2018 No Comments

Augmentation Queries

In general, the subject matter of this specification relates to identifying or generating augmentation queries, storing the augmentation queries, and identifying stored augmentation queries for use in augmenting user searches. An augmentation query can be a query that performs well in locating desirable documents identified in the search results. The performance of the query can be determined by user interactions. For example, if many users that enter the same query often select one or more of the search results relevant to the query, that query may be designated an augmentation query.

In addition to actual queries submitted by users, augmentation queries can also include synthetic queries that are machine generated. For example, an augmentation query can be identified by mining a corpus of documents and identifying search terms for which popular documents are relevant. These popular documents can, for example, include documents that are often selected when presented as search results. Yet another way of identifying an augmentation query is mining structured data, e.g., business telephone listings, and identifying queries that include terms of the structured data, e.g., business names.

These augmentation queries can be stored in an augmentation query data store. When a user submits a search query to a search engine, the terms of the submitted query can be evaluated and matched to terms of the stored augmentation queries to select one or more similar augmentation queries. The selected augmentation queries, in turn, can be used by the search engine to augment the search operation, thereby obtaining better search results. For example, search results obtained by a similar augmentation query can be presented to the user along with the search results obtained by the user query.

This past March, Google was granted a patent that involves giving quality scores to queries (the quote above is from that patent). The patent refers to high scoring queries as augmentation queries. Interesting to see that searcher selection is one way that might be used to determine the quality of queries. So, when someone searches. Google may compare the SERPs they receive from the original query to augmentation query results based upon previous searches using the same query terms or synthetic queries. This evaluation against augmentation queries is based upon which search results have received more clicks in the past. Google may decide to add results from an augmentation query to the results for the query searched for to improve the overall search results.

How does Google find augmentation queries? One place to look for those is in query logs and click logs. As the patent tells us:

To obtain augmentation queries, the augmentation query subsystem can examine performance data indicative of user interactions to identify queries that perform well in locating desirable search results. For example, augmentation queries can be identified by mining query logs and click logs. Using the query logs, for example, the augmentation query subsystem can identify common user queries. The click logs can be used to identify which user queries perform best, as indicated by the number of clicks associated with each query. The augmentation query subsystem stores the augmentation queries mined from the query logs and/or the click logs in the augmentation query store.

This doesn’t mean that Google is using clicks to directly determine rankings But it is deciding which augmentation queries might be worth using to provide SERPs that people may be satisfied with.

There are other things that Google may look at to decide which augmentation queries to use in a set of search results. The patent points out some other factors that may be helpful:

In some implementations, a synonym score, an edit distance score, and/or a transformation cost score can be applied to each candidate augmentation query. Similarity scores can also be determined based on the similarity of search results of the candidate augmentation queries to the search query. In other implementations, the synonym scores, edit distance scores, and other types of similarity scores can be applied on a term by term basis for terms in search queries that are being compared. These scores can then be used to compute an overall similarity score between two queries. For example, the scores can be averaged; the scores can be added; or the scores can be weighted according to the word structure (nouns weighted more than adjectives, for example) and averaged. The candidate augmentation queries can then be ranked based upon relative similarity scores.

I’ve seen white papers from Google before mentioning synthetic queries, which are queries performed by the search engine instead of human searchers. It makes sense for Google to be exploring query spaces in a manner like this, to see what results are like, and using information such as structured data as a source of those synthetic queries. I’ve written about synthetic queries before at least a couple of times, and in the post Does Google Search Google? How Google May Create and Use Synthetic Queries.

Implicit Signals of Query Quality

It is an interesting patent in that it talks about things such as long clicks and short clicks, and ranking web pages on the basis of such things. The patent refers to such things as “implicit Signals of query quality.” More about that in the patent here:

In some implementations, implicit signals of query quality are used to determine if a query can be used as an augmentation query. An implicit signal is a signal based on user actions in response to the query. Example implicit signals can include click-through rates (CTR) related to different user queries, long click metrics, and/or click-through reversions, as recorded within the click logs. A click-through for a query can occur, for example, when a user of a user device, selects or “clicks” on a search result returned by a search engine. The CTR is obtained by dividing the number of users that clicked on a search result by the number of times the query was submitted. For example, if a query is input 100 times, and 80 persons click on a search result, then the CTR for that query is 80%.

A long click occurs when a user, after clicking on a search result, dwells on the landing page (i.e., the document to which the search result links) of the search result or clicks on additional links that are present on the landing page. A long click can be interpreted as a signal that the query identified information that the user deemed to be interesting, as the user either spent a certain amount of time on the landing page or found additional items of interest on the landing page.

A click-through reversion (also known as a “short click”) occurs when a user, after clicking on a search result and being provided the referenced document, quickly returns to the search results page from the referenced document. A click-through reversion can be interpreted as a signal that the query did not identify information that the user deemed to be interesting, as the user quickly returned to the search results page.

These example implicit signals can be aggregated for each query, such as by collecting statistics for multiple instances of use of the query in search operations, and can further be used to compute an overall performance score. For example, a query having a high CTR, many long clicks, and few click-through reversions would likely have a high-performance score; conversely, a query having a low CTR, few long clicks, and many click-through reversions would likely have a low-performance score.

The reasons for the process behind the patent are explained in the description section of the patent where we are told:

Often users provide queries that cause a search engine to return results that are not of interest to the users or do not fully satisfy the users’ need for information. Search engines may provide such results for a number of reasons, such as the query including terms having term weights that do not reflect the users’ interest (e.g., in the case when a word in a query that is deemed most important by the users is attributed less weight by the search engine than other words in the query); the queries being a poor expression of the information needed; or the queries including misspelled words or unconventional terminology.

A quality signal for a query term can be defined in this way:

the quality signal being indicative of the performance of the first query in identifying information of interest to users for one or more instances of a first search operation in a search engine; determining whether the quality signal indicates that the first query exceeds a performance threshold; and storing the first query in an augmentation query data store if the quality signal indicates that the first query exceeds the performance threshold.

The patent can be found at:

Query augmentation
Inventors: Anand Shukla, Mark Pearson, Krishna Bharat and Stefan Buettcher
Assignee: Google LLC
US Patent: 9,916,366
Granted: March 13, 2018
Filed: July 28, 2015

Abstract

Methods, systems, and apparatus, including computer program products, for generating or using augmentation queries. In one aspect, a first query stored in a query log is identified and a quality signal related to the performance of the first query is compared to a performance threshold. The first query is stored in an augmentation query data store if the quality signal indicates that the first query exceeds a performance threshold.

References Cited about Augmentation Queries

These were a number of references cited by the applicants of the patent, which looked interesting, so I looked them up to see if I could find them to read them and share them here.

  1. Boyan, J. et al., A Machine Learning Architecture for Optimizing Web Search Engines,” School of Computer Science, Carnegie Mellon University, May 10, 1996, pp. 1-8. cited by applicant.
  2. Brin, S. et al., “The Anatomy of a Large-Scale Hypertextual Web Search Engine“, Computer Science Department, 1998. cited by applicant.
  3. Sahami, M. et al., T. D. 2006. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23-26, 2006). WWW ’06. ACM Press, New York, NY, pp. 377-386. cited by applicant.
  4. Ricardo A. Baeza-Yates et al., The Intention Behind Web Queries. SPIRE, 2006, pp. 98-109, 2006. cited by applicant.
  5. Smith et al. Leveraging the structure of the Semantic Web to enhance information retrieval for proteomics” vol. 23, Oct. 7, 2007, 7 pages. cited by applicant.
  6. Robertson, S.E. Documentation Note on Term Selection for Query Expansion J. of Documentation, 46(4): Dec. 1990, pp. 359-364. cited by applicant.
  7. Talel Abdessalem, Bogdan Cautis, and Nora Derouiche. 2010. ObjectRunner: lightweight, targeted extraction and querying of structured web data. Proc. VLDB Endow. 3, 1-2 (Sep. 2010). cited by applicant .
  8. Jane Yung-jen Hsu and Wen-tau Yih. 1997. Template-based information mining from HTML documents. In Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative application of artificial intelligence (AAAI’97/IAAI’97). AAAI Press, pp. 256-262. cited by applicant .
  9. Ganesh, Agarwal, Govind Kabra, and Kevin Chen-Chuan Chang. 2010. Towards rich query interpretation: walking back and forth for mining query templates. In Proceedings of the 19th international conference on World wide web (WWW ’10). ACM, New York, NY USA, 1-10. DOI=10. 1145/1772690. 1772692 http://doi.acm.org/10.1145/1772690.1772692. cited by applicant.

This is a Second Look at Augmentation Queries

This is a continuation patent, which means that it was granted before, with the same description, and it now has new claims. When that happens, it can be worth looking at the old claims and the new claims to see how they have changed. I like that the new version seems to focus more strongly upon structured data. It tells us that it might use structured data in sites that appear for queries as synthetic queries, and if those meet the performance threshold, they may be added to the search results that appear for the original queries. The claims do seem to focus a little more on structured data as synthetic queries, but it doesn’t really change the claims that much. They haven’t changed enough to publish them side by side and compare them.

What Google Has Said about Structured Data and Rankings

Google spokespeople had been telling us that Structured Data doesn’t impact rankings directly, but what they have been saying does seem to have changed somewhat recently. In the Search Engine Roundtable post, Google: Structured Data Doesn’t Give You A Ranking Boost But Can Help Rankings we are told that just having structured data on a site doesn’t automatically boost the rankings of a page, but if the structured data for a page is used as a synthetic query, and it meets the performance threshold as an augmentation query, it might be shown in rankings, thus helping in rankings (as this patent tells us.)

Note that this isn’t new, and the continuation patent’s claims don’t appear to have changed that much so that structured data is still being used as synthetic queries, and is checked to see if they work as augmented queries. This does seem to be a really good reason to make sure you are using the appropriate structured data for your pages.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Quality Scores for Queries: Structured Data, Synthetic Queries and Augmentation Queries appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Learning to Rank

July 17, 2018 No Comments

My last Post was Five Years of Google Ranking Signals, and I start that post by saying that there are other posts about ranking signals that have some issues. But, I don’t want to turn people away from looking at one recent post that did contain a lot of useful information.

Cyrus Shepard recently published a post about Google Sucess Factors on Zyppy.com which I would recommend that you also check out.

Cyrus did a video with Ross Hudgins on Seige Media where he talked about those Ranking signals with Cyrus, called Google Ranking Factors with Cyrus Shepard. I’m keeping this post short on purpose, to make the discussion about ranking the focus of this post, and the star. There is some really good information in the Video and in the post from Cyrus. Cyrus takes a different approach on writing about ranking signals from what I wrote, but it’s worth the time visiting and listening and watching.

And have fun learning to rank.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Learning to Rank appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Five Years of Google Ranking Signals

June 24, 2018 No Comments

LIghthouse

Braden Collum

Organic Search Ranking Signals

1. Domain Age and Rate of Linking
2. Use of Keywords
3. Related Phrases
4. Keywords in Main Headings, Lists, and Titles
5. Page Speed
6. Watch Times for a Page
7. Context Terms on a Page
8. Language Models Using Ngrams
9. Gibberish Content
10. Authoritative Results
11. How Well Databases Answers Match Queries
12. Suspicious Activity to Increase Rankings
13. Popularity Scores for Events
14. The Amount of Weight from a Link is Based upon the Probability that someone might click upon it
15. Biometric Parameters while Viewing Results
16. Click-Throughs
17. Site Quality Scores
18. Disambiguating People
19. Effectiveness and Affinity
20. Quotes
21. Category Duration Visits
22. Repeat Clicks and Visit Durations
23. Environmental Information
24. Traffic Producing Links
25. Freshness
26. Media Consumption History
27. Geographic Coordinates
28. Low Quality
29. Television Viewing
30. Quality Rankings

Semantic Search Ranking Signals

31. Searches using Structured Data
32. Related Entities
33. Nearby Locations
34. Attributes of Entities
35. Natural Language Search Results

Local Search Ranking Signals

36. Travel Time for Local Results
37. Reverse Engineering of Spam Detection in Local Results
38. Surprisingness in Business Names in Local Search
39. Local Expert Reviews
40. Similar Local Entities
41. Distance from Mobile Location History
42. What People Search for at Locations Searched
43. Semantic Geotokens

Voice Search Ranking Signals

44. Stressed Words

News Search Ranking Signals

45. Originality

Conclusion

Google Ranking Signals

There are some other pages about Google Ranking Signals that don’t consider up-to-date information or sometimes use questionable critical thinking to argue that some of the signals that they include are actually something that Google considers. I’ve been blogging about patents from Google, Yahoo, Microsoft, and Apple since 2005, and have been exploring what those might say are ranking signals for over a decade.

Representatives from Google have stated that “Just because we have a patent on something, doesn’t mean we are using it.” The first time I heard them say that was after Go Daddy started advertising domain registrations of up to 10 years, because one Google patent (Information Retrieval Based on Historical Data) said that they might look at length of domain registration as a ranking signal, based on the thought that a “spammer would likely only register a domain for a period of one year.” (but actually, many people register domains for one year, and have their registrations on auto-renewal, so a one year registration is not evidence that a person registering a domain for just one year is a spammer.).

I’ve included some ranking signals that are a little older, but most of the things I’ve listed are from the past five years, often with blog posts I’ve written about them, and patents that go with them. This list is a compilation of blog posts that I have been working on for years, taking many hours of regular searching through patent filings, and reading blog posts from within the Search and SEO industries, and reading through many patents that I didn’t write about, and many that I have. If you have questions about any of the signals I’ve listed, please ask about them in the comments.

Some of the patents I have blogged about have not been implemented by Google yet but could be. A company such as Google files a patent to protect the intellectual property behind their ideas, the work that their search engineers and testing teams put into those ideas. It is worth looking at, reading, and understanding many of these patents because they provide some insights into ideas that Google may have explored when developing ranking signals, and they may give you ideas of things that you may want to explore, and questions to keep in mind when you are working upon optimizing a site. Patents are made public to inspire people to innovate and invent and understand new ideas and inventions.

Organic Search Ranking Signals

1. Domain Age and Rate of Linking

Google does have a patent called Document scoring based on document inception date, in which they tell us that they will often use the date that they first crawl a site or the first time they see a document referenced in another site, as the age of that site. The patent also tells us that Google may look at the links pointed to a site, and calculate what the average rate of links pointed to a site may be and use that information to rank a site, based upon that linking.

2. Use of Keywords

Matt Cutts wrote a newsletter for librarians in which he explained how Google crawled the web, making an inverted index of the Web with terms found on Documents from the Web that it would match up with query terms when people performed searches. It shows us the importance of Keywords in queries and how Google finds words that contain those keywords as an important part of performing searches. A copy of that newsletter can be found here: https://www.analistaseo.es/wp-content/uploads/2014/09/How-Google-Index-Rank.pdf

3. Related Phrases

Google Recently updated its first phrase-based indexing patent, which tells us in its claims that pages with more related phrases on them rank higher than pages with less related phrases on them. That patent is: Phrase-based searching in an information retrieval system. Related phrases are phrases that are complete phrases that may predict the topic a page it appears upon is about. Google might look at the queries that a page is optimized for, and look at the highest ranking pages for those query terms, and see which meaningful complete phrases frequently occur (or co-occur) on those high ranking pages.

I wrote about the updating of this patent in the post Google Phrase-Based Indexing Updated. Google tells us about how they are indexing related phrases in an inverted index (like the term-based inverted index from #2) in the patent Index server architecture using tiered and sharded phrase posting lists

4. Keywords in Main Headings, Lists, and Titles

Semantic closeness illustrated

I wrote the post Google Defines Semantic Closeness as a Ranking Signal after reading the patent, Document ranking based on semantic distance between terms in a document. The Abstract of this patent tells us that:

Techniques are disclosed that locate implicitly defined semantic structures in a document, such as, for example, implicitly defined lists in an HTML document. The semantic structures can be used in the calculation of distance values between terms in the documents. The distance values may be used, for example, in the generation of ranking scores that indicate a relevance level of the document to a search query.

If a list in page has a heading on it, the items in that list are all considered to be an equal distance away from the list. The words contained under the main heading on a page are all considered to be an equal distance away from that main heading. All of the words on a page are considered to be an equal distance away from the title to that page. So, a page that is titled “Ford” which has the word “motors” on that page is considered to be relevant for the phrase “Ford Motors.” Here is an example of how that semantic closeness works with a heading and a list:

5. Page Speed

Google has announced repeatedly that they consider Page Speed to be a ranking signal, including in the Google Blog post: Using site speed in web search ranking, and also in a patent that I wrote about in the post, Google’s Patent on Site Speed as a Ranking Signal.

The patent assigned to Google about Page Speed is Using resource load times in ranking search results. The patent tells us that this load time signal may be based upon measures of how long it takes a page to load on a range of devices:

The load time of an online resource can be based on a statistical measure of a sample of load times for a number of different types of devices that the page or resource might be viewed upon.

6. Watch Times for a page

While it may appear to be based upon videos, there is a Google Patent that tells us that it may rank pages higher if they are watched for longer periods of time than other pages. The post I wrote about this patent on is: Google Watch Times Algorithm For Rankings?, and the patent it is about is, Watch time based ranking.

A page may contain video or images or audio, and a watch time for those may make a difference too. Here’s a screenshot from the patent showing some examples:

Watch Time for a Page

7. Context Terms on a Page

I wrote the post Google Patents Context Vectors to Improve Search, about the patent User-context-based search engine.

The patent tells us that it may look at words that have more than one meaning in knowledge bases (such as a bank, which could mean a building money is stored in, or the ground on one side of a river, or what a plane does when it turns in the air.) The search engine may take terms from that knowledge base that show what meaning was intended and collect them as “Context Terms” and it might look for those context terms when indexing pages those words are on so that it indexes the correct meaning

8. Language Models Using Ngrams

Google may give pages quality scores based upon language models created from those pages when it looks at the ngrams on the pages of a site. This is similar to the Google Book Ngram Viewer.

I wrote about this in the post Using Ngram Phrase Models to Generate Site Quality Scores based upon the patent Predicting site quality

The closer the quality score for a page is to a high-quality page from a training set, the higher the page may rank.

9. Gibberish Content

This may sound a little like #8 above. Google may use ngrams to tell if the words on a page are gibberish, and reduce the ranking of a page. I wrote about this in a post titled, Google Scoring Gibberish Content to Demote Pages in Rankings?, about the patent Identifying gibberish content in resources.

Here is an ngram analysis using a well-known phrase, with 5 words in it:

The quick brown fox jumps
quick brown fox jumps over
brown fox jumps over the
fox jumps over the lazy
jumps over the lazy dog

Ngrams from a complete page might be collected like that, and from a collection of good pages and bad pages, to build language models (and Google has done that with a lot of books, as we see from the Google Ngram Viewer covering a very large collection of books.) It would be possible to tell which pages are gibberish from such a set of language models. This Gibberish content patent also mentions a keyword stuffing score that it would try to identify.

10. Authoritative Results

In the post Authoritative Search Results in Google Searches?, I wrote about the patent Obtaining authoritative search results, which tells us that Google might look at the results of a search, and if none of the Pages in the SERPs that appear are authoritative enough, it might search upon one of the query refinements that are listed with those results to see if they return any authoritative results.

If they do, the authoritative results may be merged into the original results. The way it describes authoritative results:

In general, an authoritative site is a site that the search system has determined to include particularly trusted, accurate, or reliable content. The search system can distinguish authoritative sites from low-quality sites that include resources with shallow content or that frequently include spam advertisements. Whether the search system considers a site to be authoritative will typically be query-dependent. For example, the search system can consider the site for the Centers for Disease Control, “cdc.gov,” to be an authoritative site for the query “cdc mosquito stop bites,” but may not consider the same site to be authoritative for the query “restaurant recommendations”. A search result that identifies a resource on a site that is authoritative for the query may be referred to as an authoritative search result.

11. How Well Databases Answers Match Queries

This patent doesn’t seem to have been implemented yet. But it might, and is worth thinking about.

I wrote the post How Google May Rank Websites Based Upon Their Databases Answering Queries, based upon the patent Resource identification from organic and structured content. It tells us that Google might look at searches on a site, and how a site might answer them, to see if they are similar to the queries that Google receives from searchers.

If they are, it might rank results from those sites higher. The patent also shows us that it might include the database results from such sites within Google Search results. If you start seeing that happening, you will know that Google decided to implement this patent. Here is the screenshot from the patent:

example search results showing database information

12. Suspicious Activity to Increase Rankings

Another time that Google publicly stated that “just because we have a patent doesn’t mean we use it, came shortly after I wrote about a patent in a post I called The Google Rank-Modifying Spammers Patent based upon the patent Ranking documents.

It tells us about a transition rank that Google may assign to a site where they see activity that might be suspicious, such as keyword stuffing. Instead of improving the ranks of pages, they might decrease them, or rerank them randomly. The motivation behind it appears to be to have those people making changes to do more drastic things. The patent tells us:

Implementations consistent with the principles of the invention may rank documents based on a rank transition function. The ranking based on the rank transition function may be used to identify documents that are subjected to rank-modifying spamming. The rank transition may provide confusing indications of the impact on rank in response to rank-modifying spamming activities. Implementations consistent with the principles of the invention may also observe spammers’ reactions to rank changes to identify documents that are actively being manipulated.

13. Popularity Scores for Events

Might Google rank pages about events higher based upon how popular it might perceive that event to be? I wrote the post Ranking Events in Google Search Results about the patent Ranking events which told us about popularity of an event being something that would make a difference. The following Screenshot from the patent shows some of the signals that go into determining a popularity score for an event:

signal Scores for an event

Some patents provide a list of the “Advantages” of following a process in the patent, as does this one:

The following advantages are described by the patent in following the approach it describes.

  1. Events in a given location can be ranked so that popular or interesting events can be easily identified.
  2. The ranking can be adjusted to ensure that highly-ranked events are diverse and different from one another.
  3. Events matching a variety of event criteria can be ranked so that popular or interesting events can be easily identified.
  4. The ranking can be provided to other systems or services that can use the ranking to enhance the user experience. For example, a search engine can use the ranking to identify the most popular events that are relevant to a received search query and present the most popular events to the user in response to the received query.
  5. A recommendation engine can use the ranking to provide information identifying popular or interesting events to users that match the users’ interests.

14.The Amount of Weight from a Link is Based upon the Probability of Clicks On It

I came across an update to the reasonable surfer patent, which focused more upon anchor text used in links than the earlier version of the patent, and told us that the amount of weight (PageRank) that might pass through a link was based upon the likelihood that someone might click upon that link.

The post is Google’s Reasonable Surfer Patent Updated based upon this patent Ranking documents based on user behavior and/or feature data. Since this is a continuation patent, it is worth looking at the claims in the patent to see what they say it is about. They do mention how ranking is affected, including the impact of anchor text and words before and after a link.

identifying: context relating to one or more words before or after the links, words in anchor text associated with the links, and a quantity of the words in the anchor text, the weight being determined based on whether the particular feature data corresponds to the stored feature data associated with the one or more links or whether the particular feature data corresponds to the stored feature data associated with the one or more other links, the rank being generated based on the weight; identifying, by the one or more devices, documents associated with a search query, the documents, associated with the search query, including the particular document; and providing, by the one or more devices, information associated with the particular document based on: the search query, and the generated rank.

15. Biometric Parameters while Viewing Results

This patent was one that I wondered about whether or not Google would implement, and suspect that many people would be upset if they did. I wrote about it in Satisfaction a Future Ranking Signal in Google Search Results?, based upon Ranking Query Results Using Biometric Parameters. Google may watch through a smart phone’s reverse camera to see the reaction of someone looking at results in response to a query, and if they appear to be unsatisfied with the results, those results may be demoted in future search results.

how satisfaction might be used with Search Results Pages

16. Click-Throughs

We’ve been told by Google Spokespeople that click-throughs are too noisy to use as a ranking signal, and yet a patent came out which describes how they might be used in such a way. With some thresholds, like clicks not counting until after the first 100, or a certain amount of time passes. The post I wrote about it in was Google Patents Click-Through Feedback on Search Results to Improve Rankings, based upon Modifying search result ranking based on a temporal element of user feedback

Rand Fishkin sent me a message saying that his experience has been that clicks were counting as ranking signals, but he was also seeing thresholds of around 500 clicks before clicks would make a difference. It’s difficult to tell with some signals, especially when Google makes statements about them not being signals in use.

Rand's tweet in response to my post, about his experiment.
Rand’s tweet in response to my post, about his experiment.

And Rand responded about what I said in the post about thresholds as well:

Threshold on click rates tweet.

17. Site Quality Scores

If you search for “seobythesea named entities” it is a signal that you have an expectation that you can find information about named entities on the site seobythesea.com.

If you do a site operator search such as “site:http://www.seobythesea.com named entities” you again are showing that you expect to be able to find information about a particular topic on this site. These are considered queries that refer to a particular site.

They are counted against queries that are considered to be associated with a particular site. So, if there are more referring queries than associated queries, the quality score for a site is higher.

If there are less referring queries than associated queries, then the quality score is lower. The post I wrote about this was How Google May Calculate Site Quality Scores (from Navneet Panda) based upon the patent Site quality score. A lower site quality score can mean a lower rank, as the patent tells us:

The site quality score for a site can be used as a signal to rank resources or to rank search results that identify resources, that are found in one site relative to resources found in another site.

18. Disambiguating People

Like the patent about covering terms with more than one meaning by including context terms on their pages, when you write about people who may share a name with someone else, if they are also on sites such as Wikipedia, and disambiguated entries, make sure you include context terms on your page that makes it easier to tell which person you are writing about.

The post I covered this in was Google Shows Us Context is King When Indexing People, based upon the patent Name disambiguation using context terms

19. Effectiveness and Affinity

If you search for something on a phone such as a song, and you have a music app on that phone that has that song upon it, Google may tell you what the song you are searching for is, and that you can access it on the app that you have loaded on your phone.

Social network affinities seem to be related to this. If you ask a question that might involve someone whom you might be connected to on a social network, they might be pointed out to you. See Effectiveness and Affinity as Search Ranking Signals (Better Search Experiences) about Ranking search results.

20. Quotes

quotes-ranking-signals

Google seems to know who said what and has a patent on it.

See Google Searching Quotes of Entities on the patent Systems and methods for searching quotes of entities using a database.

21. Category Duration Visits

Could visits to specific categories of a site have a positive effect on the rankings of those visited sites? We know that people from Google have said that use behavior signals like this tend to be noisy; but what are you to think when the patent I was writing about describes ways to reduce noise from such signals?

The post is A Panda Patent on Website and Category Visit Durations, and it is about a patent co-authored by Navneet Panda titled Website duration performance based on category durations.

22.Repeat Clicks and Visit Durations

I want to believe when Google Spokespeople say that Google doesn’t use click data to rank pages, but I keep on seeing patents from Navneet Panda that Google’s Panda Update was named after which describes user behavior that may have an impact.

The post is Click a Panda: High Quality Search Results based on Repeat Clicks and Visit Duration, and the patent it is about is one called Ranking search results

23 Environmental Information

Google can listen to a television playing, and respond to a question such as “Who is starring in this movie I am watching?

I wrote about it in Google to Use Environmental Information in Queries, and the post is based upon the patent
Answering questions using environmental context

24. Traffic Producing Links

Google might attempt to estimate how much traffic links to a site might bring to that site. If it believes that the links aren’t bringing much traffic, it may discount the value of those links.

I wrote about this in the post Did the Groundhog Update Just Take Place at Google?
It is about the patent Determining a quality measure for a resource

25. Freshness
I wrote a post about this called New Google Freshness-Based Ranking Patent.

There I wrote about how a search engine might try to determine that a query is of particular recent interest by looking to see if there has been a number of occurrences of the query:

  1. Being received within a recent time period
  2. On blog web pages within a recent time period
  3. On news web pages within a recent time period
  4. On social network web pages within a recent time period
  5. Requesting news search results within a recent time period
  6. Requesting news search results within a recent time period versus requesting web search results within the time period
  7. User selections of news search results provided in response to the query or
  8. More user selections of news search results versus user selections of web search results within the time period

The patent that this one was from is:

Freshness based ranking

26. Media Consumption History

If a person has a history of interaction with specific media, such as watching a particular movie or video or listening to a specific song, their searches may be influenced by that media, as I described in Google Media Consumption History Patent Filed.

That is based upon this patent , Query Response Using Media Consumption History. It is one of a series of patents which I wrote more about in How Google May Track the Media You Consume to Influence Search Results

27. Geographic Coordinates

A patent called Determining geographic locations for place names in a fact repository was updated in a continuation patent, which I wrote about in Google Changes How they Understand Place Names in a Knowledge Graph.

The claims from the patent were updated to include many mentions of “Geographic Coordinates” which indicated that including Latitude and Longitude information in Schema for a site might not be a bad idea. It’s impossible to say, based upon the patent that they use those signals in ordinary websites that aren’t knowledge base sites like a Wikipedia or an IMDB or Yahoo Finance. But it seemed very reasonable to believe that if they were hoping to see information in that form in those places that it wouldn’t hurt on Web sites that were concerned about their locations as well (especially since knowledge bases seem to be the source of facts for many sites in places such as knowledge panels.)

28. Low Quality

A post that looks at links pointed to a site, such as from footers of other sites, and might discount those, and links from sites that tend to be redundant, which it may not count more than once is the one at How Google May Classify Sites as Low-Quality Sites.

It is based upon the patent at:

Classifying sites as low quality sites

29. Television Watching

flow chart from patent on television watching as a ranking signal

Google may try to track what is playing on television where you are located, and watch for queries which look like they might be based upon those television shows, which I wrote about in Google Granted Patent on Using What You Watch on TV as a Ranking Signal.

It is based upon the patent System and method for enhancing user search results by determining a television program currently being displayed in proximity to an electronic device

30. Quality Rankings

Quality Raters Flowchart

We know that Google uses Human Raters to evaluate sites. Their rankings of pages may influence the rankings of pages, which I wrote about in the post How Google May Rank Web Sites Based on Quality Ratings The post identifies and explains a few quality signals that might be included in raters evaluations, such as whether it has a broad appeal or a niche appeal, what the click rate or blog subscription rate or PageRank Score might be.

The patent this ranking signal is based upon is Website quality signal generation

Semantic Search Ranking Signals

31. Searches using Structured Data

Google recently published a patent which showed how Structured data in the form of JSON-LD might be used on a page and might cause Google to search for values of attributes of entities described in that structured data, such as what book was published by a certain author during a specific time period. The patent explained how Google could search through the structured data to find answers to a query like that. My post is Google Patent on Structured Data Focuses upon JSON-LD, and the patent it covers is Storing semi-structured data.

32. Related Entities

A search for an entity with a property or attribute that may not be the most noteworthy, but may be known may be findable in search results. In a post about this, I used an example query about “Where was George Washington a Surveyor?” since he is most well known for having been President. The post is Related Entity Scores in Knowledge-Based Searches, based on the patent Providing search results based on sorted properties.

33. Nearby Locations

I stood in front of a statue in my town and asked my phone what the name of the statue in front of me was. It didn’t give me an answer, but I suspect we may see answers to questions like this in the future (and information about stores and restaurants that we might be standing in front of as well. I wrote about how this might work in the post How Google May Interpret Queries Based on Locations and Entities (Tested). It is based upon the patent Interpreting User Queries Based on Nearby Locations. This is worth testing again, I am traveling to Italy in November, and I’m hoping it works for my trip then, so I can ask for reviews of restaurants I might stand in front of when there.

34. Attributes of Entities

Asking questions about facts from entities such as movies or books, and Google being able to answer such queries is a good reason to make sure Google understands the entities that exist on your web pages. I wrote about such searches in the post How Knowledge Base Entities can be Used in Searches.

It is based upon the patent Identifying entities using search results

35. Natural Language Search Results

Example of search results showing natural language answers to questions.

Featured Snippets may be answered from high authority Pages (ranking on the first page for a query) that show the natural language question to be answered, and a good answer to that question. The questions are ones that follow a common pattern for questions ask on the web, such as “What is a good treatment for X?” I wrote about such search results in the post Direct Answers – Natural Language Search Results for Intent Queries.

It is based on the patent at Natural Language Search Results for Intent Queries

Local Search Ranking Signals

36. Travel Time for Local Results

How far someone may be will to travel to a place may be a reason why Google might increase the ranking of a business in local search results. I wrote about this in the post Ranking Local Businesses Based Upon Quality Measures including Travel Time based upon the patent Determining the quality of locations based on travel time investment.

Would you drive an hour away for a slice of pizza? If so, it must be pretty good pizza. The abstract from the patent tells us this:

…the quality measure of a given location may be determined based on the time investment a user is willing to make to visit the given location. For example, the time investment for a given location may be based on a comparison of one or more actual distance values to reach the given location to one or more anticipated distance values to reach the given location.

37. Reverse Engineering of Spam Detection in Local Results

In the post How Google May Respond to Reverse Engineering of Spam Detection, I wrote about the patent Reverse engineering circumvention of spam detection algorithms. I remembered how Google responded when people brought up the Google Rank-Modifying Spammers Patent, that I wrote about in #13, telling people that just because they had a patent doesn’t mean they necessarily use it.

This patent is slightly different from the Rank modifying spammer’s patent, in that it only applies to local search, and it may keep a spamming site from appearing at all, or appearing if continued activity keeps on setting off flags. As the patent abstract tells us:

A spam score is assigned to a business listing when the listing is received at a search entity. A noise function is added to the spam score such that the spam score is varied. In the event that the spam score is greater than a first threshold, the listing is identified as fraudulent and the listing is not included in (or is removed from) the group of searchable business listings. In the event that the spam score is greater than a second threshold that is less than the first threshold, the listing may be flagged for inspection. The addition of the noise to the spam scores prevents potential spammers from reverse engineering the spam detecting algorithm such that more listings that are submitted to the search entity may be identified as fraudulent and not included in the group of searchable listings.

38. Surprisingness in Business Names in Local Search

Another patent that is about spam in local search is one I wrote about in the post Google Fights Keyword Stuffed Business Names Using a Surprisingness Value written about the patent Systems and methods of detecting keyword-stuffed business titles.

This patent targets keyword stuffed business names that include prominent business names to try to confuse the search engine. Examples include such names as “Locksmith restaurant,” and “Courtyard 422 Y st Marriott.”

39. Local Expert Reviews

I’ve been hearing people suggest that reviews can help a local search rank higher, and I have seen reviews considered equivalent to a mention in the Google patent on Location Prominence. But, I’ve now also seen a Google patent which tells us that a review from a local expert might also increase the rankings of a local entity in local results. My post was At Google Local Expert Reviews May Boost Local Search Results on the patent Identifying local experts for local search

40. Similar Local Entities

When you search for a local coffeehouse, Google may decide that it wants to show you similar local businesses, and may include some other coffee houses or other similar results in what you see also. I wrote a post on this called How Google May Determine Similar Local Entities, from the patent Detection of related local entities.

41. Distance from Mobile Location History

Google keeps track of places that you may visit using a mobile device such as a phone. It returns results on searches based upon distance from you, the relevance of a business name to your search, and the location prominence of a local entity to its location. The distance used to be from where you were searching, but it may now be based upon a distance from your location history, as I wrote about in Google to Use Distance from Mobile Location History for Ranking in Local Search

This is based upon a patent called Ranking Nearby Destinations Based on Visit Likelihood and Predicting Future Visits to Places From Location History

42. What People Search for at Locations Searched

Leo Carillo Ranch Query Refinements

Search for a place that you might visit, and the query refinements that you might see may be based upon what people at that location you are considering visiting may have searched for when they were visiting that place. The “Leo Carrill” example above is for a ranch that was converted into a state park where many people get married at, and chances are the queries shown are from people searching from that park.

This doesn’t affect the rankings of the results you see, but instead the query refinements that you are shown. See Local Query Suggestions Based Upon Where People Search based on Local query suggestions.

42. Semantic Geotokens

A semantic geotoken is “a standardized representation for the geographic location including one or more location-specific terms for the geographic location.” My post about geotokens provides details on how much an impact them might have when shown in different ways, at Better Organic Search Results at Google Involving Geographic Location Queries

These are based on a patent named Semantic geotokens

Voice Search Ranking Signals

44. Stressed Words in spoken queries

This may not be something you can optimize a page for, but it does show that Google is paying attention to voice search and where that might take us. In the post Google and Spoken Queries: Understanding Stressed Pronouns based upon the patent Resolving pronoun ambiguity in voice queries, we see that Google may be listening for our voices to emphasize certain words when we ask for something. Here is an example from the patent:

A voice query asks: “Who was Alexander Graham Bell’s father?”
The answer: “Alexander Melville Bell”
A followup voice query: “What is HIS birthday?”
The answer to the follow-up query: “Alexander Melville Bell’s birthday is 3/1/1819”

News Search Ranking Signals

45. Originality in News Search

Google has a few patents that focus specifically upon ranking news results. They have updated some of those patents with continuation patents that have rewritten claims in them. I came across one that used to once focus upon geography as a very important signal but appears to pay much more attention to originality now. I wrote about that change in the post Originality Replaces Geography as Ranking Signal in Google News

The updated patent is Methods and apparatus for ranking documents

Ranking Signals Conclusion

I have mostly focused upon including ranking signals that I have written about in this post going back five years. It’s quite possible that I missed out on some, but I ideally wanted to provide a list that included signals that I have written about and could point to patents about. I’ve mentioned that Google spokespeople have sometimes said that “Just because Google has a patent on something doesn’t mean that they are using it.” That is good advice, but I do want to urge you to keep open the idea that they found certain ideas important enough to write out in legal documents that exclude others from using the processes described in those documents, so there has been a fair amount of effort made to create the patents I point to in this post.

I will be thinking about going back more than 5 years to cover some other signals that I have written about. I did want to include some posts I had written about factors that search engines use when they might rerank search results:

I do look forward to hearing your thoughts about the ranking signals that I have covered in this post


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Five Years of Google Ranking Signals appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Google Patent on Structured Data Focuses upon JSON-LD

June 18, 2018 No Comments

Ernest Hemingway Structure Data

In a search engine that answers questions based upon crawling and indexing facts found within structured data on a site, that search engine works differently than a search engine which looks at the words used in a query, and tries to return documents that contain the same words as the ones in the query; hoping that such a matching of strings might contain an actual answer to the informational need that inspired the query in the first place. Search using Structured Data works a little differently, as seen in this flowchart from a 2017 Google patent:

Flow Chart Showing Structured Data in a Search

In Schema, Structured Data, and Scattered Databases such as the World Wide Web, I talked about the Dipre Algorithm in a patent from Sergey Brin, as I described in the post, Google’s First Semantic Search Invention was Patented in 1999. That patent and algorithm described how the web might be crawled to collect pattern and relations information about specific facts. In that case, about books. In the Google patent on structured data, we see how Google might look for factual information set out in semi-structured data such as JSON-LD, to be able to answer queries about facts, such as, “What is a book, by Ernest Hemingway, published in 1948-1952.

This newer patent tells us that it might solve that book search in this manner:

In particular, for each encoded data item associated with a given identified schema, the system searches the locations in the encoded data item identified by the schema as storing values for the specified keys to identify encoded data items that store values for the specified keys that satisfy the requirements specified in the query. For example, if the query is for semi-structured data items that have a value “Ernest Hemingway” for an “author” key and that have values in a range of “1948-1952” for a “year published” key, the system can identify encoded data items that store a value corresponding to “Ernest Hemingway” in the location identified in the schema associated with the encoded data item as storing the value for the “author” key and that store a value in the range from “1948-1952” in the location identified in the schema associated with the encoded data item as storing the value for the “year published” key. Thus, the system can identify encoded data items that satisfy the query efficiently, i.e., without searching encoded data items that do not include values for each key specified in the received query and without searching locations in the encoded data items that are not identified as storing values for the specified keys.

It was interesting seeing Google come out with a patent about searching semi-structured data which focused upon the use of JSON-LD. We see them providing an example of JSON on one of the Google Developer’s pages at: Introduction to Structured Data

As it tells us on that page:

This documentation describes which fields are required, recommended, or optional for structured data with special meaning to Google Search. Most Search structured data uses schema.org vocabulary, but you should rely on the documentation on developers.google.com as definitive for Google Search behavior, rather than the schema.org documentation. Attributes or objects not described here are not required by Google Search, even if marked as required by schema.org.

The page then points us to the Structured Data Testing Tool, to be used as you prepare pages for use with Structured Data. It also tells us that for checking on Structured Data after it has been set up, the Structured Data Report in Google Search Console can be helpful, and is what I usually look at when doing site audits.

The Schema.org website has had a lot of JSON-LD examples added to it, and it was interesting to see this patent focus upon it. As they tell us about it in the patent, it seems that they like it:

Semi-structured data is self-describing data that does not conform to a static, predefined format. For example, one semi-structured data format is JavaScript Object Notation (JSON). A JSON data item generally includes one or more JSON objects, i.e., one or more unordered sets of key/value pairs. Another example semi-structured data format is Extensible Markup Language (XML). An XML data item generally includes one or more XML elements that define values for one or more keys.

I’ve used the analogy of how XML sitemaps are machine-readable, compared to HTML Sitemaps, and that is how JSON-LD shows off facts in a machine-readable way on a site, as opposed to content that is in HTML format. As the patent tells us that is the purpose behind this patent:

In general, this specification describes techniques for extracting facts from collections of documents.

The patent discusses schemas that might be on a site, and key/value pairs that could be searched, and details about such a search of semi-structured data on a site:

The aspect further includes receiving a query for semi-structured data items, wherein the query specifies requirements for values for one or more keys; identifying schemas from the plurality of schemas that identify locations for values corresponding to each of the one or more keys; for each identified schema, searching the encoded data items associated with the schema to identify encoded data items that satisfy the query; and providing data identifying values from the encoded data items that satisfy the query in response to the query. Searching the encoded data items associated with the schema includes: searching, for each encoded data item associated with the schema, the locations in the encoded data item identified by the schema as storing values for the specified keys to identify whether the encoded data item stores values for the specified keys that satisfy the requirements specified in the query.

The patent providing details of the use of JSON-LD to provide a machine readable set of facts on a site can be found here:

Storing semi-structured data
Inventors: Martin Probst
Assignee: Google Inc.
US Patent: 9,754,048
Granted: September 5, 2017
Filed: October 6, 2014

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for storing semi-structured data. One of the methods includes maintaining a plurality of schemas; receiving a first semi-structured data item; determining that the first semi-structured data item does not match any of the schemas in the plurality of schemas; and in response to determining that the first semi-structured data item does not match any of the schemas in the plurality of schemas: generating a new schema, encoding the first semi-structured data item in the first data format to generate the first new encoded data item in accordance with the new schema, storing the first new encoded data item in the data item repository, and associating the first new encoded data item with the new schema.

Take Aways

By using Structured Data such as in Schema Vocabulary in JSON-LD formatting, you make sure that you provide precise facts in key/value pairs that provide an alternative to the HTML-based content on the pages of a site. Make sure that you follow the Structured Data General Guidelines from Google when you add it to a site. That page tells us that pages that don’t follow the guidelines may not rank as highly, or may become ineligible for rich results appearing for them in Google SERPs.

And if you are optimizing a site for Google, it also helps to optimize the same site for Bing, and it is good to see that Bing seems to like JSON-LD too. It has taken a while for Bing to do that (see Aaron Bradle’s post, An Open Letter to Bing Regarding JSON-LD.) It appears that Bing has listened a little, adding some capacity to check on JSON-LD after it is deployed: Bing announces Bing AMP viewer & JSON-LD support in Bing Webmaster Tools. The Bing Markup Validator does not yet help with JSON-LD, but Bing Webmaster Tools now helps with debugging JSON-LD. I like using this Structured Data Linter myself.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Google Patent on Structured Data Focuses upon JSON-LD appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Schema, Structured Data, and Scattered Databases such as the World Wide Web

June 16, 2018 No Comments

I spoke at SMX Advanced this week on Schema markup and Structured Data, as part of an introduction to its use at Google.

I had the chance to visit Seattle, and tour some of it. I took some photos, but would like to go back sometimes and take a few more, and see more of the City.

One of the places that I did want to see was Pike Place market. It was a couple of blocks away from the Hotel I stayed at (the Marriott Waterfront.)

It is a combination fish and produce market, and is home to one of the earliest Starbucks.

pike-place-market-entrance

I could see living near the market and shopping there regularly. It has a comfortable feel to it.

Pike Place Farmers Market

This is a view of the Farmers Market from the side. I wish I had the chance to come back later in the day, and see what it was like other than in the morning.

Victor Steinbrueck Park

This was a nice little park next to Pike Place Market, which looked like a place to take your dog for a walk while in the area, and had a great view of Elliot Bay (the central part of Puget Sound.)

A view of Puget Sound

This is a view of the waterfront from closer to the conference center.

Mount Ranier

You can see Mount Ranier from the top of the Conference Center.

My presentation for SMX Advanced 2018:

Schema, Structured Data & Scattered Databases Such as the World Wide Web. My role in this session is to introduce Schema and Structured Data and how Google is using them on the Web.

Google is possibly best known for the PageRank Algorithm invented by founder Lawrence Page, whom it is named after. In what looks like the second patent filed by someone at Google was the DIPRE (Dual interative pattern relation expansion) patent, invented and filed by Sergey Brin. He didn’t name it after himself (Brinrank) like Page did with PageRank.

The provisional patent filed for this invention was the whitepaper, “Extracting Patterns and Relations from Scattered Databases such as the World Wide Web.” The process behind it is set out in the paper, and it involves a list of 5 books, titles, their authors, Publishers, Year published. Unlike PageRank, it doesn’t involve crawling webpages, and indexing links from Page to page and anchor text. Instead, it involves collecting facts from page to page, and when it finds pages that contain properties and attributes from these five books, it is supposed to collect similar facts about other books on the same site. And once it has completed, it is supposed to move on to other sites and look for those same 5 books, and collect more books. The idea is to eventually know where all the books are on the Web, and facts about those books, that could be used to answer questions about them.

This is where we see Google being concerned about structured data on the web, and how helpful knowing about it could be.

When I first started out doing inhouse SEO, it was for a Delaware incorporation business, and geography was an important part of the queries that my pages were found for. I had started looking at patents, and ones such as this one on “Generating Structured Data caught my attention. It focused on collecting data about local entities, or local businesses, and properties related to those. It was built by the team led by Andrew Hogue, who was in charge of the Annotation framework at Google, who were responsible for “The Fact Repository”, an early version of Google’s Knowledge Graph.

If you’ve heard of NAP consistency, and of mentions being important to local search, it is because Local search was focusing on collecting structured data that could be used to answer questions about businesses. Patents about location prominence followed, which told us that a link counted as a mention, and a patent on local authority, which determined which Website was the authoritative one for a business. But, it seemed to start with collecting structured data about businesses at places.

The DIPRE Algorithm focused upon crawling the web to find facts, and Google Maps built that into an approach that could be used to rank places and answer questions about them.

If you haven’t had a chance to use Google’s experimental table search, it is worth trying out. It can answer questions to find answers from data-based tables across the web, such as “what is the longest wooden pier in California”, which is the one in Oceanside, a town next to the one I live in. It is from a Webtables project at Google.

Database fields are sometimes referred to as schema and table headers which tell us what kind of data is in a table column may also be referred to as “schema”. A data-based web table could be considered a small structured database, and Google’s Webtable project found that there was a lot of information that could be found in web tables on the Web.

Try out the first link above (the WebTables Project Slide) when you get the chance, and do some searches on Google’s table search. The second paper is one that described the WebTables project when it first started out, and the one that follows it describes some of the things that Google researchers learned from the Project. We’ve seen Structured Snippets like the one above grabbing facts to include in a snippet (in this case from a data table on the Wikipedia page about the Oceanside Pier.)

When a data table column contains the same data that another table contains, and the first doesn’t have a table header label, it might learn a label from the second table (and this is considered a way to learn semantics or meaning from tables) These are truly scattered databases across the World Wide Web, but through the use of crawlers, that information can be collected and become useful, like the DIPRE Algorithm described.

In 2005, the Official Google Blog published this short story, which told us about Google sometimes answering direct questions in response to queries at the top of Web results. I don’t remember when these first started appearing, but do remember Definition results about a year earlier, which you could type out “Define:” and a word or ask “What is” before a word and Google would show a definition, and there was a patent that described how they were finding definitions from glossary pages, and how to ideally set up those glossaries, so that your definitions might be the ones that end up as responses.

In 2012, Google introduced the Knowledge Graph, which told us that they would be focusing upon learning about specific people, places and things, and answering questions about those instead of just continuing to match keywords in queries to keywords in documents. They told us that this was a move to things instead of strings. Like the books in Brin’s DIPRE or Local Entities in Google Maps.

We could start using the Web as a scattered database, with questions and answers from places such as Wikipedia tables helping to answer queries such as “What is the capital of Poland”

And Knowledge bases such as Wikipedia, Freebase, IMDB and Yahoo Finance could be the sources of facts about properties and attributes about things such as movies and actors and businesses where Google could find answers to queries without having to find results that had the same keywords in the document as the query.

In 2011, The Schema.org site was launched as a joint project from Google, Yahoo, Bing, and Yandex, that provided machine-readable text that could be added to web pages. This text is provided in a manner that is machine readable only, much like XML sitemaps are intended to be machine-readable, to provide an alternative channel of information to search engines about the entities pages are about, and the properties and attributes on those pages.

While Schema.org was introduced in 2011, it was built to be extendable, and to let subject matter experts be able to add new schema, like this extension from GS1 (the inventors of barcodes in brick and mortar stores) If you haven’t tried out this demo from them, it is worth getting your hands on to see what is possible.

In 2014, Google published their Biperpedia paper, which tells us about how they might create ontologies from Query streams (sessions about specific topics) by finding terms to extract data from the Web about. At one point in time, Search engines would do focused crawls of the web starting at sources such as DMOZ, so that the Index of the Web they were constructing contained pages about a wide range of categories. By using query stream information, they are crowdsourcing the building of resources to build ontologies about. This paper tells us that Biperpedia enabled them to build ontologies that were larger than what they had developed through Freebase, which may be partially why Freebase was replaced by wiki data.

The Google+ group I’ve linked to above on the Schema Resources Page has members who work on Schema from Google, such as Dan Brickley, who is the head of schema for Google. Learning about extensions is a good idea, especially if you might consider participating in building new ones, and the community group has a mailing list, which lets you see and participate in discussions about the growth of Schema.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Schema, Structured Data, and Scattered Databases such as the World Wide Web appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Google to Offer Combined Content (Paid and Organic) Search Results

June 12, 2018 No Comments

Combined Content Search Results

Google Introduces Combined Content Results

This new patent is about “Combined content. What does that mean exatchly? When Google patents talk about paid search, they refer to those paid results as “content” rather than as advertisements. This patent is about how Google might combine paid search results with organic results in certain instances.

The recent patent from Google (Combining Content with Search Results) tells us about how Google might identify when organic search results might be about specific entities, such as brands. It may also recognize when paid results are about the same brands, whether they might be products from those brands.

In the event that a set of search results contains high ranking organic results from a specific brand, and a paid search result from that same brand, the process described in the patent might allow for the creation of a combined content result of the organic result with the paid result.

Merging Local and Organic Results in the Past

When I saw this new patent, it brought back memories of when Google found a way to merge organic search results with local search results. The day after I wrote about that, in the following post, I received a call from a co-worker who asked me if I had any idea why a top ranking organic result for a client might have disappeared from Google’s search results.

I asked her what the query term was, and who the client was. I performed the search, and noticed that our client was ranking highly for that query term in a local result, but their organic result had disappeared. I pointed her to the blog post I wrote the day before, about Google possibly merging local and organic results, with the organic result disappearing, and the local result getting boosted in rankings. It seemed like that is what happened to our client, and I sent her a link to my post, which described that.

How Google May Diversify Search Results by Merging Local and Web Search Results

Google did merge that client’s organic listing with their local listing, but it appeared that was something that they ended up not doing too often. I didn’t see them do that too many more times.

I am wondering, will Google start merging together paid search results with organic search results? If they would do that for local and organic results, which rank things in different ways, it is possible that they might with organic and paid. The patent describes how.

The newly granted patent does tell us about how paid search works in Search results at Google:

Content slots can be allocated to content sponsors as part of a reservation system, or in an auction. For example, content sponsors can provide bids specifying amounts that the sponsors are respectively willing to pay for presentation of their content. In turn, an auction can be run, and the slots can be allocated to sponsors according, among other things, to their bids and/or the relevance of the sponsored content to content presented on a page hosting the slot or a request that is received for the sponsored content. The content can be provided to a user device such as a personal computer (PC), a smartphone, a laptop computer, a tablet computer, or some other user device.

Combined Content – Combining Paid and Organic Results

Here is the process behind this new patent involving merging paid results (content) and organic results:

  1. A search query is received.
  2. Search results responsive to the query are returned, including one associated with a brand.
  3. Content items (paid search results) based at least in part on the query, are returned for delivery along with the search results responsive to the query.
  4. This approach includes looking to see if eligible content items are associated with a same brand as the brand associated in the organic search results.
  5. If there is a paid result and an organic result that are associated with each othte, it may combine the organi search result and the eligible content item into a combined content item, and provide the combined content item as a search result responsive to the request.

When Google decides whether the eligible content item is associated with the same brand as an organi result, it is a matter of determining that one content item is sponsored by an owner of the brand.

A combined result (of the paid and the organic results covering the same brand) includes combining what the patent is referring to as “a visual universal resource locator (VisURL),”

That combined item would include:

  • A title
  • Text from the paid result
  • A link to a landing page from the paid result into the combined content item
  • The combine items may also includ other information associated with the brand, such as:

  • A map to retail locations associated with brand retail presence.
  • Retail location information associated with the brand.

In addition to the brand owner, the organic result that could be combine might be from a retailer associated with the brand.

It can involve designating content from the sponsored item that is included in the combined content item as sponsored content (so it may show that content from the paid result as being an ad.)

It may also include “monetizing interactions with material that is included from the at least one eligible content item that is included in the combined content item based on user interactions with the material.” Additional items shown could include an image or logo associated with the brand, or one or more products associated with the brand, or combine additional links relevant to the result.

Additional Brand Content in Search Results

The patent behind this approach of combining paid and organic results was this one, granted in April:

Combining content with a search result
Inventors: Conrad Wai, Christopher Souvey, Lewis Denizen, Gaurav Garg, Awaneesh Verma, Emily Kay Moxley, Jeremy Silber, Daniel Amaral de Medeiros Rocha and Alexander Fischer
Assignee: Google LLC
US Patent: 9,947,026
Granted: April 17, 2018
Filed: May 12, 2016

Abstract

Methods, systems, and apparatus include computer programs encoded on a computer-readable storage medium, including a method for providing content. A search query is received. Search results responsive to the query are identified, including identifying a first search result in a top set of search results that is associated with a brand. Based at least in part on the query, one or more eligible content items are identified for delivery along with the search results responsive to the query. A determination is made as to when at least one of the eligible content items is associated with a same brand as the brand associated with the first search result. The first search result and one of the determined at least one eligible content items are combined into a combined content item and providing the combined content item as a search result responsive to the request.

The patent does include details on things such as an “entity/brand determination engine,” which can be used to compare paid results with organic results, to see if they cover the same brand. This is one of the changes that indexing things instead of strings is bringing us.

The patent does have many other details, and until Google announces that they are introducing this, I suspect we won’t hear more details from them about it. Then again, they didn’t announce officially that they were merging organic and local results when they started doing that. Don’t be surprised if this becomes available at Google.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Google to Offer Combined Content (Paid and Organic) Search Results appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


PageRank Updated

April 26, 2018 No Comments

A popular search engine developed by Google Inc. of Mountain View, Calif. uses PageRank.RTM. as a page-quality metric for efficiently guiding the processes of web crawling, index selection, and web page ranking. Generally, the PageRank technique computes and assigns a PageRank score to each web page it encounters on the web, wherein the PageRank score serves as a measure of the relative quality of a given web page with respect to other web pages. PageRank generally ensures that important and high-quality web pages receive high PageRank scores, which enables a search engine to efficiently rank the search results based on their associated PageRank scores.

~ Producing a ranking for pages using distances in a web-link graph

A continuation patent of an updated PageRank was granted today. The original patent was filed in 2006, and reminded me a lot of Yahoo’s Trustrank (which is cited by the patent’s applicants as one of a large number of documents that this new version of the patent is based upon.)

I first wrote about this patent in the post titled, Recalculating PageRank. It was originally filed in 2006, and the first claim in the patent read like this (note the mention of “Seed Pages”):

What is claimed is:

1. A method for producing a ranking for pages on the web, comprising: receiving a plurality of web pages, wherein the plurality of web pages are inter-linked with page links; receiving n seed pages, each seed page including at least one outgoing link to a respective web page in the plurality of web pages, wherein n is an integer greater than one; assigning, by one or more computers, a respective length to each page link and each outgoing link; identifying, by the one or more computers and from among the n seed pages, a kth-closest seed page to a first web page in the plurality of web pages according to the lengths of the links, wherein k is greater than one and less than n; determining a ranking score for the first web page from a shortest distance from the kth-closest seed page to the first web page; and producing a ranking for the first web page from the ranking score.

The first claim in the newer version of this continuation patent is:

What is claimed is:

1. A method, comprising: obtaining data identifying a set of pages to be ranked, wherein each page in the set of pages is connected to at least one other page in the set of pages by a page link; obtaining data identifying a set of n seed pages that each include at least one outgoing link to a page in the set of pages, wherein n is greater than one; accessing respective lengths assigned to one or more of the page links and one or more of the outgoing links; and for each page in the set of pages: identifying a kth-closest seed page to the page according to the respective lengths, wherein k is greater than one and less than n, determining a shortest distance from the kth-closest seed page to the page; and determining a ranking score for the page based on the determined shortest distance, wherein the ranking score is a measure of a relative quality of the page relative to other pages in the set of pages.

Producing a ranking for pages using distances in a web-link graph
Inventors: Nissan Hajaj
Assignee: Google LLC
US Patent: 9,953,049
Granted: April 24, 2018
Filed: October 19, 2015

Abstract

One embodiment of the present invention provides a system that produces a ranking for web pages. During operation, the system receives a set of pages to be ranked, wherein the set of pages are interconnected with links. The system also receives a set of seed pages which include outgoing links to the set of pages. The system then assigns lengths to the links based on properties of the links and properties of the pages attached to the links. The system next computes shortest distances from the set of seed pages to each page in the set of pages based on the lengths of the links between the pages. Next, the system determines a ranking score for each page in the set of pages based on the computed shortest distances. The system then produces a ranking for the set of pages based on the ranking scores for the set of pages.

Under this newer version of PageRank, we see how it might avoid manipulation by building trust into a link graph like this:

One possible variation of PageRank that would reduce the effect of these techniques is to select a few “trusted” pages (also referred to as the seed pages) and discovers other pages which are likely to be good by following the links from the trusted pages. For example, the technique can use a set of high quality seed pages (s.sub.1, s.sub.2, . . . , s.sub.n), and for each seed page i=1, 2, . . . , n, the system can iteratively compute the PageRank scores for the set of the web pages P using the formulae:

.A-inverted..noteq..di-elect cons..function..times..fwdarw..times..function..times..function..fwdarw. ##EQU00002## where R.sub.i(s.sub.i)=1, and w(q.fwdarw.p) is an optional weight given to the link q.fwdarw.p based on its properties (with the default weight of 1).

Generally, it is desirable to use a large number of seed pages to accommodate the different languages and a wide range of fields which are contained in the fast growing web contents. Unfortunately, this variation of PageRank requires solving the entire system for each seed separately. Hence, as the number of seed pages increases, the complexity of computation increases linearly, thereby limiting the number of seeds that can be practically used.

Hence, what is needed is a method and an apparatus for producing a ranking for pages on the web using a large number of diversified seed pages without the problems of the above-described techniques.

The summary of the patent describes it like this:

One embodiment of the present invention provides a system that ranks pages on the web based on distances between the pages, wherein the pages are interconnected with links to form a link-graph. More specifically, a set of high-quality seed pages are chosen as references for ranking the pages in the link-graph, and shortest distances from the set of seed pages to each given page in the link-graph are computed. Each of the shortest distances is obtained by summing lengths of a set of links which follows the shortest path from a seed page to a given page, wherein the length of a given link is assigned to the link based on properties of the link and properties of the page attached to the link. The computed shortest distances are then used to determine the ranking scores of the associated pages.

The patent discusses the importance of a diversity of topics covered by seed sites, and the value of a large set of seed sites. It also gives us a summary of crawling and ranking and searching like this:

Crawling Ranking and Searching Processes

FIG. 3 illustrates the crawling, ranking and searching processes in accordance with an embodiment of the present invention. During the crawling process, web crawler 304 crawls or otherwise searches through websites on web 302 to select web pages to be stored in indexed form in data center 308. In particular, web crawler 304 can prioritize the crawling process by using the page rank scores. The selected web pages are then compressed, indexed and ranked in 305 (using the ranking process described above) before being stored in data center 308.

During a subsequent search process, a search engine 312 receives a query 313 from a user 311 through a web browser 314. This query 313 specifies a number of terms to be searched for in the set of documents. In response to query 313, search engine 312 uses the ranking information to identify highly-ranked documents that satisfy the query. Search engine 312 then returns a response 315 through web browser 314, wherein the response 315 contains matching pages along with ranking information and references to the identified documents.

I’m thinking about looking up the many articles cited in the patent, and providing links to them, because they seem to be tremendous resources about the Web. I’ll likely publish those soon.


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post PageRank Updated appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


3 Ways Query Stream Ontologies Change Search

March 8, 2018 No Comments

What are query stream ontologies, and how might they change search?

Search engines trained us to use keywords when we searched – to try to guess what words or phrases might be the best ones to use to try to find something we are interested in. That we might have a situational or informational need to find out more about. Keywords were an important and essential part of SEO – trying to get pages to rank highly in search results for certain keywords found in queries that people would search for. SEOs still optimize pages for keywords, hoping to use a combination of information retrieval relevance scores and link-based PageRank scores, to get pages to rank highly in search results.

With Google moving towards a knowledge-based attempt to find “things” rather than “strings”, we are seeing patents that focus upon returning results that provide answers to questions in search results. One of those from January describes how query stream ontologies might be created from searcher’s queries, that can be used to respond to fact-based questions using information about attributes of entities.

There is a white paper from Google co-authored by the same people who are the inventors of this patent published around the time this patent was filed in 2014, and it is worth spending time reading through. The paper is titled, Biperpedia: An Ontology for Search Applications

The patent (and paper) both focus upon the importance of structured data. The summary for the patent tells us this:

Search engines often are designed to recognize queries that can be answered by structured data. As such, they may invest heavily in creating and maintaining high-precision databases. While conventional databases in this context typically have a relatively wide coverage of entities, the number of attributes they model (e.g., GDP, CAPITAL, ANTHEM) is relatively small.

The patent is:

Identifying entity attributes
Inventors: Alon Yitzchak Halevy, Fei Wu, Steven Euijong Whang and Rahul Gupta
Assignee: Google Inc. (Mountain View, CA)
US Patent: 9,864,795
Granted: January 9, 2018
Filed: October 28, 2014

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating an ontology of entity attributes. One of the methods includes extracting a plurality of attributes based upon a plurality of queries; and constructing an ontology based upon the plurality of attributes and a plurality of entity classes.

The paper echoes sentiments in the patent, with statements such as this one:

For the first time in the history of the Web, structured data is a first-class citizen among search results. The main search engines make significant efforts to recognize when a user’s query can be answered using structured data.

To cut right to the heart of what this patent covers, it’s worth pulling out the first claim from the patent that expresses how much of an impact this patent may have from a knowledge-based approach to collecting data and indexing information on the Web. Like most patent language, it’s a long passage that tends to run on, but it is very detailed about the process that this patent covers:

1. A method comprising: generating an ontology of class-attribute pairs, wherein each class that occurs in the class-attribute pairs of the ontology is a class of entities and each attribute occurring in the class-attribute pairs of the ontology is an attribute of the respective entities in the class of the class-attribute pair in which the attribute occurs, wherein each attribute in the class-attribute pairs has one or more domains of instances to which the attribute applies and a range that is either a class of entities or a type of data, and wherein generating the ontology comprises: obtaining class-entity data representing a set of classes and, for each class, entities belonging to the class as instances of the class; obtaining a plurality of entity-attribute pairs, wherein each entity-attribute pair identifies an entity that is represented in the class-entity data and a candidate attribute for the entity; determining a plurality of attribute extraction patterns from occurrences of the entities identified by the entity-attribute pairs with the candidate attributes identified by the entity-attribute pairs in text of documents in a collection of documents, wherein determining the plurality of attribute extraction patterns comprises: identifying an occurrence of the entity and the candidate attribute identified by a first entity-attribute pair in a first sentence from a first document in the collection of documents; generating a candidate lexical attribute extraction pattern from the first sentence; generating a candidate parse attribute extraction pattern from the first sentence; and selecting the candidate lexical attribute extraction pattern and the candidate parse attribute extraction pattern as attribute extraction patterns if the candidate lexical attribute pattern and the candidate parse attribute extraction patterns were generated using at least a predetermined number of unique entity-attribute pairs; and applying the plurality of attribute extraction patterns to the documents in the collection of documents to determine entity-attribute pairs, and from the entity-attribute pairs and the class-entity data, for each of one or more entity classes represented in the class-entity data, attributes possessed by entities belonging to the entity class.

Rather than making this post just the claims of this patent (which are worth going through if you can tolerate the legalese), I’m going to pull out some information from the description which describes some of the implications of the process behind the patent. This first one tells us of the benefit of crowdsourcing an ontology, by building it from the queries of many searchers, and how that may mean that focusing upon matching keywords in queries with keywords in documents becomes less important than responding to queries with answers to questions:

Extending the number of attributes known to a search engine may enable the search engine to answer more precisely queries that lie outside a “long tail,” of statistical query arrangements, extract a broader range of facts from the Web, and/or retrieve information related to semantic information of tables present on the Web.

This patent provides a lot of information about how such an ontology might be used to assist search:

The present disclosure provides systems and techniques for creating an ontology of, for example, millions of (class, attribute) pairs, including 100,000 or more distinct attribute names, which is up to several orders of magnitude larger than available conventional ontologies. Extending the number of attributes “known” to a search engine may provide several benefits. First, additional attributes may enable the search engine to more precisely answer “long-tail” queries, e.g., brazil coffee production. Second, additional attributes may allow for extraction of facts from Web text using open information extraction techniques. As another example, a broad repository of attributes may enable recovery of the semantics of tables on the Web, because it may be easier to recognize attribute names in column headers and in the surrounding text.

Answering Queries with Attributes

I wrote about the topic of How Knowledge Base Entities can be Used in Searches to describe how Google might search a data store of attributes about entities such as movies to return search results by asking about facts related to a movie, such as “What is the movie where Robert Duvall loves the smell of Napalm in the morning?” By building up a detailed ontology that includes may facts can mean a search engine can answer many questions quickly. This may be how featured snippets may be responded to in the futured, but the patent that describes this approach is returning SERPs filled with links to web documents, rather than answers to questions.

Open Information Extraction

That mention of open information extraction methods from the patent reminded me of an acquistion that Google made a few years ago when Google acquired Wavii in April of 2013. Wavii did research about open extraction as described in these papers:

A video that might be helpful to learn about how Open Information Extraction works is this one:

Open Information Extraction at Web Scale

An Ontology created from a query stream can lead to this kind of open information extraction

Semantics from Tables on the Web

Google has been running a Webtables project for a few years, and has released a followup that describes how the project has been going. Semantics from Tables is mentioned in this patent, so it’s worth including some papers about the Webtables project to give you more information about them, if you hadn’t come across them before:

Query Stream Ontologies

The process in the patent involves using a query stream to create an ontology. I enjoyed the statements in this patent about what an ontology was and how one works to help search. I recommend clicking through and reading the description in the patent along with the Biperpedia paper. This really is a transformation of search that brings it beyond keywords and understanding entities better, and how search works. This appears to be a very real future of Search:

Systems and techniques disclosed herein may extract attributes from a query stream, and then use extractions to seed attribute extraction from other text. For every attribute a set of synonyms and text patterns in which it appears is saved, thereby enabling the ontology to recognize the attribute in more contexts. An attribute in an ontology as disclosed herein includes a relationship between a pair of entities (e.g., CAPITAL of countries), between an entity and a value (e.g., COFFEE PRODUCTION), or between an entity and a narrative (e.g., CULTURE). An ontology as disclosed herein may be described as a “best-effort” ontology, in the sense that not all the attributes it contains are equally meaningful. Such an ontology may capture attributes that people consider relevant to classes of entities. For example, people may primarily express interest in attributes by querying a search engine for the attribute of a particular entity or by using the attribute in written text on the Web. In contrast to a conventional ontology or database schema, a best-effort ontology may not attach a precise definition to each attribute. However, it has been found that such an ontology still may have a relatively high precision (e.g., 0.91 for the top 100 attributes and 0.52 for the top 5000 attributes).

The ontologies that are created from query streams expressly to assist search applications are different from more conventional manually generated ontologies in a number of ways:

Ontologies as disclosed herein may be particularly well-suited for use in search applications. In particular, tasks such as parsing a user query, recovering the semantics of columns of Web tables, and recognizing when sentences in text refer to attributes of entities, may be performed efficiently. In contrast, conventional ontologies tend to be relatively inflexible or brittle because they rely on a single way of modeling the world, including a single name for any class, entity or attribute. Hence, supporting search applications with a conventional ontology may be difficult because mapping a query or a text snippet to the ontology can be arbitrarily hard. An ontology as disclosed herein may include one or more constructs that facilitate query and text understanding, such as attaching to every attribute a set of common misspellings of the attribute, exact and/or approximate synonyms, other related attributes (even if the specific relationship is not known), and common text phrases that mention the attribute.

The patent does include more about ontologies and schema and data sources and query patterns.

This is a direction that search is traveling towards, and if you want to know or do SEO, it’s worth learning about. SEO is changing, just as it has many times in the past.

I’ve also written a followup to this post on the Go Fish Digital blog at: SEO Moves From Keywords to Ontologies and Query Patterns


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post 3 Ways Query Stream Ontologies Change Search appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Related Questions are Joined by ‘People Also Search For’ Refinements; Now Using a Question Graph

February 22, 2018 No Comments

meyer lemon tree related questions

I recently bought a lemon tree and wanted to learn how to care for it. I started asking about it at Google, which provided me with other questions and answers related to caring for a lemon tree. As I clicked upon some of those, others were revealed that gave me more information that was helpful.

Last March, I wrote a post about Related Questions at Google, Google’s Related Questions Patent or ‘People Also Ask’ Questions.

As Barry Schwartz noted recently at Search Engine Land, Google is now also showing alternative query refinements as ‘People Also Search For’ listings, in the post, Google launches new look for ‘people also search for’ search refinements. That was enough to have me look to see if the original “Related Questions” patent was updated by Google. It was. A continuation patent was granted in June of last year, with the same name, but updated claims

The older version of the patent can be found at Generating related questions for search queries

It doesn’t say anything about the changing of the wording of “Related Questions” Some “people also search for” results don’t necessarily take the form of questions, either (so “people also ask” may be very appropriate, and continue to be something we see in the future.) But the claims from the new patent contain some new phrases and language that wasn’t in the old patent. The new patent is at:

Generating related questions for search queries
Inventors: Yossi Matias, Dvir Keysar, Gal Chechik, Ziv Bar-Yossef, and Tomer Shmiel
Assignee: Google Inc.
US Patent: 9,679,027
Granted: June 13, 2017
Filed: December 14, 2015

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying related questions for a search query is described. One of the methods includes receiving a search query from a user device; obtaining a plurality of search results for the search query provided by a search engine, wherein each of the search results identifies a respective search result resource; determining one or more respective topic sets for each search result resource, wherein the topic sets for the search result resource are selected from previously submitted search queries that have resulted in users selecting search results identifying the search result resource; selecting related questions from a question database using the topic sets; and transmitting data identifying the related questions to the user device as part of a response to the search query.

The first claim brings a new concept into the world of related questions and answers, which I will highlight in it:

1. A method performed by one or more computers, the method comprising: generating a question graph that includes a respective node for each of a plurality of questions; connecting, with links in the question graph, nodes for questions that are equivalent, comprising: identifying selected resources for each of the plurality of questions based on user selections of search results in response to previous submissions of the question as a search query to a search engine; identifying pairs of questions from the plurality of questions, wherein the questions in each identified pair of questions have at least a first threshold number of common identified selected resources; and for each identified pair, connecting the nodes for the questions in the identified pair with a link in the question graph; receiving a new search query from a user device; obtaining an initial ranking of questions that are related to the new search query; generating a modified ranking of questions that are related to the new search query, comprising, for each question in the initial ranking: determining whether the question is equivalent to any higher-ranked questions in the initial ranking by determining whether a node for the question is connected by a link to any of the nodes for any of the higher-ranked questions in the question graph; and when the question is equivalent to any of the higher-ranked questions, removing the question from the modified ranking; selecting one or more questions from the modified ranking; and transmitting data identifying the selected questions to the user device as part of a response to the new search query.

A question graph would be a semantic approach towards asking and answering questions that are related to each other in meaningful ways.

In addition to the “question graph” that is mentioned in that first claim, we are also told that Google is keeping an eye upon how often it appears that people are selecting these related questions and watching how often people are clicking upon and reading those.

The descriptions and the images in the patent are from the original version of the patent, so there aren’t any that reflect upon what a question graph might look like. For a while, Facebook introduced graph search as a feature that you could use to search on Facebook and that used questions that were related to each other. I found a screen that shows some of those off, and such related questions could be considered from a question graph of related questions. It isn’t quite the same thing as what Google is doing with related questions, but the idea of showing questions that may be related to any initial one in a query, and keeping an eye upon those to see if people are spending time looking at them makes sense. I’ve been seeing a lot of related questions in search results and have been using them. Here are the Facebook graph search questions:

Facebook Graph Search Related questions

As you can see, those questions share some facts, and are considered to be related to each other because they do. This makes them similar to the related questions that are found from a question graph that might mean they could be of interest to a searcher who asks the first query. It is interesting that the new patent claims ask about whether or not the related questions being shown are being clicked upon, and that tells Google if there is any interest on the part of searchers to continue to see related questions. I’ve been finding them easy to click upon and interesting.

Are you working questions and answers into your content?


Copyright © 2018 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Related Questions are Joined by ‘People Also Search For’ Refinements; Now Using a Question Graph appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓