CBPO

SEO

Does Tomorrow Deliver Topical Search Results at Google?

October 16, 2017 No Comments
The Oldest Pepper Tree in California

At one point in time, search engines such as Google learned about topics on the Web from sources such as Yahoo! and the Open Directory Project, which provided categories of sites, within directories that people could skim through to find something that they might be interested in.

Those listings of categories included hierarchical topics and subtopics; but they were managed by human beings and both directories have closed down.

In addition to learning about categories and topics from such places, search engines used to use such sources to do focused crawls of the web, to make sure that they were indexing as wide a range of topics as they could.

It’s possible that we are seeing those sites replaced by sources such as Wikipedia and Wikidata and Google’s Knowledge Graph and the Microsoft Concept Graph.

Last year, I wrote a post called, Google Patents Context Vectors to Improve Search. It focused upon a Google patent titled User-context-based search engine.

In that patent we learned that Google was using information from knowledge bases (sources such as Yahoo Finance, IMDB, Wikipedia, and other data-rich and well organized places) to learn about words that may have more than one meaning.

An example from that patent was that the word “horse” has different meanings in different contexts.

To an equestrian, a horse is an animal. To a carpenter, a horse is a work tool when they do carpentry. To a gymnast, a horse is a piece of equipment that they perform manuevers upon during competitions with other gymnasts.

A context vector takes these different meanings from knowledge bases, and the number of times they are mentioned in those places to catalogue how often they are used in which context.

I thought knowing about context vectors was useful for doing keyword research, but I was excited to see another patent from Google appear where the word “context” played a featured role in the patent. When you search for something such as a “horse”, the search results you recieve are going to be mixed with horses of different types, depending upon the meaning. As this new patent tells us about such search results:

The ranked list of search results may include search results associated with a topic that the user does not find useful and/or did not intend to be included within the ranked list of search results.

If I was searching for a horse of the animal type, I might include another word in my query that identified the context of my search better. The inventors of this new patent seem to have a similar idea. The patent mentions

In yet another possible implementation, a system may include one or more server devices to receive a search query and context information associated with a document identified by the client; obtain search results based on the search query, the search results identifying documents relevant to the search query; analyze the context information to identify content; and generate a group of first scores for a hierarchy of topics, each first score, of the group of first scores, corresponding to a respective measure of relevance of each topic, of the hierarchy of topics, to the content.

From the pictures that accompany the patent it looks like this context information is in the form of Headings that appear above each search result that identify Context information that those results fit within. Here’s a drawing from the patent showing off topical search results (showing rock/music and geology/rocks):

Search Results in Context
Different types of ‘rock’ on a search for ‘rock’ at Google

This patent does remind me of the context vector patent, and the two processes in these two patents look like they could work together. This patent is:

Context-based filtering of search results
Inventors: Sarveshwar Duddu, Kuntal Loya, Minh Tue Vo Thanh and Thorsten Brants
Assignee: Google Inc.
US Patent: 9,779,139
Granted: October 3, 2017
Filed: March 15, 2016

Abstract

A server is configured to receive, from a client, a query and context information associated with a document; obtain search results, based on the query, that identify documents relevant to the query; analyze the context information to identify content; generate first scores for a hierarchy of topics, that correspond to measures of relevance of the topics to the content; select a topic that is most relevant to the context information when the topic is associated with a greatest first score; generate second scores for the search results that correspond to measures of relevance, of the search results, to the topic; select one or more of the search results as being most relevant to the topic when the search results are associated with one or more greatest second scores; generate a search result document that includes the selected search results; and send, to a client, the search result document.

It will be exciting to see topical search results start appearing at Google.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Does Tomorrow Deliver Topical Search Results at Google? appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Using Ngram Phrase Models to Generate Site Quality Scores

September 30, 2017 No Comments
Scrabble-phrases
Source: https://commons.wikimedia.org/wiki/File:Scrabble_game_in_progress.jpg
Photographer: McGeddon
Creative Commons License: Attribution 2.0 Generic

Navneet Panda, whom the Google Panda update is named after, has co-invented a new patent that focuses on site quality scores. It’s worth studying to understand how it determines the quality of sites.

Back in 2013, I wrote the post Google Scoring Gibberish Content to Demote Pages in Rankings, about Google using ngrams from sites and building language models from them to determine if those sites were filled with gibberish, or spammy content. I was reminded of that post when I read this patent.

Rather than explaining what ngrams are in this post (which I did in the gibberish post), I’m going to point to an example of ngrams at the Google n-gram viewer, which shows Google indexing phrases in scanned books. This article published by the Wired site also focused upon ngrams: The Pitfalls of Using Google Ngram to Study Language.

An ngram phrase could be a 2-gram, a 3-gram, a 4-gram, or a 5-gram phrase; where pages are broken down into two-word phrases, three-word phrases, four-word phrases, or 5 word phrases. If a body of pages are broken down into ngrams, they could be used to create language models or phrase models to compare to other pages.

Language models, like the ones that Google used to create gibberish scores for sites could also be used to determine the quality of sites, if example sites were used to generate those language models. That seems to be the idea behind the new patent granted this week. The summary section of the patent tells us about this use of the process it describes and protects:

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of obtaining baseline site quality scores for a plurality of previously-stored sites; generating a phrase model for a plurality of sites including the plurality of previously-scored sites, wherein the phrase model defines a mapping from phrase-specific relative frequency measures to phrase-specific baseline site quality scores; for a new site, the new site not being one of the plurality of previously-scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of the plurality of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.

The newly granted patent from Google is:

Predicting site quality
Inventors: Navneet Panda and Yun Zhou
Assignee: Google
US Patent: 9,767,157
Granted: September 19, 2017
Filed: March 15, 2013

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for predicating a measure of quality for a site, e.g., a web site. In some implementations, the methods include obtaining baseline site quality scores for multiple previously scored sites; generating a phrase model for multiple sites including the previously scored sites, wherein the phrase model defines a mapping from phrase specific relative frequency measures to phrase specific baseline site quality scores; for a new site that is not one of the previously scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.

In addition to generating ngrams from text upon sites, in some versions of the implementation of this patent will include generating ngrams from anchor text of links pointing to pages of the sites. Building a phrase model involves calculating the frequency of n-grams on a site “based on the count of pages divided by the number of pages on the site.”

The patent tells us that site quality scores can impact rankings of pages from those sites, according to the patent:

Obtain baseline site quality scores for a number of previously-scored sites. The baseline site quality scores are scores used by the system, e.g., by a ranking engine of the system, as signals, among other signals, to rank search results. In some implementations, the baseline scores are determined by a backend process that may be expensive in terms of time or computing resources, or by a process that may not be applicable to all sites. For these or other reasons, baseline site quality scores are not available for all sites.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Using Ngram Phrase Models to Generate Site Quality Scores appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Google’s Project Jacquard: Textile-Based Device Controls

September 16, 2017 No Comments

Textile Devices with Controls Built into them

I remember my father building some innovative plastics blow molding machines where he added a central processing control device to the machines so that all adjustable controls could be changed from one place. He would have loved seeing what is going on at Google these days, and the hardware that they are working on developing, which focuses on building controls into textiles and plastics.

Outside of search efforts from Google, but it is interesting seeing what else they may get involved in since that is begining to cover a wider and wider range of things, from self-driving cars to glucose analyzing contact lenses.

This morning I tweeted an article I saw in the Sun, from the UK that was kind of interesting: Seating Plan Google’s creating touch sensitive car seats that will switch on air con, sat-nav and change music with a BUM WIGGLE

It had me curious if I could find patents related to Google’s Project Jacquard, so I went to the USPTO website, and searched, and a couple came up.

Attaching Electronic Components to Interactive Textiles
Inventors: Karen Elizabeth Robinson, Nan-Wei Gong, Mustafa Emre Karagozler, Ivan Poupyrev
Assignee: Google
US Patent Application: 20170232538
Granted: August 17, 2017
Filed: May 3, 2017

Abstract

This document describes techniques and apparatuses for attaching electronic components to interactive textiles. In various implementations, an interactive textile that includes conductive thread woven into the interactive textile is received. The conductive thread includes a conductive wire (e.g., a copper wire) that that is twisted, braided, or wrapped with one or more flexible threads (e.g., polyester or cotton threads). A fabric stripping process is applied to the interactive textile to strip away fabric of the interactive textile and the flexible threads to expose the conductive wire in a window of the interactive textile. After exposing the conductive wires in the window of the interactive textile, an electronic component (e.g., a flexible circuit board) is attached to the exposed conductive wire of the conductive thread in the window of the interactive textile.

Interactive Textiles
Inventors: Ivan Poupyrev
Assignee: Google Inc.
US Patent Application: 20170115777
Granted: April 27, 2017
Filed: January 4, 2017

Abstract

This document describes interactive textiles. An interactive textile includes a grid of conductive thread woven into the interactive textile to form a capacitive touch sensor that is configured to detect touch input. The interactive textile can process the touch-input to generate touch data that is useable to control various remote devices. For example, the interactive textiles may aid users in controlling volume on a stereo, pausing a movie playing on a television, or selecting a web page on a desktop computer. Due to the flexibility of textiles, the interactive textile may be easily integrated within flexible objects, such as clothing, handbags, fabric casings, hats, and so forth. In one or more implementations, the interactive textiles may be integrated within various hard objects, such as by injection molding the interactive textile into a plastic cup, a hard casing of a smart phone, and so forth.

The drawings that accompanied this patent were interesting because they showed off how gestures used on controls might be used:

Controls in action

textile controller
Here is a look at the textile controller.
double tap
A double tap on the controller is possible.
two finger touch
A two finger touch on the controller is also possible.
swipe up
You can swipe up on textile controllers
extruder
An Extruder showing plastics materials being heated up to send to a mold
molded devices
The patent shows off plastic molder devices with controls built into them.

My father would have gotten a kick out of seeing a plastics extruder in a Google Patent (I know I did.)

It will be interesting seeing textile and plastics controls come out as described in these patents.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Google’s Project Jacquard: Textile-Based Device Controls appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Citations behind the Google Brain Word Vector Approach

September 2, 2017 No Comments

Cardiff-Tidal-pools

In October of 2015, a new algorithm was announced by members of the Google Brain team, described in this post from Search Engine Land – Meet RankBrain: The Artificial Intelligence That’s Now Processing Google Search Results One of the Google Brain team members who gave Bloomberg News a long interview on Rankbrain, Gregory S. Corrado was a co-inventor on a patent that was granted this August along with other members of the Google Brain team.

In the SEM Post article, RankBrain: Everything We Know About Google’s AI Algorithm we are told that Rankbrain uses concepts from Geoffrey Hinton, involving Thought Vectors. The summary in the description from the patent tells us about how a word vector approach might be used in such a system:

Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Unknown words in sequences of words can be effectively predicted if the surrounding words are known. Words surrounding a known word in a sequence of words can be effectively predicted. Numerical representations of words in a vocabulary of words can be easily and effectively generated. The numerical representations can reveal semantic and syntactic similarities and relationships between the words that they represent.

By using a word prediction system having a two-layer architecture and by parallelizing the training process, the word prediction system can be can be effectively trained on very large word corpuses, e.g., corpuses that contain on the order of 200 billion words, resulting in higher quality numeric representations than those that are obtained by training systems on relatively smaller word corpuses. Further, words can be represented in very high-dimensional spaces, e.g., spaces that have on the order of 1000 dimensions, resulting in higher quality representations than when words are represented in relatively lower-dimensional spaces. Additionally, the time required to train the word prediction system can be greatly reduced.

So, an incomplete or ambiguous query that contains some words could use those words to predict missing words that may be related. Those predicted words could then be used to return search results that the original words might have difficulties returning. The patent that describes this prediction process is:

Computing numeric representations of words in a high-dimensional space

Inventors: Tomas Mikolov, Kai Chen, Gregory S. Corrado and Jeffrey A. Dean
Assignee: Google Inc.
US Patent: 9,740,680
Granted: August 22, 2017
Filed: May 18, 2015

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for computing numeric representations of words. One of the methods includes obtaining a set of training data, wherein the set of training data comprises sequences of words; training a classifier and an embedding function on the set of training data, wherein training the embedding function comprises obtained trained values of the embedding function parameters; processing each word in the vocabulary using the embedding function in accordance with the trained values of the embedding function parameters to generate a respective numerical representation of each word in the vocabulary in the high-dimensional space; and associating each word in the vocabulary with the respective numeric representation of the word in the high-dimensional space.

One of the things that I found really interesting about this patent was that it includes a number of citations from the applicants for the patent. They looked worth reading, and many of them were co-authored by inventors of this patent, by people who are well-known in the field of artificial intelligence, or by people from Google. When I saw them, I started hunting for locations on the Web for them, and I was able to find copies of them. I will be reading through them and thought it would be helpful to share those links; which was the idea behind this post. It may be helpful to read as many of these as possible before tackling this patent. If anything stands out in any way to you, let us know what you’ve found interesting.

Bengio and LeCun, “Scaling learning algorithms towards AI,” Large-Scale Kernel Machines, MIT Press, 41 pages, 2007. cited by applicant.

Bengio et al., “A neural probabilistic language model,” Journal of Machine Learning Research, 3:1137-1155, 2003. cited by applicant .

Brants et al., “Large language models in machine translation,” Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, 10 pages, 2007. cited by applicant .

Collobert and Weston, “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” International Conference on Machine Learning, ICML, 8 pages, 2008. cited by applicant .

Collobert et al., “Natural Language Processing (Almost) from Scratch,” Journal of Machine Learning Research, 12:2493-2537, 2011. cited by applicant .

Dean et al., “Large Scale Distributed Deep Networks,” Neural Information Processing Systems Conference, 9 pages, 2012. cited by applicant .

Elman, “Finding Structure in Time,” Cognitive Science, 14, 179-211, 1990. cited by applicant .

Huang et al Improving Word Representations via Global Context and Multiple Word Prototypes,” Proc. Association for Computational Linguistics, 10 pages, 2012. cited by applicant .

Mikolov and Zweig, “Linguistic Regularities in Continuous Space Word Representations,” submitted to NAACL HLT, 6 pages, 2012. cited by applicant .

Mikolov et al., “Empirical Evaluation and Combination of Advanced Language Modeling Techniques,” Proceedings of Interspeech, 4 pages, 2011. cited by applicant .

Mikolov et al., “Extensions of recurrent neural network language model,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528-5531, May 22-27, 2011. cited by applicant .

Mikolov et al., “Neural network based language models for highly inflective languages,” Proc. ICASSP, 4 pages, 2009. cited by applicant .

Mikolov et al., “Recurrent neural network based language model,” Proceedings of Interspeech, 4 pages, 2010. cited by applicant .

Mikolov et al., “Strategies for Training Large Scale Neural Network Language Models,” Proc. Automatic Speech Recognition and Understanding, 6 pages, 2011. cited by applicant .

Mikolov, “RNNLM Toolkit,” Faculty of Information Technology (FIT) of Brno University of Technology [online], 2010-2012 [retrieved on Jun. 16, 2014]. Retrieved from the Internet: < URL: http://www.fit.vutbr.cz/.about.imikolov/rnnlm/>, 3 pages. cited by applicant .

Mikolov, “Statistical Language Models based on Neural Networks,” PhD thesis, Brno University of Technology, 133 pages, 2012. cited by applicant .

Mnih and Hinton, “A Scalable Hierarchical Distributed Language Model,” Advances in Neural Information Processing Systems 21, MIT Press, 8 pages, 2009. cited by applicant .

Morin and Bengio, “Hierarchical Probabilistic Neural Network Language Model,” AISTATS, 7 pages, 2005. cited by applicant .

Rumelhart et al., “Learning representations by back-propagating errors,” Nature, 323:533-536, 1986. cited by applicant .

Turian et al., “MetaOptimize / projects / wordreprs /” Metaoptimize.com [online], captured on Mar. 7, 2012. Retrieved from the Internet using the Wayback Machine: < URL: http://web.archive.org/web/20120307230641/http://metaoptimize.com/project- s/wordreprs>, 2 pages. cited by applicant .
Turlan et al., “Word Representations: A Simple and General Method for Semi-Supervised Learning,” Proc. Association for Computational Linguistics, 384-394, 2010. cited by applicant .

Turney, “Measuring Semantic Similarity by Latent Relational Analysis,” Proc. International Joint Conference on Artificial Intelligence, 6 pages, 2005. cited by applicant .

Zweig and Burges, “The Microsoft Research Sentence Completion Challenge,” Microsoft Research Technical Report MSR-TR-2011-129, 7 pages, Feb. 20, 2011. cited by applicant.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Citations behind the Google Brain Word Vector Approach appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Personalizing Search Results at Google

August 19, 2017 No Comments

document sets at Google

One thing most SEOs are aware of is that search results at Google are sometimes personalized for searchers; but it’s not something that I’ve seen too much written about. So when I came across a patent that is about personalizing search results, I wanted to dig in, and see if it could give us more insights.

The patent was an updated continuation patent, and I love to look at those, because it is possible to compare changes to claims from an older version, to see if they can provide some details of how processes described in those patents have changed. Sometimes changes are spelled out in great detail, and sometimes they focus upon different concepts that might be in the original version of the patent, but weren’t necessarily focused upon so much.

One of the last continuation patents I looked at was one from Navneet Panda, in the post, Click a Panda: High Quality Search Results based on Repeat Clicks and Visit Duration In that one, we saw a shift in focus to involve more user behavior data such as repeat clicks by the same user on a site, and the duration of a visit to a site.

Personalizing search results
Inventors: Paul Tucker
Assignee: GOOGLE INC.
US Patent: 9,734,211
Granted: August 15, 2017
Filed: February 27, 2015

Abstract

A system receives a search query from a user and performs a search of a corpus of documents, based on the search query, to form a ranked set of search results. The system re-ranks the set of search results based on preferences of the user, or a group of users, and provides the re-ranked search results to the user.

The older version of the patent is Personalizing search results, which was filed on September 16, 2013, and was granted on March 10, 2015.

A continuation patent has claims rewritten on it, that reflect changes in how a process that has been patented might have changed, using the filing date of the original version of the patent.

I like comparing the claims, since that is what usually changes in continuation patents. I noticed some significant changes from the older version to this newer version.

There is a lot more emphasis on “high quality” sites and “distrusted sites” in the new version of the patent, which can be seen in the first claim of the patent. It’s worth putting the old and the new first claim one after the other, and comparing the two.

The Old First Claim

1. A method comprising: identifying, by at least one of one or more server devices, a first set of documents associated with a user, documents, in the first set of documents, being assigned weights that reflect a relative quantification of an interest of the user in the documents in the first set of documents; receiving, by at least one of the one or more server devices, a search query from a client device associated with the user; identifying, by at least one of the one or more server devices and based on the search query, a second set of documents, each document from the second set of documents having a respective score; determining, by at least one of the one or more server devices, that a particular document, from the second set of documents, matches or links to one of the documents in the first set of documents; adjusting, by at least one of the one or more server devices, the respective score of the particular document, to form an adjusted score, based on the weight assigned to the one of the documents in the first set of documents; forming, by at least one of the one or more server devices, a list of documents in which documents from the second set of documents are ranked based on the respective scores, the particular document being ranked in the list based on the adjusted score; and providing, by at least one of the one or more server devices, the list of documents to the client device.

The New First Claim

This is newly granted this week:

1. A method, comprising: determining, by at least one of one or more server devices, preferences of a user or a group of users, wherein the preferences indicate a document bias set and weights assigned to the documents, wherein the weights include distrusted document weights; determining, by the at least one of the one or more server devices, a high quality document set obtained from a document ranking algorithm; creating, by at least one of the one or more server devices, an intersection set of documents which includes documents in both the document bias set and the high quality document set; receiving, by at least one of the one or more server devices, a search query from the user; performing, by at least one of the one or more server devices, a search of a corpus of documents, based on the search query, to form a ranked set of search result documents; determining, by at least one of the one or more server devices, at least one link from the intersection set of documents to at least one document in the ranked set of search result documents, the at least one document not in the intersection set of documents; re-ranking, by at least one of the one or more server devices, the set of search result documents based on the preferences of the user or the group of users, wherein re-ranking the set of search results comprises: identifying a link of the set of links from the intersection set of documents to the document of the set of search result documents, and based on identifying the link, adjusting a rank of the search result document based on the weight assigned to the document in the document bias set from where the identified link originated from; and providing, by at least one of the one or more server devices, the re-ranked search results to the user.

The changes I am seeing in these two different first claims involve what are being called “distrusted document weights” from a “document bias set”, and showing pages from “a high quality document set.” The newer claim makes it more clear that personalized results come from these two different sets of results. It’s possible that it doesn’t change how personalization actually works, but the increased clarity is good to see.

The Purpose of these Personalizing Search Results Patents

We are told that some sites are favored more than others, and some are disliked more than others, and those are are created from a query or browser history, to generate a document bias set:

FIG. 1 illustrates an overview of the re-ranking of search results based on a user’s or group’s document or site preferences. In accordance with this aspect of the invention, a document bias set F 105 may be generated that indicates the user’s or group’s preferred and/or disfavored documents. Bias set F 105 may be automatically collected from a query or browser history of a user. Bias set F 105 may also be generated by human compilation, or editing of an automatically generated set. Bias set F 105 may include a set of documents shared, or developed, by a group that may further include a community of users of common interest. Document bias set F 105 may include one or more designated documents (e.g., documents a, b, x, y and z) with associated weights (e.g. w.sup.a.sub.F, w.sup.b.sub.F, w.sup.x.sub.F, w.sup.y.sub.F and w.sup.z.sub.F). The weights may be assigned to each document (e.g., documents a, b, x, y and z) based on a user’s, or group’s, relative preferences among documents of bias set F 105. For example, bias set F 105 may include a user’s personal most-respected, or most-distrusted, document list, with the weights being assigned to each document in bias set F 105 based on a relative quantification of the user’s preference among each of the documents of the set.

This document bias set mention appears in both the older, and the newer version of the patent.

The patents also both refer to a high quality document set, and that is described in a way that seems to place a lot of attention on PageRank or a Hubs and Authority approach to ranking:

A high quality document set L 110 may be obtained from any existing document ranking algorithm. Such document ranking algorithms may include a link-based ranking algorithm, such as, for example, Google’s PageRank algorithm, or Kleinberg’s Hubs and Authorities ranking algorithm. The document ranking algorithm may provide a global ranking of document quality that may be used for ranking the results of searches performed by search engines. High quality document set L 110 may be derived from the highest-ranking documents in the web as ranked by an existing document ranking algorithm. In one implementation, for example, set L 110 may include the top percentage of the documents globally ranked by an existing document ranking algorithm (e.g., the highest ranked 20% of documents). In an implementation using PageRank, set L 110 may include documents having PageRank scores higher than a threshold value (e.g., documents with PageRank scores higher than 10,000,000). Set L 110 may include multiple documents (e.g., documents m, n, o, p, x, y and z) with associated weights (e.g., weights W.sup.m.sub.L, W.sup.n.sub.L, W.sup.o.sub.L, W.sup.p.sub.L, W.sup.x.sub.L, W.sup.y.sub.L and W.sup.Z.sub.L). The weights may be assigned to each document (e.g., documents m, n, o, p, x, y and z) based on a relative ranking of “quality” between the different documents of set L 110 produced by the document ranking algorithm.

Personalized results served to a searcher are results that come from both the document bias set, and the high quality document set (as the patent says, from an “intersection” between the two sets).

If you are interested in how personalized search may work at Google, spending some time with this new patent may provide some insights. Knowing about how two different sets of documents are involved in returning results is a good starting point.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Personalizing Search Results at Google appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Learn SEO Through Forums

July 2, 2017 No Comments

Solana Beach Farmer's Market

I had someone who was reading my previous entries in my Learning SEO series ask about using forums to learn SEO. I promised that I would write a post about the value of forums in learning SEO.

Back in 1998 I became a moderator of a couple of forums on small business and website promotion on Yahoo Groups. Those lead to me becoming a moderator at Cre8asiteforums, joining forum owner Kim Krause Berg along with a number of other moderators such as Ammon Johns and Jill Whalen.

Cre8asiteforums was (and still is) a tremendous place to talk about SEO and web design and usability and accessibility. One of my favorite individual forums on the site was one called The Website Hospital, where people would bring their site’s URL and concerns about it, and ask questions. That was were I learned a lot about auditing sites, and seeing what worked well on them, and what might need some help. This thread is a good introduction to it: Getting Started in the Website Hospital.

Here’s a thread I started in November of 2005 that was an interesting read, on SEO Myths.

Another forum that I have gotten a lot of value from over the years is one call Webmasterworld. Most of the members of this forum are practicing SEOs or siteowners, who enjoy sharing their experiences. It reminds me of a weather vane, in that people are often open with information about changes that they experience to rankings and traffic to their sites. You can see changes taking place on the Web from what they write.

Another place that can be informative about how search works is the Google Webmaster Help Forum. If you experience problems with a site, it is often a good place to search to see if anyone else has experienced something similar – it is possible that someone has, and the answers they received may help you as well.

There are other forums on the Web that focus upon SEO and Search. I’ve included the ones that I am most familiar with. There were some others that I participated on, that aren’t very active anymore. It doesn’t hurt to start off as a lurker, and learn about the customs and culture of a forum before you start participating in it. You may find some that you enjoy participating in very much.

When I started going to conferences and events after being involved in forums for a few years, I finally had a chance to meet in real life many people whom I had only met in discussions at forums. It was nice getting a chance to do so.

You can learn a lot through forums.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Learn SEO Through Forums appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Google Patents Extracting Facts from the Web

June 20, 2017 No Comments

When Google crawls the Web, it collects information about content on the pages it finds as well as links on pages. How much does it collect information about facts on the Web? Microsoft showed off an object-based search about 10 years ago, in the paper, Object-Level Ranking: Bringing Order to Web Objects..

The team from Microsoft Research Asia tells us in that paper:

Existing Web search engines generally treat a whole Web page as the unit for retrieval and consuming. However, there are various kinds of objects embedded in the static Web pages or Web databases. Typical objects are products, people, papers, organizations, etc. We can imagine that if these objects can be extracted and integrated from the Web, powerful object-level search engines can be built to meet users’ information needs more precisely, especially for some specific domains.

This patent from Google focuses upon extracting factual information about entities on the Web. It’s an approach that goes beyond making the Web index that we know Google for because it collects more information that is related to each other. The patent tells us:

Information extraction systems automatically extract structured information from unstructured or semi-structured documents. For example, some information extraction systems that exist extract facts from collections of electronic documents, with each fact identifying a subject entity, an attribute possessed by the entity, and the value of the attribute for the entity.

I’m reminded of an early Google Provisional patent that Sergy Brin came up with in the 1990s. My post about that patent I called, Google’s First Semantic Search Invention was Patented in 1999. The patent it is about was titled Extracting Patterns and Relations from Scattered Databases Such as the World Wide Web (pdf) (Skip ahead to the third page, where it becomes much more readable). This was published as a paper on the Stanford website. It describes Sergy Brin taking some facts about some books, and searching for those books on the Web; once they are found; patterns about the locations of those books are gathered, and information about other books are collected as well. That approach sounds much like the one from this patent granted the first week of this month:

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of obtaining a plurality of seed facts, wherein each seed fact identifies a subject entity, an attribute possessed by the subject entity, and an object, and wherein the object is an attribute value of the attribute possessed by the subject entity; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern generated from a dependency parse, wherein a dependency parse of a text portion corresponds to a directed graph of vertices and edges, wherein each vertex represents a token in the text portion and each edge represents a syntactic relationship between tokens represented by vertices connected by the edge, wherein each vertex is associated with the token represented by the vertex and a part of speech tag, and wherein a dependency pattern corresponds to a sub-graph of a dependency parse with one or more of the vertices in the sub-graph having a token associated with the vertex replaced by a variable; applying the patterns to documents in a collection of documents to extract a plurality of candidate additional facts from the collection of documents; and selecting one or more additional facts from the plurality of candidate additional facts.

The patent breaks the process it describes into a number of “Advantages” that are worth keeping in mind, because it sounds a lot like how people talking about the Semantic Web describe the Web as a web of data. These are the Advantages that the patent brings us:

(1) A fact extraction system can accurately extract facts, i.e., (subject, attribute, object) triples, from a collection of electronic documents to identify values of attributes, i.e., “objects” in the extracted triples, that are not known to the fact extraction system.

(2)In particular, values of long-tail attributes that appear infrequently in the collection of electronic documents relative to other, more frequently occurring attributes can be accurately extracted from the collection. For example, given a set of attributes for which values are to be extracted from the collection, the attributes in the set can be ordered by the number of occurrences of each of the attributes in the collection and the fact extraction system can accurately extract attribute values for the long-tail attributes in the set, with the long-tail attributes being the attributes that are ranked below N in the order, where N is chosen such that the total number of appearances of attributes ranked N and above in the ranking equals the total number of appearances of attributes ranked below N in the ranking.

(3)Additionally, the fact extraction system can accurately extract facts to identify values of nominal attributes, i.e., attributes that are expressed as nouns.

The patent is:

Extracting facts from documents
Inventors: Steven Euijong Whang, Rahul Gupta, Alon Yitzchak Halevy, and Mohamed Yahya
Assignee: Google Inc.
US Patent: 9,672,251
Granted: June 6, 2017
Filed: September 29, 2014

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for extracting facts from a collection of documents. One of the methods includes obtaining a plurality of seed facts; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern generated from a dependency parse; applying the patterns to documents in a collection of documents to extract a plurality of candidate additional facts from the collection of documents; and selecting one or more additional facts from the plurality of candidate additional facts.

The patent contains a list of “other references” that were cited by the applicants. These are worth spending some time with because they contain a lot of hints about the direction that Google appears to be moving towards.

The patent tells us that entities identified by this extraction process may be stored in an entity database, and they point at the old freebase site (which used to be run by Google).

They give us some insights into how the information extracted from the Web might be used by Google in a fact repository (which is the term they used to refer to an early version of their knowledge graph):

Once extracted, the fact extraction system may store the extracted facts in a facts repository or provide the facts for use for some other purpose. In some cases, the extracted facts may be used by an Internet search engine in providing formatted answers in response to search queries that have been classified as seeking to determine the value of an attribute possessed by a particular entity. For example, a received search query “who is the chief economist of example organization?” may be classified by the search engine as seeking to determine the value of the “Chief Economist” attribute for the entity “Example Organization.” By accessing the fact repository, the search engine may identify that the fact repository includes a (Example Organization, Chief Economist, Example Economist) triple and, in response to the search query, can provide a formatted presentation that identifies “Example Economist” as the “Chief Economist” of the entity “Example Organization.”

The patent tells us about how they use patterns to identify additional facts:

The system selects additional facts from among the candidate additional facts based on the scores (step 212). For example, the system can select each candidate additional fact having a score above a threshold value as an additional fact. As another example, the system can select a predetermined number of highest-scoring candidate additional facts as additional facts. The system can store the selected additional facts in a fact repository, e.g., the fact repository of FIG. 1, or provide the selected additional facts to an external system for use for some immediate purpose.

The patent also describes the process that might be followed to score candidate additional facts.

This fact extraction process does appear to be aimed towards building a repository that might be capable of answering a lot of questions, using a machine learning approach and the kind of semantic vectors that the Google Brain team may have used to develop Google’s Rank Brain approach.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Google Patents Extracting Facts from the Web appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Learning SEO through Books

June 16, 2017 No Comments

A Path through the Batiquitos Lagoon

There are a few books and courses online that are free and really helpful when it comes to learning some of the things that will make you a better SEO. Knowledge can make a difference, and having an idea of how search engines work can possibly give you a competitive advantage over others who haven’t had a chance to learn about such resources. I’ve come across a few books that are online and free, and worth spending time with, and thought it might not be a bad idea to share them.

The first two volumes I found are ones that focus on one of the important ways that search engines understand the content of web pages, rating them based upon information retrieval scores. Having an idea of how a search engine might rank a page, based on more than just something of an understanding of how PageRank works can be really helpful.

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze

Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

In the last couple of years, we have seen significant changes to how search results look at search engines. It doesn’t include knowledge panels or featured snippets or structured snippets or sitelinks, but this book does leave lots to think about when it comes to how search results are organized:

Search User Interfaces by Marti Hearst

Having a sense of how HTML and CSS and java script works can be helpful to anyone involved in SEO. This book covers the latest version of HTML which hasn’t been adopted everywhere on the Web yet, but is still worth digging into:

Dive into HTML5 by Mark Pilgrim

Online Courses and Presentations

Google has been going through a number of transformations in the past 5 years or so, and Andrew Hogue was involved in many of the changes that took place. His presentation on them is insightful:

The Structured Search Engine by Andrew Hogue:

Last September, Jeff Dean introduced us to a Google that was going to be incorporating Machine Learning into what they do – hearing some of the details behind this movement is like peeking behind the curtain:

Jeff Dean Talks Google Brain and Brain Residency:

These tutorials were pointed out to me on Twitter a couple of days ago. I started watching them, and decided quickly that they were worth sharing:

Dan Jurafsky & Chris Manning: Natural Language Processing and Lecture Slides (h/t to Victor Gras)


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Learning SEO through Books appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


Learning SEO, Summer 2017

June 10, 2017 No Comments

The Omni La Costa Resort

In January, I wrote a post titled, Advice Given to an Aspiring 14 Year-Old Entrepreneur Wanting to Learn SEO. I included in that post links to a number of pages that I thought might be helpful to someone learning SEO. On my walk this morning past the Omni La Costa Resort, I was thinking about it, and decided that it might not be a bad idea to create a Learning SEO category here, and provide more resources to help people who are learning SEO access some of the resources I come across that might help them understand more.

One that I was thinking might be really helpful is this video (SMX West 2016 – How Google Works: A Google Ranking Engineer’s Story) with Google Engineer Paul Haahr:

I’ve written about more than one patent that Paul Haahr co-invented, and he has had been involved in many important aspects of how Google operates. His insights into ranking at Google are eye opening.

Google keeps a careful eye upon the quality of their search results, and have human beings who review those results and provide feedback on them. These people are known as human quality raters, and they are provided a set of guidelines, which Google started sharing with the public. If you want to be an SEO, having an idea what those guidelines contain can be helpful; they can give you some ideas on what you might want to include on a web site. The most recent version of the quality rater guidelines came out May 11, 2017:

Search Quality Evaluators Guideline

Many people perform searches at Google everyday, entering many queries every second into a Google search box. Could we learn something from what they search for, and what words they use when they search? I wrote about that idea with a post about 4 years ago called How Google Might Use Query Logs to Find Locations for Entities. What if Google tried to learn even more from query logs? They have, and they wrote about what they’ve built from query log information, in a paper titled:

Biperpedia: An Ontology for Search Applications

One of the authors of that paper is Alon Halevy, who is the head of Structured Data at Google (the folks responsible for rich snippets, knowledge graphs, question answering, structured snippets, and table search at Google).

A tool that I have been using on almost every audit that I do is Screaming Frog, and if it isn’t in your toolbelt, it should be something that you should consider adding. It is really useful, and this page from Seer Interactive is helpful in learning how to use it effectively:

Screaming Frog Guide to Doing Almost Anything

Google is not the only search engine, and if you aren’t looking at what Microsoft is doing with Bing, you may be surprised.

I was surprised to see Microsoft come out with a really powerful knowledge graph that covers a lot of concepts in September of last year:

Microsoft Concept Graph Preview For Short Text Understanding

I will be keeping an eye out for other pages that I think would be good resources. If you have specific questions about SEO contact me, and I will try to add answers to them to future posts in this category (thanks!)


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post Learning SEO, Summer 2017 appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓


How Does Google Look for Authoritative Search Results?

May 29, 2017 No Comments
A NASA Android that Voyaged to Space
A NASA Android that Voyaged to Space

If you’ve done any SEO for a site, you may recognize some of the steps involved in working towards making a website authoritative:

  1. Conduct keyword research to find appropriate terms and phrases for your industry and audience
  2. Review the use of keywords on the pages of your site to make sure it includes those in prominent places on those pages
  3. Map out pages on a site to place keywords in meaningful places
  4. The meaningful places on your pages are determined by information retrieval scores for HTML elements such as Titles and Headings and Lists
  5. The placement of keywords in prominent and important places on your pages can make your pages more relevant for those keywords
  6. Research the topics your pages are about, and make sure they answer questions that your audience may have about those topics in trustworthy and meaningful ways

Focus on Authoritative Search Results

A patent granted to Google this week focuses upon authoritative search results. It describes how Google might surface authoritative results for queries and for query revisions when there might not results that meet a threshold of authoritativeness for the initial query. Reading through it was like looking at a mirror image of the efforts I usually go through to try to build authoritative results for a search engine to surface. In addition to using some of the same language that I use to describe how I build authoritative pages, the patent also defines what an authoritative site is for us in terms that I might find myself using too:

In general, an authoritative site is a site that the search system has determined to include particularly trusted, accurate, or reliable content. The search system can distinguish authoritative sites from low-quality sites that include resources with shallow content or that frequently include spam advertisements. Whether the search system considers a site to be authoritative will typically be query-dependent. For example, the search system can consider the site for the Centers for Disease Control, “cdc.gov,” to be an authoritative site for the query “cdc mosquito stop bites,” but may not consider the same site to be authoritative for the query “restaurant recommendations”. A search result that identifies a resource on a site that is authoritative for the query may be referred to as an authoritative search result. The search system can determine whether to obtain an authoritative search result in response to a query in a variety of ways, which will be described below.

This definition seems to tell us that authoritative sites are high quality sites. The timing of a couple of other actions at Google seem to fit in well with the granting of this patent. Once is the publication of a Blog post by long time Google search engineer Ben Gomes (who joined Google in 1999), on steps they have taken to improve the quality of results at Google, titled Our latest quality improvements for Search. In that post, Ben points out that Google has published a brand new set of Search Quality Rater Guidelines – May 11, 2017, publicly, so that they are shared with the world instead of just to Google’s Search quality raters.

One of the named inventors on this patent was an inventor on another patent that I wrote about which focused upon high quality sites as well. That patent is worth reading about together with this one. That post is one I wrote named Google’s High Quality Sites Patent. As I said of that patent, it describes its purpose in this way:

This patent identifies pages that rank well for certain queries, and looks at the quality of those pages. If a threshold amount of those ranking pages are low quality pages, the search engine might use an alternative query to find a second set of search results that include pages from high quality sites. Those search results from the first query might then be merged with the results from the alternative query, with the pages from the low quality sites removed so that the search results include a greater percentage of pages from high quality sites.

So the aim of this new patent is to find results from higher quality search results. Google does seem to be targeting higher quality pages these days with the results they show.

Google sets a fairly high bar with search results, telling us in the description to this new patent:

Internet search engines aim to identify resources, e.g., web pages, images, text documents, multimedia content, e.g., videos, that are relevant to a user’s information needs and to present information about the resources in a manner that is most useful to the user.

In the summary section for this patent, the objective of the patent is identified to us as finding authoritative answers:

This specification describes how a system can improve search result sets by including at least one authoritative search result that identifies a resource on an authoritative site for a query. The system can include an authoritative search result, for example, when scores of an initial first search result set are low or when the query itself indicates that the user seeks resources from an authoritative site.

What this Patent Does

A search engine doesn’t choose the query terms that someone might use to perform a search with; but it might be able to identify query refinements based upon the initial query term. If the original query doesn’t return an authoritative result; Google might insert into the results shown for it some authoritative results for one of those query refinements based upon that original query. It might show that authoritative result at the top of the search results that it returns. This means that Google will be more likely to return high quality sites at the top of search results, rather than results from sites that might not be seen as authoritative sites.

The patent that was granted this week is:

Obtaining authoritative search results
Inventors: Trystan Upstill, Yungchun Wan, and Alexandre Kojoukhov
Assignee: Google Inc.
US Patent: 9,659,064
Granted: May 23, 2017
Filed: March 15, 2013

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for obtaining authoritative search results. One of the methods includes receiving a first search query. First search results responsive to the first search query are obtained. Based on the first search query or the first search results, an authoritative search result that identifies a resource on a site that is authoritative for the first search query is obtained. A ranking of the authoritative search result and the one or more first search results is generated, and the ranking of the authoritative search result and the one or more first search results is provided in response to the first search query.

There were some really interesting points raised in the patent, which makes the whole thing worth spending time reading carefully:

1. Google might maintain a “keyword-to-authoritative site database” which it can refer to when someone performs a query.
2. The patent described “Mapping” keywords on pages on the Web as sources of information for that authoritative site database.
3. Google may also maintain “topic keyword and category keyword mappings to authoritative sites”.
4. Google may calculate confidence scores, which represent a likelihood that a keyword, if received in a query, refers to a specific authoritative site.
5. The patent talks about Mapping revised queries, like this: “The system can also analyze user query refinements to generate additional topic or category keyword mappings or refine existing ones.”
6. Interestingly, the patent talks about revisions in queries as being substitute terms that might be used “aggressively to generate revised search queries.” I’ve written about substitute terms before in How Google May Rewrite Your Search Terms.
7. If the original query, and the replacement query used to surface an authoritative result are similar enough (based upon a similarity score that would be used to compare them), that authoritative result may be demoted in the results shown to a searcher.


Copyright © 2017 SEO by the Sea ⚓. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at may be guilty of copyright infringement. Please contact SEO by the Sea, so we can take appropriate action immediately.
Plugin by Taragana

The post How Does Google Look for Authoritative Search Results? appeared first on SEO by the Sea ⚓.


SEO by the Sea ⚓