While it may not have the name of the Google engineer, Navneet Panda who the Google Panda update is allegedly named after, the content of a US Patent Application DETECTING AND REJECTING ANNOYING DOCUMENTS definitely contains verbiage that looks eerily familiar to what we’ve seen and heard about the Panda update. In my opinion, what we’re looking at in the above patent is a part of the technology that Google may be using to “pandalize” a portion of the web.
The abstract reads:
“A system and method for evaluating documents for approval or rejection and/or rating. The method comprises comparing the document to one or more criteria determining whether the document contains an element that is substantially identical to one or more of a visual element, an audio element or a textual element that is determined to be displeasing.”
Personally I like to the print up the patent in it’s entirety with images and all and for those that are similar, here is a link: http://www.freepatentsonline.com/20110219300.pdf. Below is the image I found the most interesting and here is the description of the drawing:
“FIG. 3a is a flow chart illustrating an exemplary method for approving or rejecting an electronic document based on the characteristics of the electronic document according to an embodiment of the invention.”
Basically a document is identified and then run through a bunch of tests, if you will, to determine if the document, aka webpage, is acceptable or not. Sounds like Panda right?
The Field of the Invention gives us a pretty good idea of the problem Google is trying to solve with this patent:
“The present inventions relate generally to detecting undesirable characteristics of a document such as an advertisement and rejecting such document for distribution.”
The detailed description gives us a good overview at what the patent technology is capable of :
“The embodiments described herein solve many problems with existing systems and methods. One problem facing Internet content providers is evaluating a large number of documents (such as advertisements presented through its services) to determine whether each is annoying or otherwise displeasing for a wide variety of different users. Embodiments described herein overcome these and other problems by processing a document to determine whether the document is annoying or otherwise displeasing by identifying annoying or displeasing parameters and comparing the document to the parameters (e.g., offensive language or flashing action). The processing may occur automatically, i.e., by a machine-implemented process and/or without human input or intervention.”
“The embodiments described herein enable Flash and animated image documents (e.g., advertisements). Some of these types of ads are annoying. An embodiment of the present invention provides for uploading a document such as an advertisement and comparing the document to specified parameters. The document can be compared to the parameters by a document processor (e.g., automatically by an image processor). The processor may process images, sound files, and other data to identify text, images (as well as spoken words and other data), and actions in the ad. For instance, text may be identified in an image using optical character recognition (OCR) technology. By comparing the document to specified parameters, characteristics can be identified in and associated with the document, and the document can be accordingly rated and approved or rejected based on these characteristics and the status of the ratings of the comparison parameters.“
This smells like the Panda algorithms that have been unleashed on the web wouldn’t you agree?
Let’s get to the nuts and bolts though… What does the patent actually claim?
“A computer-implemented method of approving a document, the method comprising: analyzing content of a first document to identify one or more first portions, wherein the first portions are visual, textual, or audio portions: identifying one or more second documents that are similar to the first document… determining whether any of the first portions are substantially identical to the second portions that have been predetermined to be unacceptable; and approving the first document only if none of the first portions are substantially identical to the second portions that have been predetermined to be unacceptable.”
This is the “if it walks like a duck and talks like a duck” theory of setting up a set of seed documents or webpages, identifying them as “unacceptable” and then running an algorithm to find other documents that are similar to the seed documents to identify “low-quality” pages.
“A computer-implemented method of approving a document,… where the second portions have been determined to be unacceptable based on human input or automatically by a computer.”
I think it’s safe to say we’ve all read posts and articles about Google using human input to determine what a “trustworthy” page looks like and what one doesn’t look like. More on human input later…
“determining whether the first document contains computer code that may generate an action that is substantially identical to one or more actions determined to be unacceptable. “
Panda was unleashed to rid the web of low-quality results and I think this is where more human input may have been used to identify undesirable aspects of a webpage.
“determining whether the first document comprises computer code that downloads one or more video or audio documents without initiation by the user. “
More of the same… Google identifying aspects of a webpage that it considers to provide a poor user experience and then nuking everything that looks like it.
“A computer-implemented method of rating a document, comprising: identifying one or more first documents that have been determined by human means to include a visual element, an audio element or a textual element that is unacceptable; automatically, by a processing device, comparing a second document to the one or more first documents; determining whether the second document contains an element that is substantially identical to one or more of a visual element, an audio element or a textual element that has previously been determined to be unacceptable; and rating the second document based on the determining. “
In my opinion this was the biggest part of the Panda update. Google figured out a way to take human input and automate it. Automation ftw.
I should note that the first four claims of the patent made reference to approving or rejecting a document while the fifth claim makes reference to “rating” a document. Panda essentially drew a line in the sand and put websites on the good side of the line or the bad side (pandalized) and this patent seems to be all about identifying possible low-quality results and then running them through the ringer to determine if they are acceptable or unacceptable.
Claims 6 and 7 makes references to rating documents and approving or rejecting them as well.
The last claim on this patent I find interesting is claim #11. Here’s what I find interesting:
“A computer-implemented method comprising: analyzing content of a first document to identify one or more first portions; identifying one or more second documents that are similar to the first document, wherein the one or more second documents have second portions that have been predetermined to be unacceptable where a determination of unacceptability is based on a threshold and where the threshold relates to a presentation parameter associated with the one or more second documents; determining whether any of the first portions are substantially identical to the second portions that have been predetermined to be unacceptable; and rating the document as either acceptable or unacceptable based upon the determining.”
When I read the words “presentation parameter” and “threshold” the first thing I though about was ads. When Panda was first released, one of the first things I looked at possibly being a factor was the total number of ads on a webpage. Could this be the presentation parameter discussed in this claim and does the threshold refer to the total number of ads on a webpage?
In my opinion this is one of the more interesting patent applications to come out recently. This patent is all about rating, accepting and rejecting documents based on comparisons to other seed documents which have been determined to be unacceptable. When Panda was first released it was described as an algorithmic update designed to combat low-quality content and results. This patent is full of different parameters Google can use to not only identify possibly low-quality results but also to automatically compare the identified results with other seed documents that Google has already predetermined to be unacceptable. Does this not sound like Panda to you?