When it comes to the internet / world wide web, there has been a significant push to make data machine readable and machine understandable.
Early days of the Web: While separation between content and format has been a standard design aspect of the Web, this was mostly true on the server side; the publishing of the information combined the content and formatting - e.g. embedding the data within HTML. Services such as search engines and price comparison engines had access to only this HTML; application developers had to build custom scrapers to extract the relevant data from HTML. When a website’s format chaged, the scraper had to be recoded. While this works when scraping a few websites, it became quickly non scalable, fragile and error prone as the Web grew exponentially. It became clear that consuming structured data would be critical for such applications; it would also be important for web developers to make structured data readily available to ensure proper indexing and to promote discovery of their site.
During this time various standards were developed for data models, including RDF, RDFS and OWL. It was also a period of intense competition amongst search engines, leading to the introduction of new features, such as including short amounts of structured data next to each search result. This encouraged website owners to markup some of the data in their webpages so that it was easy to consume by the search engines. However there was no global standard to follow on both the syntax and semantics to use. A few verticals (e.g. events) had by them a somewhat widely accepted standard; however there was no clear guidance / guidelines for the thousands of other verticals/ domains. Because of this, a number of website owners did not add any markup to their sites. And those that did often used incorrect syntax / vocabulary, requiring search engines to compensate for this by developing complex parsers that could ingest erroneous markup.
To improve this situation, a few companies like Google, Yahoo, Yandex came together to create schema.org. The goal was to create a common vocabulary and structure that website owners could use; search engines from different services could consume the same data, and website owners did not need to republish this information in various vocabularies / formats specific to each search engine. Adoption in the first few years of launching schema.org was quickly visible - Google snippets, email messages, various intelligent assistants, etc.
Over the last few years, the use of schema.org has taken off - driven by a number of factors including the importance of search engine optimization, the ability of most search engines to ingest schema.org vocabulary, and the extension of web search to various device formats including tables / smart phones / watches / cars.
Marking up their sites with schema.org has become table stakes for website owners. Proven benefits of including structured data include:
increase in # visits to the page
users spending more time on the page
higher interaction rate on the page
higher click through rate
In addition to manually authoring structured data on a web page, a number of tools exist that make it easy to do this - these include extensions for popular conent management systems (Wordpress, Drupal), website building tools with in-built structured data generators, browser-based tools that can analyze a website to generate / improve structured data, and LLMs that can generate structured data. Formats for the structured data include JSON-LD (recommended by Google), Microdata and RDFa.
The Web has billions of pages and tens of millions of authors / publishers. Publishers and service providers (e.g. search engines) have found a way to exchange data by following machine understandable standards. When such a decentralized system is possible on the open web, enterprises need to think how to enable machine understandable data within their walls in a decentralized way.
Comments