AI Search and New Technical Standards for the Future Web

Category

While regulators battle over rules, another fight is happening at the technical level. New standards, protocols, and licensing frameworks are beginning to emerge that could influence how websites control AI crawling and how marketers approach SEO, GEO, and content strategy.  

Technical Standards 

Technical Standards are another battleground that is interesting. Thus far, several attempts have been made to establish new technical standards related to LLM control. None of the new standards have proven to be widely adopted or effective – yet.  

The old-fashioned robots.txt file (still the only true standard, but needs updating) 

The current standard for a website to control crawling of its content by bots is the robots.txt file. Supported by most major, reputable crawl-bots, robots.txt lets websites control which pages and files can be accessed by which bots. 

However, robots.txt has shortcomings. It only addresses whether or not a bot can crawl. It does not address how they can use the content, and it lacks granularity. A major problem is the all-or-nothing dilemma: If a website blocks a bot of a major search engine or LLM, they will lose almost all visibility in that channel. This is especially an issue with Google, which offers no way to prevent content from being used in Google’s generative AI without blocking Google Search. 

Still, the idea of blocking bots in robots.txt has gained steam. In January, Buzzstream analyzed the top 100 news sites and found that 79% blocked at least one training bot. However, for the typical business or services website, we would not advise blocking LLM bots in the robots.txt file

Markdown 

Markdown Language is a lightweight alternative to HTML. Some have argued it’s the best format for AI bots. Google and Bing have said they do not recommend separate Markdown for LLMs. It does not make sense to try to force anything other than the HTML users see onto LLM bots. 

That said, in February Cloudflare introduced new Markdown for Agents feature that is noteworthy. There are two important elements about Cloudflare’s approach worth highlighting:   

  • It only serves Markdown to bots that request it  
  • Cloudflare customers can easily turn this on with a simple toggle switch. If you are a Cloudflare customer, this may be worth testing. 

Really Simple Licensing 

Really Simple Licensing (RSL) is an open content licensing standard that launched in September 2025 and is championed by a collective of major publishers, content sites, and CDNs like Reddit, Yahoo, Medium, Vox Media, Akamai, Cloudflare. RSL is designed to “protect content rights in the AI era” and “enable AI and other technology companies to license content at internet scale.” RSL is a powerful concept that is especially important to large content sites with ad-driven and/or subscription-based revenue models. However, it remains unclear how cooperative AI companies will be and how the legal gray areas will be enforced.  

The llms.txt file is a proposed standard introduced in 2024 as a means to control and guide crawling for AI crawlers. It is like a combination of robots.txt file and XML sitemap, but specifically for LLM bots instead of search engine spiders. However, Google has stated repeatedly that they do not use or endorse llms.txt, which makes it dead in the water. Further, there is no research that indicates LLMs.txt impacts AI visibility and citations.  

IETF for robots.txt expansions (Worth monitoring) 

In January, the Internet Engineering Task Force (IETF) launched the AI Preferences Working Group (AIPREF) with the intent to create new standards for how AI models should use Internet content. 

The IETF is a body to take seriously, for two reasons in particular: 

  1. The IETF has a history of producing major standards, including robots.txt as well as the Internet protocol suite, aka TCP/IP  
  1. Google is on board. Google is an active participant in the IETF and its AIPREF. It is difficult to see any new standard becoming the standard unless endorsed and used by Google

The AIPREF Working Group is currently working on expanding the robots.txt file to address at least some of its shortcomings in the AI age and allow distinctions for search vs AI training. Hopefully, we see formal updates roll out this year. 

Implications for Digital Marketers 

It’s a lot, right? There are countless battles being fought to regulate, standardize, and litigate, and we’ve only scratched the surface. But here’s what this may mean for us.  

Standards and regulations will impact the economics of SEO, GEO, and AEO; this has a major impact on broad digital marketing strategy in a given market. It will also impact tactics and technical implementation. 

SEO is more than Google. The more regulations, the more that will be the case.  

AI has finally introduced meaningful competition to Google. At this point in time, Google still dominates the landscape, but impactful antitrust suits and regulations increase market competitiveness, diversity, and complexity. Indeed, wise digital marketers should already be preparing for a broader-than-Google AEO/GEO landscape. 

Europe will lead on AI search regulation, starting with Google.  

Through direct regulation, as well as the threat of it, Google and Big Tech are being forced to play more fairly. What that looks like exactly is being determined now in Europe, which will then influence the global arena.  

It impacts the rate of SEO traffic cannibalization.  

The more generative AI search results and answer engines cannibalize website traffic, the less attractive it becomes to invest in SEO, GEO, and AEO. Regulations and standards have a major impact on shaping cannibalization levels, for example by influencing attribution and competition in query results. 

Attribution requirements? (Hopefully) 

Mandatory attribution of AI results would be great and would increase the benefits of SEO and GEO, especially if it increases attribution through links. 

But is it wishful thinking? The guess is yes for the U.S. for the near future, but we’ll see what happens in Europe and how regulatory matters shape the behavior of Google and Big Tech globally. 

More opt-out options are likely coming  

Look out for changes to robots.txt, especially from the IETF. In general, expect more options coming for websites to opt-out of AI scraping.  

But think before you opt-out. It only makes sense for a minority of websites, particularly large websites whose primary product is content. For e-commerce and services businesses, the costs of opting out generally outweigh the benefits. 

Pay-to-scrape changes dynamics. 

Will it become common to see LLMs paying websites to access and use their content? If so, this could impact digital marketing in a huge variety of ways, including: 

  • Potential income source, especially for major content publishers*.  
  • *We would anticipate that for most business websites that make their revenue from products and services, it will make more sense to continue to provide free access than go pay-to-scrape. 
  • Shaping competition levels in query results as LLMs decide which websites to include. This could significantly decrease competition in Google and LLM platforms for some websites with free access, increasing the benefit of SEO and GEO. 
  • Helping determine the fates of many major content sites. This could also change the long-term SEO/GEO competitive landscape in some industries. 

Many of these regulatory battles and technical standards are still taking shape. The outcomes will determine how AI systems access, attribute, and compete for web content in the years ahead. For digital marketers, staying aware of these shifts will remain important as the landscape continues to evolve.