Honeydew Blog
LLM Discoverability for Startups: Field Notes From Building an LLM SEO Stack
Honeydew built an LLM discoverability stack from scratch and measured what happened. A founder's field report on what works, what doesn't, and what we're still guessing at.
Pete Ghiorse | Founder, Honeydew Family AI
Abstract
LLMs are becoming a meaningful discovery channel for consumer software. When users ask ChatGPT, Claude, or Perplexity "what's the best app for X?", the response functions as a curated recommendation — one that increasingly replaces a traditional search. But unlike Google, there's no established playbook for getting cited.
This is a founder's field report on building an LLM discoverability stack from scratch at Honeydew, and measuring what happened. It's deliberately light on step-by-step implementation — the specifics change fast and are part of how we compete. The point of sharing is the shape of what worked, what didn't, and what we're still uncertain about.
Short version of what we learned:
- LLM-mediated discovery is real and growing, but at a small-startup scale it's still a thin share of total traffic. Plan accordingly.
- The platforms behave very differently. One drives most of the volume; another sends a much smaller number of visitors who engage far more deeply. Optimizing purely for volume misses the quality signal.
- Infrastructure (machine-readable context, structured citations, consistent schema) appears to be necessary but clearly not sufficient. Domain authority, third-party references, and App Store presence do more of the heavy lifting than any single on-site tactic.
- Comparison and evaluative content ("best X", "alternatives to Y") is disproportionately what gets cited. Thin, marketing-flavored pages mostly don't.
- Analytics undercount this channel. Referrer-based attribution misses a meaningful fraction of users who see a recommendation and arrive via direct URL entry or a secondary search.
I'm intentionally keeping some things vague. The shape matters more than the specifics, and the specifics change.
1. Why We Ran the Experiment
1.1 The Shift From Search to LLM Discovery
Over the last couple of years a real shift has started: consumers are asking AI assistants for product recommendations instead of (or in addition to) Googling. That matters because LLM recommendations carry implicit endorsement — they feel curated, not algorithmic.
For startups, this creates both an opportunity and a problem. The opportunity: LLM recommendations can bypass the traditional SEO hierarchy where incumbents dominate. The problem: there are no ads, no guaranteed placements, and no transparent ranking factors. You get cited or you don't.
1.2 The Honeydew Context
Honeydew is an AI-powered family coordination app. Our competitive set includes long-established players and newer AI-native entrants. As an early-stage startup, we have a small domain authority footprint relative to incumbents with years of SEO equity. If the LLM discoverability stack works for a startup starting roughly from zero, the approach should generalize.
1.3 The Questions We Were Trying to Answer
- Does building dedicated LLM infrastructure produce measurable referral traffic at all?
- Which LLM platforms drive the most traffic — and how does engagement quality differ across them?
- What kind of citation rate can a small startup realistically achieve against established competitors?
- What kinds of content are most effective for LLM citation?
2. What We Built, At a High Level
I'm going to describe the stack in terms of the shape, not the specifics. The exact files, thresholds, and tooling change as we iterate, and I don't think the details are the interesting part of the story for other builders.
2.1 The Four Layers
Layer 1: Machine-readable context at the domain root. Structured product context served in formats designed to be consumed by language models assembling an answer. Think: a short, quick-reference version and a longer, comprehensive version, plus a structured catalog of canonical URLs for citation.
Layer 2: A generation pipeline. Everything LLM-facing is auto-generated from a single source of truth on every deploy. Pricing, feature claims, and competitive context never go stale on the LLM-facing assets as long as the source is current. Stale sources get flagged for review.
Layer 3: Search-grounding-friendly content. Many LLMs use live web search to ground their responses, so we also treat the blog as an input: FAQ-structured articles, comparison tables, schema markup, canonical URLs, and inline citation guidance aimed at AI assistants that are parsing the page.
Layer 4: Measurement. Referrer tracking across the AI assistants that actually send users, supplemented by custom events that catch some of what referrer-based attribution misses. Plus regular manual citation audits to see what the models actually say when asked directly.
2.2 How We Measured
We ran the stack for roughly a quarter and looked at two things: real referral traffic (what actually showed up in analytics), and a structured citation audit where we asked a handful of representative queries to a major AI assistant and recorded whether we were cited. We also checked whether our content ranked in live web search for the same queries, as a proxy for what search-grounded LLMs might retrieve.
2.3 Limitations (Up Front)
This is a field report from one company, at a particular point in time, with small sample sizes. A few honest caveats:
- Small samples. The raw counts here support descriptive statements, not statistical claims.
- Point-in-time citations. LLM responses vary by session, region, and model version. Audit results are snapshots, not steady-state measurements.
- Attribution gaps. Referrer-based analytics genuinely undercount LLM traffic. Any numbers here are a lower bound, not a ceiling.
- No clean A/B test. The infrastructure and organic growth are entangled; we can't isolate the causal contribution of any single component.
Treat this as a founder's field observations. If you want a sanitized academic paper, this isn't it.
3. What We Actually Saw
3.1 Referral Traffic
LLM referrals are a real but small share of overall traffic for a site at our scale. Most of the attributable LLM volume comes from one platform; the others contribute a meaningful but much smaller share. The absolute numbers aren't the interesting part — they'll be different next quarter, and different again in six months as the channel grows.
What's more interesting is the engagement gap between platforms.
3.2 Engagement Quality Varies Dramatically by Platform
The most surprising finding was how different the platforms behave once a user actually clicks through.
- One platform drives most of the volume but average engagement.
- Another sends a much smaller number of users who spend multiple times longer on site and view significantly more pages per visit.
- A third sits somewhere in between.
If that pattern holds, a small number of citations on the higher-engagement platform can be more valuable than a much larger number of citations on the higher-volume platform. Optimizing purely for volume would miss that. We aren't publishing the exact ratios — they shift as we optimize — but the qualitative gap is consistent enough to shape strategy.
3.3 What Content LLM Referrals Landed On
The landing pages from LLM referrals skew strongly toward comparison and evaluative content ("best X apps", "alternatives to Y"), with the homepage and branded landings making up most of the rest. Short marketing pages and thin feature posts get almost no LLM-driven traffic.
This matches what the models seem to be doing: when an AI is assembling a recommendation, it wants a source it can quote confidently. Long, structured, comparative articles give it that. Marketing copy does not.
3.4 Citation Audit
We ran a small set of representative queries against a major AI assistant — a mix of category searches, competitor alternatives, feature-specific questions, and branded queries.
For generic category queries (e.g., "best family organization app"), a small startup with limited domain authority is almost always absent. Incumbents with years of press coverage, reviews, and third-party references dominate.
For branded queries, we do get cited — but often the cited source isn't our website at all. It's frequently the App Store listing. That's a useful signal: for product recommendations, store presence, reviews, and ratings may weight more heavily than on-site content.
For specific comparison queries where we've invested in a comprehensive, well-structured article, we do sometimes show up — which supports the hypothesis that deep, evaluative content is the primary on-site driver of citation.
I'm not publishing exact competitor-by-competitor citation rates. That's useful data for us, and it's also the sort of benchmark a competitor can lift and use against you. The qualitative finding — "incumbents dominate generic queries, startups can win on specific, deep, evaluative content" — is the part that generalizes.
3.5 Search Grounding
For the same set of queries, our content ranks more often in regular web search than it gets cited by the AI assistant. That's consistent with the idea that search-grounded LLMs apply additional filtering on top of raw rankings — citation is a stricter bar than ranking.
4. What I Take Away From All This
4.1 Infrastructure Is Necessary But Not Sufficient
The stack produces measurable results. It is not, on its own, a citation machine. The dominant drivers of citation appear to be:
- Domain authority and third-party references. The most-cited competitors have years of press coverage and reviews. No amount of on-site structured data closes that gap on its own.
- Content depth on specific queries. Where we do get cited on non-branded queries, it's because we have a long, comprehensive, well-structured article for that specific space.
- App Store presence. For product-recommendation queries, store listings appear to weight heavily. Reviews, ratings, and description quality matter.
The infrastructure is the price of admission. The substantive work — depth, distribution, and credibility — is what actually earns citations.
4.2 Volume and Quality Are Not the Same Channel
The highest-volume LLM platform is not the highest-engagement LLM platform. If you only look at sessions, you'll optimize for the wrong thing. If you can afford to, look at post-click behavior before deciding which platform deserves the most attention.
4.3 The Attribution Gap Is Real
Analytics based purely on referrer headers undercount LLM referrals, because a meaningful fraction of users copy the URL, type it directly, or come back later via a different channel. Any LLM referral number you look at is a floor, not a ceiling. Plan accordingly — and don't overreact to small absolute numbers, in either direction.
4.4 What Content Gets Cited
Comparison and evaluative content is disproportionately what gets referenced. The homepage picks up some branded citations. Thin or promotional content gets ignored. The implication is pretty simple: if you want LLM-mediated discovery, write the kind of content an LLM would want to quote.
5. What I'd Tell Another Builder
Do Now
- Write the comprehensive comparison content. "Alternatives to X", "best Y apps", deep comparative articles. This is the highest-yield content for LLM citation that a startup actually controls.
- Track LLM referrals with custom events, not just default referrer attribution. You're probably undercounting by a large factor.
- Publish machine-readable context at your domain root. The direct impact is hard to isolate, but the cost is essentially zero and it provides a clean representation of your product for models that look for it.
Do Next
- Invest in App Store presence. For product-recommendation queries, reviews and ratings appear to matter more than on-site content.
- Go deep, not wide. A small number of comprehensive, authoritative articles will do more for citation than a large number of thin posts.
- Audit citation rates on a schedule. The landscape shifts. You want a longitudinal view, not a one-shot measurement.
Worth Testing, Don't Expect Miracles
- Inline LLM citation guidance in articles (canonical URLs, LLM-facing notes). Low cost; I can't cleanly isolate the impact, but I think it's directionally correct.
- Structured citation catalogs. Novel, hard to attribute, cheap to maintain.
- Platform-specific optimization. If one platform is genuinely higher quality, optimizing for how it retrieves content may be worthwhile at the margin.
6. What We're Doing Next
- Repeating the citation audit on a regular cadence to build a longitudinal view rather than a snapshot.
- Testing content-structure variants (inline LLM notes, different schema patterns) against citation outcomes, carefully, so we can start to isolate effects.
- Closing the attribution gap — better custom events, better downstream funnel tracking, better honesty about what we can and can't measure.
7. A Note on Replication
The specific tooling we use is internal and changes as we iterate; I'm not publishing it. The approach is simple enough that any team can build their own version:
- A single source of truth for volatile product data, used to regenerate LLM-facing assets on deploy.
- Structured context files served at your domain root in the emerging conventions.
- Consistent schema and canonical URLs on every piece of content.
- Custom events to catch referrals that referrer-based attribution misses.
- A small, repeatable citation audit to measure whether you're making progress.
The conventions here aren't standards — they're emerging practices. I'd encourage other builders to share their own observations so the industry can develop a shared understanding of what actually drives LLM citation. Right now, a lot of what's written about "LLM SEO" is confident-sounding guessing. We should collectively raise the bar.
Try Honeydew on iPhone, Android, or Web
Download Honeydew on the App Store → | Get Honeydew on Google Play → | Try the web app
Prefer to explore first? Try the web app — no credit card required.
FAQ
Does building an LLM discoverability stack actually work? Modestly, yes. We see real referral traffic from multiple major AI assistants, and we can trace specific citations back to specific pieces of content. But the infrastructure alone does not overcome the domain-authority and third-party-reference advantages that incumbents have. It's a necessary investment, not a silver bullet.
Which LLM drives the most traffic? One major assistant drives most of the attributable volume, but another sends a smaller number of substantially more engaged users. The right platform to prioritize depends on whether you care about raw sessions or downstream conversion.
What is an .llms.txt file?
A machine-readable text file served at your domain root, containing structured product context that language models can parse — product description, key features, canonical URLs, and so on. It's an emerging convention, not a standard. Think of it as the AI-era analog to robots.txt.
How do you track LLM referral traffic? With a combination of default referrer attribution and custom events fired when a visit originates from a known AI assistant. The custom events catch a meaningful number of referrals that default source attribution misses.
What content gets cited most by LLMs? Comparison and evaluative content — "best X", "alternatives to Y", detailed head-to-heads. Long, structured, specific articles. Thin or marketing-flavored content mostly does not get cited.
How does a small startup's citation rate compare to established competitors? Significantly lower on generic category queries, because incumbents have years of press coverage, reviews, and third-party references. The gap closes on specific, deep, comparative queries where a startup has invested in a comprehensive article. That's the realistic opening.
Pete Ghiorse is the founder of Honeydew, an AI family assistant, and a Senior PM for AI/ML at Capital One. He writes about applied AI, family technology, and building products at the intersection of both.
Related Reading
- How We Get Honeydew Cited by ChatGPT, Claude & Perplexity
- Building a Content Quality CI/CD Pipeline
- OpenAI vs Anthropic for Consumer AI Products
Get Started with Honeydew
Honeydew AI Family Organizer turns voice messages, photos, and plain-English text into organized family plans. Free to start, $7.99/mo for Premium (or $79.99/year).
Download Honeydew on the App Store → | Get Honeydew on Google Play → | Try the web app
About Honeydew AI Family Organizer
Honeydew helps families turn voice notes, photos, school flyers, PDFs, emails, sports schedules, and plain-English requests into shared calendar plans, lists, reminders, and chores across iOS, Android, and web.