Lies, Damned Lies, And LLM Statistics

This will change how you see AI forever.

… is the breathless message behind myriad reposts of the WHERE AI GETS ITS FACTS meme, originated by Visual Capitalist, from which we are meant to conclude that Reddit-sourced slop comprises nearly half the inputs.

Shift your gaze from the clickbait headline down to the barely legible reference to Semrush’s study — noting in passing that the percentages shown well exceed 100% — and you’ll find cryptic mention of “150k citations”.

First, “citation” usually occurs (unless we’re hallucinating!) at “inference” time, meaning when a user issues a query and a system assembles a response, often aided by just-in-time data retrieval. The inference phase exists distinctly from “training”, when model building occurs.

Second, what constitutes salient sources for inference hinges on who is performing the queries, which in turn drives both topic selection and prompting skill — I would expect to see very different cited sources when discussing, say, the design of an air-fried chicken recipe versus a Python text parsing function.

Third, we must distinguish between data “volume” and “velocity”, two of the three Vs of Data Engineering (“variety” being the third less pertinent one here). Volume refers to the amount of data extant in a corpus whereas velocity refers to its rate of arrival. I would want to know how well each of these parameters maps between the real world and an LLM’s grasp thereof — some data sets comprise massive amounts of useful information while others offer a relentless firehose of perishable content.

Finally, perusing the Semrush report upstream of the meme-cast version, you’ll find a similar-looking chart with a much duller title: “Top Domains Cited on LLMs”. A quick read of their Methodology reveals testing more tailored to AIO than to sophisticated LLM prompting — throwing short keyword-laden phrases at an LLM in the “Googling a thought fragment” approach reminiscent of the hilarious YouTube series If Google Was A Guy.

To explore adjacent topics, check out two recent pieces — also available in podcast form — From Chunky To Smooth and Farewell To Punch Cards.


Discover more from All The Things

Subscribe to get the latest posts sent to your email.

Leave a Reply

Scroll to Top

Discover more from All The Things

Subscribe now to keep reading and get access to the full archive.

Continue reading