Where does AI training data come from? You may have been told that Large Language Models (LLMs) are trained on vast amounts of text data scraped from the public web. This is true, but it isn’t the whole story. Simply scraping the web and dumping data into a database isn’t enough to produce a high-performing LLM. For that, the models need domain-specific knowledge which, at the moment, is mostly done by employing a bunch of subject matter experts to evaluate the data via a process known as annotation. Let’s review.
The Web is Full of Bad Data
Web-scraped data can be noisy, irrelevant, or low-quality, negatively impacting the performance and accuracy of Large Language Models (LLMs). It can also be imbalanced, with some topics or styles overrepresented, further compromising the effectiveness of LLMs. And, of course, web scraping raises ethical concerns such as copyright infringement, privacy violations, and biases in the data. Annotators help address all these issues.
Humans To The Rescue
Annotation is a meticulous process that involves human evaluators reviewing and annotating text data to provide context and meaning. The process begins with data selection, where a large dataset of text is curated and selected based on the specific use case and domain expertise required. Clear guidelines and instructions are then developed to ensure consistency and quality in the annotation process. Human evaluators then perform various annotation tasks, such as labeling, tagging, describing, and rating, to provide a deeper understanding of the text content.
Can’t AI Do This By Itself?
While AI models can’t entirely replace human annotation, there are techniques to reduce the need for extensive human annotation. Self-supervised learning, weak supervision, active learning, and transfer learning are some of the methods used to supplement human annotation. However, high-quality human annotation is still essential for achieving optimal AI model performance, especially in critical applications like healthcare or finance.
Machines Who Need People
Human evaluators, often called annotators, typically perform annotations. These individuals are usually experts in the specific domain or task, possessing the necessary knowledge and understanding to accurately label and categorize the data. Annotators may be in-house teams, freelancers, crowdsourced workers, or specialized annotation companies. Wouldn’t it be great to know who these people actually are? Are they credentialed in the areas they purport to be credentialed in? Are they truly subject matter experts? Was the expertise crowdsourced? Good questions. Good luck getting them answered.
Recruiting Annotators
Companies recruit annotators through various strategies, including job postings, professional networks, crowdsourcing platforms, training programs, and partnerships. I’ve seen postings on LinkedIn, and a few of my clients have used crowdsourcing platforms or services like Amazon Mechanical Turk.
How Annotators Annotate
If you’ve read this far, you might be wondering how annotators actually annotate. There are various techniques, including Named Entity Recognition (NER), Part-of-Speech (POS) Tagging, Dependency Parsing, and Sentiment Analysis. NER involves identifying and labeling specific entities like names, locations, and organizations. POS Tagging identifies the grammatical categories of words like nouns, verbs, and adjectives. Dependency Parsing analyzes sentence structure and relationships between words. Sentiment Analysis determines the emotional tone or sentiment behind the text. To learn more, just ask your favorite LLM about any of this, you’ll get a pretty good answer.
Creating Competitive Advantage with AI
While professional annotation plays a vital role in training large foundational models, the same holds true for small proprietary models. (See: ). At the moment, the process of building best-in-class models requires a heavy human hand. This rarely-discussed fact is probably worthy of deeper exploration.
Author’s note: This is not a sponsored post. I am the author of this article and it expresses my own opinions. I am not, nor is my company, receiving compensation for it. This work was created with the assistance of various generative AI models.
ABOUT SHELLY PALMER
Shelly Palmer is the Professor of Advanced Media in Residence at Syracuse University’s S.I. Newhouse School of Public Communications and CEO of The Palmer Group, a consulting practice that helps Fortune 500 companies with technology, media and marketing. Named he covers tech and business for , is a regular commentator on CNN and writes a popular . He's a , and the creator of the popular, free online course, . Follow or visit .