In future, we’ll see fewer generic AI chatbots like ChatGPT and more specialised ones that are tailored to our needs

Training AI systems with more focused data sets can target them to a specific use…

AI technology is developing rapidly. ChatGPT has become the fastest-growing online service in history. Google and Microsoft are integrating generative AI into their products. And world leaders are excitedly embracing AI as a tool for economic growth.

As we move beyond ChatGPT and Bard, we’re likely to see AI chatbots become less generic and more specialised. AIs are limited by the data it’s exposed to in order to make them better at what they do – in this case mimicking human speech and providing users with useful answers.

Training often casts the net wide, with AI systems absorbing thousands of books and web pages. But a more select, focused set of training data could make AI chatbots even more useful for people working in particular industries or living in certain areas.

The value of data
An important factor in this evolution will be the growing costs of amassing training data for advanced large language models (LLMs), the type of AI that powers ChatGPT. Companies know data is valuable: Meta and Google make billions from selling adverts targeted with user data. But the value of data is now changing. Meta and Google sell data “insights”; they invest in analytics to transform many data points into predictions about users.

Data is valuable to OpenAI – the developer of ChatGPT – in a subtly different way. Imagine a tweet: “The cat sat on the mat.” This tweet is not valuable for targeted advertisers. It says little about a user or their interests. Maybe, at a push, it could suggest interest in cat food and Dr Suess.

But for OpenAI, which is building LLMs to produce human-like language, this tweet is valuable as an example of how human language works. A single tweet cannot teach an AI to construct sentences, but billions of tweets, blogposts, Wikipedia entries, and so on, certainly can. For instance, the advanced LLM GPT-4 was probably built using data scraped from X (formerly Twitter), Reddit, Wikipedia and beyond.

The AI revolution is changing the business model for data-rich organisations. Companies like Meta and Google have been investing in AI research and development for several years as they try to exploit their data resources.

Organisations like X and Reddit have begun to charge third parties for API access, the system used to scrape data from these websites. Data scraping costs companies like X money, as they must spend more on computing power to fulfil data queries.

Moving forward, as organisations like OpenAI look to build more powerful versions of its GPT LLM, they will face greater costs for getting hold of data. One solution to this problem might be synthetic data.

Going synthetic
Synthetic data is created from scratch by AI systems to train more advanced AI systems – so that they improve. They are designed to perform the same task as real training data but are generated by AI.

It’s a new idea, but it faces many problems. Good synthetic data needs to be different enough from the original data it’s based on in order to tell the model something new, while similar enough to tell it something accurate. This can be difficult to achieve. Where synthetic data is just convincing copies of real-world data, the resulting AI models may struggle with creativity, entrenching existing biases.

Another problem is the “Hapsburg AI” problem. This suggests that training AI on synthetic data will cause a decline in the effectiveness of these systems – hence the analogy using the infamous inbreeding of the Hapsburg royal family. Some studies suggest this is already happening with systems like ChatGPT.

One reason ChatGPT is so good is because it uses reinforcement learning with human feedback (RLHF), where people rate its outputs in terms of accuracy. If synthetic data generated by an AI has inaccuracies, AI models trained on this data will themselves be inaccurate. So the demand for human feedback to correct these inaccuracies is likely to increase.

However, while most people would be able to say whether a sentence is grammatically accurate, fewer would be able to comment on its factual accuracy – especially when the output is technical or specialised. Inaccurate outputs on specialist topics are less likely to be caught by RLHF. If synthetic data means there are more inaccuracies to catch, the quality of general-purpose LLMs may stall or decline even as these models “learn” more.

Little language models
These problems help explain some emerging trends in AI. Google engineers have revealed that there is little preventing third parties from recreating LLMs like GPT-3 or Google’s LaMDA AI. Many organisations could build their own internal AI systems, using their own specialised data, for their own objectives. These will probably be more valuable for these organisations than ChatGPT in the long run.

Recently, the Japanese government noted that developing a Japan-centric version of ChatGPT is potentially worthwhile to their AI strategy, as ChatGPT is not sufficiently representative of Japan. The software company SAP has recently launched its AI “roadmap” to offer AI development capabilities to professional organisations. This will make it easier for companies to build their own, bespoke versions of ChatGPT.

Consultancies such as McKinsey and KPMG are exploring the training of AI models for “specific purposes”. Guides on how to create private, personal versions of ChatGPT can be readily found online. Open source systems, such as GPT4All, already exist.

As development challenges – coupled with potential regulatory hurdles – mount for generic LLMs, it is possible that the future of AI will be many specific little – rather than large – language models. Little language models might struggle if they are trained on less data than systems such as GPT-4.

But they might also have an advantage in terms of RLHF, as little language models are likely to be developed for specific purposes. Employees who have expert knowledge of their organisation and its objectives may provide much more valuable feedback to such AI systems, compared with generic feedback for a generic AI system. This may overcome the disadvantages of less data.

This article is authored by Stuart Mills, Assistant Professor of Economics, University of Leeds. It is republished from The Conversation under a Creative Commons license. Read the original article.

Cookie	Duration	Description
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
CONSENT	2 years	YouTube sets this cookie via embedded YouTube videos and registers anonymous statistical data.

Cookie	Duration	Description
OAID	1 year	Cookie set to record whether the user has opted out of the collection of information by the AdsWizz Service Cookies.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Robotics & Automation – March 2025

Robotics & Automation – March 2025

Robotics & Automation – November 2024

Robotics & Automation – July 2024

In future, we’ll see fewer generic AI chatbots like ChatGPT and more specialised ones that are tailored to our needs

Trump’s trade war puts America’s AI ambitions at risk

AI isn’t what we should be worried about – it’s the humans controlling it

AI is for the birds: How machine learning can help predict and manage avian flu outbreaks

Doosan Robotics to supply 300 cobots across Southeast Asia

ABB to spin off its robotics division

Xaba secures US$6m to advance AI-driven industrial robotics

Comau enters into a binding agreement to acquire Automha

Upcoming Events

IntraLogisteX USA

Robotics & Automation Awards

Supply Chain Excellence Awards

In future, we’ll see fewer generic AI chatbots like ChatGPT and more specialised ones that are tailored to our needs

Related Stories

Upcoming Events