Building AI products

Published on

June 8, 2024

No items found.

I will fly to India on Monday for a brief trip, and so I just spent an hour struggling through a very buggy online visa application process. Once I’d finished, since I now know what’s involved, I asked ChatGPT 4o about it. Most of these points are partially or completely wrong.

This is an ‘unfair’ test. It’s a good example of a ‘bad’ way to use an LLM. These are not databases. They do not produce precise factual answers to questions, and they are probabilistic systems, not deterministic. LLMs today cannot give me a completely and precisely accurate answer to this question. The answer might be right, but you can’t guarantee that.

There is something of a trend for people (often drawing parallels with crypto and NFTs) to presume that this means these things are useless. That is a misunderstanding. Rather, a useful way to think about generative AI models is that they are extremely good at telling you what a good answer to a question like that would probably look like. There are some use-cases where ‘looks like a good answer’ is exactly what you want, and there are some where ‘roughly right’ is ‘precisely wrong’.

Indeed, pushing this a little further, one could suggest that exactly the same prompt and exactly the same output could be a good or bad result depending on why you wanted it.

Be that as it may, in this case, I do need a precise answer, and ChatGPT cannot, in principle, be relied on to give me one, and instead it gave me a wrong answer. I asked it for something it can’t do, so this an unfair test, but it’s a relevant test. The answer is still wrong.

There are two ways to try to solve this. One is to treat it as a science problem - this is early, and the models will get better. You could say ‘RAG’ and ‘multi-agentic’ a lot. The models certainly will get better, but how much better? You could spend weeks of your life watching YouTube videos of machine learning scientists arguing about this, and learn only that they don’t really know. Really, this is a version of the ‘will LLMs produce AGI?’ argument, since a model that could answer ‘any’ question completely correctly sounds like a good definition of at least one kind of AGI to me (again, though, no-one knows).

But the other path is to treat this as a product problem. How do we build useful mass-market products around models that we should presume will be getting things ‘wrong’?

A stock reaction of AI people to examples like mine is to say “you’re holding it wrong” - I asked 1: the wrong kind of question and 2: I asked it in the wrong way. I should have done a bunch of prompt engineering! But the message of the last 50 years of consumer computing is that you do not move adoption forward by making the users learn command lines - you have to move towards the users.

‍

Early prompt engineering (WordPerfect cardboard keyboard overlays). This was not the future.

‍

I think we could break this apart further, into two kinds of product problem.

On one hand, the product design in the screenshot is communicating certainty when the model itself is inherently uncertain. Google gives you (mostly) ten blue links, which communicates ‘it’s probably one of these’, but here we are given one ‘right’ answer. This misleads a lot of people, especially since the text generation (as distinct from the actual answer) is pretty much perfect. Indeed, this fascinating survey from Deloitte suggests that people are more likely to be misled by this apparent certainty once they’ve used these systems.

‍

‍

But the other half of the problem is that the product isn’t telling me what I can ask even before I get to an ‘answer’. I gave it a ‘bad’ query (one that it can’t really answer well) but nothing in the product tells me that. Instead, this is presented to me as a general purpose tool. If the product has to try to answer anything, that makes it a lot harder for the model to be right, but it also makes it a lot harder for the interface to communicate what good questions might be.

I made the slide below, for the presentation that I will give in India, to try to capture the alternatives suggested by this.

‍

‍

The most radical approach is the completely general purpose chatbot-as-product, the challenges with which I’ve just discussed. But there are also at least two other approaches.

The first is to contain the product to a narrow domain, so that you can create a custom UI around the input and output that communicates what it can and can’t do and what you can ask, and perhaps also focus the model itself (hence RAG). This gets us the coding assistants and marketing tools that have exploded in the last 12 months, as well as the first attempts at knowledge management tools. WPP has built an internal dashboard that lets its staff steer models toward particular brand tone of voice or target demographics. Hence “ask this tool to suggest 50 ideas for product X from brand Y for demo Z - don’t ask it if you have appendicitis.” You wrap the prompt in buttons and UI - in product.

‍

‍

But the other approach is that the user never sees the prompt or the output, or knows that this is generative AI at all, and both the input and the output are abstracted away as functions inside some other thing. The model enables some capability, or it makes it quicker and easier to build that capability even if you could have done it before. This is how most of the last wave of machine learning was absorbed into software: there are new features, or features that work better or can be built faster and cheaper, but the user never knows they’re ‘AI’ - they aren’t purple and there are no clusters of little stars. Hence the old joke that AI is whatever doesn’t work yet, because once it works it’s just software.

Looking at this on another axis: with any new technology, we begin by trying to make it fit the problems we already have, while the incumbents try to make it a feature (hence Google and Microsoft spraying LLMs all over their products in the last year). Then startups use it to unbundle the incumbents (to unbundle search, Oracle or Email), but meanwhile, other startups try to work out what we could build that would be truly native to the new technology. That comes in stages. First, Flickr had an iPhone app, but then Instagram used the smartphone camera, and used local computing to add filters, and further on again, Snap and TikTok used the touch screen, video and location to make something truly native to the platform. So, what native experiences do we build with this, that aren’t the chatbot itself, or where the ‘error rate’ doesn’t matter, but abstracts this new capability in some way?

This of course is proposing a paradox, that I’ve talked about before: here we have a general-purpose technology, and yet the way to deploy is to unbundle it into single-purpose tools and experiences. However, seeing this as a paradox might just be misplacing the right level of abstraction. Electric motors are a general-purpose technology, but you don’t buy a box of electric motors from Home Depot - you buy a drill, a washing machine and a blender. A general-purpose technology is instantiated as use-cases. PCs and smartphones are general-purpose tools that replaced single-purpose tools - they replaced typewriters, calculators, voice recorders and music players - but each of those functions is achieved through a piece of single-purpose software: most people don’t use Excel as a word processor. One reason that some people are so excited about LLMs is that they might not follow that pattern: they might move up through all of those levels of abstraction to the top. That would leave no room for ‘thin GPT wrappers’. Yet I don’t think they can really do that yet, and so everything I’ve just written is really just wondering what you can build to change the world even if that never happens.

‍

Building AI products

Our investment in Manas AI

AI Agents Don’t Buy Seats—Why Your Pricing Should Follow Suit

Transforming customer service with AI agents: Parloa raises $120M Series C at $1B valuation

What kind of disruption?

Apple innovation and execution

Are better models better?

The Deep Research problem

Introducing Mosaic's new Partner, Chandar Lal

AI eats the world

Competing in search

The AI summer

The VR winter continues

Why we invested in Coram.ai

Apple intelligence and AI maximalism

Building AI products

Ways to think about AGI

AI and problems of scale

Looking for AI use cases

The challenges of investing in AI

Why we invested in Podcastle

Remaking the App Store

Why we invested in Parloa

AI and everything else

Unbundling AI

Scaling personalised support: LLMs and human empowerment

The impact of LLMs on marketplaces

LLM agents: the next platform shift in B2B software

Generative AI and intellectual property

When tech says "no"

LLM applications: an investing framework

AI and the automation of work

Vision Pro

Personalised learning: Edtech’s long-standing aspiration

Netflix, Shein and MrBeast

The New Gatekeepers

Evaluating SaaS metrics at Series A

ChatGPT and the Imagenet moment

Why We Invested in Vektor AI - a Platform Unlocking Mentoring for Tech Talent

Ways to think about a metaverse

Powering Personalisation: Why We Invested in Ninetailed

Meet Johannes Barth - Mosaic's New Head of Analytics

The creator economy: a power law

Rocket ships and tractors

Within and tech M&A

Back to the trend line?

There’s no such thing as data

Now what? A Letter to Founders on How to Survive a Bear Market

What do Europe’s leading founders have in common?

AI for code: the next frontier in software development?

TV, merchant media and the unbundling of advertising

‘Google meets Which’: cost of living data platform Nous raises $9m

Privacy on the internet: what comes next?

Tech questions for 2022

Three Steps To The Future

Nexar is building a ‘digital twin’ of cities using crowdsourced dash cam data

Notes on newsletters

B2B marketplaces: what comes next?

When big tech buys small tech

Metabrand

Blockchain says it posted $1.5 billion in revenue this year

Reimagining the future of buy-to-let. Why we invested in GetGround.

Privacy on the internet: who cares?

Metaverse! Metaverse? Metaverse!!

Stepping out of the firehose

A decade of the Tim Cook machine

Why We Invested in Lightyear

Mainframes, ML and digital transformation

Ads, privacy and confusion

Do App Store Rules Matter?

Why we invested in Zerion

Unleashing the potential of the extended workforce. Why we invested in Utmost.

Integrative SaaS: A new OS for the workplace?

Antitrust posturing

The Potential of Real Time Trade Finance. Our investment in Hokodo

Boxes, trucks and bikes

Apple, Fedex and the cookie apocalypse

Can Apple change ads?

Does Amazon know what it sells?

The challenge of cross-border arbitrage: How to scale exchange and commerce platforms across Europe