Maximize Profit, Not Intelligence

December 22, 2023

Picture this: you're at the helm of a startup, brimming with ideas, ready to leverage the marvels of AI, particularly the genius of GPT 4. You set out looking for opportunities in text classifiers:0 the kind of thing you might use to moderate posts on social media, route customers in a phone maze, direct customer emails to support agents, rate transcripts of call center recordings, etc. Sounds straightforward, but it's tricky. Language is complex.

You went out and did your business development magic and worked out a deal! You can make a profitable venture if you can make an automation that can classify sentences as being about Spanish immigration law or not about Spanish immigration law, with at least a 95% accuracy, at a price of less than $200 per million classifications.

No problem! GPT 4's function-calling feature can generate structured responses! You jump into VS Code and use GPT 4 in GitHub Copilot to craft a text classifier based on GPT 4, and it's 97% accurate when you evaluate it. AWESOME, it works!

But there's a problem: You check the costs on your evaluation run, and wow, GPT 4 tokens are expensive! It costs just over $0.40 to do 240 classifications, which means that it would cost... almost $1,700 per million classifications. The venture would not be profitable.

This is the story of 2023: We have amazing new tools, but there's a difference between a cool proof-of-concept demo and a viable business venture. Now, as 2024 unfolds, the narrative shifts from mere intelligence to profitability. Once you find a use case with business value and prove the concept, you can maximize profit by reducing your costs, by finding the least-sophisticated model that will suit your task.

It's kind of like how you set the optimal price: It's the maximum price that the market will bear. In AI applications, the path to profitability is to use the dumbest model that the problem will bear.

Make it cheaper

One easy way to make a GPT 4 solution cheaper is to use GPT 3.5 instead. It's a really good model. And it's a great value compared to GPT 4, which is amazing but expensive.

You make a simple tweak to your code to swap the models and you do your evaluation run again to check the cost. It's much better! GPT 3.5 is an order of magnitude cheaper:

Cost of GPT 4 vs GPT 3.5 classifiers.

The best part here is that the cost per million will be below your break-even point of $200 per million. You're profitable! Yay!

Profit per million AI text classifications with GPT 4 versus GPT 3.5.

But is it as accurate? Yes, it's about the same accuracy. It's enough. In the evaluation phase you're measuring things like precision and recall and accuracy, and you get about the same results.

Precision and recall for GPT 4 versus GPT 3.5 LLM text classifiers.

The accuracy score is actually a little higher for GPT 3.5 but that might be related to the training data that was generated with GPT 3.5.

Accuracy for GPT 4 versus GPT 3.5 LLM text classifiers.

Now you have a profitable venture. But your margins are razor thin:

An AI model learning

You want to optimize for profit, right? And in software we optimize through iteration. Let's try to cut costs.

Machine-learning text classifiers

This kind of algorithm that you're selling at a profit is called a "classifier", and there are a lot of ways to make them,0 including ways that don't use large language models. Let's look into making a classifier with machine learning.

One way to do it is to use a different model from OpenAI that isn't a generative text model. It's an embeddings model, for turning text into strings of numbers.0 There are similar embeddings models from other companies, like AWS0.

You can use those numbers to classify the text in any way that you want if you have a lot of examples that you can use for finding patterns in the data. Here's how:

Step One: Crafting Numerical Fingerprints

Computers don't understand words, only numbers. To process words with computers we have to convert them to numbers. You could just assign every word a number, "cat" is 1, "dog" is 2, etc. But there are more useful ways.

You can take a sentence and transform it into a numerical pattern, an 'embedding'. Imagine these embeddings as secret codes that capture the essence of each sentence numerically.0 We have tools that measure how 'similar' these embeddings are, assigning a number to represent the degree of resemblance between any two sentences.0 So, the difference between "cat" and "dog" might be less than the distance between "cat" and "suspension bridge".

While embeddings offer rich, condensed representations of text, they don't inherently categorize sentences. How can we build a classifier out of this?

Step Two: The Learning Machine

Enter the machine-learning model. Its job is to learn from examples - to recognize which numerical patterns correspond to our topic of interest. By examining known questions about immigration law and comparing them to unrelated sentences, the model learns to identify a benchmark for similarity. It's akin to teaching someone to recognize a specific tune among various songs.

For that, we used logistic regression, a statistical method that's like a skilled goldsmith, discerning which nuggets of text match our sought-after criteria. It operates on a simple yet powerful principle: calculating the likelihood that a new sentence belongs to one of two categories – a binary decision of 'yes' or 'no'.0

Picture logistic regression as a finely-tuned scale. It weighs the unique numerical fingerprints of words – our carefully crafted vector embeddings – and decides how 'heavy' they are in terms of relevance to our categories. By adjusting this scale, logistic regression finds the precise balance point, the exact threshold of similarity, that tips the scale towards a 'yes' or a 'no'. This isn't just a guessing game; it's a calculated decision based on the distinct patterns and relationships our embeddings reveal.

It's a simple0 yet effective tool in our arsenal, designed to take an array of "features" as inputs and produce a binary yes/no classification. That's exactly what we need: We need a model that takes a string of numbers from the vector embeddings and produces a prediction about whether a given embedding represents a sentence about Spanish immigration law.

Method

We generated a list of about 1,200 fake sentences using GPT 3.5, where each is labeled yes or no: Does the sentence represent a question about Spanish immigration law? We asked for sentences in English, Spanish and Spanglish. You can see those in the data/labeled.csv file in the GitHub repository for the experiment.

We used the Python notebook in the repository to load that data and generate an embedding for each sentence in the file using the OpenAI Ada 2 embeddings model. How did it perform?

Accuracy across GPT 4, GPT 3.5 and Ada 2 text classifiers.

The accuracy improved! Amazing!

Cost reduction

Even more amazing is the order-of-magnitude cost reduction. The Ada 2 model only costs $0.0001 / 1K tokens for input and none for output, compared with GPT 3.5 which costs $0.0010 / 1K tokens for input and $0.0020 / 1K tokens for output. The cost has dropped so low that you can't even see it on the visualization next to the cost of GPT 4.

Cost across GPT 4, GPT 3.5 and Ada 2 text classifiers.

Wow! Let's scale that to a million classifications:

Cost per million across GPT 4, GPT 3.5 and Ada 2 text classifiers.

And we know why that's important. An order-of-magnitude increase in profit:

Profit across GPT 4, GPT 3.5 and Ada 2 text classifiers.

Speed improvement

Another incredible thing is how fast machine-learning classifiers are, compared with LLMs that generate responses one token at a time. A machine-learning classifier shifts the time required to a one-time training step, and they're virtually instant at inference time. GPT 3.5 and GPT 4 don't require any training since they're "pre-trained", but they're super slow when they generate text compared to a machine-learning classifier when the job at hand is text classification.

Time breakdown for training and inference across GPT 4, GPT 3.5 and Ada 2 text classifiers.

More!

Can we make it even cheaper? Can we eliminate the OpenAI bill entirely?

Yes, we can. OpenAI doesn't have a monopoly on vector embeddings. There are lots of ways to do it. We could use AWS, but currently the pricing of the Titan Embeddings model is the same as the price of the OpenAI Ada 2 model, so that won't help us with costs.

But there are algorithms we can run ourselves. One of them is BERT.

BERT stands for Bidirectional Encoder Representations from Transformers. It revolutionized the understanding of context in language models by reading the text in both directions and building a deep context understanding.0 BERT processes words in relation to all the other words in a sentence, rather than one-by-one in order. It's a common embeddings model, so let's try it.

Accuracy

BERT does just as well as Ada 2 at this task, despite being free. Good to know!

Accuracy across GPT 4, GPT 3.5, Ada 2 and BERT text classifiers.

Speed

Computing the embeddings using BERT took about the same amount of time as it took to outsource the job to OpenAI. It took a little less time.

And, like the model based on OpenAI embeddings, it's virtually instant to compute classifications with it. All the time is in a one-time training step.

Time for training and inference across GPT 4, GPT 3.5, Ada 2 and BERT text classifiers.

Cost

This classifier completely eliminates the OpenAI bill. Which was already so small with Ada 2 that you can't even see it on this visualization compared with the cost of GPT 4 and GPT 3.5. But now the cost is zero.

Cost per million classifications across GPT 4, GPT 3.5, Ada 2 and BERT text classifiers.

Profit

And that increases our profit! Our key performance indicator!

Profit across GPT 4, GPT 3.5, Ada 2 and BERT text classifiers.

More!

We can't really improve the price of "free".

And we can't improve on a perfect accuracy score.

What can we improve, training time? Sure, okay, let's try Word2Vec since it's simpler than BERT. Maybe it will be faster.

Word2Vec is a group of models that produce word embeddings by capturing the context of a word in a document.0 It leverages neural networks and operates on the principle that words appearing in similar contexts have similar meanings.

Is it faster? Yes! It's so much faster at computing embeddings that you can't even see the blip on this visualization, next to the times that the other models require. There's a tiny bar there on the bottom right if you squint.

Time for training and inference across GPT 4, GPT 3.5, Ada 2, BERT and Word2Vec text classifiers.

Does it work, though?

Accuracy across GPT 4, GPT 3.5, Ada 2, BERT and Word2Vec text classifiers.

Oops. No, it's significantly less accurate, at 97%. That's accurate enough for our requirements, but it doesn't seem worth the drop in accuracy to save a few seconds in a one-time pre-training step. It won't significantly boost our profit, and that's what matters.

Okay, let's stop while we're ahead and go with BERT. You can't know until you try it. Now we know.

Our goal is to use the dumbest model that the problem will bear. To find that point, we have to go all the way to the point where we find a model that's just not good enough. We have found that point. It's just like finding the market price for something: You increase the price until the market shows you that the price is too high, and then you lower it to the point where you get sales. We can do the same thing with AI solutions to find the best value.

Try it yourself

Don't take our word for it: you can try it for yourself for free. It's amazing that you can train and run the classifiers from a free Google Colab notebook. To use it, just go there and set up two things:

  1. Use the "Download raw file" feature in GitHub to get the data file and then upload that as labeled.csv to the Colab environment.
  2. Set a secret0 called openai and set the value to your OpenAI API token. For calling the Ada 2 API.

You can also clone the GitHub repository if you want to run it in Sagemaker or on your own machine or whatever. It should run just about anywhere, with no special GPU power or anything.

You can also see an example of a text classifier built using function calling here in this Colab notebook or on GitHub.

Evaluation metrics

We want to see how well the three different embeddings techniques work for our classifier. We used these metrics:

Accuracy: This is the simplest measure, telling us the proportion of total predictions our model got right.0 It's a quick way to gauge overall effectiveness.

Precision and Recall: These metrics offer a more nuanced view. Precision shows us how many of the sentences identified as questions about Spanish immigration law were actually so. Recall, on the other hand, tells us how many of the actual legal questions our model successfully identified.0

ROC-AUC Score: This metric helps us understand the trade-offs between true positive rate and false positive rate.0 The AUC (Area Under the Curve) quantifies the model's ability to distinguish between classes - a higher AUC means better discrimination.

ROC-AUC curve analysis.

Confusion Matrices: These are tables that lay out the successes and failures of our predictions in detail. They show four types of outcomes - true positives (correctly identified legal questions), false positives (non-legal questions wrongly identified as legal), true negatives (correctly identified non-legal questions), and false negatives (legal questions missed by the model).0 This matrix helps us pinpoint areas where our model might be overconfident or too cautious, guiding us in fine-tuning its performance.

Confusion matrix for BERT.

Maximize profit, not intelligence

You can do amazing things with OpenAI models but you don't always need them. BERT is a widely-available, open algorithm that's so cheap and easy to run that you can compute 1,500 sentence embeddings from a free Google Colab notebook in a minute or two. And it performs just as well as the OpenAI embeddings that you have to pay for. Word2Vec embeddings, even cheaper to compute since the process is less computationally intense, also perform really well. Not good enough for this example, but good enough for lots of things.

As we analyze these results, a broader business lesson emerges. In the AI arena, more powerful doesn't always mean more valuable. Just like our experiment, where BERT held its ground against mightier models, businesses need to gauge the cost-effectiveness of AI solutions. With just a couple of iterations we turned a non-viable proof-of-concept into a viable business and then improved the profit by orders of magnitude. The profit is the thing that matters, and that's what our iterations need to focus on.

References