Gen AI and Measurement (2024)

Ipsh*ta

2024 is the year of Taylor Swift and Gen AI/ML — it’s mainstream, its popular and everyone is talking about it. I can’t claim to be a Swiftie* so I will stick to the latter topic.

Machine Learning models and principles are finding its way into various products starting from consumer focused conversational modules (example: AI based chat for customer service) and enterprise use-cases (data classification and search/retrieval). While I see numerous articles and papers covering the fundamentals of what Gen AI or LLM models can and cannot do; I have been digging into the next obvious topic: Measurement

Investing in ML based product solutions is typically a high resource consuming approach — this is because investing in ML forward approaches are not business critical (yet) but is still considered a “good to have” in most cases. For example consider this use-case: Developers spend time troubleshooting bugs; they parse through help center articles or code libraries — an LLM based Gen AI module could generate relevant information in seconds saving developers time and can lead to faster diagnostics. Is this a more efficient/relevant way leading to faster diagnostics? Yes. But is it business critical? Probably not because there are alternative/manual ways of achieving the same outcome.

Which is why any AI based product approach needs a clear story to be able to justify the investment in product, engineering, computing resources:

Why do we need an AI approach? Are there alternate ways to achieve the same outcome? Another way to think about this is — is there an actual use-case or benefit OR is this a vanity/resume building project?
What are the success metrics post implementation? Is it time saved? Is it directly related to business metrics example revenue generated or new user acquisition?
Is there a strategic goal or benefit to implementing an ML solution?

Fundamentally, the quality of GenAI or LLM or ML technique implemented is directly tied to the end user experience. For example: A paralegal needs to consume hundreds of pages of text and manually file documents into themes (class action lawsuits, appeals etc). The use case for ML here is building a data classifier which would a. parse text from uploaded documents b. categorize and label/theme the documents without manual intervention. The end user experience depends on how accurately the model is able to categorize the documents into the correct categories. If the model isn’t able to categorize accurately — the paralegal would need to validate and re-label the documents leading a less than ideal user experience.

The quality of Gen AI applications can be evaluated broadly by using the following parameters:

a. Accuracy: Consider this input “Who are the US presidential candidates in 2024?” The quality of the output needs to be 100% factual which makes the response relevant and accurate (No, Kanye West is not a factual response to this query)

b. Anticipatory or Conversational: A good quality response generation will typically factor in the intent of the query and the conversation history. The response generated should be a “human-like” conversational dialog. A less than ideal response does not factor in context from past queries

c. Useful: Does the response provide any value to the user? For example, if a developer wants to quickly find the relevant source code for debugging, does the system accurately source and display the relevant code? Or does the developer have to parse through code libraries to find the information they are looking for?

e. Speed (Performance): I recently tested a crypto chatbot, while it was fun to read the crypto specific responses, it quickly got a little tiring because the response generation was taking > 60 seconds for each query. Speed of response is critical for establishing a good user experience

f. Safe: Every model needs to operate within established guard-rails. If a consumer facing Gen AI product generates images based on text inputs, the images need to be not only brand safe for the business but also generate images which adhere to age/sensitive category guidelines. Another example is any information deemed confidential should be parsed out/out of bounds by models.

Human evaluation is necessary to evaluate the quality of the query as well as the response generation. What you feed in as a query is tied to the response generated which is why it’s critical to evaluate the quality of the input or user query

Is the user intent or input well defined or clear? (Example: How do I troubleshoot my device isn’t working with error message “400 input cable)
Is there refinement needed in the user input? (Example: My device isn’t working)

Measuring quality of response generation can be done by

A. evaluating the responses based on

Factuality (indicated by Yes/No) — based on where the response is supported by key facts/evidence
Relevant (indicated by Yes relevant/Partially relevant/Not relevant)
Useful or answers the question (indicated by Yes/Partial/No — missing critical information)

B. Using custom metrics to score for example “Quality score = weighted scores based on factuality/relevancy/accuracy, speed of response.

The How to Measure is the area which IMO is the piece in flux and sans any industry standards or benchmarks. In my mind, there is a lot to be defined in this space and will come together fairly quickly just as quickly as Taylor Swift keeps releasing double albums.

*Swiftie: loyal Taylor Swift fans who can make or break the internet

FAQs

How to measure the accuracy of generative AI? ›

Use the groundedness metric when you need to verify that AI-generated responses align with and are validated by the provided context. It's essential for applications where factual correctness and contextual accuracy are key, like information retrieval, question-answering, and content summarization.

Which AI can answer questions? ›

Best Question Answering APIs on the market

Hugging Face.
Microsoft Azure.
NLP Cloud.
Open AI.
Tenstorrent.

Get More Info Here ›

What do you think is a good measure for a generative AI model? ›

There are three main areas to focus on when evaluating generative AI models: model quality, system quality, and business impact.

View Details ›

What are the limitations of generative AI? ›

Lack of Creativity and Contextual Understanding: While generative AI can mimic creativity, it essentially remixes and repurposes existing data and patterns. It lacks genuine creativity and the ability to produce truly novel ideas or concepts.

Show Me More ›

Can AI be 100% accurate? ›

There's no way we can make a 100% accurate model. We can however bias the model's errors in a particular direction. This is similar to how cancer tests may be biased towards false positives as opposed to false negatives (1).

Show Me More ›

How to assess AI accuracy? ›

To evaluate the accuracy of AI, specific metrics are used for different tasks. For classification, precision, recall, and F1 score are important. For regression, MAE, RMSE, and R-squared are key.

Learn More Now ›

Which AI has the most accurate answers? ›

In the world of AI question answering systems, Google Assistant emerges as the best choice. Its accuracy, speed, extensive knowledge base, understanding of context, and user-friendly interface make it a standout performer.

Read On ›

Can AI answers be detected? ›

You can detect AI-written content either with the use of AI detection tools or by manually going over text and looking for some of the 12 common signs of AI authorship. A combination of the two often works best.

What is the main goal of generative AI? ›

Generative AI enables users to quickly generate new content based on a variety of inputs. Inputs and outputs to these models can include text, images, sounds, animation, 3D models, or other types of data.

Keep Reading ›

How to measure the impact of GenAI? ›

When measuring GenAI's impact, diving into the data is the most promising place to start. Your current data capabilities and guardrails are powerful evaluation metrics of your system's overall structure, security, and potential. You might ask: How big is our data set? Is it mostly structured or unstructured?

Explore More ›

What gen AI cannot do? ›

Generative AI can't generate new ideas or solutions

One of the key limitations of AI is its inability to generate new ideas or solutions.

What is the downfall of generative AI? ›

A recent study showed that training models based on data from other generative AI models leads to an irreversible, degenerative process. The final model starts to overestimate probable events while underestimating improbable ones, eventually losing touch with the real data distribution.

Discover More ›

How accurate is generative AI? ›

Text or images generated by AI tools have no human author, but they are trained on materials created by humans with human biases. Unlike humans, AI tools cannot reliably distinguish between biased material and unbiased material when using information to construct their responses.

How do you measure performance of generative models? ›

Another way to evaluate generative models is to use qualitative methods, which involve inspecting the generated data visually or auditorily. For example, you can use methods such as visual inspection, pairwise comparison, or preference ranking to assess how realistic, coherent, and appealing the generated data is.

What is the formula for accuracy in AI? ›

All predictions are composed of the entirety of positive (P) and negative (N) examples. P is composed of TP and false positives (FP), and N is composed of TN and false negatives (FN). Thus, we can define accuracy as ACC =TP + TNTP + TN + FN + TP =TP + TNP + N.