The latest and greatest trend in technology is AI, and it's everywhere. AI is becoming increasingly relevant in everyday life, from new versions of existing technology like the best Samsung phones to new services like ChatGPT. It isn't all good news, though. There are several questions about how ethical, fair, and accurate AI models are. Stanford researchers have devised a method to evaluate AI models to provide better transparency around each model and how it performs in various situations.

Stanford Holistic Evaluation of Language Models: a brief introduction

The Stanford Holistic Evaluation of Language Models (HELM) was developed by a team of 50 researchers at Standard University's Center For Research on Foundation Models. HELM consists of three main elements:

Broad coverage and recognition of incompleteness: Test as many scenarios as possible, but recognize they can't test all scenarios and state which ones aren't being evaluated. Multi-metric measurement: Evaluate AI models on different metrics that test the well-roundedness of the model. Standardization: Standardize a way to test AI models in similar states as much as possible.

Source: Stanford Center for Research on Foundational Models

A primary value of the team behind HELM is transparency. The Helm website makes the scenarios, predictions, prompts, and code for the model available to anyone to see and review. HELM provides a standard way to evaluate language models and test them against each other to give an industry-standard benchmark. The researchers aim to keep running and refining HELM as AI models progress. The research is funded by Google, among others.

What were the initial findings of HELM?

In 2022, HELM researchers published a paper after the first rendition of its model ran more than 4,900 evaluations of 30 models, totaling more than 12 billion token requests for these models. There were five main findings from this paper:

Using human feedback to tune AI models, called instruction tuning, is an effective way to train models and helps smaller models compete with larger ones.

Closed and limited AI models (such as those from Microsoft and OpenAI) outperform open AI models (such as those from Meta).

The accuracy, robustness, and fairness of AI models are closely correlated in normal situations. Accuracy decreases when less grammatically accurate or other dialects of English are used.

The way a model is prompted has a significant effect on how the model responds. This point demonstrates a need for strict standardization in evaluating these models.

There is a need for better summarization datasets to train AI models. Models do well at summarizing according to humans, but humans evaluate the datasets used to train them less positively. Some models are good at creating headlines for disinformation but do not perform well when attempting to prompt someone to do something based on misinformation. This should be continually evaluated.

What's next for HELM?

After this initial paper, a lot of work is still being done on HELM. There are some conclusions and next steps that were outlined in this initial paper:

The researchers would like access to more AI models and more information about how models are trained. This will help meet their goal of transparency.

Models will get better over time. It's imperative to continue to refine the evaluation process and continually test new and refined models.

It has a website with the latest findings and tests of HELM, this paper that offers more detail, and a GitHub repository with the HELM code that can be used for further research.

Transparency in AI is important

AI is making its way into more facets of our lives. To make sure it's used safely, we need to constantly evaluate it and take the necessary steps to make it better for everyone. A way to do this is to create open source AI models, so that users have better insight into the models they're using.