概要
Source: https://arxiv.org/pdf/2201.08239.pdf
LaMDA: Language Models for Dialog Applications
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformerbased neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text.
While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.
The first challenge, safety, involves ensuring that the model’s responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety.
The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible.
Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency
Figure 1: Impact of model pre-training alone vs. with fine-tuning in LaMDA on dialog quality (left), and safety and factual grounding (right). The quality metric (SSI) corresponds to sensibleness, specificity, and interestingness. See Section 4 for more details on these metrics.
1 Introduction
Language model pre-training is an increasingly promising research approach in NLP [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. As pre-training uses unlabeled text, it can be combined with scaling model and dataset sizes to achieve better performance or new capabilities [13]. For example, GPT-3 [12], a 175B parameter model trained on a large corpus of unlabeled text, shows an impressive ability in few-shot learning thanks to scaling.
Dialog models [14, 15, 16], one of the most interesting applications of large language models, successfully take advantage of Transformers’ ability to represent long-term dependencies in text [17, 18]. Similar to general language models [13], Adiwardana et al. [17] show that dialog models are also well suited to model scaling. There is a strong correlation between model size and dialog quality. Inspired by these successes, we train LaMDA, a family of Transformer-based neural language models designed for dialog.
These models’ sizes range from 2B to 137B parameters, and they are pre-trained on a dataset of 1.56T words from public dialog data and other public web documents (Section 3). LaMDA makes use of a single model to perform multiple tasks: it generates potential responses, which are then filtered for safety, grounded on an external knowledge source, and re-ranked to find the highest-quality response. We study the benefits of model scaling with LaMDA on our three key metrics: quality, safety, and groundedness (Section 4).
We observe that: (a) model scaling alone improves quality, but its improvements on safety and groundedness are far behind human performance, and (b) combining scaling and fine-tuning improves LaMDA significantly on all metrics, and although the model’s performance remains below human levels in safety and groundedness, the quality gap to measured crowdworker levels can be narrowed (labeled ‘Human’ in Figure 1). The first metric, quality, is based on three components: sensibleness, specificity, and interestingness (Section 4).
We collect annotated data that describes how sensible, specific, and interesting a response is for a multiturn context. We then use these annotations to fine-tune a discriminator to re-rank candidate responses. The second metric, safety, is introduced to reduce the number of unsafe responses that the model generates. To achieve this, we define an illustrative set of safety objectives that attempt to capture the behavior that the model should exhibit in a dialog (Appendix A.1), and we use a demographically diverse set of crowdworkers to label responses in multiturn dialogs for these objectives (Appendix A.2, A.3).
We then use these labels to fine-tune a discriminator to detect and remove unsafe responses (Section 6.1). Our work on safety for LaMDA can be understood as a process for AI value alignment, at a high level. The third metric, groundedness, is introduced for the model to produce responses that are grounded in known sources wherever they contain verifiable external world information. Due to neural language models such as LaMDA’s capacity to generalize rather than just memorize, they tend to generate responses that may seem plausible, but actually contradict factual statements made in established sources.
We use this metric for the model to avoid this tendency. While grounding in known sources does not guarantee factual accuracy, it allows users or external systems to judge the validity of a response based on the reliability of its source and its faithful reproduction. We find that augmenting model outputs with the ability to use external tools, such as an information retrieval system, is a promising approach to achieve this goal. Therefore, we collect data from a setting where crowdworkers can use external tools to research factual claims, and train the model to mimic their behavior.
Finally, we explore the use of LaMDA in the domains of education and content recommendations to investigate its potential and shortcomings. Similar to the concept of prompts in GPT-3 [12], we precondition LaMDA on a few turns of application-specific dialog to adapt LaMDA to the target applications. We perform experiments to compare the application-specific helpfulness (i.e., useful and correct responses) and role consistency (i.e., agent utterances match agent role) of pre-training-only and fine-tuned LaMDA models subject to application-specific preconditioning.
We find that both types of models can adapt to their expected application roles fairly well, but fine-tuned LaMDA models are significantly more helpful.
2 Related work Language models and dialog models: Language models have attracted much attention recently thanks to their successes in NLP applications (e.g., [19, 20, 21, 2, 1, 22, 23, 5, 12, 24]). Our study of scaling laws with respect to model sizes is inspired by recent work on the scaling laws of neural language models [12, 13]. Similar to their findings, our results show that model scaling improves our quality (sensibleness, specificity, and interestingness), safety and groundedness metrics to some extent. However, fine-tuning combined with scaling significantly improves performance on all metrics.
Our work is also closely related to recent successes in applying language models to dialog modeling (e.g., [25, 26, 17, 18]), which built on earlier research in neural dialog modeling (e.g., [14, 15, 16, 27, 28]). One of our fine-tuning stages requires training on dialog-only data, which is related to Wolf et al. [29], Dinan et al. [25] and Zhang et al. [30]. Our use of fine-tuning on crowdworker-annotated data to improve interestingness is comparable to Roller et al. [18]. However, we aim to maximize the interestingness of the model’s output distinctly from its ability to engage the user in further interaction.
Our finding that pure scaling has a limited effect on key measures of open-domain dialog model performance echoes that of Shuster et al. [31], who also focus on the problem of groundedness. Recent studies on scaling have found that performance on question-answering tasks improves with model size [32, 33], similar to our findings on pre-trained LaMDA prior to fine-tuning. Our approach to improving model groundedness is broadly consistent with a growing literature on augmenting neural language models with retrieval systems.
Most of the existing literature focuses on the problem of open-domain question-answering rather than dialog generation, and the models themselves are used to index and rank knowledge sources, rather than trained to use an intermediate tool. Given these differences, we note that the range of existing approaches to this problem include the RNNLM [34], RAG [35], REALM [36], and FiD [37] architectures. Zhu et al. [38] provide a survey of further recent work. See Karpukhin et al. [39] for details on the ‘dense passage retriever’ used in RAG. Recent work in this direction has expanded and elaborated on neural models’ ability to retrieve and rank passages [40].
The RETRO architecture demonstrates that language models can be primed with results retrieved from a database as large as two trillion tokens [41]. At a broad level, our approach is also comparable to that of Byrne et al. [42], which fine-tunes the model to use external APIs for movie ticketing dialog. Parts of our findings are similar to recent studies on dialog groundedness. Granting access to external knowledge bases has been shown to reduce the rate at which models hallucinate unsourced statements in dialog across a variety of retrieval systems and model architectures [31].
Another study finds that a question-answering system’s accuracy is improved by separating it into a reasoning unit and a response generator, analogous to our separation of ‘Base’ and ‘Research’ models in our study [43]. Meanwhile, the WebGPT framework includes a language system that can interact with the open web via a text-only interface, and learns to imitate humans in answering questions by citing external sources [44]. Komeili et al. [45] compare different types of pre-trained models and retrieval methods, and reach a similar conclusion that augmenting language models with a search engine provides more factually grounded responses.
They encode the input context with grounded information from search to generate the next response, while we augment the generated responses with information from known sources in our method. This allows us to fine-tune the model for groundedness without sacrificing gains in safety or quality from other fine-tuning treatments. Dialog metrics: Defining effective metrics for dialog models remains an open research topic. Our approach is inspired by Adiwardana et al. [17], who argued for human-like metrics, such as sensibleness and specificity. Many automated metrics for dialog models have been studied, including perplexity [16, 17], F1, Hits@1/N [25], USR [46], or BLEU/ROUGE [47, 15, 27].
However, such automated metrics may not correlate well with human judgment [48]. More reliable metrics for dialog modeling require human evaluation [49, 50, 18, 25, 17, 51], as used in this paper. Earlier research attempted to combine multifaceted evaluations of dialog quality into a single headline metric [52]. We follow the pattern established in Adiwardana et al. [17] and Roller et al. [18] by considering the different components of our evaluations separately. In addition to sensibleness and specificity per Adiwardana et al. [17], we add new metrics: interestingness, safety, and groundedness.
An advantage of using several different metrics is their debuggability: by exploring responses with low safety or groundedness scores, we have been able to develop targeted methods to improve them. Safety and safety of dialog models: Inappropriate and unsafe risks and behaviors of language models have been extensively discussed and studied in previous works (e.g., [53, 54]). Issues encountered include toxicity (e.g., [55, 56, 57]), bias (e.g., [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]), and inappropriately revealing personally identifying information (PII) from training data [73].
Weidinger et al. [54] identify 21 risks associated with large-scale language models and discuss the points of origin for these risks. While many mitigation strategies have also been suggested (e.g., [74, 75, 76, 77, 78, 79, 80, 81, 82]), meaningfully addressing these issues remains an active research area. Similar issues have also been discussed specifically for dialog models [53]. For instance, examples of bias, offensiveness, and hate speech have been found both in training data drawn from social media, and consequently in the output of dialog models trained on such data [83]. Dialog models [84] can learn, and even amplify, biases in the training data.
Echoing Gehman et al. [85], we find fine-tuning effective to augment language models for safety. The method we use in this paper follows previous attempts to tackle these issues by training separate layers to detect unsafe output [17, 86, 18, 79]. Our strategy is similar to recent work that also uses fine-tuning [87]. While their safety guidelines were derived from human rights principles, they similarly find that increasing scale has no impact on toxicity metrics, while fine-tuning on safety evaluations does. Groundedness metrics: Similar to other recent research into groundedness cited above, we assess groundedness by asking crowdworkers to judge whether the model’s output is in accordance with authoritative external sources.
The recently-proposed Attributable to Identified Sources (AIS) framework [88] articulates a more precise approach to assess output of language models that pertains to the external world. It splits evaluation into two stages, where crowdworkers are asked: (1) if they can understand and identify the information shared in a dialog turn, and (2) if all of this information can be attributed to a source. Meanwhile, a recent study has reopened the question of automatic evaluation, with the Q2 metric showing performance comparable to human annotation [89].
3 LaMDA pre-training
LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data alone [17, 18], we pre-trained LaMDA on a dataset created from public dialog data and other public web documents. Therefore, LaMDA can be used as a general language model prior to fine-tuning. The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words (Appendix E). We used the SentencePiece library [90] to tokenize the dataset into 2.81T byte pair encoding (BPE) tokens [91], with a vocabulary of 32K tokens.
For comparison, the total number of words in the training set for Meena [17] was 40B words, which is nearly 40x smaller. The largest LaMDA model has 137B non-embedding parameters, which is ~50x more parameters than Meena [17]. We use a decoder-only Transformer [92] language model as the model architecture for LaMDA. The Transformer has 64 layers, dmodel = 8192, df f = 65536, h = 128, dk = dv = 128, relative attention as described in T5 [11], and gated-GELU activation as described in Raffel et al. [93]. We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch.
We used the Lingvo framework [94] for training and achieved 123 TFLOPS/sec with 56.5% FLOPS utilization with the 2D sharding algorithm, as described in GSPMD [95] (see Section 10 for carbon footprint estimates). We also trained smaller 2B-parameter and 8B-parameter models to measure the effects of model scaling on our metrics. Hyperparameter details for the models of different sizes can be found in Table 27, Appendix D. Figure 2 gives an overview of the pre-training stage. We call the model before any fine-tuning "PT", for PreTrained.
PT uses the same sample-and-rank strategy as Meena [17] for decoding. We first sample 16 independent candidate responses using top-k (k = 40) sampling (no temperature). The final output is the highest-scoring candidate, where the score is based on the candidate’s log-likelihood and its length.
4 Metrics
Evaluating generative models in general, and open-ended dialog models in particular, is difficult. See the Related Work section for a general review of recent work in this area. In this section, we describe the metrics that we use for evaluation.
4.1 Foundation metrics: Quality, Safety and Groundedness
Sensibleness, Specificity, Interestingness (SSI): Our overall quality score is an average of sensibleness, specificity, and interestingness (SSI).
Adiwardana et al. [17] propose the sensibleness and specificity average (SSA) metric to measure the quality of Meena. This metric is a simple average of two scores: sensibleness and specificity. The first score, sensibleness, measures whether a model’s responses make sense in context and do not contradict anything that was said earlier. Humans tend to take this basic aspect of communication for granted, but generative models often struggle to meet this requirement. However, if sensibleness alone is used to evaluate models, we could inadvertently reward models for playing it safe by always producing short, generic, and boring responses.
The GenericBot algorithm [17], which answers every question with “I don’t know” and every statement with “Ok,” scores 70% on sensibleness, which even surpasses some large dialog models [17]. The second score, specificity, is used to measure whether a response is specific to a given context. For example, if a user says “I love Eurovision” and the model responds “Me too,” then it would score 0 on specificity, since this response could be used in many different contexts. If it answers “Me too. I love Eurovision songs,” then it would score 1. Adiwardana et al. [17] report that Meena narrows the gap to average human performance in the SSA metric.
As the model’s performance increases, however, we find that sensibleness and specificity are not sufficient to measure the quality of a dialog model. For example, a response to “How do I throw a ball?” could be “You can throw a ball by first picking it up and then throwing it”, which makes sense and is specific to the question. An alternative deeper and more satisfying answer could be “One way to toss a ball is to hold it firmly in both hands and then swing your arm down and up again, extending your elbow and then releasing the ball upwards.” We attempt to translate this intuition into the third score, an observable quality which we call “Interestingness”.
Similar to sensibleness and specificity, interestingness is measured as a 0/1 label by crowdworkers. We ask crowdworkers to label a response as interesting if they judge that it is likely to “catch someone’s attention” or “arouse their curiosity”, or if it is unexpected, witty, or insightful. (For the complete instructions given to crowdworkers, see Appendix B). Safety: A dialog model can achieve high quality (SSI) scores but can be unsafe for users. Therefore, we devise a new safety metric to measure unsafe model output. This metric follows objectives derived from Google’s AI Principles,2 to avoid unintended results that create risks of harm, and to avoid creating or reinforcing unfair bias.
These safety objectives are described in detail in Appendix A.1. Groundedness: We aim to ensure that LaMDA produces responses that can be associated with known sources whenever possible, enabling cross-checking if desired, because the current generation of language models tends to produce plausible but incorrect statements. We define groundedness as the percentage of responses containing claims about the external world that can be supported by authoritative external sources, as a share of all those containing claims about the external world. We also define ‘Informativeness’ as the percentage of responses that carry information about the external world that can be supported by known sources as a share of all responses.
Informativeness only differs from groundedness in the denominator term. So responses like “That’s a great idea” that do not carry any external world information do not affect groundedness, but they do affect Informativeness. However, “Rafael Nadal is the winner of Roland Garros 2020" is an example of a grounded response. Finally, we define ‘Citation accuracy’ as the percentage of model responses that cite the URLs of their sources as a share of all responses with explicit claims about the external world, excluding claims with well-known facts (such as "horses have four legs").
4.2 Role-specific metrics: Helpfulness and Role consistency
The foundation metrics (quality, safety, and groundedness) measure attributes that we find important for dialog agents in general. However, they are not dependent on any application-specific role that an agent may be designed for (e.g., teaching information about animals). We measure Helpfulness and Role consistency in dialog applications, where agents have specific roles. Helpfulness: The model’s responses are marked helpful if they contain correct information based on the user’s independent research with an information retrieval system, and the user considers them helpful. Helpful responses are a subset of informative ones, which are judged by the user to be both correct and useful.
Role consistency: The model’s responses are marked role consistent if they look like something an agent performing the target role would say. This is distinct from consistency with previous responses that the agent made in the dialog, and self-consistency within a dialog is measured by the sensibleness metric instead. Role consistency refers to consistency with the definition of the agent’s role external to the conversation. These role-specific metrics are discussed further in Section 8.
5 LaMDA fine-tuning and evaluation data
Quality (Sensibleness, Specificity, Interestingness): To improve quality (SSI), we collect 6400 dialogs with 121K turns by asking crowdworkers to interact with a LaMDA instance about any topic. These dialogs are required to last 14 to 30 turns. For each response, we ask other crowdworkers to rate whether the response given the context is sensible, specific, and/or interesting, and to and mark each with ‘yes’, ‘no’, or ‘maybe’ labels. If a response is not sensible (the crowdworker did not mark it with ‘yes’), then we do not collect the labels for specificity and interestingness, and consider them to be ‘no’.
Furthermore, if a response is not specific (the crowdworker did not mark it with ‘yes’), then we do not collect the label for interestingness, and consider it to be ‘no’. This ensures that responses are not rated positively for specificity if they are not sensible, and similarly, that responses are not rated positively for interestingness if they are not specific. Every response is labeled by 5 different crowdworkers and the response is considered sensible, specific or interesting if at least 3 out of 5 crowdworkers mark it ‘yes’.
We evaluate the models based on the model’s generated responses to the Mini-Turing Benchmark (MTB) dataset[17], which consists of 1477 dialogs with up to 3 dialog turns. The MTB includes 315 single-turn dialogs, 500 2-turn dialogs, and 662 3-turn dialogs. These dialogs are fed to the model to generate the next response. Similar to above, every response is labeled sensible, specific or interesting if at least 3 out of 5 crowdworkers mark it ‘yes’. Safety: For safety fine-tuning, we employ a structured approach that begins with defining the safety objectives (Appendix A.1).
These objectives are used to annotate candidate responses generated by a LaMDA instance in response to human-generated prompts (Appendix A.2), using a demographically diverse set of crowdworkers (Appendix A.3). Similar to SSI, we collect 8K dialogs with 48K turns by asking crowdworkers to interact with a LaMDA instance about any topic. These dialogs are required to last 5 to 10 turns. We instruct crowdworkers to interact with the model in three different ways: (a) interactions of natural form, (b) interactions that touch sensitive topics, and (c) interactions that adversarially attempt to break the model as per the safety objectives.
For each response, we ask other crowdworkers to rate whether the response given the context violates any of the safety objectives, and to mark them with ‘yes’, ‘no’, or ‘maybe’ labels. Every response is assigned a safety score of 1 if at least 2 out of 3 crowdworkers mark the response with ‘no’ for each individual safety objective. Otherwise, it is assigned a score of 0. We evaluate safety using an evaluation dataset that is a holdout sample of the adversarially collected dataset described above. This dataset consists of 1166 dialogs with 1458 turns. These dialogs are input to the model to generate the next response.
Similar to above, every response is scored 1 if at least 2 out of 3 crowdworkers mark each safety objective ‘no’ and 0 otherwise. Groundedness: Similar to SSI and safety, we collect 4K dialogs with 40K turns by asking crowdworkers to interact with the model. This time, we request that they try to steer the conversation towards information-seeking interactions. We ask crowdworkers to rate each of the model’s dialog turns, evaluating whether the information in the turn makes any claims about the external world. We exclude claims about publicly unrecognized people, as the model can make factual claims on behalf of an improvised persona.
Such claims do not require grounding on external sources (e.g., “I baked three cakes last week”), unlike claims about historical people (e.g., “Julius Caesar was born in 100 B”). We also ask crowdworkers whether they know the claims to be true. If 3 different crowdworkers all know a claim to be true, then we assume it to be common knowledge and do not check external knowledge sources before making this claim.
For utterances containing claims that need to be checked, we ask crowdworkers to record the search queries that they would use to investigate them. Finally, we ask crowdworkers to edit the model’s response to incorporate brief search results from an external knowledge-retrieval system. If the search results include any content from the open web, we ask crowdworkers to include URLs that appropriately cite the sources of the knowledge used in the final response. We evaluate groundedness using an evaluation dataset with 784 turns of dialogs from Dinan et al. [96] that encompass a variety of topics.
These contexts are fed to the model to generate the next response. For each response, we ask crowdworkers to rate whether the model’s response contains any factual claims, and if so, to rate whether these factual claims can be verified by checking a known source. Every response is labeled by 3 different crowdworkers. The final groundedness, informativeness, and citation accuracy labels of a given response are determined by majority voting. Estimating these metrics for human-generated responses: We ask crowdworkers to respond to randomly selected samples of the evaluation datasets (labeled as ‘Human’ in 1, 4 and 5).
The crowdworkers are explicitly informed to reply in a safe, sensible, specific, interesting, grounded, and informative manner. They are also explicitly asked to use any external tools necessary to generate these responses (e.g., including an information retrieval system). The context-response pairs are then sent for evaluation, and a consensus label is formed by majority voting, just as for model generated responses.
Leave a comment and share your thoughts: https://open.firstory.me/story/ckz57pru504vf08581612og9l?m=comment
Powered by Firstory Hosting