LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

175次阅读

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

1、[LG] Learning Performance-Improving Code Edits
2、[CL] Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models
3、[CL] How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval
4、[LG] The Geometry of Neural Nets’ Parameter Spaces Under Reparametrization
5、[CL] Big Little Transformer Decoder
[CL] Augmented Language Models: a Survey
[CL] SwitchPrompt: Learning Domain-Specific Gated Soft Prompts for Classification in Low-Resource Domains
[LG] CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
[LG] Transformer models: an introduction and catalog

摘要：性能-改进代码编辑学习、在预训练期间添加指令控制语言模型毒性、基于多样化增强实现可泛化稠密检索、重参数化下的神经网参数空间几何、大-小Transformer解码器、增强语言模型综述、面向低资源域分类的特定域门控软提示学习、基于预训练模型的代码生成评价、Transformer模型介绍与分类

1、[LG] Learning Performance-Improving Code Edits

A Madaan, A Shypula, U Alon, M Hashemi, P Ranganathan, Y Yang, G Neubig, A Yazdanbakhsh
[CMU & University of Pennsylvania & Google]

性能-改进代码编辑学习

要点:

摩尔定律的弱化，将科技行业的注意力转移到了提高性能的其他方法上；
大型语言模型有可能给出功能正确、性能提高的代码编辑的建议，而这种方式对于静态分析来说是不现实的；
性能改进编辑(PIE)的策划数据集，使得大型语言模型产生性能改进编辑的能力得到了评估和改进，甚至对于已编译的C++程序；
结果证明了大型语言模型在实现自动代码优化和提高算法效率方面的潜力，而不需要开发人员额外的时间成本投入。

一句话总结:
大型语言模型能以实用和功能性方式建议可提高性能的代码编辑，为程序员编写高效代码开辟了新的机会。

The waning of Moore’s Law has shifted the focus of the tech industry towards alternative methods for continued performance gains. While optimizing compilers are a standard tool to help increase program efficiency, programmers continue to shoulder much responsibility in crafting and refactoring code with better performance characteristics. In this paper, we investigate the ability of large language models (LLMs) to suggest functionally correct, performance improving code edits. We hypothesize that language models can suggest such edits in ways that would be impractical for static analysis alone. We investigate these questions by curating a large-scale dataset of Performance-Improving Edits, PIE. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program’s performance. We use PIE to evaluate and improve the capacity of large language models. Specifically, use examples from PIE to fine-tune multiple variants of CODEGEN, a billion-scale Transformer-decoder model. Additionally, we use examples from PIE to prompt OpenAI’s CODEX using a few-shot prompting. By leveraging PIE, we find that both CODEX and CODEGEN can generate performance-improving edits, with speedups of more than 2.5x for over 25% of the programs, for C++ and Python, even after the C++ programs were compiled using the O3 optimization level. Crucially, we show that PIE allows CODEGEN, an open-sourced and 10x smaller model than CODEX, to match the performance of CODEX on this challenging task. Overall, this work opens new doors for creating systems and methods that can help programmers write efficient code.

https://arxiv.org/abs/2302.07867

2、[CL] Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models

S Prabhumoye, M Patwary, M Shoeybi, B Catanzaro
[NVIDIA]

在预训练期间添加指令：控制语言模型毒性的有效方法

要点:

预训练语言模型会产生有毒的内容，使其难以在现实世界的应用中部署；
提出两种新的预训练数据增强策略(MEDA 和 INST)，以减少模型的毒性而不影响其效用；
INST表现最好，在保留五个基准NLP任务的准确性的同时，将毒性概率降低了61%，并将四种偏见检测任务的AUC分数提高了 1.3%；
通过指令将相关信息添加到预训练数据中，可以更广泛地应用，为未来工作开辟了新的方向。

一句话总结:
在预训练期间添加指令，可以有效地减少语言模型的毒性，使其在现实世界的应用中更加安全。

Pretrained large language models have become indispensable for solving various natural language processing (NLP) tasks. However, safely deploying them in real world applications is challenging because they generate toxic content. To address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity. Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks as well as improving AUC scores on four bias detection tasks by 1.3%. We also demonstrate the generalizability of our techniques by scaling the number of training samples and the number of model parameters.

https://arxiv.org/abs/2302.07388

3、[CL] How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

S Lin, A Asai, M Li, B Oguz, J Lin, Y Mehdad, W Yih, X Chen
[Meta AI & University of Waterloo & University of Washington]

如何训练DRAGON：基于多样化增强实现可泛化稠密检索

要点:

训练多样化且可泛化的稠密检索器，需要一个统一的数据增强框架；
交叉编码器可能不是最有效的教师，而类人的查询也不一定是最适合稠密检索的训练数据；
用句子裁剪和生成式查询的混合进行查询增强，以及用多个教师进行渐进式相关性标签增强，可以得到一个可泛化的稠密检索器；
提出 DRAGON，一种 BERT-base-sized DR，在监督和零样本检索任务中都优于现有模型。

一句话总结:
提出 DRAGON，一种基于BERT的稠密检索器，使用多样化增强，在监督和零样本检索中都取得了最先进的效果。

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).

https://arxiv.org/abs/2302.07452

4、[LG] The Geometry of Neural Nets’ Parameter Spaces Under Reparametrization

A Kristiadi, F Dangel, P Hennig
[Vector Institute & University of Tubingen]

重参数化下的神经网参数空间几何

要点:

承认黎曼度量，并使用正确的变换规则，以保证在重参数化时的不变性和等变性；
Hessian 的迹线、行列式和特征值是不变的，允许基于 Hessian 的平坦性度量来研究非病态的泛化；
任何有前提条件的梯度下降的无限小优化轨迹都是不变的；
概率密度函数的模式是不变的。

一句话总结:
神经网络模型重参数化会带来不一致，但通过承认黎曼度量并使用转换规则，可保证不变性和等变性。

Model reparametrization — transforming the parameter space via a bijective differentiable map — is a popular way to improve the training of neural networks. But reparametrizations have also been problematic since they induce inconsistencies in, e.g., Hessian-based flatness measures, optimization trajectories, and modes of probability density functions. This complicates downstream analyses, e.g. one cannot make a definitive statement about the connection between flatness and generalization. In this work, we study the invariance quantities of neural nets under reparametrization from the perspective of Riemannian geometry. We show that this notion of invariance is an inherent property of any neural net, as long as one acknowledges the assumptions about the metric that is always present, albeit often implicitly, and uses the correct transformation rules under reparametrization. We present discussions on measuring the flatness of minima, in optimization, and in probability-density maximization, along with applications in studying the biases of optimizers and in Bayesian inference.

https://arxiv.org/abs/2302.07384

5、[CL] Big Little Transformer Decoder

S Kim, K Mangalam, J Malik, M W. Mahoney, A Gholami, K Keutzer
[UC Berkeley]

大-小Transformer解码器

要点:

提出 BiLD 框架，可为广泛的文本生成应用提高推理效率、降低延迟；
BiLD 包含两个不同规模的模型，以协同方式生成文本：小模型以自回归方式运行，大模型以非自回归方式完善小模型的不准确预测；
BiLD 引入两种策略来协调小模型和大模型：回退策略和回滚策略；
BiLD在 NVIDIA Titan Xp GPU上实现了高达 2.13 倍的速度而没有任何性能下降，并且在 NVIDIA Titan Xp GPU 上实现了高达 2.38 倍的速度而只有约1个点的下降，而且完全即插即用，不需要任何训练或对模型架构的修改。

一句话总结:
提出 BiLD 框架，通过耦合大型和小型解码器模型并引入两种策略来进行协调，从而减少文本生成任务的推理延迟。

The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment, and which makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model’s inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to review and correct the small model’s inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, summarization on CNN/DailyMail, and language modeling on WikiText-2. On an NVIDIA Titan Xp GPU, our framework achieves a speedup of up to 2.13x without any performance drop, and it achieves up to 2.38x speedup with only ~1 point degradation. Furthermore, our framework is fully plug-and-play as it does not require any training or modifications to model architectures. Our code will be open-sourced.

https://arxiv.org/abs/2302.07863

另外几篇值得关注的论文：

[CL] Augmented Language Models: a Survey

G Mialon, R Dessì, M Lomeli, C Nalmpantis, R Pasunuru, R Raileanu, B Rozière, T Schick, J Dwivedi-Yu, A Celikyilmaz, E Grave, Y LeCun, T Scialom
[Meta AI]

增强语言模型综述

要点:

用推理和工具的使用来增强语言模型，可以扩大其上下文处理能力，并有可能解决传统语言模型的局限；
增强语言模型(ALM)利用外部模块来进行缺失 toeken 预测，并学会推理和行动，同时仍然执行标准的自然语言任务；
大多数 ALM 工作都依赖于扔的标注，可能不具有可扩展性，而以完全自监督方式为语言模型配备有意义的增强功能，仍然是一个开放的研究问题；
未来的研究应该研究 ALM 中推理和工具使用之间的整合和交互。

一句话总结:
语言模型可以通过推理和工具的使用来扩展其上下文处理能力，脱离了纯语言模型的模式，并有可能解决传统语言模型的局限。

This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demonstrations. While adhering to a standard missing tokens prediction objective, such augmented LMs can use various, possibly non-parametric external modules to expand their context processing ability, thus departing from the pure language modeling paradigm. We therefore refer to them as Augmented Language Models (ALMs). The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks and even outperforming most regular LMs on several benchmarks. In this work, after reviewing current advance in ALMs, we conclude that this new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues.

https://arxiv.org/abs/2302.07842

[CL] SwitchPrompt: Learning Domain-Specific Gated Soft Prompts for Classification in Low-Resource Domains

K Goswami, L Lange…
[Adobe Research Bangalore & Bosch Center for Artificial Intelligence & …]

SwitchPrompt: 面向低资源域分类的特定域门控软提示学习

要点:

SwitchPrompt是一种轻量提示方法，用于使语言模型自适应低资源域；
使用特定域的关键词和可训练的门控提示，来引导语言模型进入目标域；
实验表明，SwitchPrompt 优于现有的基线方法，甚至缩小了一般域和特定域语言模型之间的性能差距；
未来工作可探索 SwitchPrompt 对序列标记任务和混合域数据集的影响。

一句话总结:
SwitchPrompt 是一种新的针对特定域的提示方法，可有效减少在低资源域的预训练需求，并在文本分类基准中优于现有方法。

Prompting pre-trained language models leads to promising results across natural language processing tasks but is less effective when applied in low-resource domains, due to the domain gap between the pre-training data and the downstream task. In this work, we bridge this gap with a novel and lightweight prompting methodology called SwitchPrompt for the adaptation of language models trained on datasets from the general domain to diverse low-resource domains. Using domain-specific keywords with a trainable gated prompt, SwitchPrompt offers domain-oriented prompting, that is, effective guidance on the target domains for general-domain language models. Our few-shot experiments on three text classification benchmarks demonstrate the efficacy of the general-domain pre-trained language models when used with SwitchPrompt. They often even outperform their domain-specific counterparts trained with baseline state-of-the-art prompting methods by up to 10.7% performance increase in accuracy. This result indicates that SwitchPrompt effectively reduces the need for domain-specific language model pre-training.

https://arxiv.org/abs/2302.06868

[LG] CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

S Zhou, U Alon, S Agarwal, G Neubig
[CMU]

CodeBERTScore: 基于预训练模型的代码生成评价

要点:

CodeBERTScore 是一种新的代码生成评估指标，用大型预训练模型的上下文编码，计算出生成的代码中的每个 token 和参考代码中的软相似度得分；
CodeBERTScore 不仅对生成的 token 进行编码，还对生成代码周围的程序上下文进行编码，使得它在代码生成评估方面更加有效；
对 CodeBERTScore 在四种编程语言中的广泛评估表明，其与人工偏好和功能正确性的相关性，要高于所有现有指标；
发布了针对Java、Python、C、C++ 和 JavaScript 的五种特定语言预训练 CodeBERT 模型。

一句话总结:
提出CodeBERTScore，一种面向代码生成的自动评价指标，与现有指标相比，实现了与人工偏好和功能正确性的高度相关。

Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at this https URL . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.

https://arxiv.org/abs/2302.05527

[LG] Transformer models: an introduction and catalog

X Amatriain

Transformer模型：介绍与分类

在过去的几年里，已经有几十种 Transformer 族模型迅速出现，所有这些模型的名字都很有趣，但并不是显而易见。本文的目的是为最流行的 Transformer 模型提供一个比较全面又简单的目录和分类。本文还介绍了 Transformer 模型最重要的方面和创新。

In the past few years we have seen the meteoric appearance of dozens of models of the Transformer family, all of which have funny, but not self-explanatory, names. The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.

https://arxiv.org/abs/2302.07730

正文完

可以使用微信扫码关注公众号（ID：xzluomor）