LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

187次阅读

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

1、[LG] Symbolic Discovery of Optimization Algorithms
2、[LG] Energy Transformer
3、[LG] A modern look at the relationship between sharpness and generalization
4、[LG] Simple Hardware-Efficient Long Convolutions for Sequence Modeling
5、[LG] Guiding Pretraining in Reinforcement Learning with Large Language Models
[LG] In Search for a Generalizable Method for Source Free Domain Adaptation
[LG] Geometric Clifford Algebra Networks
[CV] 3D-aware Blending with Generative NeRFs
[CL] Level Generation Through Large Language Models

摘要：优化算法的符号化发现、能量Transformer、锐度和泛化间关系的现代视角、用于序列建模的简单硬件高效长卷积、基于大型语言模型指导强化学习预训练、可泛化无源域自适应方法研究、几何Clifford代数网络、基于生成式NeRF的3D感知混合、基于大型语言模型的游戏关卡生成

1、[LG] Symbolic Discovery of Optimization Algorithms

X Chen, C Liang, D Huang, E Real, K Wang, Y Liu, H Pham, X Dong, T Luong, C Hsieh, Y Lu, Q V. Le
[Google]

优化算法的符号化发现

要点:

提出一种通过程序搜索发现深度神经网络训练的优化算法的方法，进而发现 Lion，一种通过程序搜索的简单而有效的优化算法；
Lion 比 Adam 和 Adafactor 更节省内存，因为它只跟踪动量并通过符号操作使用统一的更新幅度；
Lion 在各种任务上的表现优于广泛使用的优化器，包括图像分类、视觉-语言对比学习、扩散模型和语言建模。
Lion 在自回归、掩码语言建模和微调方面表现出与Adam相似或更好的性能。

一句话总结:
提出一种通过程序搜索发现深度神经网络训练的优化算法的方法，进而发现 Lion，一种有效的、内存高效的优化算法，在各种任务上的表现超过了广泛使用的优化器，如Adam和Adafactor。

We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% zero-shot and 91.1% fine-tuning accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. The implementation of Lion is publicly available.

https://arxiv.org/abs/2302.06675

2、[LG] Energy Transformer

B Hoover, Y Liang, B Pham, R Panda, H Strobelt, D H Chau, M J. Zaki, D Krotov
[IBM Research & RPI & Georgia Tech]

能量Transformer

要点:

能量Transformer(ET)用一个大型关联记忆模型取代了前馈 Transformer 块的序列；
能量Transformer 被设计为最小化一个表示Token间关系的能量函数；
与传统的注意力相比，ET 中的注意力机制包含一个额外的 term；
在图异常检测任务中，ET 显示了强大的定量结果。

一句话总结:
能量Transformer(ET)是一种新的 Transformer 结构，用一个单一的大型关联记忆模型取代了前馈 Transformer 块的序列，旨在最小化一个专门设计的能量函数，该函数表示了令牌之间的关系。

Transformers have become the de facto models of choice in machine learning, typically leading to impressive performance on many applications. At the same time, the architectural development in the transformer world is mostly driven by empirical findings, and the theoretical understanding of their architectural building blocks is rather limited. In contrast, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, but have not yet demonstrated truly impressive practical results. We propose a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associative Memory model. Our novel architecture, called Energy Transformer (or ET for short), has many of the familiar architectural primitives that are often used in the current generation of transformers. However, it is not identical to the existing architectures. The sequence of transformer layers in ET is purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. As a consequence of this computational principle, the attention in ET is different from the conventional attention mechanism. In this work, we introduce the theoretical foundations of ET, explore it’s empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection task.

https://arxiv.org/abs/2302.07253

3、[LG] A modern look at the relationship between sharpness and generalization

M Andriushchenko, F Croce, M Müller, M Hein, N Flammarion
[EPFL & Tubingen AI Center]

锐度和泛化间关系的现代视角

要点:

在 ImageNet 和 MNLI 上，从从头开始训练到微调 Transformer 的各种设置中，评估了多种重参数化不变锐度措施；
锐度与泛化没有很好的关联，而是与一些训练参数如学习率相关；
在多种情况下，锐度与 OOD 泛化存在一致的负相关关系，这意味着更锐利的最小值可以更好地泛化。
正确的锐度测量是高度依赖于数据的。

一句话总结:
在现代深度神经网络中，重参数化不变锐度可能不是一个好的泛化指标。

Sharpness of minima is a promising quantity that can correlate with generalization in deep networks and, when optimized during training, can improve generalization. However, standard sharpness is not invariant under reparametrizations of neural networks, and, to fix this, reparametrization-invariant sharpness definitions have been proposed, most prominently adaptive sharpness (Kwon et al., 2021). But does it really capture generalization in modern practical settings? We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup. Interestingly, in multiple cases, we observe a consistent negative correlation of sharpness with out-of-distribution error implying that sharper minima can generalize better. Finally, we illustrate on a simple model that the right sharpness measure is highly data-dependent, and that we do not understand well this aspect for realistic data distributions. The code of our experiments is available at this https URL.

https://arxiv.org/abs/2302.07011

4、[LG] Simple Hardware-Efficient Long Convolutions for Sequence Modeling

D Y. Fu, E L. Epstein, E Nguyen, A W. Thomas, M Zhang, T Dao, A Rudra, C Ré
[Stanford University]

用于序列建模的简单硬件高效长卷积

要点:

在长序列建模中，长卷积是状态空间模型的有效替代方法；
平滑核权重是实现长卷积的高性能的关键；
FlashButterfly 是一种 IO 感知算法，通过减少 GPU 内存 IO 、提高 FLOP 利用率来提高长卷积的运行性能；
通过利用与 Butterfly 矩阵的连接，长卷积的训练速度比状态空间模型快1.8倍。

一句话总结:
长卷积是状态空间模型在序列建模方面的一个硬件高效的替代方案，需要简单的干预措施，如平滑核权重以实现高性能，并引入 IO 感知算法 FlashButterfly 以提高运行效率。

State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions–such as squashing the kernel weights–result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up convolutions by 2.2×, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2× faster than prior work. Lastly, we introduce an extension to FlashButterfly that learns the coefficients of the Butterfly decomposition, increasing expressivity without increasing runtime. Using this extension, we outperform a Transformer on WikiText103 by 0.2 PPL with 30% fewer parameters.

https://arxiv.org/abs/2302.06646

5、[LG] Guiding Pretraining in Reinforcement Learning with Large Language Models

Y Du, O Watkins, Z Wang, C Colas, T Darrell, P Abbeel, A Gupta, J Andreas
[UC Berkeley & University of Washington & MIT]

基于大型语言模型指导强化学习预训练

要点:

提出一种新的内在动机强化学习方法，ELLM，基于预训练的LLM来进行探索；
ELLM 使探索偏向于常识性的和合理的有用行为；
在预训练期间，ELLM 训练的智能体表现出对有用行为的更好覆盖，并且在下游任务中进行微调时，表现优于或相当于基线；
然而，ELLM 在基于目标的探索空间很小的环境中，或者在状态信息没有自然编码为自然语言字符串的环境中，帮助不大。

一句话总结:
使用大型预训练语言模型可以引导强化学习智能体走向有用的和对人类有意义的行为。

Reinforcement learning algorithms typically struggle in the absence of a dense, well-shaped reward function. Intrinsically motivated exploration methods address this limitation by rewarding agents for visiting novel states or transitions, but these methods offer limited benefits in large environments where most discovered novelty is irrelevant for downstream tasks. We describe a method that uses background knowledge from text corpora to shape exploration. This method, called ELLM (Exploring with LLMs) rewards an agent for achieving goals suggested by a language model prompted with a description of the agent’s current state. By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop. We evaluate ELLM in the Crafter game environment and the Housekeep robotic simulator, showing that ELLM-trained agents have better coverage of common-sense behaviors during pretraining and usually match or improve performance on a range of downstream tasks.

https://arxiv.org/abs/2302.06692

另外几篇值得关注的论文：

[LG] In Search for a Generalizable Method for Source Free Domain Adaptation

M Boudiaf, T Denton, B v Merriënboer, V Dumoulin, E Triantafillou
[Google Research & ETS Montreal]

可泛化无源域自适应方法研究

要点:

为视觉任务开发的 SFDA 方法可能同适应于(co-adapted to)该领域，从而阻碍了在其他模态和问题上的进展；
鉴于缺乏目标标签和定义明确的验证集，泛化性对 SFDA 方法至关重要；
所提出的生物声学任务对模型的开放非常重要，体现了在模态和分布漂移方面扩大 SFDA 评估的收益；
所提出的 NOTELA 方法在生物声学迁移方面优于现有方法，同时在视觉数据集上表现出强大的性能。

一句话总结:
现有的 SFDA 方法并不像之前认为的那样具有可泛化性，考虑不同的模态对设计更鲁棒的模型是有用的。

Source-free domain adaptation (SFDA) is compelling because it allows adapting an off-the-shelf model to a new domain using only unlabelled data. In this work, we apply existing SFDA techniques to a challenging set of naturally-occurring distribution shifts in bioacoustics, which are very different from the ones commonly studied in computer vision. We find existing methods perform differently relative to each other than observed in vision benchmarks, and sometimes perform worse than no adaptation at all. We propose a new simple method which outperforms the existing methods on our new shifts while exhibiting strong performance on a range of vision datasets. Our findings suggest that existing SFDA methods are not as generalizable as previously thought and that considering diverse modalities can be a useful avenue for designing more robust models.

https://arxiv.org/abs/2302.06658

[LG] Geometric Clifford Algebra Networks

D Ruhe, J K. Gupta, S d Keninck, M Welling, J Brandstetter
[University of Amsterdam & Microsoft]

几何Clifford代数网络

要点:

提出 GCAN，用对称群变换来表示和操作几何变换；
引入了群作用层和可细化的几何模板；
擅长对刚体变换进行建模；
对大规模流体动力学模拟具有良好的扩展性。

一句话总结:
几何Clifford代数网络(GCAN)使用对称群变换来表示和操作几何变换，与传统方法相比具有理论上的优势，在建模刚体变换和大规模流体动力学模拟方面表现出色。

We propose Geometric Clifford Algebra Networks (GCANs) that are based on symmetry group transformations using geometric (Clifford) algebras. GCANs are particularly well-suited for representing and manipulating geometric transformations, often found in dynamical systems. We first review the quintessence of modern (plane-based) geometric algebra, which builds on isometries encoded as elements of the Pin(p,q,r) group. We then propose the concept of group action layers, which linearly combine object transformations using pre-specified group actions. Together with a new activation and normalization scheme, these layers serve as adjustable geometric templates that can be refined via gradient descent. Theoretical advantages are strongly reflected in the modeling of three-dimensional rigid body transformations as well as large-scale fluid dynamics simulations, showing significantly improved performance over traditional methods.

https://arxiv.org/abs/2302.06594

[CV] 3D-aware Blending with Generative NeRFs

H Kim, G Lee, Y Choi, J Kim, J Zhu
[NAVER AI Lab & CMU]

基于生成式NeRF的3D感知混合

要点:

提出一种基于生成式NeRF的3D感知混合(blending)方法；
包括 3D 感知对齐和 3D 感知混合(blending)作为关键组成部分；
所提出的方法在逼真度和对输入图像的忠实度方面优于现有的 2D 基线；
优点包括颜色-几何形状的解缠和多视图一致的混合。

一句话总结:
提出一种基于生成式神经辐射场(NeRF)的新 3D 感知混合(blending)方法，其性能优于现有的 2D 基线，具有色彩-几何解缠和多视角一致混合等优点。

Image blending aims to combine multiple images seamlessly. It remains challenging for existing 2D-based methods, especially when input images are misaligned due to differences in 3D camera poses and object shapes. To tackle these issues, we propose a 3D-aware blending method using generative Neural Radiance Fields (NeRF), including two key components: 3D-aware alignment and 3D-aware blending. For 3D-aware alignment, we first estimate the camera pose of the reference image with respect to generative NeRFs and then perform 3D local alignment for each part. To further leverage 3D information of the generative NeRF, we propose 3D-aware blending that directly blends images on the NeRF’s latent representation space, rather than raw pixel space. Collectively, our method outperforms existing 2D baselines, as validated by extensive quantitative and qualitative evaluations with FFHQ and AFHQ-Cat.

https://arxiv.org/abs/2302.06608

[CL] Level Generation Through Large Language Models

G Todd, S Earle, M U Nasir, M C Green, J Togelius
[New York University Tandon & University of the Witwatersrand]

基于大型语言模型的游戏关卡生成

要点:

大型语言模型(LLM)可以为推箱子有效生成可用的游戏关卡；
LLM 生成游戏关卡的性能，在很大程度上取决于数据的可用性，并有可能克服普遍缺乏可用游戏关卡数据的问题；
一种简单的提示方法对于可观察到的关卡特征来说是足够的，但是需要更多的工作来控制更复杂的指标；
用 LLM 进行游戏关卡生成是有希望的，并有可能为通过机器学习进行程序性内容生成提供一条新途径。

一句话总结:
大型语言模型(LLM)虽然只在自然语言上进行了训练，但在有足够数据和训练的情况下可以生成可用的视频游戏关卡。

Large Language Models (LLMs) are powerful tools, capable of leveraging their training on natural language to write stories, generate code, and answer questions. But can they generate functional video game levels? Game levels, with their complex functional constraints and spatial relationships in more than one dimension, are very different from the kinds of data an LLM typically sees during training. Datasets of game levels are also hard to come by, potentially taxing the abilities of these data-hungry models. We investigate the use of LLMs to generate levels for the game Sokoban, finding that LLMs are indeed capable of doing so, and that their performance scales dramatically with dataset size. We also perform preliminary experiments on controlling LLM level generators and discuss promising areas for future work.

https://arxiv.org/abs/2302.05817

正文完

可以使用微信扫码关注公众号（ID：xzluomor）