LG – 机器学习 CV – 计算机视觉 CL – 计算与语言 RO – 机器人

183次阅读

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言 RO – 机器人

1、[RO] Language-Driven Representation Learning for Robotics
2、[CV] Decoupling Human and Camera Motion from Videos in the Wild
3、[LG] On the Training Instability of Shuffling SGD with Batch Normalization
4、[RO] Leveraging Jumpy Models for Planning and Fast Learning in Robotic Domains
5、[CL] Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
[LG] Physics-Constrained Deep Learning for Climate Downscaling
[CV] Modulating Pretrained Diffusion Models for Multimodal Image Synthesis
[LG] Edgeformers: Graph-Empowered Transformers for Representation Learning on Textual-Edge Networks
[CV] ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

摘要：面向机器人的语言驱动表示学习、实际场景视频人体摄像机运动解耦、批量归一化混洗SGD训练不稳定性研究、用跳跃模型在机器人领域进行规划和快速学习、利用外部知识和自动反馈改进大型语言模型、基于物理约束的气候降尺度深度学习、面向多模态图像合成的预训练扩散模型调制、基于图增强Transformer的文本边缘网络表示学习、结合相对深度和度量深度实现零样本迁移

1、[RO] Language-Driven Representation Learning for Robotics

S Karamcheti, S Nair, A S. Chen, T Kollar, C Finn, D Sadigh, P Liang
[Stanford University]

面向机器人的语言驱动表示学习

要点:

现有的机器人视觉表示学习方法，在超越控制的各种机器人学习问题上表现出不一致的结果；
2。 Voltron 是一种语言驱动的视觉表示学习框架，平衡了调节和生成，以捕捉低级的视觉模式和高级的语义；
Voltron 在各种机器人学习问题上的表现优于现有方法，特别是那些需要高层特征的问题；
本文介绍了一种新的评估套件，由机器人学五个不同的问题领域组成，用于全面评估视觉表征。

一句话总结:
提出一种语言驱动机器人视觉表示学习框架，平衡了调节和生成，以捕捉低层次视觉模式和高层次语义，在不同的机器人学习问题上超越了现有的方法。

Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. Leveraging methods such as masked autoencoding and contrastive learning, these representations exhibit strong transfer to policy learning for visuomotor control. But, robot learning encompasses a diverse set of problems beyond control including grasp affordance prediction, language-conditioned imitation learning, and intent scoring for human-robot collaboration, amongst others. First, we demonstrate that existing representations yield inconsistent results across these tasks: masked autoencoding approaches pick up on low-level spatial features at the cost of high-level semantics, while contrastive learning approaches capture the opposite. We then introduce Voltron, a framework for language-driven representation learning from human videos and associated captions. Voltron trades off language-conditioned visual reconstruction to learn low-level visual patterns, and visually-grounded language generation to encode high-level semantics. We also construct a new evaluation suite spanning five distinct robot learning problems – a unified platform for holistically evaluating visual representations for robotics. Through comprehensive, controlled experiments across all five problems, we find that Voltron’s language-driven representations outperform the prior state-of-the-art, especially on targeted problems requiring higher-level features.

https://arxiv.org/abs/2302.12766

2、[CV] Decoupling Human and Camera Motion from Videos in the Wild

V Ye, G Pavlakos, J Malik, A Kanazawa
[UC Berkeley]

实际场景视频人体摄像机运动解耦

要点:

所提出的方法通过利用相对摄像机估计和学到的人体运动先验，在挑战性的实际场景视频中恢复人体全局 3D 轨迹；
该方法是鲁棒的，在 Egobody 数据集上优于现有方法，并为有多人和挑战性的摄像机运动场景生成了可信的轨迹；
摄像机轨迹优化与场景视差和现实世界人体轨迹的2D再投影相一致，能对复杂人体视频进行操作；
恢复的摄像机比例允许在一个共享的坐标框架内推理多人运动，改善 PoseTrack 的下游跟踪性能。

一句话总结:
提出一种方法，通过对摄像机和人体运动进行解耦，利用相对摄像机估计和学到的人体运动先验，从移动摄像机的挑战性视频中恢复人在世界坐标帧中的运动轨迹。

We propose a method to reconstruct global human trajectories from videos in the wild. Our optimization method decouples the camera and human motion, which allows us to place people in the same world coordinate frame. Most existing methods do not model the camera motion; methods that rely on the background pixels to infer 3D human motion usually require a full scene reconstruction, which is often not possible for in-the-wild videos. However, even when existing SLAM systems cannot recover accurate scene reconstructions, the background pixel motion still provides enough signal to constrain the camera motion. We show that relative camera estimates along with data-driven human motion priors can resolve the scene scale ambiguity and recover global human trajectories. Our method robustly recovers the global 3D trajectories of people in challenging in-the-wild videos, such as PoseTrack. We quantify our improvement over existing methods on 3D human dataset Egobody. We further demonstrate that our recovered camera scale allows us to reason about motion of multiple people in a shared coordinate frame, which improves performance of downstream tracking in PoseTrack. Code and video results can be found at this https URL.

https://arxiv.org/abs/2302.12827

3、[LG] On the Training Instability of Shuffling SGD with Batch Normalization

D X. Wu, C Yun, S Sra
[UC Berkeley & KAIST & MIT]

批量归一化混洗SGD训练不稳定性研究

要点:

用单次重洗和批量归一化训练会导致发散和更慢的收敛；
随机重洗和批量归一化会使训练损失的演变更加稳定；
单次混洗和随机重洗会收敛到不同的全局最优，这些全局最优会被梯度下降所”扭曲”；
随机重洗避免了扭曲和发散，通常比单次混洗收敛得更快。

一句话总结:
批量归一化混洗SGD会导致不理想的训练行为，但就稳定性和收敛性而言，随机重洗是比单一混洗更好的选择。

We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) — two widely used variants of SGD — interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are “distorted” away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by confirming them empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.

https://arxiv.org/abs/2302.12444

4、[RO] Leveraging Jumpy Models for Planning and Fast Learning in Robotic Domains

J Zhang, J T Springenberg, A Byravan, L Hasenclever, A Abdolmaleki, D Rao, N Heess, M Riedmiller
[DeepMind]

用跳跃模型在机器人领域进行规划和快速学习

要点:

提出在从未标记经验离线的技能嵌入空间中学习多步动态预测模型(跳跃模型)；
包含时间抽象的跳跃模型可以促进在标准动态模型失败的长周期任务的规划；
学到的组件(跳跃模型和技能嵌入空间)可用于解决新遇到的任务，通过强化学习或规划的各种方式来利用它们；
在 RGB 堆叠环境中进行的实验，验证了所提出的方法的功效，证明了在学习的潜技能空间中进行跳跃式规划的好处。

一句话总结:
从未标记经验中学习跳跃模型和技能嵌入空间，可实现对新任务的零样本泛化，通过强化学习加快策略的训练。

In this paper we study the problem of learning multi-step dynamics prediction models (jumpy models) from unlabeled experience and their utility for fast inference of (high-level) plans in downstream tasks. In particular we propose to learn a jumpy model alongside a skill embedding space offline, from previously collected experience for which no labels or reward annotations are required. We then investigate several options of harnessing those learned components in combination with model-based planning or model-free reinforcement learning (RL) to speed up learning on downstream tasks. We conduct a set of experiments in the RGB-stacking environment, showing that planning with the learned skills and the associated model can enable zero-shot generalization to new tasks, and can further speed up training of policies via reinforcement learning. These experiments demonstrate that jumpy models which incorporate temporal abstraction can facilitate planning in long-horizon tasks in which standard dynamics models fail.

https://arxiv.org/abs/2302.12617

5、[CL] Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

B Peng, M Galley, P He, H Cheng, Y Xie, Y Hu, Q Huang, L Liden, Z Yu, W Chen, J Gao
[Microsoft Research]

检查事实再次尝试: 利用外部知识和自动反馈改进大型语言模型

要点:

LLM-Augmenter 用外部知识和自动反馈的即插即用模块增强了黑盒 LLM；
在不影响回答流畅性和信息性的前提下，大大减少了幻觉；
经验验证表明，在以任务为导向的对话和开放域问答场景中是有效的；
LLM-Augmenter 通过将 ChatGPT 的回答建立在综合外部知识和自动反馈的基础上，提高了 ChatGPT 的事实性得分。

一句话总结:
LLM-Augmenter 框架通过纳入外部知识和自动反馈，改进了像 ChatGPT 这样的黑盒LLM，提高了其事实性得分。

Large language models (LLMs), such as ChatGPT, are able to generate human-like, fluent responses for many downstream tasks, e.g., task-oriented dialog and question answering. However, applying LLMs to real-world, mission-critical applications remains challenging mainly due to their tendency to generate hallucinations and inability to use external knowledge.This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules. Our system makes the LLM generate responses grounded in consolidated external knowledge, e.g., stored in task-specific databases. It also iteratively revises LLM prompts to improve model responses using feedback generated by utility functions, e.g., the factuality score of a LLM-generated response. The effectiveness of LLM-Augmenter is empirically validated on two types of mission-critical scenarios, task-oriented dialog and open-domain question answering. LLM-Augmenter significantly reduces ChatGPT’s hallucinations without sacrificing the fluency and informativeness of its responses. We make the source code and models publicly available.

https://arxiv.org/abs/2302.12813

另外几篇值得关注的论文：

[LG] Physics-Constrained Deep Learning for Climate Downscaling

P Harder, V Ramesh, A Hernandez-Garcia, Q Yang, P Sattigeri, D Szwarcman, C Watson, D Rolnick
[Fraunhofer ITWM & Mila Quebec AI Institute & IBM Research]

基于物理约束的气候降尺度深度学习

要点:

提出一种新方法，将基于物理学的约束纳入到用于气候降尺度(downscaling)的神经网络架构中；
该方法提高了不同深度学习架构在各种气候数据集上的预测性能，同时保证了质量和能量守恒等物理约束；
该方法还可提高其他领域(如标准图像和卫星图像)超分辨率准确性，并为沿空间和时间维度的降尺度引入了一个新的深度学习架构；
约束层有助于解决与应用于降尺度的深度学习有关的常见问题：可以抑制沿海效应(coastal effect)，在关键区域的误差变小，分布外泛化得到改善，训练可以更加稳定。

一句话总结:
一种将基于物理学的约束纳入气候下变尺度神经网络架构的新方法，该方法提高了预测性能，同时保证了守恒定律。

The availability of reliable, high-resolution climate and weather data is important to inform long-term decisions on climate adaptation and mitigation and to guide rapid responses to extreme events. Forecasting models are limited by computational costs and, therefore, often generate coarse-resolution predictions. Statistical downscaling, including super-resolution methods from deep learning, can provide an efficient method of upsampling low-resolution data. However, despite achieving visually compelling results in some cases, such models frequently violate conservation laws when predicting physical variables. In order to conserve physical quantities, we develop methods that guarantee physical constraints are satisfied by a deep learning downscaling model while also improving their performance according to traditional metrics. We compare different constraining approaches and demonstrate their applicability across different neural architectures as well as a variety of climate and weather datasets. Besides enabling faster and more accurate climate predictions, we also show that our novel methodologies can improve super-resolution for satellite data and standard datasets.

https://arxiv.org/abs/2208.05424

[CV] Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

C Ham, J Hays, J Lu, K K Singh, Z Zhang, T Hinz
[Georgia Institute of Technology & Adobe Research]

面向多模态图像合成的预训练扩散模型调制

要点:

提出一种称为多模态调节模块(MCM)的新方法，用预训练扩散模型实现条件图像合成，避免了从头训练网络或微调预训练网络这一计算昂贵的过程；
MCM 是一个经过训练的小模块，可以在采样过程中使用扩散模型原始训练过程中未见的 2D 模态来调节扩散网络的预测；
该方法使用户能控制图像的空间布局，并实现对图像生成过程的控制增加，同时保留了高图像质量；
MCM 大约比原来的扩散模型小100倍，并且只用有限的新目标模态的配对样本进行训练，以在采样期间调制扩散模型输出，使得它比从头开始训练或微调大型模型更便宜，使用的内存更少。

一句话总结:
预训练扩散模型可以用多模态调节模块(MCM)进行调制，以实现便宜和有效的条件图像合成。

We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but does not require any updates to the diffusion network’s parameters. MCM is a small module trained to modulate the diffusion network’s predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only ∼1% of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.

https://arxiv.org/abs/2302.12764

[LG] Edgeformers: Graph-Empowered Transformers for Representation Learning on Textual-Edge Networks

B Jin, Y Zhang, Y Meng, J Han
[University of Illinois at Urbana-Champaign]

Edgeformers: 基于图增强Transformer的文本边缘网络表示学习

要点:

现有的网络表示学习模型缺乏利用边缘文本语义的设计，而边缘往往与现实世界的社会/信息网络中的丰富文本信息相关；
Edgeformers 提出一种新的图增强 Transformer 框架，以上下文方式将网络和文本信息深度结合，用于边缘和节点的表示学习；
Edgeformer 在五个不同领域的公共数据集上的表现优于各种基线，包括以节点为中心的 GNN、边缘感知 GNN 和 PLM-GNN 级联架构，证明了 Edgeformer 在边缘级和节点级任务上的优势；
未来方向包括探索将网络信号引入 Transformer 文本编码的其他变体，并将该框架应用于更多的网络相关任务，如推荐和富文本社会网络分析。

一句话总结:
Edgeformers是一种新的图增强 Transformer 框架，为网络边缘的文本信息建模，用于边缘和节点的表示学习，在边缘级和节点级的任务上都比最先进的基线表现得更好。

Edges in many real-world social/information networks are associated with rich text information (e.g., user-user communications or user-product reviews). However, mainstream network representation learning models focus on propagating and aggregating node attributes, lacking specific designs to utilize text semantics on edges. While there exist edge-aware graph neural networks, they directly initialize edge attributes as a feature vector, which cannot fully capture the contextualized text semantics of edges. In this paper, we propose Edgeformers, a framework built upon graph-enhanced Transformers, to perform edge and node representation learning by modeling texts on edges in a contextualized way. Specifically, in edge representation learning, we inject network information into each Transformer layer when encoding edge texts; in node representation learning, we aggregate edge representations through an attention mechanism within each node’s ego-graph. On five public datasets from three different domains, Edgeformers consistently outperform state-of-the-art baselines in edge classification and link prediction, demonstrating the efficacy in learning edge and node representations, respectively.

https://arxiv.org/abs/2302.11050

[CV] ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S F Bhat, R Birkl, D Wofk, P Wonka, M Müller
[KAUST & Intel]

ZoeDepth: 结合相对深度和度量深度实现零样本迁移

要点:

ZoeDepth 是第一个结合了相对深度和度量深度的方法，在保持度量尺度的同时，实现了卓越的泛化性能；
ZoeDepth 的旗舰模型 ZoeD-M12-NK 在12个数据集上使用相对深度进行预训练，并在两个数据集上使用度量深度进行微调，使其在技术水平上有了明显的提高；
ZoeDepth 是第一个可以在多个数据集(NYU Depth v2 和 KITTI)上联合训练而性能不明显下降的模型，在室内和室外域的8个未见过的数据集上实现了前所未有的零样本泛化性能；
ZoeDepth 弥补了相对深度估计和度量深度估计之间的差距，并且可以通过在更多的数据集上进行更细化的域和度量微调来进一步改进。

一句话总结:
ZoeDepth 提出一种新方法，将相对深度和度量深度结合起来，在保持度量尺度的同时，实现了卓越的泛化性能。

This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at this https URL .

https://arxiv.org/abs/2302.12288

正文完

可以使用微信扫码关注公众号（ID：xzluomor）