Safety & Alignment

Auto Added by WPeMatico

Why responsible AI development needs cooperation on safety

Artificial Intelligence, Safety & Alignment

We’ve written a policy research paper identifying four strategies that can be used today to improve the likelihood of long-term industry cooperation on safety norms in AI: communicating risks and benefits, technical collaboration, increased transparency, and incentivizing standards. Our analysis shows that industry cooperation on safety will be instrumental in ensuring that AI systems are […]

Why responsible AI development needs cooperation on safety Read More »

Transfer of adversarial robustness between perturbation types

Artificial Intelligence, Safety & Alignment

Transfer of adversarial robustness between perturbation types Read More »

Transfer of adversarial robustness between perturbation types

Artificial Intelligence, Safety & Alignment

Transfer of adversarial robustness between perturbation types Read More »

Introducing Activation Atlases

Artificial Intelligence, Safety & Alignment

We’ve created activation atlases (in collaboration with Google researchers), a new technique for visualizing what interactions between neurons can represent. As AI systems are deployed in increasingly sensitive contexts, having a better understanding of their internal decision-making processes will let us identify weaknesses and investigate failures.

Introducing Activation Atlases Read More »

Faulty reward functions in the wild

Artificial Intelligence, Safety & Alignment

Reinforcement learning algorithms can break in surprising, counterintuitive ways. In this post we’ll explore one failure mode, which is where you misspecify your reward function.

Faulty reward functions in the wild Read More »

Semi-supervised knowledge transfer for deep learning from private training data

Artificial Intelligence, Safety & Alignment

Semi-supervised knowledge transfer for deep learning from private training data Read More »

Adversarial training methods for semi-supervised text classification

Artificial Intelligence, Safety & Alignment

Adversarial training methods for semi-supervised text classification Read More »

Concrete AI safety problems

Artificial Intelligence, Safety & Alignment

We (along with researchers from Berkeley and Stanford) are co-authors on today’s paper led by Google Brain researchers, Concrete Problems in AI Safety. The paper explores many research problems around ensuring that modern machine learning systems operate as intended.

Concrete AI safety problems Read More »

Learning from human preferences

Artificial Intelligence, Safety & Alignment

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can

Learning from human preferences Read More »

Attacking machine learning with adversarial examples

Artificial Intelligence, Safety & Alignment

Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines. In this post we’ll show how adversarial examples work across different mediums, and will discuss why securing systems against them can be difficult.

Attacking machine learning with adversarial examples Read More »