Safety & Alignment

Auto Added by WPeMatico

Measuring Goodhart’s law

Artificial Intelligence, Safety & Alignment

Goodhart’s law famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out how to optimize objectives that are difficult or costly to measure.

Measuring Goodhart’s law Read More »

Aligning language models to follow instructions

Artificial Intelligence, Safety & Alignment

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language models on our API.

Aligning language models to follow instructions Read More »

Economic impacts research at OpenAI

Artificial Intelligence, Safety & Alignment

Call for expressions of interest to study the economic impacts of large language models.

Economic impacts research at OpenAI Read More »

Summarizing books with human feedback

Artificial Intelligence, Safety & Alignment

Scaling human oversight of AI systems for tasks that are difficult to evaluate.

Summarizing books with human feedback Read More »

Improving language model behavior by training on a curated dataset

Artificial Intelligence, Safety & Alignment

Our latest research finds we can improve language model behavior with respect to specific behavioral values by fine-tuning on a small, curated dataset.

Improving language model behavior by training on a curated dataset Read More »

Learning to summarize with human feedback

Artificial Intelligence, Safety & Alignment

We’ve applied reinforcement learning from human feedback to train language models that are better at summarization.

Learning to summarize with human feedback Read More »

Safety Gym

Artificial Intelligence, Safety & Alignment

We’re releasing Safety Gym, a suite of environments and tools for measuring progress towards reinforcement learning agents that respect safety constraints while training.

Safety Gym Read More »

Benchmarking safe exploration in deep reinforcement learning

Artificial Intelligence, Safety & Alignment

Benchmarking safe exploration in deep reinforcement learning Read More »

Testing robustness against unforeseen adversaries

Artificial Intelligence, Safety & Alignment

We’ve developed a method to assess whether a neural network classifier can reliably defend against adversarial attacks not seen during training. Our method yields a new metric, UAR (Unforeseen Attack Robustness), which evaluates the robustness of a single model against an unanticipated attack, and highlights the need to measure performance across a more diverse range

Testing robustness against unforeseen adversaries Read More »

Fine-tuning GPT-2 from human preferences

Artificial Intelligence, Safety & Alignment

We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models

Fine-tuning GPT-2 from human preferences Read More »