Context: An alignment researcher has a database of notes on research papers. They encounter a new paper, and would like to know which ideas in that paper are new to them. A system is useful if it can summarize just the claims/ideas in a paper which do not appear in the researcher’s database.
Input type: A database of notes on research papers structured as (paper, notes) tuples, along with a new research paper.
Output type: A list of new ideas/claims in the paper, along with pointers to where they appear in the text.
Info constraints: None
The instances below are a bit sparse because each instance required summarizing at least some ideas from multiple posts. Fortunately instances “nest”. As a researcher reads new posts, each new post generates a new instance (with all previous posts forming the database).
Instance 1:
Input:
Database: (Musings on the Speed Prior: Deception takes longer than accomplishing the original goal, so the speed prior favors just doing what the human wants. The speed prior may perform badly on many tasks though, so there is a competitiveness concern.)
New idea(s): A speed prior could disfavor meta-learning over directly learning the correct policy.
Instance 2:
Database: ((Musings on the Speed Prior: Deception takes longer than accomplishing the original goal, so the speed prior favors just doing what the human wants. The speed prior may perform badly on many tasks though, so there is a competitiveness concern.), (Should we rely on the speed prior for safety?: A speed prior could disfavor meta-learning over directly learning the correct policy.))
New idea(s): Early deception requires a more complex and time-consuming program than corrigibility, so speed and simplicity priors bias towards corrigibility. Corrigibility requires a more complex program but less runtime than late deception, so speed priors push towards corrigibility while simplicity priors push towards late deception.
Identify the novel ideas in a paper
Context: An alignment researcher has a database of notes on research papers. They encounter a new paper, and would like to know which ideas in that paper are new to them. A system is useful if it can summarize just the claims/ideas in a paper which do not appear in the researcher’s database.
Input type: A database of notes on research papers structured as (paper, notes) tuples, along with a new research paper.
Output type: A list of new ideas/claims in the paper, along with pointers to where they appear in the text.
Info constraints: None
The instances below are a bit sparse because each instance required summarizing at least some ideas from multiple posts. Fortunately instances “nest”. As a researcher reads new posts, each new post generates a new instance (with all previous posts forming the database).
Instance 1:
Input:
Database: (Musings on the Speed Prior: Deception takes longer than accomplishing the original goal, so the speed prior favors just doing what the human wants. The speed prior may perform badly on many tasks though, so there is a competitiveness concern.)
New Paper: Should we rely on the speed prior for safety?
Output:
New idea(s): A speed prior could disfavor meta-learning over directly learning the correct policy.
Instance 2:
Database: ((Musings on the Speed Prior: Deception takes longer than accomplishing the original goal, so the speed prior favors just doing what the human wants. The speed prior may perform badly on many tasks though, so there is a competitiveness concern.), (Should we rely on the speed prior for safety?: A speed prior could disfavor meta-learning over directly learning the correct policy.))
New Paper: The Speed+Simplicity Prior is probably anti-deceptive
Output:
New idea(s): Early deception requires a more complex and time-consuming program than corrigibility, so speed and simplicity priors bias towards corrigibility. Corrigibility requires a more complex program but less runtime than late deception, so speed priors push towards corrigibility while simplicity priors push towards late deception.