ozziegooen(Ozzie Gooen)

Karma: 3,222

I’m currently researching forecasting and epistemics as part of the Quantified Uncertainty Research Institute.

ozziegooen 6 Sep 2023 2:04 UTC
69 points
57
on: What I would do if I wasn’t at ARC Evals
Ajeya Cotra is currently the only evaluator for technical AIS grants.
This situation seems really bizarre to me. I know they have multiple researchers in-house investigating these issues, like Joseph Carlsmith. I’m really curious what’s going on here.

I know they’ve previously had (what seemed to me) like talented people join and leave that team. The fact that it’s so small now, given the complexity and importance of the topic, is something I have trouble grappling with.

My guess is that there are some key reasons for this that aren’t obvious externally.

I’d assume that it’s really important for this team to become really strong, but would obviously flag that when things are that strange, it’s likely difficult to fix, unless you really understand why the situation is the way it is now. I’d also encourage people to try to help here, but I just want to flag that it might be more difficult than it might initially seem.

ozziegooen 2 Apr 2022 22:26 UTC
65 points
on: MIRI announces new “Death With Dignity” strategy
For what it’s worth, I think I prefer the phrase,
”Failing with style”

ozziegooen 21 Apr 2021 5:33 UTC
27 points
in reply to: ozziegooen’s comment on: The Scout Mindset—read-along
(I originally posted this to Goodreads)
TDLR: A good book with mass appeal to help people care more about being accurate. Fairly easy to read, which makes it easy to recommend to many people.
I’ve met Julia a few times and am friendly with her. I’d be happy if this book does well, and expect that to lead to a (slightly) more reasonable world.
That said, in the interest of having a Scout Mindset, I want to be honest about my impression.
The Scout Mindset is the sort of book I’m both happy with and frustrated by. I’m frustrated because this is a relatively casual overview of what I wish were a thorough Academic specialty. I felt similarly with The Life You Can Save when that was released.
Another way of putting this is that I was sort of hoping for an academic work, but instead, think of this more as a journalistic work. It reminds me of Vice Documentaries (which I like a lot) and Malcolm Gladwell (in a nice way), instead of Superforecasting or The Elephant in the Brain. That said, journalistic works have their unique contributions in the literature, it’s just a very different sort of work.
I just read through the book on Audible and don’t have notes. To write a really solid review would take more time than I have now, so instead, I’ll leave scattered thoughts.
1. The main theme of the book is the dichotomy of “The Scout Mindset” vs. “The Soldier Mindset”, and more specifically, why the Scout Mindset is (almost always?) better than the Solider Mindset. Put differently, we have a bunch of books about “how to think accurately”, but surprisingly few on “you should even try thinking accurately.” Sadly, this latter part has to be stated, but that’s how things are.
2. I was expecting a lot of references to scientific studies, but there seemed to be a lot more text on stories and a few specific anecdotes. The main studies I recall were a very few seemingly small psychological studies, which at this point I’m fairly suspect of. One small note: I found it odd that Elon Musk was described multiple times as something like an exemplar of honesty. I agree with the particular examples pointed to, but I believe Elon Musk is notorious for making explicit overconfident statements.
3. Motivated reasoning is a substantial and profound topic. I believe it already has many books detailing not only that it exists, but why it’s beneficial and harmful in different settings. The Scout Mindset didn’t seem to engage with much of this literature. It argued that “The Scout Mindset is better than the Soldier Mindset”, but that seems like an intense simplification of the landscape. Lies are a much more integral part of society than I think they are given credit for here, and removing them would be a very radical action. If you could go back in time and strongly convince particular people to be atheistic, that could be fatal.
4. The most novel part to me was the last few chapters, on “Rethinking Identity”. This section seems particularly inspired by the blog post Keep Your Identity Small by Paul Graham, but of course, goes into more detail. I found the mentioned stories to be a solid illustration of the key points and will dwell on these more.
5. People close to Julia’s work have heard much of this before, but maybe half or so seemed rather new to me.
6. As a small point, if the theme of the book is about the benefits of always being honest, the marketing seemed fairly traditionally deceiving. I wasn’t sure what to expect from the cover and quotes. I could easily see potential readers getting the wrong impression looking at the marketing materials, and there seems to be little work to directly make the actual value of the book more clear. There’s nothing up front that reads, “This book is aiming to achieve X, but doesn’t do Y and Z, which you might have been expecting.” I guess that Julia didn’t have control over the marketing.

ozziegooen 15 Dec 2021 2:45 UTC
26 points
on: Zvi’s Thoughts on the Survival and Flourishing Fund (SFF)
I liked this post a lot, though of course, I didn’t agree with absolutely everything.
These seemed deeply terrible. If you think the best use of funds, in a world in which we already have billions available, is to go trying to convince others to give away their money in the future, and then hoping it can be steered to the right places, I almost don’t know where to start. My expectation is that these people are seeking money and power,
I’m hesitant about this for a few reasons.
1. Sure, we have a few billion available, and we’re having trouble donating that right now. But we’re also not exactly doing a ton of work to donate our money yet. (This process gave out $10 Million, with volunteers). In the scheme of important problems, a few (~40-200) billion really doesn’t seem like that much to me. Marginal money, especially lots of money, still seems pretty good.
2. My expectation is that these people are seeking money and power → I don’t know which specific groups applied or their specific details. I can say that my impression, lots of EAs really just don’t know what else to do. It’s tough to enter research, and we just don’t have that much in terms of “these interventions would be amazing, please someone do them” for longtermism. I’ve seen a lot of orgs get created with something like, “This seems like a pretty safe strategy, it will likely come into use later on, and we already have the right connections to make it happen.” This, combined with a general impression that marginal money is still useful in the long-term, I think could present a more sympathetic take than what you describe.
The default strategy for lots of non-EA entrepreneurs I know has been something like, “Make a ton of money/influence, then try to figure out how to use it for good. Because people won’t listen to me or fund my projects on my own”. I wish more of these people would do direct work (especially in the last few years, when there’s been more money), but can sympathize with that strategy. Arguably, Elon Musk is much better off having started with “less ambitious” ventures like Zip2 and Paypal; it’s not clear if he would have been funded to start with SpaceX/Tesla when he was younger.

All that said, the fact that EAs have so little idea of what exactly is useful seems like a pretty burning problem to me. (This isn’t unique to EAs, to be clear). On the margin, it seems safe to heavily emphasize “figuring stuff out” instead of “making more money, in hopes that we’ll eventually figure stuff out” However, “figuring stuff out” is pretty hard and not nearly as tractable as we’d like it to be.

“I would hire assistance to do at least the following”
I’ve been hoping that the volunteer funders (EA Funds, SFF) would do this for a while now. Seems valuable to at least try out for a while. In general, “funding work” seems really bottlenecked to me, and I’d like to see anything that could help unblock it.

definitely a case of writing a longer letter
I’m impressed by just how much you write on things like this. Do you have any posts outlining your techniques? Is there anything special, like speech-to-text, or do you spend a lot of time on it, or are you just really fast?
What links here?
- Zach Stein-Perlman's comment on Zvi’s Thoughts on the Survival and Flourishing Fund (SFF) by Zvi (EA Forum; 15 Dec 2021 13:30 UTC; 2 points)

ozziegooen 3 Dec 2020 4:50 UTC
26 points
in reply to: habryka’s comment on: The LessWrong 2018 Book is Available for Pre-order
Thanks for the reasoning here. I also don’t want to detract people from purchasing these books, I imagine if people really wanted they could write the dates on them manually.

That said—

To better explain my intuitions here:

In 5 years from now, I care about whether the essays came out in 2018 or in 2017 if I am trying to find a particular one in a book, or recommend one to another person. Ordering is really simple to remember compared to other kinds of naming one could use. When going between different books the date is particularly relevant because names and concepts will change over time. I’d hope that 10 years from now much of the 2018 content will look antiquated and old.
If you’re just aiming for “timeless and good quality posts” (this sounds like the value proposition for the readers you are referring to), then I don’t understand the need to only choose ones from 2018. Many good ones came out before 2018 that I imagine would be interesting to readers. That said, if you plan on releasing them on yearly intervals later I’d imagine some restriction might be necessary. Or, it could be that whenever a few topics seem to have come full circle or be in a good place for a book, you publish a book focused on those topics.
I agree that “LessWrong Review 2018” sounds strange, but there are other phrases that could have with 2018 in them. Many Academic periodicals (including things like Philosophy, which are at least as timeless as LessWrong content) have yearly collections. With those I don’t assume I need to read all of the old ones before reading the current year, that would take quite a while (it becomes more obvious after a few are out). I imagine the name could be something like, “LessWrong Highlighted Content: 2018″ or “The Best of LessWrong: 2018”.

It’s very possible that there’s kind of a “free pass” for the first 1-3 years, if this is a repeating thing, and then you could start adding the year. It’s not that big a deal if there are just 2-3 of these, but I imagine it will get to be annoying if there are 5+ (and by that time it will be more obvious if it’s an issue or not)

ozziegooen 16 Oct 2019 18:56 UTC
25 points
on: Introducing Foretold.io: A New Open-Source Prediction Registry
There are a few curated communities you can join and begin predicting in now. Note you must log in to Foretold before accessing these pages.

Amplifying Spot-Checks
Instructions Document

Elizabeth Van Nostrand will be evaluating several statements from the book The Unbound Prometheus. Predict how she will judge these statements. You can earn up to $65 per predicted question.

EA Survey 2019&2020
Instructions Document
Predict questions about the upcoming EA surveys. There are two rounds, with multiple cash prizes each.

Apple Inc. Updates Predict things about Apple’s new product announcements and stock price.

Slate Star Codex 2019
Scott Alexander made several predictions in the beginning of 2019. Even though 2019 is mostly over, there’s still some uncertainty left.

LessWrong
Forecast the karma of this post, and several other things. Feel free to make new questions for posts or parameters you may be interested in.

ozziegooen 14 Oct 2021 14:32 UTC
24 points
in reply to: Dustin’s comment on: Zoe Curzi’s Experience with Leverage Research
As someone part of the social communities, I can confirm that Leverage was definitely a topic of discussion for a long time around Rationalists and Effective Altruists. That said, often the discussion went something like, “What’s up with Leverage? They seem so confident, and take in a bunch of employees, but we have very little visibility.” I think I experienced basically the same exact conversation about them around 10 times, along these lines.

As people from Leverage have said, several Rationalists/EAs were very hostile around the topic of Leverage, particularly in the last ~4 years or so. (I’ve heard stories of people getting shouted at just for saying they worked at Leverage at a conference). On the other hand, they definitely had support by a few rationalists/EA orgs and several higher-ups of different kinds.

They’ve always been secretive, and some of the few public threads didn’t go well for them, so it’s not too surprising to me that they’ve had a small LessWrong/EA Forum presence.

I’ve personally very much enjoyed staying mostly staying away from the controversy, though very arguably I made a mistake there.

(I should also note that I had friends who worked at or worked close to Leverage, I attended like 2 events there early on, and I applied to work from there around 6 years ago)

ozziegooen 31 Aug 2019 23:03 UTC
24 points
on: ozziegooen’s Shortform
Questions around Making Reliable Evaluations

Most existing forecasting platform questions are for very clearly verifiable questions:
- “Who will win the next election”
- “How many cars will Tesla sell in 2030?”
But many of the questions we care about are much less verifiable:
- “How much value has this organization created?”
- “What is the relative effectiveness of AI safety research vs. bio risk research?”
One solution attempt would be to have an “expert panel” assess these questions, but this opens up a bunch of issues. How could we know how much we could trust this group to be accurate, precise, and understandable?

The topic of, “How can we trust that a person or group can give reasonable answers to abstract questions” is quite generic and abstract, but it’s a start.

I’ve decided to investigate this as part of my overall project on forecasting infrastructure. I’ve recently been working with Elizabeth on some high-level research.

I believe that this general strand of work could be useful both for forecasting systems and also for the more broad-reaching evaluations that are important in our communities.

Early concrete questions in evaluation quality

One concrete topic that’s easily studiable is evaluation consistency. If the most respected philosopher gives wildly different answers to “Is moral realism true” on different dates, it makes you question the validity of their belief. Or perhaps their belief is fixed, but we can determine that there was significant randomness in the processes that determined it.

Daniel Kahneman apparently thinks a version of this question is important enough to be writing his new book on it.

Another obvious topic is in the misunderstanding of terminology. If an evaluator understands “transformative AI” in a very different way to the people reading their statements about transformative AI, they may make statements that get misinterpreted.

These are two specific examples of questions, but I’m sure there are many more. I’m excited about understanding existing work in this overall space more, and getting a better sense of where things stand and what the next right questions are to be asking.
What links here?
- ozziegooen's comment on How Can People Evaluate Complex Questions Consistently? by Elizabeth (2 Sep 2019 10:50 UTC; 3 points)

ozziegooen 3 Dec 2020 4:08 UTC
23 points
in reply to: habryka’s comment on: The LessWrong 2018 Book is Available for Pre-order
If the main thing that separates this book from the 2019 and 2020 books is that it’s the collection of posts from 2018, it’s counterintutive to me that that’s not the prominent feature of the title here. Other “journals of the year” often make the year really prominent.

I feel like 5 years from now I’m going to have trouble remembering that “A Map That Reflects the Territory” refers to the 2018 edition, and some other equally elegant but abstract name refers to the 2019 edition.

If you do go with really premium books especially, I’d recommend considering making the date the prominent bit. Honestly I expect to memorize the “lesswrong”ness from the branding (which is distinct), so the year seems like the most important part to me.

That said, I feel like I’m not exactly in the target audience (generally don’t prefer physical books), so it would come down to the preferences of others.

I realize you’ve probably thought about this a lot and have reasons, just giving my 2 cents.

ozziegooen 18 Aug 2020 17:29 UTC
23 points
on: Why haven’t we celebrated any major achievements lately?
I found this article interesting:
https://www.thegentlemansjournal.com/25-iconic-moments-that-define-the-21st-century-thus-far/
It lists several events that caused large celebrations. However, you can notice a pattern:
2008 — Barack Obama wins the 2008 election, becoming the first African American President
2011 — Commandos conduct a raid in Pakistan, which ends with the killing of Osama bin Laden
2012 — The US rover, Curiosity, takes a selfie on Mars
2014 — Malala Yousafazi becomes the youngest ever recipient of a Nobel Prize
2015 — Same-sex marriage is legalised across all fifty states in the USA

Almost all were political or nontechnical.

Personally, I think that most kinds of modern technology are highly incremental, and as of recent have been treated with suspicion.

I also could imagine that real technology change has slowed down a fair bit (especially outside of AI), as has been discussed extensively.

ozziegooen 14 Oct 2021 14:42 UTC
22 points
in reply to: ChristianKl’s comment on: Zoe Curzi’s Experience with Leverage Research
As someone who’s been close to these, some had a few related issues, but Leverage seemed much more extreme in many of these dimensions to me.

However, now there are like 50 small EA/rationalist groups out there, and I am legitimately worried about quality control.

ozziegooen 8 Oct 2021 17:38 UTC
22 points
in reply to: prevlev-anon’s comment on: Common knowledge facts about Leverage Research 1.0
+1 for the detail. Right now there’s very little like this explained publicly (or accessible in other ways to people like myself). I found this really helpful.

I agree that the public discussion on the topic has been quite poor.

ozziegooen 18 Oct 2021 20:18 UTC
21 points
in reply to: romeostevensit’s comment on: My experience at and around MIRI and CFAR (inspired by Zoe Curzi’s writeup of experiences at Leverage)
There’s an “EA Mental Health Navigator” now to help people connect to the right care.
https://eamentalhealth.wixsite.com/navigator

I don’t know how good it is yet. I just emailed them last week, and we set up an appointment for this upcoming Wednesday. I might report back later, as things progress.

ozziegooen 7 Sep 2021 19:07 UTC
20 points
in reply to: frontier64’s comment on: I read “White Fragility” so you don’t have to (but maybe you should)
I’m really sorry if I hurt or offended you. I assumed that a brief description of where I was at would be preferred to not replying at all. I clearly was incorrect about that.
I disagree with some of your specific implications. I’m fairly sure though that you’d disagree with my responses. I could easily imagine that you’ve already predicted them, well enough, and wouldn’t find them very informative, particularly for what I could write in a few sentences.
This isn’t unusual for me. I try to stay out of almost all online discussion. I have things to do, I’m sure you have things to do as well. Online discussion is costly, and it’s especially costly when people know very little about each other[1], and the conversation topic (White Fragility) is as controversial as this one is.
[1]: I know almost nothing about you. I feel like I’d have a very difficult time feeling comfortable saying things in ways I can predict you’d be receptive to, or things that you wouldn’t actively attack me for. I find that I’ve had a difficult time modeling people online; particularly people who I barely know. This could easily lead to problems of several different kinds. It’s very, very possible that none of this applies to you, but it would take a fair amount of discussion for me to find that out and feel safe with my impressions of you. This also applies for all the other people I don’t know, but who might be watching this conversation or jump in at any point.

ozziegooen 16 Oct 2021 6:29 UTC
17 points
in reply to: Viliam’s comment on: Zoe Curzi’s Experience with Leverage Research
I very much agree about the worry, My original comment was to make the easiest case quickly, but I think more extensive cases apply to. For example, I’m sure there have been substantial problems even in the other notable orgs, and in expectation we should expect there to continue to be so. (I’m not saying this based on particular evidence about these orgs, more that the base rate for similar projects seems bad, and these orgs don’t strike me as absolutely above these issues.)

One solution (of a few) that I’m in favor of is to just have more public knowledge about the capabilities and problems of orgs.

I think it’s pretty easy for orgs of about any quality level to seem exciting to new people and recruit them or take advantage of them. Right now, some orgs have poor reputations among those “in the know” (generally for producing poor quality output), but this isn’t made apparent publicly.[1] One solution is to have specialized systems that actually present negative information publicly; this could be public rating or evaluation systems.

This post by Nuno was partially meant as a test for this:

https://forum.effectivealtruism.org/posts/xmmqDdGqNZq5RELer/shallow-evaluations-of-longtermist-organizations

Another thing to do, of course, would be to just do some amounts of evaluation and auditing of all these efforts, above and beyond what even those currently “in the know” have. I think that in the case of Leverage, there really should have been some deep investigation a few years ago, perhaps after a separate setup to flag possible targets of investigation. Back then things were much more disorganized and more poorly funded, but now we’re in a much better position for similar efforts going forward.

[1] I don’t particularly blame them, consider the alternative.

ozziegooen 3 Jan 2020 16:19 UTC
17 points
on: CFAR Participant Handbook now available to all
I’m really happy to see this become public! Personally, I find PDFs nicer than paper books for multiple reasons (can listen to, can annotate and keep easier).

Was there anything in particular that convinced the team to make it public at this point?

ozziegooen 24 Dec 2019 0:13 UTC
16 points
on: ozziegooen’s Shortform
Experimental predictability and generalizability are correlated

A criticism to having people attempt to predict the results of experiments is that this will be near impossible. The idea is that experiments are highly sensitive to parameters and these would need to be deeply understood in order for predictors to have a chance at being more accurate than an uninformed prior. For example, in a psychological survey, it would be important that the predictors knew the specific questions being asked, details about the population being sampled, many details about the experimenters, et cetera.

One counter-argument may not be to say that prediction will be easy in many cases, but rather that if these experiments cannot be predicted in a useful fashion without very substantial amounts of time, then these experiments aren’t probably going to be very useful anyway.

Good scientific experiments produce results are generalizable. For instance, a study on the effectiveness of Malaria on a population should give us useful information (probably for use with forecasting) about the effectiveness on Malaria on other populations. If it doesn’t, then value would be limited. It would really be more of a historic statement than a scientific finding.

Possible statement from a non-generalizable experiment:

“We found that intervention X was beneficial within statistical significance for a population of 2,000 people. That’s interesting if you’re interested in understanding the histories of these 2,000 people. However, we wouldn’t recommend inferring anything about this to other groups of people, or to understanding anything about these 2,000 people going forward.”

Formalization

One possible way of starting to formalize this a bit is to imagine experiments (assuming internal validity) as mathematical functions. The inputs would be the parameters and details of how the experiment was performed, and the results would be the main findings that the experiment found.

$e x p e r i m e n t_{n} (i n p u t s) = f i n d i n g s$

If the experiment has internal validity, then observers should predict that if an identical (but subsequent) experiment were performed, it would result in identical findings. $p ((e x p e r i m e n t_{n + 1} (i n p u t s_{i}) = f i n d i n g s_{i}) | (e x p e r i m e n t_{n} (i n p u t s_{i}) = f i n d i n g s_{i})) = 1$

We could also say that if we took a probability distribution of the chances of every possible set of findings being true, the differential entropy of that distribution would be 0, as smart forecasters would recognize that $f i n d i n g s_{i}$ is correct with ~100% probability. $H (e x p e r i m e n t_{n + 1} (i n p u t s_{i}) | (e x p e r i m e n t_{n} (i n p u t s_{i}) = f i n d i n g s_{i})) \approx 0$

Generalizability

Now, to be generalizable, then hopefully we could perturb the inputs in a minor way, but still have the entropy be low. Note that the important thing is not that the outputs not be changed, but rather that they remain predictable. For instance, a physical experiment that describes the basics of mechanical velocity may be performed on data with velocities of 50-100 miles/hour. This experiment would not be useful only if future experiments also described situations with similar velocities; but rather, if future experiments on velocity could be better predicted, no matter the specific velocities used.

We can describe a perturbation of $i n p u t s_{i}$ to be $i n p u t s_{i} + δ$ .

Thus, hopefully, the following will be true for low values of $δ$ .

$H ((e x p e r i m e n t_{n + 1} (i n p u t s_{i} + δ) | (e x p e r i m e n t_{n} (i n p u t s_{i}) = f i n d i n g s_{i})) \approx 0$

So, perhaps generalizability can be defined something like,

Generalizability is the ability for predictors to better predict the results of similar experiments upon seeing the results of a particular experiment, for increasingly wide definitions of “similar”.

Predictability and Generalizability

I could definitely imagine trying to formalize predictability better in this setting, or more specifically, formalize the concept of “do forecasters need to spend a lot of time understanding the parameters of an experiment.” In this case, that could look something like modeling how the amount of uncertainty forecasters have about the inputs correlates with their uncertainty about the outputs.

The general combination of predictability and generality would look something like adding an additional assumption:

If forecasters require a very high degree of information on the inputs to an experiment in order to predict it’s outputs, then it’s less likely they can predict (with high confidence) the results of future experiments with significant changes, once they see the results of said experiment.

Admitting, this isn’t using the definition of predictability that people are likely used to, but I imagine it correlates well enough.

Final Thoughts

I’ve been experimenting more with trying to formalize concepts like this. As such, I’d be quite curious to get any feedback from this work. I am a bit torn; on one hand I appreciate formality, but on the other this is decently messy and I’m sure it will turn off many readers.

ozziegooen 14 Jun 2018 2:27 UTC
16 points
on: On the Chatham House Rule
I’m curious, what kinds of events follow Chatham House Rules? I’ve never heard of them until now. Is it just official ones from the Chatham House itself, or have other organizations been using them?

ozziegooen 2 Nov 2021 8:52 UTC
15 points
in reply to: habryka’s comment on: Zoe Curzi’s Experience with Leverage Research
A few quick thoughts:

1) This seems great, and I’m impressed by the agency and speed.
2) From reading the comments, it seems like several people were actively afraid of how Leverage could retaliate. I imagine similar for accusations/whistleblowing for other organizations. I think this is both very, very bad, and unnecessary; as a whole, the community is much more powerful than individual groups, so it seems poorly managed when the community is scared of a specific group. Resources should be spent to cancel this out.
In light of this, if more money were available, it seems easy to justify a fair bit more. Or even better could be something like, “We’ll help fund lawyers in case you’re attacked legally, or anti-harassing teams if you’re harassed or trolled”. This is similar to how the EFF helps with cases from small people/groups being attacked by big companies.
I don’t mean to complain; I think any steps here, especially so quickly are fantastic.
3) I’m afraid this will get lost in this comment section. I’d be excited about a list of “things to keep in mind” like this to be repeatedly made prominent somehow. For example, I could imagine that at community events or similar, there could be necessary papers like, “Know your rights, as a Rationalist/EA”, which flags how individuals can report bad actors and behavior.
4) Obviously a cash prize can encourage lying, but I think this can be decently managed. (It’s a small community, so if there’s good moderation, $15K would be very little compared to the social stigma that would come and you’ve found out to have destructively lied for $15k)

ozziegooen 2 Dec 2020 18:48 UTC
15 points
on: The LessWrong 2018 Book is Available for Pre-order
The books look very pretty, nice work.

Is this content from 2018 specifically, or is it taken from all of historic LessWrong? My impression was that this was from the 2018 review, but I don’t see anything about that in the description above.

If it is from the 2018 review, do you have ideas on how you will differentiate the 2019/2020/etc versions?

ozziegooen(Ozzie Gooen)

Questions around Making Reliable Evaluations

Early concrete questions in evaluation quality

Experimental predictability and generalizability are correlated

Formalization

Generalizability

Predictability and Generalizability

Final Thoughts