DavidHolmes
I was about to write approximately this, so thank you! To add one point in this direction, I am sceptical about the value of reducing the expectation for researchers to explain what they are doing. My research is in two fields (arithmetic geometry and enumerative geometry). In the first we put a lot of burden on the writer to explain themselves, and in the latter poor and incomplete explanations are standard. This sometimes allows people in the latter field to move faster, but
it leaves critical foundational gaps, which we can ignore for a while but which eventually causes lot of pain;
sometimes really critical points are hidden in the details, and we just miss these if we don’t write the details down properly. Disclaimers:
while I think a lot of people working in these fields would agree with me that this distinction exists, not so many will agree that it is generally a bad thing.
I’m generally criticising lack of rigour rather than lack of explanation. I am or claiming these necessarily have to go together, but in my experience they very often do.
p.s.
For the more substantive results in section 4, I do believe the direction is always flat --> sharp.
I agree with this (with ‘sharp’ replaced by ‘generalise’, as I think you intend). It seems to me potentially interesting to ask whether this is necessarily the case.
Vacuous sure, but still true, and seems relevant to me. You initially wrote:
Regarding the ‘sharp minima can generalize’ paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don’t rule out flatness as an explanation for the success of SGD.
But, allowing reparametrisation, this seems false? I don’t understand the step in your argument where you ‘rule out reparametrisation’, nor do I really understand what this would mean.
Your comment relating description length to flatness seems nice. To talk about flatness at all (in the definitions I have seen) requires a choice of parametrisation. And I guess your coordinate description is also using a fixed parametrisation, so this seems reasonable. A change of parametrisation will then change both flatness and description length (i.e. ‘required coordinate precision’).
Thank you for the quick reply! I’m thinking about section 5.1 on reparametrising the model, where they write:
every minimum is observationally equivalent to an infinitely sharp minimum and to an infinitely flat min- imum when considering nonzero eigenvalues of the Hessian;
If we stick to section 4 (and so don’t allow reparametrisation) I agree there seems to be something more tricky going on. I initially assumed that I could e.g. modify the proof of Theorem 4 to make a sharp minimum flat by taking alpha to be big, but it doesn’t work like that (basically we’re looking at alpha + 1/alpha, which can easily be made big, but not very small). So maybe you are right that we can only make flat minimal sharp and not conversely. I’d like to understand this better!
I’m not sure I agree with interstice’s reading of the ‘sharp minima’ paper. As I understand it, they show that a given function can be made into a sharp or flat minimum by finding a suitable point in the parameter space mapping to the function. So if one has a sharp minmum that does not generalise (which I think we will agree exists) then one can make the same function into a flat minimum, which will still not generalise as it is the same function! Sorry I’m 2 years late to the party...
if we gave research grants to smart and personable university graduates and gave them carte blanche to do with the money what they wished that would work just as well as the current system
This thought is not unique to you; see e.g. the French CNRS system. My impression is that it works kind of as you would expect; a lot of them go on to do solid work, some do great work, and a few stop working after a couple of years. Of course we can not really know how things would have turned out if the same people had been given more conventional positions,
The request for elaboration concerned how the experience described related to the LCS hierarchy described in the post, which was (and remains) very unclear to me.
Definitely the antagonistic bits—I enjoyed the casual style! Really just the line ‘ Sit down. Sit down. Shut up. Listen. You don’t know nothing yet’ I found quite off-putting—even though in hindsight you were correct!
Thanks! I thought it might be, but was unsure, and didn’t want to make an awkward situation for the OP in case it was something very different...
I really liked the content, but I found some of the style (`Sit down!′ etc) really off-putting, which I why I only actually read the post on my 3rd attempt. Obviously you’re welcome to write in whatever style you want, and probably lots of other people really like it, I just thought it might be useful to mention that a non-empty set of people find it off-putting.
Can you elaborate on this a bit? I’m sorry to hear that you had a bad experience during fieldwork, though I’m afraid I’m not certain what you refer to by ‘Active Personal Life’. Can you explain how the experience you relate connects to the LCS hierarchy?
I’m sceptical of your decision to treat tenured and non-tenured faculty alike. As tenured faculty, this has long seemed to me to be perhaps the most important distinction.
More generally, what you write here is not very consistent with my own experience of academia (which is in mathematics and in Europe, though I have friends and collaborators in other countries and fields, so I am not totally clueless about how things work there).
Some points I am not seeing in your post are:
-
For many academics, being able to do their own research and work with brilliant students is their primary motivation. Grants etc are mainly valuable in how they facilitate that. This makes for a confusing situation where ‘losers’ in the original LCS model do the minimum work necessary for their paycheck, whereas ‘losers’ in the academic system (as you seem to be defining them?) do the maximum work that is compatible with their health and personal situation. Not only is this conceptually confusing to me, it also means that all other things being equal, the more `losers’ one is in academia the more impressive one’s CV will tend to be. Which is I think the opposite of the situation in the conventional LCS hierarchy?
-
The fact that I ‘perform peer review for nothing at all’ apparently makes me clueless. But this is weird; it does not go on my CV, and I do it because I think it is important to the advancement of science. Surely this makes it a `loser’ activity?
-
Acceptance of papers and awarding of grants is decided by people external to your university. This makes a huge difference, and I think you miss it by writing `So we might analyze this system at the department level, at the university level, or at the all-academia level, but it doesn’t make much of a difference.’.
Perhaps the above makes it sound as if I view academia as an organisational utopia; this is far from the case! But I do not think this post does a good job of identifying problems. I think a post analysing moral mazes in academia would be interesting, but I’m not convinced that the LCS hierarchy is an appropriate model, and this attempt to apply it does not seem to me to make useful category distinctions.
-
So the set of worlds, , is the set of functions from to …
I guess the should be a ? Also, you don’t seem to define ; perhaps ?
I expect most people on LW to be okay being asked their Cheerful Price to have sex with someone.
I find this a surprising assertion. It does not apply to me, probably it does apply to you. Ordinarily I would ask if you had any other data points, but I don’t want to take the conversation in this direction...
Sure, in the end we only really care about what comes top, as that’s the thing we choose. My feeling is that information on (relative) strengths of preferences is often available, and when it is available it seems to make sense to use it (e.g. allowing circumvention of Arrow’s theorem).
In particular, I worry that, when we only have ordinal preferences, the outcome of attempts to combine various preferences will depend heavily on how finely we divide up the world; by using information on strengths of preferences we can mitigate this.
(actually, my formula doubles the numbers you gave)
Are you sure? Suppose we take with , , then , so the values for should be as I gave them. And similarly for , giving values . Or else I have mis-understood your definition?
I’d simply see that as two separate partial preferences
Just to be clear, by “separate partial preference” you mean a separate preorder, on a set of objects which may or may not have some overlap with the objects we considered so far? Then somehow the work is just postponed to the point where we try to combine partial preferences?
EDIT (in reply to your edit): I guess e.g. keeping conditions 1,2,3 the same and instead minimising
where is proportion to the reciprocal of the strength of the preference? Of course there are lots of variants on this!
This seems really neat, but it seems quite sensitive to how one defines the worlds under consideration, and whether one counts slightly different worlds as actually distinct. Let me try to illustrate this with an example.
Suppose we have a consisting of 7 worlds, , with preferences
and no other non-trivial preferences. Then (from the `sensible case’), I think we get the following utilities:
.Suppose now that I create two new copies , of the world which each differ by the position of a single atom, so as to give me (extremely weak!) preferences , so all the non-trivial preferences in the new are now summarised as
Then the resulting utilities are (I think):
.In particular, before adding in these ‘trivial copies’ we had , and now we get . Is this a problem? It depends on the situation, but to me it suggests that, if using this approach, one needs to be careful in how the worlds are specified, and the ‘fine-grainedness’ needs to be roughly the same everywhere.
Thanks! I like the way your optimisation problem handles non-closed cycles.
I think I’m less comfortable with how it treats disconnected components—as I understand it you just translate each separately to have `centre of mass’ at 0. If one wants to get a utility function out at the end one has to make some kind of choice in this situation, and the choice you make is probably the best one, so in that sense it seems very good.
But for example it seems vulnerable to creating ‘virtual copies’ of worlds in order to shift the centre of mass and push connected components one way or the other. That was what started me thinking about including strength of preference—if one adds to your setup a bunch of virtual copies of a world between which one is `almost indifferent’ then it seems it will shift the centre of mass, and thus the utility relative to come other chain. Of course, if one is actually indifferent then the ‘virtual copies’ will be collapsed to a single point in your , but if they are just extremely close then it seems it will affect the utility relative to some other chain. I’ll try to explain this more clearly in a comment to your post.
Thanks for the comment Charlie.
If I am indifferent to a gamble with a probability of ice cream, and a probability 0.8 of chocolate cake and 0.2 of going hungry
To check I understand correctly, you mean the agent is indifferent between the gambles (probability of ice cream) and (probability 0.8 of chocolate cake, probability 0.2 of going hungry)?
If I understand correctly, you’re describing a variant of Von Neumann–Morgenstern where instead of giving preferences among all lotteries, you’re specifying a certain collection of special type of pairs of lotteries between which the agent is indifferent, together with a sign to say in which `direction’ things become preferred? It seems then likely to me that the data you give can be used to reconstruct preferences between all lotteries...
If one is given information in the form you propose but only for an
incomplete' set of special triples (c.f.
weak preferences’ above), then one can again ask whether and in how many ways it can be extended to a complete set of preferences. It feels to me as if there is an extra ambiguity coming in with your description, for example if the set of possible outcomes has 6 elements and I am given the value of theBetterness
function on two disjoint triples, then to generate a utility function I have to not only choose a `translation’ between the two triples, but also a scaling. But maybe this is better/more realistic!. By `special types’, I mean indifference between pairs of gambles of the form
(probability of A) vs (probability of B and probability of C)
for some , and possible outcomes A, B, C. Then the sign says that I prefer higher probability of B (say).
If you get the daily arXiv email feeds for multiple areas it automatically removes duplicates (i.e. each paper appears exactly once, regardless of cross-listing). The email is not to everyone’s taste of course, but this is a nice aspect of it.