For the orthogonal decomposition, don’t you need two scalars? E.g. x=ay+bz . For example, in R2 , let x=(2,2),y=(0,1),z=(1,0). Then x=2y+2z, and there’s no way to write x as y+az.
My favorite book, by far, is Functional Programming in Scala. This book has you derive most of the concepts from scratch, to the point where even complex abstractions feel like obvious consequences of things you’ve already built.
If you want something more Haskell-focused, a good choice is Programming in Haskell.
I didn’t downvote, but I agree that this is a suboptimal meme – though the prevailing mindset of “almost nobody can learn Calculus” is much worse.
As a datapoint, it took me about two weeks of obsessive, 15 hour/day study to learn Calculus to a point where I tested out of the first two courses when I was 16. And I think it’s fair to say I was unusually talented and unusually motivated. I would not expect the vast majority of people to be able to grok Calculus within a week, though obviously people on this site are not a representative sample.
A good exposition of the related theorems is in Chapter 6 of Understanding Machine Learning (https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ref=sr_1_1?crid=2MXVW7VOQH6FT&keywords=understanding+machine+learning+from+theory+to+algorithms&qid=1562085244&s=gateway&sprefix=understanding+machine+%2Caps%2C196&sr=8-1)
There are several related theorems. Roughly:
1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dimension.) This is the one I linked above.
2. The error on real data will be similar to the error of the best hypothesis in the hypothesis class, with similar proportionality
3. Special case of 2 – if the true hypothesis is in the hypothesis class, then the absolute error will be < epsilon (since the absolute error is just the difference from the true, best hypothesis.)
3 is probably the one you’re thinking of, but you don’t need the hypothesis to be in the class.
Yes, roughly speaking, if you multiply the VC dimension by n, then you need n times as much training data to achieve the same performance. (More precise statement here: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension#Uses) There are also a few other bounds you can get based on VC dimension. In practice these bounds are way too large to be useful, but an algorithm with much higher VC dimension will generally overfit more.
A different view is to look at the search process for the models, rather than the model itself. If model A is found from a process that evaluates 10 models, and model B is found from a process that evaluates 10,000, and they otherwise have similar results, then A is much more likely to generalize to new data points than B.
The formalization of this concept is called VC dimension and is a big part of Machine Learning Theory (although arguably it hasn’t been very helpful in practice): https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension
It’s a combination. The point is to throw out algorithms/parameters that do well on backtests when the assumptions are violated, because those are much more likely to be overfit.
As an example, consider a strategy like “on Wednesdays, the market is more likely to have a large move, and signal XYZ predicts big moves accurately.” You can encode that as an algorithm: trade signal XYZ on Wednesdays. But the algorithm might make money on backtests even if the assumptions are wrong! By examining the individual components rather than just whether the algorithm made money, we get a better idea of whether the strategy works.
Yes, avoiding overfitting is the key problem, and you should expect almost anything to be overfit by default. We spend a lot of time on this (I work w/Alexei). I’m thinking of writing a longer post on preventing overfitting, but these are some key parts:
Theory. Something that makes economic sense, or has worked in other markets, is more likely to work here
Components. A strategy made of 4 components, each of which can be independently validated, is a lot more likely to keep working than one black box
Measuring strategy complexity. If you explore 1,000 possible parameter combinations, that’s less likely to work than if you explore 10.
Algorithmic decision making. Any manual part of the process introduces a lot of possibilities for overfit.
Abstraction & reuse. The more you reuse things, the fewer degrees of freedom you have with each idea, and therefore the lower your chance of overfitting.
I think the prior that bubbles usually pop is incorrect. We tend to call something a bubble in retrospect, after it’s popped.
But if you try to define bubbles with purely forward-looking measures, like a period of unusually high growth, they’re more frequently followed by periods of unusually slow growth, not rapid decline. For example, Amazon’s stock would pass just about any test of a bubble over most points in its history.
I expect something similar with education, spending will likely remain high, but grow more slowly than it did in the last 20 years. That’s especially true because of the structure of student loans, people can’t really just default.
But to answer the more direct question: assuming that there is a rapid drop in education spending, how could we profit from it? Vocational schools seem like the most obvious bet, e.g. to become a programmer, dental assistant, massage therapist, electrician, and so on.
Certification services that manage to develop a reputation will become strong as well, e.g. SalesForce certificates are pretty valuable.
You could directly short lenders such as Sallie Mae.
Recruitment agencies that specialize in placing recent college graduates will likely suffer.
Management consulting firms rely heavily on college graduates, and so do hedge funds to a lesser extent.
Assume WLOG f(x)>=xThen by monotonicity, we have x<=f(x)<=f2(x)<=...<=f|P|(x)If this chain were all strictly greater, than we would have |P|+1istinct elements. Thus there must be some kuch that fk(x)=fk+1(x)By induction, fn+1(x)=fn(x)=fk(x)or all n>k
Assume f(x)>=xnd construct a chain similarly to (6), indexed by elements of αIf all inequalities were strict, we would have an injection from αo L.
Let F be the set of fixed points. Any subset S of F must have a least upper bound xn L. If x is a fixed point, done. Otherwise, consider fα(x) which must be a fixed point by (7). For any q in S, we have f(q)≤x⇒fα(q)≤fα(x)⇒q≤fα(x) Thus fα(x)s an upper bound of S in F. To see that it is the least upper bound, assume we have some other upper bound b of S in F. Then x<=b⇒fα(x)<=fα(b)=b
To get the lower bound, note that we can flip the inequalities in L and still have a complete lattice.
P(A) clearly forms a lattice where the upper bound of any set of subsets is their union, and the lower bound is the intersection.
To see that injections are monotonic, assume A0⊆A1nd fs an injection. For any function, f(A0)⊆f(A1) If a∉A0nd f(a)∈f(A0)that implies f(a)=f(a′)or some a′∈A0which is impossible since fs injective. Thus fs (strictly) monotonic.
Now h:=g∘fs an injection A→ALet Xe the set of all points not in the image of gand let A′=X∪h(X)∪h2(X)∪...ote that h(A′)=h(X)∪h2(X)∪h3(X)∪...=A′−Xsince no element of Xs in the image of hThen g(B−f(A′))=g(B)−h(A′)=g(B)−(A′−X)=g(B)−A′+g(B)∩X=g(B)−A′On one hand, every element of A not contained in g(B)s in A′y construction, so A−A′⊆g(B) On the other, clearly g(B)⊆Aso g(B)−A′⊆A−A′QED.
We form two bijections using the sets from (9), one between A’ and B’, the other between A—A’ and B—B’.
Any injection is a bijection between its domain and image. Since B′=f(A′)nd fs an injection, fs a bijection where we can assign each element b′∈B′o the a′∈A′uch that f(a′)=b′Similarly, gs a bijection between B−B′nd A−A′Combining them, we get a bijection on the full sets.
Thanks! Edited. Yeah, I specifically focused on variance because of how Bayesian updates combine Normal distributions.
I’m specifically giving up games that encourage many short check-ins, e.g. most phone games and idle games. Binges aren’t a big issue for me, they tend to give me joy and renewal. But frequent check-in games make me less happy and less productive.
“Prefer a few large, systematic decisions to many small ones.”
Pick what percentage of your portfolio you want in various assets, and rebalance quarterly, rather than making regular buying/selling decisions
Prioritize once a week, and by default do whatever’s next on the list when you complete a task.
Set up recurring hangouts with friends at whatever frequency you enjoy (e.g. weekly). Cancel or reschedule on an ad-hoc basis, rather than scheduling ad-hoc
Rigorously decide how you will judge the results of experiments, then run a lot of them cheaply. Machine Learning example: pick one evaluation metric (might be a composite of several sub-metrics and rules), then automatically run lots of different models and do a deeper dive into the 5 that perform particularly well
Make a packing checklist for trips, and use it repeatedly
Figure out what criteria would make you leave your current job, and only take interviews that plausibly meet those criteria
Pick a routine for your commute, e.g. listening to podcasts. Test new ideas at the routine level (e.g. podcasts vs books)
Find a specific method for deciding what to eat—for me, this is querying system 1 to ask how I would feel after eating certain foods, and picking the one that returns the best answer
Accepting every time a coworker asks for a game of ping-pong, as a way to get exercise, unless I am about to enter a meeting
Always suggesting the same small set of places for coffee or lunch meetings