I’m a humanities PhD who’s been reading Eliezer for a few years, and who’s been checking out LessWrong for a few months. I’m well-versed in the rhetorical dark arts, due to my current education, but I also have a BA in Economics (yet math is still my weakest suit). The point is, I like facts despite the deconstructivist tendency of humanities since the eighties. Now is a good time for hard-data approaches to the humanities. I want to join that party. My heart’s desire is to workshop research methods with the LW community.
It may break protocol, but I’d like to offer a preview of my project in this introduction.
I’m interested in associating the details of print production with an unnamed aesthetic object, which we’ll presently call the Big Book, and which is the source of all of our evidence. The Big Book had multiple unknown sites of production, which we’ll call Print Shop(s) [1-n]. I’m interested in pinning down which parts of the Big Book were made in which Print Shop. Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don’t know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.
The most obvious solution that I can see is
to catalog all Marks in the Big Book by sheet (a unit of print production, as opposed to the page), then
sort sheets by patterns of Marks, then
make some associations between the patterns of Marks and Print Shops, and then
propose Print Shops [x,y,z] to be the sites of production for the Big Book.
If nothing else, this method can define the n-number of Print Shops responsible for the Big Book.
The Bayesian twist on the obvious solution is to add some testing onto the associations, above. Specifically,
find some books strongly associated with Print Shops [x,y,z], in order to
assign probability of patterns of Marks to each Print Shop, then
revise initial associations between Print Shops [x,y,z] and the Big Book proportionally.
I’m far from an expert in Bayesian methods, but it seems already that there’s something missing here. Is there some stage where I should take a control sample? Also, how can I find a logical basis for the initial association step, when there are many potential Print Shops? Lastly, how can I account for the decay of Tools, thus increasing Marks, over time?
I’m interested in associating the details of print production with an unnamed aesthetic object, which we’ll presently call the Big Book, and which is the source of all of our evidence.
It’s the Bible, isn’t it.
Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don’t know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.
How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.
I’m far from an expert in Bayesian methods, but it seems already that there’s something missing here.
Have you considered googling for previous work? ‘Bayesian inference in phylogeny’ and ‘Bayesian stylometry’ both seem like reasonable starting points.
‘No free lunches’, right? If you’re getting anything out of your unsupervised methods, that just means they’re making some sort of assumptions and proceeding based on those.
Sorry to interrupt a perfectly lovely conversation. I just have a few things to add:
I may have overstated the case in my first post. We have some information about print shops. Specifically, we can assign very small books to print shops with a high degree of confidence. (The catch is that small books don’t tend to survive very well. The remaining population is rare and intermittent in terms of production date.)
There are some hypotheses that could be treated as priors, but they’re very rarely quantified (projects like this are rare in today’s humanities).
How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.
We have minimal information about Print Shops. I wouldn’t say the existing data are garbage, just mostly unquantified.
Have you considered googling for previous work?
Yes, but thanks to you I know the shibboleth of “Bayesian stylometry.” Makes sense, and I’ve already read some books in a similar vein, but there are some problems. Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks. Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods. Can you clear any of this up?
EDIT: I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?
Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks.
Have to define your features somehow.
Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods.
Really? I was under the opposite impression, that stylometry was, since the ’60s or so with the Bayesian investigation of Mosteller & Wallace into the Federalist papers, one of the areas of triumph for Bayesianism.
I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?
No, not really. I think I would describe GIGO in this context as ‘data which is equally consistent with all theories’.
I have just such a thing, referred to as “Marks.” I haven’t yet included that in the code, because I wanted to explore the viability of the method first. So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?
So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?
You claimed to not know what printers there were, how many there were, and what connection they had to ‘Marks’. In such a situation, what on earth do you think you can infer at all? You have to start somewhere: ‘we have good reason to believe there were not more than 20 printers, and we think the London printer usually messed up the last page. Now, from this we can start constructing these phylogenetic trees indicating the most likely printers for our sample of books...’ There is no view from nowhere, you cannot pick yourself up by your bootstraps, all observation is theory-laden, etc.
This all sounds good to me. In fact, I believe that researchers in the humanities are especially (perhaps overly) sensitive to the reciprocal relationship between theory and observation.
I may have overstated the ignorance of the current situation. The scholarly community has already made some claims connecting the Big Book to Print Shops [x,y,z]. The problem is that those claims are either made on non-quantitative bases (eg, “This mark seems characteristic of this Print Shop’s status.”) or on a very naive frequentist basis (eg, “This mark comes up N times, and that’s a big number, so it must be from Print Shop X”). My project would take these existing claims as priors. Is that valid?
I have no idea. If you want answers like that, you should probably go talk to a statistician at sufficient length to convey the domain-specific knowledge involved or learn statistics yourself.
This is a problem that machine learning can tackle. Feel free to contact me by PM for technical help.
To make sure I understand your problem:
We have many copies of the Big Book. Each copy is a collection of many sheets. Each sheet was produced by a single tool, but each tool produces many sheets. Each shop contains many tools, but each tool is owned by only one shop.
Each sheet has information in the form of marks. Sheets created by the same tool at similar times have similar marks. It may be the case that the marks monotonically increase until the tool is repaired.
Right now, we have enough to take a database of marks on sheets and figure out how many tools we think there were, how likely it is each sheet came from each potential tool, and to cluster tools into likely shops. (Note that a ‘tool’ here is probably only one repair cycle of an actual tool, if they are able to repair it all the way to freshness.)
We can either do this unsupervised, and then compare to whatever other information we can find (if we have a subcollection of sheets with known origins, we can see how well the estimated probabilities did), or we can try to include that information for supervised learning.
I’m glad you mentioned the repair cycle of tools. There are some tools that are regularly repaired (let’s just call them “Big Tools”) and some that aren’t (“Little Tools”). Both are expensive at first and to repair, but it seems the Print Shops chose to repair Big Tools because they were subject to breakage that significantly reduced performance.
I should add another twist since you mentioned sheets of known origins: Assume that we can only decisively assign origins to single sheets. There are two problems stemming from this assumption: first, not all relevant Marks are left on such sheets; second, very few single sheet publications survive. Collations greater than one sheet are subject to all of the problems of the Big Book.
I’m most interested in the distinction between unsupervised and supervised learning. And I will very likely PM you to learn more about machine learning. Again, thanks for your help!
EDIT: I just noticed a mistake in your summary. Each sheet is produced by a set of tools, not a single tool. Each mark is produced by a single tool.
I just noticed a mistake in your summary. Each sheet is produced by a set of tools, not a single tool. Each mark is produced by a single tool.
Okay. Are the classes of marks distinct by tool type- that is, if I see a mark on a sheet, I know whether it came from tool type X or tool type Y- or do we need to try and discover what sort of marks the various tools can leave?
Any time you are doing statistical analysis, you always want a sample of data that you don’t use to tune the model and where you know the right answer. (a ‘holdout’ sample)
In this case, you should have several books related to the various print shops that you don’t feed into your Bayesian algorithm. You can then assess the algorithm by seeing if it gets these books correct.
To account for the decay of the books, you need books that you know not only came from print shop x,y or z, but also you’d need to know how old the tools wee that made those books. Either that, or you’d have to have some understanding of how the tools decay from a theoretical model.
Very helpful points, thanks. The scholarly community already has a pretty good working knowledge of the Tools, and thus the theoretical model of Tool breakage (“breakage” may be more accurate than “decay,” since the decay is non-incremental and stochastic). We know the order in which parts of the Tools break, and we have some hypotheses correlating breakage to gross usage. The twist is that we don’t know when any Print Shops produced the Big Book, so we can only extrapolate a timeline based on Tool breakage
Can you say more about the holdout sample? Should the holdout sample be a randomly selected sample of data, or something suspected to be associated with Print Shops [x,y,z] ? Print Shops [a,b,c] ?
To account for the decay of the books, you need books that you know not only came from print shop x,y or z, but also you’d need to know how old the tools wee that made those books. Either that, or you’d have to have some understanding of how the tools decay from a theoretical model.
If you assume that the marks result from defects in the tool that accumulate, it should be relatively easy to build (and test) a monotonic model. Suppose we have an unordered collection of sheets, with some variable number of defects per sheet. If the defects are repeated (i.e. we can recognize defect A whenever we see it, as well as B, and so on), then we can build together paths- all of the sheets without defects pointing towards all of the sheets with just defect A, then defect A and B, and so on. There should be divergence- if we never see sheets with both defect A and C, then we can conclude the 0-A-B path is one tool (with the only some of the 0 defect sheets coming from that tool, obviously), the 0-C-D-E path is another tool, and the 0-F-G path is a third tool. (Noting that here ‘tool’ refers to one repair cycle, not the entire lifecycle.)
If you assume that the marks result from defects in the tool that accumulate, it should be relatively easy to build (and test) a monotonic model
The first assumption seems bad to me- I would assume defects accumulate only until equipment is reset or repaired, which is why I think you’d want some actual data.
The first assumption seems bad to me- I would assume defects accumulate only until equipment is reset or repaired, which is why I think you’d want some actual data.
That looks to me like it agrees with my assumption; I suspect my grammar is somehow unclear. (Note the last line of the grandparent.)
I dunno, I find the complexity-hiding capitalized nouns things strangely attractive. Maybe there should be more capitalized nouns. Why isn’t Sheets capitalized?
This is probably coming back to my fascination with graph theory, which has similar but even more exotic terminology. “A spider is a subdivision of a star, which is a kind of tree made up only of leaves and a root; a star with three arcs is called a claw.”
I was openly warned by a professor (who will likely be on the dissertation committee) not to talk about this project widely.
The capitalized nouns are to highlight key terms. I believe the current description is specific enough to describe the situation accurately and without misleading people, but not too specific to break my professor’s (correct) advice.
Have I broken LW protocol? Obviously, I’m new here.
Yes. He said that I should be careful about sharing my project because, otherwise, I’ll be reading about it in a journal in a few months. His warning may exaggerate the likelihood of a rival researcher and mis-value the expansion of knowledge, but I’m deferring to him as a concession of my ignorance, especially regarding rules of the academy.
This is heavily context-dependent. Many fields are idea-rich and implementation-poor, in which case you do have to ram ideas down people’s throats, because there’s a glut of other ideas you have to compete against. But in fields that are implementation-rich and idea-poor, ideas should be guarded until you’ve implemented them. There are no doubt academic fields where the latter case applies.
But in fields that are implementation-rich and idea-poor, ideas should be guarded until you’ve implemented them. There are no doubt academic fields where the latter case applies.
I’ve been privately told of several such cases in high-energy physics. Below is an excerpt from the Politzer’s Nobel lecture. He discovered Asymptotic freedom (that quarks are essentially connected by the miniature rubber bands which have no tension when the quarks are close to each other).
I slowly and carefully completed a calculation of the Yang-Mills beta function.
I happen to be ambidextrous and mildly dyslexic. So I have trouble with
left/right, in/out, forward/backward, etc. Hence, I derived each partial result
from scratch, paying special attention to signs and conventions. It did not take
long to go from dismay over the final minus sign (it was indeed useless for
studying low energy phenomena) to excitement over the possibilities. I phoned
Sidney Coleman. He listened patiently and said it was interesting. But, according
to Coleman, I had apparently made an error because David Gross and
his student had completed the same calculation, and they found it was plus.
Coleman seemed to have more faith in the reliability of a team of two, which
included a seasoned theorist, than in a single, young student. I said I’d check
it yet once more. I called again about a week later to say I could find nothing
wrong with my first calculation. Coleman said yes, he knew because the
Princeton team had found a mistake, corrected it, and already submitted a
paper to Physical Review Letters.
He does not explicitly say that Gross was tipped off, but it’s easy to read between the lines. The rest of his lecture, titled The Dilemma Of Attribution is also worth reading.
I cannot speak to your private examples, but I think you may be reading that into what Politzer said. He previously mentions the existence of ‘multiples’:
And the neat, linear progress, as outlined by the sequence of gleaming gems recognized by Nobel Prizes, is a useful fiction. But a fiction it is. The truth is often far more complicated. Of course, there are the oft-told priority disputes, bickering over who is responsible for some particular idea. But those questions are not only often unresolvable, they are often rather meaningless. Genuinely independent discovery is not only possible, it occurs all the time.
And shortly after your passage, he says
On learning of the Gross-Wilczek-Politzer result, [Nobelist] Ken Wilson, who might have thought of its impossibility along the same lines as I attributed to [Nobelist] Schwinger, above, knew who to call to check the result. He realized that there were actually several people around the world who had done the calculation, en passant as it were, as part of their work on radiative corrections to weak interactions in the newly-popular Weinberg-Salam model. They just never thought to focus particularly on this aspect. But they could quickly confirm for Wilson by looking in their notebooks that the claimed result was, indeed, correct....[Nobelist] Steve Weinberg and [Nobelist] Murray Gell-Mann were among those to instantly embrace non-Abelian color SU(3) gauge theory as the theory of the strong interactions. In Gell-Mann’s case, it was in no small part because he had already invented it (!) with Harald Fritzsch and christened it QCD...’d only heard of Gell-Mann and Fritzsch’s work second hand, from [Nobelist] Shelly Glashow, and he seemed think it shouldn’t be taken too seriously. I only later realized it was more Glashow’s mode of communication than his serious assessment of the plausibility of the proposal. In any case, I had completely lost track of Gell-Mann and Fritzsch’s QCD.
I cannot speak to your private examples, but I think you may be reading that into what Politzer said.
Not me. This tip-off story had been talked about in the community for a long time, just never publicly until Politzer decided to carefully and tactfully state what he knew personally and avoid speculating on what might have transpired. The result itself, of course, was ripe for discovery, and indeed was discovered but glossed over by others before him. I mentioned this particular story because it’s one of the most famous and most public ones. Of course, it might all be rumors and in reality there was no issue.
‘When you hear hoofbeats, think horses, not zebras’. I see here by Politzer’s testimony a multiple discovery of at least 3 (Gell-Mann and the more-than-one persons implied by ‘several’) and you ask me to believe that a fourth multiple is not yet another multiple but rather a plagiarism/theft based, solely on you say it was being talked about. It’s not exactly a convincing case.
The general narrative sounds very similar to cases in my own field, but I’d rather not talk about it. I’ve been cautioned not to speak about my current projects with certain people, on account of this.
David Gross and his student had completed the same calculation, and they found it was plus.
A week after Politzer shared his calculation:
the Princeton team had found a mistake, corrected it, and already submitted a paper to Physical Review Letters.
Why would they decide to redo the calculation (not a very hard one, but rather laborious back then, though it’s a standard one in any grad QFT course now) at exactly the same time?
Anyway, no point in further speculations without new data.
It may be more precise to say there are academic groups to which that description applies, and that discretion is worthwhile in their proximity. Examples of those still living will remain private for obvious reasons.
I’ve been privately told of several such cases in high-energy physics. Some even allege that the main reason that David Gross got a share of the Nobel Prize for Asymptotic Freedom because he was a referee or maybe a journal editor for the Politzer’s paper and managed to hasten his group’s somewhat lagging research to get it published at the same time. No idea if the story has any true in it.
Yep. It’s not the Bible. I suspect that there are already good stats compiled on the Q-source, etc.
In a way it’s not only futile but limiting to play the guessing game. There are lots of possible applications of Bayesian methods to the humanities. Maybe this discussion will help more projects than my own.
That was my first thought too; there’s a huge textual analysis tradition relating to the Bible and what I know of it maps pretty closely to the summary, although it’s also mature enough that there wouldn’t be much reason to obfuscate it like this. But it’s not implausible that it applies to some other body of literature. I understand there are some similar things going on in classics, for example.
The specifics shouldn’t matter too much, though. Although some types of mark are going to be a lot more machine-distinguishable than others, and that’s going to affect the kinds of analysis you can do—differences in spelling and grammar, for example, are far machine-friendlier than differences in letterforms in a manuscript.
Thanks for the feedback. I actually cleared up the technical language considerably. I don’t think there’s any need to get lost in the weeds of the specifics while I’m still hammering out the method.
Hi everyone,
I’m a humanities PhD who’s been reading Eliezer for a few years, and who’s been checking out LessWrong for a few months. I’m well-versed in the rhetorical dark arts, due to my current education, but I also have a BA in Economics (yet math is still my weakest suit). The point is, I like facts despite the deconstructivist tendency of humanities since the eighties. Now is a good time for hard-data approaches to the humanities. I want to join that party. My heart’s desire is to workshop research methods with the LW community.
It may break protocol, but I’d like to offer a preview of my project in this introduction. I’m interested in associating the details of print production with an unnamed aesthetic object, which we’ll presently call the Big Book, and which is the source of all of our evidence. The Big Book had multiple unknown sites of production, which we’ll call Print Shop(s) [1-n]. I’m interested in pinning down which parts of the Big Book were made in which Print Shop. Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don’t know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.
The most obvious solution that I can see is
to catalog all Marks in the Big Book by sheet (a unit of print production, as opposed to the page), then
sort sheets by patterns of Marks, then
make some associations between the patterns of Marks and Print Shops, and then
propose Print Shops [x,y,z] to be the sites of production for the Big Book.
If nothing else, this method can define the n-number of Print Shops responsible for the Big Book.
The Bayesian twist on the obvious solution is to add some testing onto the associations, above. Specifically,
find some books strongly associated with Print Shops [x,y,z], in order to
assign probability of patterns of Marks to each Print Shop, then
revise initial associations between Print Shops [x,y,z] and the Big Book proportionally.
I’m far from an expert in Bayesian methods, but it seems already that there’s something missing here. Is there some stage where I should take a control sample? Also, how can I find a logical basis for the initial association step, when there are many potential Print Shops? Lastly, how can I account for the decay of Tools, thus increasing Marks, over time?
It’s the Bible, isn’t it.
How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.
Have you considered googling for previous work? ‘Bayesian inference in phylogeny’ and ‘Bayesian stylometry’ both seem like reasonable starting points.
Not quite. You can get quite a bit of insight out of unsupervised clustering.
‘No free lunches’, right? If you’re getting anything out of your unsupervised methods, that just means they’re making some sort of assumptions and proceeding based on those.
Right, but this isn’t a free lunch so much as “you can see a lot by looking.”
Sorry to interrupt a perfectly lovely conversation. I just have a few things to add:
I may have overstated the case in my first post. We have some information about print shops. Specifically, we can assign very small books to print shops with a high degree of confidence. (The catch is that small books don’t tend to survive very well. The remaining population is rare and intermittent in terms of production date.)
There are some hypotheses that could be treated as priors, but they’re very rarely quantified (projects like this are rare in today’s humanities).
Interesting feedback.
Ha, I wish. No, it’s more specific to literature.
We have minimal information about Print Shops. I wouldn’t say the existing data are garbage, just mostly unquantified.
Yes, but thanks to you I know the shibboleth of “Bayesian stylometry.” Makes sense, and I’ve already read some books in a similar vein, but there are some problems. Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks. Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods. Can you clear any of this up?
EDIT: I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?
Have to define your features somehow.
Really? I was under the opposite impression, that stylometry was, since the ’60s or so with the Bayesian investigation of Mosteller & Wallace into the Federalist papers, one of the areas of triumph for Bayesianism.
No, not really. I think I would describe GIGO in this context as ‘data which is equally consistent with all theories’.
I don’t understand what this means. Can you say more?
http://en.wikipedia.org/wiki/Feature_%28machine_learning%29 A specific concrete variable you can code up, like ‘total number of commas’.
I have just such a thing, referred to as “Marks.” I haven’t yet included that in the code, because I wanted to explore the viability of the method first. So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?
You claimed to not know what printers there were, how many there were, and what connection they had to ‘Marks’. In such a situation, what on earth do you think you can infer at all? You have to start somewhere: ‘we have good reason to believe there were not more than 20 printers, and we think the London printer usually messed up the last page. Now, from this we can start constructing these phylogenetic trees indicating the most likely printers for our sample of books...’ There is no view from nowhere, you cannot pick yourself up by your bootstraps, all observation is theory-laden, etc.
This all sounds good to me. In fact, I believe that researchers in the humanities are especially (perhaps overly) sensitive to the reciprocal relationship between theory and observation.
I may have overstated the ignorance of the current situation. The scholarly community has already made some claims connecting the Big Book to Print Shops [x,y,z]. The problem is that those claims are either made on non-quantitative bases (eg, “This mark seems characteristic of this Print Shop’s status.”) or on a very naive frequentist basis (eg, “This mark comes up N times, and that’s a big number, so it must be from Print Shop X”). My project would take these existing claims as priors. Is that valid?
I have no idea. If you want answers like that, you should probably go talk to a statistician at sufficient length to convey the domain-specific knowledge involved or learn statistics yourself.
This is a problem that machine learning can tackle. Feel free to contact me by PM for technical help.
To make sure I understand your problem:
We have many copies of the Big Book. Each copy is a collection of many sheets. Each sheet was produced by a single tool, but each tool produces many sheets. Each shop contains many tools, but each tool is owned by only one shop.
Each sheet has information in the form of marks. Sheets created by the same tool at similar times have similar marks. It may be the case that the marks monotonically increase until the tool is repaired.
Right now, we have enough to take a database of marks on sheets and figure out how many tools we think there were, how likely it is each sheet came from each potential tool, and to cluster tools into likely shops. (Note that a ‘tool’ here is probably only one repair cycle of an actual tool, if they are able to repair it all the way to freshness.)
We can either do this unsupervised, and then compare to whatever other information we can find (if we have a subcollection of sheets with known origins, we can see how well the estimated probabilities did), or we can try to include that information for supervised learning.
That’s a hell of a summary, thanks!
I’m glad you mentioned the repair cycle of tools. There are some tools that are regularly repaired (let’s just call them “Big Tools”) and some that aren’t (“Little Tools”). Both are expensive at first and to repair, but it seems the Print Shops chose to repair Big Tools because they were subject to breakage that significantly reduced performance.
I should add another twist since you mentioned sheets of known origins: Assume that we can only decisively assign origins to single sheets. There are two problems stemming from this assumption: first, not all relevant Marks are left on such sheets; second, very few single sheet publications survive. Collations greater than one sheet are subject to all of the problems of the Big Book.
I’m most interested in the distinction between unsupervised and supervised learning. And I will very likely PM you to learn more about machine learning. Again, thanks for your help!
EDIT: I just noticed a mistake in your summary. Each sheet is produced by a set of tools, not a single tool. Each mark is produced by a single tool.
Okay. Are the classes of marks distinct by tool type- that is, if I see a mark on a sheet, I know whether it came from tool type X or tool type Y- or do we need to try and discover what sort of marks the various tools can leave?
Fortunately, we know which tool types leave which marks. We also have a very strong understanding of the ways in which tools break and leave marks.
Thanks again for entertaining this line of inquiry.
Good point!
Also yay combining multiple fields of knowledge and expertise! applause
Seriously though, the world does need more of it, and I felt the need to explicitly reward and encourage this.
Thanks! I feel explicitly encouraged.
Any time you are doing statistical analysis, you always want a sample of data that you don’t use to tune the model and where you know the right answer. (a ‘holdout’ sample)
In this case, you should have several books related to the various print shops that you don’t feed into your Bayesian algorithm. You can then assess the algorithm by seeing if it gets these books correct.
To account for the decay of the books, you need books that you know not only came from print shop x,y or z, but also you’d need to know how old the tools wee that made those books. Either that, or you’d have to have some understanding of how the tools decay from a theoretical model.
Very helpful points, thanks. The scholarly community already has a pretty good working knowledge of the Tools, and thus the theoretical model of Tool breakage (“breakage” may be more accurate than “decay,” since the decay is non-incremental and stochastic). We know the order in which parts of the Tools break, and we have some hypotheses correlating breakage to gross usage. The twist is that we don’t know when any Print Shops produced the Big Book, so we can only extrapolate a timeline based on Tool breakage
Can you say more about the holdout sample? Should the holdout sample be a randomly selected sample of data, or something suspected to be associated with Print Shops [x,y,z] ? Print Shops [a,b,c] ?
If you assume that the marks result from defects in the tool that accumulate, it should be relatively easy to build (and test) a monotonic model. Suppose we have an unordered collection of sheets, with some variable number of defects per sheet. If the defects are repeated (i.e. we can recognize defect A whenever we see it, as well as B, and so on), then we can build together paths- all of the sheets without defects pointing towards all of the sheets with just defect A, then defect A and B, and so on. There should be divergence- if we never see sheets with both defect A and C, then we can conclude the 0-A-B path is one tool (with the only some of the 0 defect sheets coming from that tool, obviously), the 0-C-D-E path is another tool, and the 0-F-G path is a third tool. (Noting that here ‘tool’ refers to one repair cycle, not the entire lifecycle.)
The first assumption seems bad to me- I would assume defects accumulate only until equipment is reset or repaired, which is why I think you’d want some actual data.
That looks to me like it agrees with my assumption; I suspect my grammar is somehow unclear. (Note the last line of the grandparent.)
Yes, I see an accord between your statement and Vaniver’s. As I said below, most tools have very slow repair cycles.
How about talking clearly about whatever you are currently hinting at?
I dunno, I find the complexity-hiding capitalized nouns things strangely attractive. Maybe there should be more capitalized nouns. Why isn’t Sheets capitalized?
This is probably coming back to my fascination with graph theory, which has similar but even more exotic terminology. “A spider is a subdivision of a star, which is a kind of tree made up only of leaves and a root; a star with three arcs is called a claw.”
I was openly warned by a professor (who will likely be on the dissertation committee) not to talk about this project widely.
The capitalized nouns are to highlight key terms. I believe the current description is specific enough to describe the situation accurately and without misleading people, but not too specific to break my professor’s (correct) advice.
Have I broken LW protocol? Obviously, I’m new here.
Did they say why?
Yes. He said that I should be careful about sharing my project because, otherwise, I’ll be reading about it in a journal in a few months. His warning may exaggerate the likelihood of a rival researcher and mis-value the expansion of knowledge, but I’m deferring to him as a concession of my ignorance, especially regarding rules of the academy.
“Don’t worry about people stealing your ideas. If your ideas are any good, you’ll have to ram them down people’s throats.”
This is heavily context-dependent. Many fields are idea-rich and implementation-poor, in which case you do have to ram ideas down people’s throats, because there’s a glut of other ideas you have to compete against. But in fields that are implementation-rich and idea-poor, ideas should be guarded until you’ve implemented them. There are no doubt academic fields where the latter case applies.
Can you name any?
I’ve been privately told of several such cases in high-energy physics. Below is an excerpt from the Politzer’s Nobel lecture. He discovered Asymptotic freedom (that quarks are essentially connected by the miniature rubber bands which have no tension when the quarks are close to each other).
He does not explicitly say that Gross was tipped off, but it’s easy to read between the lines. The rest of his lecture, titled The Dilemma Of Attribution is also worth reading.
I cannot speak to your private examples, but I think you may be reading that into what Politzer said. He previously mentions the existence of ‘multiples’:
And shortly after your passage, he says
Not me. This tip-off story had been talked about in the community for a long time, just never publicly until Politzer decided to carefully and tactfully state what he knew personally and avoid speculating on what might have transpired. The result itself, of course, was ripe for discovery, and indeed was discovered but glossed over by others before him. I mentioned this particular story because it’s one of the most famous and most public ones. Of course, it might all be rumors and in reality there was no issue.
‘When you hear hoofbeats, think horses, not zebras’. I see here by Politzer’s testimony a multiple discovery of at least 3 (Gell-Mann and the more-than-one persons implied by ‘several’) and you ask me to believe that a fourth multiple is not yet another multiple but rather a plagiarism/theft based, solely on you say it was being talked about. It’s not exactly a convincing case.
The general narrative sounds very similar to cases in my own field, but I’d rather not talk about it. I’ve been cautioned not to speak about my current projects with certain people, on account of this.
A week after Politzer shared his calculation:
Why would they decide to redo the calculation (not a very hard one, but rather laborious back then, though it’s a standard one in any grad QFT course now) at exactly the same time?
Anyway, no point in further speculations without new data.
It may be more precise to say there are academic groups to which that description applies, and that discretion is worthwhile in their proximity. Examples of those still living will remain private for obvious reasons.
Yup, some specific people steal. This definitely happens (but I will not mention names for obvious reasons).
I’ve been privately told of several such cases in high-energy physics. Some even allege that the main reason that David Gross got a share of the Nobel Prize for Asymptotic Freedom because he was a referee or maybe a journal editor for the Politzer’s paper and managed to hasten his group’s somewhat lagging research to get it published at the same time. No idea if the story has any true in it.
I think Gwern’s right on this.
But Humanities has rejected that!
Yep. It’s not the Bible. I suspect that there are already good stats compiled on the Q-source, etc.
In a way it’s not only futile but limiting to play the guessing game. There are lots of possible applications of Bayesian methods to the humanities. Maybe this discussion will help more projects than my own.
Ah, OK. They hadn’t when I wrote it.
That was my first thought too; there’s a huge textual analysis tradition relating to the Bible and what I know of it maps pretty closely to the summary, although it’s also mature enough that there wouldn’t be much reason to obfuscate it like this. But it’s not implausible that it applies to some other body of literature. I understand there are some similar things going on in classics, for example.
The specifics shouldn’t matter too much, though. Although some types of mark are going to be a lot more machine-distinguishable than others, and that’s going to affect the kinds of analysis you can do—differences in spelling and grammar, for example, are far machine-friendlier than differences in letterforms in a manuscript.
Thanks for the feedback. I actually cleared up the technical language considerably. I don’t think there’s any need to get lost in the weeds of the specifics while I’m still hammering out the method.