Fixing Moral Hazards In Business Science
I’m a LW reader, two time CFAR alumnus, and rationalist entrepreneur.
Today I want to talk about something insidious: marketing studies.
Until recently I considered studies of this nature merely unfortunate, funny even. However, my recent experiences have caused me to realize the situation is much more serious than this. Product studies are the public’s most frequent interaction with science. By tolerating (or worse, expecting) shitty science in commerce, we are undermining the public’s perception of science as a whole.
The good news is this appears fixable. I think we can change how startups perform their studies immediately, and use that success to progressively expand.
Product studies have three features that break the assumptions of traditional science: (1) few if any follow up studies will be performed, (2) the scientists are in a position of moral hazard, and (3) the corporation seeking the study is in a position of moral hazard (for example, the filing cabinet bias becomes more of a “filing cabinet exploit” if you have low morals and the budget to perform 20 studies).
I believe we can address points 1 and 2 directly, and overcome point 3 by appealing to greed.
Here’s what I’m proposing: we create a webapp that acts as a high quality (though less flexible) alternative to a Contract Research Organization. Since it’s a webapp, the cost of doing these less flexible studies will approach the cost of the raw product to be tested. For most web companies, that’s $0.
If we spend the time to design the standard protocols well, it’s quite plausible any studies done using this webapp will be in the top 1% in terms of scientific rigor.
With the cost low, and the quality high, such a system might become the startup equivalent of citation needed. Once we have a significant number of startups using the system, and as we add support for more experiment types, we will hopefully attract progressively larger corporations.
Is anyone interested in helping? I will personally write the webapp and pay for the security audit if we can reach quorum on the initial protocols.
Companies who have expressed interested in using such a system if we build it:
Complice (disclosure: the CEO, Malcolm, is a friend of mine)
General Biotics (disclosure: the CEO, David, is me)
(I sent out my inquiries at 10pm yesterday, and every one of these companies got back to me by 3am. I don’t believe “startups love this idea” is an overstatement.)
So the question is: how do we do this right?
Here are some initial features we should consider:
Data will be collected by a webapp controlled by a trusted third party, and will only be editable by study participants.
The results will be computed by software decided on before the data is collected.
Studies will be published regardless of positive or negative results.
Studies will have mandatory general-purpose safety questions. (web-only products likely exempt)
Follow up studies will be mandatory for continued use of results in advertisements.
All software/contracts/questions used will be open sourced (MIT) and creative commons licensed (CC BY), allowing for easier cross-product comparisons.
Any placebos used in the studies must be available for purchase as long as the results are used in advertising, allowing for trivial study replication.
Significant contributors will receive:
Co-authorship on the published paper for the protocol.
(Through the paper) an Erdos number of 2.
The satisfaction of knowing you personally helped restore science’s good name (hopefully).
I’m hoping that if a system like this catches on, we can get an “effective startups” movement going :)
So how do we do this right?
I am having trouble visualizing it. Could you tell a story that is a use case?
Max L.
Thanks for pointing this out.
Let’s use Beeminder as an example. When I emailed Daniel he said this: “we’ve talked with the CFAR founders in the past about setting up RCTs for measuring the effectiveness of beeminder itself and would love to have that see the light of day”.
Which is a little open ended, so I’m going to arbitrarily decide that we’ll study Beeminder for weight loss effectiveness.
Story* as follows:
Daniel goes to (our thing).com and registers a new study. He agrees to the terms, and tells us that this is a study which can impact health—meaning that mandatory safety questions will be required. Once the trial is registered it is viewable publicly as “initiated”.
He then takes whatever steps we decide on to locate participants. Those participants are randomly assigned to two groups: (1) act normal, and (2) use Beeminder to track exercise and food intake. Every day the participants are sent a text message with a URL where they can log that day’s data. They do so.
After two weeks, the study completes and both Daniel and the world are greeted with the results. Daniel can now update Beeminder.com to say that Beeminder users lost XY pounds more than the control group… and when a rationalist sees such claims they can actually believe them.
Note that this story isn’t set in stone—just a sketch to aid discussion
These kind of studies suffer from the Hawthorne effect. It is better to assign the control group to do virtually anything instead of nothing. In this case I’d suggest to have them simply monitor their exercise and food intake without any magical line and/or punishment.
Thank you. I had forgotten about that.
So let’s say the two groups were, as you suggest:
Tracking food & exercise on Beeminder
Tracking food & exercise in a journal
Do you have any thoughts on what questions we should be asking about this product? Somehow the data collection and analysis once we have the timeseries data doesn’t seem so hard… but the protocol and question design seems very difficult to me.
I wonder if there should be a group where they still get Beeminder’s graph, but they don’t pay anything for going off their road. (In order to test whether the pledge system is actually necessary.)
Yes, It should be a task that has a camparable amount of effort behind it.
Thanks for the example. It leads me to questions:
For more complicated propositions, who does the math and statistics? The application apparently gathers the data, but it is still subject to interpretation.
Is the data (presumably anonymized) made publicly available, so that others can dispute the meaning?
If the sponsoring company does its own math and stats, must it publicly post its working papers before making claims based on the data? Does anyone review that to make sure it passes some light smell test, and isn’t just pictures of cats?
What action does the organization behind the app take if a sponsor publicly misrepresents the data or, more likely, its meaning? If the organization would take action, does it take the same action if the statement is merely misleading, rather than factually incorrect?
What do the participants get? Is that simply up to the sponsor? If so, who reviews it to assure that the incentive does not distort the data? If no one, will you at least require that the incentive be reported as part of the trial?
Does a sponsor have any recourse if it designed the trial badly, leading to misleading results? Or is its remedy really to design a better trial and publicize that one?
Can sponsors do a private mini-trial to test its trial design before going full bore (presumably, with their promise not to publicize the results)?
Have you considered some form of reputation system, allowing commenters to build a reputation for debunking badly supported claims and affirming well-supported claims? (Or perhaps some other goodie?) I can imagine it becoming a pastime for grad students, which would be a Good Thing (TM).
I imagine these might all be very basic questions that arise out of my ignorance of such studies. If so, please spend your time on people with more to contribute than ignorance!
Max L.
This is an awesome idea. I had not considered this until you posted it. This sounds great.
This is a hard one. I anticipate that at least initially only Good People will be using this protocol. These are people who spent a lot of time creating something to (hopefully) make the world better. Not cool to screw them if they make a mistake, or if v1 isn’t as awesome as anticipated.
A related question is: what can we do to help a company that has demonstrated its effectiveness?
This is exactly the moral hazard companies face with the normal procedure too.
The main advantage I see is that the webapp approach is much cheeper allowing companies t do it early thus reducing the moral hazard.
At minimum the code used should be posted publicly and open-source licensed (otherwise there can be no scrutiny or replication). I also think paying to have a third party review the code isn’t unreasonable.
That was the initial plan, yes! Beltran (my co-founder at GB) is worried that will result in either HIPPA issues or something like this, so I’m ultimately unsure. Putting structures in place so the science is right the first time seems better.
The privacy issue here is interesting.
It makes sense to guarantee anonymity. Participants recruited personally by company founders may be otherwise unwilling to report honestly (for example). For health related studies, privacy is an issue for insurance reasons, etc.
However, for follow-up studies, it seems important to keep earlier records including personally identifiable information so as to prevent repeatedly sampling from the same population.
That would imply that your organization/system needs to have a data management system for securely storing the personal data while making it available in an anonymized form.
However, there are privacy risks associated with ‘anonymized’ data as well, since this data can sometimes be linked with other data sources to make inferences about participants. (For example, if participants provide a zip code and certain demographic information, that may be enough to narrow it down to a very few people.) You may want to consider differential privacy solutions or other kinds of data perturbation.
http://en.wikipedia.org/wiki/Differential_privacy
I hadn’t. I like the idea, but am less able to visualize it than the rest of this stuff. Grad students cleaning up marketing claims does indeed sound like a Good Thing...
I was thinking something like the karma score here. People could comment on the data and the math that leads to the conclusions, and debunk the ones that are misleading. A problem would be that, If you allow endorsers, rather than just debunkers, you could get in a situation where a sponsor pays people to publicly accept the conclusions. Here are my thoughts on how to avoid this.
First, we have to simplify the issue down to a binary question: does the data fairly support the conclusion that the sponsor claims? The sponsor would offer $x for each of the first Y reviewers with a reputation score of at least Z. They have to pay regardless of what the reviewer’s answer to the question is. If the reviewers are unanimous, then they all get small bumps to their reputation. If they are not unanimous, then they see each others’ reviews (anonymously and non-publicly at this point) and can change their positions one time. After that, those who are in the final majority and did not change their position get a bump up in reputation, but only based on the number of reviewers who switched to be in the final majority. (I.e. we reward reviewers who persuade others to change their position.) The reviews are then opened to a broader number of people with positive reputations, who can simply vote yes or no, which again affects the reputations of the reviewers. Again, voting is private until complete, then people who vote with the majority get small reputation bumps. At the conclusion of the process, everyone’s work is made public.
I’m sure that there are people who have thought about reputation systems more than I have. But I have mostly seen reputation systems as a mechanism for creating a community where certain standards are upheld in the absence of monetary incentives. A reputation system that is robust against gaming seems difficult.
Max L.
I’m very glad I asked for more clarification. I’m going to call this system The Reviewer’s Dilemma, it’s a very interesting solution for allowing non-software analysis to occur in a trusted manner. I am somewhat worried about a laziness bias (it’s much easier to agree than disprove), but I imagine that there is a similar bounty for overturning previous results this might be handled.
I’ll do a little customer development with some friends, but the possibility of reviewers being added as co-authors might also act as a nice incentive (both to reduce laziness, and as addition compensation).
We need to design rules governing participant compensation.
At a minimum I think all compensation should be reported (it’s part of what’s needed for replication), and of course not related to the results a participant reports. Ideally we create a couple defined protocols for locating participants, and people largely choose to go with a known good solution.
StackOverflow et al are also free and offer no compensation except for points and awards and reputation. Maybe it can be combined. Points for regular participation, prominent mention somewhere and awards being real rewards. The downside is that this may pose moral hazards of some kind.
Oh, interesting.
I had been assuming that participants needed to be drawn from the general population. If we don’t think there’s too much hazard there, I agree a points system would work. Some portion of the population would likely just enjoy the idea of receiving free product to test.
I would worry about sampling bias due to selection based on, say, enjoying points.
For studies in which people have to actively involve themselves and consent to participate, I believe that there is always going to be some sampling bias. At best we can make it really really small, at worst, we should state clearly what we believe are those biases in our population.
At worst, we will have a better understanding of what goes into the results.
Also, for some studies, the sampled population might, by necessity, be a subset of the population.
I imagined similar actions as the Free Software Foundation takes when a company violates the GPL: basically a lawsuit and press release warning people. For template studies, ideally what claims can be made would be specified by the template (ie “Our users lost XY more pounds over Z time”.)
One option is simply to report it to the Federal Trade Commission for investigation, along with a negative publicity statement. That externalizes the cost.
If you would like assistance drafting the agreements, I am a lawyer and would be happy to help. I have deep knowledge about technology businesses, intellectual property licensing, and contracting, mid-level knowledge about data privacy, light knowledge about HIPAA, and no knowledge about medical testing or these types of protocols. I’m also more than fully employed, so you’d have the constraint of taking the time I could afford to donate.
Max L.
FTC is so much better than lawsuit. I don’t know a single advertiser that isn’t afraid of the FTC. It looks like enforcement is tied to complaint numbers, so the press release should include information about how to personally complain (and go out to a mailing list as well).
I would love assistance with the agreements. It sounds like you would be more suited to the Business <> Non-Profit agreements than the Participant <> Business agreements. How do I maximize the value of your contribution? Are you more suited to the high-level term sheet, or the final wording?
This problem can be reduced in size by having the webapp give out blinded data, and only reveal group names after the analysis has been publicly committed to. If participating companies are unhappy with the existing modules, they could perhaps hire “statistical consultants” to add a module, permanently improving the site for everyone.
This could be related to your #8 as well :)
I think I get your meaning. You mean that the webapp itself would carry out the testing protocol. I was thinking that it would be designed by the sponsor using standardized components. I think what you are saying is that it would be more rigid than that. This would allow much more certainty in the meaning of the result. Your example of “using X resulted in average weight loss of Y compared to a control group” would be a case that could be standardized, where “average weight loss” is a configurable data element.
Max L.
Yes. I think if we can manage it, requiring data-analysis to be pre-declared is just better. I don’t think science as a whole can do this, because not all data is as cheap to produce as product testing data.
Now that I’ve heard your reply to question #8, I need to consider this again. Perhaps we could have some basic claims done by software, while allowing for additional claims such as “those over 50 show twice the results” to be verified by grad students. I will think about this.
Thank you! This is exactly the kind of discussion I was hoping for.
The general answer to your questions is: I want to build whatever LessWrong wants me to build. If it’s debated in the open, and agreed as the least-worst option, that’s the plan.
I’ll post answers to each question in a separate thread, since they raise a lot of questions I was hoping for feedback on.
Even if the group assignments are random, the prior step of participant sampling could lead to distorted effects. For example, the participants could be just the friends of the person who created the study who are willing to shill for it.
The studies would be more robust if your organization took on the responsibility of sampling itself. There is non-trivial scientific literature on the benefits and problems of using, for example, Mechanical Turk and Facebook ads for this kind of work. There is extra value added for the user/client here, which is that the participant sampling becomes a form of advertising.
Yeah, this is a brutal point. I wish I knew a good answer here.
Is there a gold standard approach? Last I checked even the state of the art wasn’t particularly good.
Facebook / Google / StumbleUpon ads sound promising in that they can be trivially automated, and if only ad respondents could sign up for the study, then the friend issue is moot. Facebook is the most interesting of those, because of the demographic control it gives.
How bad is the bias? I performed a couple google scholar searches but didn’t find anything satisfying.
To make things more complicated, some companies will want to test highly targeted populations. For example, Apptimize is only suitable for mobile app developers—and I don’t see a facebook campaign working out very well for locating such people.
A tentative solution might be having the company wishing to perform the test supply a list of websites they feel caters to good participants. This is even worse than facebook ads from a biasing perspective though. At minimum it sounds like disclosing how participants were located prominently will be important.
There are people in my department who do work in this area. I can reach out and ask them.
I think Mechanical Turk gets used a lot for survey experiments because it has a built-in compensation mechanism and there are ways to ask questions in ways that filter people into precisely what you want.
I wouldn’t dismiss Facebook ads so quickly. I bet there is a way to target mobile app developers on that.
My hunch is that like survey questions, sampling methods are going to need to be tuned case-by-case and patterns extracted inductively from that. Good social scientific experiment design is very hard. Standardizing it is a noble but difficult task.
I sincerely hope that study plan would not pass muster. Doesn’t there need to be a more reasonable placebo?
In general, who will review proposed studies for things like suitable placebo decisions?
Can you provide an example of what you’d like to see pass muster?
Roughly speaking: “Act normal” vs “Use Beeminder” vs “some alternative intervention”. Basically, I expect to see “do something different” produce results, at least for a little while, for almost any value of “something different”. Literally anything at all that didn’t make it clear to the placebo group that they were the placebo group. Maybe some non-Beeminder exercise and intake tracking. Maybe a prescribed simple exercise routine + non-Beeminder tracking.
I’m glad you’re here. My background is in backend web software, and stats once the data has been collected. I read “Measuring the Weight of Smoke” in college, but that’s not really a sufficient background to design the general protocol. That’s a lot of my motivation behind posting this to LW—there seem to be protocol experts here, with great critiques of the existing ones.
My hope is we can create a “getting started testing” document that gets honest companies on the right track. Searching around the web I’m finding things like this rather than serious guides to proper placebo creation.
I’m hoping either registered statistical consultants or grad students. Hopefully this can be streamlined by a good introductory guide.
I posted a link to this discussion on the Quantified Self forums.
To work well, I think it needs a good name. In terms of long term social dynamics, creating a meta-brand that helps smaller brands seems essential. Like when people initially see the “tested by X” logo they won’t know what it means.
Assuming the web app works as intended, and assuming any significant fraction the population just stop believing any of the classes of claims that might be tested this way and lack the logo, then the process should gain more and more credibility over the course of months and years. The transition from an unknown logo to a trusted logo will be tricky for the larger institutional hack to work, and the name itself might be key to the logic of acceptance at the beginning.
I ground through various options at the command line with $ whois $OPTION | grep “[A-Z].COM”… trying to find things that get the right idea and aren’t already registered.
DoesItWork .com (taken)
justtestit .com (taken)
efficacy .com (taken)
forrealz .com (taken)
proveitforreal .com (available!)
simplytested .com (taken)
quickproofs .com (taken)
openproducttesting .com (available!)
opensourcetesting .com (taken)
tested .com (taken)
testedclaim .com (available!)
thirdpartytested .com (available!)
3rdpartytested .com (available!)
Namespace is huge and finding a good name seems key. The names I looked for my be too boring or too long or too easy to misspell? Please comment in response to this comment, one name suggestion per comment, and then find the 3 best suggestions from other people (assuming that there are lots to choose from) and vote them up :-)
Edited to add: I’m seeing lots of votes and no suggestions. Also, ProveItForReal seems to be winning but I think that works better in a {{citation needed}} context (ie you say {{prove it for real}} to dubious claims) but it works less well for logos on products. Imagine a logo that is worked into product’s packaging that says: “TestedClaim: X gives benefit Y in Z% of users”… that seems good in that context, but {{this needs to be a tested claim}} is awkward. Surely something better is possible than either of these?
proveitforreal.com
Acronym: PIFR.
Used like {{needs citation}} it really shines as {{prove it for real}} …but how does it look on a product label?
The karma has spoken. I’ve registered proveitforreal.com. Thank you!
I think a trademarked “proved” image will do nicely for use on labels :)
simpleproofoftruth.com
Acronym: SPOT, sounds kind of neat with “SPOT test” or “SPOT tested”.
It also works as a potential prod like citation needed… {{ simple proof of truth needed }}
One thing that slightly bothers me is that it relies on an older and less precise sense of the word “proof” that comes more from english common law than from mathematics, and the mathematical sense of the word “proof” is dear to my heart.
3rdpartytested.com
Acronym: 3PT is distinctive. TPT less so.
As a past tense claim, it really shines if you imagine what the logo could do for a product on a website. The link says “3rd Party Tested” and you click on it and it takes you to the open study. Simple and clean.
Downside: if the name overruns a pre-existing phrase because “third party tested” means something already, then you get confusing semantic collisions if someone has third party tested products that weren’t tested by Third Party Tested (the unique tool).
openproducttesting.com
Acroynm: OPT (kinda cool… “opt in”?)
Openness has good connotations: honestly, transparency, etc. Very clear mission statement if it turns into an organization and the organization runs into questions about what to do next.
testedclaim.com
Acronym: TC but it is so short and simple the acronym is less likely to be used I think.
This is my personal favorite among the names I suggested, from the perspective of a logo on a bottle or a website. Very clean :-)
Apptimize (YC S13) is also interested. (disclosure: the CEO and CTO, Nancy and Jeremy, are friends of mine, and I worked there as a programmer).
If anyone else would like to be included, please reply here.
In the Reproducibility Initiative PloS, and a few partner came together to improve the quality of science.
I would suggest all the people listed as advisors in the Reproducibility Initiative whether there are interested in your project. PloS would be a good trusted third-party with an existing brand.
Thank you. I had not seen the reproducibility initiative. Link very much appreciated, I’ll start the conversation tonight. PLoS hosting the application would be ideal.
I think this could be moved to Main.
Nope. The cost of doing less flexible studies will be the cost of losing that flexibility. For companies which expect a particular result from a study this cost can be considerable.
Do you mean that companies might see the opportunity cost of their ability to (amorally) rig a study as a bad thing? Or do you mean that companies might (legitimately) need complex study design to show a real but qualified positive effect?
If the former, then this proposal seems already to have taken that into account as a chief reason for creating the institution in the first place. If the latter then it might be possible to add options for important and epistemically useful variations in study design to the web app.
I mean both.
However now that I’ve looked at the OP more carefully, it seems to imply that a “webapp” will do research for multiple businesses at zero cost. I don’t think I understand how that’s supposed to work.
In my experience, startups want to demonstrate efficacy about basic things: weight removed, increased revenue, personal productivity, product safety, etc.
This kind of research lends itself extremely well to protocol templates: a standardized sequence of steps to locate the participants, collect the data, and decide the results. These steps could be performed by a website. I’ve posted a story of how that might work here.
Without such a project, founders have two options:
Perform the study themselves. The scientific background and time required to design a study well is non-trivial, as is the expertise necessary to create participant waivers. There are also significant time costs in setting up the data collection, and performing the analysis. At the end, the study will be greeted somewhat skeptically.
Pay a University or Contract Research Organization to perform the study. This is expensive, which I believe is why more founders aren’t doing it.
This project creates a third option:
Use a standard template then get back to work.
Which may be imperfect, but is still a pretty appealing value proposition.
This is not at all self-evident to me. How, for example, would you demonstrate product safety (for a diverse variety of products) via a standard template?
I don’t see how a “template” frees one from
Research costs money and requires competent people. If it were possible to do meaningful research on the cheap just by reusing the same template, don’t you think it would be a very popular opinion already?
Templates not template. I think if you know roughly which bodily systems a product is likely to effect, the questions are not so diverse.
My background is not in question selection (it’s ML and webapp programming), but here goes some general question ideas for edible products:
I have/have not felt sick to my stomach in the last 24 hours. (standardized 1-7 to rate severity)
I have/have not felt dizzy in the last 24 hours. (standardized 1-7 to rate severity)
Bristol stool scale score
The mandatory questions are intended to give LessWrong / everyone a say in what startups will test their products for—NOT to provide a 100% guarantee of general safety (the FDA already handles that). We should use these questions to learn about unanticipated side effects.
I’m hope it will do something akin to what Google Translate did for translation: lower the cost for modest use cases. If you want a high quality translation (poetry) you still need to hire a good translator. However, if you are willing to accept a reasonably good level of translation quality, it’s now free.
I agree it’s weird that somebody else hasn’t noticed. testifiable.com is the closest I’ve found. I’ve already spoken with Testifiable founder’s and invited them to this thread.
There is a critical difference: Google Translate does not guarantee the quality of results and, in fact, often generates something close to garbage. It may produce a “reasonably good level of translation quality” or it may not and that’s fine because it made zero promises about its capabilities.
You are planning to set yourself up as a standard of research which means you must generate better than adequate results every single time.
P.S. Oh, and a random thought. What do you think 4Chan will do with your “webapp”? X-)
Hmm. I’m confused. Let’s look at something slightly more extreme than what we’re talking about and see if that helps.
Level 0: Imagine we make a product study as good as possible, then allows anyone to perform the same study with a different product. Some products “shouldn’t” be tested that way, but I don’t see how a protocol like that will produce garbage (they will merely establish “no effect”).
Level 1: We broaden to support more companies, and allow anyone to perform those studies as well.
Level 2: After a sufficient number of companies have had their experiment created, we take the protocols and create a “build your own experiment constructor kit” which allows for an increasingly large number of products to receive a good test.
Level 3: As we add more and more products to the adaptable system, we reach the point where most product claims have a community ratified question for them, and the protocol is stable. You might not be able to test 20% of the things that you’d like, but for the other 80% you can test those just fine.
Please let me know where you believe that plan breaks. Actual plan will likely differ of course, but we need something concrete to talk about or I’m going to keep not understanding what sounds like potentially constructive methodology advice.
Perform some hilarious experiments! Hopefully we get publicity from them :D
Well, I still don’t understand how it’s supposed to work.
Let’s take a specific example. For example there is a kitchen gadget, Sansaire sous vide circulator, that was successfully funded on Kickstarter. Let’s pretend that I’m one of the people behind the project and I’m interested in (1) whether people find that product useful; (2) whether the product is safe.
How would the product study go in this particular example? Who does what in which sequence?
I described an overview in a different thread, but that was before a lot of discussion happened.
I’ll use this as an opportunity to update the script based on what has been discussed. This is of course still non-final.
The creator of the sous vide machine (Tester) would register his study and agree to the terms.
The Tester would register this as a food-related study, automatically adding required safety questions.
The Tester would perform a search of our questions database and locate customer satisfaction related questions.
The Tester would click “start the experiment”.
Our application would post an ad seeking participants.
The participants would register for the study, once a critical mass was reached our app would create a new instance of the data collection webapp.
Once the study period is complete, the data collection app signs and transfers the participant data back to our main app. The analysis is performed, the study is posted publicly, and the Tester is notified of the results via email.
Okay, hope (and the link to my earlier user story) helps make things more clear. If you see issues with this please do bring it up—finding and fixing issues early is the reason I started this thread.
You’re describing just the scaffolding, but what actually happens? All the important stuff is between points 6 and 7 and you don’t even allocate space for it :-/
The link I gave to the data collection webapp describes the data collection more depth, which I believe is what you are asking about between 6 and 7.
From that url:
Core function:
Every day an SMS/email is sent to participants with a securely generated one time URL.
The participant visits this URL and is greeted with a list of questions to answer.
Potential changes to this story:
If the URL is not used within 16 hours, it expires forever.
If a participant does not enter data for more than 3 days, they are automatically removed from the study.
If a participant feels that they need to removed from the study, they may do so at any time. They will be prompted to provide details on their reasons for doing so. These reasons will be communicated to the study organizer.
The study organizer may halt the study for logistical or ethical reasons at any time
No, not really. Recall the setting—I am about to produce a sous vide circulator and am interested (1) whether people find that product useful; and (2) whether the product is safe. I see nothing in your post which indicates how the process of answering my questions will work.
By the way, shipping a product to random people and asking them “Is it useful?” and “Did you kill yourself at any point during the last 24 hours?” is not likely to produce anything useful at all, never mind a proper scientific study.
I see. Right now the system doesn’t have any defined questions. I believe that suitable questions will be found so I’m focusing on the areas I have a solid background in.
If a project is unsafe in a literal way, shipping the product to consumers (or offering it for sale) is of course illegal. However, when considering a sous vide cooker in the past I have always worried about the dangers of potentially eating undercooked food (eg. diarrhea, nausea, and light headedness), which was how I took your meaning previously. “Product is safe for use, but accidental use might lead to undesirable outcomes”. As I mentioned in our discussion here this project is not intended to be a replacement for the FDA.
I agree that “is it useful” is not a particularly useful question to ask, but I don’t see any harm in supporting it. If you are looking for a better question, “80% of users used the product twice a week or more three months after receiving it” sounds like information that would personally help me make a buying decision. (Have you used the product today?)
So perhaps frequency of use might be a better question? I wasn’t haggling over what questions to ask because it was your example.
I think rigor in data collection and data processing are what make something scientific. For an example, you could do a rigorous study on “do you think the word turtle is funny?”.
Sorry, I don’t find this idea, at least in its present form, useful. However I’ve certainly been wrong before and will be wrong in the future so it’s quite possible I”m wrong now as well :-) There doesn’t seem to be much point for me to play the “Yes, but” game and I’ll just tap out.
I think you overrate the quality of Google Translate. That pitch doesn’t sound right to me.
Ahh, okay. That one goes on the scrap heap.
I think if you change the price of something by an order of magnitude you get a fundamental change in what it’s used for. The examples that jump to mind are letters → email, hand copied parchment → printing press → blogs, and SpaceX. If you increase the quality at the same time you (at least sometimes) get a mini-revolution.
I think a better example might be online courses. It can be annoying that you can’t ask the professor any questions (customize the experience), but they are still vastly better than nothing.
Another example is the use of steel. If it’s expensive, it’s used for needles and watch springs. If it’s cheap, it’s used for girders.
Email is not only cheaper than letters but also much faster.
The online courses example sounds reasonable but I’m still not sure whether that’s the best marketing strategy. Having a seal for following good science processes like preregistration might have it’s own value.
Prior Art: http://genomera.com/
Thanks so much! I didn’t know about them, it’s a good datapoint. Do you happen to know if they are active?
Awesome, great link. Example study here.
I think the needs for this project are still substantially different. Genomera trusts the scientists, which is usually a fine thing to do. I’ve applied for a beta invite, but don’t have access. Based on the example study I’ve linked it seems like they are more focused on assisting in data gathering (which based on my recent experience seems like the easiest thing we are considering).
One issue that seems more likely to be problematic when the web application is being created and launched than later on, is whether the questions are well designed. There’s a whole area of expertise that goes into creating scales that are reliable, valid, and discriminative. One possibility is to construct them from scratch from first principles, and then make them publicly available, but another possibility is to find the best of what exists already that is open sourced.
For general biotics and meal squares it seems like some measure of “not having a happy tummy” is a relevant thing to measure. If soylent gets in on the process they might have a similar interest?
A little bit of googling turned up the Gastrointestinal Symptom Rating Scale. It has 15 items (which might be too many?) and it is interview based (so hard to fit into an automated system). The really nice thing was that I could find a PDF and it all looked pretty basic.
A 2006 paper by van Zanten tipped me off to the existence of:
The Glasgow Dyspepsia Severity Scale
The Leeds Dyspepsia Questionnaire (public domain, with a Mandarin version!)
The Severity of Dyspepsia Assessment
The Nepean Dyspepsia Index
I’m feeling like in this situation, I can safely say “I love standards, there are so many to choose from”! One of the things that turned up in my searches that seems like a really useful “meta find” is the Proqolid Clinital Outcomes Assessment database but it requires membership to use the internal search function and I need to pause to grab some dinner.
Thank you for posting this!
Getting a list of LessWrong approved questions would be awesome. Both because I think the LW list will be higher quality than a lot of what’s out there, and because I feel question choice is one of the free variables we shouldn’t leave in the hands of the corporation performing the test.
I am confused. Shouldn’t the questions depend on the content of the study being performed? Which would depend (very specifically) on the users/clients? Or am I missing something?
I am hopeful that at minimum we can create guidelines for selecting questions.
I also think that some standardized health & safety questions by product category would be good (for nutritional supplements I would personally be interested in seeing data for nausea, diarrhea, weight change, stress/mood, and changes in sleep quality).
For productivity solutions I’d be curious about effects on social relationships, and other changes in relaxation activities.
Within a given product category, I’m also hopeful we can reuse a lot of questions. Soylent’s test and Mealsquares’ test shouldn’t require significantly different questions.
I’m not sure whether Lesswrong approval is the way to go. In the outside world few people care about Lesswrong.
I think if the project is in a later stage it might make sense to mail a bunch of domain experts and ask them for guidance.
I could also imagine using biology.stackexchange as a platform to discuss which tests should be used for a specific issue.
These requirements are higher than the average study in social sciences could fulful.[Citation needed]
That being said, I would put more faith in this startup if it targeted more professional research first and thus made itself more compatible with traditional papers. In a first step it would require researchers to announce a study and then publish the results regardless of the outcome (as is already done by some journals, as far as I know.) In a second step, require the results to be analysed by code published in advance under some kind of open content / open source licence. In a third step require there to be a replication under the same conditions for the claim to be published “officially”. And so on.
I’ll think about it some more, but the whole thing seems like it has been discussed on LW before.
Thank you. Help considering the methodology and project growth prospects is very much appreciated.
I agree that compatibility with traditional papers is important. It was not stated explicitly, but I do want the results to be publishable in traditional journals. I plan on publishing the results for my company’s product. It seemed to me like being overly rigorous might be a selling point initially—“sure we did the study cheap / didn’t use a university, but look how insanely rigorous we were”
Going after professional researchers seems much harder. They actually know how to perform the research, so the value proposition is much weaker—they are already trusted, and know how to use R :p
These are just initial thoughts. I’ll think about this more.
The thing is this seems like an ab-initio approach to doing research by people who are not researchers by trade. The vast majority of tech startups are lead by engineers not researchers, though there is no visible line between the two.
By the principle of comparative advantage researchers should be willing to delegate some of their work to a third party, so look for the repetitive parts that could be automated by either protocol or program. If, for example, the journal requires a replication before the full study is published, the original researcher(s) might have an incentive to plan for a replication from another party.
My idea for you would be to follow the same line most other improvements on traditional procedures follow: Automate the parts that can be automated, standardise the parts that can be standardised and continue. Designing a whole system tends to fail from my reading of history.
A two-pronged approach might even be more favourable: Work with a traditional journal that has the “perfect” scientific standards so the requirements infect traditional science and meanwhile fill the journal with the papers generated from the program.
I’ll have to think about this some more.
The Beeminder and HabitRPG links both point to less wrong blank pages.
Thanks, fixed.
Hi David,
This is a worthwhile initiative. All the very best to you.
I would advise that this data be maintained on a blockchain like data structure. It will be highly redundant and very difficult to corrupt, which I think is one of the primary concerns here.
http://storj.io, http://metadisk.org/
Interesting. I’m hoping that by getting a trustworthy non-profit to host the site (and paying for a security audit) we can largely side step the issues.
I spent a long time trying to create a way not to need the trusted third party, but I kept hitting dead ends. The specific dead end that hurt the most was blinding of physical product shipments.
If we can figure out a way to ship both products and placebos to people without knowing who’s getting what, I think we can do this :)
I have been thinking of a lot of incentivized networks and was almost coming to the same conclusion, that the extra cost and the questionable legality in certain jurisdictions may not be worth the payoff, and then the Nielsen scandal showed up on my newsfeed. I think there is a niche, just not sure where would it be most profitable. Incidentally Steve Waldman also had a recent post on this - social science data being maintained in a neutral blockchain.
About the shipping of products and placebos to people, I see a physical way of doing it, but it is definitely not scalable.
Let’s say there is a typical batch of identical products to be tested. They’ve been moved to the final inventory sub-inventory, but not yet to the staging area where they are to be shipped out. The people from the testing service arrive with a bunch of duplicate labels for the batch and the placebos and replace 1⁄2 the quantity with placebo. Now, only the testing service knows which item is placebo and which is product.
This requires 2 things from the system—the ability to trace individual products and the ability to print duplicate labels. the latter should be common except for places which might have some legal issues for continuous numbering. Ability to trace individual products is there in a lot of discrete mfg. but a whole lot of process manufacturing industries have only traceability by batch/lot.
Your approach to blinding makes sense, and works. I thought we were trying for a zero third party approach though?
I was giving more thought to a distributed solution during dinner, and I think I see how to solve the physical shipments problem in a scalable way. I’m still not 100% sold on it, but consider these two options:
You ship both a placebo package and a non-placebo package to the participant, and have them flip a coin to decide which one to use. They either throw away or disregard the other package for the duration of the study.
You ship N packages to Total/N participants. The participants which receive N packages then randomly assigns himself a package, and randomly distributes the remaining (N-1) packages to other participants.
They both require trusting the participant with assignment. Which feels wrong to me, but I’m not sure why...
I have thought a bit more about the blinding issue.
One question that comes to mind: How do we trust that the placebo is a real placebo and substantially different then the drug? The company producing the product wants to show a difference. Therefore they have no incentive to give both parties the same product.
On the other hand the company could mix some slight poison into the placebo. Even an ineffective drug beats a poison.
Therefore the placebo has to be produced or purchased by a trusted organisation and that organisation has to package the placebo in the same box that it’s packaging the drug.
This is awesomely paranoid. Thank you for pointing this out.
I’m a little worried a solution here will call for whoever controls the webapp to also be an expert at creating placebos for every product type. (If we trust contract manufacturers to be honest, then the issue of adding poisons to a placebo can be handled by having them ship directly to the third party for mailing… but I that’s already the default case).
Perhaps poisons can be discovered by looking at other products which performed the same protocol? “This experiment has to be re-done because the control group mysteriously got sick” doesn’t seem like a good solution though...
I’ll wrestle with this. Maybe something with MaxL’s answer to #8 might be possible?
How about… company with product type X suggests placebo Y. Webapp/process owner confirms suitability of placebo Y with unaffiliated/blinded subject matter expert in the field of product X. If confirmed as suitable, placebo is produced by unaffiliated external company (who doesn’t know what the placebo is intended for, only the formulation of requested items).
Alternately, the webapp/process owner could produce the confirmed placebo, but I’m not sure if this makes sense cost-wise, and also it may open the company up to accusations of corruption, because the webapp/process owner is not blinded to who the recipient company is, and therefore might collude.
That’s possible but it means that you double your product costs. The advantage would be that you can do crossover trials.
In the case of your probiotic a crossover trial might to worthwhile even without this reason.
In this case you could encrypt a list with products ID for placebos and non-placebos before you ship your product. The participant has no opportunity to know which of the two products are placebos.
When he starts the study the participant puts the product ID of the product he decides to use into the webapp. He also puts the ID of the product he doesn’t want to use in the webapp.
The participant has no way to decide between the two packages or know which one is the placebo and which one is the real thing so he doesn’t need to go through the process of flipping a real coin.
Once the study is finished you releases the decryption key that can be used to distinguish placebos from real products. You can give the decryption key to a trusted third-party organisation so that your company can’t prevent the results of the study from being published.
That means the participants has to do the work of going to the post office and remailing packages. Some of them will require additional time to remail and it might produce complications.
Maybe you can win a company such as Amazon as a partner for distribution. Amazon warehouses can store both products and placebos and ship randomly.
For Amazon it should be relatively little work and it could be good PR.
Labels don’t have to be printed on the bottle. Amazon has the capabilities of adding a paper with study instructions to a shipment and that paper can contain a unique ID for the product. Amazon could again publish an encrypted version of the ID at the beginning of the study and release the decryption key at the end of the study.
I am worried that standardized shipping will come with standardized package layout, and I’m guessing “preference of left vs right identical thing” correlates with something the system will eventually test. Having thought about it more, this is the real issue with allowing customers to choose which product they’ll use: that decision has to be purely random if you want the math to be simple / understandable. I agree people are unlikely to actually flip a coin :/
Thankfully the fix is easy: you have the testing webapp decide for the participant. They receive the product, enter the numbers online, and are told which to use.
For non-crossover trials I agree this needlessly increases the cost. It’s almost surely better to use a trusted third party.
Agree. I think having a trusted third party handle the shipments is cleaner at the moment. I’m still curious what blogospheroid’s thread comes to. It seems like the paranoia of cryptoland is helping us see some more holes in modern experiment design (ie your thread on poisonous placebos).
Thanks for saying this. Looking closer, I actually think their existing Fulfillment APIs would just work for this (ie the webapp controls an Amazon fulfillment account, the person seeking a test ships two pallets of physical product there, the webapp says where to send them).
You are right, if we already have hosted our webapp at trust-place we should be able to use the existing Amazon API.
If the company whose product is tested simply ships additional copies to the Amazon warehouse, those copies could by achieved by the trusted organisation. If anybody doubts that the products are real the trusted organisation has copies that they can analyse. If the whole things scales the trusted organisation also can randomly inspect products to see if they contain what they should contain.
Yes cryptoparanoia is always fun ;) The web app could regularly publish hashes of the data of specific studies to a public block chain. That way any tempering that happens afterwards can be detected and you only need to trust that the web app is temper proof the moment the data gets transmitted.
This is a great point. Maybe community members could bet karma on the outcome of a tox screening? This could create a prioritized list.
One problem with my earlier suggestion is that some companies will want narrowly selected participant pools. These will necessarily differ from the population at large, and might create data that looks like a poison placebo is being used. I see two possible solutions to this problem:
Log baseline data before the treatment is used. If people do worse on the placebo, that would be very suspicious.
Include an additional group of testers that do something different not related to the placebo/product. “Eat an apple every day for the next week”. If the placebo group did worse than the apple group, that would be very suspicious.
I feel like #2 from above is unsatisfying though, if we think it works then why are we using normal placebos?
This would actually be really easy to implement. (Not the block chain portion, the per-study rolling checksums).
Any updates on this?
Okay, sorry I’ve been away from the thread for a while. I spent the last half day hacking together a rough version of the data collection webapp. This seemed reasonable because I haven’t heard any disagreement on that part of the project, and I hope that having some working code will excite us :)
The models are quite good and well tested at this point, but the interface is still a proof of concept. I’ll have some more time tomorrow evening, which will hopefully be enough time to finish off the question rendering and SMS sending. I think with those two features added, we will have a reasonable v1.
We will still need to create
The main project page & study creation interfaces
Questions for use in our initial experiments
Participant location and screening criteria
Data analysis routines
Legal contracts
Paper describing what we did—erdos numbers don’t grow on trees :p
Repo is: https://github.com/GeneralBiotics/GLaDyS (I’ll move it away from the GB github once we finalize a project name).
Update: Question rendering now works, demo app can be viewed at http://gladys-example.herokuapp.com/
Cool work! I think it might make sense to create a new top level post to point to the progress and solicit more feedback. Comments down here aren’t going to get enough eyeballs, and now that you’re getting into the prototyping stage more eyeballs (and associated feedback) would be useful I think.
One thing I wonder, maybe you could have a dummy study set up to test the efficacy of something incredibly simple, like whether “apple eating increases prescience” as a control group for experiments. Some people get the apple condition. Some people get the no apple condition. Every evening participants get a question “Did you eat an apple today?” and “How did your secret coin toss come out, heads or tails?” To be in the study people have to go through a screening process and agree that if randomly assigned to do so, they will buy some apples and eat at least one apple a day, every day for a week.
That way LWers could sign up for it and interact with the software, and workflows like user signup and data collection and whatnot could be experienced and iterated to get feedback on the little practical things before the Serious Science begins in earnest :-)
Could you please link to examples of the kind of marketing studies that you are talking about? I’d especially like to see examples of those that you consider good vs. those you consider bad.
I did a poor job at the introduction. I’m assuming the studies exist, because if they don’t that’s full on false advertising.
Not to pick on anyone in particular here are some I recently encountered:
mtailor.com (“Scientifically proven to measure you 20% more accurately than a professional tailor.”—no details are provided on how this was measured, hard to believe claim, YC company)
Nightwave Sleep Assistant—list of effects, no source.
Basically anything in whole foods :p
The probiotics section at wholefoods (and my interactions with customers who believed those claims or were skeptical of my claims given the state of the supplement market) was what finally caused me to post this thread.
As a perplexing counterbalance to wholefoods are companies which don’t advertise any effects whatsoever, even though you’d expect they would.
List of companies where a lack of studies/objective claims caught my imagination:
unbounce.com & optimizely.com—these are huge companies doing science stuff. Why don’t they have “opptimizely users make X% more revenue after 9 months” rather than testimonials?
The five companies I did customer development with (Beeminder, HabitRPG, Mealsquares, Complice, Apptimize)
This is an interesting project!
An obvious relevant model is Gwern’s self experimentation on himself (http://www.gwern.net/Nootropics)
The key difference being, of course, that you are interested in group differences.
A key important step will be offering power calculations so that they sample size can be estimated prior to performing the test. (Also, so that post-hoc, you can understand how big an effect your study should have been able to detect.)
There are already some web apps that perform this, however. How will your app improve over those, or will yours offer an integrated solution and therefore be more valuable?
Eg., see http://www.statisticalsolutions.net/pss_calc.php
Thanks, that’s a great point.
I’m worried that a statistical calculator will throw off founders who would otherwise test their products with us (specifically YC founders, an abnormally influential group), so as much as possible I’d like to keep sample sizes in the “Advanced Menu” section. (This is not to say this is an unimportant issue—I’m saying this is a more important issue because many people won’t be customizing the default values).
I also think there are three unique features for product studies that can help simply defining good default values here:
Startups are going to be interested in talking about big improvements (small sample sizes needed).
Startups will likely view study participation as advertising, allowing for a generous margin of error on sample size.
Consumers are skeptical of low-sample size studies, even when they shouldn’t be.
What do you suggest we do? It sounds like getting baseline mean and variance data for the questions we include with the app is basically a requirement.
What is the awesome version of handling this issue? :p