Year 4 Computer Science student
find me anywhere in linktr.ee/papetoast
Year 4 Computer Science student
find me anywhere in linktr.ee/papetoast
Which GPT? The paper mentioned that GPT-5{,-mini,-nano} has only ~5% success rate. I tried it with o3 and got 2⁄3.
There may be something to find in how distributed systems do leader election, and the leader can then be a majority trusted overseer. Note that I don’t really understand leader election.
https://www.lesswrong.com/posts/WLQspe83ZkiwBc2SR/double-crux
(and hover on the reactions on your comment)
When females flirted, it was picked up 18% of the time. And while I do not think one should take these numbers very seriously, I will note that female flirting was also detected 17% of the time when the female did not flirt, so at least according to these numbers… female flirting gave basically zero signal above baseline noise.
This is wrong? What the data implies is that when female flirted is detected by someone, they are right ~50% of the time. My guess of the baseline of whether a female is flirting is like <2%. There is indeed signal here.
You may have misunderstood. I am asking about why your reply got −5 agreement vote despite it seeming correct to me, nothing related to the other comments.
Can someone explain why they disagree? I don’t see a particularly obvious reason.
Reading through all the responses the one thing that sticks out is Gemini-2.5 really, really wants to write the first character in caps.
what it produces in these unmonitored runs takes more work for me to clean up than just iterating with Claude directly
It may be the type of work that we are doing differs then.
Also, you don’t seem too bothered that running claude code implies a responsibility to review the code soon-ish (or have your local codebase go increasingly messy). The fact that I don’t need to worry about state with PR agents mean it is more affordable to spin more attempts, and because more attempts can be ran simultaneously, each individual attempt can be of lower quality, as long as the best attempt is good. Deciding that the code is garbage and not worth any time cleaning up is much faster than cleaning up, so in general I don’t find the initial read-through of the n attempts to take that much time. At the end I still only spin up codex on desktop if I think the task has reasonable chance to be done well, which really depends on the specific task size/difficulty/type (bug fix, refactor, adds). It’s also likely that claude code work better for you because you’re more experienced and can basically tell claude exactly what to do when it’s stuck.
I like to use the PR agents in some cases. (But I still manually checkout on those branches and rebase, split the commits or rewrite some stuff)
spin off tasks when I’m on mobile
it is easier to do multiple parallel attempts on the same task when I know the output probably suck. And not gonna lie OpenAI’s codex cloud has very lenient compute limit so I also feel like I’m saving money this way.
they live in (other people’s) containers so I don’t need to worry about multiple agents colliding with each other. I know git worktrees exist but juggling the which worktree is on which branch turns out to be somewhat annoying too.
They are good for queueing up tasks that I don’t expect to have to bandwidth to start working on today. I can make the agents do the PR today and forget about them until a few days later.
LW uses graphql. You can follow the guide below for querying if you’re unfamiliar with it.
https://www.lesswrong.com/posts/LJiGhpq8w4Badr5KJ/graphql-tutorial-for-lesswrong-and-effective-altruism-forum (For step 3 it seems like you now want to hover over output_type instead of input)
For step 3 it seems like you now want to hover over output_type instead of input
How I use AI for coding.
I wrote this in like 10 minutes for quick sharing.
I am not a full time coder, I am a student who code like 15-20 hours a week.
Investing too much time on writing good prompts make little sense. I go with the defaults and add pieces of nudges as needed. (See one of my AGENTS .md at the end)
Mainly codex (cloud) and Cursor. Claude Code works, but being able to easily revert is helpful, so Cursor is better.
I still try out claude code for small pieces of edits, but it doesnt feel worth it.
I have no idea why people like claude code so much? CLI is inferior to GUI
Using cursor means I don’t need to have multiple git worktrees for each agent, as long as I get them to work on different parts of the codebase
Mobile coding is real and very convenient with codex (cloud), but I still review and edit on desktop.
Using multiple agents is possible, but usually one big feature and multiple smaller background edits.
Or multiple big features using codex cloud, and delay review to a later time.
Codex cloud is good but only generate one commit for PR, often I need to manually split them up. I am eyeing on other cloud agents solution but havent tried them seriously yet.
Current prompt for one of the python projects
## Code Style
- 120-character lines
- Type hints is a must
- **Don't use Python 3.8 typings**: Never import `List`, `Tuple` or other deprecated classes from `typing`, use `list`, `tuple` etc. instead, or import from `collections.abc`
- Do not use `from __future__ import annotations`, use forward references in type hints instead. `TYPE_CHECKING` should be used only for imports that would cause circular dependencies.
## Documentation and Comments
Add code comments sparingly. Focus on why something is done, especially for complex logic, rather than what is done. Only add high-value comments if necessary for clarity or if requested by the user. Do not edit comments that are separate from the code you are changing. NEVER talk to the user or describe your changes through comments.
### Using a new environmental variable
When using a new environmental variable, add it to `.env.example` with a placeholder value, and optionally a comment describing its purpose. Also add it to the `Environment Variables` section in `README.md`.
### Using deal
We only use the exception handling features of deal. Use `@deal.raises` to document expected exceptions for functions/methods. Do not use preconditions/postconditions/invariants.
Additionally, we assume `AssertionError` is never raised, so `@deal.raises(AssertionError)` is not allowed.
## Testing Guidelines
To be expanded.
Mocking is heavily discouraged. Use test databases, test files, and other real resources instead of mocks wherever possible.
Allowed pytest markers:
- `@pytest.mark.integration`
- `@pytest.mark.slow`
- `@pytest.mark.docker`
- builtin ones like `skip`, `xfail`, `parametrize`, etc.
We do not use
- `@pytest.mark.unit`: all tests are unit tests by default
- `@pytest.mark.asyncio`: we use `pytest-asyncio` which automatically handles async tests
- `@pytest.mark.anyio`: we do not use `anyio`
### Running Tests
Use `uv run pytest …` instead of simply `pytest …` so that the virtual environment is activated for you.
## Asking for Help
- Refactoring:
As a command-line only tool, you do not have access to helpful IDE features like “Refactor > Rename Symbol”. Instead, you can ask the user to rename variables, functions, classes, or other symbols by providing the current name and the new name. It is important that you don’t rename public variables yourself, as you might miss some occurrences of the symbol across the codebase.
## Information
Finding dependencies: we use `pyproject.toml`, not `requirements.txt`. Use `uv add <package>` to add new dependencies.
(Note that the Asking for Help is basically useless. It was experimental and I never got asked lol)
I don’t doubt the conclusion, but I think we would be buying (life expectancy—age) life years instead of 1 life.
Are you guys talking about tin foil for small lights that some appliances emit? For windows I don’t understand why not just use a curtain.
It is a bit unintuitive for me that hallucination are made-up inputs, but it does make sense.
Hard to tell simply with what you said, mind sharing the conversation?
I apologize. Should have searched before talking.
Side note: This seems like a completely different topic from your top level comment. Kind of weird to start a mostly tangential argument inside an unresolved argument thread.
You’d still be better off creating the 1M x 100 world than the (1M + 1) x (100 - ε) world.
Where does (1M + 1) come from?
In the post Ben mentions that manufacturer doing hundreds of experiments, not millions. Of course, in the limiting case the smallest quality drop can and will be observed, but I believe Ben is not talking about that.
Even we use the 1M base figure, it doesn’t explain why it is +1 rather than e.g. +1000
You are assuming that the icecream manufacturer is trying to maximise aggregate utility, which seems obviously false to me.
Alternatively, what about matching people by browser history? If there is a way to avoid data security and privacy concerns (ha!) then there are actually a lot of advantages.
I have recently learned that Fully Homomorphic Encryption (doing calculations on encrypted data) 1. exists and 2. is usable in a small scale.
https://bozmen.io/fhe
https://bozmen.io/fhe-current-apps (FHE Real-world Applications)
Current FHE has 1,000x to 10,000x computational overhead compared to plaintext operations. On the storage side, ciphertexts can be 40 to 1,000 times larger than the original. It’s like the internet in 1990—technically awesome, but limited in practice.
Based on vibes, I found it more probable that the function from Hard to oversee → Easy to oversee is not 1-to-1 and thus reversible. It feels more like a projection function, so when you get simple alignment and try to unproject it you still just get a really high dimension space where advanced alignment is a negligible region.