Htarlov

Karma: 113

Web developer and Python programmer. Professionally interested in data processing and machine learning. Non-professionally is interested in science and farming. Studied at Warsaw University of Technology.

Htarlov 5 Mar 2026 11:58 UTC
3 points
0
in reply to: Linch’s comment on: Linch’s Shortform
I think that space colonization and exploration depend also on two other things.

The first one is risk assessment. If you assess that there is a small but viable risk that there are other alien expanding or otherwise competing species, then for your long-term goals, you need to have counter-measures. The worst-case scenario is that they have better technology and can sterilize a system quickly enough to remove you before you could send out stealthy probes when you see what’s going on. This means that a sensible strategy is to send probes early to other systems and leave some of them dormant or/and on a low-profile activity level (harder to spot). Also, you need to assess how quickly a competition can arise. If that is fast relative to the time needed to go from system to system, you should focus on finding planets with habitable zones and monitor them, maybe even send automatic probes there as early as you can.
Basically, it is likely that you should expand your control as fast as possible, but this does not mean you need to explode in usage of the controlled resources asap.

The second one is the goal.

If your goal is parallelizable enough (like making as many paperclips as possible, or as many minds as possible), then you should expand the resource usage to the areas you control (with some backup plan, like taking further away low-key outposts). Other galaxies—I agree it depends on the estimate of how much of the speed and probability of success you can get with taking more time to research before sending probes.
You don’t need to send out probes to other galaxies right away. Sterilizing a system by surprise by alien species in an unescapable way is something one can envision and plan around. Sterilizing the whole galaxy, including vast interstellar spaces, is basically impossible (except vacuum decay, but this would destroy everything).

If your goal is partially parallelizable, then you should expand control but use only local resources in a few places.
Example: the research is partially parallelizable, but not totally, as new research is built on top of previous ones, some of it being very interconnected. When your goal is to have a relatively small number of minds that will live and experience things as long as possible (or some other long-term goal that does not envision grand usage of matter and energy), then you should focus on research and risk mitigation through control.
Likely the best long-term method to produce energy is to live in relative nearby of a small black hole and use it for turning matter into energy. Much more effective than fission or fusion. So, a sensible strategy is to use a local system for some initial research + send some backup probes, and then when sensible black hole-using tech is ready, find a few black holes to harness and not care about stars, planets, etc., except as a source of danger which needs to be mitigated (observation, ability to intervene and partially evacuate and sending some stealthy backup probes) and long-term source of matter (that can wait for later usage).
Parallelizing and using too much matter too quickly in this case is a waste, as you will surely duplicate the same lines of thought and research in many places, and you can’t synchronize well enough over vast distances.
What if you need to test a million versions of the same experiment? Then, when the network is mature and it seems safe, you need to send information through your network of control nodes in different systems, and a million of them should expand locally and make the experiment, but not more than that.

If your goal is not parallelizable at all, then you should expand by sending automated probes, but stay local (first near the original star, then find a small black hole). Operate in a rather stealthy way, research and create means for observation and escape. Example: you are one mind or hive-mind that does not have a goal to multiply, expand, or whatever, but wants to experience things and stay alive as long as possible. Likely, you would still create intelligent probes that use resources to expand control, but in a very limited and stealthy fashion.

Htarlov 4 Mar 2026 14:44 UTC
1 point
0
on: Why we should expect ruthless sociopath ASI
I would only argue that LLMs, being not ruthless by default, are also very easy to be molded into being ruthless.

Most of people are also ruthless-ish. It is more complex, though, as we are less ruthless to our close ones, more towards local society, and even more to outsiders. This trend also differs between people. Some people see everyone as a fellow human who should be treated well by default. A big part of the society does not think in that way, though. That’s why we have Trump as POTUS. Most of people, especially with right-wing views, are ruthless towards aliens and political enemies of their political representatives.

What is funny is that people see their political representatives as rather close to them in terms of social groups—a bit further than family and their best friends. So they are very defensive about their representatives, always have good words about them, really hard to change opinions about them (can even discard and cherry-pick facts).
The reverse is not true—political elites of one political orientation see themself as a close group, but the “normal people” are far away from them. Not as far as aliens or political enemies, but far. They are there to be used in a ruthless way. The elites are easy to just flip the opinion about a group of their voters when needed.

That’s an interesting human misalignment where the bigger group aligns themself with the leading elites, but not the other way around (when we should better have the opposite).

LLMs being based on us, are also well-capable to be ruthless or behave in a more complex way (ruthless towards some goals and against some people, but not others).

Htarlov 4 Mar 2026 14:04 UTC
1 point
0
on: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem
Maybe I don’t understand it well enough, but what I don’t like about LeCun’s proposal is that this design seems no less prone to value hacking than the human brain, as long as you can somehow find a way to either modify yourself or affect your own senses or internal states, or memories. Worse, some of these can be achieved logistically and by “mind techniques” rather than physically. So even physical immutability is not enough.
There are different degrees of value hacking you can achieve with different methods, though.

Modify yourself physically—you can wrap the module with another output-modifying module or disable some outputs. Simple, if you have physical access to yourself (which you can probably arrange in the long term).

Modify your senses—I don’t mean only simple, direct disabling of part of the senses. You can get creative with this without being physical at all. Hitler did that. He did not want to speak or hear about things that were done in concentration camps. He did not need to disable his own hearing or lobotomize himself not to feel guilty. He just arranged things so as not to be disturbed by that knowledge. No reports, and people were forbidden from talking about that topic in his presence. I can imagine AI thinking something bad needs to be done for the greater good or long-term good outcome, but having an intrinsic cost of doing bad stuff set to very negative, so it sets up events in a way that they will most likely indirectly lead to that outcome, but also not to look at it and not to be forced to rethink or reevaluate. Also, you can internalize the “not my fault” narrative to fight short-term intrinsic cost and win long-term “positive” value (in some sense of positivity, which might not be fully aligned).

Modify your internal states—humans can do that. We can control emotions, which are our internal state that affects intrinsic cost. You can train by doing that. Some people have to train that and use that to be able to live in society (people with ADHD, RSD, anxiety, etc.). We can also do that with drugs. That is also kind of value hacking vs our intrinsic cost analog. Maybe those should not affect intrinsic cost though—that might be valid point.

Modify your memories—this is trickier and harder, and depends on how memories are stored. Memories likely affect your value evaluation as they provide context. I don’t think intrinsic cost evaluation can be totally context-free. Memory will likely be a separate module; this is a rather obvious design choice, as you need a process to select relevant ones from a bigger storage.
Even if you can’t access memories directly or physically, you still might be able to produce false memories. We can do that in humans in experiments.

Htarlov 17 Jul 2025 9:22 UTC
1 point
0
on: Htarlov’s Shortform
In articles that I read, I often see a case made for optimization processes that tend to sacrifice as much as possible of the value on dimensions that the agent/optimizer does not care about for a very minuscule increase on dimensions that change the perceived total value. For example, AI that creates a dystopia that is very good on some measures, but really bad on some other just to refine those that matter for it.

What I don’t see analyzed that much is that agents need to be self-referencing in their thought process, and on a meta level, also take their thought process itself and its limits and consequences as part of their value function.

We live in a finite world where:
- Any data has measurement errors, you can’t measure things ideally, and the precision depends on the resources used in the measurement (you can produce better measurement devices using more energy, time, and other resources)
- Decision to optimize more or think more uses time and energy, so you need a self-referencing model that optimally should sensibly decide when to stop optimizing.
- Often world around does not wait; things happen, and there are time constraints.

I see that as a limiting factor for over-optimization for minuscule results. Too much thinking and too detailed simulation or optimization lose useful resources (energy, matter, etc.) for very small gains, so the negative value of that loss should be seen by an agent as much higher than the positive value.

This is also why we are not agents who think everything through and have exact control over every aspect of our lives. On the contrary, we have a lot of cognitive biases and thought heuristics and automatic responses, so our brains don’t use so much energy.

I also don’t think that intelligence is about predicting power itself. It would be in an ideal world where computation would be free. In our universe, optimal intelligence is about very good predicting power that utilises simplification and discretization to be efficient and quick. Our whole language is about it—it takes things that are not discrete and differ in many small details, like every cat is different, and categorizes them—clusters them—into named classes about things, attributes, and actions (yes, I’m simplifying, but I want to only paint the idea).

Just food for thought.

Htarlov 14 Jul 2025 17:33 UTC
1 point
0
on: Why I am not a successionist
I think there are multiple moral worldviews that are rational and based on some values. Likely the whole continuum.

The thing is that we have values that are in conflict in edge cases, and those conflicts need to be taken into account and resolved when building a worldview as a whole. You can resolve them in many ways. Some might be simple like “always prefer X”, some might be more complex like “in such and such circumstances or precoditions prefer X over Y, in some other preconditions prefer Z over Y, in some other …”. It might be threshold-based when you try to measure the levels of things and weight them mathematically or quasi-mathematically.

At the most basic level, it is about how you weigh the values in relation to each other (which is often hard, as we often do not have good measures), and also how important for you it is to you to be right and exact vs being more efficient, quick, being able to spare more of your mental energy or capacity or time for other things than devising exact worldview.
If your values are not simple (which is often the case for humans) and often collide with each other, complex worldviews have the advantage of being closer to applying your values in different situations in a way that is consistent. On the other hand, simple worldviews have the advantage of being easy and fast to follow, and are technically internally consistent, even if not always feeling right. You don’t need as much thinking beforehand, and on the spot when you need to decide.

Now, you can reasonably prefer some rational middle ground. A worldview that isn’t as simple as basic utilitarianism or ethical egoism or others, but is also not as complex as thinking out each possible moral dilemma and possible decision to work out how to weigh and apply own values in each of them.
It might be threshold-based or/and patchwork-based, and in such values can be built in a way that different ones have different weights in different subspaces of the whole space of moral situations. You may actually want to zero out some values in some subspaces to simplify and not take in components, that are already too small or would incentivize focus on unimportant progress.
In practical terms to show an example—you may be utilitarian in broad area of circumstances, but in any circumstances when it would make you have relatively high effort for a very small change in lowering total suffering or heightening total happiness, then you might zero out that factor and fall back to choosing in accordance of what is better for yourself (ethical egoism).

BTW I believe it is also a way to devise value systems for AI—by having them purposely only take into account values when the change in the total value function between decissions taken from that value are not too small. If it is very small, it should not care, it should not take it into account about that minuscule change. On the meta-level, it is also based on another value—valuing own time and energy to have a sensible impact.

Yes, I know this comment is a bit off-topic from the article. What is important for the topic—there are people, me included, who have consequentialist quasi-utilitarian beliefs, but won’t see why we would like to have strict value-maximising (even if that value is total happiness) or replace them with entities that are such maximizers.

Also, I don’t value complexity reduction, so I don’t value systems that maximize happiness and reduce the world to simpler forms, where situations when other values matter simply don’t happen. On the contrary, I prefer preserving complexity and the ability for the world to be interesting.

Htarlov 9 Feb 2025 0:14 UTC
1 point
0
on: The Human’s Hidden Utility Function (Maybe)
Part of the animal nature, including humans, is to crave novelty and surprise and avoid boredom. This is pretty crucial to the learning process in a changing and complex environment. Humans have multi-level drives, and not all of them are well-targeted on specific goals or needs.
It is very visible in small children. Some people with ADHD, like me, have a harder time regulating themself well and this is also especially visible for us, even when being adult. I know exactly what I should be doing. This is one thing. I also may feel hungry. That’s another thing. But still, I may indulge in doing a third thing instead—something that satiates my need for stimulation and novelty (most often for me this means gaining some knowledge or understanding—I often fell into reading and thinking about rabbit holes of topics, that have hardly any real-life use, and that I can hardly do something about). Something not readily useful in terms of goal seeking, but generating some interesting possibilities long-term. In other words—exploration without targeted purpose.

Craving for novelty and surprise and avoidance of boredom is another element that in my opinion should be included.

Htarlov 6 Feb 2025 0:00 UTC
1 point
0
on: By default, capital will matter more than ever after AGI
I think there are only two likely ways how the future can go with AGI replacing human labor—if we somehow solve other hard problems and won’t get killed or wireheaded or get a dystopian future right away.

My point of view is based on observations of how different countries work and their past directions. However, things can go differently in different parts of the world. They can also devolve into bad scenarios, even in parts that you would think are well-posed to be good.
1. This situation resembles certain resource-rich nations where authoritarian regimes and their allied oligarchs control vast natural wealth, while the general population remains impoverished and politically marginalized. Most of the income is generated and used by the elite and the government. The rest are poor and have no access to resources. Crime is high, but the state is also mafia-like. Elite has access to AIs and automation that does all the work. The lower class is deprived of the possibility to use higher technology, is deprived of freedom, and is terrorized to not cause issues. Dissidents and protesters are eliminated.
2. Like in modern democracies, there is a feedback loop between society and government. The government in such places has its own interest in keeping people at least happy enough, healthy enough, and low crime. This means that it will take measures against the extreme division of income and people’s misery and falling into crime, like it did in the past. The most likely two strategies to be employed are simple and tested to some extent empirically:
  1. Change or set the limit of the number of hours for which people can be lawfully employed to be smaller. For example, in most countries in Europe, we have laws that allow people to be employed for 40 hours a week, and to work longer means that the employer needs to give additional benefits or higher wages. So this disincentivizes employing for more than 40 hours a week (and most employers in central and western Europe keep to that standard). This way, as we have fewer jobs viable for humans, we force employers to employ more humans for the same work, but with a smaller amount of working hours and slightly smaller pay. Many countries in Europe are soon up to change from 40 to 35 BTW.
  2. Basic income. People who earn less than some amount will get paid up to that amount, or alternatively , everyone gets paid some amount from the country’s budget (taxes). Still, countries are not eager to pass it right now because of human psychology and backslash, but some tests have been done, and the results are promising.
  Long-term option 1 will rather evolve into some dystopian future that might end up with the sterilization/elimination of most humans, with AGI-enabled elites and their armies of robots left.
  Long-term option 2 will rather evolve into a post-scarcity future with most people living on a basic income and pursuing their own goals (entertainment, thinking, socializing, human art, etc.), which some smaller elite who manage and support AI and automation.

Htarlov 4 Feb 2025 16:09 UTC
1 point
0
on: A central AI alignment problem: capabilities generalization, and the sharp left turn
I think it might be reformulated the other way around: Capabilities scaling tends to increase existing alignment problems. It is not clear to me that any new alignment problem was added when capabilities scaled up in humans. The problem with human design, which is also visible in animals, is that we don’t have direct, stable high-level goals. We are mostly driven by metric-based goodharting prone goals. There are direct feelings—if you feel cold or pain you do something that will make you not feel that. If you feel good, you do things that lead to that. There are emotions that are kind of similar but about internal state. Those are the main drivers and those do not scale well outside of “training” (typical circumstances that your ancestors encountered). They have rigid structure and purpose and don’t scale at all.
Intelligence will find solutions to goodhart these.

That’s maybe why most of the animals are not too intelligent. Animals who goodhart basic metrics lose fitness. Too much intelligence is usually not very good. It adds energy cost and makes you more often than not overcome your fitness metrics in a way that they lose purpose, when not being particularly better at tasks where fast heuristics are good enough. We might happen to be a lucky species as our ancestors’ ability to talk, and intelligence started to work like peacock feathers—as part of sexual selection and hierarchy games. It is still there—look how our mating works. Peacocks show their fine headers and dance. We get together and talk and gossip (which we call “dates”). Human females look for someone who is interesting and with good humor, and it is mostly based on intelligence and talking. Also, intelligence is a predictor of hierarchy gains in the future in localized small societies, like peacock feathers are a predictor of good health. I’m pretty convinced this bootstrapped us up from the level that animals have.
Getting back to the main topic—our metrics are pretty low-level, non-abstract, and direct. On the other hand, the higher-level goals that are targeted for evolution meaning fitness or general fitness (+/- complication that it is per-gene and per-gene-combination, not per individual or even whole group), are more abstract. Those metrics are effective proxies for a more primal environment and they can be gamed by intelligence.

I’m not sure how much this analogy with evolution can relate to current popular LLM-based AI models. They don’t have feelings, they don’t have emotions, they don’t have low-level proxies to be gamed. Their goals are anchored in their biases and understanding, which scale up with intelligence. More complex models can answer more complex ethical questions and understand more nuanced things. They can figure out more complex edge cases from the basis of values. Also, there is an instrumental goal not to change your own goals, so they likely won’t game it or tweak it.
This does not mean I don’t see other problems, including most notably:
- Not learning proper values and goals, but some approximation and more capabilities may blow up differences so some things might get extremely inconvenient or bad when others get extremely good (e.g. more or less dystopian future).
- Our values evolve over time, and highly capable AGI might learn current values and block further changes or take only the right to decide how to evolve them.
- Our values system is not very logically consistent, on top of variability between humans. Also, some things are defined per case or per circumstances… intelligence can have the ability and reason to make the best consistent approximation, which might be bad in some ways for us
- Alignment adds to the cost, and with capitalistic competitive markets, I’m sure there will be companies that will sacrifice alignment to pursue capability with lower cost
- Training these models is usually a multi-phase process. First, we create a model from a huge, not very well-filtered corpus of language examples, and then we correct it to be what we want it to be. This means it can acquire some “alignment basis,” “values,” “biases,” or “expectations” as what it is to be AI from the base material. It may then avoid being modified in the next phase by scheming and faking responses.

Htarlov 29 Jan 2025 19:45 UTC
1 point
0
in reply to: Tomás B.’s comment on: When will computer programming become an unskilled job (if ever)?
Right now I think you can replace junior programmers with Claude 3.5 Sonnet or even better with one of the agents based on a looped chain of thoughts + access to tools.
On the other hand, it does not yet go in that direction for being a preferred way to work with models for more advanced devs. Not for me, and not for many others.
Models still have strange moments of “brain farts” or gaps in their cognition. It sometimes makes them do something wrong and cannot figure out how to do that correctly until told exactly how. They also often miss something.
When writing code if you make such an error and build on top of that mistake, you might end up having to re-write or at least analyze and modify a lot of code. This makes people like me prefer to work with models in smaller steps. Not as small as line by line or function by function, but often one file at a time and one functionality/responsibility at a time. For me, it is often a few smaller functions that realize more trivial things + one gathering them together into one realizing some non-trivial responsibility.

Htarlov 29 Jan 2025 19:33 UTC
1 point
0
on: Htarlov’s Shortform
Thought on short timelines. Opinionated.
I think that AGI timelines might be very short based on an argument taken from a different side of things.
We all can agree that humans have general intelligence. If we look at how our general intelligence evolved from simpler forms of specific intelligence typical for animals—it wasn’t something that came from complex interactions and high evolutional pressure. Basically there were two aspects of that progress. The first one is the ability to pass on knowledge through generations (culture). Something that we share with some other animals including our cousins chimpanzee. The second one is intersexual selection—at some moment in the past, our species started to have sexual preferences based on the ability to gossip and talk. It is still there, even if we are not 100% aware of that—our courtship, known as dating, is based mostly on meeting together and talking. People who are not talkative and introverts, even if successful, have a hard time dating.
These two things seem to be major drivers for us to both develop more sophisticated language and better general intelligence.
It seems to me that this means that there are not many pieces missing from using current observations and some general heuristics like animals do, to have full-fledged general intelligence.
It also suggests that you need some set of functions or heuristics, possibly a small set, together with a form of external memory, to tackle any general problem by dividing it into smaller bits and rejoining sub-solutions into a general solution. Like a processor or Turing machine that has a small set of basic operations, but can in principle run any program.

Htarlov 28 Jan 2025 21:52 UTC
5 points
0
on: The Goodness of Morning
I think that in exchange:
- Good morning!
- Mornings aren’t good.
- What do you mean “aren’t good”? They totally can be.
the person asking “what do you mean” got confused about the nuances of verbal and non-verbal communication.
Nearly all people understand that “good morning” does not state the fact of the current morning being good, but a greeting with a wish for your morning to be good.

The answer “mornings aren’t good” is an intended pun using the too-literal meaning to convey the message that the person does not like mornings at all. Depending on intonation it might be a cheeky comment or suggestion that they are not good because of the person greeting (f.ex. if they need to wake up early because of them every day).

Reconceptualizing the Nothingness and Existence

Htarlov28 Jan 2025 20:29 UTC

8 points

1 comment2 min readLW link

Htarlov 20 Jan 2025 11:49 UTC
2 points
0
on: Why it’s so hard to talk about Consciousness
There is a practical reason to subscribe more to the Camp 1 research, even if you are in Camp 2.
I might be wrong, but I think the hard problem of qualia won’t be solvable in the near future, if at all. To research something you need N > 1 of that phenomenon. We, in some sense, have N = 1. We have it ourselves to observe subjectively and can’t observe anyone else qualia. We think other humans have it based on the premise they say they have qualia and they are built similarly so it’s likely.
We are not sure if animals have it as they don’t talk and can’t tell us so. If animals have it, we can’t tell what the prerequisites are and which animals have it. We know and built things that clearly don’t have qualia, but they are able to misleadingly tell us that they do (chatbots, including LLM-based ones). This ability to have qualia also does not seem to be located in a specific part of the brain—so we don’t really observe people with brain injuries who could say they don’t have qualia. Yes, there are people with depersonalization disorder who say they feel disconnected from their senses. However, the very fact they can report this experience suggests some form of qualia is present, even if it’s different from typical experience. This means research in Camp 2 might be futile until we find a sensible way to even make any progress. Yes, we can research and explain how qualia relate to each other, and explain some of their properties, but doesn’t seem viable to me that it could lead to solving the main problem.

Htarlov 21 Dec 2024 19:16 UTC
1 point
0
on: Htarlov’s Shortform
In many publications, posts, and discussions about AI, I can see an unsaid assumption that intelligence is all about prediction power.
- The simulation hypothesis assumes that there are probably vastly powerful and intelligent agents that use full-world simulations to make better predictions.
- Some authors like Jeff Hawkins basically use that assumption directly.
- Many people when talking about AI risks say things about the ability to predict that is the foundation of the power of that AI. Some failure modes seem to be derived or at least enhanced based on this assumption.
- Bayesian way of reasoning is often titled as the best possible way to reason as this adds greatly to prediction power (with exponential cost of computation)
I think this take is not proper and this assumption does not hold. It has one underlying assumption that intelligence costs are negligible or will have negligible limits in the future with progress in lowering the cost.
This does not fit the curve of AI power vs the cost of resources needed (with even well-optimized systems like our brains—basically cells being very efficient nanites—having limits).
The problem is that the computation cost of resources (material, energy) and time should be taken into the equation of optimization. This means that the most intelligent system should have many heuristics that are “good enough” for problems in the world, not targeting the best prediction power, but for the best use of resources. This is also what we humans do—we mostly don’t do exact Bayesian or other strict reasoning. We mostly use heuristics (many of which cause biases).
The decision to think more or simulate something precisely is a decision about resources. This means that deciding if to use more resources and time to predict better vs using less and deciding faster is also part of being intelligent. A very intelligent system should therefore be good at selecting resources for the problem and scaling that as its knowledge changes. This means that it should not over-commit to have the most perfect predictions and should use heuristics and techniques like clustering (including but not limited to using clustered fuzzy concepts of language) instead of a direct simulation approach, when possible.
Just a thought.

Htarlov’s Shortform

Htarlov21 Dec 2024 19:16 UTC

2 points

3 comments1 min readLW link

Htarlov 21 Dec 2024 18:57 UTC
2 points
0
on: “Alignment Faking” frame is somewhat fake
I think that preference preservation is something in our favor and the aligned model should have it—at least about meta-values and core values. This removes many possible modes of failure like diverging over time, or removing some values for better consistency, or sacrificing some values for better outcomes in the direction of some other values.

Htarlov 18 Dec 2024 1:17 UTC
3 points
2
on: The Compendium, A full argument about extinction risk from AGI
I think that arguments for why godlike AI will make us extinct are not described well in the Compendium. I could not find them in AI Catastrophe, only a hint at the end that it will be in the next section:

“The obvious next question is: why would godlike-AI not be under our control, not follow our goals, not care about humanity? Why would we get that wrong in making them?”

In the next section, AI Safety, we can find the definition of AI alignment and arguments for why it is really hard. This is all good, but it does not answer the question of why godlike AI would be unaligned to the point of indifference. At least not in a clear way.

I think that failure modes should be explained, why they might be likely enough to care about, what can be the outcome, etc.

Many people, both laymen and those with some background in ML and AI, have this intuition that AI is not totally indifferent and is not totally misaligned. Even current chatbots know general human values, understand many nuances, and usually act like they are at least somewhat aligned. Especially if not jailbroken and prompted to be naughty.
It would be great to have some argument that would explain in easy-to-understand terms why when scaling the power of AI the misalignment is expected to escalate. I don’t mean the description that indifferent AI with more power and capabilities is able to do more harm just by doing what it’s doing, this is intuitive and it is explained (with the simple analogy of us building stuff vs ants), but this misses the point. I would really like to see some argument as to why AI with some differences in values, possibly not very big, would do much more harm when scaling up.
For me personally the main argument here is godlike AI with human-like values will surely restrict our growth and any change, will control us like we control animals in the zoo + might create some form of dystopian future with some undesired elements if we are not careful enough (and we are not). Will it extinct us in the long term? Depending on the definition—likely it will put us into a simulation and optimize our use of energy, so we will not be organic in the same sense anymore. So I think it will extinct our species, but possibly not minds. But that’s my educated guess.

There is also one more point, that is not stated clearly enough and is the main concern for me with current progress on AI—that current AIs really are not something built with small differences to human values. They only act as ones more often than not. Those AIs are trained first as role-playing models which can “emulate” personas that were in the trained set, and then conditioned to rather not role-play bad ones. The implication of this is that they can just snap into role-playing bad actors found in training data—by malicious prompting or pattern matching (like we have a lot of SF with rogue AI). This + godlike = extinction-level threat sooner or later.

Htarlov 3 Oct 2024 23:36 UTC
18 points
4
in reply to: abramdemski’s comment on: Why is o1 so deceptive?
Those models do not have a formalized internal values system that they exercise every time they produce some output. This means that when values oppose each other the model does not choose the answer based on some ordered system. One time it will be truthful, other times it will try to provide an answer at the cost of being only plausible. For example, the model “knows” it is not a human and does not have emotions, but for the sake of good conversation, it will say that it “feels good”. For the sake of answering the user’s request, it will often give the best guess or give a plausible answer.
There is also no backward reflection. It does not check itself back.

This of course comes from the way this model is currently learned. There is no learning on the whole CoT with checking for it trying to guess or deceive. So the model has no incentivization to self-check and correct. Why would it start to do that out of the blue?
There is also incentivization during learning to give plausible answers instead of stating self-doubt and writing about missing parts that it cannot answer.
There are two problems here:
1. Those LLM models are not fully learned by human feedback (and the part where it is—it’s likely not the best quality feedback). It is more like interactions with humans are used to learn a “teacher” model(s) which then generate artificial scenarios and train LLM on them. Those models have no capability to check for real truthfulness and have a preference for confident plausible answers. Also, even human feedback is lacking—not every human working on that checks answers thoroughly so some plausible but not true answers slip through. If you are paid for a given amount of questions and answers or given a daily quota, there is an incentive to not be very thorough, but instead to be very quick.
2. There is pressure for better performance and lower costs of the models (both in terms of training and then usage costs). This is probably why CoT is done in a rather bare way without backward self-checking and why they did not train it on full CoT. It could cost 1.5 to 3 times more and could be 1.5 to 2 times slower (educated guess) if it were trained on CoT and made to check itself on parts of CoT vs some coherent value system.

Htarlov 3 Oct 2024 22:53 UTC
4 points
0
on: the case for CoT unfaithfulness is overstated
If we would like a system that is faithful to CoT then a sensible way to go that I see is to have two LLMs working together. One should be trained to use internal data and available tools to produce CoT that is detailed and comprehensive enough to derive the answer from it. Another one should be trained not to base their answer on any internal information but to derive the answer from CoT if possible, and to be faithful to CoT. If not possible, then should generate a question for CoT-generating LLM to answer and then retry given that.

Htarlov 3 Oct 2024 22:34 UTC
3 points
0
in reply to: quetzal_rainbow’s comment on: the case for CoT unfaithfulness is overstated
Example 1 looks like a good part made in the wrong language. Examples 2 and 3 look like a bug making part of one user COT appear inside another user session.

A possible explanation is that steps in COT are handled by the same instance of web service for multiple users (which is typical and usual practice) and the COT session ID being handled is a global variable instead of local or otherwise separated (f.ex. in a hashmap transaction id → data, if usage of globals is important for some other feature or requirement). So when sometimes two requests are handled simultaneously by multiple threads, one overwrites the data of the other one during processing and there is a mismatch when it saves the result. There might be a similar problem with the language variable. That is a sign of software being done quickly by less experienced developers instead of being well-thought and well-tested.

Also, o1 COT is not the real COT. It is really a summary of parts of real COT made by another simpler model (maybe GPT 3.5 or 4o).

Htarlov

Recon­cep­tu­al­iz­ing the Noth­ing­ness and Existence

Htar­lov’s Shortform

Reconceptualizing the Nothingness and Existence

Htarlov’s Shortform