Beware technological wonderland, or, why text will dominate the future of communication and the Internet

VipulNaik13 Apr 2014 17:34 UTC

20 points

Disclaimer: The views expressed here are speculative. I don’t have a claim to expertise in this area. I welcome pushback and anticipate there’s a reasonable chance I’ll change my mind in light of new considerations.

One of the interesting ways that many 20th century forecasts made of the future went wrong is that they posited huge physical changes in the way life was organized. For instance, they posited huge changes in these dimensions:

The home living arrangements of people. Smart homes and robots were routinely foreseen over time horizons where progress towards those ends would later turn out to be negligible.
Overoptimistic as well as overpessimistic scenarios of energy sources merged in strange ways. People believed the world would run out of oil by now, but at the same time envisioned nuclear-powered flight and home electricity.
Overoptimistic visions of travel: People thought humans would be sending out regular manned missions to the solar system planets, and space colonization would be on the agenda by now.
The types of products that would be manufactured. New products ranging from synthetic meat to room temperature superconductors were routinely prophesied to happen in the near future. Some of them may still happen, but they’ll take a lot longer than people had optimistically expected.

At the same time, they underestimated to quite an extent the informational changes in the world:

With the exception of forecasters specifically studying computing trends, most missed the dramatic growth of computing and the advent of the Internet and World Wide Web.
Most people didn’t appreciate the extent of the information and communication revolution and how it would coexist with a world that looked physically indinstinguishable from the world of 30 years ago. Note that I’m looking here at the most advanced First World places, and ignoring the point that many places (particularly in China) have experienced huge physical changes as a result of catch-up growth.

My LessWrong post on megamistakes discusses these themes somewhat in #1 (the technological wonderland and timing point) and #2 (the exceptional case of computing).

What about predictions within the informational realm? I detect a similar bias. It seems that prognosticators and forecasters tend to give undue weight to heavyweight technologies (such as 3D videoconferencing) and ignore the fact that the bulk of the production and innovation has been focused on text (with a little bit in images to augment and interweave with the text), and, to a somewhat lesser extent, images. In this article, I lay the pro-text position. I don’t have high confidence in the views expressed here, and I look forward to critical pushback that changes my mind.

Text: easier to produce

One great thing about text is its lower production costs. To the extent that production is quantitatively little and dominated by a few big players, high-quality video and audio play an important role. But as the Internet “democratizes” content production, it’s a lot easier for a lot of people to contribute text than to contribute audio or video content.

Some advantages of text from the creation perspective:

It’s far easier to edit and refine. This is a particularly big issue because with audio and video, you need to rehearse, do retakes, or do heavy editing in order to make something coherent come out. The barriers to text are lower.
It’s easier to upload and store. Text takes less space. Uploading it to a network or sending it to a friend takes less bandwidth.
People are (rightly or wrongly) less concerned about putting their best foot forward with text. People often spend a lot of time selecting their very best photos, even for low-stakes situations like social networks. With text, they are relatively less inhibited, because no individual piece of text represents them as persons as much as they consider their physical appearance or mannerisms to. This allows people to create a lot more text. Note that Snapchat may be an exception that proves the rule: people flocked to it because its impermanence made them less inhibited about sharing. But its impermanence also means it does not add to the stock of Internet content. And it’s still images, not videos.
It’s easy to copy and paste.
As an ergonomic matter, typing all day long, although fatiguing, consumes less energy than talking all day long.
Text can be created in fits and bursts. An audio or video needs to be recorded more or less in a continuous sitting.
You can’t play background music while having a video conversation or recording audio or video content.

Text: easier to consume and share

Text is also easier to consume and share.

Standardization of format and display methods makes the consumption experience similar across devices.
Low storage and bandwidth costs make it easy to consume over poor Internet connections and on a range of devices.
Text can be read at the user’s own pace. People who are slow at grasping the content can take time. People who are fast can read very quickly.
Text can be copied, pasted, modified, and reshared with relative ease.
Text is easier to search (this refers both to searching within a given piece of text and to locating a text based on some part of it or some attributes of it).
You can’t play background music while consuming audio-based content, but you can do it while consuming text.
Text can more easily be translated to other languages.

On the flip side, reading text requires you to have your eyes glued to the screen, which reduces your flexibility of movement. But because you can take breaks at your will, it’s not a big issue. Audiobooks do offer the advantage that you can move around (e.g., cook in the kitchen) while listening, and some people who work from home are quite fond of audiobooks for that purpose. In general, the benefits of text seem to outweigh the costs.

Text generates more flow-through effects

Holding willingness to pay on the part of consumers the same, text-based content is likely to generate greater flow-through effects because of its ability to foster more discussion and criticism and to be modified and reused for other purposes. This is related to the point that video and audio consumption on the Internet generally tends to substitute for TV and cinema trips, which are largely pure consumption rather than intermediate steps to further production. Text, on the other hand, has a bigger role in work-related stuff.

Augmented text

When I say that text plays a major role, I don’t mean that long ASCII strings are the be-all-and-end-all of computing and the Internet. Rather, more creative and innovative ways of interweaving a richer set of expressive and semantically powerful symbols in text is very important to harnessing its full power. It really is a lot different to read The New York Times in HTML than it would be to read the plain text of the article on a monochrome screen. The presence of hyperlinks, share buttons, the occasional image, sidebars with more related content, etc. add a lot of value.

Consider Facebook posts. These are text-based, but they allow text to be augmented in many ways:

Inline weblinks are automatically hyperlinked when you submit the post (though at present it’s not possible to edit the anchor text to show something different from the weblink).
Hashtags can be used, and link to auto-generated Facebook pages listing recent uses of the hashtag.
One can tag friends and Facebook groups and pages, subject to some restrictions. For friends tagged, the anchor text can be shortened to any one word in their name.
One can attach links, photos, and files of some types. By default, the first weblink that one uses in the post is automatically attached, though this setting can be overridden. The attached link includes a title, summary, and thumbnail.
One can set a location for the post.
One can set the timing of publication of a post.
Smileys are automatically rendered when the post is published.
It’s possible to edit the post later and make changes (except to attachments?). People can see the entire edit history.
One can promote one’s own post at a cost.
One can delete the post.
One can decide who is allowed to view the post (and also restrict who can comment on the post).
One can identify who one is with at the time of posting.
One can add a rich set of “verbs” to specify what one is doing.

Consider the actions that people reading the posts can perform:

Like the post.
Comment on the post. Comments automatically include link previews, and they can also be edited later (with edit histories available). Comments can also be used to share photos.
Share the post.
Select the option to get notifications on updates (such as further comments) on the post.
Like comments on the post.
Report posts or mark them as spam.
View the edit history of the post and comments.
For posts with restrictions on who can view them, see who can view the post.
View a list of others who re-shared the post.

If you think about it, this system, although it basically relies on text, has augmented text in a lot of ways with the intent of facilitating more meaningful communication. You may find some of the augmentations of little use to you, but each feature probably has at least a few hundred thousand people who greatly benefit from it. (If nobody uses a feature, Facebook axes it).

I suspect that the world in ten years from now will feature text that is richly augmented relative to how text is now in a similar manner that the text of today is richly augmented compared to what it was back in 2006. Unfortunately, I can’t predict any very specific innovations (if I could, I’d be busy programming them, not writing a post on LessWrong). And it might very well be the case that the low-hanging fruit with respect to augmenting text is already taken.

Why didn’t all the text augmentation happen at once? None of the augmentations are hard to program in principle. The probable reasons are:

Training users: The augmented text features need a loyal userbase that supports and implements them. So each augmentation needs to be introduced gradually in order to give users onboarding time. Even if Facebook in 2006 knew exactly what features they would eventually have in 2014, and even if they could code all the features in 2006, introducing them all at once might scare users because of the dramatic increase in complexity.
Deeper insight into what features are actually desirable: One can come up with a huge list of features and augmentations of text that might in principle be desirable, but only a small fraction of them pass a cost-benefit analysis (where the cost is the increased complexity of user interface). Discovering what features work is often a matter of trial and error.
Performance in terms of speed and reliability: Each augmentation adds an extra layer of code, reducing the performance in terms of speed and reliability. As computers and software have gotten faster and more powerful, and the Internet companies’ revenue has increased (giving them more leeway to spend more for server space), investments in these have become more worthwhile.
Focus on userbase growth: Companies were spending their resources in growing their userbase rather than adding features. Note that this is the main point that is likely to change soon: the userbase is within an order of magnitude of being the whole world population.

Images

Images play an important role along with text. Indeed, websites such as 9GAG rely on images, and others like Buzzfeed heavily mix texts and images.

I think images will continue to grow in importance on the Internet. But the vision of images as it is likely to unfold is probably quite different from the vision as futurists generally envisage. We’re not talking of a future dominated by professionally done (or even amateurly done) 16 megapixel photography. Rather, we’re talking of images that are used to convey basic information or make a memetic point. Consider that many of the most widely shared images are the standard images for memes. The number of meme images is much smaller than the number of meme pictures. Meme creators just use a standard image and their own contribution is the text at the top and bottom of the meme. Thus, even while the Internet uses images, the production at the margin largely involves text. The picture is scaffolding. Webcomics (I’m personally most familiar with SMBC and XKCD, but there are other more popular ones) are at the more professional end, but they too illustrate a similar point: it’s often the value of the ideas being creatively expressed, rather than the realism of the imagery, that delivers value.

One trend that was big in the early days of the Internet, then died down, and now seems to be reviving is the animated GIF. Animated GIFs allow people to convey simple ideas that cannot be captured in still images, without having to create a video. They also use a lot less bandwidth for consumers and web hosts than videos. Again, we see that the future is about economically using simple representations to convey ideas or memes rather than technologically awesome photography.

Quantitative estimates

Here’s what Martin Hilbert wrote in How Much Information is There in the “Information Society” (p. 3):

It is interesting to observe that the kind of content has not changed significantly since the analog age: despite the general perception that the digital age is synonymous with the proliferation of media-rich audio and videos, we find that text and still images capture a larger share of the world’s technological memories than before the digital age.5 In the early 1990s, video represented more than 80 % of the world’s information stock (mainly stored in analog VHS cassettes) and audio almost 15 % (audio cassettes and vinyl records). By 2007, the share of video in the world’s storage devices decreased to 60 % and the share of audio to merely 5 %, while text increased from less than 1 % to a staggering 20 % (boosted by the vast amounts of alphanumerical content on internet servers, hard-disks and databases. The multi-media age actually turns out to be an alphanumeric text age, which is good news if you want to make life easy for search engines.

I had come across this quote as part of a preliminary investigation for MIRI into the world’s distribution of computation (though I had not highlighted the quote in the investigation since it was relatively less important to the investigation). As another data point, Facebook claims that it needed 700 TB (as of October 2013) to store all the text-based status updates and comments plus relevant semantic information on users that would be indexed by Facebook Graph Search once it was extended to posts and comments. Contrast this with a few petabytes of storage needed for all their photos (see also here), despite the fact that one photo takes up a lot more space than one text-based update.

Beautiful text

The Internet looks a lot more beautiful today than it did ten years ago. Why? Small, incremental changes in the way that text is displayed have played a role. New fonts, new WordPress themes, a new Wikipedia or Facebook layout, all conspire to provide a combination of greater usability and greater aesthetic appeal. Also, as processors and bandwidth have improved, some layouts that may have been impractical earlier have been made possible. The block tile layout for websites has caught on quite a bit, inspired by an attempt to create a unified smooth browsing experience across a range of different devices (from small iPhone screens to large monitors used by programmers and data analysts).

Notice that it’s the versatility of text that allowed it to be upgraded. Videos created an old way would have to be redone in order to avail of new display technologies. But since text is stored as text, it can be rendered in a new font easily.

The wonders of machine learning

I’ve noticed personally, and some friends have remarked to me, that Google Search, GMail, and Facebook have gotten a lot better in recent years in many small incremental ways despite no big leaps in the overall layout and functioning of the services. Facebook shows more relevant ads, makes better friend suggestions, and has a much more relevant news feed. Google Search is scarily good at autocompletion. GMail search is improving at autocompletion too, and the interface continues to improve. Many of these improvements are the results of continuous incremental improvement, but there’s some reason to believe that the more recent changes are driven in part by application of the wonders of machine learning (see here and here for instance).

Futurists tend to think of the benefits of machine learning in terms of qualitatively new technologies, such as image recognition, video recognition, object recognition, audio transcription, etc. And these are likely to happen, eventually. But my intuition is that futurists underestimate the proportion of the value from machine learning that is intermediated through improvement in the existing interfaces that people already use (and that high-productivity people use more than average), such as their Facebook news feed or GMail or Google Search.

A place for video

Video will continue to be good for many purposes. The watching of movies will continue to migrate from TV and the cinema hall to the Internet, and the quantity watched may also increase because people have to spend less in money and time costs. Educational and entertainment videos will continue to be watched in increasing numbers. Note that these effects are largely in terms of substitution of one medium, plus a raw increase in quantity, for another rather than paradigm shifts in the nature of people’s activities.

Video chatting, through tools such as Skype or Google Talk/Hangouts, will probably continue to grow. These will serve as important complements to text-based communication. People do want to see their friends’ faces from time to time, even if they carry out the bulk of their conversation in text. As Internet speeds improve around the world, the trivial inconveniences in the way of video communication will reduce.

But these will not drive the bulk of people’s value-added from having computing devices or being connected to the Internet. And they will in particular be an even smaller fraction of the value-added for the most productive people or the activities with maximum flow-through effects. Simply put, video just doesn’t deliver higher information per unit bandwidth and human inconvenience.

Progress in video may be similar to progress in memes and animated GIFs: there may be more use of animation to quickly create videos expressing simple ideas. Animated video hasn’t taken off yet. Xtranormal shut down. The RSA Animate style made waves in some circles, but hasn’t caught on widely. It may be that the code for simple video creation hasn’t yet been cracked. Or it may be that if people are bothering to watch video, they might as well watch something that delivers video’s unique benefits, and animated video offers little advantage over text, memes, animated GIFs, and webcomics. This remains to be seen. I’ve also heard of Vine (a service owned by Twitter for sharing very short videos), and that might be another direction for video growth, but I don’t know enough about Vine to comment.

What about 3D video?

High definition video has made good progress in relative terms, as cameras, Internet bandwidth, and computer video playing abilities have improved. It’ll be increasingly common to watch high definition videos on one’s computer screen or (for those who can afford it) on a large flatscreen TV.

What about 3D video? If full-blown 3D video could magically appear all of a sudden with a low-cost implementation for both creators and consumers, I believe it would be a smashing success. In practice, however, the path to getting there would be more tortuous. And the relevant question is whether intermediate milestones in that direction would be rewarding enough to producers and consumers to make the investments worth it. I doubt that they would, which is why it seems to me that, despite the fact that a lot of 3D video stuff is technically feasible today, it will still probably take several decades (I’m guessing at least 20 years, probably more than 30 years) to become one of the standard methods of producing and consuming content. For it to even begin, it’s necessary that improvements in hardware continue apace to the point that initial big investments in 3D video start becoming worthwhile. And then, once started, we need an ever-growing market to incentivize successive investments in improving the price-performance tradeoff (see #4 in my earlier article on supply, demand, and technological progress). Note also that there may be a gap of a few years, perhaps even a decade or more, between 3D video becoming mainstream for big budget productions (such as movies) and 3D video being common for Skype or Google Hangouts or their equivalent in the later era.

Fractional value estimates

I recently asked my Facebook friends for their thoughts on the fraction of the value they derived from the Internet that was attributable to the ability to play and download videos. I received some interesting comments there that helped confirm initial aspects of my hypothesis. I would welcome thoughts from LessWrongers on the question.

Thanks to some of my Facebook friends who commented on the thread and offered their thoughts on parts of this draft via private messaging.

What links here?

VipulNaik13 Apr 2014 17:34 UTC

20 points

25 comments12 min readLW link Archive

CAE_Jones 13 Apr 2014 23:10 UTC
22 points
0
(my reply wound up over 8kb long, but I don’t think it’s general enough to turn into a discussion article.)

Reading this and its comments immediately made me think of the current status of braille, where technology has completely failed to keep up with the mainstream, and now many people are claiming that braille is outdated and they’ll just use text-to-speech for everything. (Disclaimer: I was taught braille starting from kindergarten, and picked it up fast and thoroughly. A lot of anti-braille people appear to have had a very hard time learning it and can’t actually read any quicker than I could read large print when my vision was at its best. So I have to acknowledge some kinda privilege when talking about the subject. I’ll also acknowledge that mastery of braille and financial/academic/etc success are positively correlated among blind Americans, according to all the not-incredibly-transparent sources I’ve found.)

Some of the points made here about text in general apply to braille, some are just the opposite, and some depend entirely on the audio/video/tactual affinities of the specific user. For example:

•You can’t play background music while having a video conversation or recording audio or video content.

Is one of the arguments I use in favor of braille whenever the subject comes up, and it’s easily extended to consumption: noise and reading, noise and writing, privacy, the need/lack for headphones, all the different environments in which one can work, etc.

This, though:

•Low storage and bandwidth costs make it easy to consume over poor Internet connections and on a range of devices.

Is only technically true for braille, since braille technology is so far behind that devices are almost always bulky, expensive, cumbersome, in addition to the mainstream device to which they connect, and only the most expensive and bulkiest models display more than 40 characters at a time (so like half of one print line… and most people will get an even smaller model, because even 40 characters is bulky, expensive, and generally more than trivially inconvenient to use on anything but a dedicated device like the PACMate Omni). The base format, digitally, is the same and can be transmited and stored easily, but everyone who can hear but not see is going to convert it to speech anyway.

•Text can be read at the user’s own pace. People who are slow at grasping the content can take time. People who are fast can read very quickly.

Applies to text-to-speech as well; most TTS has adjustable speaking rates, and it’s possible to go back and reread things (however, checking the spelling of words is a trivial inconvenience which almost no one that relies on a TTS ever uses. Ever seen me misspell something? That’ll be why. On the other hand, what types of errors count as obvious visually vs obvious audibly differ, so blind and sighted text speak tend to differ simply for readability’s sake.).

•Text is easier to search (this refers both to searching within a given piece of text and to locating a text based on some part of it or some attributes of it).

Even people who don’t care for braille in general will agree that it helps loads with math and programming for exactly this reason. If hooking a braille display to a laptop were not so bloody inconvenient (and did not require so much desk-space), I’d have one connected pretty much all the time for this alone.

•You can’t play background music while consuming audio-based content, but you can do it while consuming text.

People usually reply to this with “Use the Windows volume mixer.” (I disagree with said reply under most conditions. If I have to have music quieter than a screen reader, then a lot of the impact is reduced. And screenreader + conversation is just plain impractical.)

On the flip side, reading text requires you to have your eyes glued to the screen, which reduces your flexibility of movement. But because you can take breaks at your will, it’s not a big issue.

Depending on the device, the opposite can be true for braille; a small display, or something smaller than a novel (braille novels are enormous) can be read while walking without compromising one’s awareness of one’s surroundings (especially if one can read one-handed). Audio is dependant on the device as well, however; walking around with bulky headphones on is a terrible idea (compare texting while driving), but external speakers while going about other business in the same room is fine.

The presence of hyperlinks, share buttons, the occasional image, sidebars with more related content, etc. add a lot of value.

Braille does not do formatting well, but neither does audio, and I’ve never had access to a braille device that can actually perform the equivalent to clicking or tapping a hyperlink. This is an improvement I thought of the first time I actually had a braille display for more than 5 minutes: every braille display I’ve ever seen includes cursor-routing keys, which are basically buttons above each cell that will move the cursor to that position when clicked. The obvious thing to do is to double-click one of those to simulate a mouse-click at that spot, yet I’ve never heard of this being implemented.

There’s also such a thing as 8-dot braille, which is typically used for unicode characters, to indicate capitalization, or to indicate the position of the cursor or highlighted text. Even most braille-using techies don’t learn 8-dot unicode (and from what I can tell, that isn’t even standardized, so it’d only matter for the specific hardware/software combination that one studied with), so it’s a little disappointing that using the two extra dots for formatting or HTML effects hasn’t really caught on.

(As an example of how braille and screen readers handle HTML elements, we have links: a screen reader reads Lesswrong.com as “link Lesswrong dot com”, and on a braille display, it shows up as “lnk Lesswrong.com″. I consider the latter to be more problematic, in that it costs 4 whole cells, which is anywhere from 5% to 33% of the display!)

•One can tag friends and Facebook groups and pages, subject to some restrictions. For friends tagged, the anchor text can be shortened to any one word in their name.

Side complaint: Facebook accessibility is mixed, and blind people tend to use the mobile site, where in-line friend-tagging is not possible. (Yes, the main Facebook page is bad enough that this is more than a reasonable tradeoff.)

•Training users: The augmented text features need a loyal userbase that supports and implements them. So each augmentation needs to be introduced gradually in order to give users onboarding time.

This is so obviously applicable to anything accessibility-related that I momentarily considered not including it here.

•Performance in terms of speed and reliability: Each augmentation adds an extra layer of code, reducing the performance in terms of speed and reliability. As computers and software have gotten faster and more powerful, and the Internet companies’ revenue has increased (giving them more leeway to spend more for server space), investments in these have become more worthwhile.

Referring back to m.facebook.com Vs facebook.com: it’s very hard for accessibility technology, an extremely tiny market with little funding and lots of coordination problems due to size, to keep pace with all these augmentations. The more powerful stuff on Facebook.com gives me lag that ends in me queerying my brain for incidents in the early 2000s to try and find something comparable.

For another example: Lesswrong is usually pretty responsive to screen readers, but if a post has a large number of comments (I’ve noticed that 80 or more tends to be a good predicter), there might be enough lag in reading or loading to be inconvenient, and there is a particular feature that is actually annoying: occasionally, while reading comments, I’ll be notified of a comment’s percent positive karma, at which point the screen reader takes a whole second to get back to reading, adds more spoken formatting information (“clickable”, mostly; bold/italics/font size are almost never spoken, but screen readers are getting better about those), and once this happens once, it will almost definitely repeat if I keep scrolling. (My solution so far has been to switch to “just read everything from the cursor down” if this happens. How more or less convenient this method is depends on the screen reader. And I’m using the free one, because I’d rather not incentivize charging $800 for a screen reader.)

However, when I’ve tried using a braille display and text-to-speech simultaneously, I’ve found that, frequently, a page that will take several seconds to get a response from TTS will start displaying braille much more quickly. Considering that the screen reader is managing both, this is a little bizarre; it’d imply that the lag is in the TTS program, rather than the screen reader itself, yet different screen readers seem to render speech faster or slower on the same websites.
- VipulNaik 13 Apr 2014 23:54 UTC
  5 points
  0
  Parent
  Thanks for commenting! This is an insightful perspective.
Kaj_Sotala 13 Apr 2014 18:50 UTC
11 points
0

People are (rightly or wrongly) less concerned about putting their best foot forward with text.

As an ergonomic matter, typing all day long, although fatiguing, consumes less energy than talking all day long.

Text can be created in fits and bursts. An audio or video needs to be recorded more or less in a continuous sitting.

Note that many people seem to prefer e.g. Skype calls over text chats because (to these people) voice chat requires less energy than writing, and feels like just having a normal conversation and thus effortless, whereas writing is something that requires actually thinking about what you say and thus feels much more laborious.

A lot of people also seem to find audio easier to consume than text: podcasts would be a lot less popular otherwise. (I never understood podcasts at first. Why not just write? Finally I realized that non-nerds actually find listening easier than reading.)

You can’t play background music while having a video conversation

Headphones and a good call quality together fix this, I think? Haven’t tried, though.
- kalium 13 Apr 2014 20:14 UTC
  10 points
  0
  Parent
  Audio is easier to consume when full attention isn’t available. It’s not easy to read a book while driving, jogging, or knitting. I think that’s enough to fully explain podcasts’ popularity without any claim that audio is overall easier to consume than text for any substantial population.
- ColtInn 13 Apr 2014 20:10 UTC
  4 points
  0
  Parent
  
  Finally I realized that non-nerds actually find listening easier than reading.
  
  A lot of nerds listen to podcast. I’d estimate 80% plus of my communication is textual and that includes with family, and I’m a father.
  
  I listen to several hours of podcats per week. Podcasts aren’t two way communication. Like text they can be left alone and returned to at will. They can be educational and or entertainment. My favourite podcasts are those which are mostly other people conversing with each other over some debatable ideas. Audiobooks can be good too. I mostly listen to these during routine work, walking or cycling.
  
  seem to prefer e.g. Skype calls over text chats because (to these people) voice chat requires less energy than writing, and feels like just having a normal conversation and thus effortless, whereas writing is something that requires actually thinking about what you say and thus feels much more laborious
  
  I think most people prefer these modes to text and that we’re the exception. There is some positive emotional payoff to hearing and seeing friends and loved ones. I’ve noticed that many people will skype or phone when the want to share positive things, but email or text when the message is a nagative or confrontational one. Lots of breakups happen via text butI imagine very few marriage proposals do.
- gwern 18 Apr 2014 21:49 UTC
  1 point
  0
  Parent
  
  Note that many people seem to prefer e.g. Skype calls over text chats because (to these people) voice chat requires less energy than writing, and feels like just having a normal conversation and thus effortless, whereas writing is something that requires actually thinking about what you say and thus feels much more laborious.
  
  This would explain why I have been contacted by a number of people such as journalists interested in talking or interviewing me and proposing use of Skype, only to never reply when I say I don’t use Skype and we can just chat on IRC.
- itaibn0 18 Apr 2014 21:23 UTC
  0 points
  0
  Parent
  Personally I prefer speaking to writing but I prefer reading to listening. I believe part of the reason is that I set myself higher standards when I write. For instance, in a conversation I would be satisfied to finish this comment with just the first sentence, but here I want to elaborate.
- VipulNaik 13 Apr 2014 18:53 UTC
  0 points
  0
  Parent
  For short conversations, video/voice may be more effective because it’s slightly faster.
  
  However, spending the bulk of the day in video/voice conversations is a lot more fatiguing than spending it using text-based communication [EDIT: I’m quite likely mistaken about this, see the followup comments].
  
  I think the people who’re not used to text-based communication generally just end up spending less time communicating, and/or work in group environments in physical proximity to others where one can talk occasionally.
  - benkuhn 13 Apr 2014 20:04 UTC
    1 point
    0
    Parent
    Do you have actual data on this? Otherwise I’m very tempted to call typical mind.
    - VipulNaik 13 Apr 2014 20:45 UTC
      3 points
      0
      Parent
      On the claim:
      
      spending the bulk of the day in video/voice conversations is a lot more fatiguing than spending it using text-based communication.
      
      It seems that I was wrong.
      
      The following sources contradict me: http://calorielab.com/burned/?mo=se&gr=09&ti=miscellaneous+activities&q=&wt=150&un=lb&kg=68 and http://www.my-calorie-counter.com/Calories_Burned/
      
      Some random Internet comments corroborate me. For instance, scott preston writes at http://radar.oreilly.com/2011/03/stephan-spencer-seo-future-search.html: “In fact speaking takes a lot more energy to than typing does.”
      
      I’ll look this up more and update if I find more reliable information.
      - benkuhn 13 Apr 2014 21:20 UTC
        6 points
        0
        Parent
        I would expect the relevant factor to be mental, not physical, exertion. Unfortunately that’s a lot harder to measure.
        VipulNaik 13 Apr 2014 21:41 UTC
        2 points
        0
        Parent
        btw, I think I can both talk and type for far longer durations than the median world resident. But my typing stamina may be substantially greater than my talking stamina, so I may be expressing typical mind fallacy in the proportional angle.
VipulNaik 4 Jun 2026 22:11 UTC
2 points
0
Here’s Claude’s review of the article in light of changes over the past 12 years:

This is a rich piece and holds up remarkably well in its core claims. Let me give you a structured assessment.

What Has Stood the Test of Time

The central thesis — that text would dominate over video as the marginal medium of the internet — was basically correct for the period 2014–2022. Text did continue to dominate information production, search infrastructure, professional work, and knowledge accumulation. The argument about production costs was prescient: the explosion of Twitter, Reddit, newsletters, Discord, Slack, and (ironically) LLM chatbots all represent text continuing to dominate the high-value-density layer of the internet.

The “augmented text” framing aged extremely well. Rich text environments — Notion, Substack, Obsidian, Linear, GitHub — are exactly the evolution you described: not plain ASCII, but semantically dense text augmented with structure, embeds, formatting, and links. The vision of text getting progressively richer without jumping to video was accurate.

The meme/GIF point was spot-on. The visual internet that actually exploded was image macros, reaction GIFs, and short-form images — scaffold for text, not replacement for it.

The 3D video skepticism was well-calibrated. You said “at least 20 years, probably 30.” Oculus/Meta has sunk tens of billions into VR and AR with essentially zero mass-market traction for communication or content consumption. The Quest 3 is a technical marvel that almost nobody uses daily. Apple Vision Pro is a $3500 curiosity. The intermediate milestones haven’t been rewarding enough, exactly as you predicted.

The machine learning section is one of the best-aged passages in the piece. You specifically argued that ML’s value would be mostly intermediated through improvements to existing text interfaces rather than through flashy new modalities. This turned out to be deeply right — for nearly a decade, the biggest ML wins were ranking, recommendations, autocomplete, spam filtering, and search quality. The LLM era is almost an extreme version of this thesis.

The flow-through effects argument (text generating more downstream production) has been validated. Text on the internet has compounded — every piece of text is linkable, quotable, indexable, trainable-on. Video mostly doesn’t compound in the same way.

What Has Been Proven False or Significantly Complicated

The strongest refutation: short-form video. TikTok is the clearest counterexample to your thesis. You acknowledged video’s role in entertainment/movies but treated it as substitution rather than paradigm shift. TikTok represents something genuinely new: video as the primary discovery and communication medium for a large population cohort, replacing not just TV but search, social feeds, and even text tutorials. The “democratization of production” you cited as a text advantage turned out to apply to video too — partly because smartphones commoditized decent video production, and partly because TikTok’s algorithm is forgiving enough of rough production values that the inhibition barrier dropped substantially. YouTube Shorts, Instagram Reels, and now even LinkedIn video are real. For Gen Z, the video-first internet is the default.

Podcasts grew much larger than your framing suggested. You mention audio briefly and essentially dismiss it. But podcasts became a multi-billion dollar industry, became the dominant long-form interview format, and arguably became more culturally influential than comparable text media. The asynchronous mobility advantage (listening while exercising, driving, cooking) turned out to be a stronger force than you credited.

The inhibition argument partially inverted for video. You argued people would be more inhibited about video than text, reducing video production. But the smartphone + front camera + selfie culture + TikTok’s ephemeral/low-stakes aesthetic created a generation that is less inhibited about video than text. For many young people, typing out a thought feels more labored than filming themselves saying it.

Image platforms became more sophisticated than the meme-scaffold model. Instagram became a major driver of culture, commerce, and communication in ways that went beyond images-as-text-scaffolding. The visual identity economy (influencer culture, brand aesthetics) is genuinely image/video-native in ways that don’t reduce to text.

How LLMs Change the Text vs. Video Equation

This is where it gets most interesting, and I think the LLM era dramatically amplifies your core thesis while also introducing some genuinely new wrinkles.

LLMs are the apotheosis of text supremacy. The entire edifice of foundation models is built on text. The reason LLMs are so capable is precisely that the internet’s accumulated text was a dense, searchable, linkable, structured knowledge base — exactly the properties you argued for. If the internet had been primarily video since 2000, we would not have GPT-4. In a sense, the world bet on text as the training substrate, and it paid off in a way that creates further lock-in to text.

Text generation is now free. One of your core arguments was that text is cheaper to produce. LLMs make this even more true — to the point of near-zero marginal cost for competent text. This should further shift the balance toward text, since the production-cost advantage of video erodes (it still takes effort to make a good video), while the production-cost advantage of text collapses entirely in the other direction.

LLMs as text-to-everything interfaces. Rather than video replacing text for search and information retrieval, we’re now seeing text interfaces that generate code, images, analysis, and even video (Sora, Runway, etc.). Text has become the universal command layer — you prompt in text and get any modality back. This is a stronger form of text dominance than you argued for: text isn’t just winning the communication war, it’s becoming the control plane for all modalities.

But: video-to-text pipelines are getting very good. Auto-transcription (Whisper-quality and above), video summarization, multimodal models that can “read” video — these reduce one of text’s competitive advantages (searchability). If every YouTube video is automatically transcribed, indexed, and queryable, the search advantage of text partially transfers to video. This is a genuine new development that could modestly weaken your case.

The LLM UI is pure augmented text. Claude, ChatGPT, Perplexity — these are the dominant new computing interfaces of the 2020s, and they are quintessentially text-native. The “augmented text” future you described arrived in the form of conversational AI rather than the richer Facebook-style social annotations you imagined, but the basic prediction was correct.

A subtle reversal for LLM-generated content: one concern is that as text generation becomes cheap and ubiquitous, human-produced video may become a stronger authenticity signal. If you can’t tell whether a text was written by a person or an LLM, the value of video as proof-of-person might actually increase. This is speculative but worth watching.

Overall verdict: I’d say roughly 70% of the piece has aged well, with the main blind spot being the TikTok-style video revolution and the underestimation of podcast culture. The LLM era has, if anything, been a late-stage vindication of the core thesis — text dominated long enough and densely enough to become the training substrate for the most transformative technology in decades, and that technology is now making text even more dominant as a control interface. Your machine learning passage in particular reads almost prophetically given where things ended up.
ChristianKl 14 Apr 2014 10:22 UTC
1 point
0
I think roughly a month ago I had an discussion about using Anki to learn biology data on LW. The person complained about the perceived inability of Anki to be text only. He rather wants to learn using things like Venn diagrams because they are better at displaying information then pure text.

The problem is that it’s not straightforward to simple create a Venn diagram while creating Anki card or while discussing on LessWrong. It takes extra time. With a bit of smart UI design we might have an UI that makes it easy to make points via diagrams. Of course that means we need to think about how to create good diagrams for a bunch of other semantic constructs.

Especially if your default medium of data entry isn’t a keyboard but a multitouch device having a bunch of diagrams might be better than text. Text developed in an environment where space was expensive. Today keyboards are simply amazing technology that make text into very easy.

I could imagine that the necessary technology won’t be developed in customer applications like facebook but in a field like biology where it’s very important to express complex ideas in an easy to understand manner. A series of big diagrams might just perform better than a bunch of long and convoluted sentences.

It’s easier to upload and store. Text takes less space. Uploading it to a network or sending it to a friend takes less bandwidth.

Today that might be a concern. I don’t think it will be in 20 years. I think a large part of why Google Wave failed was because it was just too slow.

Text is easier to search (this refers both to searching within a given piece of text and to locating a text based on some part of it or some attributes of it).

Speech to text to technology should make this easier in the future.

You can’t play background music while consuming audio-based content, but you can do it while consuming text.

I think you can play low volume music in the background of a podcast.
- wubbles 31 Aug 2015 1:34 UTC
  0 points
  0
  Parent
  I can consume text at a rate sometimes as high as 26 words a second. I cannot do that with audio. If we had text-to-speech, I would use it for turning audio into text, and consuming the text. Or the author could use it and produce text, which they could then edit. Frequently when talking we do all sorts of things we don’t do when writing: repeat ourselves, use funny turns of phrase, search for words, etc. The bandwidth advantage to the consumer of a small amount of work for the producer makes text continue to be valuable.
  
  As far as diagrams go in technical areas, there are some famous pictures in mathematics. These pictures inevitably mean nothing without text. Transmitting abstract ideas, and in particular transmitting subtle variations in how solid something is, doesn’t seem compatible with diagrams. Diagrams are good for some concepts, but it’s still an art to get good ones. Creating them is expensive, and sometimes they don’t work. On the other hand it’s hard to beat a good graph for communicating numerical data easily and letting the viewer draw appropriate inferences.
- kalium 14 Apr 2014 16:15 UTC
  0 points
  0
  Parent
  I think you mean speech to text technology.
  - ChristianKl 14 Apr 2014 16:17 UTC
    0 points
    0
    Parent
    Fixed.
Yosarian2 16 Apr 2014 0:06 UTC
0 points
0
It depends what you mean by 3D video, but now that facebook has put 2 billion dollars into Oculus Rift, and other tech companies like Sony are talking about similar kind of VR devices, I expect we’re going to see a significant amount of money invested in them over the next few years from several major tech companies, and we’re probably going to see some high-quality consumer devices appear. How popular they will be, of course, is anyone’s guess, but I think the odds are good.

I don’t think that negates your main point, though; text is still the dominant medium of the internet, and will probably continue to be so. Another big advance in text in recent years is Google Translate; the fact that someone can post an news article in Russian on reddit and I can read it easily without any extra effort on my part is a huge advance.
Richard_Kennaway 14 Apr 2014 14:57 UTC
0 points
0

What about 3D video? If full-blown 3D video could magically appear all of a sudden with a low-cost implementation for both creators and consumers, I believe it would be a smashing success. In practice, however, the path to getting there would be more tortuous. And the relevant question is whether intermediate milestones in that direction would be rewarding enough to producers and consumers to make the investments worth it.

I think 3D video is not so technologically far off, and that the real problem of VR is that no-one has really worked out what to do with it. There are all sorts of visions of it in science fiction, but that’s all fictional evidence, and actually building a VR platform that everyone will want to use is a hard problem. VR systems go back at least to the 80s, and there has been steady technological progress in that time, but the most successful VR social platform, Second Life, has never obtained a mass userbase.
polarix 14 Apr 2014 14:39 UTC
0 points
0
Essentially, this sounds like temporal sampling bias. The points about ease of recombination and augmentation bespeak a lack of infrastructure investment in post-text meda, not a fundamental property. Yes, communication mediums begin with text. But the low emotional bandwidth (and low availability of presence in real-time interactions) concretely limits the kinds of transmissions that can be made.

Your writing, however, does raise a spectacular question.

How can we increase the bandwidth of text across the machine/brain barrier?
Gunnar_Zarncke 14 Apr 2014 10:26 UTC
0 points
0
One reason text has a higher potential to be the main form of communication even when full image recognition is available is that it can be combined more freely. This can best be seen in GUIs which despite their (often) intuitive nature cannot nearly match the ‘unlimited power’ of command line interfaces. As long as there is no way found to do the same via a graphical means—and that goes ay beyond image ″recognition″ - I don’t see a reason why text indeed shouldn’t dominate images. Images are a nice add on associated with a text node but don’t form nodes of their own.

This might change once we get sufficiently smart image composition software that besically realizes fantasy/visual imagination and thus the elements could as freely compose as text.
- Richard_Kennaway 14 Apr 2014 12:53 UTC
  0 points
  0
  Parent
  
  This can best be seen in GUIs which despite their (often) intuitive nature cannot nearly match the ‘unlimited power’ of command line interfaces.
  
  It depends what software you’re talking about. Here are three examples: Photoshop (2D raster image processing), Blender (3D modelling and animation), or Maya (ditto). As far as I know, none of these have command-line interfaces.[1] How would you use a command-line interface to paint a picture, or model a 3D character?
  
  I could add Illustrator (2D object-oriented image processing) and COMSOL (finite element engineering calculations) to that list as well. GUI and API, but no CLI beyond the needs of batch processing.
  
  [1] This needs some amplification. All of them have programming interfaces, but that is something different. Blender (and I expect Maya as well, but I’m less familiar with it) can be invoked from the command line, with options to say what you want it to do, but that’s only useful for batch-type tasks like final-quality renders of complex scenes and movies. Everything you can do in the CLI you can do in the GUI, but most of what you can do in the GUI cannot be done from the CLI.
  - Gunnar_Zarncke 14 Apr 2014 13:53 UTC
    0 points
    0
    Parent
    Gimp has a scripting language and Imagemagick is entirely scripted.
    
    I agree that some tasks—esp. selecting an image part—are (currently) most easily done via pointing—because image recognition isn’t far enough yet.
    
    Many CAD systems have a command language.
    
    Specialized ‘graphical’ applications like circuit layout used to be done by hand but moved to specialized languages.
    
    I’d guess that earlier or later you’d rather use speech to compose most parts of the image and use (force feedback) motion for specialized paiting actions and transformations.
    - Richard_Kennaway 14 Apr 2014 14:40 UTC
      0 points
      0
      Parent
      
      I’d guess that earlier or later you’d rather use speech to compose most parts of the image and use (force feedback) motion for specialized paiting actions and transformations.
      
      I’m rather baffled by how I would use speech to paint into a Photoshop window. Force feedback motion already exists for 2D painting—graphics tablets are standard equipment for artists.
      
      There are things in 3D animation that can be usefully expressed as text, but the only examples I know of are scripted procedural animation, in which the possibility of textual expression arises from limitations imposed on the repertoire of available movement. The example I’m most familiar with is deaf sign language, and the HamNoSys notation in particular (because I’ve worked with it and written software to translate it into animation data).
      
      I agree with the original point that text is an essential medium that is not going away, but I think that GUIs vs CLIs is not the issue. Each has uses not easily replicated by the other. CLIs are more scalable, but GUIs provide memory cues and physical interaction. The main reason is just that words, spoken or written, is what people use to communicate with each other, whether via a computer or not. And only the written word is easily accessible for re-use.
      - Gunnar_Zarncke 14 Apr 2014 16:05 UTC
        0 points
        0
        Parent
        
        I’m rather baffled by how I would use speech to paint into a Photoshop window. Force feedback motion already exists for 2D painting—graphics tablets are standard equipment for artists.
        
        You wouldn’t “paint into a Photoshop window”. I’d imagine saying e.g. “put a circular animation of growing fern around the center of the pulsating ball” and then tweaking via force feedback some of the parameters of the fern or its growing.

Beware technological wonderland, or, why text will dominate the future of communication and the Internet

What Has Stood the Test of Time

What Has Been Proven False or Significantly Complicated

How LLMs Change the Text vs. Video Equation