Spitballing here, but how about designing the language in tandem with a ML model for it? I see multiple benefits to that:
First is that current English language models spend an annoyingly large amount of power on reasoning about what specific words mean in context. For “I went to the store” and “I need to store my things”, store is the same token in both, so the network needs to figure out what it actually means. For a constructed language, that task can be made much easier.
English has way too many words to make each of them their own token, so language models preprocess texts by splitting them up into smaller units. For a logical language you can have significantly fewer tokens, and each token can be an unique word with an unique meaning. With the proper morphology you also no longer need to tokenize spaces, which cuts down on the size of the input (and thus complexity).
Language models such as GPT-3 work by spitting out a value for each possible output token, representing the likelihood that it will be the next in sequence. For a half-written sentence in a logical language it will be possible to reliably filter out words that are known to be ungrammatical, which means the model doesn’t have to learn all of that itself.
The benefits of doing this would not only be to the ML model. You’d get a tool that’s useful for the language development, too:
Let’s say you want to come up with a lexicon, and you have certain criteria like “two words that mean similar things should not sound similar, so as to make them easy to differentiate while speaking”. Simply inspect the ML model, and see what parts of the network is affected by the two tokens. The more similar that is, presumably the closer they are conceptually. You can then use that distance to programmatically generate the entire lexicon, using whatever criteria you want.
If the language has features to construct sentences that would be complicated for an English-speaker to think, the model might start outputting those. By human-guided use of the model itself for creating a text corpus, it might be possible to converge to interestingly novel and alien thoughts and concepts.
Typically the input text is pre-processed with a secondary model (such as BERT) which somewhat improves the situation.
Except proper nouns I suppose, those you’d still need to split.
Yeah, x seems the most appropriate candidate. It sufficiently rare in English to not trip people up too much, from a cursory glance at Wikipedia it’s at least used for that purpose in Pirahã, and it even looks like a little pictographic “stop” symbol.
Edit: Oh, apologies, I completely misunderstood the part where “ņ” was actually written with the letter “q”. Nevermind that part!
Phonology should be significantly optimized for aesthetics, as long as the loglangishness doesn’t suffer. The sheer ugliness of Lojban is IMO a big reason why it’s not as popular as it should be. As a second point on the “optimize for popularity” topic, if there’s ever a conflict between ease of pronounceability for English speakers versus any other language, err on the side of English.
Having any character not in the a-z range has two major drawbacks – the first is that it’s going to be really annoying to type for a vast amount of people. Typing “Ņ” with my Swedish keyboard requires me to do the awkward hand movement of pressing AltGr+, then releasing the keys to press Shift-N. I’d rather have it be any other character that’s available.
Secondly, and this applies to the apostrophe too, is that a lot of things that has to do with computers doesn’t deal with those characters very well. Anything with e.g. an apostrophe will be hard to google for, it will often need escaping if it’s inserted in a string, and likely won’t be useable as tokens (e.g. a variable name in programming languages, a browser user-agent, computer usernames...) – and even in the cases where it is usable, it requires ugly hacks to get working (domain names, filenames).