jefftk comments on Lightweight minimal speech recognition?

jefftk 13 Jul 2023 13:26 UTC
2 points
0
Thanks! Your model of the problem I’m trying to solve is good.

It’s true that I have limited control over the acoustic environment, and a noisy stage can be unpredictable. On the other hand, my mouth is right next to a high quality microphone, which gives me a very good signal to noise ratio. So, for example, whistling to control a bass synthesizer has worked well for me.

Thoughts on your suggestions:
- Eye tracking: I think this is plausible, though it would definitely need to still let me look around the stage when I wasn’t actively giving a command. Possibly some pattern of looking at ~four different spots in order might be enough? Some stages are dark, though, which makes me nervous about anything visual.
- Muscle sensing: I can’t really think of a good place to put one. I’m already using my hands and arms to play the piano or mandolin and my legs and feet to play drums. Something on my face would be possible, but kind of intrusive?
- Accelerometer and gyro on the head: I built one of these early on in my explorations here and it does work. I stopped using it, though, because it would give me a sore neck.
- I think range finding runs into the same issue as muscle sensing: all the obvious candidates are in use.
- The array of fans sounds interesting. I could put them in a ring around my microphone and blow into/on them. It’s rare that I play in windy environments, and it’s okay if this is a component I can’t use there. Slightly nervous about reliability, since this seems kind of fragile?
- Loads: this is something I already do a lot, and I’m pretty happy with. For example, I can switch my foot pedals between drums, bass, or a bit of both. But while it lets me to more different things, it doesn’t let me do more things at once, which is important when trying to make a full sound as a duo.
- nim 13 Jul 2023 15:55 UTC
  3 points
  0
  Parent
  Thanks for explaining!
  
  Eye tracking could also mean face/expression tracking, too. I figure there are probably some areas (stage, audience) where it’s important for you to look without issuing commands, and other areas (floor? above audience?) where you won’t gain useful data by looking. It’s those not-helpful-to-look areas where I’m wondering if you could get enough precision to essentially visualize a matrix of buttons, look at the position of the imagined button you want to “select” it, blink or do a certain mouth movement to “click” it, etc.
  
  Your confidence in the quality of your mic updates my hope that audio processing might actually be feasible. The lazy approach I’d take to finding music-ish noises which can be picked out of an audio stream from that mic would be to play some appropriate background noise and then kinda freestyle beatbox into the mic in a way that feels compatible with the music, while recording. I’d then throw that track into whatever signal processing software I was already using to see whether it already had any filters that could garner a level of meaning from the music-compatible mouth-noises. A similar process could be to put on background music and rap music-compatible nonsense syllables to it, and see what speech-to-text can do with the result.
  
  (As a listener, I’m also selfish in proposing nonsense noises/sounds over English words, because my brain insists on parsing all language in music that I hear. This makes me expect that some portion of your audience would have a worse time listening to you if the music you’re trying to play was mixed with commands that the listeners would be meant to ignore. )
  
  I expect that by brute forcing the “what can this software hear clearly and easily?” problem in this way, you’ll discover that the systems you’re using do well at discerning certain noises and poorly at discerning others. It’s almost like working with an animal that has great hearing in some ranges that we consider normal and poor hearing in others. When my family members who farm with working dogs need to name a puppy, they actually test lists of monosyllabic names in a similar way to make sure that no current dog will confuse the puppy’s name for its own. before teaching the puppy what its name is.
  
  After building your alphabet of easy-to-process sounds, you can map combinations of those sounds to commands in any way that you like, and never have to worry about stumbling across a word that the text-to-speech just can’t handle in the noisy context.
  
  The less lazy way, of course, would be to choose your vocabulary of commands and then customize the software until it can handle them. That’s valid and arguably cooler; it just strikes me as a potentially unbounded amount of work.
  - jefftk 13 Jul 2023 16:14 UTC
    2 points
    0
    Parent
    
    I’m wondering if you could get enough precision to essentially visualize a matrix of buttons, look at the position of the imagined button you want to “select” it, blink or do a certain mouth movement to “click” it, etc
    
    Maybe! This would definitely be nice if it worked. Probably better for switching the system between modes than triggering sounds in real time, though?
    
    This makes me expect that some portion of your audience would have a worse time listening to you if the music you’re trying to play was mixed with commands that the listeners would be meant to ignore.
    
    When using the mic in this mode I wouldn’t be sending it out to the hall. It wouldn’t be audible offstage.
    
    see what speech-to-text can do with the result
    
    I do think that’s worth doing, though only if I get far enough along to have speech-to-text running at all. Right now I think I probably am just trying to use hardware that isn’t up to the task.