textfiles/programming/gesture.txt

281 lines
17 KiB
Plaintext

Power Glove Gesture Recognition
Foreword:
Perhaps everyone who has used the Nintendo Power Glove, whether it be for
entertainment or development purposes, has gained insight into its usefulness as a gesture
sensing device - something that "knows" what you are trying to say with your body. Of
course, even the Power Glove is limited to hand and wrist input, but it is clearly a far more
natural input device than the keyboard, mouse, or even the touch screen. To be more
quantitative, the glove offers fully 8 channels of completely independent input, as opposed
to 4 for the mouse or joystick (X, Y, and 2 buttons) and (essentially) 1 for the keyboard.
It is as though you are playing an 8 string guitar that your computer can "hear" perfectly,
(although some of the strings only have 4 frets.) What I want to implement is a program
to take the flood of glove input and boil it down to easily-interpreted commands like
"punch", "move left", "move forward", "twist left", "twist right", "fist", and so on and so
forth. It is much easier to write a useful application when this filtering has already been
taken care of. I see the first and most important application of this implementation to be
virtual reality. Simply put, VR is the attempt to make a computer "look and feel" more
like real life and less like a stupid machine. The glove puts people in closer contact with
computer mechanics, and perhaps more importantly, it puts the machine in closer contact
with the human user! This much power will probably find other applications as well, and I
list some potential ones later in the text.
This paper gives a brief description of how I will implement an excellent gesture
recognition system. At this point I'm not asking for input on what you think of this, so if
you don't like it, ignore it. The discussion is very technical and contains a lot of "stream of
consciousness" writing. If this bothers you, I'm sorry, I don't have a lot of time to polish it
at this point. I need to start coding as quickly as possible. If it bugs you that I'm releasing
this before the code is done, well, read the afterword, then go back to your private, selfish
excuse for a life. The rest of us are getting down to work.
Section I - How simple should recognized gestures to be?
Fingers: Thumb, Index, Middle, Ring.
Location: X, Y, Z, and Rotation.
Gesture consists of finger "delta" and/or location "delta" ("delta" = "change in glove
input"). More subtle observation - we may wish to ignore the delta of any of the eight
parameters. Do finger deltas need to be handled differently? (My answer turns out to be
an emphatic "NO".) Big issue is one of gesture "combination". Should two identical rapid
gestures be interpreted as a "double-click"? I think so, (though later I'll reject this.) The
concept can be extended like so: Say the gestures listed in the 1st paragraph can be
sensed. If several gestures occur rapidly, there might be a gesture in the event queue like
"punch-twirl-punch-punch" which would be distinguished from "punch", "twirl", "punch",
"punch" (4 separate queue events) by the delay between the gestures. Clearly a queue of
gestures would be required, similar to an event queue for mouse clicks.
Another possibility is the "shifting" of gestures. That is, if a gesture does not use a
parameter delta (in other words "an input channel is ignored in a particular gesture"), that
parameter could be used as a kind of "shift key". The most logical extension would work
like this:
- User picks a "shift" gesture, such as holding the thumb close to the palm.
- The shift gesture parameter (in this case the thumb delta) cannot be used in any of the
"regular" gestures. (This is to prevent "overlap" see below for more.)
- Now any regular gesture can be shifted, giving twice as many fundamental gestures.
- The # of fundamental gestures could in fact be doubled a few more times by adding
additional "shifting" gestures in the manner of the <Ctrl> and <Alt> keys on most PC
keyboards. (One idea I had was to change the meaning of gestures based on the
"quadrant" that the user is pointing at - upper left, upper right, lower left and lower right.
E.g. "fist" can be broken into "upper left fist", "lower right fist", etc.) HOWEVER, using
shifting gestures encourages modality in the client application, which is usually very
frustrating to the end-user learning an application. The human mind, really, only operates
in one "mode". I imagine that this is disturbing to some of you. User interfaces have
progressed past the "insert mode," "delete mode", "change mode" type. Equally offensive
to me are the "just hold down the shift to key to modify commands in such-and-such a
way." (E.g. shift-left-mouse-button to copy, ctrl-left-mouse-button to move, or Shift-F10
to print and Ctrl-F10 for print preview.) The human mind can't handle layers like that, but
the computer will store them forever. How many times have you moved a file when you
wanted to copy it, or printed a file when all you wanted was a preview? This happens
most frequently in beginning users, and users who have not used a program in a while. So
those of you who think that "shifting" gestures is neat had better grow up. If I choose to
implement them, it will be for a better reason than that
- Another basic problem is that to be effective, the shifting gestures must lock out an
entire channel of input which may limit the variety and intuitiveness of the "unshifted"
gestures. (What's the difference between a "shifted-fist" and an (unshifted) "fist" if the
shifting gesture is the thumb tucked into the palm?) More and more I'm thinking one shift
gesture is probably too much. Might want to make it optional for certain applications.
It would be useful for the gesture recognition system to do automatic "globbing".
E.g. the "punch-twirl-punch-punch" command would automatically be shortened to
something like "knockout". This is different from gesture combination in that it does not
directly involve time. Example - "punch" followed later by another "punch" could be
globbed to "twopunch". This way, "punch", "knockout", and "twopunch" would all be
separate gestures, even though they all involve a rapid thrust toward the screen. The
client app would not need to do any special interpretation to receive them correctly.
There is more about shifting, combination, and globbing later in this article.
Section II - Uses of this implementation
The intended implementation would allow the end user to define the gestures
he/she is most comfortable with. What the computer is supposed to do when a gesture is
received is not the subject of this article. There are a myriad of applications that could use
gesture receipt to trigger a function. Ideas:
Virtual Reality
Communicating between multiple users
Constructing rooms & objects
Movement within the world
"Training" of autonomous moving objects
Disabled persons
Simplified communication (speech synthesis, text generation)
Therapy, muscular exercise
Windowing/User Interfaces
Sizing, moving, selecting windows and data chunks
Education
Training for sign language
Visual computer programming
Introductions to computers
Multimedia applications
Moving between subjects.
Games
Control of unusual peripherals (robotic arms)
Perhaps with the glove and good gesture recognition, we can break out of "flat"
computer interfaces and put computers within the reach of more people. Keep your
fingers crossed!
Section III - The implementation itself
We need a way to "name" the gestures without using the keyboard. One way is to
have a built-in gesture set designed for entering alphabetic characters (similar to the way
hearing-impaired people do proper names). Another way is to just have a long list of
gestures that can be assigned, and you just select the gesture you want to define menu-
style. I think we want a combination of the above - I'll supply a list but users can also add
their own. Keep in mind that these ASCII names do not go into the event queue, they are
there solely for the convenience of the application. I imagine that once the gestures are
loaded from disk and any modifications are made by the end user, the names can be
removed from memory. They only need to be reloaded if the user wants to change the
definition. The app refers to the gestures via a pointer (or "handle"). (Side note: I've read
some discussion on Compuserve about whether to use #defines or constant strings to refer
to the gestures. NEITHER OF THESE METHODS IS FLEXIBLE ENOUGH.) Clearly
I need a way of saving/loading gestures/gesture-sets to/from disk. I will avoid "binary
dumps"; however, don't expect them to easily interpretable. (Although you never know!)
Okay. I am assuming that the glove will be sampled at regular intervals. The
length of this interval is not terribly important, but most applications are going to need
real-time input. When I refer to a "click" or a "time click" below, I mean the time between
glove samples. I suppose you could still do gesture recognition if the glove is sampled
irregularly, as long as the time between each sample is recorded, or maybe a time stamp on
each sample or something like that. It would still be a pain in the butt compared to regular
sampling. Basic objects are shown below:
Sample:
One UNSIGNED byte for each of the following X, Y, Z, Rotation, Thumb, Index,
Middle, Ring. (Practically speaking, the straight glove data.)
Gesture:
1 word pointer to a Recognizable.
Recognizable:
One SIGNED byte for each of the following: delta X, delta Y, delta Z, delta
Rotation, delta Thumb, delta Index, delta Middle, delta Ring
8 bit value with each bit signifying whether the corresponding "channel" is to be
used or ignored for the gesture.
One size_t time (in clicks) for the length of the gesture.
EventQueue:
A queue of Gestures.
SampleBuffer:
A random access array of Samples (updated after each click.) Must hold as many
samples as the largest Recognizable in the GestureList.
GestureList:
A sorted list of Recognizable objects. The sorting key is the length of the gesture.
The main engine works like so:
1. A new Sample is added to the SampleBuffer.
2. Step through the GestureList:
a. Compute a delta vector for the current Recognizable, (if it is not the same as
the last one.) This is done by subtracting each of the current Sample values from the
Sample values N clicks ago (which will be present in the SampleBuffer), where N is the
length of the current Recognizable.
b. See of the delta vector lies within Epsilon of the current Recognizable's delta
vector. Epsilon is a constant vector to allow for "close enough" user input. I will supply
the appropriate Epsilon vector for the Power Glove after experimenting.
c. If 2b. is true, add the address of the Recognizable to the event queue. The
address of the Recognizable is in fact a Gesture.
3. Continue forever!
Hmmm. This isn't quite right. Depending on how quickly the user makes a
gesture, it might make it to the queue several times, instead of just once as hoped. We
need to amend the data structures.
Recognizable:
Same as above, plus an 8 byte "work vector" and a 1 bit flag to say whether the
gesture is currently being sensed (the "sensed bit").
Gesture:
Same as above, plus a 1 bit value that is TRUE when the gesture is first
recognized, and FALSE when the gesture has been released.
In case you can't tell, event queue messages now take the form "punch(sense)" followed
later by "punch(release)". This is similar to the way mouse "click-hold-and-drag"
operations - one message is sent when the gesture is first "seen" and another is sent when
the gesture is finally released. Hmmm, rather than setting and resetting a bit to indicate
whether the gesture is turning on or off, why not just have the client app assume that
identical queue messages will be sent. The first one will always be the "turning on"
message, and it must be followed by another identical "turning off" message at a later
time. But the messages may not occur one right after another. Does that put too much
responsibility on the client? Hmmm. It would be nice to have just a single pointer in the
queue, but I'm leaning toward keeping the extra bit. Many client apps will want to ignore
the "release" message, and they would just have to check that bit to distinguish the
irrelevant messages. Otherwise they'll have to keep track...
I'll modify the engine as follows:
2. Step through the GestureList:
a. Compute a delta vector for the current Recognizable, (if it is not the same as
the last one.) This is done by subtracting each of the current Sample values from the
Sample values N clicks ago, where N is the length of the current Recognizable. If the
"sensed bit" is set, subtract the current values from the work vector (rather than from the
sample N clicks ago.).
b. See of the delta vector lies within Epsilon of the current Recognizable's delta
vector. Epsilon is a constant vector to allow for "close enough" user input. I will supply
the appropriate Epsilon vector for the Power Glove after experimenting.
c. If 2b. is true, and the "sensed bit" is not set, then set the "sensed bit" and add
the address of the Recognizable to the event queue, making sure it is a "turning on"
gesture. Also, copy the Sample from N clicks ago into the work vector!
d. If 2b is false, and the "sensed bit" is set, then send a "turning off" queue
message.
3. Continue forever and ever!
It should be plainly obvious how users can create their own gestures using JUST
THE GLOVE! They'll hit one of the glove buttons to start a gesture, make the gesture,
then hit the glove button again. All you need to store are the sample deltas and the length
of the gesture. What could be easier?
Glove gestures as I have defined them are rather "elemental". It is not possible to
define a gesture like "move hand up, wiggle your index finger, move toward the screen
and twist your wrist left" However, you COULD define 4 or 5 elemental gestures that
would be "added up" by the client application and interpreted as a single gesture! I will
include one or more relatively simple examples of this with the code, but I'm not sure if
they'll be the combination or globbing type.
That about covers it. I'm still not sure about shifting, combination, or globbing. I
have some deeper questions about these concepts which I have not really addressed here.
If you have understood everything I said, you probably have the same questions! I'll
probably use Borland's C++ CLASSLIB library to handle the queue, buffer, and array in
the first cut. I'll probably release that mainly to show beginning OOP programmers how a
class library is used. (I know there's a lot of you out there.) But the real version will not
use CLASSLIB for three reasons: 1) Portability, 2) Execution speed, and 3) Memory
usage. Good reasons don't you think! Only disadvantage is perhaps maintainability. But
this is a REAL TIME application. Compactness in both speed and size are penultimate. I
will promise you that it will be object oriented as well. I really want this thing to reach a
wide audience. If you doubt my credentials read MTP.BIO in the COMART forum on
CompuServe.
Afterword:
The system will be freeware with the provision that you give me credit if you use
any part of the source code for any purpose. It will be copyrighted. It is my opinion that
good software is only of value to the intellectual community when it is accessible. I am
making every effort to put this tool in a position where it will be used. Please use it! In
the spirit of Richard Stallman's work, I am permitting you to take advantage of my ideas.
If you make any significant improvements to the engine, please let me know.
If my gesture recognition system sounds good to you, here's how you can pay me
back. Think about what gesture sets will be useful to you. When the code is done, use it
to create the gesture sets you want. But I WANT TO SEE WHAT YOU HAVE DONE.
This is only fair. You may sell your application if you must, but you still owe it to me to
let me have your gesture sets. It's the only way I can see the effect of my effort. It will
encourage me to stay focused on this important area. There's obviously no way I can
force you to do this without investing big $$$, which I don't have. I just want you to "Be
Nice to Me" as Todd Rundgren so aptly put it (in 1971).
Also, if you're willing to hire me to work on Virtual Reality construction tools or
let me help in constructing actual Virtual Worlds, please contact me (see below). My
present job is very boring, and this kind of stuff is one of the few things which stimulate
me.
Mark T. Pflaging
Home:
7651 S. Arbory Lane
Laurel, MD 20707
(301)-498-5840
Work:
Cambridge Scientific Abstracts
7200 Wisconsin Avenue
Bethesda, MD 20814