281 lines
17 KiB
Plaintext
281 lines
17 KiB
Plaintext
Power Glove Gesture Recognition
|
|
|
|
Foreword:
|
|
|
|
Perhaps everyone who has used the Nintendo Power Glove, whether it be for
|
|
entertainment or development purposes, has gained insight into its usefulness as a gesture
|
|
sensing device - something that "knows" what you are trying to say with your body. Of
|
|
course, even the Power Glove is limited to hand and wrist input, but it is clearly a far more
|
|
natural input device than the keyboard, mouse, or even the touch screen. To be more
|
|
quantitative, the glove offers fully 8 channels of completely independent input, as opposed
|
|
to 4 for the mouse or joystick (X, Y, and 2 buttons) and (essentially) 1 for the keyboard.
|
|
It is as though you are playing an 8 string guitar that your computer can "hear" perfectly,
|
|
(although some of the strings only have 4 frets.) What I want to implement is a program
|
|
to take the flood of glove input and boil it down to easily-interpreted commands like
|
|
"punch", "move left", "move forward", "twist left", "twist right", "fist", and so on and so
|
|
forth. It is much easier to write a useful application when this filtering has already been
|
|
taken care of. I see the first and most important application of this implementation to be
|
|
virtual reality. Simply put, VR is the attempt to make a computer "look and feel" more
|
|
like real life and less like a stupid machine. The glove puts people in closer contact with
|
|
computer mechanics, and perhaps more importantly, it puts the machine in closer contact
|
|
with the human user! This much power will probably find other applications as well, and I
|
|
list some potential ones later in the text.
|
|
This paper gives a brief description of how I will implement an excellent gesture
|
|
recognition system. At this point I'm not asking for input on what you think of this, so if
|
|
you don't like it, ignore it. The discussion is very technical and contains a lot of "stream of
|
|
consciousness" writing. If this bothers you, I'm sorry, I don't have a lot of time to polish it
|
|
at this point. I need to start coding as quickly as possible. If it bugs you that I'm releasing
|
|
this before the code is done, well, read the afterword, then go back to your private, selfish
|
|
excuse for a life. The rest of us are getting down to work.
|
|
|
|
Section I - How simple should recognized gestures to be?
|
|
|
|
Fingers: Thumb, Index, Middle, Ring.
|
|
Location: X, Y, Z, and Rotation.
|
|
|
|
Gesture consists of finger "delta" and/or location "delta" ("delta" = "change in glove
|
|
input"). More subtle observation - we may wish to ignore the delta of any of the eight
|
|
parameters. Do finger deltas need to be handled differently? (My answer turns out to be
|
|
an emphatic "NO".) Big issue is one of gesture "combination". Should two identical rapid
|
|
gestures be interpreted as a "double-click"? I think so, (though later I'll reject this.) The
|
|
concept can be extended like so: Say the gestures listed in the 1st paragraph can be
|
|
sensed. If several gestures occur rapidly, there might be a gesture in the event queue like
|
|
"punch-twirl-punch-punch" which would be distinguished from "punch", "twirl", "punch",
|
|
"punch" (4 separate queue events) by the delay between the gestures. Clearly a queue of
|
|
gestures would be required, similar to an event queue for mouse clicks.
|
|
Another possibility is the "shifting" of gestures. That is, if a gesture does not use a
|
|
parameter delta (in other words "an input channel is ignored in a particular gesture"), that
|
|
parameter could be used as a kind of "shift key". The most logical extension would work
|
|
like this:
|
|
- User picks a "shift" gesture, such as holding the thumb close to the palm.
|
|
- The shift gesture parameter (in this case the thumb delta) cannot be used in any of the
|
|
"regular" gestures. (This is to prevent "overlap" see below for more.)
|
|
- Now any regular gesture can be shifted, giving twice as many fundamental gestures.
|
|
- The # of fundamental gestures could in fact be doubled a few more times by adding
|
|
additional "shifting" gestures in the manner of the <Ctrl> and <Alt> keys on most PC
|
|
keyboards. (One idea I had was to change the meaning of gestures based on the
|
|
"quadrant" that the user is pointing at - upper left, upper right, lower left and lower right.
|
|
E.g. "fist" can be broken into "upper left fist", "lower right fist", etc.) HOWEVER, using
|
|
shifting gestures encourages modality in the client application, which is usually very
|
|
frustrating to the end-user learning an application. The human mind, really, only operates
|
|
in one "mode". I imagine that this is disturbing to some of you. User interfaces have
|
|
progressed past the "insert mode," "delete mode", "change mode" type. Equally offensive
|
|
to me are the "just hold down the shift to key to modify commands in such-and-such a
|
|
way." (E.g. shift-left-mouse-button to copy, ctrl-left-mouse-button to move, or Shift-F10
|
|
to print and Ctrl-F10 for print preview.) The human mind can't handle layers like that, but
|
|
the computer will store them forever. How many times have you moved a file when you
|
|
wanted to copy it, or printed a file when all you wanted was a preview? This happens
|
|
most frequently in beginning users, and users who have not used a program in a while. So
|
|
those of you who think that "shifting" gestures is neat had better grow up. If I choose to
|
|
implement them, it will be for a better reason than that
|
|
- Another basic problem is that to be effective, the shifting gestures must lock out an
|
|
entire channel of input which may limit the variety and intuitiveness of the "unshifted"
|
|
gestures. (What's the difference between a "shifted-fist" and an (unshifted) "fist" if the
|
|
shifting gesture is the thumb tucked into the palm?) More and more I'm thinking one shift
|
|
gesture is probably too much. Might want to make it optional for certain applications.
|
|
It would be useful for the gesture recognition system to do automatic "globbing".
|
|
E.g. the "punch-twirl-punch-punch" command would automatically be shortened to
|
|
something like "knockout". This is different from gesture combination in that it does not
|
|
directly involve time. Example - "punch" followed later by another "punch" could be
|
|
globbed to "twopunch". This way, "punch", "knockout", and "twopunch" would all be
|
|
separate gestures, even though they all involve a rapid thrust toward the screen. The
|
|
client app would not need to do any special interpretation to receive them correctly.
|
|
There is more about shifting, combination, and globbing later in this article.
|
|
|
|
Section II - Uses of this implementation
|
|
|
|
The intended implementation would allow the end user to define the gestures
|
|
he/she is most comfortable with. What the computer is supposed to do when a gesture is
|
|
received is not the subject of this article. There are a myriad of applications that could use
|
|
gesture receipt to trigger a function. Ideas:
|
|
Virtual Reality
|
|
Communicating between multiple users
|
|
Constructing rooms & objects
|
|
Movement within the world
|
|
"Training" of autonomous moving objects
|
|
Disabled persons
|
|
Simplified communication (speech synthesis, text generation)
|
|
Therapy, muscular exercise
|
|
Windowing/User Interfaces
|
|
Sizing, moving, selecting windows and data chunks
|
|
Education
|
|
Training for sign language
|
|
Visual computer programming
|
|
Introductions to computers
|
|
Multimedia applications
|
|
Moving between subjects.
|
|
Games
|
|
Control of unusual peripherals (robotic arms)
|
|
|
|
Perhaps with the glove and good gesture recognition, we can break out of "flat"
|
|
computer interfaces and put computers within the reach of more people. Keep your
|
|
fingers crossed!
|
|
|
|
Section III - The implementation itself
|
|
|
|
We need a way to "name" the gestures without using the keyboard. One way is to
|
|
have a built-in gesture set designed for entering alphabetic characters (similar to the way
|
|
hearing-impaired people do proper names). Another way is to just have a long list of
|
|
gestures that can be assigned, and you just select the gesture you want to define menu-
|
|
style. I think we want a combination of the above - I'll supply a list but users can also add
|
|
their own. Keep in mind that these ASCII names do not go into the event queue, they are
|
|
there solely for the convenience of the application. I imagine that once the gestures are
|
|
loaded from disk and any modifications are made by the end user, the names can be
|
|
removed from memory. They only need to be reloaded if the user wants to change the
|
|
definition. The app refers to the gestures via a pointer (or "handle"). (Side note: I've read
|
|
some discussion on Compuserve about whether to use #defines or constant strings to refer
|
|
to the gestures. NEITHER OF THESE METHODS IS FLEXIBLE ENOUGH.) Clearly
|
|
I need a way of saving/loading gestures/gesture-sets to/from disk. I will avoid "binary
|
|
dumps"; however, don't expect them to easily interpretable. (Although you never know!)
|
|
Okay. I am assuming that the glove will be sampled at regular intervals. The
|
|
length of this interval is not terribly important, but most applications are going to need
|
|
real-time input. When I refer to a "click" or a "time click" below, I mean the time between
|
|
glove samples. I suppose you could still do gesture recognition if the glove is sampled
|
|
irregularly, as long as the time between each sample is recorded, or maybe a time stamp on
|
|
each sample or something like that. It would still be a pain in the butt compared to regular
|
|
sampling. Basic objects are shown below:
|
|
|
|
|
|
Sample:
|
|
One UNSIGNED byte for each of the following X, Y, Z, Rotation, Thumb, Index,
|
|
Middle, Ring. (Practically speaking, the straight glove data.)
|
|
|
|
Gesture:
|
|
1 word pointer to a Recognizable.
|
|
|
|
Recognizable:
|
|
One SIGNED byte for each of the following: delta X, delta Y, delta Z, delta
|
|
Rotation, delta Thumb, delta Index, delta Middle, delta Ring
|
|
8 bit value with each bit signifying whether the corresponding "channel" is to be
|
|
used or ignored for the gesture.
|
|
One size_t time (in clicks) for the length of the gesture.
|
|
|
|
EventQueue:
|
|
A queue of Gestures.
|
|
|
|
SampleBuffer:
|
|
A random access array of Samples (updated after each click.) Must hold as many
|
|
samples as the largest Recognizable in the GestureList.
|
|
|
|
GestureList:
|
|
A sorted list of Recognizable objects. The sorting key is the length of the gesture.
|
|
|
|
|
|
The main engine works like so:
|
|
|
|
1. A new Sample is added to the SampleBuffer.
|
|
2. Step through the GestureList:
|
|
a. Compute a delta vector for the current Recognizable, (if it is not the same as
|
|
the last one.) This is done by subtracting each of the current Sample values from the
|
|
Sample values N clicks ago (which will be present in the SampleBuffer), where N is the
|
|
length of the current Recognizable.
|
|
b. See of the delta vector lies within Epsilon of the current Recognizable's delta
|
|
vector. Epsilon is a constant vector to allow for "close enough" user input. I will supply
|
|
the appropriate Epsilon vector for the Power Glove after experimenting.
|
|
c. If 2b. is true, add the address of the Recognizable to the event queue. The
|
|
address of the Recognizable is in fact a Gesture.
|
|
3. Continue forever!
|
|
|
|
Hmmm. This isn't quite right. Depending on how quickly the user makes a
|
|
gesture, it might make it to the queue several times, instead of just once as hoped. We
|
|
need to amend the data structures.
|
|
|
|
|
|
|
|
Recognizable:
|
|
Same as above, plus an 8 byte "work vector" and a 1 bit flag to say whether the
|
|
gesture is currently being sensed (the "sensed bit").
|
|
|
|
Gesture:
|
|
Same as above, plus a 1 bit value that is TRUE when the gesture is first
|
|
recognized, and FALSE when the gesture has been released.
|
|
|
|
In case you can't tell, event queue messages now take the form "punch(sense)" followed
|
|
later by "punch(release)". This is similar to the way mouse "click-hold-and-drag"
|
|
operations - one message is sent when the gesture is first "seen" and another is sent when
|
|
the gesture is finally released. Hmmm, rather than setting and resetting a bit to indicate
|
|
whether the gesture is turning on or off, why not just have the client app assume that
|
|
identical queue messages will be sent. The first one will always be the "turning on"
|
|
message, and it must be followed by another identical "turning off" message at a later
|
|
time. But the messages may not occur one right after another. Does that put too much
|
|
responsibility on the client? Hmmm. It would be nice to have just a single pointer in the
|
|
queue, but I'm leaning toward keeping the extra bit. Many client apps will want to ignore
|
|
the "release" message, and they would just have to check that bit to distinguish the
|
|
irrelevant messages. Otherwise they'll have to keep track...
|
|
I'll modify the engine as follows:
|
|
|
|
2. Step through the GestureList:
|
|
a. Compute a delta vector for the current Recognizable, (if it is not the same as
|
|
the last one.) This is done by subtracting each of the current Sample values from the
|
|
Sample values N clicks ago, where N is the length of the current Recognizable. If the
|
|
"sensed bit" is set, subtract the current values from the work vector (rather than from the
|
|
sample N clicks ago.).
|
|
b. See of the delta vector lies within Epsilon of the current Recognizable's delta
|
|
vector. Epsilon is a constant vector to allow for "close enough" user input. I will supply
|
|
the appropriate Epsilon vector for the Power Glove after experimenting.
|
|
c. If 2b. is true, and the "sensed bit" is not set, then set the "sensed bit" and add
|
|
the address of the Recognizable to the event queue, making sure it is a "turning on"
|
|
gesture. Also, copy the Sample from N clicks ago into the work vector!
|
|
d. If 2b is false, and the "sensed bit" is set, then send a "turning off" queue
|
|
message.
|
|
3. Continue forever and ever!
|
|
|
|
|
|
It should be plainly obvious how users can create their own gestures using JUST
|
|
THE GLOVE! They'll hit one of the glove buttons to start a gesture, make the gesture,
|
|
then hit the glove button again. All you need to store are the sample deltas and the length
|
|
of the gesture. What could be easier?
|
|
Glove gestures as I have defined them are rather "elemental". It is not possible to
|
|
define a gesture like "move hand up, wiggle your index finger, move toward the screen
|
|
and twist your wrist left" However, you COULD define 4 or 5 elemental gestures that
|
|
would be "added up" by the client application and interpreted as a single gesture! I will
|
|
include one or more relatively simple examples of this with the code, but I'm not sure if
|
|
they'll be the combination or globbing type.
|
|
That about covers it. I'm still not sure about shifting, combination, or globbing. I
|
|
have some deeper questions about these concepts which I have not really addressed here.
|
|
If you have understood everything I said, you probably have the same questions! I'll
|
|
probably use Borland's C++ CLASSLIB library to handle the queue, buffer, and array in
|
|
the first cut. I'll probably release that mainly to show beginning OOP programmers how a
|
|
class library is used. (I know there's a lot of you out there.) But the real version will not
|
|
use CLASSLIB for three reasons: 1) Portability, 2) Execution speed, and 3) Memory
|
|
usage. Good reasons don't you think! Only disadvantage is perhaps maintainability. But
|
|
this is a REAL TIME application. Compactness in both speed and size are penultimate. I
|
|
will promise you that it will be object oriented as well. I really want this thing to reach a
|
|
wide audience. If you doubt my credentials read MTP.BIO in the COMART forum on
|
|
CompuServe.
|
|
|
|
Afterword:
|
|
The system will be freeware with the provision that you give me credit if you use
|
|
any part of the source code for any purpose. It will be copyrighted. It is my opinion that
|
|
good software is only of value to the intellectual community when it is accessible. I am
|
|
making every effort to put this tool in a position where it will be used. Please use it! In
|
|
the spirit of Richard Stallman's work, I am permitting you to take advantage of my ideas.
|
|
If you make any significant improvements to the engine, please let me know.
|
|
If my gesture recognition system sounds good to you, here's how you can pay me
|
|
back. Think about what gesture sets will be useful to you. When the code is done, use it
|
|
to create the gesture sets you want. But I WANT TO SEE WHAT YOU HAVE DONE.
|
|
This is only fair. You may sell your application if you must, but you still owe it to me to
|
|
let me have your gesture sets. It's the only way I can see the effect of my effort. It will
|
|
encourage me to stay focused on this important area. There's obviously no way I can
|
|
force you to do this without investing big $$$, which I don't have. I just want you to "Be
|
|
Nice to Me" as Todd Rundgren so aptly put it (in 1971).
|
|
Also, if you're willing to hire me to work on Virtual Reality construction tools or
|
|
let me help in constructing actual Virtual Worlds, please contact me (see below). My
|
|
present job is very boring, and this kind of stuff is one of the few things which stimulate
|
|
me.
|
|
|
|
|
|
|
|
Mark T. Pflaging
|
|
|
|
Home:
|
|
7651 S. Arbory Lane
|
|
Laurel, MD 20707
|
|
(301)-498-5840
|
|
|
|
Work:
|
|
Cambridge Scientific Abstracts
|
|
7200 Wisconsin Avenue
|
|
Bethesda, MD 20814
|
|
|