A common belief is that interfaces to applications in the VR should be as natural as possible. It is assumed that the most realistic system will also be the most useful system. Even though commands seem inheritently unnatural, there are physical-world parallels. Since most of these involve giving orders to others using our voice, it is not surprising that one popular method of issuing commands in the VR is through voice recognition.
The worst problem of voice recognition is the nearly limitless degree of freedom in such systems.
There are such a variety of sounds produced by a single human voice, not to mention the variety of different voices from different users, that fast and accurate speech-recognition is very difficult to achieve. Algoritms are improving, but the most accurate still require training by each individual user before actual use. This is unacceptabe for systems that will have a large number of occasional and first-time users.
Secondly when the computer (the application) is trained to the voice of one individual user, the user himself has to be trained to know which commands are valid, which words (in what context) can be used to interact with the application.
Of course not every word the user knows can be known by the application and can be interpreted in the right way. The program only knows very few commands in a special context.
Another problem occours when the user is not alone inside the CAVE (as this is a multi user VR system). There is no way for an application to distinguish between words that are given to the program and words that are said to the other users. This will be problematic especially when words said during a conversation trigger commands in the application.
A solution to some of the given problems are voice recognition menues, which show to the user which commands are valid (and what function they will trigger)