Multi-feature speech/music discrimination system2010-03-29 00:00:003a and 3b are histograms of the spectral centroid for speech and music signals, respectively;
FIGS. 4a and 4b are histograms of the spectral flux for speech and music signals, respectively;
FIGS. 5a and 5b are histograms of the zero-crossing rate for speech and music signals, respectively;
FIGS. 6a and 6b are histograms of the spectral roll-off for speech and music signals, respectively;
FIGS. 7a and 7b are histograms of the cepstral resynthesis residual magnitude for speech and music signals, respectively;
FIG. 7c is a graph showing the power spectra for voiced speech and a smoothed version of the speech signal;
FIGS. 8a and 8b are graphs depicting variances between speech and music signals, in general;
FIGS. 9a and 9b are histograms of the variation in spectral flux for speech and music signals, respectively;
FIGS. 10a and 10b are histograms of the proportion of low energy frames for speech and music signals, respectively;
FIG. 11 is a block diagram of a speech modulation detector;
FIGS. 12a and 12b are histograms of the 4 Hz modulation energy for speech and music signals, respectively;
FIG. 13 is a block diagram of a circuit for determining the pulse metric of signals, along with corresponding signal graphs for two bands at each stage of the circuit;
FIGS. 14a and 14b are histograms of the pulse metric for speech and music signals, respectively;
FIG. 15 is a graph illustrating the probability distributions of two measured features;
FIG. 16 is a more detailed block diagram of a discriminator; and
FIG. 17 is a graph illustrating an example of speech/music decisions for a sequence of frames.
DETAILED DESCRIPTION
In the following discussion of various embodiments of the invention, it is described in the context of a speech/music discriminator. In other words, all input sounds are considered to fall within one of the two classes of speech or music. In practice, of course, other components can also be present within an audio signal, such as noise, silence or simultaneous speech and music. In some situations where these other types of data are present in the audio signal, it
might be more desirable to employ the invention as a speech detector or a music detector. A speech detector can be considered to be different from a speech/music discriminator, in the sense that the output of the detector is not labeled as speech or music. Rather, the audio signal is classified as either "speech" or "non-speech", in which the latter class consists of music, noise, silence and any other audio-related component that is not classified as speech per se. Such a detector may be useful, for example, in an automatic speech recognition context.
The general construction of a speech-music discriminator in accordance with the present invention is illustrated in block diagram form in FIG. 1. An audio signal 10 to be classified is fed to a feature detector 12. If the audio signal is in analog form, for example a radio signal or the output signal from a microphone, it is first converted into a digital format. Within the feature detector, the digital signal is analyzed to measure various quantifiable components that characterize the signal. The individual components, or features, are described in detail hereinafter. Preferably, the audio signal is analyzed on a frame-by-frame basis. Referring to FIG. 2, for example, an audio signal 10 is divided into a plurality of overlapping frames. In the preferred embodiment illustrated therein, each frame has a length of about 40 milliseconds, and adjacent frames overlap one another by one-half of a frame, e.g. 20 milliseconds. Each feature is measured over the duration of each full frame. In addition, for some of the features, the variation of that feature's value over several frames is determined.
After the values for all of the features have been determined for a given frame, or series of frames, they are presented to a selector 14. Depending upon the particular application, certain combinations of features may provide more accurate results than others. In this regard, it is not necessarily the case that the classification accuracy increases with the number of features that are analyzed. Rather, the data that is provided with respect to some features may decrease overall performance, and therefore it is preferable to eliminate the data of those features from the classification process. Furthermore, by reducing the total number of features that are analyzed, the amount of data to be interpreted is reduced, thereby increasing the speed of the classification process. The best set of features to employ is empirically determined for different situations, and is discussed in detail hereina...
Method for encoding music printing information in a MIDI message2010-03-10 00:00:00messages may contain any number of data bytes and can be terminated either by an end of exclusive or any other status byte, with the exception of Real-Time messages. An end of exclusive should always be sent at the end of a system exclusive message. System exclusive messages always include a manufacturer's identification code. If a receiver does not recognize the identification code it will ignore the following dam.
As those skilled in the art will appreciate upon reference to the foregoing, musical compositions may be encoded utilizing the MIDI standard and stored and/or transmitted utilizing substantially less than data than would otherwise be required. The MIDI standard permits the transmittal of a serial listing of program status messages and channel messages, such as "note on" and "note off" and as a consequence require substantially less digital data to encode than the straightforward digitization of an analog musical signal. Using the MIDI system provides additional advantages including the ability to transmit the MIDI signals to a variety of MIDI-compatible devices to allow simultaneous translation of a single signal for multiple purposes and also to allow mixing signals to a variety of devices on a single signalling connection.
Extensions to MIDI
The MIDI standard was originally designed for communication between electronic instruments. About the time it was being developed, however, it was also becoming clear to many people that not only could one musical instrument be used to control another, but musical instruments themselves could be controlled by computer. Put another way, a sequence of electronic (MIDI) commands that
might be generated (in a performance) on a musical keyboard could also be generated by computer. The receiving instrument has no way to know what the origin of the commands is (computer vs. live performance). In a similar manner, the "sending" instrument knows nothing about the nature of the "receiving" instrument under the MIDI standard. It is therefore possible for the receiving instrument to be a computer, which is actually recording (receiving and storing) the MIDI signals from the sender. In this way, a musical performance (defined as a series of physical gestures on an electronic keyboard), can be recorded (received and stored) on a computer and later played back on (sent to) the same keyboard or some other sound generating device which "understands" MIDI commands.
With the advent of "MIDI" recordings and simulated recordings compiled by software, the need arose to find a way to pass this data from one computer to another. Also there were several descriptive aspects of the...