In perception, auditory scene analysis (ASA) is the process by which the auditory system constructs meaningful perceptual components from sound. The term was coined by psychologist Albert Bregman, whose 1991 book summarized contemporary research and proposed conceptual foundations for the field.[1][2] The study of ASA has traditionally concerned the perception of multiple, distinct sound sources. More recently, this term has also been used to encompass perception related to other factors in sound generation, such as reverberation due to the environment.[3] Computational auditory scene analysis (CASA) involves the implementation of ASA in computational systems, which has contributed both to the development of formal theories of ASA in human listeners[4] and to building systems for machine perception. The interaction of auditory scene analysis with the classic psychological concept of attention is described by the cocktail party problem.[5]

Background edit

The soundwave received by the ear is often composed of a mixture of sounds produced by different sources. For instance, different instruments in a musical ensemble each produce their own distinct vibrations, but those vibrations combine in the air before reaching the ear. Despite only observing this single soundwave, a listener typically experiences several streams of sound, which may each appear to arise from a separate source. Auditory scene analysis describes that process by which multiple meaningful entities (e.g., sources such as musical instruments) are perceived from the single soundwave received at the ear. Determining these entities from the observed soundwave alone is ill-posed: there are infinitely many combinations of arbitrary source soundwaves that could physically match the observed soundwave. Therefore, because source structure is not inherent in sound alone, the auditory system itself must embody (realize, carry out) principles by which [it makes this determination].

Hypothesized principles have typically been formulated as heuristics which govern whether vibrations of varying frequencies across time should be grouped (as belonging to a single source) or segregated, somewhat analogous to Gestalt principles of perceptual organization in vision.[6] In CASA, such principles have also been formulated in terms of constraints on the relationship between distinct sounds comprising a mixture (e.g., statistical independence) as well as assumptions about the properties of isolated sources.[7] In the latter case, ASA is seen as a process of Bayesian inference, in which probable sources are inferred given the observed soundwave and assumptions (i.e., prior beliefs) about single sources.

In line with thinking on natural scene statistics, Bregman proposed that organisms' auditory systems are adapted to regularities in their natural sonic environment through evolution, and that this adaptation is the basis for many ASA principles.[8] That is, because sources do not produce arbitrary sounds, organisms are able to internalize structure through evolution... He also suggested that perceptual learning throughout one's lifetime could shape how organisms hear auditory scenes.

History edit

In his 1895 treatise On the sensations of tone, Hermann von Helmholtz described how a note played by a musical instrument is composed of multiple, harmonically related pure tones (which each consist of a single frequency).[9] He further described how one could manipulate whether they perceived the sound as a single note or as comprised of pure tones, which he termed synthetic versus analytic listening. In contrast to work on visual perceptual grouping which occurred in the early 20th century by the school of Gestalt perception, it was only in the 1950s that significant seminal work on auditory perceptual organization was conducted. Examples include Colin Cherry's 1953 research on the cocktail party problem[10] and Broadbent and Ladefoged's 1957 work on grouping different frequencies in vowel sounds.[11]

Studies on auditory perceptual organization continued in the 1960s-1980s by researchers such as Richard Warren, Chris Darwin, Albert Bregman, and others, with this body of work summarized in Bregman's 1990 book.[1] A small amount of CASA research had begun prior to the publication of Bregman's book, mainly on speech separation systems.[4][12][13]

Bregman's book further motivated computationally-inclined researchers to attempt to instantiate human auditory grouping principles in computational systems, particularly in the style set out by David Marr in his 1982 book Vision.[14][15]

Research on ASA continues to the present but according to a 2016 review, the field lacks a comprehensive account of human ASA.[16] Furthermore, a 2014 review noted that there is a lack of ecologically relevant research on human ASA, as most research has involved relatively simple laboratory stimuli rather than sounds that people hear in their everyday environments.[17] In CASA, machine learning approaches such as non-negative matrix factorization and deep learning are being applied to solve ASA in specific applications, such as speech separation.[18][19][20]

Experimental methods edit

To study ASA, it is necessary to examine what components (sometimes called streams) are perceived in an auditory scene. Therefore, experiments typically involve manipulating aspects of the sound mixture hypothesized to affect perceptual organization and assessing their effect on perceptual organization. Such experiments have used a variety of methods to measure listeners' perception, including:

  • collecting listeners' direct subjective reports of perception; for instance, whether they hear one or two concurrent sounds[21] or on the clarity of a melody in a mixture of tones[22].
  • measuring a listener's ability to recognize whether a familiar melody is present in a set of interleaved tones,[23] which may be difficult if the notes of the melody are perceptually segregated into different streams.
  • measuring a listener's ability to make temporal judgments; for instance, about the order of sounds or the duration of silences between sounds. Listeners tend to be better at making judgments about the temporal relationship between two sounds when they subjectively report that the two sounds group into the same stream, and tend to be worse when they report that the two sounds are segregated into different streams.[24] [25] [26] [27] [28][29]
  • measuring psychophysical thresholds, for instance, the amplitude at which a tone embedded in a noisy background is detectable.[30]
  • having a listener adjust an isolated "comparison" stimulus until it sounds like some component in a mixture[31]
  • measuring a listener's classification of stimuli, for instance, whether a stimulus sounds like one vowel or another depends on specific sound frequencies being grouped together.[4]

Perceptual phenomena edit

Researchers have identified several phenomena in which listeners tend to perceive specific types of source structure in ambiguous auditory scenes, or in which the manipulation of specific acoustic parameters lead to the perception of distinct sources. Bregman broadly characterized these as phenomena as involving grouping over time ("sequential grouping") or grouping in frequency ("simultaneous grouping"). He also distinguished "primitive segregation" from "schema-based" scene analysis that depends on the recognition of learned patterns such as a particular melody or language. Another class of ASA .

Another class of ASA phenomena do not have so much to do with grouping distinct elements, but rather show what occurs perceptually when sounds overlap in time and frequency -- this is a very bad sentence -- here, we refer to these as "filling-in" phenomena.

Sequential grouping edit

When sounds occur in succession, the auditory system must determine which sets of sounds were produced by the same source. Musicians can exploit the principles by which the auditory system achieves this in order to create the perception of a single melody (actually played by different musicians), or several melodies (actually played by a single musician). For example, the bregman interlocking xylophone example. For example, Bach.

Sequential grouping has most often been studied using sequences of tones. One commonly used sequence is the "ABA sequence", first introduced by van Noorden in 1975. The ABA sequence consists of two types of tones (A & B), which may vary in a number of acoustic attributes such as frequency or amplitude. The three-tone set is repeated with an intervening silence (ABA_ABA_ABA_). To test the effect of the varied acoustic parameters on perceptual grouping, listeners are typically asked whether they hear the sequence as an integrated "galloping" rhythm involving both tones (ABA_ABA_ABA_), or whether the sequence "splits" into two segregated isochronous rhythms, one fast (A_A_A_A_) and one slow (B___B___B___).[32] Other measures of perceptual grouping, such as the ability to make temporal judgments between the A and B tones, may also be used.[give example] For some settings, tone sequences are bistable, meaning that

Parameters that affect the tendency of the ABA sequence to split perceptually include the frequency and timing of the tones. When the difference between the frequency of the A and B tones is small, listeners will tend to hear an integrated sequence. In contrast, when the difference between the onset of the two tones is smaller (i.e., the overall sequence is faster), the sequence will tend to split. The musical examples cited above take advantage of these perceptual effects. By using more complex tone sequences, Tougas and Bregman showed that tones will tend to group so as to create streams which span relatively narrow frequency ranges, even if equally sized frequency differences between adjacent tones would result from alternative perceptual organizations. Furthermore, even when the absolute frequency difference between two tones is held constant, they can group or segregate depending on their surrounding context. Other parameters that affect include spectral similarity, onset similarity and overall sound amplitude. For instance, it can be easier to hear a quiet voice mixed in with a loud voice than two loud voices...

Repetition also affects sequential grouping. The more times that the ABA segment is repeated, the more likely listeners are to report that the sequence splits into two streams. One actively pursued hypothesis is that the relative predictibility of tone sequences (Bendixen ... )

Repetition edit

- capturing a tone out of a stream - demo ... this may also work in simultaneous grouping

- cumulative streaming - demo

- embedded repetition - demo

Spatial hearing

.

Simultaneous grouping edit

Onset and offset edit

sudden amplitude changes instead of slow ones - demo

offset doesn't seem to matter so much - demo

old plus new hueristic

Harmonic mistuning edit

one of the most commonly studied

Common frequency modulation edit

Frequency modulated tones will tend to .... However, if two sets of tones are frequency-modulated out of phase with each other, this will not ...

there is not an increase in segregation if the tones are frequency-modulated out of phase with each other.

Spatial hearing

....

Perceptual "filling-in" edit

- continuity effect

- spectral completion

- results with speech

Schema-based edit

- melodies: diana deutsch

-billig word thing[33]

- kevin

Auditory scene analysis across species edit

Different species may possess different ASA mechanisms , specific to their ecology. For instance, starlings are able to detect the presence of a familiar birdsong when it is mixed with several previously unheard songs, but humans are incapable of this task even after training.

BATS, marine mammals, owls, crickets

difference between us and many animals --> they use echos . except some blind people. a way to work this in?

References edit

  1. ^ a b Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press.
  2. ^ Szabó, B. T., Denham, S. L., & Winkler, I. (2016). Computational Models of Auditory Scene Analysis: A Review. Frontiers in Neuroscience, 10, 524. http://doi.org/10.3389/fnins.2016.00524
  3. ^ Traer, James; McDermott, Josh H. (2016-11-29). "Statistics of natural reverberation enable perceptual separation of sound and space". Proceedings of the National Academy of Sciences. 113 (48): E7856–E7865. doi:10.1073/pnas.1612524113. ISSN 0027-8424. PMID 27834730.
  4. ^ a b c Cooke, M., & Ellis, D. P. (2001). The auditory organization of speech and other sources in listeners and computational models. Speech communication, 35(3-4), 141-177.
  5. ^ McDermott, Josh H. (2009-12). "The cocktail party problem". Current Biology. 19 (22): R1024–R1027. doi:10.1016/j.cub.2009.09.005. ISSN 0960-9822. {{cite journal}}: Check date values in: |date= (help)
  6. ^ Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press. p. 24.
  7. ^ Ellis, Daniel P. W. (2006). "Model-Based Scene Analysis". In Wang, DeLiang; Brown, Guy J. (eds.). Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press. p. 115.
  8. ^ Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press. pp. 38–39.
  9. ^ von Helmholtz, Hermann (1895). On the sensations of tone as a physiological basis for the theory of music. Longmans, Green, and Co.
  10. ^ Cherry, E. Colin (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears". The Journal of the Acoustical Society of America. 25 (5): 975–79. doi:10.1121/1.1907229. ISSN 0001-4966
  11. ^ Broadbent, D. E., & Ladefoged, P. (1957). On the fusion of sounds reaching different sense organs. The Journal of the Acoustical Society of America, 29(6), 708-710.
  12. ^ Parsons, T. W. (1976). Separation of speech from interfering speech by means of harmonic selection. The Journal of the Acoustical Society of America, 60(4), 911-918.
  13. ^ Weintraub, M. (1985). A theory and computational model of auditory monaural sound separation (Doctoral dissertation, Stanford University). https://www.ee.columbia.edu/~dpwe/papers/Weintraub85-phd.pdf
  14. ^ Brown, G. J. (1992). Computational auditory scene analysis: a representational approach (Doctoral dissertation, University of Sheffield). http://etheses.whiterose.ac.uk/2982/1/DX202847.pdf
  15. ^ Ellis, D. P. W., & Rosenthal, D. F. (1995). Mid-level representations for computational auditory scene analysis. Perceptual Computing Section, Media Laboratory, Massachusetts Institute of Technology. https://pdfs.semanticscholar.org/6592/7488156fe0e84ff9635c24256bd6b9180181.pdf
  16. ^ Szabó, Beáta T.; Denham, Susan L.; Winkler, István (2016). "Computational Models of Auditory Scene Analysis: A Review". Frontiers in Neuroscience. 10: 524. doi:10.3389/fnins.2016.00524. ISSN 1662-4548. PMC 5108797. PMID 27895552.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  17. ^ Deike, Susann; Denham, Susan L.; Sussman, Elyse (12 September 2014). "Probing auditory scene analysis". Frontiers in Neuroscience. doi:https://doi.org/10.3389/fnins.2014.00293. {{cite journal}}: Check |doi= value (help); External link in |doi= (help)
  18. ^ Smaragdis, P., Fevotte, C., Mysore, G. J., Mohammadiha, N., & Hoffman, M. (2014). Static and dynamic source separation using nonnegative factorizations: A unified view. IEEE Signal Processing Magazine, 31(3), 66-75. http://web.stanford.edu/class/stats253/IEEE_SPM.pdf
  19. ^ Isik, Y., Roux, J. L., Chen, Z., Watanabe, S., & Hershey, J. R. (2016). Single-channel multi-speaker separation using deep clustering. arXiv preprint arXiv:1607.02173. http://www.merl.com/publications/docs/TR2016-073.pdf
  20. ^ Wang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: an overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://arxiv.org/ftp/arxiv/papers/1708/1708.07524.pdf
  21. ^ Popham, S., Boebinger, D., Ellis, D. P., Kawahara, H., & McDermott, J. H. (2018). Inharmonic speech reveals the role of harmonicity in the cocktail party problem. Nature communications, 9(1), 2122.
  22. ^ Tougas, Y., & Bregman, A. S. (1985). Crossing of auditory streams. Journal of Experimental Psychology: Human Perception and Performance, 11(6), 788.
  23. ^ Dowling, W. J. (1973). The Perception of Interleaved Melodies. Cognitive Psychology, 5, 322-337. https://www.utdallas.edu/research/mpac/publications/pdf/1973-2.pdf
  24. ^ Warren, R. M.; Obusek, C. J.; Farmer, R. M.; Warren, R. P. (1969-05-02). "Auditory sequence: confusion of patterns other than speech or music". Science (New York, N.Y.). 164 (3879): 586–587. ISSN 0036-8075. PMID 4888106.
  25. ^ Bregman, A. S., & Campbell, J. (1971). Primary auditory stream segregation and perception of order in rapid sequences of tones. Journal of Experimental Psychology, 89, 244-249.
  26. ^ Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press. p.157-163.
  27. ^ Vliegen J., Moore B.C.J., Oxenham A.J. The role of spectral and periodicity cues in auditory stream segregation, measured using a temporal discrimination task. J. Acoust. Soc. Am. 1999;106:938–945.
  28. ^ Roberts B., Glasberg B.R., Moore B.C.J. Effects of the build-up and resetting of auditory stream segregation on temporal discrimination. J. Exp. Psychol. Hum. Percept. Perform. 2008;34:992–1006.
  29. ^ Thompson S.K., Carlyon R.P., Cusack R. An objective measurement of the build-up of auditory streaming and of its modulation by attention. J. Exp. Psychol. Hum. Percept. Perform. 2011;37:1253–1262.
  30. ^ Hall, J. W., Haggard, M. P., & Fernandes, M. A. (1984). Detection in noise by spectro‐temporal pattern analysis. The Journal of the Acoustical Society of America 76, 50. https://doi.org/10.1121/1.391005
  31. ^ McDermott, J. H., & Oxenham, A. J. (2008). Spectral completion of partially masked sounds. Proceedings of the National Academy of Sciences, 105(15), 5939-5944.
  32. ^ Van Noorden, L. S. (1975). Temporal coherence in the perception of tone sequences. PhD thesis, Eindhoven University of Technology. pg. 8
  33. ^ Billig, Alexander J.; Davis, Matthew H.; Deeks, John M.; Monstrey, Jolijn; Carlyon, Robert P. (2013-08-19). "Lexical Influences on Auditory Streaming". Current Biology. 23 (16): 1585–1589. doi:10.1016/j.cub.2013.06.042. ISSN 0960-9822. PMC 3748342. PMID 23891107. {{cite journal}}: no-break space character in |first2= at position 8 (help); no-break space character in |first3= at position 5 (help); no-break space character in |first5= at position 7 (help); no-break space character in |first= at position 10 (help)CS1 maint: PMC format (link)