Brain Soup| Part III

Jan 23

Do You Feel That?

Recent generative Machine Learning (ML) tools could skyrocket our creativity with language and visual art, but have yet to ‘solve’ the challenge of creating halfway decent music. Here’s the current state of the art from a friend of mine and part-time jazz pianist currently working on his PhD in Signal Processing at Cambridge University.

This article contrasts 3 lenses for approaching music creation:

How musicians see it
Pitch, Rhythm, and Texture
How machine learning engineers see it
Spectrograms and tokenization
How neuroscientists see it
Consciousness and brain networks

In particular, I’d like to raise questions about the bearing a rigorous analytical approach to music has on our experience of it as creators and appreciators.

1. How Musicians See It

Just like band members responding to one another's performances, machine learning architectures run on the principle of prediction. For example, a Large Language Model (LLM) predicts the answer to “write me an essay on the relationship between music, consciousness, and machine learning” using the entirety of its ‘self-model.’ See Part I of this series for more detail on prediction as a strategy for cognition.

Music is a highly versatile and immediately effective way to ‘design’ emotion by toying with this human tendency to predict. Meanwhile it’s guided by a balance of rules and creative freedom which has made it far harder to generate computationally than language.

Let’s use our intuition as musicians to understand why this is the case.

We can understand music as operating on two axes - the vertical axis for pitch and horizontal axis for time. This gives us a simple graph:

This organisation into space (pitch) and time (rhythm) is used to notate music in score and production (though this varies outside of Western traditions).

Pitch

While in an absolute sense, pitch results from the continuous stretching and squashing of frequencies - a long glissando between the lowest and highest perceptible pitches - we nonetheless split it up into steps.

We have devised music theory language to describe the structures which arise from combining these ratios across space and time - intervals, melodies, chords, progressions, keys etc.

Rhythm

Outside the context of music, organising time into discrete units led to a neat calendar and a financial system which stays in time to the millisecond. In the world of music we use more intuitive time windows that allow us to experience groove.

Again, we have language to help us make sense of and create music - describing note lengths, time signatures, phrases, and sections.

However, machine learning algorithms decoding and generating music don’t have any of that. Instead, they are given a long series of frequencies and amplitudes to make sense of. Hence the field of signal processing comes into existence.

Let’s take a look at a layer of structure which emerges from combining pitch and rhythm together.

Texture

Texture is what happens when we take spatial structures and distribute them using time structures. It’s also where we start needing to make sense of long-range patterns and structures that haven’t made their way into traditional music theory.

Just like how only six elements - Carbon, Hydrogen, Nitrogen, Oxygen, Phosphorus, and Sulphur - organise into structures which layer into the untold variety of life on earth, combinations of this simple set of musical elements explode in variety to give us every discernible musical structure.

For example, rather than playing block chords on fixed beats of the bar, we can split them by arpeggiating, oom-cha-ing, adding extensions, articulation etc. etc. to go from a simple pop song to an expressive jazz improvisation such as those Timothy transcribes.

The next visualisation introduces shape and colour to display four metrics at once, aiming to display the intrinsic complexity of various textural structures.

Here, space and time have been replaced by spatial and temporal spread, showing the range occupied by the structure.

High spatial spread = B0 → A6; higher y-axis value
High temporal spread = structured composing a single chord voicing/change; higher x-axis value

Harmonic and rhythmic complexity are shown with colours and boxes, respectively.

High harmonic complexity = many extensions/ notes outside the chord tones; magenta
High temporal complexity = many offbeat notes; solid border

Temporal spread refers both to the internal textures of individual chords (arpeggio & walking bass are both high spread), as well as the span required to establish that texture (Funky bassline & repeated note/chord are also high spread).

By combining these in series into a structural organisation of choice, we can build a feel or genre.

Consider how classical music often maintains textural consistency - for example the multi-bar passages of arpeggio in Rachmaninoff’s Preludes, or extended oom-cha patterns in Chopin waltzes
Jazz improvisers tend to use less ongoing textural consistency, resting on the rhythm section to keep the beat, or the intrinsic musicality of their listeners

Here, even small 1-2 bar snippets contain a huge amount of information. Far more than a piece of text which would take you a similar amount of time to read.

This is the core problem for those who want to generate convincing music with AI. We need:

a) a huge amount of more data so we can expect our machine learning model to notice the musical elements, and

b) a model which can also handle the long-range dependencies like phrasing and structure over the course of minutes.

Now we’ve got a structured view on how music can be broken down from the intuitive end of things. Let’s finish with a look at how signal processing engineers start with a complex wave transform and help their machine learning architecture figure all of this out from scratch.

2. How Machine Learning Engineers See It

Spectrograms are the language of signal processing, and help us visualise the structure of data which needs to be parsed by a deep learning architecture in order to produce music.

Take a look (and listen) to this representation of a short improvisation which reconstructs the notes being played using the overtones in the original signal.

Both brains and machine learning algorithms need to interpret their sensory inputs as a series of tiny units, then process them with a large and complex architecture to create words, language, music etc.

In fact, just like the brain, machine learning architectures use prediction - in this case to estimate the next token in the sequence.

While it’s unclear what strategy the brain uses for each of its sensory modalities, there are various machine learning strategies for this called ‘tokenization’ which chop up inputs into smaller units - for example with text this process involves getting rid of the ‘ing’ on verbs, and separating compound words like ‘photograph’ into ‘photo’ and ‘graph’.

Tokenization in music is comparatively very difficult, since the structure of sound is far more ‘informationally dense’ than text, in that it contains information about the tone and timbre of the sound, as well as the structures like pitch, melody, rhythm etc.

This density means that you need a whole lot more data to make a halfway decent generative AI model for music.

3. How Neuroscientists See It

The experience of having consciousness - the thing which is modulated by hearing music, eating tasty food, feeling strong emotions, and just about every engagement with the world around you - is known in the academic literature as ‘phenomenological experience’. The question of how a wrinkly mass of cells smaller than a loaf of bread can produce this remains a chin-scratcher for Philosophers of Mind.

Experimental approaches to consciousness haven’t received a whole lot of attention in the literature until recently, but nonetheless there are a handful of theories - almost none of which are able to make meaningful falsifiable predictions.

However we can get a grasp on this using the encompassing paradigm of hierarchical predictive coding from Part I. This approaches the brain as a system which begins with completely unconscious processing of sensory data, which is subsequently fed upwards in a hierarchy through a series of self-regulating networks until it reaches a set of areas - broadly overlapping with the cortex, the outer layer of the onion - any of which could correspond with conscious experience if the activity is of the right sort. The particular interactions between the neurons of the cortex currently ‘engaged in’ consciousness produce the totally unique experience you perceive as taking place (e.g surprisingly spicy chords from your dentist’s waiting room playlist).

It’s the ambiguous line between conscious and subconscious which interests me personally, but hopefully us all as musicians.

A musician coordinates intricate action plans in a fraction of a second using the motor system and the cerebellum and countless other overlapping systems and networks without needing to even know they’re there. By experimenting with our own minds and practice process, we can come to understand the extent of our own conscious experience, and start to notice the effect, for example, prolonged practice has on how we think, as a result of lower-level regions being told what to do.

This process of monitoring our own thoughts is referred to as metacognition, and it’s something we use incredibly frequently as musicians to do effective practice.

Over time we build up a relationship between the elements over which we have cognitive control from within our working memories, and the intricate processing that is relegated to subconsciousness.

This also explains how we only have partial control over the music we create in the moment, and can have a sense of it ‘appearing’ out of the blue.

Conclusion

Hopefully I’ve offered some reasons to be curious about both the brain and machine learning architectures, as well as an understanding of where the two diverge.

We can get useful insights as musicians from applying the analytical lens - perhaps creating new musical ideas to experiment with, noticing holes in our practice, or getting an understanding of where musical features we stumbled across ourselves fit into the whole.

And perhaps when a new wave of generative music architecture arrives and threatens the jobs of musicians, it’ll be helpful to understand how they’re limited so we can keep making music that’s new.

I’ve very much enjoyed writing this series and am very happy that Timothy has used the texture diagram in his new course on Arranging and Reharmonization!

As of publishing this article, I’m also between academic institutes. If you have questions, thoughts, or ideas you’d like to share please feel free to get in touch via email: jethro.reeve@gmail.com

Jethro Reeve

Jethro is a neuroscience graduate from the UK, and wrote his dissertation on Cognitive Neuroscience and Music Teaching. He is a jazz pianist and teacher with a performance diploma from Trinity College London. He now studies Interdisciplinary practice in Culture & Complexity while continuing to teach piano during weekends, and looks to reapply theoretical principles of the brain to the world at large.
https://www.linkedin.com/in/jethroreeve