Neural Network Ensembles in London and Representing Collaborative Interaction by Charles Martin

I recently had the chance to present a paper about my "Neural iPad Ensemble" at the Audio Mostly conference in London. The paper, discusses how machine learning can help to model and create free-improvised music on new interfaces, where the rules of music theory may not fit. I described the Recurrent Neural Network (RNN) design that I used to produce an AI iPad ensemble that responds to a "lead" human performer. In the demonstration session, I set up the iPads and RNN and had lots of fun jamming with the conference attendees.

Many of those at the conference were very curious about making music with AI systems and the practical implications of using deep learning in concert. Some had appealing, enigmatic, and sometimes confused, assumptions about musical AI. For example, I was often asked whether I could recognise personalities of the performers in the output. This isn't possible in my RNN, because, like in all machine learning systems, the output reflects only the training data, and not the context that we humans see around it.

A limitation of my system is that it learns musical interactions as sequences of high-level gestures. These measurements are made once every second on the raw touchscreen data and describe it as, for examples, "fast taps", or "small swirls". In the performance system, a synthesiser replays chunks of performances that correspond to the desired gestures. This simplification makes it easier to design the neural network, but mean that the RNN isn't trained on the low-level data that we might consider to contain the nuanced "personality" of a performance.

Another easily confused point of my system is that the human performer isn't really the "leader". In the interest of making the most of a limited data set, every performer in every example is permuted as the "leader" and each of the three "ensemble" performers. In fact, few (if any) performances in the corpus had a leader at all. In my system, the human performer is really just one of four equal performers, so it's unreasonable to expect that you could "control" the RNN performers by performing in a certain way, or doing something unexpected.

A practical issue with performance with the neural iPad ensemble is starting and stopping the music! In a human ensemble, the performers use cues like counting in, an audible breath, or a look, to bring in the ensemble and to signal the end of an improvisation. With my RNN ensemble, cues have no effect; sometimes the group starts playing, sometimes it doesn't. Just when you think the performance might be over, the group starts up again without you! Thinking about the training data, this behaviour makes sense. Each training example is 120 seconds long -- too short to capture the long-term curve of a performance, including starting and ending. The examples do contain plenty of instances where one performer plays alone, or three play while one lays out, so there's precedence in the data for completely ignoring the "leader" starting and stopping!

As with many AI systems, these limitations reveal that humans are so good at combining different datasources and contexts that we forget that we're doing it. A truly "natural" iPad ensemble might have to be trained on much more than just high-level gestures in order to keep up with "obvious" musical cues and reproduce musical personalities.

While this system has limitations, it is still fun to play with and useful as a reflection on "creative" AI. Of course, there's many ways to improve it, one of the most important would be to train the RNN on the lowest level data available; in this case, the raw touch event data from the iPad screens. A promising way to approach this is with Mixture Density RNNs (more on that later). I'm looking forward to more chances to perform with and talk about Musical AI soon!

MicroJam at Boost by Charles Martin

We presented MicroJam this week at the Boost Technology and Equality in Music Conference at Sentralen, Oslo. The conference arranged a Tech Showcase session in Hvelvet, Sentralen's old bank vault with developers of music apps, synthesisers, robots and education software. I was joined by two master's students from the UiO Department of Informatics, Benedikte and Henrik who helped to demonstrated MicroJam to the many participants - thanks guys!

It was wonderful to meet so many wonderful teachers, musicians, and developers from around the region and the world, looking forward to getting the app out to them as soon as we can!

Music Tech at IFI by Charles Martin

We recently hosted a music technology event at the Department of Informatics to gather together researchers and students from the University of Oslo to see performances and demonstrations of current research.

Luckily, Maria Finkelmeier happened to be swinging through the area from Boston, so we were able to present some new percussion and touch-screen works together, and hear Maria's new live version of #improvAday. Christina Hopgood and Maria joined me for three iPad ensemble pieces, including a new experiment performing live ensemble music with MicroJam --- exactly the opposite of how that app was designed to be used! It was also great to have Kristian Nymoen demonstrate with the Xsens full-body motion tracking system and have other demos from the Departments of Musicology and Informatics.

The event featured the following projects:

  • Ensemble Metatone: new music for touch-screen and percussion
  • #improvADay with Maria Finkelmeier (USA)
  • MuMyo: muscle sensing music from IMV
  • PhaseRings for iPad Ensemble and Ensemble Director Agent
  • Xsens Motion Music: making music with full-body motion tracking
  • Prototype music interfaces from the DESIGN group.

Great to have so many engaged researchers visit IFI and to perform in Escape, the student pub in IFI's basement! Thanks to the cybernetics student society for their help and hope to perform down in Escape again soon!

Performing with a Neural Touch-Screen Ensemble by Charles Martin

Since about 2011, I've been performing music with various kinds of touch-screen devices in percussion ensembles, new music groups, improvisation workshops, installations, as well as my dedicated iPad group, Ensemble Metatone. Most of these events were recorded; detailed touch and gestural information was collected including classifications of each ensemble member's gesture every second during each performance. Since moving to Oslo, however, I don't have an iPad band! This leads to the question: Given all this performance data, can I make an artificial touch-screen ensemble using deep neural networks?

I've collected a lot of data from four years of touch-screen ensemble concerts (left). Now, I've used it to train an artificial neural network (right) to interact in a similar way! 

I've collected a lot of data from four years of touch-screen ensemble concerts (left). Now, I've used it to train an artificial neural network (right) to interact in a similar way! 

As it turns out, the answer is yes! To make this neural touch-screen ensemble, I've used a large collection of collaborative touch-screen performance data to model ensemble interactions and to synthesise ensemble performances. These performances were free-improvisations, gestural explorations between tightly interacting performers of synthesised sounds, samples, and field recordings. In this context, the music theory of melody and harmony doesn't help much to understand what is going on. A data-driven strategy for musical representation is required. Machine learning (ML) is an ideal approach, as ML algorithms can learn from example, rather than from theory.

In this article, I'll explain the parts of this system but first, here's a demonstration of what it looks like:

Live Interaction with a Neural Network

The rough idea of the neural touch-screen ensemble is this: one human improvises music on a touch-screen app and an ensemble of computer-generated 'musicians' reacts with their own improvisation. This system works as follows:

  1. First, a human performer plays PhaseRings on an iPad. Their touch-data is classified into one of a small number of touch gestures using a system called Metatone Classifier.
  2. Next, a recurrent neural network, Gesture-RNN, takes this lead gesture and predicts how an ensemble might respond in terms of their own gestures, this is described in more detail below.
  3. The touch-synthesiser then searches the corpus of performance data for sequences of touches that match these gestures and sends them to the other iPads which also run PhaseRings.
  4. Finally, the ensemble iPads 'perform' the sound (and visuals) from these sequences of touches, as if a human were tapping on their screens.
neural-ensemble-system

One cool thing about this system is that the 'fake' ensemble members sound quite authentic, as their touches are taken directly from human-recorded touch data. The totality of these components is a system for co-creative interaction between neural network and human performer. The neural net responds to the human gestures, and in turn, the live performer responds to the sound of the generated ensemble iPads. This system is currently used for in-lab demonstrations and we're hoping to show it off at a few events soon!

Learning Gestural Interaction

The most complex part of this system is the Gesture-RNN at the centre. This artificial neural network is trained on hundreds of thousands of excerpts from ensemble performances to predict appropriate gestural responses for the ensemble.

In improvising touch-screen ensembles, the musicians often work as gestural explorers. Patterns of interaction with the instruments and between the musicians are the most important aspect of the performances. Touch-screen improvisations have been previously categorised in terms of nine simple touch-gestures, and a large corpus of collaborative touch-screen performances is freely available. Classified performances consist of sequences of gesture labels (numbers between 0 and 8) for each player in the group - similar to the sequences of characters that are often used as training data in text-generating neural nets.

Like other creative neural nets, such as folkRNN and charRNN, Gesture-RNN is a recurrent neural network (RNN) with long short-term memory (LSTM) cells. These LSTM cells preserve information inside the network, acting as a kind of memory, and help the network to predict structure in sequences of multiple time-steps. The difference between character-level RNNs and this system is that Gesture-RNN is trained to predict how an ensemble would react to a single performer, not what that lead performer might do next.

Training data for Gesture-RNN consists of time series of gestural classification for each member of the group at one second intervals. The network is designed to predict the ensemble response to a single 'lead' sequence of gestures. So in the case of a quartet, one player is taken to be the leader, and the network is trained to predict the reaction of the other three players.

In this case, the input for the network is the lead player's current gesture, and also the previous gestures of the other ensemble members. The output of the network is the ensemble members' predicted reaction. This output is then fed back in to the network at the next time-step.

Here's an example output from Gesture-RNN. In these plots, a real lead performance (in red) was used as the input and the ensemble performers (other colours) were generated by the neural net. Each level on the y-axis in these plots represents a different musical gesture performed on the touch-screens.

Gesture-RNN is implemented in Tensorflow and Python. It's tricky to learn how to structure Tensorflow code and the following blog posts and resources were helpful: WildML: RNNs in Tensorflow, a practical guide, R2RT: Recurrent Neural Networks in Tensorflow, AI Codes: Tensorflow Best Practices, Géron: Hands-On Machine Learning with Scikit-Learn and Tensorflow.

Recurrent Neural Networks and Creativity

Gesture-RNN uses a similar neural network architecture to other creative machine learning systems, such as folkRNN, Magenta's musical RNNs, and charRNN. It has recently become apparent that recurrent neural networks, which can be equipped with "memory" cells to learn long sequences of temporally-related information, can be unreasonably effective. Creative neural network systems are beginning to be a bit of a party trick, like the amusingly scary NN-generated Christmas carol. In the case of high-level ensemble interactions, we don't have tools (like music theory) to help us understand and compose them, so a data-driven approach using RNNs could be much more useful!

The neural touch-screen ensemble is a unique way for a human performer to interact with a creative neural network. We're using this system in the EPEC (Engineering Predictability with Embodied Cognition) project at the University of Oslo to evaluate how a predictive RNN can be engaged in co-creation with a human performer. In our current application, the synthesised touch-performances are played back through separate iPads which embody the "fake" ensemble members. In future, this system could also be integrated within a single touch-screen app, and it might allow individual users to experience a kind of collaborative music-making. It might also be possible to condition Gesture-RNN to produce certain styles of responses, that model particular users, or performance situations.

The code for this system is available online: Gesture-RNN, Metatone Classifier, PhaseRings. While there are lots of creative applications of recurrent neural networks out there, there aren't too many examples of interactive and collaborative RNN systems. It would be great to see more creative and interactive systems using these and other neural net designs!