Matrix Revisited: reinforcement learning

Originally written on February 23, 2009.

Revised for this blog...Content now written in the past tense to account for the closing of the Matrix Online.

Distri is simulating a response to a primary stimulus. However, her behavior goes unrewarded on a ludic level although a lot of positive narrative feedback can result from repeatedly engaging in this sort of behavior.

I –Core Behavioral Mechanics:

In terms of the core positive and negative reinforcers, primary (survival) reinforcers do not really exist in the Matrix Online since a character does not need to eat, sleep nor go to the bathroom (i.e. waste resource management) to survive in the game even though the player him/herself is rewarded in the form of a “bio-break” outside of the game. Admittedly, one can argue that a primal “avoidance of pain” is psychologically prevalent in the game since the headaches caused by the tedium of watching (you do not directly/manually engage in) the NPC combat, causes a reinforced desire to explore the rest of the story-world instead. In other words, any pleasure derived from engaging in combat-oriented choreography is usually systematically extinguished.

II – Fixed and Variable Reward Schedules

“In a ratio schedule, reinforcement is one where the behavior must be performed X times before it is reinforced. X can be a fixed or variable number.” (Butcher, year unknown)

Here is a screenshot of Distri looting a bit of information (money) and a nice ring from an opponent NPC.

In Figure 2, we see a good example of a ratio schedule where Distri must kill several NPCs (and transform their virtual bodies into corpses) and loot each one over and over to see if any quantifiable rewards such as money (“information”) or other treasures can be extracted.

It is presently unclear whether or not this looting ritual precisely corresponds to a Fixed Ratio or Variable Ratio since a measurable pattern of a fixed number of kills/loots translating into a predictable display of information and treasure(s) has not yet been revealed with any consistency. Although the player needs a fixed amount of rewards to “cash-in” and be rewarded for their behavior, the way in which the loot is inserted into each NPC prior to being killed appears to be variably assigned.

III – Interval Schedules...

The Matrix Online has successfully remediated the operant conditioning ritual of using an elevator via expectations derived from the “real world”.

This century-old ritual of using elevators – in “real” or “virtual” life – is a perfect example of a “duration schedule” where the behavior of button pressing must be performed throughout the waiting interval. Similar to triggering a pedestrian traffic light for a cross-walk, the player sometimes presses a floor-button more than once to see whether or not the virtual elevator actually works. In terms of audio feedback, the button sounds do not accurately give a sense that the button is pressed but there is an elevator drone that occurs shortly after (bandwidth latency determines the drone length) which indicates a transition between the floors. Because of this latency, there was not a direct correlation regarding the relative space between floors since the sound did not always sync up. Also, there was no audio or visual feedback (no light nor sound) provided for a floor that was inaccessible or locked. As a result, Distri spent many minutes wondering if there was a glitch in the system (the world crashed) or whether or not a desired floor was locked.

Within Metro World, some of the elevators do not work and in addition, not all the floors of each floor can be accessed. Once again, it should be emphasized that since there is no feedback provided via a sound indicating a successful arrival onto a given floor, the ritual of using a virtual elevator in the Matrix is not properly reinforced on a Pavlovian level since there is no “bell” to indicate access-success. On the contrary, the elevator behavior in the Matrix is far closer to Operant Conditioning methodology since one engages in a mindless and anguishing repetition of the stimulus (button-pressing of multiple floors) in order to hopefully get access to a given floor.

In this sense, the elevator in the Matrix is one of the closest things to a “Skinner Box” (Skinner, 1948) except that the player him/herself functions as an impatient homing pigeon. For the continued usage of the elevator to persist (instead of merely jumping from building to building), the button-pressing behavior must be performed throughout the lifting/descending interval. In other words, as the behavior is learned (and remediated from everyday elevator experience), the frequency of the reward (experiencing a new floor) is reduced as this process is similar to our everyday habits.

In terms of any sort of Fixed Ratio (FR) schedule, the collection of information onto virtual floppy discs can be seen as corresponding to a fixed schedule since a fixed quantity of discs is required before “cashing in” at the locally available hard-line (phone booth) and graduating to the next level. Unlike the World of Warcraft, however, there is no visible progress bar that is used for gradual leveling. As a result, there is not really any incentive to level because the process seems obscured or at worst, totally arbitrary. As is more common with generic MMORPGs, most of reinforcers are given after a specified number of “correct” responses (Butcher, year unknown) and therefore, conform to a Variable Ratio (VR) scheduling system.

Distri is in the process of buying some inventory from her local neighborhood vendor NPC, Meek.

Unlike the corpse-looting, the ritual of buying from an NPC vendor is much more reliable and more clearly shows the balance between Fixed and Variable scheduling ratios. In this instance, the prices of the items are fixed (at least in this current game-play iteration) and this NPC is always present in the same spatial location - this continuity of spatial proximity guarantees a stable form of visual-feedback. However, the player’s bank account and current inventory is always variable so the stimulus-response feedback chain is variable to the point where the player can re-visit the same NPC after a given time-interval without becoming too bored with the available inventory selection that the NPC possesses.

In this case, the deployment of static vending NPCs assist in providing one of the few narrative-bound aspects of the game that successfully implement a Resistance-To-Extinction (NTE) behavioral schema. Since one’s inventory is always in flux (variable) than, the usefulness of the same item list in the NPCs inventory can become interesting for different reasons at different time intervals. This aspect is more fore-grounded in the “merchant inventory” pop-up window where the “Hide Unusable Objects” can provide the player with a way to organize their current reward preferences at any given time. Also, the merchant inventory structurally acts as a narrative fore-shadowing tool where the player is gradually conditioned through the variable availability of “Usable Objects” to eventually pick a behavioral archetype. In this case, Distri will be re-consulting this vendor known as “Meek” to see if her newfound objects can help advance her path as a “Code Shaper” along the growing ability tree.

In the examples given, any sort of noticeable “Fixed Interval” (FI) scheduling process never seemed to occur in the game. Even in the case of the consistent positive visual feedback that results from spawning and re-spawning, corpses do disappear after looting but no re-spawning was evident at the same vantage point as the corpse. In other words, the corpse might re-spawn in a different position, such as behind the player or after the player has left the battleground area. Objects and treasure seem to appear at fixed intervals but that is not clear as the treasure might be placed onto an NPC randomly.

This screenshot shows an example of a Variable Interval (VI) that is rarely used in the game but does exist.

The example above, however, provides a unique case where context determines the level of Fixation of Variability on a given schedulable Interval. Due to the lack of other players – i.e. “friends” – of Distri, this particular email message to the author’s own ALT (alternate identity) on another Matrix Online (MXO) server has been completely fabricated for the sake of showing the communication interface and what a “Variable Interval” might look like in a more active community – as the original designers had intended.

If this were a real in-world email, there would be a variable interval between the time this message was sent and the time another PC (Player-Character) would reply. In theory, this emailing system would have provided the player with a sense of sustained fellow-ship and a conditioned variable dependency on other players as has proven successful in the in-world mailbox system of World of Warcraft, for example. Other examples of Variable Intervals (VEs) involve waiting for opponent NPCs to re-spawn but there was no evidence of consistent spatial re-spawning. As this email was never sent (since Distri has no real friends), it was unclear the degree to which overt audio or visual feedback would be provided via the interface.

Here is a screenshot of the loading screen as Distri’s enters Metro city.

On the subject of spawning, since Distri died during the last game-play session, she was able to pick which “hardline” (phone booth) to use for re-spawning. Distri picked Mara (Richmond) as it is the most social area in the game. Even a remote opportunity for social interaction with other players was a sufficient reward to condition Distri’s behavior towards picking this particular phone-booth as her preferred entry point even though she eventually acquired a few phone booths to choose from. Those other booths might actually launch her back into areas where she might progress at a much faster ludic rate but Distri’s main anticipated response the hope of social interaction so she is very much following an Operant rather than a Classical conditioning paradigm. She is intentionally choosing a Stimulus in the hopes of receiving a response rather than responding to an external stimulus.

Without lag, this situation would likely have been originally scheduled by the game’s designers according to a rigid Fixed Interval Limited Hold (VI-LH) scheme but because of the inevitable client-server lag in most MMORPGs (esp. MXO), the Interval is actually determined through a fluid, yet frustrating variable duration.

The first response to this variable interval (loading the game-play textures into the client/server cache memory) would usually involve the greeting ritual of the other players towards Distri (which is a cultural form of positive visual feedback). Certainly that was the expected response that conditioned Distri’s behavior to select this particular phone booth. However, Distri was not rewarded with any greeting from the players once she had finally arrived into the lobby space. Perhaps, due to lag, Distri may have appeared un-responsive and rude (i.e. unable to provide contextual feedback via her body language) to other players while entering and they returned the favor with an equal measure of silence. It is possible then that Distri’s client lag thwarted the other PCs’ (Player-Characters) Variable Interval Limited Hold (VI-LH) protocols of engagement.

Before moving on towards discussions of Operant and Classical conditioning paradigms, it should be noted that all of the tutorial and questing sessions in the Matrix would qualify as being part of a Fixed Duration schedule. However, in the particular segment of game-play, no quest-givers were encountered during game-play although some opponent NPCs cunningly disguised themselves as quest-givers as we will see in the next example...

IV – Operant and Classical Conditioning...

Agent Brady impersonating an allied Quest-Giver. This is Operant Conditioning at work....

In one particular instance of encountering what appeared from a distance to be a quest-giving NPC, a situation arose where one’s socially conditioned behaviors were being put to a test – the kind of test that measured degrees of voluntary (Operant) behavior rather than following a primal unconscious impulse (Classical). This is perhaps the first instance in the game-play where my character was confronted with the entertaining possibility of flirting with moral ambiguity within the story-world.
I was expecting the Agent to give me a quest that conformed to a fixed duration schedule since machines are meant to be very precise and methodical. Usually, the positive visual feedback triggered through the placement of the “i” halo-icon means that the NPCs are quest-givers. Also, the Matrix Online designers claim that quests allow for the player to eventually choose their ethical orientation. Therefore, it was assumed that if I had completed a quest with the machines/agents, I would then become gradually aligned with their faction. However, all “Agent Brady” wanted to do was tell me that they were watching me and that I should consider joining the machines for philosophical reasons. Unfortunately, no actual fixed duration quest was handed out after repeatedly talking with him. This indicates a kind of placebo effect where my own conditioned expectations about the kind of scheduled feedback I receive from a quest-giver (stimulus) produced what BF Skinner would call a “superstitious behavior” (Skinner, 1948). There have been cases in this game-play segment, however where “suspicious behavior” from an NPC led towards a potential extinction of ludic reinforcers.

Here is Distri being forced to wait by the system before engaging in her next combat move.

As illustrated above, this method of forcing the player to wait to fight while the opponent NPC does nothing in return makes conventional turn-based combat look like a highly embodied real-time experience. As shown in this screenshot, the “Blackwood Goof” NPC goofily walks in circles while engaging in the next round of combat. Although being a “goof” is part of this NPC’s character, there is no indication in the game that this NPC should be completely clueless once engaged in combat –especially if he is given a competitive advantage over the player through a loop-hole in the game mechanical bias. Since both the player AND the NPC are left to aimlessly await combat orders, this example clearly illustrates a really lame use of a Variable Duration schedule and therefore in this case, the visual feedback evident in the “you must wait longer...” text string ultimately results in a down-word spiral of negative visual feedback. It should be noted that there were no audio feedback clues given to enhance the waiting message and so, this only confused the player more as to the expected duration of waiting to be re-engaged in combat choreography.

The act of crossing intimate melee zones and instigating initial combat moves should provide a constant flow of actionable-interchanges despite a decrease in weapons and/or energy. Therefore, this system-activity is actually a dis-incentive (negative reinforcement) to engaging in combat. The frequency of the combat being disengaged by the system (via rapid intervals of thwarted close-combat) in favor of an NPC dancing in circles for no perceivable reason caused an unintended ratio strain that began to encourage the fighting/looting behavior involving the engagement of opponent NPCs towards a rapidly implemented schedule of extinction. This trajectory towards ludic extinction gradually leads the player down towards an even more morally ambiguous path that is manifested in the form of a kind of anarchic and disrespectful behavior.

This screenshot shows a clear example where the lack of intelligent agency in each NPC within the Matrix Online positively reinforces the player towards gradually subverting one’s “real life” conditioned social protocols.

In the example shown above, Distri does not feel any ethical dissonance from directly interrupting an automated church service by budging directly (rudely) in front of the semi-intelligent Pastor NPC. This new kind of obtrusive behavior can be paralleled with the kind of behavior seen in Grand Theft Auto where one gradually becomes desensitized towards the unethical act of breaking into cars and stealing them. In this case, Distri is gradually becoming desensitized to the NPCs in the game and begins to lose respect for them as autonomous entities. As a result of the neutrality with regards to receiving meaningful social feedback, Distri’s attitude towards NPCs becomes increasingly manipulative and disrespectful. Because this space is populated almost entirely with half-comatose NPCs, the original cultural context of a church as being a sacred space for divine contemplation has been reduced to that of a non-ludic situation that exists simply to relieve combat-boredom.

Another example of operant conditioning can be seen in the routine behavioral patterns remediated from “real” life such as exploring a world through the obsessive-compulsive opening and closing of doors...

Here is a screenshot of Distri getting into the conditioned routine of habitually opening doors and entering them.

In the game, one has to right-click on a door in order to trigger some positive visual feedback in the form of a tiny pop-up window that says “door open”. If not, then it is assumed that the door is just a prop and is not an active part of the exploratory component of the game. In other words, there is a conditioned expectation remediated from similar expectations derived from the “real world” that behind each of these doors is a space one can enter into. There seems to be an appropriate reward for every stimulus, thus far and might even function as a Resistance-To-Extinction (RTE).

Similar acts of Operant Conditioned behavior include turning a light-switch on and off and pressing the “talk” button when engaging NPCs (where hopefully, some inane dialogue will be provided as positive feedback). As is the case with the Skinner box, an expected response usually occurs after the stimulus and if not, superstitious behavior emerges.

Regardless of the level in which suspicious behavior could be stimulated (and simulated as mentioned in a previous critique), there has been a single recorded encounter so far of a behavioral interaction that may in fact be influenced by Classical Conditioning.

This screenshot shows Distri trying to pose and dance at a popular nightclub while her mobile phone incessantly rings in the background.

In a non-ludic instance of the game-play (and there were many since the ludic modes of experiencing the game were not very pleasurable) the phone was trying to act as one of the rare examples of Classical Conditioning in the game. The phone ring acts as a Pavlovian-bell and tries to stimulate the user into answering the phone in order to receive a quest update (i.e. “nag”) from one of the allied NPCs. However, the nightclub was much more entertaining than the phone call so the phone was only answered once to see who it was. When the phone tried to call again, the ring was ignored. From the game designer’s perspective, the mobile phone was attempting to employ audio feedback to catch the attention of the player. However, the player still has free-will and does not need to respond to that audio feedback. If there phone is answered, there is a clicking sound and a pop-window (visual feedback) that appears showing the quest-giving NPC.

V- Reinforcement chaining...
In the Matrix Online, the most predictable and observable form of reinforcement chaining can be found in the way in which the modular combat moves are sequenced into a choreographic whole. What is interesting about these automated combat sequences – beside their aesthetic and voyeuristic appeal – is that the player can be conditioned either through a forward or backward chaining process (Sharpsteen, Brown & Patrick, 2005) of learning on the most effective moves for fighting each opponent NPC. The player can oscillate between forward and backward modes of learning the best combat moves based on the very subjective way in which they perceive the mechanics behind the automated actions.

Since the Matrix Online is a MMORPG with variable schedules and intervals, it is useful to note that the chaining process is somewhat variable as well. For example, the player can place moves together in a chained sequence to learn how each move effects the opponent, OR, the player can randomly pair “learned” combat modules together and reverse-engineer the sequence of successfully triggered combat moves through an intense process of character-observation and introspection. From seeing the result of the chained moves against a random dice roll, the player can gradually work backwards through the chain and through the gradual assimilation of secondary reinforcers, figure out which moves on one’s archetypal ability tree should be gradually phased out (i.e. “extinguished”). If this combat-choreography had modular moves that could be chained together with more than two combat moves at a time, the player might achieve a state of total task presentation (Ibid,.) but there is no way for the player to “accomplish the entire series of responses in each learning trial” (Ibid.,).

In either perspective, the conditioned selection responses are chained so one can observe the combat as a sequence of “learned” autonomous behaviors rather than having to go through a manual trial-and-error process gleaned from direct combat. If one chooses to develop this chain of learned behaviors, one triggers some corresponding visual feedback in the form of a gradually lengthening schematic map.

As mentioned in previous critiques, however, the extra variable of the random dice roll does dilute the ways in which this sequential chaining might become disentangled from the expected ludic behavior of the player since the any sort of clearly associative positive visual and audio feedback is neutralized and thereby, the player may quickly become bored with this method of systematic combat-analysis as theory, rather than practice (praxis).

In other words, the game as a “game” is ultimately not very fun unless the MMORPG was highly populated since developing archetypal abilities and tedious questing/grinding would be rewarded by the prestige and fellow-ship gained amongst peers as in World of Warcraft...in the Matrix Online, however, the servers are not populated enough to activate any of the intended conditioning strategies used to immerse the player in a controlled ludic behavior and so, one quickly learns that the environment is too de-populated for meaningful game-play. Instead, the player learns through constant negative reinforcement via the boring and tedious combat that experiencing the diverse and reality remediating world of Metro Town itself provides a sufficient reward and positive reinforcement for returning to the game and maintaining a subscription.

During this segment of play, none of the feedback mechanisms involved seemed to possess a bullet-proof Resistance-To-Extinction (RTE) and as a result, the game-play was only engaged for analysis/critique purposes where earning a good grade was the only real reward for conducting any sort of ludic behavior. There were many more exploratory rewards, however, and so those pre-conditioned towards non-gaming experiences were inadvertently rewarded while gamer-centric behavior was so extinguishable, engaging in the game-play almost qualified as a negative reinforcer (punishment). Ironically, the desire and pleasure derived from exploring the story-world acted as an extended post-reinforcement pause that occurred from experiencing the negative reinforcing stimulus of traditional game-play activity. Also, there was no evidence of a Premack Principle being used properly in this segment of the game-play since the high frequency activity of exploring the Matrix world rarely provides opportunities (which could be enabled through the NPCs and other players – if enough enthusiastic ones were around) where the player feels any tendency to return to the intended game-play – as this game-behavior seems to occur at a very low frequency.

Even the other frequent “gamers” that were encountered throughout the game mainly hung out at the Mara (Richmond) lobby for chat and socialization and occasionally gossiped via the local-area text chat about the POTENTIAL for engaging in combat when opponent player-characters (PCs) also wandered around the SAME shared lobby. However, none of them felt conditionally compelled to transmute their preferred high-frequency activity of socializing and gossiping into any sort of perceivable low-frequency combat with neither other players nor opponent NPCs. If anything, the high-frequency activity of socializing is encouraged through the negative reinforcement that occurs when one engages in combat. This might be due to server lag where fighting with other PCs is too laggy to make the experience rewarding...In addition, the NPC’s behavior do not seem intelligent enough to simulate a positively reinforced ludic outcome.

REFERENCES

Butcher, Sean. “A Behavioral Approach to Video Game Design.” http://www.betabunny.com/behaviorism/Conditioning.htm

Leblanc, Marc. “Tools for Creating Dynamic Game Systems” (1999), pgs. 438-460 in Salen, Katie & Zimmerman, Eric (ed.). Rules of Play: Game Design Fundamentals. MIT Press, 2003.

MATRIX ONLINE WIKIPEDIA ENTRY - http://en.wikipedia.org/wiki/The_Matrix_Online
Sharpsteen Don, J., Brown, Karen & Patrick, Tia. AP Psychology – 7th Edition. P. 93-94. ISBN-10: 07386-1209 New Jersey: Research and Education Association, 2005.

Skinner, BF. “‘Superstition’ in the Pigeon”. (1948). First published in the Journal of Experimental Psychology, 38, 168-172. Re-published online as part of “Classics in the History of Psychology” by Christopher D. Green (York University, Toronto, Ontario) . http://psychclassics.yorku.ca/Skinner/Pigeon/

Matrix Revisited

Tuesday, April 26, 2011

Academatrix - The Post-Reinforcement Pause That Refreshes...