Name Mangling and Serialization in PHP

Serialization in PHP is the process of taking a snapshot of the state of an object and converting it to a storable string to be reconstructed later. It is a powerful tool, however there are a few pitfalls to be aware of that can cause issues if you’re not careful. In order for a serialized object to be unserialized, the class that it is an instance of must be defined in the context that it is being unserialized so that PHP knows how to reconstruct the object. Some interesting and unexpected behaviors can arise when the class definition changes in between the time when the object was originally constructed and when it is unserialized. For example, this could happen when new code is deployed.

Consider the following simple class:

[gist id=faed8c1fcf71e57e099e]

We instantiate this class, serialize the object and push the result on to a job queue. A worker later pulls it off the queue and unserializes it, except in this context the $bar property has been renamed to $baz;

[gist id=c63eb4af83f9a76868d7]

var_dump()ing the object, results in the following output:

[gist id=1041bb571d5726171df3]

The object now appears to be a combination of the original and new class definitions. If we attempt to echo $foo->bar, one would expect to get a fatal error since it is a protected property. However instead the result is “PHP Notice: Undefined property: Foo::$bar”

To understand why this happens, we need to dive into the internals of PHP, specifically the Zend Engine which powers the PHP runtime. The Zend Engine is written in C, so having some basic knowledge of the language will be helpful, but is not strictly necessary. The PHP lxr website is an excellent way to explore the PHP source with many useful features to help you find your way around.

These code snippets will be from the PHP 5.3 branch. If you look at newer branch, it will have some other data structures for optimizing storage of declared properties, but the relevant components are the same. First a quick introduction to how the Zend Engine represents objects internally.

[gist id=f6012ac6b59d5456c0e8]

The zend_object consists of three pointers: one to the declaring class (*ce), one to a hash table of the object’s properties (*properties), and one to a hash table of recursion guards (*guards). Guards are not relevant to our topic so we’ll disregard them. The class definition (*ce) contains all of the information about the class such as its constants, functions, and properties.

When you access a property of an object in PHP, the Zend Engine internally calls zend_std_read_property.

[gist id=f47614c2488949dc9a9d]

zend_std_read_property calls into zend_get_property_info which looks up the property by name in the object’s class definition (*ce), verifies that the property can be accessed in the current scope, and returns a zend_property_info struct. If the property has not been declared in the class definition, then it returns a default set of values. This is the reason that properties that aren’t explicitly declared default to public visibility.

[gist id=b834a43eadfcacaa8c9e]

zend_property_info contains information about the property such as its visibility, hash key, and name. The Zend Engine uses a technique known as name mangling to resolve the property names of class. The *name pointer is the mangled name of the property. To see how the Zend Engine mangles property names, let’s take a look at a snippet of code from zend_declare_property_ex which is called when a property is defined on an object.

[gist id=2869a3d28549eafdde6e]

Depending on the declared visibility, zend_mangle_property_name is called with different parameters. For our example class “Foo” with a property “bar”, it would return “[NUL]Foo[NUL]bar” if it were declared private, “[NUL]*[NUL]bar” if it were declared protected, and simply “bar” if it were declared public, where [NUL] is the null byte. If you examine the output of a serialized object, you can actually see the mangled names of all of the object’s properties.

Why do we need to store multiple properties with the same name? Remember that these are actually distinct properties, and you can access a property of the same name from a subclass.

[gist id=bde457d69ac9893273d1]

Let’s get back to our example now. In the new context, the class entry defines a single property called $baz, while the properties hash table contains two items: $baz, which is stored with a mangled name of “[NUL]*[NUL]baz” and $bar which is stored with a mangled name of “[NUL]*[NUL]bar”. When zend_get_property_info tries to look up $bar in the class entry, it can’t find it since it doesn’t exist in the new class definition, so it returns that it has public visibility which has the mangled name “bar”. However as we saw earlier, the properties table has it stored under the mangled name “[NUL]*[NUL]bar”. Because there is a mismatch between the mangled name of the declaring class and the object, when it tries to look it up, it can’t find an entry with that hash key and results in the undefined property PHP notice.

As we’ve seen, serialization can be a bit tricky, but when used correctly, is an invaluable part of the PHP toolset. When deploying PHP code which changes the class definition of a serialized object, be sure to exercise caution to avoid making changes that will cause these class definition/object mismatches.