Let's break CPython together, for fun and mischief

I promise that nothing we do here will be useful.

But I promise it will be entertaining, and (hopefully) educational:

Clarification

Before we proceed to hackage, let me make sure it's clear what I'm talking about when I say "CPython internals". CPython is the reference implementation of Python, and it's what most people use. It's what comes as standard on any system I've ever used.

A Python implementation includes the interpreter, the built-in types and the standard library. With CPython, apart from much of the standard library which is in Python, this is all written in C. There are other implementations:

Everything we do here is exploiting the specific implementation details of CPython.

YMMV

Please bear in mind that Python was not designed to do the things we're going to do, and some of the fun things that worked with the version of Python I used here, my operating system, &c., might end up segfaulting for you. Running stuff in ipython rather than the standard REPL will also likely end up with more issues occurring when things are hacked.

To whet your appetite

Let's have a look at the Python language reference. The first two sentences of the data model say this:

Objects are Python’s abstraction for data. All data in a Python program is represented by objects or by relations between objects.

In CPython a Python object is defined in the PyObject struct:

1 typedef struct _object {
2     _PyObject_HEAD_EXTRA
3     Py_ssize_t ob_refcnt;
4     struct _typeobject *ob_type;
5 } PyObject;

(The first bit here, _PyObject_HEAD_EXTRA, is only valid when compiling Python with a special tracing debugging feature, so don't worry about it.)

We have the reference count ob_refcnt, which is used for memory management and tells us how many other objects are referencing this one. When the reference count of an object is zero, its memory and resources can be freed by the garbage collector.

We also have the type information, ob_type, which tells us how to interact with the object, what its behaviour is, what data it contains.

Going back to the data model:

Every object has an identity, a type and a value. An object’s identity never changes once it has been created; you may think of it as the object’s address in memory. The ‘is’ operator compares the identity of two objects; the id() function returns an integer representing its identity.

CPython implementation detail: For CPython, id(x) is the memory address where x is stored.

So what I'd expect CPython to do is dynamically allocate memory for a new PyObject each time we create a new object.

Let's test this out with some integers:

1 >>> x = 500
2 >>> y = 500
3 >>> x is y
4 False

That makes sense: a new PyObject has been allocated for each variable we've made here, and so they are at different places in memory. But what if we use smaller integers?

1 >>> x = 5
2 >>> y = 5
3 >>> x is y
4 True

How surprising! Let's have a look in the CPython source to see why this might be:

 1 #ifndef NSMALLPOSINTS
 2 #define NSMALLPOSINTS           257
 3 #endif
 4 #ifndef NSMALLNEGINTS
 5 #define NSMALLNEGINTS           5
 6 #endif
 7 
 8 /* Small integers are preallocated in this array so that they
 9    can be shared.
10    The integers that are preallocated are those in the range
11    -NSMALLNEGINTS (inclusive) to NSMALLPOSINTS (not inclusive).
12 */
13 static PyLongObject small_ints[NSMALLNEGINTS + NSMALLPOSINTS];

So it seems integers between -5 and 256 inclusive are statically allocated in a big old array! This is an optimisation that CPython has chosen to do -- the idea is that these integers are going to be used a lot, and it would be time consuming to allocate new memory every time.

But...if that means some integers have a defined place in memory, can we...corrupt that memory?

import ctypes

ctypes

Most good CPython shenanigans begins with importing ctypes, which is Python's standard C foreign function interface. An FFI allows different languages to interoperate. ctypes provides C compatible data types and allows calling functions from shared libraries and such.

The ctypes docs tell us about the function memmove:

ctypes.memmove(dst, src, count)

Same as the standard C memmove library function: copies count bytes from src to dst. dst and src must be integers or ctypes instances that can be converted to pointers.

So what if we copied the memory where 6 is to where 5 is?

1 >>> import ctypes
2 >>> import sys
3 >>> ctypes.memmove(id(5), id(6), sys.getsizeof(5))
4 >>> 5 + 5
5 12

What fun! But this is small fry stuff. We can do more. We have ambition.

Ambition

I don't what to change one integer. I want to change ALL the integers.

What if we changed what happens when you add integers together? What if we made it subtract instead?

mischief

The way operator resolution works in Python is that the corresponding "magic method" or "dunder method" (for double underscores) is called. For example x + y will become x.__add__(y). So the int.__add__ method is going to be our target for mischevious hackage.

1 def fake_add(x, y):
2     return x - y
3 >>> int.__add__ = fake_add
4 TypeError: can't set attributes of built-in/extension type 'int'

Annoying, but unsurprising. Python is permissive in the sense that it doesn't have access modifiers like C++ or Java – you can't really define private attributes of a class. But you can't do just anything, and patching built-ins like this is one of the things Python prevents us from doing – unless we try very hard.

So what can we try instead? All attribute resolution comes down to looking up attribute names in an object's dictionary. For example, x.y would resolve to x.__dict__["y"]1. What if we try accessing int.__add__ that way?

1 >>> int.__dict__["__add__"] = fake_add
2 TypeError: 'mappingproxy' object does not support item assignment

Tarnation. But of course, we knew it would not be as easy as this. Perhaps lesser programmers would give up here. "It's not allowed," they might say. But we are strong and we are determined.

What is this mappingproxy the interpreter speaks of?

Read-only proxy of a mapping.

Ok, so this is just some cast over the actual dictionary. If we can cast it to a dictionary, we can assign to it. But doing this with a Python cast is just creating a copy:

1 >>> dict(int.__dict__)["__add__"] = fake_add
2 >>> 1 + 5
3 6
4 >>> (1).__add__(5)
5 6
6 >>> int.__add__ == fake_add
7 False

We need to go deeper. Let's look at the CPython source for the mappingproxy type.

 1 typedef struct {
 2     PyObject_HEAD
 3     PyObject *mapping;
 4 } mappingproxyobject;
 5 
 6 static PyMappingMethods mappingproxy_as_mapping = {
 7     (lenfunc)mappingproxy_len,                  /* mp_length */
 8     (binaryfunc)mappingproxy_getitem,           /* mp_subscript */
 9     0,                                          /* mp_ass_subscript */
10 };

The PyMappingMethods of a type tell us how it behaves as a mapping: what does x[key] do (mp_subscript)? What does x[key] = y do (mp_ass_subscript)?

What this is telling us is that the mapping proxy is basically a wrapper around a normal dictionary with the function pointer to the subscript assignment method set to NULL.

We can use ctypes to cast this and reveal the underlying dictionary.

 1 import ctypes
 2 
 3 
 4 class PyObject(ctypes.Structure):
 5     pass
 6 
 7 
 8 PyObject._fields_ = [
 9     ('ob_refcnt', ctypes.c_ssize_t)
10     ('ob_type', ctypes.POINTER(PyObject))
11 ]
12 
13 
14 class MappingProxy(PyObject):
15     _fields_ = [('dict', ctypes.POINTER(PyObject))]

The trouble is, once we have the dict as a PyObject pointer, how do we get it back to being a plain old Python dict? It's no good doing this:

1 >>> MappingProxy.from_address(id(int.__dict__)).dict
2 <LP_PyObject at 0x7f6e98c8e7b8>

if we have no way to interpret this as a dict. We can use this pleasing wee trick from the CPython API, courtesy of Armin Ronacher2 which will put it as a value into another existing dictionary where it will be interpreted the same as any other object, then we can extract it!

 1 def pyobj_cast(obj):
 2     return ctypes.cast(id(obj), ctypes.POINTER(PyObject)
 3 
 4 
 5 def get_dict(proxy):
 6     dict_as_pyobj = MappingProxy.from_address(id(proxy)).dict
 7     fence = {}
 8     ctypes.pythonapi.PyDict_SetItem(
 9             pyobj_cast(fence),
10             pyobj_cast("victory"),
11             dict_as_pyobj)
12     return fence["victory"]
13 
14 int_dict = get_dict(int.__dict__)
15 int_dict["__add__"] = fake_add

Have we done it???

1 >>> 1 + 1
2 2

D'oh! But wait a minute...

1 >>> (1).__add__(1)
2 0

What! But the data model says:

to evaluate the expression x + y, where x is an instance of a class that has an __add__() method, x.__add__(y) is called

We've been lied to...this is clearly not true! It seems CPython has some shortcut in place.

To be fair, they probably didn't think we'd ever find out this "lie" by performing these shenanigans. We need to go yet deeper still to fulfill our pointless quest. We will have control of the builtins.

Full type mappings

Back to the CPython source. What is in this type information that's in the PyObject struct we looked at earlier? The answer: lots of stuff that I am not going to put here, but the most interesting parts for our purposes are the method suites:

1 typedef struct _typeobject {
2     ...
3     PyNumberMethods *tp_as_number;
4     PySequenceMethods *tp_as_sequence;
5     PyMappingMethods *tp_as_mapping;
6     ...
7 } PyTypeObject;

This struct contains other structs defining the behaviour of the type via function pointers. We're specifically interested in this tp_as_number member. Its first member, nb_add, is the function pointer to the add method. This is what we want to overwrite. This is our new target.

1 typedef struct {
2     binaryfunc nb_add;
3     binaryfunc nb_subtract;
4     binaryfunc nb_multiply;
5     ...
6 } PyNumberMethods;

So, like we made the ctypes mappings before, I want to do it for this entire PyTypeObject struct. Which is big...so I'm not putting it all here!

 1 import ctypes
 2 
 3 
 4 class PyObject(ctypes.Structure):
 5     pass
 6 
 7 
 8 class PyTypeObject(ctypes.Structure):
 9     pass
10 
11 
12 Py_ssize_t = ctypes.c_ssize_t
13 binaryfunc = ctypes.CFUNCTYPE(
14         ctypes.POINTER(PyObject),
15         ctypes.POINTER(PyObject),
16         ctypes.POINTER(PyObject))
17 
18 
19 class PyNumberMethods(ctypes.Structure):
20     _fields_ = [
21             ("nb_add", binaryfunc),
22             ("nb_subtract", binaryfunc),
23             ("nb_multiply", binaryfunc),
24             ...
25 
26 PyTypeObject._fields_ = [
27         ...
28         ("tp_as_number", ctypes.POINTER(PyNumberMethods)),
29         ...
30 
31 
32 PyObject._fields_ = [
33         ("ob_refcnt", Py_ssize_t),
34         ("ob_type", ctypes.POINTER(PyTypeObject))]

So here we've basically made a Python mapping of the structs we have in C. If we cast our Python int type to the equivalent type struct, we'll reveal the secrets usually hidden from us.

1 >>> PyLong_Type = ctypes.cast(id(int), ctypes.POINTER(PyTypeObject)).contents
2 >>> PyLong_Type.tp_as_number.contents.nb_add = PyLong_Type.tp_as_number.contents.nb_subtract
3 >>> 10 + 4
4 6
5 >>> 1 + 1
6 0
7 >>> 1 + 3
8 -2

We did it!!! Incredible.

But now we know how to patch built-ins...what if we went further? What if we added functionality that wasn't there before, rather than altering existing functionality?

Nice immutable string you got there. It would be a shame if something should...happen to it 😏

In Python, strings are immutable. You can't go in and change one of the characters – you have to create a new string object. When you add characters to an existing string variable, a new string object is created.3

What if we made strings mutable?

Let's have a look at PyUnicode_Type in the CPython source...and then swiftly look away from all 16k lines of it cos it's distressingly complex as it has to handle unicode and all its complexities as well as ASCII: good times. We want to find the tp_as_mapping member of the PyUnicode_Type struct:

1 static PyMappingMethods unicode_as_mapping = {
2     (lenfunc)unicode_length,        /* mp_length */
3     (binaryfunc)unicode_subscript,  /* mp_subscript */
4     (objobjargproc)0,               /* mp_ass_subscript */
5 };

We want to create a new function to point to from the mp_ass_subscript member. Here's my extremely hacked one which wouldn't handle every case, not at all. But I think it's going to allow us to do what we want.

 1 static int
 2 unicode_ass_subscript(PyUnicodeObject* self, PyObject* item, PyObject* value)
 3 {
 4     Py_ssize_t i = ((PyLongObject*)(item))->ob_digit[0];
 5     unsigned int kind = ((PyASCIIObject*)(self))->state.kind;
 6     char* data = ((char*)((PyASCIIObject*)(self) + 1));
 7     char* new_data = ((char*)((PyASCIIObject*)(value) + 1));
 8     *(data + kind * i) = *new_data;
 9     return 0;
10 }
11 
12 static PyMappingMethods unicode_as_mapping = {
13     (lenfunc)unicode_length,                 /* mp_length */
14     (binaryfunc)unicode_subscript,           /* mp_subscript */
15     (objobjargproc)unicode_ass_subscript,    /* mp_ass_subscript */
16 };

I don't want to just change the source code of the Python binary I'm using. That is cheating. I want to break Python from the inside.

But I can use this to get the machine code that I want to replace the subscript assignment function with. (This is now just stupid compared to what we were doing before...and not at all portable. But we are doing it.).

First, I built CPython with this new source code in there. I can retrieve the machine code generated by our new function using objdump:

1 objdump Objects/unicodeobject.o

 1 00000000000000f4 <unicode_ass_subscript>:
 2       f4:   8b 4e 18                mov    0x18(%rsi),%ecx
 3       f7:   0f b6 47 20             movzbl 0x20(%rdi),%eax
 4       fb:   c0 e8 02                shr    $0x2,%al
 5       fe:   83 e0 07                and    $0x7,%eax
 6      101:   48 0f af c1             imul   %rcx,%rax
 7      105:   0f b6 52 30             movzbl 0x30(%rdx),%edx
 8      109:   88 54 07 30             mov    %dl,0x30(%rdi,%rax,1)
 9      10d:   b8 00 00 00 00          mov    $0x0,%eax
10      112:   c3                      retq

Then, in a standard un-patched Python session, let's copy the machine codes we've just compiled from the C, and make a new function pointer that points to them:

 1 import ctypes
 2 import mmap
 3 
 4 PyUnicode_Type = ctypes.cast(id(str), ctypes.POINTER(PyTypeObject)).contents
 5 
 6 payload = (
 7         b"\x8b\x4e\x18\x0f\xb6\x47\x20\xc0\xe8\x02\x83\xe0\x07\x48\x0f\xaf"
 8         b"\xc1\x0f\xb6\x52\x30\x88\x54\x07\x30\xb8\x00\x00\x00\x00\xc3")
 9 buf = mmap.mmap(
10         -1,
11         len(payload),
12         prot=mmap.PROT_READ | mmap.PROT_WRITE | mmap.PROT_EXEC)
13 buf.write(payload)
14 fpointer = ctypes.c_void_p.from_buffer(buf)
15 bad_boi = objobjargproc(ctypes.addressof(fpointer))
16 PyUnicode_Type.tp_as_mapping.contents.mp_ass_subscript = bad_boi

Before we scored this righteous hack, this would happen:

1 >>> x = "hack the planet"
2 >>> x[1] = "4"
3 TypeError: 'str' object does not support item assignment

but now...

1 >>> x = "hack the planet"
2 >>> x[1] = "4"
3 >>> print(x)
4 "h4ck the planet"

floppy

And so our pointless quest is over. I hope you had fun. Sometimes it is good to remember that computers can be just for fun. 😊


  1. More information about this is in the Data Model. In short, there is a defined resolution order for class hierarchies. So if x.__dict__ didn't have the key "y", then we'll next look in base classes of x, &c. 

  2. I first seen Armin Ronacher do this on Twitter, so a lot of credit is due to him for this great trick. Someone has uploaded his code here (I can't find original tweet) 

  3. There is a good explanation as to why here