Let's break CPython together, for fun and mischief
Sun 13 October 2019I promise that nothing we do here will be useful.
But I promise it will be entertaining, and (hopefully) educational:
-
if you don't know anything about the CPython internals, you're about to learn (a bit)
-
if you do know about the CPython internals, you're hopefully about to learn some new ways to abuse them 😉
Clarification
Before we proceed to hackage, let me make sure it's clear what I'm talking about when I say "CPython internals". CPython is the reference implementation of Python, and it's what most people use. It's what comes as standard on any system I've ever used.
A Python implementation includes the interpreter, the built-in types and the standard library. With CPython, apart from much of the standard library which is in Python, this is all written in C. There are other implementations:
- PyPy is written in Python itself and has a JIT compiler, it's really fast
- Jython runs on the JVM
- IronPython runs on .NET the Microsoft framework
Everything we do here is exploiting the specific implementation details of CPython.
YMMV
Please bear in mind that Python was not designed to do the things we're going to do, and some of the fun things that worked with the version of Python I used here, my operating system, &c., might end up segfaulting for you. Running stuff in ipython rather than the standard REPL will also likely end up with more issues occurring when things are hacked.
To whet your appetite
Let's have a look at the Python language reference. The first two sentences of the data model say this:
Objects are Python’s abstraction for data. All data in a Python program is represented by objects or by relations between objects.
In CPython a Python object is defined in the
PyObject
struct:
1 typedef struct _object { 2 _PyObject_HEAD_EXTRA 3 Py_ssize_t ob_refcnt; 4 struct _typeobject *ob_type; 5 } PyObject;
(The first bit here, _PyObject_HEAD_EXTRA
, is only valid when compiling Python
with a special tracing debugging feature, so don't worry about it.)
We have the reference count ob_refcnt
, which is used for memory management
and tells us how many other objects are referencing this one. When the
reference count of an object is zero, its memory and resources can be freed by
the garbage collector.
We also have the type information, ob_type
, which tells us how to interact
with the object, what its behaviour is, what data it contains.
Going back to the data model:
Every object has an identity, a type and a value. An object’s identity never changes once it has been created; you may think of it as the object’s address in memory. The ‘is’ operator compares the identity of two objects; the id() function returns an integer representing its identity.
CPython implementation detail: For CPython, id(x) is the memory address where x is stored.
So what I'd expect CPython to do is dynamically allocate memory for a new
PyObject
each time we create a new object.
Let's test this out with some integers:
1 >>> x = 500 2 >>> y = 500 3 >>> x is y 4 False
That makes sense: a new PyObject
has been allocated for each variable we've
made here, and so they are at different places in memory. But what if we use
smaller integers?
1 >>> x = 5 2 >>> y = 5 3 >>> x is y 4 True
How surprising! Let's have a look in the CPython source to see why this might be:
1 #ifndef NSMALLPOSINTS 2 #define NSMALLPOSINTS 257 3 #endif 4 #ifndef NSMALLNEGINTS 5 #define NSMALLNEGINTS 5 6 #endif 7 8 /* Small integers are preallocated in this array so that they 9 can be shared. 10 The integers that are preallocated are those in the range 11 -NSMALLNEGINTS (inclusive) to NSMALLPOSINTS (not inclusive). 12 */ 13 static PyLongObject small_ints[NSMALLNEGINTS + NSMALLPOSINTS];
So it seems integers between -5 and 256 inclusive are statically allocated in a big old array! This is an optimisation that CPython has chosen to do -- the idea is that these integers are going to be used a lot, and it would be time consuming to allocate new memory every time.
But...if that means some integers have a defined place in memory, can we...corrupt that memory?
import ctypes
Most good CPython shenanigans begins with importing ctypes, which is Python's standard C foreign function interface. An FFI allows different languages to interoperate. ctypes provides C compatible data types and allows calling functions from shared libraries and such.
The ctypes docs tell us about the function
memmove
:
ctypes.memmove(dst, src, count)
Same as the standard C memmove library function: copies count bytes from src to dst. dst and src must be integers or ctypes instances that can be converted to pointers.
So what if we copied the memory where 6 is to where 5 is?
1 >>> import ctypes 2 >>> import sys 3 >>> ctypes.memmove(id(5), id(6), sys.getsizeof(5)) 4 >>> 5 + 5 5 12
What fun! But this is small fry stuff. We can do more. We have ambition.
Ambition
I don't what to change one integer. I want to change ALL the integers.
What if we changed what happens when you add integers together? What if we made it subtract instead?
The way operator resolution works in Python is that the corresponding "magic
method" or "dunder method" (for double underscores) is called. For example x +
y
will become x.__add__(y)
. So the int.__add__
method is going to be our
target for mischevious hackage.
1 def fake_add(x, y): 2 return x - y 3 >>> int.__add__ = fake_add 4 TypeError: can't set attributes of built-in/extension type 'int'
Annoying, but unsurprising. Python is permissive in the sense that it doesn't have access modifiers like C++ or Java – you can't really define private attributes of a class. But you can't do just anything, and patching built-ins like this is one of the things Python prevents us from doing – unless we try very hard.
So what can we try instead? All attribute resolution comes down to looking up
attribute names in an object's dictionary. For example, x.y
would resolve to
x.__dict__["y"]
1. What if we try accessing int.__add__
that way?
1 >>> int.__dict__["__add__"] = fake_add 2 TypeError: 'mappingproxy' object does not support item assignment
Tarnation. But of course, we knew it would not be as easy as this. Perhaps lesser programmers would give up here. "It's not allowed," they might say. But we are strong and we are determined.
What is this mappingproxy the interpreter speaks of?
Read-only proxy of a mapping.
Ok, so this is just some cast over the actual dictionary. If we can cast it to a dictionary, we can assign to it. But doing this with a Python cast is just creating a copy:
1 >>> dict(int.__dict__)["__add__"] = fake_add 2 >>> 1 + 5 3 6 4 >>> (1).__add__(5) 5 6 6 >>> int.__add__ == fake_add 7 False
We need to go deeper. Let's look at the CPython source for the mappingproxy type.
1 typedef struct { 2 PyObject_HEAD 3 PyObject *mapping; 4 } mappingproxyobject; 5 6 static PyMappingMethods mappingproxy_as_mapping = { 7 (lenfunc)mappingproxy_len, /* mp_length */ 8 (binaryfunc)mappingproxy_getitem, /* mp_subscript */ 9 0, /* mp_ass_subscript */ 10 };