本文共 18604 字,大约阅读时间需要 62 分钟。
Python解释器在执行任何一个Python程序文件时,首先进行的动作都是先对文件中的Python源代码进行编译,编译的主要结果是产生一组Python的byte code(字节码),然后将编译的结果交给Python的虚拟机(Virtual Machine),由虚拟机按照顺序一条一条地执行字节码,从而完成对Python程序的执行动作。
对于Python编译器来说,PyCodeObject对象才是其真正的编译结果,而pyc文件只是这个对象在硬盘上的表现形式,它们实际上是Python对源文件编译的结果的两种不同存在方式。
在程序运行期间,编译结果存在于内存的PyCodeObject对象中;而Python结束运行后,编译结果又被保存到了pyc文件中。当下一次运行相同的程序时,Python会根据pyc文件中记录的编译结果直接建立内存中的PyCodeObject对象,而不用再次对源文件进行编译了。
对整体流程认识清晰后完全可以写一个工具,将基于Python3.7生成的pyc文件解析出来,pyc文件的内容用json格式组织一下如下图:
写工具的目的只是为了更加理解整个流程。实际上使用Python的dis模块可以输出更为详细清晰的内容,如下图:
// code.htypedef struct { PyObject_HEAD int co_argcount; int co_kwonlyargcount; int co_nlocals; int co_stacksize; int co_flags; int co_firstlineno; PyObject *co_code; PyObject *co_consts; PyObject *co_names; PyObject *co_varnames; PyObject *co_freevars; PyObject *co_cellvars; Py_ssize_t *co_cell2arg; PyObject *co_filename; PyObject *co_name; PyObject *co_lnotab; void *co_zombieframe; PyObject *co_weakreflist; void *co_extra;} PyCodeObject;
# test.pyclass A: passdef Fun(): passa = A()Fun()
# pyc_generator.pyimport impimport sysdef generate_pyc(name): fp, pathname, description = imp.find_module(name) try: imp.load_module(name, fp, pathname, description) finally: if fp: fp.close()if __name__ == '__main__': generate_pyc(sys.argv[1])
命令行中输入如下命令会生成pyc文件:
>>> ./python3.7 pyc_generator.py test
从上面的pyc_generator文件中的imp.load_module开始,函数调用顺序如下:
// imp.pyload_module=>load_source// _bootstrap.py[1]=>_load=>_load_unlocked// _bootstrap_external.py=> exec_module=> get_code
get_code方法中调用source_to_code方法生成PyCodeObject对象,调用_code_to_timestamp_pyc将PyCodeObject转为二进制数据,调用_cache_bytecode方法将二进制数据写入文件。
值得注意的是真正的Python不会调用_bootstrap.py的_load方法(上面函数调用顺序中的[1]),在Lib/importlib/__init__.py中:
# __init__.pytry: import _frozen_importlib as _bootstrapexcept ImportError: from . import _bootstrap _bootstrap._setup(sys, _imp)else: # do sthtry: import _frozen_importlib_external as _bootstrap_externalexcept ImportError: from . import _bootstrap_external _bootstrap_external._setup(_bootstrap) _bootstrap._bootstrap_external = _bootstrap_externalelse: # do sth
可以看到实际上调用的是_frozen_importlib中的_load方法,而不是_bootstrap中的_load方法,此lib的内容在Python/importlib.h中被定义:
不太明白为什么要这么处理,但是分析整体流程时将此处换成了_bootstrap,便于阅读源码。下面会详细分析生成PyCodeObject对象,将PyCodeObject转为二进制数据和将二进制数据写入文件的流程。
// _bootstrap_external.pysource_to_code// _bootstrap.py=>_call_with_frames_removed// bltinmodule.c=> builtin_compile_impl
builtin_compile_impl的C源码如下:
// bltinmodule.cstatic PyObject *builtin_compile_impl(PyObject *module, PyObject *source, PyObject *filename, const char *mode, int flags, int dont_inherit, int optimize){ PyObject *source_copy; const char *str; int compile_mode = -1; int is_ast; PyCompilerFlags cf; int start[] = {Py_file_input, Py_eval_input, Py_single_input}; PyObject *result; cf.cf_flags = flags | PyCF_SOURCE_IS_UTF8; if (flags & ~(PyCF_MASK | PyCF_MASK_OBSOLETE | PyCF_DONT_IMPLY_DEDENT | PyCF_ONLY_AST)) { PyErr_SetString(PyExc_ValueError, "compile(): unrecognised flags"); goto error; } /* XXX Warn if (supplied_flags & PyCF_MASK_OBSOLETE) != 0? */ if (optimize < -1 || optimize > 2) { PyErr_SetString(PyExc_ValueError, "compile(): invalid optimize value"); goto error; } if (!dont_inherit) { PyEval_MergeCompilerFlags(&cf); } if (strcmp(mode, "exec") == 0) compile_mode = 0; else if (strcmp(mode, "eval") == 0) compile_mode = 1; else if (strcmp(mode, "single") == 0) compile_mode = 2; else { PyErr_SetString(PyExc_ValueError, "compile() mode must be 'exec', 'eval' or 'single'"); goto error; } is_ast = PyAST_Check(source); if (is_ast == -1) goto error; if (is_ast) { // do sth. } str = source_as_string(source, "compile", "string, bytes or AST", &cf, &source_copy); if (str == NULL) goto error; result = Py_CompileStringObject(str, filename, start[compile_mode], &cf, optimize); Py_XDECREF(source_copy); goto finally;error: result = NULL;finally: Py_DECREF(filename); return result;}
其中:
// pythonrun.cPyObject *Py_CompileStringObject(const char *str, PyObject *filename, int start, PyCompilerFlags *flags, int optimize){ PyCodeObject *co; mod_ty mod; PyArena *arena = PyArena_New(); if (arena == NULL) return NULL; mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena); if (mod == NULL) { PyArena_Free(arena); return NULL; } if (flags && (flags->cf_flags & PyCF_ONLY_AST)) { PyObject *result = PyAST_mod2obj(mod); PyArena_Free(arena); return result; } co = PyAST_CompileObject(mod, filename, flags, optimize, arena); PyArena_Free(arena); return (PyObject *)co;}
调用PyParser_ASTFromStringObject方法生成语法树,调用PyAST_CompileObject方法生成PyCodeObject对象。此处不对语法解析和编译做深入分析。
_code_to_timestamp_pyc方法负责将PyCodeObject对象转为二进制数据,源码如下:
// _bootstrap_external.pydef _code_to_timestamp_pyc(code, mtime=0, source_size=0): "Produce the data for a timestamp-based pyc." data = bytearray(MAGIC_NUMBER) data.extend(_w_long(0)) data.extend(_w_long(mtime)) data.extend(_w_long(source_size)) data.extend(marshal.dumps(code)) return data
可以看出一个pyc文件包含几部分内容:
marshal.dumps调用marshal_dumps_impl方法:
// marshal.cstatic PyObject *marshal_dumps_impl(PyObject *module, PyObject *value, int version)/*[clinic end generated code: output=9c200f98d7256cad input=a2139ea8608e9b27]*/{ return PyMarshal_WriteObjectToString(value, version);}
PyMarshal_WriteObjectToString源码为:
// marshal.cPyObject *PyMarshal_WriteObjectToString(PyObject *x, int version){ WFILE wf; memset(&wf, 0, sizeof(wf)); wf.str = PyBytes_FromStringAndSize((char *)NULL, 50); if (wf.str == NULL) return NULL; wf.ptr = wf.buf = PyBytes_AS_STRING((PyBytesObject *)wf.str); wf.end = wf.ptr + PyBytes_Size(wf.str); wf.error = WFERR_OK; wf.version = version; if (w_init_refs(&wf, version)) { Py_DECREF(wf.str); return NULL; } w_object(x, &wf); w_clear_refs(&wf); if (wf.str != NULL) { char *base = PyBytes_AS_STRING((PyBytesObject *)wf.str); if (wf.ptr - base > PY_SSIZE_T_MAX) { Py_DECREF(wf.str); PyErr_SetString(PyExc_OverflowError, "too much marshal data for a bytes object"); return NULL; } if (_PyBytes_Resize(&wf.str, (Py_ssize_t)(wf.ptr - base)) < 0) return NULL; } if (wf.error != WFERR_OK) { Py_XDECREF(wf.str); if (wf.error == WFERR_NOMEMORY) PyErr_NoMemory(); else PyErr_SetString(PyExc_ValueError, (wf.error==WFERR_UNMARSHALLABLE)?"unmarshallable object" :"object too deeply nested to marshal"); return NULL; } return wf.str;
此处最关键的方法为w_object,该方法会调用w_complex_object,真正将PyCodeObject对象转为二进制数据就在w_complex_object方法中:
// marshal.cstatic voidw_complex_object(PyObject *v, char flag, WFILE *p){ // do sth. else if (PyCode_Check(v)) { PyCodeObject *co = (PyCodeObject *)v; W_TYPE(TYPE_CODE, p); w_long(co->co_argcount, p); w_long(co->co_kwonlyargcount, p); w_long(co->co_nlocals, p); w_long(co->co_stacksize, p); w_long(co->co_flags, p); w_object(co->co_code, p); w_object(co->co_consts, p); w_object(co->co_names, p); w_object(co->co_varnames, p); w_object(co->co_freevars, p); w_object(co->co_cellvars, p); w_object(co->co_filename, p); w_object(co->co_name, p); w_long(co->co_firstlineno, p); w_object(co->co_lnotab, p); } // do sth.}
可以看出:
// class A"co_filename": { "type": "unicode", "size": 49, "value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"}// def Fun"co_filename": { "type": "ref", "ref": 6}// test.py"co_filename": { "type": "ref", "ref": 6}
这是通过w_ref方法实现的,w_ref的源码如下。其中有一个hash表,该表的key为对象的地址,value为index,如果表中存在相同地址的对象,则写入TYPE_REF类型和index,从而节省空间。
// marshal.cstatic intw_ref(PyObject *v, char *flag, WFILE *p){ _Py_hashtable_entry_t *entry; int w; if (p->version < 3 || p->hashtable == NULL) { return 0; /* not writing object references */ } /* if it has only one reference, it definitely isn't shared */ if (Py_REFCNT(v) == 1) { return 0; } entry = _Py_HASHTABLE_GET_ENTRY(p->hashtable, v); if (entry != NULL) { /* write the reference index to the stream */ _Py_HASHTABLE_ENTRY_READ_DATA(p->hashtable, entry, w); /* we don't store "long" indices in the dict */ assert(0 <= w && w <= 0x7fffffff); w_byte(TYPE_REF, p); w_long(w, p); return 1; } else { size_t s = p->hashtable->entries; /* we don't support long indices */ if (s >= 0x7fffffff) { PyErr_SetString(PyExc_ValueError, "too many objects"); goto err; } w = (int)s; Py_INCREF(v); if (_Py_HASHTABLE_SET(p->hashtable, v, w) < 0) { Py_DECREF(v); goto err; } *flag |= FLAG_REF; return 0; }err: p->error = WFERR_UNMARSHALLABLE; return 1;}
这个过程的逆序实现过程如下。如果flag不为0,则向list表中增加实际的值。如果类型为TYPE_REF,则根据读取的index从list表中获取真实的值。
static PyObject *r_object(RFILE *p){ PyObject *v, *v2; Py_ssize_t idx = 0; long i, n; int type, code = r_byte(p); int flag, is_interned = 0; PyObject *retval = NULL; if (code == EOF) { PyErr_SetString(PyExc_EOFError, "EOF read where object expected"); return NULL; } p->depth++; if (p->depth > MAX_MARSHAL_STACK_DEPTH) { p->depth--; PyErr_SetString(PyExc_ValueError, "recursion limit exceeded"); return NULL; } flag = code & FLAG_REF; type = code & ~FLAG_REF;#define R_REF(O) do{\ if (flag) \ O = r_ref(O, flag, p);\} while (0) switch (type) { // do sth. case TYPE_REF: n = r_long(p); if (n < 0 || n >= PyList_GET_SIZE(p->refs)) { if (n == -1 && PyErr_Occurred()) break; PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)"); break; } v = PyList_GET_ITEM(p->refs, n); if (v == Py_None) { PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)"); break; } Py_INCREF(v); retval = v; break; // do sth. }}
这里存在一个问题,为什么w_ref没有像r_object中根据flag的值决定哪个字段写入hash表中,目前没有想明白。
_cache_bytecode方法负责将将二进制数据写入文件,源码如下:
# _bootstrap_external.py def _cache_bytecode(self, source_path, bytecode_path, data): # Adapt between the two APIs mode = _calc_mode(source_path) return self.set_data(bytecode_path, data, _mode=mode)
set_data方法源码如下:
def set_data(self, path, data, *, _mode=0o666): """Write bytes data to a file.""" parent, filename = _path_split(path) path_parts = [] # Figure out what directories are missing. while parent and not _path_isdir(parent): parent, part = _path_split(parent) path_parts.append(part) # Create needed directories. for part in reversed(path_parts): parent = _path_join(parent, part) try: _os.mkdir(parent) except FileExistsError: # Probably another Python process already created the dir. continue except OSError as exc: # Could be a permission error, read-only filesystem: just forget # about writing the data. _bootstrap._verbose_message('could not create {!r}: {!r}', parent, exc) return try: _write_atomic(path, data, _mode) _bootstrap._verbose_message('created {!r}', path) except OSError as exc: # Same as above: just don't write the bytecode. _bootstrap._verbose_message('could not create {!r}: {!r}', path, exc)
写入文件的关键方法为_write_atomic,源码如下。该方法采用写入临时文件,而后重命名的方式,用于保证要么有异常从而不会生成文件,要么无异常生成指定名称的文件。
def _write_atomic(path, data, mode=0o666): """Best-effort function to write data to a path atomically. Be prepared to handle a FileExistsError if concurrent writing of the temporary file is attempted.""" # id() is used to generate a pseudo-random filename. path_tmp = '{}.{}'.format(path, id(path)) fd = _os.open(path_tmp, _os.O_EXCL | _os.O_CREAT | _os.O_WRONLY, mode & 0o666) try: # We first write data to a temporary file, and then use os.replace() to # perform an atomic rename. with _io.FileIO(fd, 'wb') as file: file.write(data) _os.replace(path_tmp, path) except OSError: try: _os.unlink(path_tmp) except OSError: pass raise
分析清楚pyc文件生成的流程后,就可以实现8.1节中提到的工具,工具源码如下:
# -*- coding:utf-8 -*-import jsonimport datetimeimport sysFLAG_REF = ord('\x80')TYPE_CODE = ord('c')TYPE_STRING = ord('s')TYPE_SMALL_TUPLE = ord(')')TYPE_INT = ord('i')TYPE_SHORT_ASCII = ord('z')TYPE_SHORT_ASCII_INTERNED = ord('Z')TYPE_REF = ord('r')TYPE_NONE = ord('N')REFS_HASH = {}def parse_code(fp): code = int.from_bytes(fp.read(1), 'little') code_type = code & ~FLAG_REF code_flag = code & FLAG_REF idx = len(REFS_HASH) if code_flag: REFS_HASH[idx] = None code_dict = {} if code_type == TYPE_CODE: code_dict['type'] = 'code' code_dict['co_argcount'] = int.from_bytes(fp.read(4), 'little') code_dict['co_kwonlyargcount'] = int.from_bytes(fp.read(4), 'little') code_dict['co_nlocals'] = int.from_bytes(fp.read(4), 'little') code_dict['co_stacksize'] = int.from_bytes(fp.read(4), 'little') code_dict['co_flags'] = int.from_bytes(fp.read(4), 'little') code_dict['co_code'] = parse_code(fp) code_dict['co_consts'] = parse_code(fp) code_dict['co_names'] = parse_code(fp) code_dict['co_varnames'] = parse_code(fp) code_dict['co_freevars'] = parse_code(fp) code_dict['co_cellvars'] = parse_code(fp) code_dict['co_filename'] = parse_code(fp) code_dict['co_name'] = parse_code(fp) code_dict['co_firstlineno'] = int.from_bytes(fp.read(4), 'little') code_dict['co_lnotab'] = parse_code(fp) elif code_type == TYPE_STRING: code_dict['type'] = 'string' length = int.from_bytes(fp.read(4), 'little') code_dict['length'] = length # todo value = fp.read(length) code_dict['value'] = str(value) if code_flag: REFS_HASH[idx] = code_dict['value'] elif code_type == TYPE_SMALL_TUPLE: code_dict['type'] = 'tuple' size = int.from_bytes(fp.read(1), 'little') code_dict['size'] = size items = [] for _ in range(size): items.append(parse_code(fp)) code_dict['items'] = items if code_flag: REFS_HASH[idx] = code_dict['items'] elif code_type == TYPE_INT: code_dict['type'] = 'long' value = int.from_bytes(fp.read(4), 'little') code_dict['value'] = value if code_flag: REFS_HASH[idx] = code_dict['value'] elif code_type == TYPE_SHORT_ASCII: code_dict['type'] = 'unicode' size = int.from_bytes(fp.read(1), 'little') code_dict['size'] = size code_dict['value'] = fp.read(size).decode() if code_flag: REFS_HASH[idx] = code_dict['value'] elif code_type == TYPE_SHORT_ASCII_INTERNED: code_dict['type'] = 'unicode' size = int.from_bytes(fp.read(1), 'little') code_dict['size'] = size code_dict['value'] = fp.read(size).decode() if code_flag: REFS_HASH[idx] = code_dict['value'] elif code_type == TYPE_REF: code_dict['type'] = 'ref' code_dict['ref'] = int.from_bytes(fp.read(4), 'little') code_dict['value'] = REFS_HASH[code_dict['ref']] elif code_type == TYPE_NONE: code_dict['type'] = 'none' else: print(code_type) return code_dictdef parse_pyc(file_name): pyc_dict = {} with open(file_name, 'rb') as fp: magic_number = int.from_bytes(fp.read(2), 'little') if magic_number >= 3390 and magic_number <= 3392: pyc_dict['version'] = 'Python 3.7' else: print('only support Python 3.7') exit(0) _ = fp.read(2) _ = fp.read(4) timestamp = int.from_bytes(fp.read(4), 'little') pyc_dict['modified'] = str(datetime.datetime.fromtimestamp(timestamp)) source_size = int.from_bytes(fp.read(4), 'little') pyc_dict['size'] = source_size pyc_dict['code'] = parse_code(fp) return(pyc_dict)if __name__ == '__main__': file_name = sys.argv[1] print(json.dumps(parse_pyc(file_name), indent=2))
分析test.py后结果为:
实现了对TYPE_REF的处理,下面的value值并不在真实的二进制中包含:
"co_filename": { "type": "ref", "ref": 6, "value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"}
目前没有对指令集做处理。
转载地址:http://rlutx.baihongyu.com/