博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
8. Python3源码—Code对象与pyc文件
阅读量:5931 次
发布时间:2019-06-19

本文共 18604 字,大约阅读时间需要 62 分钟。

8.1. Python程序的执行过程

Python解释器在执行任何一个Python程序文件时,首先进行的动作都是先对文件中的Python源代码进行编译,编译的主要结果是产生一组Python的byte code(字节码),然后将编译的结果交给Python的虚拟机(Virtual Machine),由虚拟机按照顺序一条一条地执行字节码,从而完成对Python程序的执行动作。

对于Python编译器来说,PyCodeObject对象才是其真正的编译结果,而pyc文件只是这个对象在硬盘上的表现形式,它们实际上是Python对源文件编译的结果的两种不同存在方式。

在程序运行期间,编译结果存在于内存的PyCodeObject对象中;而Python结束运行后,编译结果又被保存到了pyc文件中。当下一次运行相同的程序时,Python会根据pyc文件中记录的编译结果直接建立内存中的PyCodeObject对象,而不用再次对源文件进行编译了。

对整体流程认识清晰后完全可以写一个工具,将基于Python3.7生成的pyc文件解析出来,pyc文件的内容用json格式组织一下如下图:

写工具的目的只是为了更加理解整个流程。实际上使用Python的dis模块可以输出更为详细清晰的内容,如下图:

8.2. PyCodeObject源码

// code.htypedef struct {    PyObject_HEAD    int co_argcount;    int co_kwonlyargcount;    int co_nlocals;    int co_stacksize;     int co_flags;     int co_firstlineno;    PyObject *co_code;    PyObject *co_consts;    PyObject *co_names;    PyObject *co_varnames;    PyObject *co_freevars;    PyObject *co_cellvars;    Py_ssize_t *co_cell2arg;    PyObject *co_filename;          PyObject *co_name;              PyObject *co_lnotab;            void *co_zombieframe;     PyObject *co_weakreflist;    void *co_extra;} PyCodeObject;
  • Code Block:
    Python编译器在对Python源代码进行编译的时候,对于代码中的一个Code Block,会创建一个PyCodeObject对象与这段代码对应。当进入一个新的名字空间,或者说作用域时,就算是进入了一个新的Code Block了。比如下面的代码有三个code block:一个对应整个test.py文件,一个对应class A,一个对应def Fun。
# test.pyclass A:    passdef Fun():    passa = A()Fun()
  • 名字空间:
    名字空间是符号的上下文环境,符号的含义取决于名字空间。更具体地说,一个变量名对应的变量值是什么,在Python中,这并不是确定的,而是需要通过名字空间来决定。一个Code Block,对应着一个名字空间,它会对应一个PyCodeObject对象。
  • Python中的code对象:
    在Python中,有与C语言下的PyCodeObject对象对应的对象——code对象,这个对象是对C语言下的PyCodeObject对象的一个简单包装,通过code对象,我们可以访问PyCodeObject对象中的各个域。

8.3. 生成pyc文件

# pyc_generator.pyimport impimport sysdef generate_pyc(name):    fp, pathname, description = imp.find_module(name)    try:        imp.load_module(name, fp, pathname, description)    finally:        if fp:            fp.close()if __name__ == '__main__':    generate_pyc(sys.argv[1])

命令行中输入如下命令会生成pyc文件:

>>> ./python3.7 pyc_generator.py test

8.3.1. 生成PyCodeObject对象和pyc文件的C流程

从上面的pyc_generator文件中的imp.load_module开始,函数调用顺序如下:

// imp.pyload_module=>load_source// _bootstrap.py[1]=>_load=>_load_unlocked// _bootstrap_external.py=> exec_module=> get_code

get_code方法中调用source_to_code方法生成PyCodeObject对象,调用_code_to_timestamp_pyc将PyCodeObject转为二进制数据,调用_cache_bytecode方法将二进制数据写入文件。

值得注意的是真正的Python不会调用_bootstrap.py的_load方法(上面函数调用顺序中的[1]),在Lib/importlib/__init__.py中:

# __init__.pytry:    import _frozen_importlib as _bootstrapexcept ImportError:    from . import _bootstrap    _bootstrap._setup(sys, _imp)else:    # do sthtry:    import _frozen_importlib_external as _bootstrap_externalexcept ImportError:    from . import _bootstrap_external    _bootstrap_external._setup(_bootstrap)    _bootstrap._bootstrap_external = _bootstrap_externalelse:   # do sth

可以看到实际上调用的是_frozen_importlib中的_load方法,而不是_bootstrap中的_load方法,此lib的内容在Python/importlib.h中被定义:

不太明白为什么要这么处理,但是分析整体流程时将此处换成了_bootstrap,便于阅读源码。

下面会详细分析生成PyCodeObject对象,将PyCodeObject转为二进制数据和将二进制数据写入文件的流程。

8.3.2. 生成PyCodeObject对象源码

// _bootstrap_external.pysource_to_code// _bootstrap.py=>_call_with_frames_removed// bltinmodule.c=> builtin_compile_impl

builtin_compile_impl的C源码如下:

// bltinmodule.cstatic PyObject *builtin_compile_impl(PyObject *module, PyObject *source, PyObject *filename, const char *mode, int flags, int dont_inherit, int optimize){    PyObject *source_copy;    const char *str;    int compile_mode = -1;    int is_ast;    PyCompilerFlags cf;    int start[] = {Py_file_input, Py_eval_input, Py_single_input};    PyObject *result;    cf.cf_flags = flags | PyCF_SOURCE_IS_UTF8;    if (flags &        ~(PyCF_MASK | PyCF_MASK_OBSOLETE | PyCF_DONT_IMPLY_DEDENT | PyCF_ONLY_AST))    {        PyErr_SetString(PyExc_ValueError,                        "compile(): unrecognised flags");        goto error;    }    /* XXX Warn if (supplied_flags & PyCF_MASK_OBSOLETE) != 0? */    if (optimize < -1 || optimize > 2) {        PyErr_SetString(PyExc_ValueError,                        "compile(): invalid optimize value");        goto error;    }    if (!dont_inherit) {        PyEval_MergeCompilerFlags(&cf);    }    if (strcmp(mode, "exec") == 0)        compile_mode = 0;    else if (strcmp(mode, "eval") == 0)        compile_mode = 1;    else if (strcmp(mode, "single") == 0)        compile_mode = 2;    else {        PyErr_SetString(PyExc_ValueError,                        "compile() mode must be 'exec', 'eval' or 'single'");        goto error;    }    is_ast = PyAST_Check(source);    if (is_ast == -1)        goto error;    if (is_ast) {        // do sth.    }    str = source_as_string(source, "compile", "string, bytes or AST", &cf, &source_copy);    if (str == NULL)        goto error;    result = Py_CompileStringObject(str, filename, start[compile_mode], &cf, optimize);    Py_XDECREF(source_copy);    goto finally;error:    result = NULL;finally:    Py_DECREF(filename);    return result;}

其中:

  • 调用source_as_string方法将上面的test.py源码加载进内存:
  • 调用Py_CompileStringObject方法生成PyCodeObject对象:
// pythonrun.cPyObject *Py_CompileStringObject(const char *str, PyObject *filename, int start,                       PyCompilerFlags *flags, int optimize){    PyCodeObject *co;    mod_ty mod;    PyArena *arena = PyArena_New();    if (arena == NULL)        return NULL;    mod = PyParser_ASTFromStringObject(str, filename, start, flags, arena);    if (mod == NULL) {        PyArena_Free(arena);        return NULL;    }    if (flags && (flags->cf_flags & PyCF_ONLY_AST)) {        PyObject *result = PyAST_mod2obj(mod);        PyArena_Free(arena);        return result;    }    co = PyAST_CompileObject(mod, filename, flags, optimize, arena);    PyArena_Free(arena);    return (PyObject *)co;}

调用PyParser_ASTFromStringObject方法生成语法树,调用PyAST_CompileObject方法生成PyCodeObject对象。此处不对语法解析和编译做深入分析。

8.3.3. 将PyCodeObject对象转为二进制数据

_code_to_timestamp_pyc方法负责将PyCodeObject对象转为二进制数据,源码如下:

// _bootstrap_external.pydef _code_to_timestamp_pyc(code, mtime=0, source_size=0):    "Produce the data for a timestamp-based pyc."    data = bytearray(MAGIC_NUMBER)    data.extend(_w_long(0))    data.extend(_w_long(mtime))    data.extend(_w_long(source_size))    data.extend(marshal.dumps(code))    return data

可以看出一个pyc文件包含几部分内容:

  • MAGIC_NUMBER:不同版本的Python实现都会定义不同的MAGIC_NUMBER,比如Python 3.7a0 3392,Python 3.6a0 3360,防止加载不兼容的pyc文件;
  • 0:不清楚是用作什么;
  • mtime:py文件创建或最近一次修改的时间信息,如果修改时间没有改变则不需要转为二进制保存,即不需要修改pyc文件;
  • source_size:源码大小;
  • marshal.dumps(code):PyCodeObject对象的二进制流;

marshal.dumps调用marshal_dumps_impl方法:

// marshal.cstatic PyObject *marshal_dumps_impl(PyObject *module, PyObject *value, int version)/*[clinic end generated code: output=9c200f98d7256cad input=a2139ea8608e9b27]*/{    return PyMarshal_WriteObjectToString(value, version);}

PyMarshal_WriteObjectToString源码为:

// marshal.cPyObject *PyMarshal_WriteObjectToString(PyObject *x, int version){    WFILE wf;    memset(&wf, 0, sizeof(wf));    wf.str = PyBytes_FromStringAndSize((char *)NULL, 50);    if (wf.str == NULL)        return NULL;    wf.ptr = wf.buf = PyBytes_AS_STRING((PyBytesObject *)wf.str);    wf.end = wf.ptr + PyBytes_Size(wf.str);    wf.error = WFERR_OK;    wf.version = version;    if (w_init_refs(&wf, version)) {        Py_DECREF(wf.str);        return NULL;    }    w_object(x, &wf);    w_clear_refs(&wf);    if (wf.str != NULL) {        char *base = PyBytes_AS_STRING((PyBytesObject *)wf.str);        if (wf.ptr - base > PY_SSIZE_T_MAX) {            Py_DECREF(wf.str);            PyErr_SetString(PyExc_OverflowError,                            "too much marshal data for a bytes object");            return NULL;        }        if (_PyBytes_Resize(&wf.str, (Py_ssize_t)(wf.ptr - base)) < 0)            return NULL;    }    if (wf.error != WFERR_OK) {        Py_XDECREF(wf.str);        if (wf.error == WFERR_NOMEMORY)            PyErr_NoMemory();        else            PyErr_SetString(PyExc_ValueError,              (wf.error==WFERR_UNMARSHALLABLE)?"unmarshallable object"               :"object too deeply nested to marshal");        return NULL;    }    return wf.str;

此处最关键的方法为w_object,该方法会调用w_complex_object,真正将PyCodeObject对象转为二进制数据就在w_complex_object方法中:

// marshal.cstatic voidw_complex_object(PyObject *v, char flag, WFILE *p){    // do sth.    else if (PyCode_Check(v)) {        PyCodeObject *co = (PyCodeObject *)v;        W_TYPE(TYPE_CODE, p);        w_long(co->co_argcount, p);        w_long(co->co_kwonlyargcount, p);        w_long(co->co_nlocals, p);        w_long(co->co_stacksize, p);        w_long(co->co_flags, p);        w_object(co->co_code, p);        w_object(co->co_consts, p);        w_object(co->co_names, p);        w_object(co->co_varnames, p);        w_object(co->co_freevars, p);        w_object(co->co_cellvars, p);        w_object(co->co_filename, p);        w_object(co->co_name, p);        w_long(co->co_firstlineno, p);        w_object(co->co_lnotab, p);    }    // do sth.}

可以看出:

  • PyCodeObject对象的类型是TYPE_CODE,8.2节中的test.py文件会生成三个PyCodeObject对象,它们之间的关系为一个PyCodeObject对象嵌套两个PyCodeObject对象;
  • co_argcount、co_kwonlyargcount等字段是通过调用w_long(调用w_byte方法写入四个字节),co_code、co_consts 等字段是通过调用w_object(实际上是调用w_long、w_string等方法),最终转为二进制数据的。这些字段的具体含义之后再进行深入分析;
  • 需要注意的是有一个特殊的类型:TYPE_REF,可以通过该类型节约存储空间。以co_filename为例,这个字段的含义为py文件的完整路径,下面为test.py生成的pyc文件中co_filename字段的值:
// class A"co_filename": {    "type": "unicode",    "size": 49,    "value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"}// def Fun"co_filename": {    "type": "ref",    "ref": 6}// test.py"co_filename": {    "type": "ref",    "ref": 6}

这是通过w_ref方法实现的,w_ref的源码如下。其中有一个hash表,该表的key为对象的地址,value为index,如果表中存在相同地址的对象,则写入TYPE_REF类型和index,从而节省空间。

// marshal.cstatic intw_ref(PyObject *v, char *flag, WFILE *p){    _Py_hashtable_entry_t *entry;    int w;    if (p->version < 3 || p->hashtable == NULL) {        return 0; /* not writing object references */    }    /* if it has only one reference, it definitely isn't shared */    if (Py_REFCNT(v) == 1) {        return 0;    }    entry = _Py_HASHTABLE_GET_ENTRY(p->hashtable, v);    if (entry != NULL) {        /* write the reference index to the stream */        _Py_HASHTABLE_ENTRY_READ_DATA(p->hashtable, entry, w);        /* we don't store "long" indices in the dict */        assert(0 <= w && w <= 0x7fffffff);        w_byte(TYPE_REF, p);        w_long(w, p);        return 1;    } else {        size_t s = p->hashtable->entries;        /* we don't support long indices */        if (s >= 0x7fffffff) {            PyErr_SetString(PyExc_ValueError, "too many objects");            goto err;        }        w = (int)s;        Py_INCREF(v);        if (_Py_HASHTABLE_SET(p->hashtable, v, w) < 0) {            Py_DECREF(v);            goto err;        }        *flag |= FLAG_REF;        return 0;    }err:    p->error = WFERR_UNMARSHALLABLE;    return 1;}

这个过程的逆序实现过程如下。如果flag不为0,则向list表中增加实际的值。如果类型为TYPE_REF,则根据读取的index从list表中获取真实的值。

static PyObject *r_object(RFILE *p){    PyObject *v, *v2;    Py_ssize_t idx = 0;    long i, n;    int type, code = r_byte(p);    int flag, is_interned = 0;    PyObject *retval = NULL;    if (code == EOF) {        PyErr_SetString(PyExc_EOFError,                        "EOF read where object expected");        return NULL;    }    p->depth++;    if (p->depth > MAX_MARSHAL_STACK_DEPTH) {        p->depth--;        PyErr_SetString(PyExc_ValueError, "recursion limit exceeded");        return NULL;    }    flag = code & FLAG_REF;    type = code & ~FLAG_REF;#define R_REF(O) do{\    if (flag) \        O = r_ref(O, flag, p);\} while (0)    switch (type) {      // do sth.      case TYPE_REF:        n = r_long(p);        if (n < 0 || n >= PyList_GET_SIZE(p->refs)) {            if (n == -1 && PyErr_Occurred())                break;            PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)");            break;        }        v = PyList_GET_ITEM(p->refs, n);        if (v == Py_None) {            PyErr_SetString(PyExc_ValueError, "bad marshal data (invalid reference)");            break;        }        Py_INCREF(v);        retval = v;        break;      // do sth.      }}

这里存在一个问题,为什么w_ref没有像r_object中根据flag的值决定哪个字段写入hash表中,目前没有想明白。

8.3.4. 将二进制数据写入文件

_cache_bytecode方法负责将将二进制数据写入文件,源码如下:

# _bootstrap_external.py    def _cache_bytecode(self, source_path, bytecode_path, data):    # Adapt between the two APIs    mode = _calc_mode(source_path)    return self.set_data(bytecode_path, data, _mode=mode)

set_data方法源码如下:

def set_data(self, path, data, *, _mode=0o666):        """Write bytes data to a file."""        parent, filename = _path_split(path)        path_parts = []        # Figure out what directories are missing.        while parent and not _path_isdir(parent):            parent, part = _path_split(parent)            path_parts.append(part)        # Create needed directories.        for part in reversed(path_parts):            parent = _path_join(parent, part)            try:                _os.mkdir(parent)            except FileExistsError:                # Probably another Python process already created the dir.                continue            except OSError as exc:                # Could be a permission error, read-only filesystem: just forget                # about writing the data.                _bootstrap._verbose_message('could not create {!r}: {!r}',                                            parent, exc)                return        try:            _write_atomic(path, data, _mode)            _bootstrap._verbose_message('created {!r}', path)        except OSError as exc:            # Same as above: just don't write the bytecode.            _bootstrap._verbose_message('could not create {!r}: {!r}', path,                                        exc)

写入文件的关键方法为_write_atomic,源码如下。该方法采用写入临时文件,而后重命名的方式,用于保证要么有异常从而不会生成文件,要么无异常生成指定名称的文件。

def _write_atomic(path, data, mode=0o666):    """Best-effort function to write data to a path atomically.    Be prepared to handle a FileExistsError if concurrent writing of the    temporary file is attempted."""    # id() is used to generate a pseudo-random filename.    path_tmp = '{}.{}'.format(path, id(path))    fd = _os.open(path_tmp,                  _os.O_EXCL | _os.O_CREAT | _os.O_WRONLY, mode & 0o666)    try:        # We first write data to a temporary file, and then use os.replace() to        # perform an atomic rename.        with _io.FileIO(fd, 'wb') as file:            file.write(data)        _os.replace(path_tmp, path)    except OSError:        try:            _os.unlink(path_tmp)        except OSError:            pass        raise

8.4. 参考

  • Python源码剖析

8.5. 附录

分析清楚pyc文件生成的流程后,就可以实现8.1节中提到的工具,工具源码如下:

# -*- coding:utf-8 -*-import jsonimport datetimeimport sysFLAG_REF = ord('\x80')TYPE_CODE = ord('c')TYPE_STRING = ord('s')TYPE_SMALL_TUPLE = ord(')')TYPE_INT = ord('i')TYPE_SHORT_ASCII = ord('z')TYPE_SHORT_ASCII_INTERNED = ord('Z')TYPE_REF = ord('r')TYPE_NONE = ord('N')REFS_HASH = {}def parse_code(fp):    code = int.from_bytes(fp.read(1), 'little')    code_type = code & ~FLAG_REF    code_flag = code & FLAG_REF    idx = len(REFS_HASH)    if code_flag:        REFS_HASH[idx] = None    code_dict = {}    if code_type == TYPE_CODE:        code_dict['type'] = 'code'        code_dict['co_argcount'] = int.from_bytes(fp.read(4), 'little')        code_dict['co_kwonlyargcount'] = int.from_bytes(fp.read(4), 'little')        code_dict['co_nlocals'] = int.from_bytes(fp.read(4), 'little')        code_dict['co_stacksize'] = int.from_bytes(fp.read(4), 'little')        code_dict['co_flags'] = int.from_bytes(fp.read(4), 'little')        code_dict['co_code'] = parse_code(fp)        code_dict['co_consts'] = parse_code(fp)        code_dict['co_names'] = parse_code(fp)        code_dict['co_varnames'] = parse_code(fp)        code_dict['co_freevars'] = parse_code(fp)        code_dict['co_cellvars']  = parse_code(fp)        code_dict['co_filename']  = parse_code(fp)        code_dict['co_name']  = parse_code(fp)        code_dict['co_firstlineno']  = int.from_bytes(fp.read(4), 'little')        code_dict['co_lnotab']  = parse_code(fp)    elif code_type == TYPE_STRING:        code_dict['type'] = 'string'        length = int.from_bytes(fp.read(4), 'little')        code_dict['length'] = length        # todo        value = fp.read(length)        code_dict['value'] = str(value)        if code_flag:            REFS_HASH[idx] = code_dict['value']    elif code_type == TYPE_SMALL_TUPLE:        code_dict['type'] = 'tuple'        size = int.from_bytes(fp.read(1), 'little')        code_dict['size'] = size        items = []        for _ in range(size):            items.append(parse_code(fp))        code_dict['items'] = items        if code_flag:            REFS_HASH[idx] = code_dict['items']    elif code_type == TYPE_INT:        code_dict['type'] = 'long'        value = int.from_bytes(fp.read(4), 'little')        code_dict['value'] = value        if code_flag:            REFS_HASH[idx] = code_dict['value']    elif code_type == TYPE_SHORT_ASCII:        code_dict['type'] = 'unicode'        size = int.from_bytes(fp.read(1), 'little')        code_dict['size'] = size        code_dict['value'] = fp.read(size).decode()        if code_flag:            REFS_HASH[idx] = code_dict['value']    elif code_type == TYPE_SHORT_ASCII_INTERNED:        code_dict['type'] = 'unicode'        size = int.from_bytes(fp.read(1), 'little')        code_dict['size'] = size        code_dict['value'] = fp.read(size).decode()        if code_flag:            REFS_HASH[idx] = code_dict['value']    elif code_type == TYPE_REF:        code_dict['type'] = 'ref'        code_dict['ref'] = int.from_bytes(fp.read(4), 'little')        code_dict['value'] = REFS_HASH[code_dict['ref']]    elif code_type == TYPE_NONE:        code_dict['type'] = 'none'    else:        print(code_type)    return code_dictdef parse_pyc(file_name):    pyc_dict = {}    with open(file_name, 'rb') as fp:        magic_number = int.from_bytes(fp.read(2), 'little')        if magic_number >= 3390 and magic_number <= 3392:            pyc_dict['version'] = 'Python 3.7'        else:            print('only support Python 3.7')            exit(0)                _ = fp.read(2)        _ = fp.read(4)        timestamp = int.from_bytes(fp.read(4), 'little')        pyc_dict['modified'] = str(datetime.datetime.fromtimestamp(timestamp))        source_size = int.from_bytes(fp.read(4), 'little')        pyc_dict['size'] = source_size        pyc_dict['code'] = parse_code(fp)    return(pyc_dict)if __name__ == '__main__':    file_name = sys.argv[1]    print(json.dumps(parse_pyc(file_name), indent=2))

分析test.py后结果为:

实现了对TYPE_REF的处理,下面的value值并不在真实的二进制中包含:

"co_filename": {    "type": "ref",    "ref": 6,    "value": "/Users/l.wang/Documents/pythonindepth/bin/test.py"}

目前没有对指令集做处理。

转载地址:http://rlutx.baihongyu.com/

你可能感兴趣的文章
Python3 注释
查看>>
老树开新花:DLL劫持漏洞新玩法
查看>>
关于LVS负载均衡tcp长连接分发的解决思路
查看>>
LeetCode Recover Binary Search Tree
查看>>
SpringMVC的页面几种返回方式
查看>>
优盘复制大文件
查看>>
scrapy 6023 telnet查看爬虫引擎相关状态
查看>>
关于最小生成树,拓扑排序、强连通分量、割点、2-SAT的一点笔记
查看>>
[iOS]查看苹果支持的所有字库
查看>>
TCP/IP协议层
查看>>
理解SQLNET.AUTHENTICATION_SERVICES参数|转|
查看>>
new Option及用法
查看>>
C#:基于WMI查询USB设备
查看>>
par函数family参数-控制文字的字体
查看>>
程序员考证之信息系统项目管理师
查看>>
Custom Tabbed Toolbar with Corporate Image and Central Registry Integration
查看>>
HttpWebRequest模拟POST提交防止中文乱码
查看>>
Bring Your Heart to Work
查看>>
android 手动打包
查看>>
进化计算简介和遗传算法的实现--AForge.NET框架的使用(六)
查看>>