如何使用Python docx从Word文档中提取表格内的编号列表

来源：AI视频音频作者：三上悠亚头衔：网络博主

导读：本期聚焦于小伙伴创作的《如何使用Python docx从Word文档中提取表格内的编号列表》，敬请观看详情，探索知识的价值。以下视频、文章将为您系统阐述其核心内容与价值。如果您觉得《如何使用Python docx从Word文档中提取表格内的编号列表》有用，将其分享出去将是对创作者最好的鼓励。

使用Python docx从Word文档中提取表格内的编号列表，核心思路是先通过python_docx库加载Word文档，遍历文档中的所有表格，再逐个处理表格的单元格，识别单元格内的编号列表元素并提取内容。这种方式可以批量处理大量Word文档，大幅提升工作效率。

环境准备

首先需要安装python_docx库，使用pip命令即可完成安装：

pip install python-docx

Word文档表格与编号列表的结构解析

Word文档中的表格由<w:tbl>标签定义，每个表格包含多行<w:tr>，每行包含多个单元格<w:tc>。而编号列表在python_docx中属于段落的特殊属性，每个编号列表项对应一个段落，段落的style属性会标记是否为列表样式，同时段落的_p底层XML节点中包含编号相关的信息。

提取表格内编号列表的实现步骤

1. 加载Word文档并获取所有表格

通过Document类加载目标Word文档，再通过tables属性获取文档中的所有表格：

from docx import Document

def get_all_tables(file_path):
    # 加载Word文档
    doc = Document(file_path)
    # 返回所有表格对象
    return doc.tables

2. 遍历表格单元格并识别编号列表

遍历每个表格的行和单元格，检查单元格内的段落是否属于编号列表。编号列表的段落通常具有style.name包含List关键字，同时可以通过段落的_p节点的属性判断是否存在编号：

def is_numbered_list_paragraph(paragraph):
    # 判断段落样式是否为列表样式
    if paragraph.style and "List" in paragraph.style.name:
        return True
    # 检查底层XML节点是否存在编号相关属性
    p_pr = paragraph._p.get_or_add_pPr()
    num_pr = p_pr.find("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numPr")
    return num_pr is not None

3. 提取编号列表的内容

对识别出的编号列表段落，提取其文本内容并保留编号顺序：

def extract_table_numbered_lists(file_path):
    tables = get_all_tables(file_path)
    result = []
    # 遍历所有表格
    for table_idx, table in enumerate(tables):
        table_data = {
            "table_index": table_idx,
            "numbered_lists": []
        }
        # 遍历表格的所有行
        for row_idx, row in enumerate(table.rows):
            # 遍历行内的所有单元格
            for col_idx, cell in enumerate(row.cells):
                cell_lists = []
                # 遍历单元格内的所有段落
                for para in cell.paragraphs:
                    if is_numbered_list_paragraph(para):
                        # 提取段落文本，去除首尾空白
                        list_text = para.text.strip()
                        if list_text:
                            cell_lists.append(list_text)
                if cell_lists:
                    table_data["numbered_lists"].append({
                        "row_index": row_idx,
                        "col_index": col_idx,
                        "lists": cell_lists
                    })
        if table_data["numbered_lists"]:
            result.append(table_data)
    return result

完整示例代码与测试

以下是一个完整的可运行示例，假设我们有一个名为test.docx的Word文档，其中表格内包含编号列表：

from docx import Document

def get_all_tables(file_path):
    doc = Document(file_path)
    return doc.tables

def is_numbered_list_paragraph(paragraph):
    if paragraph.style and "List" in paragraph.style.name:
        return True
    p_pr = paragraph._p.get_or_add_pPr()
    num_pr = p_pr.find("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}numPr")
    return num_pr is not None

def extract_table_numbered_lists(file_path):
    tables = get_all_tables(file_path)
    result = []
    for table_idx, table in enumerate(tables):
        table_data = {
            "table_index": table_idx,
            "numbered_lists": []
        }
        for row_idx, row in enumerate(table.rows):
            for col_idx, cell in enumerate(row.cells):
                cell_lists = []
                for para in cell.paragraphs:
                    if is_numbered_list_paragraph(para):
                        list_text = para.text.strip()
                        if list_text:
                            cell_lists.append(list_text)
                if cell_lists:
                    table_data["numbered_lists"].append({
                        "row_index": row_idx,
                        "col_index": col_idx,
                        "lists": cell_lists
                    })
        if table_data["numbered_lists"]:
            result.append(table_data)
    return result

if __name__ == "__main__":
    # 替换为你的Word文档路径
    file_path = "test.docx"
    extracted_data = extract_table_numbered_lists(file_path)
    for table_info in extracted_data:
        print(f"表格索引：{table_info['table_index']}")
        for list_info in table_info["numbered_lists"]:
            print(f"  行：{list_info['row_index']}，列：{list_info['col_index']}")
            print(f"  编号列表内容：{list_info['lists']}")

注意事项

部分自定义编号样式可能不会被"List" in paragraph.style.name识别，此时可以通过解析段落底层XML的编号属性做更精准的判断。
如果单元格内同时存在普通文本和编号列表，上述代码只会提取编号列表的内容，普通文本会被忽略。
处理大型Word文档时，建议先测试少量表格的提取效果，确认逻辑符合需求后再批量处理。

Python python_docx Word表格提取编号列表提取修改时间：2026-07-01 11:45:41

免责声明：已尽一切努力确保本网站所含信息的准确性。网站内容多为原创整理与精心编撰，观点力求客观中立。本站旨在免费分享，内容仅供个人学习、研究或参考使用。若引用了第三方作品，版权归原作者所有。如内容涉及您的权益，请联系我们处理。