在ubuntu上实现pdf转docx

先简单记录一下在ubuntu 22.04上如何通过命令去实现pdf转word

pdf转word可分为两种情况:

  1. 是纯文字的,非图片那种。
  2. 有文字有图片或者只有图片。

1

对于第一种是很好解决的。可通过pdf2docx即可实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# 安装
pip install pdf2docx

# 使用
AME
    pdf2docx - Command line interface for ``pdf2docx``.

SYNOPSIS
    pdf2docx COMMAND | -

DESCRIPTION
    Command line interface for ``pdf2docx``.

COMMANDS
    COMMAND is one of the following:

     convert
       Convert pdf file to docx file.

     debug
       Convert one PDF page and plot layout information for debugging.

     gui
       Simple user interface.

     table
       Extract table content from pdf pages.

# 实际使用,可指定页数
Convert all pages:

$ pdf2docx convert test.pdf test.docx

Convert pages from the second to the end:

$ pdf2docx convert test.pdf test.docx --start=1

Convert pages from the first to the third (index=2):

$ pdf2docx convert test.pdf test.docx --end=3

Convert second and third pages:

$ pdf2docx convert test.pdf test.docx --start=1 --end=3

Convert the first and second pages with zero-based index turn off:

$ pdf2docx convert test.pdf test.docx --start=1 --end=3 --zero_based_index=False

By page numbers
Convert the first, third and 5th pages:

$ pdf2docx convert test.pdf test.docx --pages=0,2,4

Multi-Processing
Turn on multi-processing with default count of CPU:

$ pdf2docx convert test.pdf test.docx --multi_processing=True

Specify the count of CPUs:

$ pdf2docx convert test.pdf test.docx --multi_processing=True --cpu_count=4

参考文档:https://pdf2docx.readthedocs.io/en/latest/index.html

2

通过OCR先转换为文本,然后再通过pdf2docx转换为word

  • ocr转文本,工具:ocrmypdf
1
2
3
4
5
# 安装
apt install ocrmypdf

# 使用
ocrmypdf --skip-text orang.pdf orang_ocr.pdf 
  • pdf转word,工具:pdf2docx

参考1

参考文档:https://github.com/ocrmypdf/OCRmyPDF

updatedupdated2024-04-172024-04-17