先简单记录一下在ubuntu 22.04上如何通过命令去实现pdf转word
pdf转word可分为两种情况:
- 是纯文字的,非图片那种。
- 有文字有图片或者只有图片。
对于第一种是很好解决的。可通过pdf2docx即可实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
|
# 安装
pip install pdf2docx
# 使用
AME
pdf2docx - Command line interface for ``pdf2docx``.
SYNOPSIS
pdf2docx COMMAND | -
DESCRIPTION
Command line interface for ``pdf2docx``.
COMMANDS
COMMAND is one of the following:
convert
Convert pdf file to docx file.
debug
Convert one PDF page and plot layout information for debugging.
gui
Simple user interface.
table
Extract table content from pdf pages.
# 实际使用,可指定页数
Convert all pages:
$ pdf2docx convert test.pdf test.docx
Convert pages from the second to the end:
$ pdf2docx convert test.pdf test.docx --start=1
Convert pages from the first to the third (index=2):
$ pdf2docx convert test.pdf test.docx --end=3
Convert second and third pages:
$ pdf2docx convert test.pdf test.docx --start=1 --end=3
Convert the first and second pages with zero-based index turn off:
$ pdf2docx convert test.pdf test.docx --start=1 --end=3 --zero_based_index=False
By page numbers
Convert the first, third and 5th pages:
$ pdf2docx convert test.pdf test.docx --pages=0,2,4
Multi-Processing
Turn on multi-processing with default count of CPU:
$ pdf2docx convert test.pdf test.docx --multi_processing=True
Specify the count of CPUs:
$ pdf2docx convert test.pdf test.docx --multi_processing=True --cpu_count=4
|
参考文档:https://pdf2docx.readthedocs.io/en/latest/index.html
通过OCR先转换为文本,然后再通过pdf2docx转换为word
1
2
3
4
5
|
# 安装
apt install ocrmypdf
# 使用
ocrmypdf --skip-text orang.pdf orang_ocr.pdf
|
参考1
参考文档:https://github.com/ocrmypdf/OCRmyPDF