Press "Enter" to skip to content

从头构建PaddleOCR的docker镜像

有需求需要内网部署ocr服务,多方对比感觉paddle百度飞桨各方面还是不错的。

https://www.paddlepaddle.org.cn/hubdetail?name=ch_pp-ocrv3&en_category=TextRecognition

我的需求场景是内网的容器部署,编译底包也是限定死的只能用centos(centos8)系统。

所以尝试从头打包paddleocr的镜像。

先贴配置文件:

Dockerfile

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
FROM centos:8.3.2011
RUN sed -i -e "s|mirrorlist=|#mirrorlist=|g" /etc/yum.repos.d/CentOS-*
RUN sed -i -e "s|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g" /etc/yum.repos.d/CentOS-*
RUN yum update -y
RUN yum install gcc make python3-devel openssl-devel mesa-libGL bzip2-devel libffi-devel zlib-devel xz-devel sqlite-devel unzip -y
COPY Python-3.9.6.tgz /Python-3.9.6.tgz
RUN tar -zxvf /Python-3.9.6.tgz
RUN cd /Python-3.9.6
RUN /Python-3.9.6/configure --enable-loadable-sqlite-extensions --enable-optimizations
RUN make altinstall
RUN pip3.9 install --upgrade pip
COPY PaddleOCR-release-2.7.zip /PaddleOCR-release-2.7.zip
RUN unzip /PaddleOCR-release-2.7.zip -d /
RUN mv /PaddleOCR-release-2.7 /PaddleOCR
# RUN mkdir /PaddleOCR/inference
# ADD ch_ppocr_mobile_v2.0_cls_infer.tar /PaddleOCR/inference/
# ADD ch_ppocr_mobile_v2.0_det_infer.tar /PaddleOCR/inference/
# ADD ch_ppocr_mobile_v2.0_rec_infer.tar /PaddleOCR/inference/
# RUN mv /PaddleOCR/inference/ch_ppocr_mobile_v2.0_det_infer /PaddleOCR/inference/ch_PP-OCRv3_det_infer
# RUN mv /PaddleOCR/inference/ch_ppocr_mobile_v2.0_rec_infer /PaddleOCR/inference/ch_PP-OCRv3_rec_infer
WORKDIR /PaddleOCR
RUN pip install backports.lzma
RUN pip install PyMuPDF==1.18.0
RUN pip install -r /PaddleOCR/requirements.txt
RUN pip install paddlepaddle
RUN pip install paddlehub
# RUN hub install deploy/hubserving/ocr_system/
# RUN hub install deploy/hubserving/ocr_cls/
# RUN hub install deploy/hubserving/ocr_det/
# RUN hub install deploy/hubserving/ocr_rec/
RUN mkdir -p /root/.paddlehub/modules
ADD ch_pp_ocrv3_1.2.0.zip /tmp/
ADD ch_pp_ocrv3_det_1_1_0.zip /tmp
RUN unzip /tmp/ch_pp_ocrv3_1.2.0.zip -d /root/.paddlehub/modules/
RUN unzip /tmp/ch_pp_ocrv3_det_1_1_0.zip -d /root/.paddlehub/modules/
RUN pip install waitress tornado
ADD html /PaddleOCR/html
ADD html/app_compat.py /usr/local/lib/python3.9/site-packages/paddlehub/serving/app_compat.py
RUN yum install epel-release -y
RUN yum install supervisor -y
ADD supervisord.conf /supervisord.conf
EXPOSE 9000
# CMD ["/bin/bash","-c","hub serving start --modules ocr_system ocr_cls ocr_det ocr_rec -p 9000"]
# CMD ["/bin/bash","-c","hub serving start --modules ch_pp-ocrv3 -p 9000"]
# CMD ["/bin/bash","-c","python3.9 html/main.py"]
CMD ["supervisord","-c","/supervisord.conf"]
FROM centos:8.3.2011 RUN sed -i -e "s|mirrorlist=|#mirrorlist=|g" /etc/yum.repos.d/CentOS-* RUN sed -i -e "s|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g" /etc/yum.repos.d/CentOS-* RUN yum update -y RUN yum install gcc make python3-devel openssl-devel mesa-libGL bzip2-devel libffi-devel zlib-devel xz-devel sqlite-devel unzip -y COPY Python-3.9.6.tgz /Python-3.9.6.tgz RUN tar -zxvf /Python-3.9.6.tgz RUN cd /Python-3.9.6 RUN /Python-3.9.6/configure --enable-loadable-sqlite-extensions --enable-optimizations RUN make altinstall RUN pip3.9 install --upgrade pip COPY PaddleOCR-release-2.7.zip /PaddleOCR-release-2.7.zip RUN unzip /PaddleOCR-release-2.7.zip -d / RUN mv /PaddleOCR-release-2.7 /PaddleOCR # RUN mkdir /PaddleOCR/inference # ADD ch_ppocr_mobile_v2.0_cls_infer.tar /PaddleOCR/inference/ # ADD ch_ppocr_mobile_v2.0_det_infer.tar /PaddleOCR/inference/ # ADD ch_ppocr_mobile_v2.0_rec_infer.tar /PaddleOCR/inference/ # RUN mv /PaddleOCR/inference/ch_ppocr_mobile_v2.0_det_infer /PaddleOCR/inference/ch_PP-OCRv3_det_infer # RUN mv /PaddleOCR/inference/ch_ppocr_mobile_v2.0_rec_infer /PaddleOCR/inference/ch_PP-OCRv3_rec_infer WORKDIR /PaddleOCR RUN pip install backports.lzma RUN pip install PyMuPDF==1.18.0 RUN pip install -r /PaddleOCR/requirements.txt RUN pip install paddlepaddle RUN pip install paddlehub # RUN hub install deploy/hubserving/ocr_system/ # RUN hub install deploy/hubserving/ocr_cls/ # RUN hub install deploy/hubserving/ocr_det/ # RUN hub install deploy/hubserving/ocr_rec/ RUN mkdir -p /root/.paddlehub/modules ADD ch_pp_ocrv3_1.2.0.zip /tmp/ ADD ch_pp_ocrv3_det_1_1_0.zip /tmp RUN unzip /tmp/ch_pp_ocrv3_1.2.0.zip -d /root/.paddlehub/modules/ RUN unzip /tmp/ch_pp_ocrv3_det_1_1_0.zip -d /root/.paddlehub/modules/ RUN pip install waitress tornado ADD html /PaddleOCR/html ADD html/app_compat.py /usr/local/lib/python3.9/site-packages/paddlehub/serving/app_compat.py RUN yum install epel-release -y RUN yum install supervisor -y ADD supervisord.conf /supervisord.conf EXPOSE 9000 # CMD ["/bin/bash","-c","hub serving start --modules ocr_system ocr_cls ocr_det ocr_rec -p 9000"] # CMD ["/bin/bash","-c","hub serving start --modules ch_pp-ocrv3 -p 9000"] # CMD ["/bin/bash","-c","python3.9 html/main.py"] CMD ["supervisord","-c","/supervisord.conf"]
FROM centos:8.3.2011


RUN sed -i -e "s|mirrorlist=|#mirrorlist=|g" /etc/yum.repos.d/CentOS-*
RUN sed -i -e "s|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g" /etc/yum.repos.d/CentOS-*

RUN yum update -y
RUN yum install gcc make python3-devel openssl-devel mesa-libGL bzip2-devel libffi-devel zlib-devel xz-devel sqlite-devel unzip -y


COPY Python-3.9.6.tgz /Python-3.9.6.tgz
RUN tar -zxvf /Python-3.9.6.tgz
RUN cd /Python-3.9.6
RUN /Python-3.9.6/configure --enable-loadable-sqlite-extensions --enable-optimizations
RUN make altinstall

RUN pip3.9 install --upgrade pip

COPY PaddleOCR-release-2.7.zip /PaddleOCR-release-2.7.zip
RUN unzip /PaddleOCR-release-2.7.zip -d /
RUN mv /PaddleOCR-release-2.7 /PaddleOCR
# RUN mkdir /PaddleOCR/inference
# ADD ch_ppocr_mobile_v2.0_cls_infer.tar /PaddleOCR/inference/
# ADD ch_ppocr_mobile_v2.0_det_infer.tar /PaddleOCR/inference/
# ADD ch_ppocr_mobile_v2.0_rec_infer.tar /PaddleOCR/inference/
# RUN mv /PaddleOCR/inference/ch_ppocr_mobile_v2.0_det_infer /PaddleOCR/inference/ch_PP-OCRv3_det_infer
# RUN mv /PaddleOCR/inference/ch_ppocr_mobile_v2.0_rec_infer /PaddleOCR/inference/ch_PP-OCRv3_rec_infer

WORKDIR /PaddleOCR

RUN pip install backports.lzma
RUN pip install PyMuPDF==1.18.0


RUN pip install -r /PaddleOCR/requirements.txt
RUN pip install paddlepaddle
RUN pip install paddlehub

# RUN hub install deploy/hubserving/ocr_system/
# RUN hub install deploy/hubserving/ocr_cls/
# RUN hub install deploy/hubserving/ocr_det/
# RUN hub install deploy/hubserving/ocr_rec/

RUN mkdir -p /root/.paddlehub/modules
ADD ch_pp_ocrv3_1.2.0.zip /tmp/
ADD ch_pp_ocrv3_det_1_1_0.zip /tmp
RUN unzip /tmp/ch_pp_ocrv3_1.2.0.zip -d /root/.paddlehub/modules/
RUN unzip /tmp/ch_pp_ocrv3_det_1_1_0.zip -d /root/.paddlehub/modules/


RUN pip install waitress tornado
ADD html /PaddleOCR/html
ADD html/app_compat.py /usr/local/lib/python3.9/site-packages/paddlehub/serving/app_compat.py

RUN yum install epel-release -y
RUN yum install supervisor -y
ADD supervisord.conf /supervisord.conf


EXPOSE 9000

# CMD ["/bin/bash","-c","hub serving start --modules ocr_system ocr_cls ocr_det ocr_rec -p 9000"]
# CMD ["/bin/bash","-c","hub serving start --modules ch_pp-ocrv3 -p 9000"]
# CMD ["/bin/bash","-c","python3.9 html/main.py"]
CMD ["supervisord","-c","/supervisord.conf"]

supervisord.conf

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
[supervisord]
logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)
logfile_maxbytes=50MB ; (max main logfile bytes b4 rotation;default 50MB)
logfile_backups=10 ; (num of main logfile rotation backups;default 10)
loglevel=info ; (log level;default info; others: debug,warn,trace)
pidfile=/var/run/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
nodaemon=false ; (start in foreground if true;default false)
minfds=1024 ; (min. avail startup file descriptors;default 1024)
minprocs=200 ; (min. avail process descriptors;default 200)
[program:paddlehub]
autorestart=True ; 程序异常退出后自动重启
autostart=True ; 在 supervisord 启动的时候也自动启动
process_name=paddlehub
command=python3.9 /PaddleOCR/html/main.py
directory=/PaddleOCR
user=admin
[supervisord] logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log) logfile_maxbytes=50MB ; (max main logfile bytes b4 rotation;default 50MB) logfile_backups=10 ; (num of main logfile rotation backups;default 10) loglevel=info ; (log level;default info; others: debug,warn,trace) pidfile=/var/run/supervisord.pid ; (supervisord pidfile;default supervisord.pid) nodaemon=false ; (start in foreground if true;default false) minfds=1024 ; (min. avail startup file descriptors;default 1024) minprocs=200 ; (min. avail process descriptors;default 200) [program:paddlehub] autorestart=True ; 程序异常退出后自动重启 autostart=True ; 在 supervisord 启动的时候也自动启动 process_name=paddlehub command=python3.9 /PaddleOCR/html/main.py directory=/PaddleOCR user=admin
[supervisord]
logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)
logfile_maxbytes=50MB ; (max main logfile bytes b4 rotation;default 50MB)
logfile_backups=10 ; (num of main logfile rotation backups;default 10)
loglevel=info ; (log level;default info; others: debug,warn,trace)
pidfile=/var/run/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
nodaemon=false ; (start in foreground if true;default false)
minfds=1024 ; (min. avail startup file descriptors;default 1024)
minprocs=200 ; (min. avail process descriptors;default 200)

[program:paddlehub]
autorestart=True      ; 程序异常退出后自动重启
autostart=True        ; 在 supervisord 启动的时候也自动启动
process_name=paddlehub
command=python3.9 /PaddleOCR/html/main.py
directory=/PaddleOCR
user=admin

main.py

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# encoding: utf-8
# author:alisen
# time: 2020/4/28 14:54
import os
import json
import base64
from tornado.options import define, options
import tornado.web
import tornado.gen
import tornado.template
import tornado.httpserver
import tornado.ioloop
from paddlehub.serving.model_service.base_model_service import cv_module_info
from paddlehub.serving.model_service.base_model_service import nlp_module_info
from paddlehub.serving.model_service.base_model_service import v2_module_info
from paddlehub.utils import log
import paddlehub as hub
import numpy as np
import cv2
ocr = hub.Module(name="ch_pp-ocrv3", enable_mkldnn=True)
def package_result(status: str, msg: str, data: dict):
'''
Package message of response.
Args:
status(str): Error code
======== ==============================================================================================
Code Meaning
-------- ----------------------------------------------------------------------------------------------
'000' Return results normally
'101' An error occurred in the predicting method
'111' Module is not available
'112' Use outdated and abandoned HTTP protocol format
======== ===============================================================================================
msg(str): Detailed info for error
data(dict): Result of predict api.
Returns:
dict: Message of response
Examples:
.. code-block:: python
data = {'result': 0.002}
package_result(status='000', msg='', data=data)
'''
return {"status": status, "msg": msg, "results": data}
class NpEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.integer):
return int(obj)
elif isinstance(obj, np.floating):
return float(obj)
elif isinstance(obj, np.ndarray):
return obj.tolist()
else:
return super(NpEncoder, self).default(obj)
class Index(tornado.web.RequestHandler):
@tornado.gen.coroutine
def get(self, *args, **kwargs):
self.render('/PaddleOCR/html/index.html')
class Predict(tornado.web.RequestHandler):
@tornado.gen.coroutine
def post(self):
'''
Http api for predicting.
Args:
module_name(str): Module name for predicting.
Returns:
Result of predicting after packaging.
'''
self.set_header('content-type', 'application/json')
img_b64 = json.loads(self.request.body)['images'][0]
if img_b64 is None:
message = "This usage is out of date, please use 'application/json' as content-type to post to /predict"
self.set_status(400)
self.finish(json.dumps(package_result("112", message, ""), cls=NpEncoder))
return
im_bytes = base64.b64decode(img_b64)
im_arr = np.frombuffer(im_bytes, dtype=np.uint8)
source = cv2.imdecode(im_arr, flags=cv2.IMREAD_COLOR)
result = ocr.recognize_text(images=[source])
self.set_status(200)
self.finish(json.dumps(package_result("000", "", result), ensure_ascii=False))
return
def make_app():
# from backend.webInterface import tr_run
return tornado.web.Application([
(r"/predict/ch_pp-ocrv3", Predict),
(r"/", Index)
])
if __name__ == "__main__":
define("port", default=9000, type=int, help='指定运行时端口号')
tornado.options.parse_command_line()
port = options.port
app = make_app()
server = tornado.httpserver.HTTPServer(app)
server.bind(port)
server.start(20)
print(f'Server is running: http://0.0.0.0:{port}')
tornado.ioloop.IOLoop.current().start()
# encoding: utf-8 # author:alisen # time: 2020/4/28 14:54 import os import json import base64 from tornado.options import define, options import tornado.web import tornado.gen import tornado.template import tornado.httpserver import tornado.ioloop from paddlehub.serving.model_service.base_model_service import cv_module_info from paddlehub.serving.model_service.base_model_service import nlp_module_info from paddlehub.serving.model_service.base_model_service import v2_module_info from paddlehub.utils import log import paddlehub as hub import numpy as np import cv2 ocr = hub.Module(name="ch_pp-ocrv3", enable_mkldnn=True) def package_result(status: str, msg: str, data: dict): ''' Package message of response. Args: status(str): Error code ======== ============================================================================================== Code Meaning -------- ---------------------------------------------------------------------------------------------- '000' Return results normally '101' An error occurred in the predicting method '111' Module is not available '112' Use outdated and abandoned HTTP protocol format ======== =============================================================================================== msg(str): Detailed info for error data(dict): Result of predict api. Returns: dict: Message of response Examples: .. code-block:: python data = {'result': 0.002} package_result(status='000', msg='', data=data) ''' return {"status": status, "msg": msg, "results": data} class NpEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, np.integer): return int(obj) elif isinstance(obj, np.floating): return float(obj) elif isinstance(obj, np.ndarray): return obj.tolist() else: return super(NpEncoder, self).default(obj) class Index(tornado.web.RequestHandler): @tornado.gen.coroutine def get(self, *args, **kwargs): self.render('/PaddleOCR/html/index.html') class Predict(tornado.web.RequestHandler): @tornado.gen.coroutine def post(self): ''' Http api for predicting. Args: module_name(str): Module name for predicting. Returns: Result of predicting after packaging. ''' self.set_header('content-type', 'application/json') img_b64 = json.loads(self.request.body)['images'][0] if img_b64 is None: message = "This usage is out of date, please use 'application/json' as content-type to post to /predict" self.set_status(400) self.finish(json.dumps(package_result("112", message, ""), cls=NpEncoder)) return im_bytes = base64.b64decode(img_b64) im_arr = np.frombuffer(im_bytes, dtype=np.uint8) source = cv2.imdecode(im_arr, flags=cv2.IMREAD_COLOR) result = ocr.recognize_text(images=[source]) self.set_status(200) self.finish(json.dumps(package_result("000", "", result), ensure_ascii=False)) return def make_app(): # from backend.webInterface import tr_run return tornado.web.Application([ (r"/predict/ch_pp-ocrv3", Predict), (r"/", Index) ]) if __name__ == "__main__": define("port", default=9000, type=int, help='指定运行时端口号') tornado.options.parse_command_line() port = options.port app = make_app() server = tornado.httpserver.HTTPServer(app) server.bind(port) server.start(20) print(f'Server is running: http://0.0.0.0:{port}') tornado.ioloop.IOLoop.current().start()
# encoding: utf-8
# author:alisen
# time: 2020/4/28 14:54
import os
import json
import base64
from tornado.options import define, options
import tornado.web
import tornado.gen
import tornado.template
import tornado.httpserver
import tornado.ioloop

from paddlehub.serving.model_service.base_model_service import cv_module_info
from paddlehub.serving.model_service.base_model_service import nlp_module_info
from paddlehub.serving.model_service.base_model_service import v2_module_info

from paddlehub.utils import log

import paddlehub as hub
import numpy as np
import cv2

ocr = hub.Module(name="ch_pp-ocrv3", enable_mkldnn=True)  

def package_result(status: str, msg: str, data: dict):
    '''
    Package message of response.

    Args:
        status(str): Error code
            ========   ==============================================================================================
            Code       Meaning
            --------   ----------------------------------------------------------------------------------------------
            '000'      Return results normally
            '101'      An error occurred in the predicting method
            '111'      Module is not available
            '112'      Use outdated and abandoned HTTP protocol format
            ========   ===============================================================================================
        msg(str): Detailed info for error
        data(dict): Result of predict api.

    Returns:
        dict: Message of response

    Examples:
        .. code-block:: python

            data = {'result': 0.002}
            package_result(status='000', msg='', data=data)
    '''
    return {"status": status, "msg": msg, "results": data}

class NpEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        else:
            return super(NpEncoder, self).default(obj)

class Index(tornado.web.RequestHandler):
    @tornado.gen.coroutine
    def get(self, *args, **kwargs):
        self.render('/PaddleOCR/html/index.html')
        

class Predict(tornado.web.RequestHandler):

    @tornado.gen.coroutine
    def post(self):
        '''
        Http api for predicting.

        Args:
            module_name(str): Module name for predicting.

        Returns:
            Result of predicting after packaging.
        '''
        self.set_header('content-type', 'application/json')

        img_b64 = json.loads(self.request.body)['images'][0]

        if img_b64 is None:
            message = "This usage is out of date, please use 'application/json' as content-type to post to /predict"
            self.set_status(400)
            self.finish(json.dumps(package_result("112", message, ""), cls=NpEncoder))
            return
        
        im_bytes = base64.b64decode(img_b64)
        im_arr = np.frombuffer(im_bytes, dtype=np.uint8)
        source = cv2.imdecode(im_arr, flags=cv2.IMREAD_COLOR)

        result = ocr.recognize_text(images=[source])

        self.set_status(200)
        self.finish(json.dumps(package_result("000", "", result), ensure_ascii=False))
        return

def make_app():
    # from backend.webInterface import tr_run
    return tornado.web.Application([
        (r"/predict/ch_pp-ocrv3", Predict),
        (r"/", Index)
    ])


if __name__ == "__main__":
    define("port", default=9000, type=int, help='指定运行时端口号')
    tornado.options.parse_command_line()
    port = options.port

    app = make_app()

    server = tornado.httpserver.HTTPServer(app)
    server.bind(port)
    server.start(20)
    print(f'Server is running: http://0.0.0.0:{port}')
    tornado.ioloop.IOLoop.current().start()

其中遇到不少的坑,好在最终都搞好了。

要用到的一些包:

https://bj.bcebos.com/paddlehub/paddlehub_dev/ch_pp_ocrv3_det_1_1_0.zip

https://bj.bcebos.com/paddlehub/paddlehub_dev/ch_pp_ocrv3_1.2.0.zip

https://www.python.org/ftp/python/3.9.6/Python-3.9.6.tgz

https://codeload.github.com/PaddlePaddle/PaddleOCR/zip/refs/heads/release/2.7

参考的一些资料:

A sample supervisor config filegist.github.com

https://stackoverflow.com/questions/70963985/error-failed-to-download-metadata-for-repo-appstream-cannot-prepare-internal

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注