import base64 import copy import ctypes import gc import hashlib import inspect import multiprocessing import os import random import traceback from glob import glob, iglob import threading import time import urllib import psutil import requests import json import sys from multiprocessing import Process, Pool __dir__ = os.path.dirname(os.path.abspath(__file__)) sys.path.append(os.path.abspath(os.path.join(__dir__, '..'))) from format_convert.convert import convert from ocr.ocr_interface import ocr, OcrModels from otr.otr_interface import otr, OtrModels from format_convert.judge_platform import get_platform class myThread(threading.Thread): def __init__(self, threadName): threading.Thread.__init__(self) self.threadName = threadName def run(self): while True: start_time = time.time() test_convert() print(self.threadName, "finish!", time.time()-start_time) class myThread_appendix(threading.Thread): def __init__(self, threadName, _list): threading.Thread.__init__(self) self.threadName = threadName self._list = _list def run(self): start_time = time.time() test_appendix_downloaded(self._list) print(self.threadName, "finish!", time.time()-start_time) def test_ocr(): with open("test_files/开标记录表3_page_0.png", "rb") as f: base64_data = base64.b64encode(f.read()) # print(base64_data) url = local_url + ":15011" + '/ocr' # url = 'http://127.0.0.1:15013/ocr' r = requests.post(url, data=base64_data, timeout=2000) # print("test:", r.content.decode("utf-8")) def test_otr(): with open("test_files/开标记录表3_page_0.png", "rb") as f: base64_data = base64.b64encode(f.read()) # print(base64_data) url = local_url + ":15017" + '/otr' # url = 'http://127.0.0.1:15013/ocr' r = requests.post(url, data=base64_data, timeout=2000) # print("test:", r.content.decode("utf-8")) def test_convert(): # path = "开标记录表3.pdf" # path = "test_files/开标记录表3_page_0.png" # path = "test_files/1.docx" # path = '光明食品(集团)有限公司2017年度经审计的合并及母公司财务报表.pdf' # path = '光明.pdf' # path = 'D:/BIDI_DOC/比地_文档/Oracle11g学生成绩管理系统.docx' # path = "C:\\Users\\Administrator\\Desktop\\1600825332753119.doc" # path = "temp/complex/8.png" # path = "合同备案.doc" # path = "1.png" # path = "1.pdf" # path = "(清单)衢州市第二人民医院二期工程电缆采购项目.xls" # path = "D:\\Project\\format_conversion\\appendix_test\\temp\\00fb3e52bc7e11eb836000163e0ae709" + \ # "\\00fb43acbc7e11eb836000163e0ae709.png" # path = "D:\\BIDI_DOC\\比地_文档\\8a949486788ccc6d017969f189301d41.pdf" # path = "be8a17f2cc1b11eba26800163e0857b6.docx" # path = "江苏省通州中等专业学校春节物资采购公 告.docx" # path = "test_files/1.zip" # path = "C:\\Users\\Administrator\\Desktop\\33f52292cdad11ebb58300163e0857b6.zip" path = "C:\\Users\\Administrator\\Desktop\\Test_Interface\\1623392355541.zip" with open(path, "rb") as f: base64_data = base64.b64encode(f.read()) # print(base64_data) url = _url + '/convert' # url = 'http://127.0.0.1:15014/convert' # headers = {'Content-Type': 'application/json'} headers = { 'Connection': 'keep-alive' } data = urllib.parse.urlencode({"file": base64_data, "type": path.split(".")[-1]}).encode('utf-8') req = urllib.request.Request(url, data=data, headers=headers) with urllib.request.urlopen(req) as response: _dict = eval(response.read().decode("utf-8")) result = _dict.get("result") is_success = _dict.get("is_success") print("is_success", is_success) print("len(result)", len(result)) for i in range(len(result)): print("=================") print(result[i]) print("-----------------") # print(len(eval(r.content.decode("utf-8")).get("result"))) # print(r.content) def test_appendix_downloaded(_list): # 直接使用下载好的附件 i = 0 # for docid_file in glob("/mnt/html_files/*"): for docid_file in _list: if i % 100 == 0: print("Loop", i) # print(docid_file) for file_path in iglob(docid_file + "/*"): print(file_path) with open(file_path, "rb") as f: base64_data = base64.b64encode(f.read()) url = _url + '/convert' # print(url) try: # headers = { # 'Connection': 'keep-alive' # } # data = urllib.parse.urlencode({"file": base64_data, "type": file_path.split(".")[-1]}).encode('utf-8') # req = urllib.request.Request(url, data=data, headers=headers) # with urllib.request.urlopen(req, timeout=2000) as response: # _dict = eval(response.read().decode("utf-8")) # timeout=2000 r = requests.post(url, data={"file": base64_data, "type": file_path.split(".")[-1]}, timeout=2000) _dict = eval(r.content.decode("utf-8")) print("is_success:", _dict.get("is_success")) except Exception as e: print("docid " + str(docid_file) + " time out!", e) i += 1 def test_convert_maxcompute(): try: ocr_model = OcrModels().get_model() otr_model = OtrModels().get_model() path_list = [] path_suffix = "未命名4.pdf" if get_platform() == "Windows": path_prefix = "C:\\Users\\Administrator\\Desktop\\Test_ODPS\\" # path_prefix = "C:\\Users\\Administrator\\Desktop\\" path_list.append(path_prefix + path_suffix) else: path_list.append(path_suffix) result_list = [] for path in path_list: with open(path, "rb") as f: base64_data = base64.b64encode(f.read()) # print("------------") # print(base64_data) # print('------------') data = {"file": base64_data, "type": path.split(".")[-1]} result_dict = convert(data, ocr_model, otr_model) print("garbage object num:%d" % (len(gc.garbage))) _unreachable = gc.collect() print("unreachable object num:%d" % (_unreachable)) print("garbage object num:%d" % (len(gc.garbage))) result_list.append(result_dict) for result_dict in result_list: result = result_dict.get("result_text") is_success = result_dict.get("is_success") for i in range(len(result)): print("=================", "is_success", is_success, i, "in", len(result)) # _dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..") # _dir = os.path.abspath(_dir) + os.sep # if i == 0: # with open(_dir + "result.html", "w") as ff: # ff.write(result[i]) # else: # with open(_dir + "result.html", "a") as ff: # ff.write("