养生
createtempfile(Python 开发必备:tempfile 模块深度解析)

临时目录就是个生命周期很短的文件夹,专门用来存放那些不需要长期保留的数据。用完之后连同里面的内容一起删掉,文件系统保持干净。

临时目录在实际开发中有几个明显的好处:

需要一个临时空间来存放中间计算结果或临时文件。写单元测试的时候模拟文件操作,完了自动清理。下载或解压的数据不需要长期保存。处理用户上传的文件,在保存最终结果之前需要一个缓冲区。构建自动化流程时,要确保不留下任何痕迹。

tempfile 模块基础用法

import tempfile import os # Create a temporary directorywith tempfile.TemporaryDirectory() as temp_dir: print(f"Temporary directory created at: {temp_dir}") # Create a temporary file inside the directoryfile_path = os.path.join(temp_dir, "sample.txt") with open(file_path, "w") as f: f.write("Hello, Temporary World!") # Read back the filewith open(file_path, "r") as f: print(f.read()) # At this point, the directory and its contents are deleted automaticallyprint("Temporary directory cleaned up automatically.")

关键在于 with 语句块结束时,目录和文件会自动删除,不需要手动调用 os.remove() 或 shutil.rmtree()。

手动控制临时目录的生命周期

这种方式下需要自己负责清理工作,用完记得删除。

自定义临时目录的命名和位置

输出类似这样:

Created: /tmp/myapp_abcd1234_data

当系统默认的临时目录权限不够或者空间不足时,这个功能就派上用场了。

实战案例:安全处理 ZIP 文件

整个流程结束后,解压的文件夹自动删除,磁盘不会留下任何垃圾文件。

实战案例:动态生成报告

生成的报告可以直接上传、发邮件或者读取内容,不会在本地留存。

实战案例:单元测试中的文件操作

每个测试用例都在独立的临时环境中运行,互不干扰,也不需要手动清理。

嵌套临时目录

多阶段数据处理流程中,每个阶段可以有自己的独立沙箱环境。

使用临时目录的几个注意事项

获取系统临时目录路径:

import tempfile print(tempfile.gettempdir())

tempfile.mktemp()

下面这段代码展示了如何在 PDF 处理项目中使用临时目录。整个流程包括 PDF 转图片、图片转 Markdown、最后合并成完整文档:

import os import io import shutil import tempfile from pathlib import Path from typing import Iterable, Optional, Callable, Tuple # Requires: pip install pymupdf pillow import fitz # PyMuPDF from PIL import Image
def process_pdfs_to_markdown( pdf_paths: Iterable[str | os.PathLike], output_dir: str | os.PathLike, *, page_image_dpi: int = 200, image_format: str = "PNG", llm_page_markdown_fn: Optional[Callable[[Path], str]] = None, ) -> Tuple[list[Path], list[Path]]: """ Convert each input PDF into page images using a temporary workspace, run an LLM on each page image to get Markdown, save one MD per page (still in a temp workspace), then merge the per-PDF Markdown into a single non-temporary Markdown file per PDF in `output_dir`. Non-temp file handling is kept simple (write final merged .md into `output_dir`), while the heavy lifting uses temp directories that auto-clean on success or error. Parameters ---------- pdf_paths : Iterable[str | PathLike] Paths to PDF files to process. output_dir : str | PathLike Directory where FINAL merged Markdown files (non-temp) will be written. page_image_dpi : int, optional Rendering resolution for converting PDF pages to images. Higher DPI → sharper (default 200). image_format : str, optional Image format for page renders (e.g., "PNG", "JPEG"). Default "PNG". llm_page_markdown_fn : Callable[[Path], str], optional A callable that takes a Path to a page image and returns Markdown text for that page. If not provided, a placeholder stub will be used. Returns ------- Tuple[list[Path], list[Path]] A tuple (final_markdown_files, per_page_markdown_files_flattened) - final_markdown_files: list of merged Markdown file paths written in output_dir (non-temp) - per_page_markdown_files_flattened: flattened list of all per-page MD files (in temp, ephemeral) (Returned for inspection/logging; these will be deleted when temp dir goes away.) Notes ----- - Uses a single top-level TemporaryDirectory for the whole batch to keep structure neat. - For each PDF, creates `/tmp/.../<pdf_stem>/images` and `/tmp/.../<pdf_stem>/md`. - Each page is rendered to an image file named `page-<index>.<ext>`. - Each page's Markdown is saved to `page-<index>.md`. - Finally, merges all page MDs for that PDF into `<output_dir>/<pdf_stem>.md` (non-temp). - Replace `llm_stub_markdown_from_image` with your actual LLM call (OpenAI, local VLM, etc.). Pseudocode hint for real LLM integration ---------------------------------------- def llm_page_markdown_fn(img_path: Path) -> str: # pseudo: # bytes = img_path.read_bytes() # resp = my_llm_client.vision_to_md(image=bytes, system_prompt="Extract content as Markdown.") # return resp.markdown pass """ output_dir = Path(output_dir) output_dir.mkdir(parents=True, exist_ok=True) # --- Local helper: default LLM stub (replace this with your LLM call) --- def llm_stub_markdown_from_image(img_path: Path) -> str: # This is a placeholder. Swap with a real LLM/VLM call to convert the image to Markdown. # You can pass the image bytes and ask the model to produce clean Markdown with headings, tables, lists, etc. return f"# Page extracted (stub)\n\n_Image: {img_path.name}_\n\n> Replace this with real LLM Markdown output." # Choose the LLM function (user-supplied or stub) llm_to_md = llm_page_markdown_fn or llm_stub_markdown_from_image final_markdown_files: list[Path] = [] per_page_markdown_files_flattened: list[Path] = [] # Top-level temp root for the entire run with tempfile.TemporaryDirectory(prefix="pdf2img-md_") as temp_root: temp_root = Path(temp_root) for pdf_path in map(Path, pdf_paths): if not pdf_path.exists() or pdf_path.suffix.lower() != ".pdf": # Skip invalid entries gracefully; alternatively raise ValueError continue pdf_stem = pdf_path.stem pdf_temp_dir = temp_root / pdf_stem images_dir = pdf_temp_dir / "images" md_dir = pdf_temp_dir / "md" images_dir.mkdir(parents=True, exist_ok=True) md_dir.mkdir(parents=True, exist_ok=True) # --- 1) Render pages to images in temp --- # Using PyMuPDF: fast, no external poppler dependency pages_rendered: list[Path] = [] with fitz.open(pdf_path) as doc: # scale based on DPI (PyMuPDF normally uses zoom factors; convert DPI to zoom) # base DPI ~72; zoom = target_dpi / 72 zoom = page_image_dpi / 72.0 mat = fitz.Matrix(zoom, zoom) for page_index in range(doc.page_count): page = doc.load_page(page_index) pix = page.get_pixmap(matrix=mat, alpha=False) # no alpha for standard formats img_bytes = pix.tobytes(output=image_format.lower()) img_name = f"page-{page_index + 1}.{image_format.lower()}" img_path = images_dir / img_name # Save via PIL to ensure consistent headers/metadata if needed with Image.open(io.BytesIO(img_bytes)) as im: im.save(img_path, format=image_format) pages_rendered.append(img_path) # --- 2) For each page image, call LLM to get Markdown; save per-page MD in temp --- page_md_files: list[Path] = [] for img_path in pages_rendered: md_text = llm_to_md(img_path) # <-- your real LLM call here md_path = md_dir / (img_path.stem + ".md") md_path.write_text(md_text, encoding="utf-8") page_md_files.append(md_path) per_page_markdown_files_flattened.append(md_path) # --- 3) Merge per-page MD into a FINAL non-temp Markdown file (one per PDF) --- final_md_path = output_dir / f"{pdf_stem}.md" # If you want sophisticated merging rules, implement here (e.g., front matter, TOC). # Pseudocode for richer post-processing could be: # combined = render_front_matter(pdf_path) + "\n" + concatenate_markdown(page_md_files) + "\n" + add_toc() # final_md_path.write_text(combined, encoding="utf-8") with final_md_path.open("w", encoding="utf-8") as fout: fout.write(f"<!-- Source PDF: {pdf_path.name} -->\n") fout.write(f"# {pdf_stem}\n\n") for i, md_file in enumerate(sorted(page_md_files, key=lambda p: p.name), start=1): fout.write(f"\n\n---\n\n<!-- Page {i} -->\n\n") fout.write(md_file.read_text(encoding="utf-8")) final_markdown_files.append(final_md_path) # NOTE: # All temp content (images & per-page MDs) is automatically cleaned up on exit. return final_markdown_files, per_page_markdown_files_flattened

实际使用时把 llm_stub_markdown_from_image 替换成真正的 LLM 调用(比如 OpenAI 的 Vision API 或者本地视觉模型),就能实现完整的 PDF 文档处理流程。

总结

Sravanth


顶一下()     踩一下()

热门推荐

发表评论
0评