How to Read a File in Python: The Complete, No-Shortcuts Guide

Reading a file is one of the first real-world tasks every Python developer encounters. It sounds simple -- and the basic syntax is simple -- but there is a surprising depth beneath that surface. The built-in open() function, the with statement, encoding handling, buffered I/O layers, the modern pathlib module, memory-mapped files, seeking behavior, concurrent access, security implications, and asynchronous alternatives all have origin stories and design decisions that shape what your code actually does at runtime.

Understanding those layers is the difference between code that works on your machine and code that works everywhere. This article covers every major approach to reading files in Python, explains the engineering decisions behind them, connects each technique back to the PEPs and core developer discussions that made it possible, and addresses the questions that other guides leave unanswered -- including what happens to your file position after a read, when to reach for mmap, how to handle files with unknown encodings, what security risks file reading introduces, and what concurrent reads actually guarantee.

The Foundation: Python's Built-In open() Function

Every file-reading operation in Python begins with open(). In its simplest form:

file = open("data.txt", "r")
content = file.read()
file.close()

The "r" mode opens the file for reading in text mode, which is the default. Python returns a file object -- technically, an io.TextIOWrapper -- and you call .read() to pull the entire contents into a string. Then you close the file.

Critical Flaw

This pattern works, but if an exception occurs between open() and close(), the file handle leaks. The operating system has a finite number of file descriptors available (often 1024 by default on Linux systems), and leaking them can eventually crash your application. This is why, in 2005, Guido van Rossum authored PEP 343 to introduce the with statement.

A subtle point that matters: open() in Python 3 is actually io.open(). They are the same function. In Python 2, these were different -- io.open() returned the newer I/O objects while open() returned the legacy file type. If you ever encounter legacy code or documentation referencing io.open() explicitly, understand that in Python 3 this distinction is gone. The official Python documentation confirms that the built-in open() is an alias for io.open().

PEP 343: The with Statement and Why It Matters

PEP 343, titled "The 'with' Statement," was authored by Guido van Rossum and later updated by Alyssa Coghlan (then publishing as Nick Coghlan). It was accepted for Python 2.5 and remains one of the most consequential PEPs for everyday Python programming.

The PEP's history reveals a pivotal design moment. Before PEP 343, Guido had proposed PEP 340 -- "Anonymous Block Statements" -- a more complex mechanism that could conceal retries, loops, and suppressed exceptions inside seemingly innocent abstractions. In PEP 343, Guido referenced Raymond Chen's blog post about hidden flow control, which argued that concealing control flow inside macros makes code impossible to reason about. That argument resonated with Python's core design philosophy: code should be readable, and control flow should be visible. Guido withdrew PEP 340 and proposed the simpler with statement instead.

The with statement was designed to factor out the standard try/finally pattern into something readable and foolproof:

with open("data.txt", "r") as file:
    content = file.read()
# file is automatically closed here, even if an exception occurred

Under the hood, the with statement calls the file object's __enter__() method when entering the block and __exit__() when leaving it -- regardless of whether the block completed normally or raised an exception. For file objects, __exit__() calls close().

Best Practice

This is the Pythonic way to read files. There is virtually no good reason to use the manual open()/close() pattern in modern Python.

Reading Methods Within a with Block

Once you have a file object open inside a with block, you have several methods available.

.read() -- Read the entire file at once:

with open("config.json", "r") as f:
    full_text = f.read()

This loads the complete file contents into a single string. For small to medium files (configuration files, JSON payloads, short text documents), this is perfectly fine. For files measured in gigabytes, it will consume an equivalent amount of memory.

.readline() -- Read one line at a time:

with open("server.log", "r") as f:
    first_line = f.readline()
    second_line = f.readline()

Each call to .readline() advances the file position by one line. The returned string includes the trailing newline character (\n), which you will usually want to strip.

.readlines() -- Read all lines into a list:

with open("names.txt", "r") as f:
    all_lines = f.readlines()
    # ['Alice\n', 'Bob\n', 'Charlie\n']

This is convenient but has the same memory concern as .read() -- the entire file ends up in memory.

Iterating directly -- The memory-efficient approach:

with open("large_dataset.csv", "r") as f:
    for line in f:
        process(line.strip())

This is the gold standard for reading large files. The file object itself is an iterator, yielding one line at a time without loading the entire file into memory. Python's I/O buffering ensures this is efficient despite making many small reads at the Python level.

The walrus operator pattern (Python 3.8+):

For chunked binary reading, the walrus operator (:=) introduced in PEP 572 provides a cleaner alternative to the while True / break pattern:

with open("large_archive.bin", "rb") as f:
    while chunk := f.read(8192):
        process_chunk(chunk)

This is semantically identical to the while True loop with a break, but the intent is expressed in fewer lines with no redundancy. The assignment expression evaluates to the assigned value, and an empty bytes object is falsy, so the loop terminates naturally at end-of-file.

PEP 3116: The I/O Architecture You Never See

Many Python developers never think about what happens between calling open() and getting text back. The answer is a three-layer I/O stack, formalized in PEP 3116, titled "New I/O." This PEP was authored by Daniel Stutzbach, Guido van Rossum, and Mike Verdone, and it shipped with Python 3.0 in 2008.

PEP 3116 established that Python's I/O library consists of three layers, each defined by an abstract base class:

  1. Raw I/O (RawIOBase) -- Thin wrappers around operating system calls. FileIO is the concrete implementation. Each call maps to roughly one system call.
  2. Buffered I/O (BufferedIOBase) -- Adds buffering on top of raw I/O. BufferedReader, BufferedWriter, and BufferedRandom reduce the number of actual system calls by reading or writing in chunks (the default buffer size is typically 8192 bytes, though Python uses the file's block size when available).
  3. Text I/O (TextIOBase) -- Handles encoding and decoding between bytes and strings. TextIOWrapper is the implementation that sits on top of the buffered layer.

When you write open("data.txt", "r"), Python constructs all three layers: a FileIO object at the bottom, a BufferedReader in the middle, and a TextIOWrapper on top. You can verify this yourself:

with open("data.txt", "r") as f:
    print(type(f))                  # <class '_io.TextIOWrapper'>
    print(type(f.buffer))           # <class '_io.BufferedReader'>
    print(type(f.buffer.raw))       # <class '_io.FileIO'>

Understanding these layers matters in practice. If you need to read raw bytes with no buffering overhead -- say, when implementing a custom binary protocol -- you can open in binary mode with buffering disabled:

with open("raw_stream.bin", "rb", buffering=0) as f:
    chunk = f.read(4096)  # Directly from the OS, no intermediate buffer

It also explains a behavior that confuses many developers: newline translation. In text mode on Windows, the Text I/O layer translates \r\n (Windows line endings) to \n during reads, and reverses the translation during writes. This is why reading a Windows text file on any platform gives you clean \n characters -- and why opening a binary file in text mode will silently corrupt it. The translation happens in the TextIOWrapper layer, and binary mode skips that layer entirely.

The Encoding Problem: PEP 597 and PEP 686

Here is where many Python developers -- particularly those working exclusively on macOS or Linux -- unknowingly write buggy code. When you open a file in text mode without specifying an encoding, Python does not automatically use UTF-8. It uses the locale encoding, which is platform-dependent. On many modern Linux and macOS systems, the locale encoding happens to be UTF-8, which masks the problem. On Windows, it might be cp1252, cp936, or something else entirely.

Cross-Platform Bug

Code that works perfectly on your development machine can silently produce corrupted text -- or crash outright -- on a colleague's Windows machine when encoding is not specified explicitly.

PEP 597, authored by Inada Naoki and accepted for Python 3.10, directly addressed this. It introduced EncodingWarning, a new warning category emitted when the encoding argument to open() is omitted and the default locale-specific encoding is used. Naoki's research, documented in the PEP, found that of the 4,000 packages with the highest download counts on PyPI, 489 used non-ASCII characters in their README files, and 82 of those could not be installed from source on systems with non-UTF-8 locales because their build scripts read those files without specifying an encoding.

PEP 686, also authored by Inada Naoki and accepted for Python 3.15, makes this change permanent by enabling UTF-8 mode by default. As of March 2026, PEP 686 has already been implemented in the Python 3.15 alpha releases (the latest being 3.15.0a6, released February 2026), and Python 3.15 is on track for its final release in October 2026. Once Python 3.15 ships as a stable release, developers will no longer need to worry about locale-dependent encoding defaults -- UTF-8 will simply be the baseline everywhere. The PEP notes that UTF-8 has become the de facto standard text encoding: JSON, TOML, YAML, and text editors including VS Code and Windows Notepad all use it by default.

Until you are running Python 3.15 or later, the best practice is explicit:

with open("readme.md", "r", encoding="utf-8") as f:
    content = f.read()
"Explicit is better than implicit." — Tim Peters, PEP 20 (The Zen of Python)

Steve Dower, a CPython core developer and Windows platform expert, contributed to the PEP 597 discussion with a principle that should inform every developer's approach to file I/O: if you do not know the encoding of a file, you cannot reliably read that file. But what happens when you genuinely do not know the encoding? That is a real scenario, and we cover it in the Detecting Unknown Encodings section below.

Reading Binary Files

Not all files are text. Images, compiled executables, serialized data, and compressed archives are all binary. Reading them in text mode would trigger encoding/decoding that corrupts the data. Use binary mode instead:

with open("photo.jpg", "rb") as f:
    image_data = f.read()  # Returns bytes, not str

The "rb" mode skips the Text I/O layer entirely. You get a BufferedReader wrapping a FileIO, and .read() returns a bytes object -- no encoding, no newline translation, no surprises.

For processing binary files in chunks, which is critical for large files:

def read_in_chunks(filepath, chunk_size=8192):
    with open(filepath, "rb") as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            yield chunk

for chunk in read_in_chunks("database_backup.sql.gz"):
    hasher.update(chunk)
Note

The chunk_size of 8192 bytes aligns with the default buffer size in Python's I/O stack as specified in PEP 3116, making this pattern both memory-efficient and performant. You can check the actual buffer size on your system at runtime: import io; print(io.DEFAULT_BUFFER_SIZE).

PEP 428: pathlib and the Modern Approach

For years, file path manipulation in Python meant juggling os.path.join(), os.path.dirname(), os.path.exists(), and similar functions -- all operating on plain strings. PEP 428, authored by Antoine Pitrou, proposed the pathlib module as an object-oriented alternative. It was included in the standard library starting with Python 3.4.

The PEP's rationale explains that the case for dedicated path classes mirrors the case for other stateless value objects -- dates, times, IP addresses -- where dedicated types enable cleaner APIs, enable desirable default behaviors (like Windows path case-insensitivity), and allow Python to move away from replicating raw C language APIs toward providing more helpful, language-level abstractions.

pathlib provides convenience methods for reading files that eliminate even the with block:

from pathlib import Path

# Read entire text file
content = Path("config.toml").read_text(encoding="utf-8")

# Read entire binary file
data = Path("image.png").read_bytes()

Under the hood, Path.read_text() opens the file, reads its contents, closes it, and returns the string -- all in one call. Use pathlib.read_text() or pathlib.read_bytes() when you need the full contents of a file and want clean, concise code. Use open() with a with block when you need line-by-line iteration, partial reads, seeking within the file, or any operation that requires keeping the file handle open.

from pathlib import Path

# pathlib excels at path construction and quick reads
data_dir = Path.home() / "projects" / "analysis" / "data"
for csv_file in data_dir.glob("*.csv"):
    raw = csv_file.read_text(encoding="utf-8")
    process(raw)

Seeking, Positions, and Partial Reads

Every file object in Python maintains a current position -- a byte offset from the beginning of the file. Many tutorials skip this entirely, which leads to a specific class of bug: calling .read() twice on the same file object and getting an empty string the second time.

with open("config.toml", "r", encoding="utf-8") as f:
    first_read = f.read()      # Returns the full file
    second_read = f.read()     # Returns "" -- position is at end
    
    f.seek(0)                  # Reset to the beginning
    third_read = f.read()      # Returns the full file again

The .tell() method returns the current position as a byte offset, and .seek(offset, whence) moves it. The whence argument accepts three values: 0 (absolute position, the default), 1 (relative to current position), and 2 (relative to the end of the file). In text mode, seeking is restricted -- you can only seek to positions returned by .tell() or seek back to position 0. Arbitrary byte-offset seeks in text mode are unreliable because multi-byte encoding sequences mean there is no 1:1 mapping between bytes and characters.

with open("large_file.bin", "rb") as f:
    f.seek(-4, 2)            # 4 bytes before end of file
    trailer = f.read(4)      # Read the last 4 bytes (e.g., for format magic numbers)
    
    position = f.tell()      # Record current position
    f.seek(0)                # Back to start
    header = f.read(8)       # Read first 8 bytes

For structured binary formats -- ZIP files, PNG images, SQLite databases -- seek patterns like this are not edge cases. They are the entire architecture. Understanding that open() returns a stateful cursor, not just a snapshot of content, is foundational to working with any non-trivial file format.

Mental Model

Think of a file object as a cursor in a text editor, not a variable holding data. Calling .read() moves the cursor to the end. Calling .read() again reads nothing because the cursor has nothing ahead of it. Calling .seek(0) is like pressing Ctrl+Home to jump back to the start.

Memory-Mapped Files: When read() Is Not Enough

For very large files that you need to access non-sequentially -- log archives, binary databases, genome data files -- even line-by-line iteration becomes limiting. The standard library's mmap module offers a different model: instead of reading bytes into Python memory, it maps the file's pages directly into the process's virtual address space. The operating system handles paging on demand, and you access the file as if it were a bytes object.

import mmap

with open("genome_database.bin", "rb") as f:
    with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
        # mm behaves like a bytes-like object
        # The OS loads only the pages you actually touch
        position = mm.find(b'\x00MARKER\x00')
        if position != -1:
            record = mm[position:position + 256]

The length=0 argument tells mmap to map the entire file. ACCESS_READ creates a read-only mapping, which prevents accidental writes and allows the OS to share pages between processes reading the same file. The critical performance difference: with .read(), you always pay the cost of copying bytes into Python's heap. With mmap, if your access pattern only touches 2% of a 10 GB file, you only load 2% of the data from disk.

The trade-off is complexity and platform constraints. Memory-mapped files can raise OSError on systems with limited virtual address space (a real concern on 32-bit systems or when mapping multiple large files simultaneously). They also cannot be used with files on certain virtual filesystems. For typical files under a few hundred megabytes accessed sequentially, the generator pattern with for line in f is simpler and performs comparably. Reserve mmap for the cases where you genuinely need random access into a very large file without loading it entirely.

Handling Errors Gracefully

Real-world file reading requires handling the cases where things go wrong:

from pathlib import Path

filepath = Path("user_data.json")

try:
    content = filepath.read_text(encoding="utf-8")
except FileNotFoundError:
    print(f"File not found: {filepath}")
except PermissionError:
    print(f"Permission denied: {filepath}")
except UnicodeDecodeError:
    print(f"Encoding error: file is not valid UTF-8")

The UnicodeDecodeError case is particularly insidious. If someone hands you a file that claims to be UTF-8 but contains bytes from a different encoding, Python will raise this exception at the point where it encounters the invalid byte sequence -- which might be deep into a multi-gigabyte file. For maximum resilience, you can use the errors parameter:

with open("messy_data.txt", "r", encoding="utf-8", errors="replace") as f:
    content = f.read()  # Invalid bytes become U+FFFD replacement characters

Other useful error handlers include "ignore" (silently drops invalid bytes) and "backslashreplace" (inserts Python backslash escape sequences). The "replace" option is generally safest for exploratory reading because it preserves all valid data while marking problems visibly.

There is a fourth error handler that deserves mention: "surrogateescape". This handler, introduced in PEP 383, encodes invalid bytes as surrogate characters in the Unicode range U+DC80 through U+DCFF. The advantage is that these surrogate characters can be round-tripped back to the original bytes when writing the file back to disk. This is the default error handler for the operating system interface (file names, environment variables) and is invaluable for processing files where you need to preserve undecodable bytes without data loss:

with open("mixed_encoding.log", "r", encoding="utf-8", errors="surrogateescape") as f:
    content = f.read()  # Bad bytes become surrogates, preserving original data

with open("output.log", "w", encoding="utf-8", errors="surrogateescape") as f:
    f.write(content)  # Surrogates are converted back to original bytes

Detecting Unknown Encodings

Steve Dower's principle -- if you do not know the encoding, you cannot reliably read the file -- is technically correct. But in practice, developers regularly receive files with no encoding metadata: legacy CSVs from enterprise systems, web scraping output, logs from third-party services, data exports from older databases. You need a strategy.

The charset-normalizer library (which replaced the older chardet as the recommended encoding detection tool, and is now used internally by the requests library) uses statistical analysis to detect the encoding of arbitrary byte sequences:

from charset_normalizer import from_bytes
from pathlib import Path

raw_bytes = Path("mystery_file.csv").read_bytes()
results = from_bytes(raw_bytes)

best_match = results.best()
if best_match:
    detected_encoding = best_match.encoding
    text = str(best_match)  # Decoded text using detected encoding
    print(f"Detected: {detected_encoding}")
else:
    print("Could not detect encoding")

Encoding detection is probabilistic, not deterministic. No tool can guarantee correctness because the same byte sequence can be valid in multiple encodings. For example, the byte 0xE9 is the character "e" in Latin-1 and the start of a three-byte sequence in UTF-8. The best approach is to layer defenses: try the expected encoding first, fall back to detection, and always validate the result against domain-specific expectations (are the decoded characters reasonable for this type of data?).

Note

A practical heuristic: if a file is valid UTF-8, it is almost certainly UTF-8. Invalid UTF-8 sequences are extremely common in files encoded with single-byte encodings like Latin-1 or Windows-1252, because many of those encodings use byte values (0x80-0xFF) that are not valid UTF-8 lead bytes. Try encoding="utf-8" first; if it raises UnicodeDecodeError, then investigate further.

Reading Specific File Formats

Python's standard library and ecosystem provide specialized tools for structured file formats. Using them instead of raw .read() is almost always the right call:

import json
import csv
from pathlib import Path

# JSON
with open("api_response.json", "r", encoding="utf-8") as f:
    data = json.load(f)  # Returns dict/list, not raw string

# CSV
with open("sales_report.csv", "r", encoding="utf-8", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row["revenue"])

# Reading structured config (Python 3.11+)
import tomllib
with open("pyproject.toml", "rb") as f:  # Note: binary mode for TOML
    config = tomllib.load(f)

Note the newline="" parameter when reading CSV files. This is specified in the csv module documentation and ensures that the CSV reader handles line endings correctly across platforms, preventing issues with fields that contain embedded newlines. Also note that tomllib.load() requires binary mode ("rb") -- TOML files are defined as UTF-8, and the tomllib module handles decoding internally rather than relying on the system's text I/O layer. The tomllib module was added to the standard library in Python 3.11 via PEP 680; for earlier Python versions, install the tomli package from PyPI, which shares the same API.

Security: What File Reading Can Expose

File reading is a trust boundary. Every time your program reads a file, it is accepting input from an external source -- and that input may be hostile, oversized, or structurally malicious. This section covers security considerations that many file I/O guides overlook entirely.

Path traversal. If a user or external system provides a filename, never trust it blindly. A filename like ../../etc/shadow could escape your intended directory:

from pathlib import Path

SAFE_DIR = Path("/app/uploads").resolve()

def safe_read(user_filename: str) -> str:
    """Read a file only if it resides within the safe directory."""
    requested = (SAFE_DIR / user_filename).resolve()
    if not requested.is_relative_to(SAFE_DIR):
        raise ValueError(f"Path traversal detected: {user_filename}")
    return requested.read_text(encoding="utf-8")

The .resolve() call collapses .. segments into absolute paths, and .is_relative_to() (Python 3.9+) verifies that the resolved path still falls under the intended directory. Without this check, an attacker can read any file the process has permission to access.

Denial of service through oversized files. If your code calls .read() on user-provided files, an attacker can provide a multi-gigabyte file and exhaust your process's memory. Always limit what you read:

MAX_FILE_SIZE = 10 * 1024 * 1024  # 10 MB

def read_with_limit(filepath: Path, max_size: int = MAX_FILE_SIZE) -> str:
    """Read a file, refusing to load anything larger than max_size."""
    file_size = filepath.stat().st_size
    if file_size > max_size:
        raise ValueError(f"File too large: {file_size} bytes (limit: {max_size})")
    return filepath.read_text(encoding="utf-8")

Symlink attacks. On POSIX systems, a symbolic link can point anywhere -- including files outside the expected directory. If your application reads from a directory that untrusted users can write to, an attacker can create a symlink that redirects your read to a sensitive file. Use Path.resolve() before reading to follow symlinks and validate the true target, or use os.open() with the O_NOFOLLOW flag to refuse to open symlinks at all.

TOCTOU (Time-of-check to time-of-use). There is an inherent race condition between checking a file's properties and reading it. Between calling filepath.exists() and filepath.read_text(), the file can be replaced by another process. In security-sensitive contexts, it is better to attempt the read inside a try/except block rather than checking first. The Python documentation for os.access() explicitly warns about this pattern.

What Happens When Multiple Processes Read the Same File

Many tutorials cover the single-process case and stop there. But in production systems -- web servers, data pipelines, logging infrastructure -- multiple processes frequently read (and write) the same files at the same time. Python's open() gives you no concurrency guarantees by default.

For pure reads, concurrent access is generally safe on POSIX systems. A file opened for reading gets a separate file descriptor with its own position pointer, so two processes iterating the same log file independently will not interfere with each other's position. The OS handles this at the kernel level.

The problem arises when one process is writing while another is reading. Without synchronization, a reader can observe a partially-written file -- seeing some new data but not yet the complete record. The Pythonic solutions range in complexity:

import fcntl
import os

def read_with_lock(filepath):
    """Read a file safely while another process may be writing to it."""
    with open(filepath, "r", encoding="utf-8") as f:
        # Request a shared (read) lock. Multiple readers can hold this
        # simultaneously, but it blocks if a writer holds an exclusive lock.
        fcntl.flock(f, fcntl.LOCK_SH)
        try:
            return f.read()
        finally:
            fcntl.flock(f, fcntl.LOCK_UN)
Platform Note

fcntl is POSIX-only and unavailable on Windows. The cross-platform alternative is the filelock library, which implements advisory file locking that works on both POSIX and Windows by using a separate lock file alongside the target file.

There is also a subtler question: atomicity at the read level. If you are reading a configuration file that another process replaces by writing to a temporary file and then renaming it (the atomic rename pattern), you may read either the old or new file in its entirety -- but you will never read a half-written file, because the rename syscall itself is atomic on POSIX. This is a common and robust pattern for safely updating configuration that other processes are reading.

Asynchronous File I/O

Python's asyncio does not provide built-in asynchronous file operations, and for good reason: the underlying POSIX file I/O APIs are fundamentally synchronous. Calling open() and read() on a regular file will block the calling thread until the OS completes the operation, and there is no portable, non-blocking mechanism for disk I/O the way there is for network sockets.

In an async application -- a web server, a chat bot, a data pipeline -- blocking on file I/O can stall the entire event loop. The aiofiles library solves this by running file operations in a thread pool behind the scenes:

import aiofiles
import asyncio

async def read_config():
    async with aiofiles.open("config.json", "r", encoding="utf-8") as f:
        content = await f.read()
    return content

# In an async context:
config = asyncio.run(read_config())

The API mirrors the synchronous open() / with pattern, but each I/O call is awaitable and runs on a separate thread. This keeps the event loop responsive even when reading large files.

An alternative for simpler cases is asyncio.to_thread() (Python 3.9+), which delegates any synchronous function to the default thread pool executor:

import asyncio
from pathlib import Path

async def load_data():
    content = await asyncio.to_thread(
        Path("large_dataset.json").read_text, encoding="utf-8"
    )
    return content

This is less ergonomic for line-by-line processing but is a zero-dependency solution when you only need to offload a single blocking read.

Putting It All Together: A Real-World Pattern

Here is a practical pattern that incorporates every best practice discussed in this article -- explicit encoding, proper error handling, memory-efficient reading, security validation, and pathlib for clean path management:

from pathlib import Path
from typing import Iterator


SAFE_BASE = Path("/var/log/myapp").resolve()
MAX_LOG_SIZE = 500 * 1024 * 1024  # 500 MB


def read_log_entries(log_path: Path) -> Iterator[dict]:
    """
    Read a structured log file, yielding parsed entries.
    
    Each line is expected to be: TIMESTAMP | LEVEL | MESSAGE
    Validates path safety, file size, and encoding resilience.
    """
    resolved = log_path.resolve()
    if not resolved.is_relative_to(SAFE_BASE):
        raise ValueError(f"Path outside allowed directory: {log_path}")
    
    if not resolved.exists():
        raise FileNotFoundError(f"Log file not found: {resolved}")
    
    file_size = resolved.stat().st_size
    if file_size > MAX_LOG_SIZE:
        raise ValueError(f"Log file too large: {file_size} bytes")
    
    with resolved.open("r", encoding="utf-8", errors="replace") as f:
        for line_number, raw_line in enumerate(f, start=1):
            line = raw_line.strip()
            if not line:
                continue
            
            parts = line.split(" | ", maxsplit=2)
            if len(parts) != 3:
                continue  # Skip malformed lines
            
            yield {
                "line": line_number,
                "timestamp": parts[0],
                "level": parts[1],
                "message": parts[2],
            }


# Usage
log_file = SAFE_BASE / "application.log"
for entry in read_log_entries(log_file):
    if entry["level"] == "ERROR":
        print(f"[{entry['timestamp']}] {entry['message']}")

This function reads a potentially large file line-by-line (memory-efficient), specifies encoding explicitly (cross-platform safe), uses errors="replace" (resilient to encoding corruption), leverages pathlib for path handling (clean and readable), validates the resolved path against a safe base directory (security-conscious), checks file size before reading (DoS-resistant), and yields results lazily via a generator (composable with other processing pipelines).

Quick Reference: Which Method to Use When

  1. Quick one-shot read: Use Path.read_text(encoding="utf-8") -- cleanest and most concise.
  2. Line-by-line processing of large files: Iterate directly over the file object inside a with block.
  3. Chunked binary reading: Use the walrus operator pattern: while chunk := f.read(8192).
  4. Binary data: Use "rb" mode and .read() with a chunk size.
  5. Non-sequential access within a file: Use .seek() and .tell() -- in binary mode for arbitrary positions, text mode only for positions returned by .tell().
  6. Very large binary files requiring random access: Use mmap for demand-paged reading without loading the full file into memory.
  7. Structured formats (JSON, CSV, TOML): Use the dedicated standard library modules.
  8. Unknown encodings: Use charset-normalizer for statistical detection, then validate results.
  9. Async applications: Use aiofiles or asyncio.to_thread() to avoid blocking the event loop.
  10. User-provided file paths: Always resolve and validate paths against a safe directory before reading.
  11. Concurrent reads with possible concurrent writes: Use fcntl.flock() on POSIX or the filelock library for cross-platform advisory locking.
  12. Always: Specify your encoding explicitly. Until Python 3.15 ships as a stable release, the default is locale-dependent, not UTF-8.

Guido van Rossum has consistently emphasized that Python code should be written primarily for the humans who will read it. His contributions across numerous PEPs, python-dev discussions, and his foreword to Mark Lutz's Programming Python share a common thread: clarity of intent matters more than brevity of expression.

The difference between open("data.txt") and open("data.txt", encoding="utf-8") is thirteen characters -- but those thirteen characters communicate your intent, prevent cross-platform bugs, and demonstrate that you understand what your code is actually doing. That is the difference between copying a tutorial and actually comprehending the language.

back to articles