Learn How to Build a Text Word Frequency Analyzer in Python

Q: How do you split a string into words in Python?

You call the .split() method on a string. By default, .split() splits on whitespace and removes empty strings. For example, 'hello world'.split() returns ['hello', 'world']. To handle punctuation, you can also use str.lower() and strip punctuation before splitting.

Q: What does dict.get() do when used for word counting?

dict.get(key, default) returns the value for the key if it exists, or the default value if it does not. When counting words, freq.get(word, 0) returns the current count for a word or zero if the word has not been seen yet, which lets you increment the count safely without a KeyError.

Q: How do you sort a Python dictionary by value?

You use sorted() with a key argument. For a dictionary named freq, sorted(freq.items(), key=lambda item: item[1], reverse=True) returns a list of (word, count) tuples sorted from highest to lowest count. The lambda extracts the second element of each tuple, which is the count.

Q: What is collections.Counter and should beginners use it?

collections.Counter is a dictionary subclass in Python's standard library that automates word counting. You pass it an iterable and it returns a Counter object with each element as a key and its count as the value. It also provides a most_common(n) method. Beginners benefit from learning the manual dictionary approach first so they understand what Counter is doing under the hood.

Q: How do you ignore punctuation and capitalization when counting words?

Convert the entire text to lowercase with text.lower() before splitting, so 'Python' and 'python' are counted together. Remove punctuation by using str.translate() with str.maketrans('', '', string.punctuation), or by replacing individual characters with str.replace().

Q: What are stop words and why would you filter them out?

Stop words are common function words like 'the', 'a', 'is', and 'in' that appear very frequently but carry little meaning. Filtering them out lets a word frequency analyzer surface the words that actually characterize the content. You filter them by checking if each word is in a set of stop words before adding it to the frequency dictionary.

A word frequency analyzer reads a piece of text and tells you how often each word appears. Building one from scratch is one of the best beginner Python projects because it puts dictionaries, loops, string methods, and sorting all into practice at the same time on a problem you can see and understand right away.

Word frequency analysis shows up in many real applications: search engines rank pages partly by keyword frequency, data scientists analyze text corpora, and security analysts scan logs for anomalous patterns. The Python version you will build here is small enough to fit in one screen but complete enough to run on any text you hand it.

What the Analyzer Will Do

The finished program accepts a string of text, cleans it so that capitalization and punctuation do not interfere with counting, splits it into individual words, counts each word using a dictionary, and prints a ranked list from the most frequent word down to the least frequent. You will also learn how to filter out common words like "the" and "a" that inflate counts without adding meaning.

Here is the complete program you will build step by step throughout this tutorial. Read through it now so you have the whole picture before the individual pieces are explained.

python

import string

def count_words(text, top_n=10):
    # Normalize: lowercase and remove punctuation
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Split into individual words
    words = text.split()

    # Count each word
    freq = {}
    for word in words:
        freq[word] = freq.get(word, 0) + 1

    # Sort by frequency, highest first
    sorted_words = sorted(freq.items(), key=lambda item: item[1], reverse=True)

    # Print the top N results
    print(f"Top {top_n} words:")
    for word, count in sorted_words[:top_n]:
        print(f"  {word}: {count}")

sample = """
Python is an easy to learn language. Python is also a powerful language.
Many programmers choose Python because Python syntax is clean and readable.
"""

count_words(sample)

Note

No third-party libraries are required. Everything used here is part of Python's standard library. You can run this code in any Python 3.6 or later environment, including the browser-based Python shells at python.org/shell and repl.it.

The Python Tools You Will Use

Before writing any code it helps to know what each built-in tool does and why it is the right choice for this problem.

Dictionaries as counters

A Python dictionary maps keys to values. When counting words, each unique word is a key and its running total is the value. Dictionaries allow you to look up and update a count in constant time, which makes them the natural choice for frequency analysis over, say, a sorted list that would require scanning on every lookup.

python

# The .get(key, default) pattern is central to the counting loop.
# If the word exists, return its current count.
# If it does not exist yet, return 0 so we can add 1 to it.

freq = {}
freq['python'] = freq.get('python', 0) + 1   # first time: 0 + 1 = 1
freq['python'] = freq.get('python', 0) + 1   # second time: 1 + 1 = 2

print(freq)  # {'python': 2}

String methods: lower, split, and translate

Three string methods do the cleanup work. .lower() converts the entire text to lowercase so "Python" and "python" are counted as the same word. .split() breaks the string into a list of substrings at every whitespace boundary, which gives you the individual words. .translate() paired with str.maketrans() removes every punctuation character in a single pass without a loop.

python

import string

text = "Hello, World! Hello."
text = text.lower()                                         # "hello, world! hello."
text = text.translate(str.maketrans('', '', string.punctuation))  # "hello world hello"
words = text.split()                                        # ['hello', 'world', 'hello']
print(words)

sorted() with a lambda key

After building the frequency dictionary, you have an unordered collection of word-count pairs. sorted() takes an iterable and returns a new sorted list. The key argument accepts a function that is called on each element to produce a sort value. Using lambda item: item[1] tells Python to sort each (word, count) tuple by its second element — the count. Setting reverse=True puts the highest counts first.

python

freq = {'python': 4, 'language': 2, 'easy': 1, 'is': 3}

# .items() returns (key, value) pairs as a view
# sorted() returns a new list — the original dict is unchanged
sorted_words = sorted(freq.items(), key=lambda item: item[1], reverse=True)

print(sorted_words)
# [('python', 4), ('is', 3), ('language', 2), ('easy', 1)]

What it does: Retrieves the current count (or 0 if absent) and increments by 1, storing the result back under the same key.
When to use: The standard beginner pattern. Explicit, readable, and requires no imports.

What it does: A dictionary subclass from collections that automatically inserts 0 for missing keys, so you can write freq[word] += 1 without a KeyError.
When to use: When you want slightly shorter loop code and are already importing from collections.

What it does: Counts all elements in an iterable in a single call and returns a Counter object. Has a built-in .most_common(n) method that replaces the manual sort.
When to use: When you need a production-grade solution in minimal lines. Learn the manual approach first so you understand what Counter automates.

code builder click a token to place it

Build the line that counts a word using the dictionary .get() pattern. Place the tokens in the correct order:

your code will appear here...

0 freq[word] 1 freq.get(word, = + ) -

Why: The pattern freq[word] = freq.get(word, 0) + 1 reads the current count (defaulting to 0 if the word is new), adds 1, and writes the updated value back. The = sign is assignment, not comparison. Using - instead of + would decrement counts rather than increment them.

Cleaning the Input Text

Raw text rarely arrives in a clean state. Consider the sentence "Python, is great! Python." — a naive split produces ['Python,', 'is', 'great!', 'Python.']. The trailing punctuation makes "Python," and "Python." look like different words, which produces two separate keys in the frequency dictionary when you want one. Two preprocessing steps fix this.

The first step is case normalization. Calling .lower() on the full string before splitting means "Python", "PYTHON", and "python" all map to the same key 'python'. This is always correct for general-purpose frequency analysis, though you might skip it for case-sensitive tasks like analyzing source code identifiers.

The second step is punctuation removal. Python's string module provides a pre-built constant string.punctuation that contains all 32 standard ASCII punctuation characters. Calling str.maketrans('', '', string.punctuation) creates a translation table that maps every punctuation character to None, meaning delete it. Passing that table to text.translate() strips all punctuation from the string in one operation without a loop.

python

import string

raw = "Python, is great! Python."

# Step 1: normalize case
lowered = raw.lower()
print(lowered)  # "python, is great! python."

# Step 2: strip punctuation
table = str.maketrans('', '', string.punctuation)
clean = lowered.translate(table)
print(clean)    # "python is great python"

# Now split produces clean, consistent tokens
words = clean.split()
print(words)    # ['python', 'is', 'great', 'python']

Pro Tip

Always clean before you split, not after. If you split first, you have to loop through every word and strip punctuation from each one individually. Cleaning the full string first means a single .translate() call handles the entire text regardless of length.

Filtering stop words

Once you have a working frequency dictionary, you may notice that the top positions are dominated by words like "the", "a", "is", and "to". These are called stop words. They appear constantly in English but convey nothing specific about the content being analyzed. A small set defined in your code is enough for beginner projects. For each word, you check if it belongs to the stop word set before adding it to the frequency dictionary, which keeps those words out of the counts entirely.

python

STOP_WORDS = {'the', 'a', 'an', 'is', 'are', 'was', 'were',
              'and', 'or', 'but', 'in', 'on', 'at', 'to',
              'of', 'for', 'with', 'it', 'this', 'that'}

freq = {}
for word in words:
    if word not in STOP_WORDS:        # skip stop words
        freq[word] = freq.get(word, 0) + 1

Watch Out

Use a set, not a list, for your stop words. Checking membership in a set — word not in STOP_WORDS — runs in constant time. Checking membership in a list requires scanning each element, so it slows down proportionally as the list grows.

spot the bug click the line that contains the bug

The function below is supposed to count words and return the frequency dictionary, but it always returns empty results. Click the line you think is wrong, then hit check.

The fix: Change return {} on line 7 to return freq. The loop correctly builds the freq dictionary, but the function discards it by returning a new empty dictionary literal instead of the variable that was just populated.

How to Build a Word Frequency Analyzer in Python

The following five steps walk through the complete construction of the word frequency analyzer, from receiving raw text to printing a ranked output.

Define the input text and clean it

Store your text in a variable. Call .lower() to normalize capitalization, then call .translate(str.maketrans('', '', string.punctuation)) to remove all punctuation. Both operations must happen before splitting so that every word token is consistent.
Split the text into a list of words

Call .split() on the cleaned string. With no arguments, .split() divides on any whitespace sequence and discards empty strings, giving you a list where each element is one word. Assign the result to a variable named words.
Count each word using a dictionary

Create an empty dictionary freq = {}. Loop over the words list. Inside the loop, write freq[word] = freq.get(word, 0) + 1. This retrieves the current count for each word (or zero on first encounter) and increments it by one, storing the updated value back under the same key.
Sort the results by frequency

Call sorted(freq.items(), key=lambda item: item[1], reverse=True). The .items() method returns each key-value pair as a tuple. The key argument tells sorted() to rank by the second element of each tuple, which is the count. Setting reverse=True places the highest count first.
Print the top results

Loop over the sorted list and unpack each tuple into a word and count variable, then print them. Use list slicing — sorted_words[:10] — to limit output to the top ten words. You can make this limit a parameter so callers can request as many or as few results as they need.

Putting all five steps together with the stop-word filter produces the extended version of the analyzer shown below. Notice that the only structural difference from the first version is the addition of the STOP_WORDS set and the conditional inside the loop.

python

import string

STOP_WORDS = {'the', 'a', 'an', 'is', 'are', 'was', 'were',
              'and', 'or', 'but', 'in', 'on', 'at', 'to',
              'of', 'for', 'with', 'it', 'this', 'that'}

def count_words(text, top_n=10, filter_stops=True):
    # Step 1: clean
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Step 2: split
    words = text.split()

    # Step 3: count
    freq = {}
    for word in words:
        if filter_stops and word in STOP_WORDS:
            continue
        freq[word] = freq.get(word, 0) + 1

    # Step 4: sort
    sorted_words = sorted(freq.items(), key=lambda item: item[1], reverse=True)

    # Step 5: print top N
    print(f"\nTop {top_n} words:")
    for word, count in sorted_words[:top_n]:
        bar = '#' * count
        print(f"  {word:<20} {count:>3}  {bar}")

sample = """
Python is an easy to learn language. Python is also a powerful language.
Many programmers choose Python because Python syntax is clean and readable.
Learning Python opens doors to data science, web development, and automation.
"""

count_words(sample, top_n=8)

Running this on the sample text produces output similar to the following, with "python" appearing four times and other content words ranked below it.

output

Top 8 words:
  python                4  ####
  language              2  ##
  easy                  1  #
  powerful              1  #
  many                  1  #
  programmers           1  #
  choose                1  #
  syntax                1  #

"Simple is better than complex." — The Zen of Python (PEP 20)

Python Learning Summary Points

A Python dictionary is the right data structure for counting because it gives you constant-time lookup and update. The pattern freq[word] = freq.get(word, 0) + 1 handles both new words and existing words without any conditional branching on your part.
Always normalize text before splitting. Calling .lower() and .translate() on the full string is more efficient than processing individual words after the fact, and it ensures your tokens are consistent from the start.
sorted() with a lambda key is a general pattern you will use throughout Python whenever you need to rank a collection by a derived value. In this project you sort by count, but the same pattern works for sorting objects by any attribute.
Stop word filtering is controlled by a set membership check. Using a set keeps the check at constant time regardless of how many stop words you define, which matters once your stop word list grows beyond a handful of entries.
The collections.Counter class in Python's standard library automates the entire counting loop into a single call. Learning the manual dictionary approach first gives you a clear mental model of what Counter does, making it easier to use and debug when you adopt it.

The word frequency analyzer is a compact program but it exercises the core skills of Python data manipulation: reading and transforming strings, building and querying dictionaries, iterating with for loops, and sorting sequences. Every project you write after this one will call on those same patterns.

check your understanding question 1 of 5

Frequently Asked Questions

What is a word frequency analyzer in Python?

A word frequency analyzer is a program that reads a piece of text and counts how many times each unique word appears. In Python, this is done by splitting the text into a list of words, then using a dictionary to keep a running count for each word.

What Python data structure is used to count word frequencies?

A Python dictionary is the standard data structure for counting word frequencies. Each unique word becomes a key, and its count becomes the value. Python also provides collections.Counter, which is a specialized dictionary subclass built for exactly this task.

How do you split a string into words in Python?

Call the .split() method on a string. By default, .split() splits on whitespace and removes empty strings. For example, 'hello world'.split() returns ['hello', 'world']. To handle punctuation, call .lower() and strip punctuation before splitting.

What does dict.get() do when used for word counting?

dict.get(key, default) returns the value for the key if it exists, or the default value if it does not. When counting words, freq.get(word, 0) returns the current count for a word or zero if the word has not been seen yet, which lets you increment the count safely without a KeyError.

How do you sort a Python dictionary by value?

Use sorted() with a key argument. For a dictionary named freq, sorted(freq.items(), key=lambda item: item[1], reverse=True) returns a list of (word, count) tuples sorted from highest to lowest count. The lambda extracts the second element of each tuple, which is the count.

What is collections.Counter and should beginners use it?

collections.Counter is a dictionary subclass in Python's standard library that automates word counting. Pass it an iterable and it returns a Counter object with each element as a key and its count as the value. It also provides a .most_common(n) method. Beginners benefit from learning the manual dictionary approach first so they understand what Counter is doing under the hood.

How do you ignore punctuation and capitalization when counting words?

Convert the entire text to lowercase with text.lower() before splitting, so "Python" and "python" are counted together. Remove punctuation by using str.translate() with str.maketrans('', '', string.punctuation), which strips every punctuation character in a single operation.

What are stop words and why would you filter them out?

Stop words are common function words like "the", "a", "is", and "in" that appear very frequently but carry little meaning. Filtering them out lets a word frequency analyzer surface the words that characterize the content. Filter them by checking if each word belongs to a set of stop words before adding it to the frequency dictionary.

Learn How to Build a Text Word Frequency Analyzer in Python: Absolute Beginners Tutorial

What the Analyzer Will Do

The Python Tools You Will Use

Dictionaries as counters

String methods: lower, split, and translate

sorted() with a lambda key

Cleaning the Input Text

Filtering stop words

How to Build a Word Frequency Analyzer in Python

Python Learning Summary Points

Frequently Asked Questions