A word frequency analyzer reads a piece of text and tells you how often each word appears. Building one from scratch is one of the best beginner Python projects because it puts dictionaries, loops, string methods, and sorting all into practice at the same time on a problem you can see and understand right away.
Word frequency analysis shows up in many real applications: search engines rank pages partly by keyword frequency, data scientists analyze text corpora, and security analysts scan logs for anomalous patterns. The Python version you will build here is small enough to fit in one screen but complete enough to run on any text you hand it.
What the Analyzer Will Do
The finished program accepts a string of text, cleans it so that capitalization and punctuation do not interfere with counting, splits it into individual words, counts each word using a dictionary, and prints a ranked list from the most frequent word down to the least frequent. You will also learn how to filter out common words like "the" and "a" that inflate counts without adding meaning.
Here is the complete program you will build step by step throughout this tutorial. Read through it now so you have the whole picture before the individual pieces are explained.
import string
def count_words(text, top_n=10):
# Normalize: lowercase and remove punctuation
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
# Split into individual words
words = text.split()
# Count each word
freq = {}
for word in words:
freq[word] = freq.get(word, 0) + 1
# Sort by frequency, highest first
sorted_words = sorted(freq.items(), key=lambda item: item[1], reverse=True)
# Print the top N results
print(f"Top {top_n} words:")
for word, count in sorted_words[:top_n]:
print(f" {word}: {count}")
sample = """
Python is an easy to learn language. Python is also a powerful language.
Many programmers choose Python because Python syntax is clean and readable.
"""
count_words(sample)
No third-party libraries are required. Everything used here is part of Python's standard library. You can run this code in any Python 3.6 or later environment, including the browser-based Python shells at python.org/shell and repl.it.
The Python Tools You Will Use
Before writing any code it helps to know what each built-in tool does and why it is the right choice for this problem.
Dictionaries as counters
A Python dictionary maps keys to values. When counting words, each unique word is a key and its running total is the value. Dictionaries allow you to look up and update a count in constant time, which makes them the natural choice for frequency analysis over, say, a sorted list that would require scanning on every lookup.
# The .get(key, default) pattern is central to the counting loop.
# If the word exists, return its current count.
# If it does not exist yet, return 0 so we can add 1 to it.
freq = {}
freq['python'] = freq.get('python', 0) + 1 # first time: 0 + 1 = 1
freq['python'] = freq.get('python', 0) + 1 # second time: 1 + 1 = 2
print(freq) # {'python': 2}
String methods: lower, split, and translate
Three string methods do the cleanup work. .lower() converts the entire text to lowercase so "Python" and "python" are counted as the same word. .split() breaks the string into a list of substrings at every whitespace boundary, which gives you the individual words. .translate() paired with str.maketrans() removes every punctuation character in a single pass without a loop.
import string
text = "Hello, World! Hello."
text = text.lower() # "hello, world! hello."
text = text.translate(str.maketrans('', '', string.punctuation)) # "hello world hello"
words = text.split() # ['hello', 'world', 'hello']
print(words)
sorted() with a lambda key
After building the frequency dictionary, you have an unordered collection of word-count pairs. sorted() takes an iterable and returns a new sorted list. The key argument accepts a function that is called on each element to produce a sort value. Using lambda item: item[1] tells Python to sort each (word, count) tuple by its second element — the count. Setting reverse=True puts the highest counts first.
freq = {'python': 4, 'language': 2, 'easy': 1, 'is': 3}
# .items() returns (key, value) pairs as a view
# sorted() returns a new list — the original dict is unchanged
sorted_words = sorted(freq.items(), key=lambda item: item[1], reverse=True)
print(sorted_words)
# [('python', 4), ('is', 3), ('language', 2), ('easy', 1)]
- What it does
- Retrieves the current count (or 0 if absent) and increments by 1, storing the result back under the same key.
- When to use
- The standard beginner pattern. Explicit, readable, and requires no imports.
- What it does
- A dictionary subclass from
collectionsthat automatically inserts 0 for missing keys, so you can writefreq[word] += 1without a KeyError. - When to use
- When you want slightly shorter loop code and are already importing from
collections.
- What it does
- Counts all elements in an iterable in a single call and returns a Counter object. Has a built-in
.most_common(n)method that replaces the manual sort. - When to use
- When you need a production-grade solution in minimal lines. Learn the manual approach first so you understand what Counter automates.
Build the line that counts a word using the dictionary .get() pattern. Place the tokens in the correct order:
freq[word] = freq.get(word, 0) + 1 reads the current count (defaulting to 0 if the word is new), adds 1, and writes the updated value back. The = sign is assignment, not comparison. Using - instead of + would decrement counts rather than increment them.
Cleaning the Input Text
Raw text rarely arrives in a clean state. Consider the sentence "Python, is great! Python." — a naive split produces ['Python,', 'is', 'great!', 'Python.']. The trailing punctuation makes "Python," and "Python." look like different words, which produces two separate keys in the frequency dictionary when you want one. Two preprocessing steps fix this.
The first step is case normalization. Calling .lower() on the full string before splitting means "Python", "PYTHON", and "python" all map to the same key 'python'. This is always correct for general-purpose frequency analysis, though you might skip it for case-sensitive tasks like analyzing source code identifiers.
The second step is punctuation removal. Python's string module provides a pre-built constant string.punctuation that contains all 32 standard ASCII punctuation characters. Calling str.maketrans('', '', string.punctuation) creates a translation table that maps every punctuation character to None, meaning delete it. Passing that table to text.translate() strips all punctuation from the string in one operation without a loop.
import string
raw = "Python, is great! Python."
# Step 1: normalize case
lowered = raw.lower()
print(lowered) # "python, is great! python."
# Step 2: strip punctuation
table = str.maketrans('', '', string.punctuation)
clean = lowered.translate(table)
print(clean) # "python is great python"
# Now split produces clean, consistent tokens
words = clean.split()
print(words) # ['python', 'is', 'great', 'python']
Always clean before you split, not after. If you split first, you have to loop through every word and strip punctuation from each one individually. Cleaning the full string first means a single .translate() call handles the entire text regardless of length.
Filtering stop words
Once you have a working frequency dictionary, you may notice that the top positions are dominated by words like "the", "a", "is", and "to". These are called stop words. They appear constantly in English but convey nothing specific about the content being analyzed. A small set defined in your code is enough for beginner projects. For each word, you check if it belongs to the stop word set before adding it to the frequency dictionary, which keeps those words out of the counts entirely.
STOP_WORDS = {'the', 'a', 'an', 'is', 'are', 'was', 'were',
'and', 'or', 'but', 'in', 'on', 'at', 'to',
'of', 'for', 'with', 'it', 'this', 'that'}
freq = {}
for word in words:
if word not in STOP_WORDS: # skip stop words
freq[word] = freq.get(word, 0) + 1
Use a set, not a list, for your stop words. Checking membership in a set — word not in STOP_WORDS — runs in constant time. Checking membership in a list requires scanning each element, so it slows down proportionally as the list grows.
The function below is supposed to count words and return the frequency dictionary, but it always returns empty results. Click the line you think is wrong, then hit check.
return {} on line 7 to return freq. The loop correctly builds the freq dictionary, but the function discards it by returning a new empty dictionary literal instead of the variable that was just populated.
How to Build a Word Frequency Analyzer in Python
The following five steps walk through the complete construction of the word frequency analyzer, from receiving raw text to printing a ranked output.
-
Define the input text and clean it
Store your text in a variable. Call
.lower()to normalize capitalization, then call.translate(str.maketrans('', '', string.punctuation))to remove all punctuation. Both operations must happen before splitting so that every word token is consistent. -
Split the text into a list of words
Call
.split()on the cleaned string. With no arguments,.split()divides on any whitespace sequence and discards empty strings, giving you a list where each element is one word. Assign the result to a variable namedwords. -
Count each word using a dictionary
Create an empty dictionary
freq = {}. Loop over thewordslist. Inside the loop, writefreq[word] = freq.get(word, 0) + 1. This retrieves the current count for each word (or zero on first encounter) and increments it by one, storing the updated value back under the same key. -
Sort the results by frequency
Call
sorted(freq.items(), key=lambda item: item[1], reverse=True). The.items()method returns each key-value pair as a tuple. Thekeyargument tellssorted()to rank by the second element of each tuple, which is the count. Settingreverse=Trueplaces the highest count first. -
Print the top results
Loop over the sorted list and unpack each tuple into a
wordandcountvariable, then print them. Use list slicing —sorted_words[:10]— to limit output to the top ten words. You can make this limit a parameter so callers can request as many or as few results as they need.
Putting all five steps together with the stop-word filter produces the extended version of the analyzer shown below. Notice that the only structural difference from the first version is the addition of the STOP_WORDS set and the conditional inside the loop.
import string
STOP_WORDS = {'the', 'a', 'an', 'is', 'are', 'was', 'were',
'and', 'or', 'but', 'in', 'on', 'at', 'to',
'of', 'for', 'with', 'it', 'this', 'that'}
def count_words(text, top_n=10, filter_stops=True):
# Step 1: clean
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
# Step 2: split
words = text.split()
# Step 3: count
freq = {}
for word in words:
if filter_stops and word in STOP_WORDS:
continue
freq[word] = freq.get(word, 0) + 1
# Step 4: sort
sorted_words = sorted(freq.items(), key=lambda item: item[1], reverse=True)
# Step 5: print top N
print(f"\nTop {top_n} words:")
for word, count in sorted_words[:top_n]:
bar = '#' * count
print(f" {word:<20} {count:>3} {bar}")
sample = """
Python is an easy to learn language. Python is also a powerful language.
Many programmers choose Python because Python syntax is clean and readable.
Learning Python opens doors to data science, web development, and automation.
"""
count_words(sample, top_n=8)
Running this on the sample text produces output similar to the following, with "python" appearing four times and other content words ranked below it.
Top 8 words:
python 4 ####
language 2 ##
easy 1 #
powerful 1 #
many 1 #
programmers 1 #
choose 1 #
syntax 1 #
"Simple is better than complex." — The Zen of Python (PEP 20)
Python Learning Summary Points
- A Python dictionary is the right data structure for counting because it gives you constant-time lookup and update. The pattern
freq[word] = freq.get(word, 0) + 1handles both new words and existing words without any conditional branching on your part. - Always normalize text before splitting. Calling
.lower()and.translate()on the full string is more efficient than processing individual words after the fact, and it ensures your tokens are consistent from the start. sorted()with alambdakey is a general pattern you will use throughout Python whenever you need to rank a collection by a derived value. In this project you sort by count, but the same pattern works for sorting objects by any attribute.- Stop word filtering is controlled by a set membership check. Using a
setkeeps the check at constant time regardless of how many stop words you define, which matters once your stop word list grows beyond a handful of entries. - The
collections.Counterclass in Python's standard library automates the entire counting loop into a single call. Learning the manual dictionary approach first gives you a clear mental model of what Counter does, making it easier to use and debug when you adopt it.
The word frequency analyzer is a compact program but it exercises the core skills of Python data manipulation: reading and transforming strings, building and querying dictionaries, iterating with for loops, and sorting sequences. Every project you write after this one will call on those same patterns.
Frequently Asked Questions
A word frequency analyzer is a program that reads a piece of text and counts how many times each unique word appears. In Python, this is done by splitting the text into a list of words, then using a dictionary to keep a running count for each word.
A Python dictionary is the standard data structure for counting word frequencies. Each unique word becomes a key, and its count becomes the value. Python also provides collections.Counter, which is a specialized dictionary subclass built for exactly this task.
Call the .split() method on a string. By default, .split() splits on whitespace and removes empty strings. For example, 'hello world'.split() returns ['hello', 'world']. To handle punctuation, call .lower() and strip punctuation before splitting.
dict.get(key, default) returns the value for the key if it exists, or the default value if it does not. When counting words, freq.get(word, 0) returns the current count for a word or zero if the word has not been seen yet, which lets you increment the count safely without a KeyError.
Use sorted() with a key argument. For a dictionary named freq, sorted(freq.items(), key=lambda item: item[1], reverse=True) returns a list of (word, count) tuples sorted from highest to lowest count. The lambda extracts the second element of each tuple, which is the count.
collections.Counter is a dictionary subclass in Python's standard library that automates word counting. Pass it an iterable and it returns a Counter object with each element as a key and its count as the value. It also provides a .most_common(n) method. Beginners benefit from learning the manual dictionary approach first so they understand what Counter is doing under the hood.
Convert the entire text to lowercase with text.lower() before splitting, so "Python" and "python" are counted together. Remove punctuation by using str.translate() with str.maketrans('', '', string.punctuation), which strips every punctuation character in a single operation.
Stop words are common function words like "the", "a", "is", and "in" that appear very frequently but carry little meaning. Filtering them out lets a word frequency analyzer surface the words that characterize the content. Filter them by checking if each word belongs to a set of stop words before adding it to the frequency dictionary.