-
Pl
chevron_right
Nancy Wairimu Nyambura: Outreachy Update:Understanding and Improving def-extractor.py
news.movim.eu / PlanetGnome • 25 July • 2 minutes
Introduction
Over the past couple of weeks, I have been working on understanding and improving def-extractor.py , a Python script that processes dictionary data from Wiktionary to generate word lists and definitions in structured formats. My main task has been to refactor the script to use configuration files instead of hardcoded values, making it more flexible and maintainable.
In this blog post, I’ll explain:
- What the script does
- How it works under the hood
- The changes I made to improve it
- Why these changes matter
What Does the Script Do?
At a high level, this script processes huge JSONL (JSON Lines) dictionary dumps, like the ones from Kaikki.org , and filters them down into clean, usable formats.
The def-extractor.py script takes raw dictionary data (from Wiktionary) and processes it into structured formats like:
- Filtered word lists (JSONL)
- GVariant binary files (for efficient storage)
- Enum tables (for parts of speech & word tags)
It was originally designed to work with specific word lists (Wordnik, Broda, and a test list), but my goal is to make it configurable so it could support any word list with a simple config file.
How It Works (Step by Step)
1. Loading the Word List
The script starts by loading a word list (e.g., Wordnik’s list of common English words). It filters out invalid words (too short, contain numbers, etc.) and stores them in a hash table for quick lookup.

2. Filtering Raw Wiktionary Data
Next, it processes a massive raw-wiktextract-data.jsonl file (theWiktionary dump) and keeps only entries that:
- Match words from the loaded word list
- Are in the correct language (e.g., English)

3. Generating Structured Outputs
After filtering, the script creates:
- Enum tables (JSON files listing parts of speech & word tags)
- GVariant files (binary files for efficient storage and fast lookup)

What Changes have I Made?
1. Added Configuration Support
Originally, the script uses hardcoded paths and settings. I modified it to read from .config files , allowing users to define:
- Source word list file
- Output directory
- Word validation rules (min/max length, allowed characters)
Before (Hardcoded):
WORDNIK_LIST = "wordlist-20210729.txt"
ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
After (Configurable):
ini
[Word List]
Source = my-wordlist.txt
MinLength = 2
MaxLength = 20
2. Improved File Path Handling
Instead of hardcoding paths, the script now constructs them dynamically:
output_path = os.path.join(config.word_lists_dir, f"{config.id}-filtered.jsonl")
Why Do These Changes Matter?
Flexibility
-Now supports any word list via config files.
Maintainability
– No more editing code to change paths or rules.
Scalability
-Easier to add new word lists or languages.
Consistency
-All settings are in configs.
Next Steps?
1. Better Error Handling
I am working on adding checks for:
- Missing config fields
- Invalid word list files
- Incorrectly formatted data
2. Unified Word Loading Logic
There are separate functions (
load_wordnik()
,
load_broda()
).
I want to merged them into one
load_words(config)
that would works for any word list.
3. Refactor legacy code for better structure
Try It Yourself
- Download the script: [ wordlist-Gitlab ]
-
Create a
.
conf config file -
Run:
python3 def-extractor.py --config my-wordlist.conf filtered-list
Happy coding!