Explore • Nancy Wairimu Nyambura: Outreachy Update:Understanding and Improving def-extractor.py

Pl chevron_right

Nancy Wairimu Nyambura: Outreachy Update:Understanding and Improving def-extractor.py

news.movim.eu / PlanetGnome • 25 July • 2 minutes

Introduction

Over the past couple of weeks, I have been working on understanding and improving def-extractor.py , a Python script that processes dictionary data from Wiktionary to generate word lists and definitions in structured formats. My main task has been to refactor the script to use configuration files instead of hardcoded values, making it more flexible and maintainable.

In this blog post, I’ll explain:

What the script does
How it works under the hood
The changes I made to improve it
Why these changes matter

What Does the Script Do?

At a high level, this script processes huge JSONL (JSON Lines) dictionary dumps, like the ones from Kaikki.org , and filters them down into clean, usable formats.

The def-extractor.py script takes raw dictionary data (from Wiktionary) and processes it into structured formats like:

Filtered word lists (JSONL)
GVariant binary files (for efficient storage)
Enum tables (for parts of speech & word tags)

It was originally designed to work with specific word lists (Wordnik, Broda, and a test list), but my goal is to make it configurable so it could support any word list with a simple config file.

How It Works (Step by Step)

1. Loading the Word List

The script starts by loading a word list (e.g., Wordnik’s list of common English words). It filters out invalid words (too short, contain numbers, etc.) and stores them in a hash table for quick lookup.

2. Filtering Raw Wiktionary Data

Next, it processes a massive raw-wiktextract-data.jsonl file (theWiktionary dump) and keeps only entries that:

Match words from the loaded word list
Are in the correct language (e.g., English)

3. Generating Structured Outputs

After filtering, the script creates:

Enum tables (JSON files listing parts of speech & word tags)
GVariant files (binary files for efficient storage and fast lookup)

What Changes have I Made?

1. Added Configuration Support

Originally, the script uses hardcoded paths and settings. I modified it to read from .config files , allowing users to define:

Source word list file
Output directory
Word validation rules (min/max length, allowed characters)

Before (Hardcoded):

WORDNIK_LIST = "wordlist-20210729.txt"
ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

After (Configurable):

ini

[Word List]
Source = my-wordlist.txt
MinLength = 2
MaxLength = 20

2. Improved File Path Handling

Instead of hardcoding paths, the script now constructs them dynamically:

output_path = os.path.join(config.word_lists_dir, f"{config.id}-filtered.jsonl")

Why Do These Changes Matter?

Flexibility -Now supports any word list via config files.
Maintainability – No more editing code to change paths or rules.
Scalability -Easier to add new word lists or languages.
Consistency -All settings are in configs.

Next Steps?

1. Better Error Handling

I am working on adding checks for:

Missing config fields
Invalid word list files
Incorrectly formatted data

2. Unified Word Loading Logic

There are separate functions ( load_wordnik() , load_broda() ).

I want to merged them into one load_words(config) that would works for any word list.

3. Refactor legacy code for better structure

Try It Yourself

Download the script: [ wordlist-Gitlab ]
Create a . conf config file
Run: python3 def-extractor.py --config my-wordlist.conf filtered-list

Happy coding!