call_end

    • Pl chevron_right

      Nancy Wairimu Nyambura: Outreachy Update:Understanding and Improving def-extractor.py

      news.movim.eu / PlanetGnome • 25 July • 2 minutes

    Introduction

    Over the past couple of weeks, I have been working on understanding and improving def-extractor.py , a Python script that processes dictionary data from Wiktionary to generate word lists and definitions in structured formats. My main task has been to refactor the script to use configuration files instead of hardcoded values, making it more flexible and maintainable.

    In this blog post, I’ll explain:

    1. What the script does
    2. How it works under the hood
    3. The changes I made to improve it
    4. Why these changes matter

    What Does the Script Do?

    At a high level, this script processes huge JSONL (JSON Lines) dictionary dumps, like the ones from Kaikki.org , and filters them down into clean, usable formats.

    The def-extractor.py script takes raw dictionary data (from Wiktionary) and processes it into structured formats like:

    • Filtered word lists (JSONL)
    • GVariant binary files (for efficient storage)
    • Enum tables (for parts of speech & word tags)

    It was originally designed to work with specific word lists (Wordnik, Broda, and a test list), but my goal is to make it configurable so it could support any word list with a simple config file.

    How It Works (Step by Step)

    1. Loading the Word List

    The script starts by loading a word list (e.g., Wordnik’s list of common English words). It filters out invalid words (too short, contain numbers, etc.) and stores them in a hash table for quick lookup.

    2. Filtering Raw Wiktionary Data

    Next, it processes a massive raw-wiktextract-data.jsonl file (theWiktionary dump) and keeps only entries that:

    • Match words from the loaded word list
    • Are in the correct language (e.g., English)

    3. Generating Structured Outputs

    After filtering, the script creates:

    • Enum tables (JSON files listing parts of speech & word tags)
    • GVariant files (binary files for efficient storage and fast lookup)

    What Changes have I Made?

    1. Added Configuration Support

    Originally, the script uses hardcoded paths and settings. I modified it to read from .config files , allowing users to define:

    • Source word list file
    • Output directory
    • Word validation rules (min/max length, allowed characters)

    Before (Hardcoded):

    WORDNIK_LIST = "wordlist-20210729.txt"
    ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

    After (Configurable):

    ini

    [Word List]
    Source = my-wordlist.txt
    MinLength = 2
    MaxLength = 20

    2. Improved File Path Handling

    Instead of hardcoding paths, the script now constructs them dynamically:

    output_path = os.path.join(config.word_lists_dir, f"{config.id}-filtered.jsonl")

    Why Do These Changes Matter?

    Flexibility -Now supports any word list via config files.
    Maintainability – No more editing code to change paths or rules.
    Scalability -Easier to add new word lists or languages.
    Consistency -All settings are in configs.

    Next Steps?

    1. Better Error Handling

    I am working on adding checks for:

    • Missing config fields
    • Invalid word list files
    • Incorrectly formatted data

    2. Unified Word Loading Logic

    There are separate functions ( load_wordnik() , load_broda() ).

    I want to merged them into one load_words(config) that would works for any word list.

    3. Refactor legacy code for better structure

    Try It Yourself

    1. Download the script: [ wordlist-Gitlab ]
    2. Create a . conf config file
    3. Run: python3 def-extractor.py --config my-wordlist.conf filtered-list

    Happy coding!