The experimental Acrostic Generator has
finally landed
inside the Crossword editor and is currently tagged as
BETA
.
I’d classify this as one of the trickiest and most interesting projects I’ve worked on.
Here’s how an acrostic puzzle loaded inside Crossword editor looks like:
In my previous
blog post
(published about a year ago), I explained one part of the generator. Since then, there have been many improvements.
I won’t go into detail about what an acrostic puzzle is, as I’ve covered that in multiple previous posts already.
If you’re unfamiliar, please check out my earlier post for a brief idea.
Coming to the Acrostic Generator
, I’ll begin by showing an illustration that shows the input and the corresponding output generated by it. After that, I’ll walk through the implementation and challenges I faced.
Let’s take the quote: “
CATS ALWAYS TAKE NAPS
” whose author is a “
CAT
”.

Here’s what the Acrostic Generator essentially does
It generates answers like
“CATSPAW”
,
“ALASKAN”
and
“TYES”
which, as you can probably guess from the color coding, are made up of letters from the original quote.
Core Components
Before explaining how the Acrostic generator works, I want to briefly explain some of the key components involved.
1.
Word list
The word list is an important part of Crosswords. It provides APIs to efficiently search for words. Refer to the
documentation
to understand how it works.
2.
IpuzCharset
The performance of the Acrostic Generator heavily depends on IpuzCharset, which is essentially a HashMap that stores characters and their frequencies.
We perform numerous ipuz_charset_add_text and ipuz_charset_remove_text operations on the
QUOTE
charset. I'd especially like to highlight ipuz_charset_remove_text, which used to be computationally very slow. Last year, charset was rewritten in Rust by Federico. Compared to the earlier implementation in C using a GTree, the Rust version turned out to be quite faster.
Here’s F
ederico’s blog post
on rustifying libipuz’s charset.
Why is ipuz_charset_remove_text latency so important? Let's consider the following example:
QUOTE: "CARNEGIE VISITED PRINCETON AND TOLD WILSON WHAT HIS YOUNG MEN NEEDED WAS NOT A LAW SCHOOL BUT A LAKE TO ROW ON IN ADDITION TO BEING A SPORT THAT BUILT CHARACTER AND WOULD LET THE UNDERGRADUATES RELAX ROWING WOULD KEEP THEM FROM PLAYING FOOTBALL A ROUGHNECK SPORT CARNEGIE DETESTED"
SOURCE: "DAVID HALBERSTAM THE AMATEURS"
In this case, the total number of maximum ipuz_charset_remove_text operations required in the worst case would be:
73205424239083486088110552395002236620343529838736721637033364389888000000
…which is a
lot
.
Terminology
I’d also like you guys to take a note of a few things.
1.
Answers
and
Clues
refer to the same thing, they are the
solutions
generated by the Acrostic Generator. I’ll be using them interchangeably throughout.
2. We’ve set two constants in the engine: MIN_WORD_SIZE = 3 and MAX_WORD_SIZE = 20. These make sure the answers are not too short or too long and help stop the engine from running indefinitely.
3. Leading characters here are all the characters of source. Each one is the first letter of corresponding answer.
Setting up things
Before running the engine, we need to set up some data structures to store the results.
typedef struct {
/* Representing a answer */
gunichar leading_char;
const gchar *letters;
guint word_length;
/* Searching the answer */
gchar *filter;
WordList *word_list;
GArray *rand_offset;
} ClueEntry;
We use a ClueEntry structure to store the answer for each clue. It holds the leading character (from the source), the letters of the answer, the word length, and some additional word list information.
Oh wait, why do we need the word length since we are already storing letters of the answer?
Let’s backtrack. Initially, I wrote the following brute-force recursive algorithm:
void
acrostic_generator_helper (AcrosticGenerator *self,
gchar nth_source_char)
{
// Iterate from min_word_size to max_word_size for every answer
for (word_length = min_word_size; word_length <= max_word_size; word_length++)
{
// get list of words starting from `nth_source_char`
// and with length equal to word_length
word_list = get_word_list (starting_letter = nth_source_char, word_length);
// Iterate throught the word list
for (guint i = 0; i < word_list_get_n_items (word_list); i++)
{
word = word_list[i];
// check if word is present in the quote charset
if (ipuz_charset_remove_text (quote_charset, word))
{
// if present we forward to the next source char
acrostic_generator_helper (self, nth_source_char + 1)
}
}
}
}
The problem with this approach is that it is too slow. We were iterating from MIN_WORD_SIZE to MAX_WORD_SIZE and trying to find a solution for every possible size. Yes, this would work and eventually we’ll find a solution, but it would take a lot of time. Also, many of the answers for the initial source characters would end up having length equal to MIN_WORD_SIZE .
To quantify this, compared to the latest approach (which I’ll discuss shortly), we would be performing roughly
20 times
the current number (7.3 × 10⁷³) of ipuz_charset_remove_text operations.
To fix this, we added randomness by calculating and assigning random lengths to clue answers before running the engine.
To generate these random lengths, we break a number equal to the length of the quote string into
n
parts (where
n
is the number of source characters), each part having a random value.
static gboolean
generate_random_lengths (GArray *clues,
guint number,
guint min_word_size,
guint max_word_size)
{
if ((clues->len * max_word_size) < number)
return FALSE;
guint sum = 0;
for (guint i = 0; i < clues->len; i++)
{
ClueEntry *clue_entry;
guint len;
guint max_len = MAX (min_word_size,
MIN (max_word_size, number - sum));
len = rand() % (max_len - min_word_size + 1) + min_word_size;
sum += len;
clue_entry = &(g_array_index (clues, ClueEntry, i));
clue_entry->word_length = len;
}
return sum == number;
}
I have been continuously researching ways to generate random lengths that help the generator find answers as quickly as possible.
What I concluded is that the Acrostic Generator performs best when the word lengths follow a
right-skewed distribution
.
static void
fill_clue_entries (GArray *clues,
ClueScore *candidates,
WordListResource *resource)
{
for (guint i = 0; i < clues->len; i++)
{
ClueEntry *clue_entry;
clue_entry = &(g_array_index (clues, ClueEntry, i));
// Generate filter in order to get words with starting letter nth char of source string
// For eg. char = D, answer_len = 5
// filter = "D????"
clue_entry->filter = generate_individual_filter (clue_entry->leading_char,
clue_entry->word_length);
// Load all words with starting letter equal to nth char in source string
clue_entry->word_list = word_list_new ();
word_list_set_resource (clue_entry->word_list, resource);
word_list_set_filter (clue_entry->word_list, clue_entry->filter, WORD_LIST_MATCH);
candidates[i].index = i;
candidates[i].score = clue_entry->word_length;
// Randomise the word list which is sorted by default
clue_entry->rand_offset = generate_random_lookup (word_list_get_n_items (clue_entry->word_list));
}
Now that we have random lengths, we fill up the ClueEntry data structure.
Here, we generate individual filters for each clue, which are used to set the filter on each word list. For example, the filters for the example illustrated above are C??????, A??????, and T??? .
We also maintain a separate word list for each clue entry. Note that we do not store the huge word list individually for every clue. Instead, each word list object refers to the same memory-mapped word list resource.
Additionally, each clue entry contains a random offsets array, which stores a randomized order of indices. We use this to traverse the filtered word list in a random order. This randomness helps fix the problem where many answers for the initial source characters would otherwise end up with length equal to MIN_WORD_SIZE.
The advantage of pre-calculating all of this before running the engine is that the main engine loop only performs the heavy operations: ipuz_charset_remove_text and ipuz_charset_add_text.
static gboolean
acrostic_generator_helper (AcrosticGenerator *self,
GArray *clues,
guint index,
IpuzCharsetBuilder *remaining_letters,
ClueScore *candidates)
{
ClueEntry *clue_entry;
if (index == clues->len)
return TRUE;
clue_entry = &(g_array_index (clues, ClueEntry, candidates[index].index));
for (guint i = 0; i < word_list_get_n_items (clue_entry->word_list); i++)
{
const gchar *word;
g_atomic_int_inc (self->count);
// traverse based on random indices
word = word_list_get_word (clue_entry->word_list,
g_array_index (clue_entry->rand_offset, gushort, i));
clue_entry->letters = word;
if (ipuz_charset_builder_remove_text (remaining_letters, word + 1))
{
if (!add_or_skip_word (self, word) &&
acrostic_generator_helper (self, clues, index + 1, remaining_letters, candidates))
return TRUE;
clean_up_word (self, word);
ipuz_charset_builder_add_text (remaining_letters, word + 1);
clue_entry->letters = NULL;
}
}
clue_entry->letters = NULL;
return FALSE;
}
The approach is quite simple. As you can see in the code above, we perform ipuz_charset_remove_text many times, so it was crucial to make the ipuz_charset_remove_text operation efficient.
When all the characters in the charset have been used/removed and the index becomes equal to number of clues, it means we have found a solution. At this point, we return, store the answers in an array, and continue our search for new answers until we receive a stop signal.
We also maintain a skip list that is updated whenever we find an clue answer and is cleaned up during backtracking. This makes sure there are no duplicate answers in the answers list.
Performance Improvements
I compared the performance of the acrostic generator using the current Rust charset implementation against the previous C GTree implementation. I have used the following quote and source strings with the same RNG seed for both implementations:
QUOTE: "To be yourself in a world that is constantly trying to make you something else is the greatest accomplishment."
SOURCE: "TBYIWTCTMYSEGA"
Results:
+-----------------+--------------------+
| Implementation | Time taken(secs) |
+-----------------+--------------------+
| C GTree | 74.39 |
| Rust HashMap | 17.85 |
+-----------------+--------------------+
The Rust HashMap implementation is nearly 4 times faster than the original C GTree version for the same random seed and traversal order.
I have also been testing the generator to find small performance improvements. Here are some of them:
-
When searching for answers, looking for answers for clues with longer word lengths first helps find solutions faster
-
We switched to using nohash_hasher for the hashmap because we are essentially storing {char: frequency} pairs. Trace reports showed that significant time and resources were spent computing hash using Rust’s default SipHash implementation which was unnecessary.
MR
-
Inside ipuz_charset_remove_text, instead of cloning the original data, we use a rollback mechanism that tracks all modifications and rolls back in case of failure.
MR
I also remember running the generator on some quote and source input back in the early days. It ran continuously for
four hours
and still couldn’t find a single solution. We even overflowed the gint counter which tracks number of words tried. Now, the same generator
can return 10 solutions in under 10 seconds
. We’ve come a long way! 😀
Crossword Editor
Now that I’ve covered the engine, I’ll talk about the UI part.
We started off by sketching potential designs on paper. @
jrb
came up with a good design and we decided to move forward with it, making a few tweaks to it.
First, we needed to display a list of the generated answers.
For this, I implemented my own list model where each item stores a string for the answer and a boolean indicating whether the user wants to apply that answer.
To allow the user to run, stop the generator and then apply answers, we reused the compact version of the original autofill component used in normal crosswords. The answer list gets updated whenever the slider is moved.
We have tried to reuse as much code as possible for acrostics, keeping most of the code common between acrostics and normal crosswords.
Here’s a quick demo of the acrostic editor in action:
We also maintain a cute little histogram on the right side of the bottom panel to summarize clue lengths.
You can also try out the Acrostic Generator using our CLI app, which I originally wrote to quickly test the engine. To use the binary, you’ll need to build
Crosswords Editor
locally. Example usage:
$ ./_build/src/acrostic-generator -q "For most of history, Anonymous was a woman. I would venture to guess that Anon, who wrote so many poems without signing them, was often a woman. And it is for this reason that I would implore women to write all the more" -s "Virginia wolf"
Starting acrostic generator. Press Ctrl+C to cancel.
[ VASOTOMY ] [ IMFROMMISSOURI ] [ ROMANIANMONETARYUNIT ] [ GREATFEATSOFSTRENGTH ] [ ITHOUGHTWEHADADEAL ] [ NEWSSHOW ] [ INSTITUTION ] [ AWAYWITHWORDS ] [ WOOLSORTERSPNEUMONIA ] [ ONEWOMANSHOWS ] [ LOWMANONTHETOTEMPOLE ] [ FLOWOUT ]
[ VALOROUSNESS ] [ IMMUNOSUPPRESSOR ] [ RIGHTEOUSINDIGNATION ] [ GATEWAYTOTHEWEST ] [ IWANTYOUTOWANTME ] [ NEWTONSLAWOFMOTION ] [ IMTOOOLDFORTHISSHIT ] [ ANYONEWHOHADAHEART ] [ WOWMOMENT ] [ OMERS ] [ LAWUNTOHIMSELF ] [ FORMATWAR ]
Plans for the future
To begin with, we’d really like to improve the overall design of the Acrostic Editor and make it more user friendly. Let us know if you have any design ideas, we’d love to hear your suggestions!
I’ve also been thinking about different algorithms for generating answers in the Acrostic Generator. One idea is to use a divide-and-conquer approach, where we recursively split the quote until we find a set of sub-quotes that satisfy all constraints of answers.
To conclude, here’s an
acrostic
for you all to solve, created using the
Acrostic Editor
! You can load the file in
Crosswords
and start playing.
Thanks for reading!