# Word suggestion algorithm

**Status**: Draft | 2025

**Author**: Victor Ma


## Terminology

| Term                       | Definition |
| -------------------------- | ---------- |
| Current cell               | The cell that the cursor is on.
| Current slot               | The slot that the cursor is on.
| Current filter             | The filter for the current slot.
| Secondary slot             | The slot that intersects the current slot, at the current cell.
| Secondary filter           | The filter for the secondary slot.
| Intersecting slot          | A slot that intersects a given slot.
| Intersecting slots         | All the slots that intersect a given slot.
| Word suggestion list       | The UI element that shows a list of words that fit the current slot.
| Word suggestion algorithm  | An algorithm that generates word suggestions for a given slot or grid.
| Dead end word              | A word that results in an unfillable grid.
| Grid fill                  | A one-to-one mapping from the slots of a grid to words.

![Word suggestion list](word-suggestion-list.png)<br>
*The word suggestion list.*


## Our current algorithm

Our current word suggestion algorithm works like this, where $n$ is the offset
of the current cell in the current slot, and $m$ is the offset of the current
cell in the secondary slot:
1. **(Phase 1)** Get the set of possible letters for the current cell,
   constrained by the current filter.
    1. Get the set of words that match the current filter.
    1. Return the set of letters that appear as the $n^{\text{th}}$ letter, in
       the set of words.
1. **(Phase 2)** Get the set of possible words for the secondary slot, and the
   set of possible letters for the current cell---both constrained by the
   secondary filter and the current filter.
    1. Get the set of words that satisfy both these constraints:
        * The word matches the secondary filter.
        * The $m^{\text{th}}$ letter of the word is in the set of letters from
          Step 1.
    1. Get the set of letters that appear as the $m^{\text{th}}$ letter, in the
       set of words.
    1. Return the set of words and the set of letters.
1. **(Phase 3)** Get the set of possible words for the current slot, constrained
   by the current filter and secondary filter.
    1. Get the set of words that satisfy both these constraints:
        * The word matches the current filter.
        * The $n^{\text{th}}$ letter of the word is in the set of letters from
          Step 2.
    1. Return the set of words.
1. **(Return)** Return the set of words from Phase 2 and Phase 3.

The [`word_list_find_intersection ()`](https://gitlab.gnome.org/jrb/crosswords/-/blob/0.3.15/src/word-list.c?ref_type=tags#L1372) function implements our current word suggestion algorithm.

### Limitations

Our current algorithm only considers two constraints:
* The filter for the current slot.
* The filter for the secondary slot.

This means that the algorithm is unaware of the other intersecting slots---and
the constraints that they impose on the current slot.

Consider the following grid:
```
+---+---+---+---+
|   |   |   | Z |
+---+---+---+---+
|   |   |   | E |
+---+---+---+---+
|   |   |   | R |
+---+---+---+---+
| W | O | R |   | < current slot
+---+---+---+---+
```
4-Down begins with *ZER*, so the only word it can be is *ZERO*. This constrains
the bottom-right cell to the letter *O*.

4-Across starts with *WOR*. We know that the bottom-right cell must be *O*, so
that means 4-Across must be *WORO*. But *WORO* is not a word. So, this grid is
unfillable, which means that 4-Across and 4-Down are also unfillable.

Now, suppose that the current slot is 4-Across, and the current cell is the
bottom-right cell. Then, our word suggestion algorithm checks the filters for
4-Across and 4-Down. It correctly detects that the two slots are unfillable, and
returns the empty set.

But what if the current cell is one of the other cells in 4-Across? Then, the
algorithm never checks the filter for 4-Down, and so it doesn't know about the
constraint that it imposes on the current slot. As a result, the algorithm
returns all the words that match the filter *WOR?*, like *WORD* and
*WORM*---even though they don't actually fit in the slot.


### Consequences

There are two undesirable consequences of this:
* Our algorithm can generate dead end words. These are frustrating for the user.
  The user might enter a word from the word suggestions list, reach a dead end,
  and then need to undo all their progress since adding the suggested word.
* Our algorithm's output can vary based on the current cell. This is
  semantically incorrect, because the word suggestions for a slot should not
  change based on the cursor's position.

So, we should change our algorithm, in order to fix these issues.


## Lookahead-based algorithm

One possible solution is to add *lookahead* to our word suggestion algorithm.
The lookahead-based algorithm takes into account the intersecting slots of the
current slot, when generating the word suggestions. Here's how it works:

1. Iterate through the intersecting slots of the current slot.
    * For each intersecting slot, determine the constraint that it imposes on
      the current slot.
1. Return the set of words that meet these constraints:
    * The word meets the constraints imposed by the intersecting slots.
    * The word matches the current filter.

Consider this grid from before:
```
+---+---+---+---+
|   |   |   | Z |
+---+---+---+---+
|   |   |   | E |
+---+---+---+---+
|   |   |   | R |
+---+---+---+---+
| W | O | R |   | < current slot
+---+---+---+---+
  ^ current cell
```
Suppose that the current slot is 4-Across, and the current cell is the
bottom-left cell. Then, the lookahead-based algorithm checks the filter for
4-Down. The algorithm sees that the filter constrains the bottom-right cell to
the letter *O*. This makes the current filter be `WORO`, which doesn't match any
word. So, the lookahead-based algorithm returns the empty set.

### Problem solved?

This approach solves the problem of our word suggestion algorithm producing
different outputs for the same slot, based on the cursor's position. Now, the
current cell has no impact on the output.

This approach also significantly reduces the amount of dead end words in the
output. However, it does not eliminate them altogether.

### Still broken
Consider this grid:
```
+---+---+---+---+
|   |   |   | N |
+---+---+---+---+
| Q | U | I |   |
+---+---+---+---+
|   |   |   | X |
+---+---+---+---+
| W | E | S |   | < current slot
+---+---+---+---+
```

2-Across starts with *QUI*, so the only word it can be is *QUIZ*. This makes the
filter for 4-Down be *NZX?*. No word matches this filter, and so, 4-Down is
unfillable. This means that 4-Across is also unfillable, because it intersects
with 4-Down.

Now, suppose that the current slot is 4-Across. Then, the lookahead-based
algorithm (as currently defined) does not detect that the current slot is
unfillable.

The algorithm sees that the filter for 4-Down is *N?X?*. But it does not see
that 2-Across forces 4-Down to be *NZX?*. So, the algorithm thinks that 4-Down
can be *NEXT*, and that 4-Across can be *WEST*---which is incorrect.

This failure is because the lookahead-based algorithm checks the intersecting
slots of the current slot---but it does not check the intersecting slots of the
intersecting slots of the current slot. In other words, the lookahead does not
go deep enough to discover that the current slot unfillable.

### Additional levels of lookahead

To handle this grid properly, we need to add another level of lookahead. To do
this, we make the algorithm recursively perform the lookahead process on all the
slots from the first level of lookahead. This gives the algorithm a more
accurate set of possible words for the intersecting slots of the current slot.
That, in turn, lets the algorithm more accurately constrain the current slot.

For each additional level of lookahead that we want to add, we increase the
recursion depth by one.

| Level | Constraints considered |
| ----- | ---------------------- |
| 0     | Current filter.
| 1     | Current filter, intersecting filters of the current slot.
| 2     | Current filter, intersecting filters of the current slot, intersecting filters of the intersecting slots of the current slot.
| ...   |

The lookahead-based algorithm with two levels does handle the example grid
properly. But it's clear that we can always create a grid that trips up the
algorithm---unless the algorithm has enough levels to visit every slot on the
grid.

### Unsolveable?

Consider the following grid. Every slot is unfillable. This is because 4-Across
(*ZXCVBN?*) is unfillable, and that "unfillability" propagates to every other
slot.
```
+---+---+---+---+---+---+---+
|   | U | A | L | I | T |   |
+---+---+---+---+---+---+---+
| U | ■ | ■ | ■ | ■ | ■ | O |
+---+---+---+---+---+---+---+
| I | ■ | A | X |   | ■ | G |
+---+---+---+---+---+---+---+
| E | ■ | ■ | ■ | B | ■ | U |
+---+---+---+---+---+---+---+
|   | H | U | M |   | ■ | R |
+---+---+---+---+---+---+---+
| ■ | ■ | ■ | ■ | ■ | ■ | T |
+---+---+---+---+---+---+---+
| Z | X | C | V | B | N |   |
+---+---+---+---+---+---+---+
```

Now, suppose the current slot is 2-Across (*AX?*). Then, our algorithm needs at
least six levels of lookahead to properly detect that the grid is unfillable.
Otherwise, it thinks that the grid can be filled with *AXE*, *EBB*, *THUMB*,
*QUIET*, *QUALITY*, and *YOGURTS* (starting with 2-Across and spiralling out).

However, it's not practical to add more than one or two levels of lookahead. The
number of intersections that the algorithm checks is roughly $s^n$, where $s$ is
the average slot size, and $n$ is the number of levels.

So practically, the limit is one or two levels. This means that we cannot
completely prevent dead end words with the lookahead-based algorithm.


## AC-3-based algorithm

In order for our word suggestion algorithm to visit all the slots on the
grid---and thus eliminate all dead end words---we need to use a different
algorithm entirely. For this, we look to the field of
[constraint satisfaction problems (CSPs)](https://en.wikipedia.org/wiki/Constraint_satisfaction_problem).

### Express grid filling as a CSP

<!--TODO: fix latex overflow-->

For a given grid and word list, we can express the problem, *Find a valid grid
fill for this grid*, as a CSP. What follows is one possible CSP model for this
problem, though
[other models exist](https://cs.uwaterloo.ca/~vanbeek/Publications/cai01a.pdf).

Let $V$ be the set of variables for the CSP, and

$$
V = \{a_1, a_2, \dots, a_m\} \cup \{d_1, d_2, \dots, d_n\},
$$

where $a_i$ is the word in the $i$-Across slot and $m$ is the number of across
slots in the grid; and $d_j$ is the word in the $j$-Down slot and $n$ is the
number of down slots in the grid. So, there is one variable for each slot, and
each variable is set to the chosen word for that slot.

Let $D$ be the set of domains for the CSP, and

$$
D = \{D_v \mid v \in V\}.
$$

Let $F_v$ be the set of words that match the filter for the slot that $v$
represents, and that are in the word list, and let

$$
D_v = F_v.
$$

In other words, the domain of a variable is the set of word-list words that
match the variable's slot's filter.

Let $I$ be the set of intersection constraints for the CSP, and

$$
I = \{I_{v_1v_2} \mid v_1, v_2 \in V \land \text{The slots for $v_1$ and $v_2$ intersect each other} \land \text{The slot for $v_1$ appears before the slot for $v_2$ in the clue list}\},
$$

where

$$
\begin{aligned}
  I_{v_1v_2} ={} &\text{The values of $v_1$ and $v_2$ have the same letter} \\
  &\text{at the offsets where their slots intersect}.
\end{aligned}
$$

Let $Q$ be the set of unique word constraints, and

$$
Q = \{Q_{v_1v_2} \mid  v_1, v_2 \in V \land v_1 \neq v_2 \land
    \text{The slots for $v_1$ and $v_2$ have the same length} \land
    \text{The slot for $v_1$ appears before the slot for $v_2$ in the clue list}\},
$$

where

$$
Q_{v_1v_2} = \text{The value of } v_1 \neq \text{the value of } v_2.
$$

Let $C$ be the set of constraints for the CSP, and

$$
C = I \cup Q.
$$

### Solution to the CSP

A solution to this CSP is a valid grid fill for the given grid and word list.
Suppose the CSP has a standard word list, and the following grid:
```
+---+---+---+---+
| ■ | ■ | ■ | N |
+---+---+---+---+
| T | I | M |   |
+---+---+---+---+
| ■ | ■ | ■ | X |
+---+---+---+---+
| W | E | S |   |
+---+---+---+---+
```
Then, the unique solution is
\begin{align*}
  a_1 &= \text{TIME} \\
  a_2 &= \text{WEST} \\
  d_1 &= \text{NEXT}.\\
\end{align*}

### Arc consistency

However, our goal is not to find a solution to the CSP. Our goal is to get the
set of possible words for a slot.

To do this, we make every variable
*[arc consistent](https://en.wikipedia.org/wiki/Local_consistency#Arc_consistency)*
with every other variable. To do this, we can use the
[AC-3 algorithm](https://en.wikipedia.org/wiki/AC-3_algorithm).

### Resources

#### Learning resources

Resources for learning about CSPs and the AC-3 algorithm:
* [UIUC lecture slides](https://courses.grainger.illinois.edu/ece448/sp2022/slides/lec18.pdf)
* [FIT lecture slides](https://cs.fit.edu/~dmitra/ArtInt/lectures/constraint.pdf)
* [UWaterloo lecture slides](https://cs.uwaterloo.ca/~jhoey/teaching/cs486/lecture4-nup.pdf)
* [Textbook chapters on arc consistency](https://www.sciencedirect.com/topics/computer-science/arc-consistency)

<!-- Videos -->

#### Other resources

Miscellaneous resources related to CSPs and crosswords:
* [Paper that compares crossword CSP models](https://cs.uwaterloo.ca/~vanbeek/Publications/cai01a.pdf)
* [Paper on solving crossword CSP](https://web.archive.org/web/20230202063946/https://web.stanford.edu/~jduchi/projects/crossword_writeup.pdf)
* [Crossword construction tool](https://www.crosswordconstruction.com/static/files/poster.pdf)


## Algorithms comparison

The following is a comparison between the lookahead-based algorithm, and the
AC-3-based algorithm. Assume that the lookahead-based algorithm uses a single
level of lookahead.

<!--
|   | Lookahead | AC-3 |
| - | --------- | ---- |
| **Dead end words** | Uncommon. | None.
| **Crossword aids** | Possible, but requires conversion to grid-level algorithm. | Fully compatible.
| **Benefit to autofill** | Possible, but requires conversion to grid-level algorithm. | Yes.
| **Speed** | Fast. | Slow.
| **Implementation difficulty** | Simple | Complex.
-->


### Dead end words

The AC-3-based algorithm eliminates all dead end words. The lookahead-based
algorithm does not eliminate all dead end words.

However, the lookahead-based algorithm *does* eliminate most of the dead end
words. The reasoning for this is:
1. The lookahead-based algorithm accounts for the slots closest to the current slot.
1. The slots closest to the current slot impose most constraints on the current slot.
1. Therefore, the lookahead-based algorithm eliminates most of the dead end words.

#### Closer slots impose more constraints

Closer slots impose more constraints on the current slot, because they are fewer
steps removed from the current slot.

Suppose $i_1$ is an intersecting slot of the current slot. Then, $i_1$ has a
large influence on the current slot, because $i_1$ directly constrains the
character set for one of the current slot's cells.

Now, suppose $i_2$ is an intersecting slot of $i_1$. Then, the only way that
$i_2$ can constrain the current slot is by constraining $i_1$. And the only way
it can constrain $i_1$ is by constraining the character set for one of $i_1$'s
cells. So, $i_2$ must constrain one of $i_1$'s cells to the point that $i_1$
becomes constrained enough to add an additional constraint on the current slot.

And for a slot $i_3$ that intersects $i_2$, the constraining effect is another
order of magnitude smaller, and so on, for each additional step removed a slot
is.

So, the impact that a slot has on the current slot grows weaker, the further
away it is. This means that the lookahead-based algorithm eliminates most of the
dead end words.

### Crossword aids

Here are some crossword aids that would be good to implement:

| Crossword aid | Description |
| ------------- | ----------- |
| Fillability indicator | An indicator for the fillability of the grid. If at least one slot is unfillable, then the grid is unfillable.
| Cell constraint heat map | A heat map overlaid on top of the grid that indicates how constrained each cell is. A cell that can be any letter from *A* to *Z* is completely unconstrained. A cell that can only be a single letter is maximally constrained.
| Most-constrained-slot button | A button that moves the cursor to the most constrained slot. The most constrained slot is the slot that has the smallest set of possible words---and that is not fully filled.

The AC-3-based algorithm functions on the grid level. Its input is the entire
grid, and its output is the set of possible words for every slot on the grid.
This makes it well suited for implementing the crossword aids.

The lookahead-based algorithm functions on the slot level. In order to implement
crossword aids with it, we need to turn the algorithm into a grid-level
algorithm, like this:
```text
For each slot in the grid:
  Run the lookahead-based algorithm
```

However, the crossword aids are less accurate with the lookahead-based
algorithm. This is because the algorithm's outputs may contain dead end words,
which pollute the data that the crossword aids rely on. The more dead end words
there are, the less accurate the crossword aids become.

<!--
### Benefit to autofill

The AC-3-based algorithm can benefit the autofill algorithm by providing it with the possible word sets for every slot.

The lookahead-based algorithm can also do this, if we transform it into a grid-level algorithm.

==manual button==
==hybrid (switch automatically when enough filled)==
==use lookahead while ac3 runs==
-->

### Speed

The lookahead-based algorithm takes [$n$ times longer to run](#implementation)
than our current word suggestion algorithm, where $n$ is the average slot size.

Our current algorithm takes less than 16 ms to run (because we are able to hit
60 fps). So, even for the maximum slot size of 21 (extra-large grid), the
lookahead-based algorithm runs in under a second.

<!--
To estimate the speed of the AC-3-based algorithm, we can look at existing crossword editors that use a grid-level word suggestion algorithm. For these editors, their algorithms' speed looks like this:
* If most of the grid is filled, the algorithm runs in under a second

==regen each time?==
-->

### Implementation difficulty

* The lookahead-based algorithm is easy to implement.
* The AC-3-based algorithm is difficult to implement.

## Which algorithm?

We should implement the lookahead-based algorithm. It significantly reduces dead
end words, it runs quickly, and it's easy to implement.

<!--
The AC-3-based algorithm has the potential to improve our editor a lot---

### Both algorithms?

button, option
-->


## Prior Art

The following is true about most crossword editors:
* They use a grid-level word suggestion algorithm.
* It takes a while for the word suggestion list to populate.
* They have a fillability indicator.
* They have a cell constraint heat map.
* They have a most-constrained-slot button.

Open source editors:
* [Ingrid's word suggestion algorithm](https://github.com/rf-/ingrid_core/blob/f00da637452fff46e1b6255a29a553aa8f2da5c2/src/backtracking_search.rs#L1-L5)
  is well-documented, but complex.
* Exet's word suggestion algorithm simply looks for words that match the current filter.

<!--caching?-->

(implementation)=
## Implementation

The following is a possible implementation for the lookahead-based word
suggestion algorithm. It runs the intersection function on each cell of the
current slot, and then returns the set intersection of all the outputs.
```
lookahead_based_algorithm (current_slot)
{
  final_words_set = NULL;

  for (cell : current_slot)
    {
      if (cell != "")
        continue;

      intersecting_slot = get_int_slot (current_slot, cell);
      words_set = intersection_func (current_slot, intersecting_slot);

      if (final_words_set == NULL)
        final_words_set = words_set;
      else
        final_words_set = (final_words_set) ∩ (words_set);

      if (final_words_set == {})
        return final_words_set;
    }

  return final_words_set;
}
```

### Must be asynchronous

This algorithm runs the intersection function $n$ times, where $n$ is the size
of the current slot. The intersection function itself takes a few milliseconds
to run. This means that we cannot hit our target of 16 ms per frame, if we run
the lookahead-based algorithm in the main thread. So, we must implement the
algorithm asynchronously.


<!--
###
==only run if less than N open slots to avoid lag on open grid...===

==future async==
autofill csp
-->

<!--TODO: add section about caching somewhere, probably before the implementation section.-->

## Next steps

1. Finish the WIP parts of this design doc.
1. Implement the lookahead-based algorithm synchronously.
1. Discuss and decide next steps:
    * Implement the lookahead-based algorithm asynchronously?
    * Implement AC-3-based algorithm?
    * Hybrid approach?

## Open questions

## Areas for improvement