EXPLAIN EXTENDED

How to create fast database queries

A good first word for Wordle

with 8 comments

Ok, I gave in to the fad and took up Wordle.

For those who have been living under a rock for the past few weeks, Wordle is a relatively new online word game which has become viral. It is a variation of Bulls and Cows.

You have six tries to guess a five-letter word. Each try reveals letters that the target word has. If your guess has the same letter on the same spot as the target word, the letter is colored green. If the letter is there, but on the wrong spot, it's colored yellow. Finally, if the letter is not in the target word at all, it's colored black. There is a little twist when it comes to the words with repeating letters. If your guess has a repeating letter, it will be colored green or yellow, but only as many times as the target word has it (more about it later).

You can only use dictionary words to make your guesses, the game won't allow you to post garbage and reveal positions of the letters.

Every new try reveals some information about the target word. But when the playfield is empty, we don't have any information at all. We need a good word to start the game of Wordle. And, since the words are (supposedly) chosen randomly, and at the beginning of the game we are absolutely blind, it makes sense to use the same word to start every new game.

Let's use SQL to find this word!

The word list

I don't know what word list Wordle is using, but it's probably the same one that is being used by different variations of Scrabble, or close enough to it. I used SOWPODS which I found on Github: https://github.com/jesstess/Scrabble/blob/master/scrabble/sowpods.txt. This list contains exactly 12478 five-letter words, from "aahed" (which means "made an aah sound") to "zymic" (which means "pertaining to fermentation"). I uploaded them all into a PostgreSQL table.

The best worst case

Every time we make a move, the game reveals some information about the target word.

How much information does it reveal? It depends on the answer. If we entered the word and guessed four letters correctly, we would be happy: just one letter to go! If we guessed none of the letters right, we would probably not be so happy, although we did learn something new about the target word (namely, that it has none of the letters we used). Somehow, it feels that guessing letters right gives us more information about the words than not guessing them.

This feeling is (mostly) right, and it comes with a measure. There were 12478 potential candidates for the first in the list before our first guess. After our first guess, there will be fewer candidates.

If we entered the word RUBUS and got GBGGG (green, black, green, green, green) back, we know that the answer is REBUS: there is just one single word compatible with this coloring in the word list.

If we entered XYLAN, and got BBBBB back (no letters), then we would know that TUXES, or YAWED, or many other words could not be the answers (TUXES has "X" and YAWED has "Y"). But BIJOU, or FORTH, or PROBE, or yet many other words could, because they don't have any letters in common with XYLAN indeed. In fact, for the word XYLAN, there is 1 word compatible with the answer BYGGG, and 3303 words compatible with the answer BBBBB. This means that BYGGG gives us a lot more information about the answer than BBBBB does.

It would be nice to always get a lot of green letters in the answer, but the words are chosen randomly. This means that for the word XYLAN, the answer BBBBB narrows the pool of potential candidates down to 26% of the original pool. If we do the math for the other color combinations, we'll see that BBBYB narrows it down to 2049 potential candidates, BBBBY to 972 and so on. This also means that BBBBB is what we are likely to get the most.

It's not always the case, by the way, not for every word. For instance, if we enter the word YORES, we are more likely to get BBBBG (guess the last letter on its spot) than BBBBB (guess no letters at all). This is because so many letters in the list end in "s". For YORES, 1084 words in the list would give back BBBBG, vs. 969 for BBBBB. It means that at least for the first step, BBBBB would actually be a better answer for us, because it does a better job of narrowing down the pool of answers.

If you ever played the comparison game, where you should guess the number from "yes" or "no" answers to the questions "is it more than N", you would know that the optimal strategy for winning this game, in the long run, is always splitting the interval of possible values in two equal parts. If you pick uneven intervals, every now and then you will get lucky and guess the number faster, but most of the times, the target number will be in the larger interval (because of the trivial fact that the larger interval is, well, larger, and the random number is more likely to be there), and you will just have wasted a try. Every new answer should, on average, bring the same amount of information.

This means that a good word for Wordle, on every step, would be the word which would break the pool of candidate answers into more or less equal bins. Of course, unlike the number guessing game, it's not always possible to make these bins exactly equal: four green letters cannot possibly leave you more than 25 candidates, even in theory. But what we can do is pick the word which will give us the smallest pool of answers in the worst case. I call this strategy "the best worst case". If we get lucky, all the better. But even if we don't (and the larger pool of candidates the answer gives us, the more likely are we to get this answer), we still don't end up with too large a pool.

Formulation of the problem

We need to find the word which, in the worst case, would limit the pool of answers to the number that is less than any other word would. If the word XYLAN gives us 3033 potential candidates in the worst case, and the word YORES gives 1084, then YORES is a better candidate for the first guess than XYLAN. We need to find the word with the best worst case, so that we are guaranteed to have more information on the first attempt.

If we look into the potential answers that we can get, we can see that the first letter can be black, yellow or green; the second letter can be black, yellow or green and so on. Every letter can have three colors, and there are five of them, which gives us 35 = 273 possible combinations. Some combinations, though, are impossible, like four green letters and one yellow. This means that every guess divides the list of potential candidates into 273 (possibly empty) bins, some of them larger, some smaller. We want the bins to be as even as possible. We want to find the word for which the largest bin is smaller than the largest bin for every other word.

Implementation in SQL

I have created a table called wordle and loaded the list of 12478 five-letter words there. All the words are uppercase.

To encode the answer, we will be using ternary encoding. For the first letter, black is 0, yellow is 1, green is 2. For the second letter, black is 0×3 = 0, yellow is 1×3 = 3, green is 2×3 = 6 and so on. An answer like BGYGB would be encoded as 0×1 + 2×3 + 1×9 + 2×27 + 0×81 = 69.

Let's create a function to find the answer a target would give for a certain guess. It's not that straightforward because we need to handle repeating letters correctly (kudos to very astute @rcv from the comments section for pointing this out to me!).

First of all, let's notice that in case of an exact match on a spot, this spot will always be green. We will handle the green first, by replacing the matching letters with a period in both words.

With the greens out of the way, we need to calculate yellows and blacks. We don't have exact matches anymore in either word, because we got rid of them on the previous step. Now, if we have two of the same letter in the guess, but only one in the target word, we want only the first one of them to be yellow. We will evaluate the guess from left to right and notice how many occurrences of the current letter we've had so far. If it's less or equal to the number of occurrences of the same letter in the target word, then we color it yellow, otherwise black.

Let's see it on an example. Say, we are guessing ADDED for DUVET.

First of all, we handle the exact matches: ADD.D and DUV.T

Then we need to go over all spots in the guess, from 1 to 5, and count how many occurrences of the current letter we've had before (and including) this spot:

  1. 'A': current letter 'A', 1 occurrence in the guess so far, 0 in the target. 1 is more than 0, it's a black
  2. 'AD': current letter 'D', 1 occurrence in the guess so far, 1 in the target. It's a yellow
  3. 'ADD': current letter 'D', 2 occurrences in the guess so far, 1 in the target. 2 is more than 1, it's a black.
  4. 'ADD.': current letter '.'. It's an exact match which had been handled before, so it's a green.
  5. 'ADD.D': current letter 'D', 3 occurrences in the guess so far, 1 in the target. 3 is more than 1, it's a black.

The answer, therefore, would be 'BYBGB'.

We will create some helper functions to implement this logic in SQL.

CREATE OR REPLACE FUNCTION fn_fix_green(one TEXT, other TEXT, ix INTEGER)
RETURNS TEXT
IMMUTABLE PARALLEL SAFE
AS
$$

SELECT  COALESCE(NULLIF(SUBSTRING(one FROM ix FOR 1), SUBSTRING(other FROM ix FOR 1)), '.')

$$
LANGUAGE 'sql';


CREATE OR REPLACE FUNCTION fn_fix_green(guess TEXT, target TEXT)
RETURNS TEXT
IMMUTABLE PARALLEL SAFE
AS
$$

SELECT  fn_fix_green(guess, target, 1) ||
        fn_fix_green(guess, target, 2) ||
        fn_fix_green(guess, target, 3) ||
        fn_fix_green(guess, target, 4) ||
        fn_fix_green(guess, target, 5)

$$
LANGUAGE 'sql';


CREATE OR REPLACE FUNCTION fn_count_character(word TEXT, letter TEXT)
RETURNS INTEGER
IMMUTABLE PARALLEL SAFE
AS
$$

SELECT  LENGTH(word) - LENGTH(REPLACE(word, letter, ''))

$$
LANGUAGE 'sql';


CREATE OR REPLACE FUNCTION fn_fix_other(guess TEXT, target TEXT, ix INTEGER)
RETURNS INT
IMMUTABLE PARALLEL SAFE
AS
$$

SELECT  CASE
        WHEN SUBSTRING(guess FROM ix FOR 1) = '.' THEN
                2
        WHEN fn_count_character(SUBSTRING(guess FROM 1 FOR ix), SUBSTRING(guess FROM ix FOR 1)) <=
                fn_count_character(target, SUBSTRING(guess FROM ix FOR 1)) THEN
                1
        ELSE
                0
        END

$$
LANGUAGE 'sql';

CREATE OR REPLACE FUNCTION fn_fix_other(guess TEXT, target TEXT)
RETURNS INT
IMMUTABLE PARALLEL SAFE
AS
$$
SELECT  fn_fix_other(guess, target, 1) +
        fn_fix_other(guess, target, 2) * 3 +
        fn_fix_other(guess, target, 3) * 9 +
        fn_fix_other(guess, target, 4) * 27 +
        fn_fix_other(guess, target, 5) * 81
$$
LANGUAGE 'sql';


CREATE OR REPLACE FUNCTION fn_match(guess text, target text)
RETURNS INT
IMMUTABLE PARALLEL SAFE
AS
$$

SELECT  fn_fix_other(fn_fix_green(guess, target), fn_fix_green(target, guess))

$$
LANGUAGE 'sql';


CREATE OR REPLACE FUNCTION fn_match_letter(color INT)
RETURNS TEXT
IMMUTABLE PARALLEL SAFE
AS
$$

SELECT  CASE color WHEN 0 THEN 'B' WHEN 1 THEN 'Y' ELSE 'G' END

$$
LANGUAGE 'sql';


CREATE OR REPLACE FUNCTION fn_match_text(colors INT)
RETURNS TEXT
IMMUTABLE PARALLEL SAFE
AS
$$

SELECT  fn_match_letter(colors % 3) ||
        fn_match_letter((colors / 3) % 3) ||
        fn_match_letter((colors / 9) % 3) ||
        fn_match_letter((colors / 27) % 3) ||
        fn_match_letter((colors / 81) % 3)

$$
LANGUAGE 'sql';

I could have just combined these expressions in the SELECT list of the query, but it would be hard to read. I am not using any kind of aggregates or FROM clause here, because this function is on the hot path and it works much faster this way.

Let's unit test these functions. I'm using examples from the articles I found online which explain how Wordle handles duplicate letters. We will compare the actual output of the function to the expected value. If there will be no mismatches, the query will come back empty:

SELECT  target, guess, fn_match_text(fn_match(guess, target))
FROM    (
        VALUES
        ('AAABB', 'BBAAA'),
        ('GHOTI', 'GHOTI'),
        ('REBUS', 'RUBUS'),
        ('BANAL', 'ANNAL'),
        ('BANAL', 'UNION'),
        ('BANAL', 'ALLOY'),
        ('BANAL', 'BANAL'),
        ('ABBEY', 'ABBEY'),
        ('ABBEY', 'ABYSS'),
        ('ABBEY', 'KEBAB'),
        ('ABBEY', 'BABES'),
        ('ABBEY', 'OPENS'),
        ('DUVET', 'ADDED')
        ) AS q (target, guess)
EXCEPT
VALUES
        ('AAABB', 'BBAAA', 'YYGYY'),
        ('GHOTI', 'GHOTI', 'GGGGG'),
        ('REBUS', 'RUBUS', 'GBGGG'),
        ('BANAL', 'ANNAL', 'YBGGG'),
        ('BANAL', 'UNION', 'BYBBB'),
        ('BANAL', 'ALLOY', 'YYBBB'),
        ('BANAL', 'BANAL', 'GGGGG'),
        ('ABBEY', 'ABBEY', 'GGGGG'),
        ('ABBEY', 'ABYSS', 'GGYBB'),
        ('ABBEY', 'KEBAB', 'BYGYY'),
        ('ABBEY', 'BABES', 'YYGGB'),
        ('ABBEY', 'OPENS', 'BBYBB'),
        ('DUVET', 'ADDED', 'BYBGB')
target guess fn_match_text

The output of the query is empty, which means that all our unit tests are passing.

Now, we need to split the guesses into the bins and find the one with the best worst case. This guess will have the largest bin which is smaller, than the largest bins of all other guesses.

Since we need to evaluate every possible guess and compare it with every possible target word, we will have to cross join the word list with itself and find the answer for every combination of every pair of words from the list. This means that the database will have to make whopping 155,700,484 comparisons and process them. Fortunately, modern computers are very fast.

SELECT  guess.word AS guess, target.word AS target, fn_match(guess.word, target.word) AS colors
FROM    wordle guess
CROSS JOIN
        wordle target

Now that we have the answer for every possible combination of words, we will need to group them by the guess and the answer, to know how large is the bin that every possible answer for every possible word gives us. For some of the guesses, some combination of colors will never pop up, because there would be no words in the list which could give these combinations as an answer. Yet, most of them, or at least many enough of them, would come back not empty. For every possible guess, we will have no more than 273 rows with different combinations of the colors, and the number of answers which would give this combination.

SELECT  guess, colors, COUNT(*) matches
FROM    (
        SELECT  guess.word AS guess, target.word AS target, fn_match(guess.word, target.word) AS colors
        FROM    wordle guess
        CROSS JOIN
                wordle target
        ) q
GROUP BY
        guess, colors

Then, we would need to take the worst case (the largest bin) for every answer. Since we want to return not just the number of the candidates in the largest bin, but also the answer which would give us this bin, we would use a window function to order these bins and give every one of them its rank from the worst to the best:

SELECT  guess, colors, matches,
        ROW_NUMBER() OVER (PARTITION BY guess ORDER BY matches DESC) AS rn
FROM    (
        SELECT  guess, colors, COUNT(*) matches
        FROM    (
                SELECT  guess.word AS guess, target.word AS target, fn_match(guess.word, target.word) AS colors
                FROM    wordle guess
                CROSS JOIN
                        wordle target
                ) q
        GROUP BY
                guess, colors
        ) q
) q

Finally, we want to select just the worst bins for each guess (they would have rn = 1), and, among those, select the best, ordering them by the number of matches:

SELECT  *
FROM    (
        SELECT  guess, colors, matches,
                ROW_NUMBER() OVER (PARTITION BY guess ORDER BY matches DESC) AS rn
        FROM    (
                SELECT  guess, colors, COUNT(*) matches
                FROM    (
                        SELECT  guess.word AS guess, target.word AS target, fn_match(guess.word, target.word) AS colors
                        FROM    wordle guess
                        CROSS JOIN
                                wordle target
                        ) q
                GROUP BY
                        guess, colors
                ) q
        ) q
WHERE   rn = 1
ORDER BY
        matches
guess colors matches rn
SERAI 27 659 1
SOARE 9 724 1
PASEO 27 747 1
AEROS 1 749 1
STOAE 27 781 1
NARES 0 785 1
NEARS 0 785 1
REANS 0 785 1
KAIES 0 788 1
REALS 0 791 1
CIVIC 0 7333 1
COCCO 0 7491 1
JUGUM 0 7499 1
YUKKY 0 7538 1
BUBBY 0 7551 1
IMMIX 0 7581 1
FUZZY 0 7592 1
XYLYL 0 7772 1
FUFFY 0 7862 1
GYPPY 0 7880 1

This query ran for 76 minutes for my laptop.

So, it looks like a good first word for Wordle would be serai (a Turkish palace). Even in the worst case, it would narrow the list of candidate answers down to 659 entries.

It might not be the optimal word in the strict sense. There is a good chance that two-level or deeper analysis would bring a better first guess, that would not be optimal on the first step, but would be on the second, even in the worst case. But implementing it in SQL would require a triple cartesian self-join of a 12k records table, which would take weeks to complete and would probably run out of space on my laptop.

A whole game or Wordle in SQL

Let's solve yesterday's Wordle (#221) completely in SQL!

We will be playing by the hardest rules: every green letter should be on its spot, every yellow letter should be used elsewhere, no black letters should be used. Ironically, this is easiest to implement in SQL: we will just need to make sure that every next guess is in the pool of all the previous guesses. This will keep pool of potential guesses and potential answers shorter and make the query run faster.

We will create a materialized CTE called valid to calculate the pool. It will be only populated by the words which are in line with the answer Wordly gave us.

Here's the game. We will start with SERAI, our top pick. Wordle gives us back BBBYB (27).

SERAI

Let's find the next best candidate for the guess:

WITH    guesses (guess, colors) AS
        (
        VALUES
        ('SERAI', 27)
        ),
        valid AS MATERIALIZED
        (
        SELECT  *
        FROM    wordle
        WHERE   (
                SELECT  BOOL_AND(fn_match(guess, word) = colors)
                FROM    guesses
                )
        )
SELECT  guess, matches
FROM    (
        SELECT  guess, colors, matches,
                ROW_NUMBER() OVER (PARTITION BY guess ORDER BY matches DESC) AS rn
        FROM    (
                SELECT  guess, colors, COUNT(*) matches
                FROM    (
                        SELECT  guess.word AS guess, target.word AS target, fn_match(guess.word, target.word) AS colors
                        FROM    valid guess
                        CROSS JOIN
                                valid target
                        ) q
                GROUP BY
                        guess, colors
                ) q
        ) q
WHERE   rn = 1
ORDER BY
        matches
LIMIT 1
guess matches
MANLY 74

Alright, the next best candidate is NYALA. We put it into Wordle, and it gives us back BBGBB (18).

NYALA

What's the next word?

WITH    guesses (guess, colors) AS
        (
        VALUES
        ('SERAI', 27),
        ('NYALA', 18)
        ),
        valid AS MATERIALIZED
        (
        SELECT  *
        FROM    wordle
        WHERE   (
                SELECT  BOOL_AND(fn_match(guess, word) = colors)
                FROM    guesses
                )
        )
SELECT  guess, matches
FROM    (
        SELECT  guess, colors, matches,
                ROW_NUMBER() OVER (PARTITION BY guess ORDER BY matches DESC) AS rn
        FROM    (
                SELECT  guess, colors, COUNT(*) matches
                FROM    (
                        SELECT  guess.word AS guess, target.word AS target, fn_match(guess.word, target.word) AS colors
                        FROM    valid guess
                        CROSS JOIN
                                valid target
                        ) q
                GROUP BY
                        guess, colors
                ) q
        ) q
WHERE   rn = 1
ORDER BY
        matches
LIMIT 1
guess matches
COAPT 6

The next word is COAPT (which would narrow us down to just 6 candidates, worst case). What does Wordle think about it?

COAPT

It's YBGBB (19). On to the next one.

WITH    guesses (guess, colors) AS
        (
        VALUES
        ('SERAI', 27),
        ('NYALA', 18),
        ('COAPT', 19)
        ),
        valid AS MATERIALIZED
        (
        SELECT  *
        FROM    wordle
        WHERE   (
                SELECT  BOOL_AND(fn_match(guess, word) = colors)
                FROM    guesses
                )
        )
SELECT  guess, matches
FROM    (
        SELECT  guess, colors, matches,
                ROW_NUMBER() OVER (PARTITION BY guess ORDER BY matches DESC) AS rn
        FROM    (
                SELECT  guess, colors, COUNT(*) matches
                FROM    (
                        SELECT  guess.word AS guess, target.word AS target, fn_match(guess.word, target.word) AS colors
                        FROM    valid guess
                        CROSS JOIN
                                valid target
                        ) q
                GROUP BY
                        guess, colors
                ) q
        ) q
WHERE   rn = 1
ORDER BY
        matches
LIMIT   1
guess matches
WHACK 2

The next one is WHACK, with just two candidates. Let's put it into Wordle and… we win!

WHACK

Have fun!

Written by Quassnoi

January 27th, 2022 at 11:00 pm

Posted in PostgreSQL

Tagged with , , , ,

8 Responses to 'A good first word for Wordle'

Subscribe to comments with RSS

  1. I have done exactly 7 worldle’s. So, please don’t consider me an expert. However, in the 4th paragraph below “The Best Worst Case”, you suggest RUBUS could be scored with GYGGG.

    I disagree. The score would be GBGGG. You can never be scored 4 greens and 1 yellow, because that says all letters are correct, but one letter is out of place. Of course, that’s not possible.

    Scoring for WORDLE seems to be similar to scoring for Mastermind. However, WORLDLE gives you precise position information, whereas Mastermind effectively gives you the count of G’s (black pegs in Mastermind) and the count of Y’s (yellow pegs in Mastermind).

    I’m not sure I want to delve more deeply when there is a fundamental misstatement of how the game is scored. Cheers!

    rcv

    29 Jan 22 at 20:33

  2. @rcv: you would be right had Wordle not allowed repeating letters (which is what I initially thought), but they do. In RUBUS for REBUS, the letter U is in the target word, but it’s not on the second spot. Hence, the second spot is yellow.

    Quassnoi

    29 Jan 22 at 20:40

  3. OK. I did delve more deeply. The scoring error noted above is carried through to your fn_match function.

    In my post, I means the “white pegs” in Mastermind, not the “yellow pegs”.

    3^5 = 243.

    Having done 7 of these, every final solution has been drawn from a list of relatively common words. But, the dictionary of acceptable words has been significantly larger.

    If you don’t want to “cheat” and obtain the same dictionaries used by WORDLE, I would suggest you incorporate word frequency/probability into your tables. So, rather than treating RUBUS and WHACK as equally probably, weight the words in some way. Then don’t just tally the “number of words” that fit a pattern, tally the “weighted probaility” that one of the eligible words will fit the pattern.

    Among the most common words are THERE, OTHER, WHICH, ABOUT, WOULD, WRITE. I would suggest underweighting these words, because it is likely WORDLE is using a dictionary containing several hundred to a couple of thousand words. Perhaps something as simple as weighting all of the most common 2000 words at the same value as the word ranked 2000. (Or whatever cutoff you choose.)

    I also think it wouild be practical to find the “Best Worst Case” to 2 levels if you limited your dictionary for the first 2 levels to 600 (600^2 = 36,000) or 2000 (2000^2 = 4,000,000) of the most common words. This won’t take “weeks”. After 2 levels, your algorithm looks OK, provided it takes word frequency/probability into consideration. Cheers!

    rcv

    29 Jan 22 at 21:25

  4. I think you are mistaken. Without giving much up on today’s WORDLE, if I start with the word QUEUE, it is score BYBBB. I think you would score it BYBYB.

    rcv

    29 Jan 22 at 21:35

  5. (Obviously, if you accept my actual result from today’s official WORDLE Web Site, they could have scored QUEUE as either BYBBB or BBBYB. I’m sure this makes your scoring functions quite a bit more complicated.)

    rcv

    29 Jan 22 at 21:46

  6. @rcv: you’re totally right, I got it wrong. I’ll fix the post later today. Thanks for the heads up!

    Quassnoi

    29 Jan 22 at 22:46

  7. I might suggest your dictionary may be too broad – of those 12,000 words, how many are likely to appear. I’ve been solving wordle for just a short while using a SQL words database I acquired 20+ years ago that’s just under 7,000 words with 5 characters. I’ve been trying to make my SQL query “easier” to write out, and take inspiration from your code. FYI – I was given the tip to start with “SIREN” and “OCTAL” to cover the most common letters.

    JOC

    7 Feb 22 at 19:38

  8. @JOC: Wordle, actually, uses two word lists: one for the answers (they are common words) and another one for the possible guesses. The latter is very close to the one I was using. I didn’t know it used two lists at the time I was writing the article.

    Quassnoi

    7 Feb 22 at 20:04

Leave a Reply