Wikipedia:Reference desk/Archives/Language/2022 February 3

Language desk
< February 2 << Jan | February | Mar >> Current desk >
Welcome to the Wikipedia Language Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


February 3 edit

Word list edit

I'm trying my hand at replicating Wordle in Excel. To my surprise, the difficult part has been in obtaining a decent list of 5-letter words. The stuff I've found online is either really truncated (eliminating a lot of pretty basic 5-letter English words) or exists across dozens of pages. What I'd like to find at this point is just a plain old corpus of 5-letter words that I can paste in. I've resigned myself to creating the actual puzzle word list by hand, but I do still need a more comprehensive list to serve as a check that the user is entering a valid word (and not just entering AEIOU for their first word. Any suggestions? If possible, it should be a "legal" word list such as what Scrabble might accept, though it needn't be an exact match. I'm just trying to avoid lists with made up stuff like adrad in it. Matt Deres (talk) 02:38, 3 February 2022 (UTC)[reply]

Can't you do something similar to downloading a Scrabble dictionary, and then run a script to sort out all accepted five-letter words? By the way, adrad doesn't appear to be made up, but it's from generally obsolete Chaucer's English... 惑乱 Wakuran (talk) 03:27, 3 February 2022 (UTC)[reply]
@Matt Deres:: Bill the Farmer has two English*.java for his Gurgle. --Error (talk) 12:27, 3 February 2022 (UTC)[reply]
What about importing a large bunch of public domain books and parsing out the unique five-letter words?Hayttom (talk) 17:57, 3 February 2022 (UTC)[reply]
I'd guess that approach would produce a bunch of fluff with proper names, placenames, random coinages etc... 惑乱 Wakuran (talk) 19:53, 3 February 2022 (UTC)[reply]
The list used by Wordle is included in plain text in its source code, and is reasonably easy to extra from there. (Actually there are two lists, of common and less-common words.) AndrewWTaylor (talk) 22:15, 3 February 2022 (UTC)[reply]
Check these files for the Wordle lists in text format sorted alphabetically. Apparently the source files from the official website are in order, so they contain spoilers. - Lindert (talk) 22:58, 3 February 2022 (UTC)[reply]
Thanks, Lindert; that did the trick (though it does have adrad in there :-P). AndrewWTaylor, your link returned a 404 error. Just an FYI; I've got what I need. Thank you everyone who replied! Matt Deres (talk) 03:33, 4 February 2022 (UTC)[reply]
According to Wiktionary, adrad is Middle English for "afraid", so maybe not "made up" exactly. --Trovatore (talk) 23:07, 5 February 2022 (UTC) [reply]
  • Just as an aside, 3Blue1Brown just posted a video on the Wordle game and information theory. Here. It may have a lot of useful information regarding the word lists useful for the OP. --Jayron32 17:12, 7 February 2022 (UTC)[reply]
From what I've read about the game, the main point is that there's a large set of words (including such as adrad) that are acceptable as possible five-letter "guesses" by players but a much smaller set of these (consisting of widely known words) that are used as actual solutions. See this, for example. Deor (talk) 17:24, 7 February 2022 (UTC)[reply]
Yes, that is correct; 3b1b's video goes into that, and talks about how to optimize play using a combination of the words in the lists and knowledge of the commonness of various words on the list. --Jayron32 17:32, 7 February 2022 (UTC)[reply]