Tuesday, December 21, 2010

Machine Learning for Human Memorization

Posted by Danny Tarlow
Here's a machine learning problem I find interesting but haven't been able to satisfactorily solve.

Background: As some of you know, I used to play Scrabble somewhat seriously. Most Tuesdays in middle school, I would go to the local scrabble club meetings and play 4 games against the best Scrabble players in the area (actually, it was usually 3 games, because the 4th game started past my bedtime). It's not your family game of Scrabble: to begin to be competitive, you need to know all of the two letter words, most of the threes, and you need to have some familiarity with a few of the other high-priority lists (e.g., vowel dumps; short q, z, j, and x words; at least a few of the bingo stems). See here for a good starting point.

Anyway, I recently went to the Toronto Scrabble Club meeting and had a great time. I think I'll start going with more regularity. As a busy machine learning researcher, though, I don't have the time or the mental capacity to memorize long lists of words anymore: for example, there are 972 legal three letter words and 3902 legal four letter words.

So I'm looking for an alternative to memorization. Typically during play, there will be a board position that could yield a high-scoring word, but it requires that XXX or XXXX be a word. It would be very helpful if I could spend a minute or so of pen and paper computation time, then arrive at an answer like, "this is a word with 90% probability". So what I really need is just a binary classifier that maps a word to probability of label "legal".

Problem description: In machine learning terms, it's a somewhat unique problem (from what I can tell). We're not trying to build a classifier that generalizes well, because the set of 3 (or 4) letter words is fixed: we have all inputs, and they're all labeled. At first glance, you might think this is an easy problem, because we can just choose a model with high model capacity, overfit the training data, and be done. There's no need for regularization if we don't care about overfitting, right? Well, not exactly. By this logic, we should just use a nearest neighbors classifier; but in order for me to run a nearest neighbors algorithm in my head, I'd need to memorize the entire training set!

There is some related work. Regularizers like the Bayesian information criterion (BIC) penalize a model for having many parameters. This is somewhat close, but not all sets of parameters are equally difficult for me to memorize. For example, if somewhere in the algorithm we used the first 26 prime numbers, then I could sit down at the beginning of each game and write out the numbers pretty quickly, without any memorization ahead of time needed. Similarly, I know all of the two letter words, so it would be cheap to include a feature about whether a substring is a valid two letter word. The measure of complexity is something like Kolmogorov complexity, but it's modified because there are certain sets of letters or numbers that might have high Kolmogorov complexity but which I already know. Let's call this me complexity. In the case of me, Danny complexity.

So here's my admittedly still vague description of the problem: find the feature set and binary classifier that best balances high training set accuracy with low Danny complexity. In particular, the Danny complexity should be much smaller than that of the naive memorization strategy.

My first attempt at this for three letter words was to define a feature mapping from each letter x to integer value f(x), then to search over integer values of c_1, c_2, c_3, and thresh for a linear classifier c_1 * f(x_1) + c_2 * f(x_2) + c_3 * f(x_3) > thresh. The Danny complexity isn't bad, depending on the form of f(x), but it hasn't been very successful in the accuracy department.

Any ideas? Can you do better? I'm attaching my code, along with the list of three letter words. letter_val(letter) defines f(x), and hash_fn maps a full word into some discrete bin (0 or 1 for binary classification, but you could imagine having more bins). I tried a couple different ways of defining f(x), which you can see in the code.
LETTER_EMBEDDING_1D = [
    'A', 'O', 'E', 'I', 'U',
    'Y', 'R', 'H', 'L', 'M',
    'N', 'C', 'K', 'T', 'P',
    'Q', 'W', 'S', 'B', 'D',
    'F', 'G', 'X', 'V', 'J', 'Z'
    ]

LETTERMAP = dict(zip(LETTER_EMBEDDING_1D, range(26)))

"""
LETTERMAP = {
    'A' : 2,
    'B' : 3,
    'C' : 5,
    'D' : 7,
    'E' : 11,
    'F' : 13,
    'G' : 17,
    'H' : 19,
    'I' : 23,
    'J' : 29,
    'K' : 31,
    'L' : 37,
    'M' : 41,
    'N' : 43,
    'O' : 47,
    'P' : 53,
    'Q' : 59,
    'R' : 61,
    'S' : 67,
    'T' : 71,
    'U' : 73,
    'V' : 79,
    'W' : 83,
    'X' : 89,
    'Y' : 97,
    'Z' : 101
    }
"""


def letter_val(letter):
    return LETTERMAP[letter]
    #return ord(letter) - ord('A') + 1

def hash_fn(word, c1, c2, c3, m):
    if (c1*letter_val(word[0]) + c2*letter_val(word[1]) + c3*letter_val(word[2])) > m:
        return 1
    else:
        return 0

    #return (c1*letter_val(word[0]) + c2*letter_val(word[1]) + c3*letter_val(word[2])) % m


f = open("three_letter_words.txt")
GOODWORDS = []
for line in f:
    GOODWORDS.extend(line.split())
print GOODWORDS

LETTERS = [chr(65+i) for i in range(26)]
ALLWORDS = [a + b + c for a in LETTERS for b in LETTERS for c in LETTERS]

m = 2
for thresh in range(-1, 2):
    for c1 in range(-2,3):
        for c2 in range(-2,3):
            for c3 in range(-2,3):
                
                good_hashes = []
                all_hashes = []

                for w in GOODWORDS:
                    good_hashes.append(hash_fn(w, c1, c2, c3, thresh))

                for w in ALLWORDS:
                    all_hashes.append(hash_fn(w, c1, c2, c3, thresh))

                num_good = m*[0]
                num_all = m*[0]

                for h in good_hashes:
                    num_good[h] += 1
                
                for h in all_hashes:
                    num_all[h] += 1

                scores = m*[0]

                for i in range(m):
                    if num_all[i] > 0:
                        scores[i] = float(num_good[i]) / num_all[i]

                if max(scores) > .3:
                    print thresh,"\t", c1, c2, c3,":\t",
                    print scores
                    print
Save this list as three_letter_words.txt. If you're feeling ambitious, you can get the four letter words here.
AAH AAL AAS ABA ABO ABS ABY ACE ACT ADD ADO ADS ADZ AFF AFT AGA AGE AGO AHA AID
AIL AIM AIN AIR AIS AIT ALA ALB ALE ALL ALP ALS ALT AMA AMI AMP AMU ANA AND ANE 
ANI ANT ANY APE APT ARB ARC ARE ARF ARK ARM ARS ART ASH ASK ASP ASS ATE ATT AUK 
AVA AVE AVO AWA AWE AWL AWN AXE AYE AYS AZO 
BAA BAD BAG BAH BAL BAM BAN BAP BAR BAS BAT BAY BED BEE BEG BEL BEN BET BEY BIB 
BID BIG BIN BIO BIS BIT BIZ BOA BOB BOD BOG BOO BOP BOS BOT BOW BOX BOY BRA BRO 
BRR BUB BUD BUG BUM BUN BUR BUS BUT BUY BYE BYS 
CAB CAD CAM CAN CAP CAR CAT CAW CAY CEE CEL CEP CHI CIS COB COD COG COL CON COO 
COP COR COS COT COW COX COY COZ CRY CUB CUD CUE CUM CUP CUR CUT CWM 
DAB DAD DAG DAH DAK DAL DAM DAP DAW DAY DEB DEE DEL DEN DEV DEW DEX DEY DIB DID 
DIE DIG DIM DIN DIP DIS DIT DOC DOE DOG DOL DOM DON DOR DOS DOT DOW DRY DUB DUD 
DUE DUG DUI DUN DUO DUP DYE 
EAR EAT EAU EBB ECU EDH EEL EFF EFS EFT EGG EGO EKE ELD ELF ELK ELL ELM ELS EME 
EMF EMS EMU END ENG ENS EON ERA ERE ERG ERN ERR ERS ESS ETA ETH EVE EWE EYE 
FAD FAG FAN FAR FAS FAT FAX FAY FED FEE FEH FEM FEN FER FET FEU FEW FEY FEZ FIB 
FID FIE FIG FIL FIN FIR FIT FIX FIZ FLU FLY FOB FOE FOG FOH FON FOP FOR FOU FOX 
FOY FRO FRY FUB FUD FUG FUN FUR 
GAB GAD GAE GAG GAL GAM GAN GAP GAR GAS GAT GAY GED GEE GEL GEM GEN GET GEY GHI 
GIB GID GIE GIG GIN GIP GIT GNU GOA GOB GOD GOO GOR GOT GOX GOY GUL GUM GUN GUT 
GUV GUY GYM GYP 
HAD HAE HAG HAH HAJ HAM HAO HAP HAS HAT HAW HAY HEH HEM HEN HEP HER HES HET HEW 
HEX HEY HIC HID HIE HIM HIN HIP HIS HIT HMM HOB HOD HOE HOG HON HOP HOT HOW HOY 
HUB HUE HUG HUH HUM HUN HUP HUT HYP 
ICE ICH ICK ICY IDS IFF IFS ILK ILL IMP INK INN INS ION IRE IRK ISM ITS IVY 
JAB JAG JAM JAR JAW JAY JEE JET JEU JEW JIB JIG JIN JOB JOE JOG JOT JOW JOY JUG 
JUN JUS JUT 
KAB KAE KAF KAS KAT KAY KEA KEF KEG KEN KEP KEX KEY KHI KID KIF KIN KIP KIR KIT 
KOA KOB KOI KOP KOR KOS KUE 
LAB LAC LAD LAG LAM LAP LAR LAS LAT LAV LAW LAX LAY LEA LED LEE LEG LEI LEK LET 
LEU LEV LEX LEY LEZ LIB LID LIE LIN LIP LIS LIT LOB LOG LOO LOP LOT LOW LOX LUG 
LUM LUV LUX LYE 
MAC MAD MAE MAG MAN MAP MAR MAS MAT MAW MAX MAY MED MEL MEM MEN MET MEW MHO MIB 
MID MIG MIL MIM MIR MIS MIX MOA MOB MOC MOD MOG MOL MOM MON MOO MOP MOR MOS MOT 
MOW MUD MUG MUM MUN MUS MUT 
NAB NAE NAG NAH NAM NAN NAP NAW NAY NEB NEE NET NEW NIB NIL NIM NIP NIT NIX NOB 
NOD NOG NOH NOM NOO NOR NOS NOT NOW NTH NUB NUN NUS NUT 
OAF OAK OAR OAT OBE OBI OCA ODD ODE ODS OES OFF OFT OHM OHO OHS OIL OKA OKE OLD
OLE OMS ONE ONS OOH OOT OPE OPS OPT ORA ORB ORC ORE ORS ORT OSE OUD OUR OUT OVA 
OWE OWL OWN OXO OXY 
PAC PAD PAH PAL PAM PAN PAP PAR PAS PAT PAW PAX PAY PEA PEC PED PEE PEG PEH PEN 
PEP PER PES PET PEW PHI PHT PIA PIC PIE PIG PIN PIP PIS PIT PIU PIX PLY POD POH 
POI POL POM POP POT POW POX PRO PRY PSI PUB PUD PUG PUL PUN PUP PUR PUS PUT PYA
PYE PYX 
QAT QUA 
RAD RAG RAH RAJ RAM RAN RAP RAS RAT RAW RAX RAY REB REC RED REE REF REG REI REM 
REP RES RET REV REX RHO RIA RIB RID RIF RIG RIM RIN RIP ROB ROC ROD ROE ROM ROT 
ROW RUB RUE RUG RUM RUN RUT RYA RYE 
SAB SAC SAD SAE SAG SAL SAP SAT SAU SAW SAX SAY SEA SEC SEE SEG SEI SEL SEN SER 
SET SEW SEX SHA SHE SHH SHY SIB SIC SIM SIN SIP SIR SIS SIT SIX SKA SKI SKY SLY 
SOB SOD SOL SON SOP SOS SOT SOU SOW SOX SOY SPA SPY SRI STY SUB SUE SUM SUN SUP 
SUQ SYN 
TAB TAD TAE TAG TAJ TAM TAN TAO TAP TAR TAS TAT TAU TAV TAW TAX TEA TED TEE TEG
TEL TEN TET TEW THE THO THY TIC TIE TIL TIN TIP TIS TIT TOD TOE TOG TOM TON TOO 
TOP TOR TOT TOW TOY TRY TSK TUB TUG TUI TUN TUP TUT TUX TWA TWO TYE 
UDO UGH UKE ULU UMM UMP UNS UPO UPS URB URD URN USE UTA UTS 
VAC VAN VAR VAS VAT VAU VAV VAW VEE VEG VET VEX VIA VIE VIG VIM VIS VOE VOW VOX 
VUG 
WAB WAD WAE WAG WAN WAP WAR WAS WAT WAW WAX WAY WEB WED WEE WEN WET WHA WHO WHY 
WIG WIN WIS WIT WIZ WOE WOG WOK WON WOO WOP WOS WOT WOW WRY WUD WYE WYN 
XIS 
YAH YAK YAM YAP YAR YAW YAY YEA YEH YEN YEP YES YET YEW YID YIN YIP YOB YOD YOK 
YOM YON YOU YOW YUK YUM YUP 
ZAG ZAP ZAX ZED ZEE ZEK ZIG ZIN ZIP ZIT ZOA ZOO

14 comments:

gwern said...

> As a busy machine learning researcher, though, I don't have the time or the mental capacity to memorize long lists of words anymore: for example, there are 972 legal three letter words and 3902 legal four letter words.

This is dodging the main thrust of your post, but actually, that isn't too bad.

5k words, most of which you almost surely already know? You could memorize that easily over a year or so, using something like spaced repetition:

- http://en.wikipedia.org/wiki/Spaced_repetition
- http://www.gwern.net/Mnemosyne.html

Danny Tarlow said...

Thanks for the interesting links, gwern.

You're probably right, that just the three letter words, and maybe the four letter words could be manageable. But there will always be more words to memorize. So maybe I should rephrase the question to be: given that I'm willing to spend X hours over the next year on memorization, what strategy should I use in order to maximize the number of words that I can learn?

It seems likely to me that doing some computation ahead of time could give a win over the strategy of independently memorizing every word I don't know.

Yaroslav said...

Here's an idea -- find small sets of words so that any legal word is close to a member of that set under some Danny-computable distance function. This can be turned into dominating set problem. I tried Hamming distance with an off-the-shelf solver for dominating set, but it was too slow to go beyond 100 words. First 100 three-letter words are within hamming distance 2 of words "ABO" and "BAA"

gwern said...

Well, I include time estimates in my Mnemosyne article of 30-40 seconds per item per 3 years; approximate this down to a year by assuming an even distribution of expended time (1/3), and that's 10 seconds per item per year.

(This is an underestimate since SR is by nature 'front-loaded', but the original number is an overestimated average over all cards - even hard ones, not exclusively easy ones like 3 or 5 letter English words. Hopefully they cancel out somewhat; my intuition says 10 seconds is probably too high.)

So that's 6 items per minute, 60 minutes per hour or 6*60 items, or 360 items per hour.

That's a total investment in the first year of 3 hours to memorize the 3 letter words; and 11 hours for the 4 letter words. (So roughly 2.3 minutes per day.)

I don't have any idea what the time requirements for your calculation strategy might be.

Danny Tarlow said...

I like that framing of the problem, Yaroslav. It seems like a distance function that gives smaller distance to pairs of words where a vowel is swapped for a vowel or a consonant is swapped for a consonant could be a good balance of Danny computability and accuracy. We could go further and divides letters into clusters (e.g., vowels, type 1 consonants, ..., type k consonants), and do something similar.

And the upside is that we don't actually have to solve the dominating set problem exactly. A suboptimal solution just means a few more words for me to memorize.

Very nice. Thanks!

Yaroslav said...

Actually, the result in previous comment is an instance of maximum independent set rather than minimum dominating set. Hamming distance may not be a good example, it seems that ignoring special structure of hamming distance makes the problem a lot harder. Here's what the hamming distance-1 graph looks like for first 65 words. http://yaroslavvb.com/upload/save/a-graph.png

One of the cliques you see there is "ACE", "AGE", "ALE", "ANE", "APE", "ARE", "ATE", "AVE", "AWE"

Yaroslav said...

Possibly relevant, here's a mention of two recent results on the problem of minimum dominating set
http://11011110.livejournal.com/209641.html

Chris said...

Interesting problem! I saw this link from reddit.

I like how you relax the constraints and are willing to build a classifier that merely gives you a high probability of a word being a legit word, since you're ultimately concerned with your real-world gain.

Likewise, you could try ranking the words by how many points their letters sum up to, in hopes of producing a smaller candidate list of words to memorize rules for or to train a classifier. Additionally, maybe adding an element in there that also factors in the likelihood of being able to use that word -- for example, if ZQZ were a high point-value word, but we know that realistically you would rarely get to play this word because none of those letters are often already laid out on the board... then we should rank ZQZ lower on our candidate list.

With this potential candidate list, if there seems to be noticeable patterns, such as a large percentage of the words (1) contains a X; or (2) contains a Y; or (2) has a vowel in the middle position, then maybe you could handcraft a few boolean attributes like this, then build a decision tree for it. That way, given some decision tree, you can memorize a few of the top-level edges and know what to look for... ah.. nvm, I guess you already kind of know what to look for, it's just the quick valid-checking that you need. Nonetheless, maybe reducing your original set to a smaller set might be useful?

Danny Tarlow said...

Hi Chris,

Thanks for commenting!

I like your idea of more accurately taking into account how valuable it will be to memorize a particular word. I actually think this is one of the more interesting aspects of Scrabble. In some sense, though, this is the point of the beginner word lists: they're roughly the most valuable to start memorizing, but you're right that the value calculation could be done at a finer-grained level.

Unfortunately--and this requires some Scrabble experience--it's not as simple as just counting up the score of a word. First, you would also want to weigh the words by frequency of the letters. The word PIZAZZ, for example, would be worth a lot of points by a naive measure of counting letter values, but it will only be possible to play if you have the Z and both blanks, because there is only one Z in the game.

Second, and more importantly, short words are usually used as hooks rather than as the primary point scorer. What this means is that often I'll have a high-valued 7 letter word on my rack, but I'll need a place to play it. The three letter words are useful to know, because you can often find a two letter word to extend in one direction while you play the long word in the other direction. So most of the points will end up coming from the long word, but you wouldn't have had anywhere to play it if you didn't know that you could say extend AE by putting a W in front of it.

Scott Turner said...

Is there an online archive of expert-level Scrabble games available? You could datamine that to determine the most valuable words more easily than trying to discern heuristics.

Steve said...

Here's a possible approach for 3 letter words:
1) Find the most common two-letter clusters, dividing them into those that are and aren't real words. Call these four groups ATx,xAT, AZx, xAZ (AT is a word, AZ isn't it...)
2) For each pair, find groups that have a similar pool of letters that could fill the blank.
3) Focus on the intersection of those pools. Say TO_, _ET and _GO each have more than 10 possible blank fillers (making this up), but there's a core set of 7 letters that work in all three.
4) Form mnemonics with those groups of "like" pairs, and with the pool of letters that fill the gaps.
5) ???
6) Profit

Keith said...

A character bigram model might offer reasonable accuracy for guessing whether a word is real or not. Just be sure to include an extra symbol for the beginning and ending of the word (because beginnings/endings have other restrictions). If you want to apply it as a human, you'd have to (at least implicitly) remember 27x27 probabilities. Realistically, you can chunk many of them together or instead reject a postulated word if you find any low-probability transition. (For example, reject words with "tz" in them)

You could consider alternate encoding methods for the word also. I'm sure we all already encode vocabulary as base + suffix, leading to the prevalence of students learning Latin for SATs/etc. You might also consider enumerating the list of valid English syllables and representing words in that way. Or spellings of valid English phonemes. Given such an encoding, you might be able to make a more compact ngram model. Another more ad hoc way of looking at it is that many players already chunk letters together in their memory (such as "qu").

If you really want to use ML, you might consider transformation-based learning for the problem. Based on the way it learns, it's already setup for you to find a comfortable balance between accuracy and Danny complexity.

Danny Tarlow said...

Thanks for the comments, Keith. What's the best reference to read for an introduction to transformation-based learning?

George said...

A decision tree might have a very low Danny complexity if you restrict the set of questions an internal node is allowed to ask. You should train a decision tree with a few low Danny complexity question and see how accurate it is and how large of a tree gets produced.