Tuesday, December 21, 2010

Machine Learning for Human Memorization

Posted by Danny Tarlow
Here's a machine learning problem I find interesting but haven't been able to satisfactorily solve.

Background: As some of you know, I used to play Scrabble somewhat seriously. Most Tuesdays in middle school, I would go to the local scrabble club meetings and play 4 games against the best Scrabble players in the area (actually, it was usually 3 games, because the 4th game started past my bedtime). It's not your family game of Scrabble: to begin to be competitive, you need to know all of the two letter words, most of the threes, and you need to have some familiarity with a few of the other high-priority lists (e.g., vowel dumps; short q, z, j, and x words; at least a few of the bingo stems). See here for a good starting point.

Anyway, I recently went to the Toronto Scrabble Club meeting and had a great time. I think I'll start going with more regularity. As a busy machine learning researcher, though, I don't have the time or the mental capacity to memorize long lists of words anymore: for example, there are 972 legal three letter words and 3902 legal four letter words.

So I'm looking for an alternative to memorization. Typically during play, there will be a board position that could yield a high-scoring word, but it requires that XXX or XXXX be a word. It would be very helpful if I could spend a minute or so of pen and paper computation time, then arrive at an answer like, "this is a word with 90% probability". So what I really need is just a binary classifier that maps a word to probability of label "legal".

Problem description: In machine learning terms, it's a somewhat unique problem (from what I can tell). We're not trying to build a classifier that generalizes well, because the set of 3 (or 4) letter words is fixed: we have all inputs, and they're all labeled. At first glance, you might think this is an easy problem, because we can just choose a model with high model capacity, overfit the training data, and be done. There's no need for regularization if we don't care about overfitting, right? Well, not exactly. By this logic, we should just use a nearest neighbors classifier; but in order for me to run a nearest neighbors algorithm in my head, I'd need to memorize the entire training set!

There is some related work. Regularizers like the Bayesian information criterion (BIC) penalize a model for having many parameters. This is somewhat close, but not all sets of parameters are equally difficult for me to memorize. For example, if somewhere in the algorithm we used the first 26 prime numbers, then I could sit down at the beginning of each game and write out the numbers pretty quickly, without any memorization ahead of time needed. Similarly, I know all of the two letter words, so it would be cheap to include a feature about whether a substring is a valid two letter word. The measure of complexity is something like Kolmogorov complexity, but it's modified because there are certain sets of letters or numbers that might have high Kolmogorov complexity but which I already know. Let's call this me complexity. In the case of me, Danny complexity.

So here's my admittedly still vague description of the problem: find the feature set and binary classifier that best balances high training set accuracy with low Danny complexity. In particular, the Danny complexity should be much smaller than that of the naive memorization strategy.

My first attempt at this for three letter words was to define a feature mapping from each letter x to integer value f(x), then to search over integer values of c_1, c_2, c_3, and thresh for a linear classifier c_1 * f(x_1) + c_2 * f(x_2) + c_3 * f(x_3) > thresh. The Danny complexity isn't bad, depending on the form of f(x), but it hasn't been very successful in the accuracy department.

Any ideas? Can you do better? I'm attaching my code, along with the list of three letter words. letter_val(letter) defines f(x), and hash_fn maps a full word into some discrete bin (0 or 1 for binary classification, but you could imagine having more bins). I tried a couple different ways of defining f(x), which you can see in the code.
LETTER_EMBEDDING_1D = [
    'A', 'O', 'E', 'I', 'U',
    'Y', 'R', 'H', 'L', 'M',
    'N', 'C', 'K', 'T', 'P',
    'Q', 'W', 'S', 'B', 'D',
    'F', 'G', 'X', 'V', 'J', 'Z'
    ]

LETTERMAP = dict(zip(LETTER_EMBEDDING_1D, range(26)))

"""
LETTERMAP = {
    'A' : 2,
    'B' : 3,
    'C' : 5,
    'D' : 7,
    'E' : 11,
    'F' : 13,
    'G' : 17,
    'H' : 19,
    'I' : 23,
    'J' : 29,
    'K' : 31,
    'L' : 37,
    'M' : 41,
    'N' : 43,
    'O' : 47,
    'P' : 53,
    'Q' : 59,
    'R' : 61,
    'S' : 67,
    'T' : 71,
    'U' : 73,
    'V' : 79,
    'W' : 83,
    'X' : 89,
    'Y' : 97,
    'Z' : 101
    }
"""


def letter_val(letter):
    return LETTERMAP[letter]
    #return ord(letter) - ord('A') + 1

def hash_fn(word, c1, c2, c3, m):
    if (c1*letter_val(word[0]) + c2*letter_val(word[1]) + c3*letter_val(word[2])) > m:
        return 1
    else:
        return 0

    #return (c1*letter_val(word[0]) + c2*letter_val(word[1]) + c3*letter_val(word[2])) % m


f = open("three_letter_words.txt")
GOODWORDS = []
for line in f:
    GOODWORDS.extend(line.split())
print GOODWORDS

LETTERS = [chr(65+i) for i in range(26)]
ALLWORDS = [a + b + c for a in LETTERS for b in LETTERS for c in LETTERS]

m = 2
for thresh in range(-1, 2):
    for c1 in range(-2,3):
        for c2 in range(-2,3):
            for c3 in range(-2,3):
                
                good_hashes = []
                all_hashes = []

                for w in GOODWORDS:
                    good_hashes.append(hash_fn(w, c1, c2, c3, thresh))

                for w in ALLWORDS:
                    all_hashes.append(hash_fn(w, c1, c2, c3, thresh))

                num_good = m*[0]
                num_all = m*[0]

                for h in good_hashes:
                    num_good[h] += 1
                
                for h in all_hashes:
                    num_all[h] += 1

                scores = m*[0]

                for i in range(m):
                    if num_all[i] > 0:
                        scores[i] = float(num_good[i]) / num_all[i]

                if max(scores) > .3:
                    print thresh,"\t", c1, c2, c3,":\t",
                    print scores
                    print
Save this list as three_letter_words.txt. If you're feeling ambitious, you can get the four letter words here.
AAH AAL AAS ABA ABO ABS ABY ACE ACT ADD ADO ADS ADZ AFF AFT AGA AGE AGO AHA AID
AIL AIM AIN AIR AIS AIT ALA ALB ALE ALL ALP ALS ALT AMA AMI AMP AMU ANA AND ANE 
ANI ANT ANY APE APT ARB ARC ARE ARF ARK ARM ARS ART ASH ASK ASP ASS ATE ATT AUK 
AVA AVE AVO AWA AWE AWL AWN AXE AYE AYS AZO 
BAA BAD BAG BAH BAL BAM BAN BAP BAR BAS BAT BAY BED BEE BEG BEL BEN BET BEY BIB 
BID BIG BIN BIO BIS BIT BIZ BOA BOB BOD BOG BOO BOP BOS BOT BOW BOX BOY BRA BRO 
BRR BUB BUD BUG BUM BUN BUR BUS BUT BUY BYE BYS 
CAB CAD CAM CAN CAP CAR CAT CAW CAY CEE CEL CEP CHI CIS COB COD COG COL CON COO 
COP COR COS COT COW COX COY COZ CRY CUB CUD CUE CUM CUP CUR CUT CWM 
DAB DAD DAG DAH DAK DAL DAM DAP DAW DAY DEB DEE DEL DEN DEV DEW DEX DEY DIB DID 
DIE DIG DIM DIN DIP DIS DIT DOC DOE DOG DOL DOM DON DOR DOS DOT DOW DRY DUB DUD 
DUE DUG DUI DUN DUO DUP DYE 
EAR EAT EAU EBB ECU EDH EEL EFF EFS EFT EGG EGO EKE ELD ELF ELK ELL ELM ELS EME 
EMF EMS EMU END ENG ENS EON ERA ERE ERG ERN ERR ERS ESS ETA ETH EVE EWE EYE 
FAD FAG FAN FAR FAS FAT FAX FAY FED FEE FEH FEM FEN FER FET FEU FEW FEY FEZ FIB 
FID FIE FIG FIL FIN FIR FIT FIX FIZ FLU FLY FOB FOE FOG FOH FON FOP FOR FOU FOX 
FOY FRO FRY FUB FUD FUG FUN FUR 
GAB GAD GAE GAG GAL GAM GAN GAP GAR GAS GAT GAY GED GEE GEL GEM GEN GET GEY GHI 
GIB GID GIE GIG GIN GIP GIT GNU GOA GOB GOD GOO GOR GOT GOX GOY GUL GUM GUN GUT 
GUV GUY GYM GYP 
HAD HAE HAG HAH HAJ HAM HAO HAP HAS HAT HAW HAY HEH HEM HEN HEP HER HES HET HEW 
HEX HEY HIC HID HIE HIM HIN HIP HIS HIT HMM HOB HOD HOE HOG HON HOP HOT HOW HOY 
HUB HUE HUG HUH HUM HUN HUP HUT HYP 
ICE ICH ICK ICY IDS IFF IFS ILK ILL IMP INK INN INS ION IRE IRK ISM ITS IVY 
JAB JAG JAM JAR JAW JAY JEE JET JEU JEW JIB JIG JIN JOB JOE JOG JOT JOW JOY JUG 
JUN JUS JUT 
KAB KAE KAF KAS KAT KAY KEA KEF KEG KEN KEP KEX KEY KHI KID KIF KIN KIP KIR KIT 
KOA KOB KOI KOP KOR KOS KUE 
LAB LAC LAD LAG LAM LAP LAR LAS LAT LAV LAW LAX LAY LEA LED LEE LEG LEI LEK LET 
LEU LEV LEX LEY LEZ LIB LID LIE LIN LIP LIS LIT LOB LOG LOO LOP LOT LOW LOX LUG 
LUM LUV LUX LYE 
MAC MAD MAE MAG MAN MAP MAR MAS MAT MAW MAX MAY MED MEL MEM MEN MET MEW MHO MIB 
MID MIG MIL MIM MIR MIS MIX MOA MOB MOC MOD MOG MOL MOM MON MOO MOP MOR MOS MOT 
MOW MUD MUG MUM MUN MUS MUT 
NAB NAE NAG NAH NAM NAN NAP NAW NAY NEB NEE NET NEW NIB NIL NIM NIP NIT NIX NOB 
NOD NOG NOH NOM NOO NOR NOS NOT NOW NTH NUB NUN NUS NUT 
OAF OAK OAR OAT OBE OBI OCA ODD ODE ODS OES OFF OFT OHM OHO OHS OIL OKA OKE OLD
OLE OMS ONE ONS OOH OOT OPE OPS OPT ORA ORB ORC ORE ORS ORT OSE OUD OUR OUT OVA 
OWE OWL OWN OXO OXY 
PAC PAD PAH PAL PAM PAN PAP PAR PAS PAT PAW PAX PAY PEA PEC PED PEE PEG PEH PEN 
PEP PER PES PET PEW PHI PHT PIA PIC PIE PIG PIN PIP PIS PIT PIU PIX PLY POD POH 
POI POL POM POP POT POW POX PRO PRY PSI PUB PUD PUG PUL PUN PUP PUR PUS PUT PYA
PYE PYX 
QAT QUA 
RAD RAG RAH RAJ RAM RAN RAP RAS RAT RAW RAX RAY REB REC RED REE REF REG REI REM 
REP RES RET REV REX RHO RIA RIB RID RIF RIG RIM RIN RIP ROB ROC ROD ROE ROM ROT 
ROW RUB RUE RUG RUM RUN RUT RYA RYE 
SAB SAC SAD SAE SAG SAL SAP SAT SAU SAW SAX SAY SEA SEC SEE SEG SEI SEL SEN SER 
SET SEW SEX SHA SHE SHH SHY SIB SIC SIM SIN SIP SIR SIS SIT SIX SKA SKI SKY SLY 
SOB SOD SOL SON SOP SOS SOT SOU SOW SOX SOY SPA SPY SRI STY SUB SUE SUM SUN SUP 
SUQ SYN 
TAB TAD TAE TAG TAJ TAM TAN TAO TAP TAR TAS TAT TAU TAV TAW TAX TEA TED TEE TEG
TEL TEN TET TEW THE THO THY TIC TIE TIL TIN TIP TIS TIT TOD TOE TOG TOM TON TOO 
TOP TOR TOT TOW TOY TRY TSK TUB TUG TUI TUN TUP TUT TUX TWA TWO TYE 
UDO UGH UKE ULU UMM UMP UNS UPO UPS URB URD URN USE UTA UTS 
VAC VAN VAR VAS VAT VAU VAV VAW VEE VEG VET VEX VIA VIE VIG VIM VIS VOE VOW VOX 
VUG 
WAB WAD WAE WAG WAN WAP WAR WAS WAT WAW WAX WAY WEB WED WEE WEN WET WHA WHO WHY 
WIG WIN WIS WIT WIZ WOE WOG WOK WON WOO WOP WOS WOT WOW WRY WUD WYE WYN 
XIS 
YAH YAK YAM YAP YAR YAW YAY YEA YEH YEN YEP YES YET YEW YID YIN YIP YOB YOD YOK 
YOM YON YOU YOW YUK YUM YUP 
ZAG ZAP ZAX ZED ZEE ZEK ZIG ZIN ZIP ZIT ZOA ZOO