T shirts, feminism, parenting, and data science

Post on 28-Oct-2014

579 views 1 download

Tags:

description

A talk I gave at the Seattle DAML Meetup, 10/24/2013

Transcript of T shirts, feminism, parenting, and data science

T-Shirts, Feminism,

Parenting, and Data Science

Joel GrusChief Scientist, VoloMetrix

@joelgrus

About Me

• Chief Scientist at VoloMetrix• Have a 2-year-old daughter• Did not take me long to discover that “boys” clothing is fun, “girls”

clothing kind of sucks

Typical “Toddler Girls”

Shirt

Typical “Toddler Boys”

Shirt

Obvious to us, but can a computer figure it out?

The Data

• Downloaded image of every “toddler boys” and “toddler girls” t-shirt from • Carters• Children’s Place• Crazy 8• Gap Kids• Gymboree• Old Navy• Target.

• 616 images of boys shirts and 446 images of girls shirts• The goal: to build a model that predicts “boy shirt” or “girl shirt” just

based on the images!

Attempt #1: Colors

• Each image is a collection of RGB pixels• There are 256 * 256 * 256 ~ 17 million possible colors (too many)• Bucket each of R, G, B into [0,85), [85,170), or [170,255)• This gives 3 * 3 * 3 = 27 possible colors• Use features “does image contain at least one pixel of color j?”• Train logistic regression model on 80% of shirts, test on other 20%

Color Model Performance

P(girl shirt | “girl shirt”) = 75%P(boy shirt | “boy shirt”) = 77%P(“girl shirt” | girl shirt) = 63%P(“boy shirt” | boy shirt) = 86%

“Confidence Score” ( > 0 “boy shirt”, < 0 “girl shirt”)

# of shirts (boys) (girls)

“girlier”

“boyier”

“girlier”

“boyier”

less colorful

more colorful

Attempt #2: Eigenshirts

• To compare images, rescale all of them to 138 x 138• Chose this size because many were 138 x 138 already• Others mostly bigger

• Using R, G, B as coordinates for each pixel, think of each image as a point in 138 * 138 * 3 = 57,132-dimensional space• Obviously, with 57k features and only 1,000 shirts, this will overfit• Use dimensionality reduction to find the 10 most “interesting”

dimensions, project shirts into 10-d subspace, build model there• Each subspace dimension determines a (Platonic ideal) “eigenshirt”

What does projection look like?

Almost all miscategorized shirts have weak predictions (overall 93% accuracy)

“girlier”

“boyier”

Future Directions

• Look at text on shirt (but too lazy to transcribe it)• Try to make images same size / background color• Build model to predict how “fun” a shirt is (but will require tedious

hand-labeling)• ??

More info

• Code (but not data) is on https://github.com/joelgrus/shirts• Two blog posts on joelgrus.com, both linked from the github README

(or Google them, they have the same title as this talk)• Follow me on twitter: @joelgrus