T shirts, feminism, parenting, and data science

17
T-Shirts, Feminism, Parenting, and Data Science Joel Grus Chief Scientist, VoloMetrix @joelgrus

description

A talk I gave at the Seattle DAML Meetup, 10/24/2013

Transcript of T shirts, feminism, parenting, and data science

Page 1: T shirts, feminism, parenting, and data science

T-Shirts, Feminism,

Parenting, and Data Science

Joel GrusChief Scientist, VoloMetrix

@joelgrus

Page 2: T shirts, feminism, parenting, and data science

About Me

• Chief Scientist at VoloMetrix• Have a 2-year-old daughter• Did not take me long to discover that “boys” clothing is fun, “girls”

clothing kind of sucks

Page 3: T shirts, feminism, parenting, and data science

Typical “Toddler Girls”

Shirt

Typical “Toddler Boys”

Shirt

Page 4: T shirts, feminism, parenting, and data science

Obvious to us, but can a computer figure it out?

Page 5: T shirts, feminism, parenting, and data science

The Data

• Downloaded image of every “toddler boys” and “toddler girls” t-shirt from • Carters• Children’s Place• Crazy 8• Gap Kids• Gymboree• Old Navy• Target.

• 616 images of boys shirts and 446 images of girls shirts• The goal: to build a model that predicts “boy shirt” or “girl shirt” just

based on the images!

Page 6: T shirts, feminism, parenting, and data science

Attempt #1: Colors

• Each image is a collection of RGB pixels• There are 256 * 256 * 256 ~ 17 million possible colors (too many)• Bucket each of R, G, B into [0,85), [85,170), or [170,255)• This gives 3 * 3 * 3 = 27 possible colors• Use features “does image contain at least one pixel of color j?”• Train logistic regression model on 80% of shirts, test on other 20%

Page 7: T shirts, feminism, parenting, and data science

Color Model Performance

P(girl shirt | “girl shirt”) = 75%P(boy shirt | “boy shirt”) = 77%P(“girl shirt” | girl shirt) = 63%P(“boy shirt” | boy shirt) = 86%

“Confidence Score” ( > 0 “boy shirt”, < 0 “girl shirt”)

# of shirts (boys) (girls)

Page 8: T shirts, feminism, parenting, and data science

“girlier”

“boyier”

Page 9: T shirts, feminism, parenting, and data science

“girlier”

“boyier”

less colorful

more colorful

Page 10: T shirts, feminism, parenting, and data science

Attempt #2: Eigenshirts

• To compare images, rescale all of them to 138 x 138• Chose this size because many were 138 x 138 already• Others mostly bigger

• Using R, G, B as coordinates for each pixel, think of each image as a point in 138 * 138 * 3 = 57,132-dimensional space• Obviously, with 57k features and only 1,000 shirts, this will overfit• Use dimensionality reduction to find the 10 most “interesting”

dimensions, project shirts into 10-d subspace, build model there• Each subspace dimension determines a (Platonic ideal) “eigenshirt”

Page 11: T shirts, feminism, parenting, and data science
Page 12: T shirts, feminism, parenting, and data science

What does projection look like?

Page 13: T shirts, feminism, parenting, and data science
Page 14: T shirts, feminism, parenting, and data science

Almost all miscategorized shirts have weak predictions (overall 93% accuracy)

Page 15: T shirts, feminism, parenting, and data science

“girlier”

“boyier”

Page 16: T shirts, feminism, parenting, and data science

Future Directions

• Look at text on shirt (but too lazy to transcribe it)• Try to make images same size / background color• Build model to predict how “fun” a shirt is (but will require tedious

hand-labeling)• ??

Page 17: T shirts, feminism, parenting, and data science

More info

• Code (but not data) is on https://github.com/joelgrus/shirts• Two blog posts on joelgrus.com, both linked from the github README

(or Google them, they have the same title as this talk)• Follow me on twitter: @joelgrus