(7) Data, Privacy and Metrics 数据 , 隐私和测量

download (7)  Data, Privacy and Metrics 数据 , 隐私和测量

If you can't read please download the document

description

The Networked Economy: Information Management, Strategy, and Innovation 网络经济 : 信息管理 , 战略 , 和创新. (7) Data, Privacy and Metrics 数据 , 隐私和测量. Agenda 议程. Role of data in decision making 数据在决策中的地位 Size and cost of storage 数据存储的规模和成本. River Nile 尼罗河. Notre Dame 巴黎圣母院. - PowerPoint PPT Presentation

Transcript of (7) Data, Privacy and Metrics 数据 , 隐私和测量

(7) Data and Metrics

The Networked Economy: Information Management, Strategy, and Innovation:, , (7) Data, Privacy and Metrics,

people & data | www.weigend.comAndreas S. Weigend, Ph.D. # people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 1AgendaRole of data in decision makingSize and cost of storage

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 River Nile

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800

Notre Dame # people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 From Faith to Data The Era of FaithMassive investments into cathedrals etc.Unclear ROI (Return on Investment)ROINo feedback, or l_o_n_g feedback cycleThe Era of DataMassive investments into measuring, networking, communication, storageROI measurableShort feedback cycleExperiments

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Characteristics of our Era What do we do with data?

Gather dataExplore dataExploit dataPublish dataArchive data

D A T Atoo much of itwhat does it mean?will it integrate with my systems?how can I act on itD A T AD A T AOpportunities and challenges for marketers, publishers, agencies# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Measuring information and storageNameNr Bytes Example Byte*B100E0One characterkilobytekB103E31,000 bytes1000A short email messagemegabyteMB106E61million bytes100Text of a bookHigh-resolution image gigabyteGB109E91,000 MB1000A CDCDterabyteTB1012E121,000 GB1Storage on laptops in this roompetabytePB1015E151,000 TB1000Size of web exabyteEB1018E181million TB100zettabyteZB1021E21 Relationship Byte (B) and bit (b): 1 Byte = 8 bits1=8# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 7Measuring information and storageNameNr Bytes Example Comparison ByteB100E0One character1 nm1Atom: 0.1 nm:0.1KilobytekB103E31,000 bytesA short email1 m1Hair: 50m thick:50MegabyteMB106E61 million bytesText of a bookHigh-resolution pic 1 mm1Floppy disk: 1 mm thick 1GigabyteGB109E91,000 MBA CD CD1 m1TerabyteTB1012E121,000 GBStorage of laptops in this room1 km1Mt Everest/Qomolangma:8.8 km high8.8PetabytePB1015E151,000 TBSize of web 1,000 km1000ExabyteEB1018E181million TBE6 km1To moon: 0.4 E6 km0.4ZettabyteZB1021E21E9 km10To sun: 0.2 E9 km0.2# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 8Surface web= static pages=10 billion pages10010 kB 100 kB / page1 10 100 TB 1PB total storage1001000

Deep web10x size of surface web 10InternetEmail 3 billion email accounts3010 emails / day / account10 30 billion emails / day 3001 kB / email1kB 30 TB traffic per day 10 TB 100 petabyte / year 100

Storage cost (2008 ASW)1 petabyte = USD 100k 1 = 10# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 ASW checkUsenet73 terabytes of Usenet per year

9Turning behavior into dataRevealed preferences

Music SearchOnline tradingOnline dating

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Additional sources of data about peopleMovement Mobile phonesGPS

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Everything can and will become dataAdditional sources of data about peopleMovement Mobile phones, GPS,Brain activityNeuromarketingfMRI analysis of response to stimulifMRIRFIDs (Radio Frequency Identifiers)Unique identifiers for objects, bridging physical and digital

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 There 3 billion base pairs, Not just one billion ,in human genome. Liu Jun13RFIDs and e-businessFactsPrice: 2 US cent2Size: 2 mm2It will happen: Big businessOpportunitiesInventory systems, Supply chain ,Wal-Mart saves USD 8 billion per year by using RFIDs80Shipping screw-ups: 1 in 201/20Personalization

FearsLoss of privacyAbuse of dataConsumers need to be educated to make informed, conscious decisions about their dataThis level of transparency is native in e-business

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 ADD SOME FROM ROD GOODMAN

14Aspects of privacyInformationname, address, hobbies...Communicationphone calls, e-mail, SMS, ...Territoryprivacy of your office, home, bedroom, ...Bodily privacystrip searches, drug tests, ...

LILY PLEASE FIX AS I TOLD YOU IN CLASS

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Some privacy concernsCollection and storageExtensive amounts of personally identifiable data collected and stored Unauthorized secondary useInformation collected from individuals for one purpose is used for another, purpose without authorization from the individuals ,Improper accessData about individuals are available to people not authorized to view or work with these dataCombining dataPersonal data in disparate databases may be combined into larger databases.*Source: Smith, H.J., Milberg, S. J., Burke, S. J., Information Privacy: MeasuringIndividuals Concerncs About Organizational Practices, MIS Quarterly, June 1996

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Errors in personal dataPeople worry that protections against errors in personal data are inadequateErrors by accident or deliberatePeople increasingly demand access to their personal information Revealed preferences often differ from stated preferencesPerception matters often more than objective factsQuestion:Describe processes how people can correct errors in data about themselves

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Different people have different privacy concerns NeverNever1 2 3 4 5Marginally concerned24% Profiling averse26% Identity concerned20% Privacy fundamental list30% Under certain conditions AlwaysUnder certain conditions 1 2 3 4 5Identity revelationProfile revelation# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Privacy becoming increasingly more relevant Personal information becomes ubiquitous with electronic transactions Personal information is at the core of privacyPrivacy is a fundamental right that has been recognized by democratic societies across centuries and across geographiesPrivacy is a proven customer concernPrivacy breach increasingly becomes a relevant social cost (including companies)As companies have begun to treat customer information as an asset, people learn to consider their information as an asset

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 People trust less in way companies deal with their dataMost businesses handle the personal information they collect about consumers in a proper and confidential way.

19993 Strongly/Somewhat Disagree 34%1999 / 34%

20004 Strongly/Somewhat Disagree 43%2000 / 43 % 2001 Strongly/Somewhat Disagree 56%2001 / 56 %

*Source: Ernst & Young Privacy: What Consumers Want, January 2003

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 People feel increasingly less protectedExisting laws and organizational practices provide a reasonable level of protection for consumer privacy today.

19995 Strongly/Somewhat Disagree 38%1999 / 38%

20006 Strongly/Somewhat Disagree 47%2000 / 47% 2001 Strongly/Somewhat Disagree 62%2001 / 62%

*Source: Ernst & Young Privacy: What Consumers Want, January 2003

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Privacy backlash could have a considerable impact on a companies bottom line.If you were to hear or read that a company with which you were a customer was collecting, sharing or using customers personal information in a way you did not think was proper, which one of the following best describes what you would do?

83% 16% 1%*Source: Ernst & Young Privacy: What Consumers Want, January 2003# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Professionals are significantly less concerned about privacy issues when they are being asked as professionals compared to when they are asked in private.**1=unimportant, 1= 7=extremely important7=*Source: Esrock, S.L.., Ferr, J.P., A Dichotomy of Privacy: Personal and Professional Attitudes of Marketers, Business and Society Review, 104: 1, 1999, pp.107-120# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 23Privacy Principles by US Federal Trade Commission (FTC)FTC5Notice/Awareness/Detailed advise to visitors of policies w.r.t. the personal data you process: What data is collected by whom, shared with, used for, consequences of refusal to provide data, Choice/Consent/Giving consumers options as to how information collected may be used, esp. w.r.t. secondary uses; opt-in versus opt-out debate & granularity of privacy choices given Access/Participation/Letting people about whom you have information find out what that information is, and contest its accuracy and completeness if they believe its wrong.Enforcement/Redress /Comply with the privacy laws in a country, subscribe to an industry code of practice or participate in a privacy seal program,...Integrity/Security/Data must be accurate and secure. Data collector must use only reputable sources of data and cross-reference data against multiple sources, providing consumer access to data, destroying untimely data or converting it into anonymous form.# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Storage is freeDramatic drop in price (2008: 1GB costs 10 US cents)Exponential increase in storage

10 exabyte10 EBPB# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Or: Money makes the world go round

26

sina.com Oct 8, 1997 (web.archive.org)1997108

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 27Internet Archivehttp://web.archive.orgStores versions of surface web since 19961996Collected via opt-outopt-out1TB / day raw data11 petabyte stored total1000

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Why now? ?CostStorageTime19902010CommunicationExplicit(Surveys etc.)Implicit(Clicks etc.)Data collectedper dayTime19902010Data collected implicitly: Dramatic growth over timeData collected explicitly: Amount constant over time# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 What a great opportunity for marketing!

Explicit: rate itemsSelf-personalize: myyahoo: only 20% Plus processing powerBottleneck used to be data and algo, now need Constraints!

What ACTIONS are possible? What is trivial? What is useless? What is valuable?

Communication fast feedback loopSo, whats hard now? To make sense out of it!

TIME SCALESEvene rabbits take a while

29Why now? ?Malthuss Law of Information::New information content is doubling every yearTime spent on information consumption is constantCostStorageTime19902010CommunicationExplicit(Surveys etc.)Implicit(Clicks etc.)Data collectedper dayTime19902010# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 What a great opportunity for marketing!

Explicit: rate itemsSelf-personalize: myyahoo: only 20% Plus processing powerBottleneck used to be data and algo, now need Constraints!

What ACTIONS are possible? What is trivial? What is useless? What is valuable?

Communication fast feedback loopSo, whats hard now? To make sense out of it!

TIME SCALESEvene rabbits take a while

30Why now? Malthuss Law of Information::New information content is doubling every yearTime spent on information consumption is constant

Communication

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 31Voice over IP (VOIP)IP := Internet ProtocolIPTraditional phones are on their way outExample: skypeskypeskype skype: freeskype skypeskype phone: 1c/ minskype 1Concurrent users (3/06): 5MWhy is it so inexpensive?

IP # people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800

Nr of words transmitted vs cost of transmission (US 1960-1980) 1960-198010001972# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Large e-business company: Amount of data created per yearLevelCustomerOrdersSession aggregatesClicks

Amount of dataNew data per year

100 MB

10 GB

1 TB

100 TB

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 ERROR: FIX TRANSLATION!

Same as google logs: 100G / dayLargest lab for people dataVision summaryE.g., to compute convergencesInformation ageInteraction effectsVisit Level / could talk about Visit Level (daily aggregates)WHY?? ADD BENEFITSMORE: Site instrumentationJavaScript (Mouse movement, scrolling)

34The iterative process of modeling and decision making 1. Define Business metrics, objectives and baselines2. Measure Collect, store, manage data 3. Describe Exploratory data analysis 4. Predict and evaluate Probabilistic models 5. Decide, act, and evaluate Re-# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Design Analyze Generalize

1) Baseline: what does it mean to do well?Instrument the site

CONTROL

Me obsessed with evaluation

Learn

351.Business metrics and objectives 1.Stock priceProfitNumber of items soldNumber of visitsConversion rateCustomer acquisitionCustomer retentionCustomer satisfaction

Trade-offTrade-off# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Own inventory? Marketplace? New categories?vs

Writing papersTo increase transparency of business (and, of course, return on investment)

DISCUSS: Number of clicks per visit

CUSTOMER DELIGHT

362. Measure 2.OrdersOverall use of the site Buying vs selling vs. Searching vs browsing vs. Engagement: Reviews, etc.Customer service contacts E-mail, phone SurveysSatisfactionIntentions and goals //Customer service responseResolution Free replacement, refund

Delivery date: Actual vs promised

Number of items returned in a search

E-mail campaigns and responsesCustomer-CompanyInteractionsCustomerBehaviorCompanyBehavior# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 %Think what your company can collect!!Different sources!TouchpointsMore: competitors prices

37Why is it hard? Even simple behavioral analysis requires significant infrastructure Reporting Behavioral analysis, predictive modeling and action (e.g., recommendations) Cost centerProfit center# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 and storeAmazons data production rate is comparable to that of satellite television

38Business questionsHow many people are coming to my site?Who are they?Where are they coming from?What are they doing?Whos coming back and how frequently?How is all of this changing over time?What is the impact of a recent site change?

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Twymans LawAny statistic that appears interesting is almost certainly a mistakeValidate amazing discoveries in different waysThey are usually the result of a business process 5% of customers were born on the exact same day (including year)5%11/11/11 is the easiest way to satisfy the mandatory birth date field11/11/11For US Web sites, there will be a small sales increase on Oct 4, 2008, for European Nov xx 20082008104200811xx40# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 40http://webexhibits.org/daylightsaving/b.htmlFor Oct 29, 2006 its both Europe and the US. For starting DST, the dates are different.

Dont forget to change your batteries: More than 90 percent of homes in the United States have smoke detectors, but one-third are estimated to have dead or missing batteries. Some experiencesSynchronize clocks from all data collection pointsExample: Some servers were set to GMT and others to Pacific time, leading to strange anomaliesEven being a few minutes off can cause add-to-carts to appear prior to the searchRemove test dataQA organizations constantly test the systemMake sure the data can be identified and removed from analysisRemove robots/bots/spiders ()5-40% of site e-commerce site traffic is generated by crawlers from search engines (and students doing problem sets)5-40%These significantly skew results unless removed

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 41[Some people bought fairly expensive products for less than 5 cents. Note this is an example of a multi-variate anomaly. It is OK for some products (e.g. gum) to be 5 cents, but not for other products.26 different ways of spelling Mitsubishi!. Use drop down lists instead of free text fields]Picking the right visualization is key to seeing patternsTraffic by dayEasy to see weekendsDifficult to see other patternsHeat mapShows traffic colored from green to yellow to red()Utilizes cyclical nature of the weekNote 9/3 (Labor Day) and 9/119/39/11

Weekends

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 42ASW replace by own siteASW add gay.com

Explain the heatmap.Note that Fridays are generally weaker.The next version of office (office 2007) has heatmaps.Business-level lessonsCollect operation business data Data usually not in web logsSearchesResponse times to return results Shopping cart eventsRegistration formsExternal eventsMarketing promotionsSite changesChoose to collect as much data as you realistically can because you do not know what might be relevant for a future question.Consider privacy issuesOften aggregated or anonymous data suffices

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 43Collection example Form Errors-

Here is a good example of data collection that was introduced without knowing apriori whether it will help: form errorsaprioriIf a web form filled and a field did not pass validation, log field and value

This was the Bluefly home page when they went liveBlueflyLooking at form errors, we saw thousands of errors every day on this pageAny guesses?# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 44People filled in search keywords into the e-mail box that says sign up for e-mail.Easy to fix.BTW, search has to be on the home page. Amazon also made this mistake when it went live: there was no search box on the home page.SummaryThink about the problem end-to-endCollectionTransformationsReportingVisualizationsModelingTaking action

Agree on terminologyHow do you define a session?How do you define a customer? (e.g., did every customer make a purchase)? Beware of hidden variables when concluding causalityE.g., Simpsons paradoxConduct controlled experiments (A/B tests) when possible -- our intuition is poorA/B

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Weblog entry209.209.111.59 - - [29/Jun/2006:13:38:50 -0700] "GET / HTTP/1.1" 200 17497 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

209.209.111.59[29/Jun/2006:13:38:50 -0700]"GET / HTTP/1.120017497"-"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 APPENDIX# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Books and Libraries30 million books30001 MB per book (text)1 100 characters per line100100 lines per page100100 pages per book100

30TB for text3030TB

1 petabyte = $ 1 million1 = 100Are books the right medium for archiving?Digital storage cheap: $10k for all 5Scanning expensive: $10 per book10But: $300M in one shot, then done forever!3Scanning all books is half a year of Library of Congresss budget

Books in print: 3.2 million320Books sold in US in 1999: 1.1 billion199911

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 FITS IN A BOXGETS YOU A HOUSE well, a garage in SF

48Web Surface web= static pages=10 billion pages10010 kB 100 kB / page1 10 100 TB 1PB total storage1001000

Deep web= underlying databases =10x size of surface web 10 1 10 petabyte1 10 PBInternetEmail 1 billion email accounts 1010 emails / day / account1010 billion emails / day1001 kB / email1kB 10 TB traffic per day 10 TB

30 petabyte / year30 Comparison (banner ads)4 billion ads / day served by DoubleClick DoubleClick40

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Usenet73 terabytes of Usenet per year

49Information productionSurveillance30 exabyte / year30 EB30M cameras30003 frames / sec -> 100M pics / sec3 / -> 1/10kB / pic -> 1 TB/sec10kB/ -> 1TB/100k secs / day -> 100 petabyte / day10/ -> 100 PB / One day of production of surveillance cameras = 1 year of all email traffic=1= 100+ years of data stored by Amazon.com=100

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Mail (US only)197 billion pieces733 pieces per year per person150 petabytes per year (counting dups)

50Information production20 exabyte / year flow through telephone, internet, radio, TV20 EB Telephone1013 minutes per year (worldwide, 2005 estimate)1013 200510 exabyte per year10 EB5 exabyte / year of new data was produced and stored in 200220025 EBCorresponds to 1 GB per person (worldwide) per year1GBCorresponds to 10 meters of printed books per person per year10After removing duplicates, 1 exabyte of new information per year1EBCorresponds to one million new libraries per year100 or one large library per minute US is 40% of world total40%# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 InformationDigital: 8095%:80~95%Mainly magnetic (hard disks)

Non-digital: 520%520%Mainly filmLittle paper (0.01%) Very little CD, DVD (optical) CD

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 Most information is produced by individualsMost information is created by individuals not institutionsTelephone calls, email, printouts, photosWe dont know how to organize it

Note: Paper consumption is growing, but most is printed off digital mediaOffice documents and mail outnumber books, newspapers and journalsNorth Americans consume 24 reams (11,916 sheets) of paper annually; European Union consumes 15 reams, or (7,280 sheets); world average is 1,500 sheets each.2411,916157,2801,500In the US, at least half of this paper is used to produce office documents, mostly computer printouts# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800

FilmFilm is less than 10% of total10%Estimates based sales of film materials

Photos80 billion shots per year8002700 shots per second2700 80 percent of US households have camera80%15 percent of Chinese households15%China is 2nd largest market70% of US purchasers say they will buy digital next time70%Movies4,250 movies per year worldwide4250X-rays X2 billion X-rays per year20X

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800 PaperPaper is less than 0.01% of total0.01% Office documents are mainly printouts of digital growing!Books1 million titles per year (UNESCO)100Newspapers23,000 published per year( 25 terabytes)23,000(25 TB)Scholarly journals (Ulrichs)40,000 published per year40,000Magazines (Ulrichs)80,000 published per year80,000

# people & data | www.weigend.com | +1 650 906-5906 | +49 174 906-5906 | +86 138 1818 3800