Adding Semantics to Enterprise Search
Workshop
Tom ReamyChief Knowledge Architect
KAPS Group
Program Chair – Text Analytics World
Knowledge Architecture Professional Services
http://www.kapsgroup.com
2
Agenda
Introduction – What is Wrong with Enterprise Search? Solution: Adding Semantics to Enterprise Search
– Infrastructure Solution– Taxonomy, Metadata, Information Technology
– Hybrid Solutions – Text Analytics
Development – Taxonomy, Categorization, Faceted Metadata Search and Search-based Applications
– Integration with Search and ECM– Platform for Information Applications
Questions / Discussions
3
Introduction: KAPS Group
Knowledge Architecture Professional Services – Network of Consultants Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies Services:
– Strategy – IM & KM - Text Analytics, Social Media, Integration– Taxonomy/Text Analytics development, consulting, customization– Text Analytics Quick Start – Audit, Evaluation, Pilot– Social Media: Text based applications – design & development
Partners – Smart Logic, Expert Systems, SAS, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics
Clients: – Genentech, Novartis, Northwestern Mutual Life, Financial Times,
Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc.
Presentations, Articles, White Papers – www.kapsgroup.com
4
Enterprise Search Workshop:Introduction
What is Wrong with Enterprise Search? Everything! It is the wrong technology
– Index vs. Section Headings & summaries
It is the wrong approach– Technology is not the answer– Need semantics, context, articulated infrastructure
Leads to the Enterprise Search Dance – Every 2-5 years, buy a new search engine– And repeat the same mistakes– 2-5 years later the complaints start again
5
Enterprise Search Workshop:Introduction: What is Wrong with Enterprise Search?
The Google Solution? Great Answer to the Wrong Question Outside the enterprise Google works great
– Link Algorithm – most popular answer is popular– Secret Sauce – 1,000’s of editors & analysis doing millions of
“Best Bets” (and selling to the highest bidder – more best bets)
Inside the enterprise - just another Alta Vista– Link Algorithm doesn’t work– Looking for THE document, not the most popular
6
Enterprise Search Workshop:Introduction: What is Wrong with Enterprise Search?
The “Automatic” Solution? Variety of claimants
– Autonomy et al – just point us at content and magic happens– NLP, Latent Semantic Indexing, Training sets
Semantic Web – trillions of triples – Applications still mostly missing – how are triples structured
Nothing is automatic – where resources are put – programming or library science or?
One question – how well does “Find Similar” work No easy answer – Why search still is not working
7
Enterprise Search Workshop:Introduction: What is Wrong with Enterprise Search?
The Right Answer – look beyond search Need a Different Technology:
– Semantics, language, meaning– Aboutness of documents
Beyond Technology: Context(s):– Purpose, business function of information– Self-Knowledge is the highest form of knowledge
Beyond IT– Library, business groups, Data wizards – predictive analytics
What is new in search?
8
Enterprise Search Workshop: Information EnvironmentElements of a Solution: Semantic Infrastructure
Semantic Layer = Taxonomies, Metadata, Vocabularies + Text Analytics – adding cognitive science, structure to unstructured Modeling users/audiences
Technology Layer– Search, Content Management, SharePoint, Intranets
Publishing process, multiple users & info needs– SharePoint – taxonomies but
• Folksonomies – still a bad idea
Infrastructure – Not an Application– Business / Library / KM / EA – not IT
Building on the Foundation– Info Apps (Search-based Applications)
9
Enterprise Search WorkshopSemantic Infrastructure: People Communities / Tribes
– Different languages– Different Cultures– Different models of knowledge
Two needs – support silos and inter-silo communication Types of Communities
– Formal and informal– Variety of subject matters – vaccines, research, sales– Variety of communication channels and information behaviors
Individual People – tacit knowledge / information behaviors– Consumers and Producers of information – In Depth– Map major types
10
Enterprise Search WorkshopPeople: Central Team Central Team supported by software and offering services
– Creating, acquiring, evaluating taxonomies, metadata standards, vocabularies, categorization taxonomies
– Input into technology decisions and design – content management, portals, search
– Socializing the benefits of metadata, creating a content culture– Evaluating metadata quality, facilitating author metadata– Analyzing the results of using metadata, how communities are using– Research metadata theory, user centric metadata – Facilitate knowledge capture in projects, meetings
11
Enterprise Search WorkshopPeople: Location of Team
KM/KA Dept. – Cross Organizational, Interdisciplinary Balance of dedicated and virtual, partners
– Library, Training, IT, HR, Corporate Communication
Balance of central and distributed Industry variation
– Pharmaceutical – dedicated department, major place in the organization
– Insurance – Small central group with partners– Beans – a librarian and part time functions
Which design – knowledge architecture audit
12
Enterprise Search WorkshopResources: Technology Text Mining
– Both a structure technology – taxonomy development– And an application
Search Based Applications– Portals, collaboration, business intelligence, CRM– Semantics add intelligence to individual applications– Semantics add ability to communicate between applications
Creation – content management, innovation, communities of practice (CoPs)
– When, who, how, and how much structure to add– Workflow with meaning, distributed subject matter experts (SMEs) and
centralized teams
13
Enterprise Search WorkshopBusiness Processes
Platform for variety of information behaviors & needs– Research, administration, technical support, etc.– Types of content, questions
Subject Matter Experts – Info Structure Amateurs Web Analytics – Feedback for maintenance & refine Enhance Basic Processes – Integrated Workflow
– Enhance Both Efficiency and Quality
Enhance support processes – education, training Develop new processes and capabilities
– External Content – Text mining, smarter categorization
14
Enterprise Search WorkshopKnowledge Structures List of Keywords (Folksonomies) Controlled Vocabularies, Glossaries Thesaurus Browse Taxonomies (Classification) Formal Taxonomies Faceted Classifications Semantic Networks / Ontologies Categorization Taxonomies Topic Maps Knowledge Maps
15
Enterprise Search WorkshopA Framework of Knowledge Structures Level 1 – keywords, glossaries, acronym lists, search logs
– Resources, inputs into upper levels
Level 2 – Thesaurus, Taxonomies– Semantic Resource – foundation for applications, metadata
Level 3 – Facets, Ontologies, semantic networks, topic maps, Categorization Taxonomies
– Applications
Level 4 – Knowledge maps – Strategic Resource
16
Enterprise Search WorkshopEnterprise Taxonomies: Wrong Approach Very difficult to develop - $100,000’s Even more difficult to apply
– Teams of Librarians or Authors/SME’s– Cost versus Quality
Problems with maintenance Cost rises in proportion with granularity Difficulty of representing user perspective Social media requires a framework – doesn’t create one
– Wisdom of Crowds OR – Tyranny of the majority, madness of crowds
17
Enterprise Search Workshop Information EnvironmentMetadata - Tagging How do you bridge the gap – taxonomy to documents? Tagging documents with taxonomy nodes is tough
– And expensive – central or distributed Library staff –experts in categorization not subject matter
– Too limited, narrow bottleneck– Often don’t understand business processes and business uses
Authors – Experts in the subject matter, terrible at categorization– Intra and Inter inconsistency, “intertwingleness”– Choosing tags from taxonomy – complex task– Folksonomy – almost as complex, wildly inconsistent– Resistance – not their job, cognitively difficult = non-compliance
Text Analytics is the answer(s)!
18
Enterprise Search Workshop: Information EnvironmentMind the Gap – Manual-Automatic-Hybrid All require human effort – issue of where and how effective Manual - human effort is tagging (difficult, inconsistent)
– Small, high value document collections, trained taggers Automatic - human effort is prior to tagging – auto-categorization
rules and/or NLP algorithm effort Hybrid Model – before (like automatic) and after
– Build on expertise – librarians on categorization, SME’s on subject terms
Facets – Requires a lot of Metadata - Entity Extraction feeds facets – more automatic, feedback by design
Manual - Hybrid – Automatic is a spectrum – depends on context
19
Enterprise Search WorkshopContent Structures: New Approach Simple Subject Taxonomy structure
– Easy to develop and maintain Combined with categorization capabilities
– Added power and intelligence Combined with Faceted Metadata
– Dynamic selection of simple categories– Allow multiple user perspectives
• Can’t predict all the ways people think• Monkey, Banana, Panda
Combined with ontologies and semantic data– Multiple applications – Text mining to Search– Combine search and browse
20
Enterprise Search WorkshopBenefits - Why Semantic Infrastructure Unstructured content = 90% or more of all content Only way to get value is adding structure Only way to add useful structure is deep research into
information environment What is the justification for this approach?
– How many new search engines do you need to buy and do the dance in another 5 years?
– Not as expensive or time consuming as it seems (just unfamiliar to IT)
21
Enterprise Search WorkshopBenefits- Infrastructure vs. Projects Strategic foundation vs. Short Term Integrated solution – CM and Search and Applications
– Better results– Avoid duplication
Semantics– Small comparative cost– Needed to get full value from all the above
ROI – asking the wrong question– What is ROI for having an HR department?– What is ROI for organizing your company?
Enterprise Search WorkshopCosts and Benefits IDC study – quantify cost of bad search Three areas:
– Time spent searching– Recreation of documents– Bad decisions / poor quality work
Costs – 50% search time is bad search = $2,500 year per person– Recreation of documents = $5,000 year per person– Bad quality (harder) = $15,000 year per person
Per 1,000 people = $ 22.5 million a year– 30% improvement = $6.75 million a year– Add own stories – especially cost of bad information– Human measure - # of FTE’s, savings passed on to customers, etc.
22
23
Enterprise Search WorkshopBenefits - Selling the Benefits CTO, CFO, CEO
– Doesn’t understand – wrong language– Semantics is extra – harder work will overcome– Not business critical – Not tangible – accounting bias– Does not believe the numbers– Believes he/she can do it
Need stories and figures that will connect Need to understand their world – every case is different Need to educate them – Semantics is tough and needed
24
Enterprise Search WorkshopBenefits of Text Analytics Why Text Analytics?
– Enterprise search has failed to live up to its potential– Enterprise Content management has failed to live up to its potential– Taxonomy has failed to live up to its potential– Adding metadata, especially keywords has not worked
What is missing?– Intelligence – human level categorization, conceptualization– Infrastructure – Integrated solutions not technology, software
Text Analytics can be the foundation that (finally) drives success – search, content management, and much more
26
Enterprise Search WorkshopIntroduction: Text Analytics History – academic research, focus on NLP Inxight –out of Zerox Parc
– Moved TA from academic and NLP to auto-categorization, entity extraction, and Search-Meta Data
Explosion of companies – many based on Inxight extraction with some analytical-visualization front ends
– Half from 2008 are gone - Lucky ones got bought Focus on enterprise text analytics – shift to sentiment analysis -
easier to do, obvious pay off (customers, not employees)– Backlash – Real business value?
Enterprise search down – 10 years of effort for what?– Need Text Analytics to work
Text Analytics is slowly growing – time for a jump?
27
Enterprise Search WorkshopCurrent State of Text Analytics Current Market: 2012 – exceed $1 Bil for text analytics (10% of total
Analytics)
Growing 20% a year Search is 33% of total market Other major areas:
– Sentiment and Social Media Analysis, Customer Intelligence– Business Intelligence, Range of text based applications
Fragmented market place – full platform, low level, specialty– Embedded in content management, search, No clear leader.
Big Data – Big Text is bigger, text into data, data for text– Watson – ensemble methods, pun module
28
Enterprise Search WorkshopCurrent State of Text Analytics: Vendor Space Taxonomy Management – SchemaLogic, Pool Party From Taxonomy to Text Analytics
– Data Harmony, Multi-Tes Extraction and Analytics
– Linguamatics (Pharma), Temis, whole range of companies Business Intelligence – Clear Forest, Inxight Sentiment Analysis – Attensity, Lexalytics, Clarabridge Open Source – GATE Stand alone text analytics platforms – IBM, SAS, SAP, Smart
Logic, Expert System, Basis, Open Text, Megaputer, Temis, Concept Searching
Embedded in Content Management, Search– Autonomy, FAST, Endeca, Exalead, etc.
29
Enterprise Search WorkshopWhat is Text Analytics? Text Mining – NLP, statistical, predictive, machine learning Semantic Technology – ontology, fact extraction Extraction – entities – known and unknown, concepts, events
– Catalogs with variants, rule based
Sentiment Analysis – Objects/ Products and phrases– Statistics, catalogs, rules – Positive and Negative
Auto-categorization – Training sets, Terms, Semantic Networks– Rules: Boolean - AND, OR, NOT– Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE– Disambiguation - Identification of objects, events, context– Build rules based, not simply Bag of Individual Words
40
Enterprise Search WorkshopNeed for a Quick Start Text Analytics is weird, a bit academic, and not very practical
• It involves language and thinking and really messy stuff
On the other hand, it is really difficult to do right (Rocket Science) Organizations don’t know what text analytics is and what it is for TAW Survey shows - need two things:
• Strategic vision of text analytics in the enterprise• Business value, problems solved, information overload• Text Analytics as platform for information access
• Real life functioning program showing value and demonstrating an understanding of what it is and does
Quick Start – Strategic Vision – Software Evaluation – POC / Pilot
41
Enterprise Search WorkshopText Analytics Vision & Strategy Strategic Questions – why, what value from the text analytics,
how are you going to use it– Platform or Applications?
What are the basic capabilities of Text Analytics? What can Text Analytics do for Search?
– After 10 years of failure – get search to work?
What can you do with smart search based applications?– RM, PII, Social
ROI for effective search – difficulty of believing– Problems with metadata, taxonomy
Enterprise Search WorkshopQuick Start Step One- Knowledge Audit
Ideas – Content and Content Structure– Map of Content – Tribal language silos– Structure – articulate and integrate– Taxonomic resources
People – Producers & Consumers– Communities, Users, Central Team
Activities – Business processes and procedures– Semantics, information needs and behaviors– Information Governance Policy
Technology – CMS, Search, portals, text analytics– Applications – BI, CI, Semantic Web, Text Mining
42
Enterprise Search WorkshopQuick Start Step One- Knowledge Audit Info Problems – what, how severe Formal Process – Knowledge Audit
– Contextual & Information interviews, content analysis, surveys, focus groups, ethnographic studies, Text Mining
Informal for smaller organizations, specific application Category modeling – Cognitive Science – how people think
– Panda, Monkey, Banana Natural level categories mapped to communities, activities
• Novice prefer higher levels• Balance of informative and distinctiveness
Strategic Vision – Text Analytics and Information/Knowledge Environment
43
44
Quick Start Step Two - Software EvaluationVarieties of Taxonomy/ Text Analytics Software Software is more important to text analytics
– No spreadsheets for semantics
Taxonomy Management - extraction Full Platform
– SAS, SAP, Smart Logic, Concept Searching, Expert System, IBM, Linguamatics, GATE
Embedded – Search or Content Management– FAST, Autonomy, Endeca, Vivisimo, NLP, etc.– Interwoven, Documentum, etc.
Specialty / Ontology (other semantic)– Sentiment Analysis – Attensity, Lexalytics, Clarabridge, Lots – Ontology – extraction, plus ontology
Quick Start Step Two - Software EvaluationDifferent Kind of software evaluation Traditional Software Evaluation - Start
– Filter One- Ask Experts - reputation, research – Gartner, etc.• Market strength of vendor, platforms, etc.• Feature scorecard – minimum, must have, filter to top 6
– Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus
– Filter Three – In-Depth Demo – 3-6 vendors Reduce to 1-3 vendors Vendors have different strengths in multiple environments
– Millions of short, badly typed documents, Build application– Library 200 page PDF, enterprise & public search
45
Quick Start Step Two - Software EvaluationDesign of the Text Analytics Selection Team
IT - Experience with software purchases, needs assess, budget– Search/Categorization is unlike other software, deeper look
Business -understand business, focus on business value They can get executive sponsorship, support, and budget
– But don’t understand information behavior, semantic focus
Library, KM - Understand information structure Experts in search experience and categorization
– But don’t understand business or technology Interdisciplinary Team, headed by Information Professionals Much more likely to make a good decision Create the foundation for implementation
46
Quick Start Step Three – Proof of Concept / Pilot Project
POC use cases – basic features needed for initial projects Design - Real life scenarios, categorization with your content Preparation:
– Preliminary analysis of content and users information needs• Training & test sets of content, search terms & scenarios
– Train taxonomist(s) on software(s)– Develop taxonomy if none available
Four week POC – 2 rounds of develop, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial
categorization of content Majority of time is on auto-categorization
47
48
Enterprise Search WorkshopPOC Design: Evaluation Criteria & Issues Basic Test Design – categorize test set
– Score – by file name, human testers Categorization & Sentiment – Accuracy 80-90%
– Effort Level per accuracy level Combination of scores and report Operators (DIST, etc.) , relevancy scores, markup Development Environment – Usability, Integration Issues:
– Quality of content & initial human categorization– Normalize among different test evaluators– Quality of taxonomy – structure, overlapping categories
Quick Start for Text AnalyticsProof of Concept -- Value of POC
Selection of best product(s) Identification and development of infrastructure elements –
taxonomies, metadata – standards and publishing process Training by doing –SME’s learning categorization,
Library/taxonomist learning business language Understand effort level for categorization, application Test suitability of existing taxonomies for range of applications Explore application issues – example – how accurate does
categorization need to be for that application – 80-90% Develop resources – categorization taxonomies, entity extraction
catalogs/rules
49
Enterprise Search WorkshopPOC and Early Development: Risks and Issues
CTO Problem –This is not a regular software process Semantics is messy not just complex
– 30% accuracy isn’t 30% done – could be 90% Variability of human categorization Categorization is iterative, not “the program works”
– Need realistic budget and flexible project plan Anyone can do categorization
– Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results
– Need to educate IT and business in their language
50
51
Text Analytics Development: Categorization ProcessStart with Taxonomy and Content Starter Taxonomy
– If no taxonomy, develop (steal) initial high level• Textbooks, glossaries, Intranet structure• Organization Structure – facets, not taxonomy
Analysis of taxonomy – suitable for categorization – Structure – not too flat, not too large– Orthogonal categories
Content Selection– Map of all anticipated content – Selection of training sets – if possible– Automated selection of training sets – taxonomy nodes as
first categorization rules – apply and get content
52
Enterprise Search WorkshopText Analytics Development: Categorization Process First Round of Categorization Rules Term building – from content – basic set of terms that
appear often / important to content Add terms to rule, apply to broader set of content Repeat for more terms – get recall-precision “scores” Repeat, refine, repeat, refine, repeat Get SME feedback – formal process – scoring Get SME feedback – human judgments Test against more, new content Repeat until “done” – 90%?
53
Enterprise Search WorkshopText Analytics Development: Entity Extraction Process Facet Design – from Knowledge Audit, K Map Find and Convert catalogs:
– Organization – internal resources– People – corporate yellow pages, HR– Include variants – Scripts to convert catalogs – programming resource
Build initial rules – follow categorization process– Differences – scale, threshold – application dependent– Recall – Precision – balance set by application– Issue – disambiguation – Ford company, person, car
54
Enterprise Search WorkshopCase Study - Background
Inxight Smart Discovery Multiple Taxonomies
– Healthcare – first target– Travel, Media, Education, Business, Consumer Goods,
Content – 800+ Internet news sources– 5,000 stories a day
Application – Newsletters – Editors using categorized results– Easier than full automation
55
Enterprise Search WorkshopCase Study - Approach Initial High Level Taxonomy
– Auto generation – very strange – not usable– Editors High Level – sections of newsletters– Editors & Taxonomy Pro’s - Broad categories & refine
Develop Categorization Rules– Multiple Test collections– Good stories, bad stories – close misses - terms
Recall and Precision Cycles– Refine and test – taxonomists – many rounds – Review – editors – 2-3 rounds
Repeat – about 4 weeks
59
Enterprise Search WorkshopCase Study – Issues & Lessons
Taxonomy Structure: Aggregate vs. independent nodes– Children Nodes – subset – rare
Trade-off of depth of taxonomy and complexity of rules No best answer – taxonomy structure, format of rules
– Need custom development– Recall more important than precision – editors role
Combination of SME and Taxonomy pros– Combination of Features – Entity extraction, terms, Boolean, filters,
facts
Training sets and find similar are weakest Plan for ongoing refinement
60
Enterprise Search WorkshopEnterprise Environment – Case Studies
A Tale of Two Taxonomies – It was the best of times, it was the worst of times
Basic Approach– Initial meetings – project planning– High level K map – content, people, technology– Contextual and Information Interviews– Content Analysis– Draft Taxonomy – validation interviews, refine– Integration and Governance Plans
61
Enterprise Search WorkshopEnterprise Environment – Case One – Taxonomy, 7 facets
Taxonomy of Subjects / Disciplines:– Science > Marine Science > Marine microbiology > Marine toxins
Facets:– Organization > Division > Group– Clients > Federal > EPA– Facilities > Division > Location > Building X– Content Type – Knowledge Asset > Proposals– Instruments > Environmental Testing > Ocean Analysis > Vehicle– Methods > Social > Population Study– Materials > Compounds > Chemicals
62
Enterprise Search WorkshopEnterprise Environment – Case One – Taxonomy, 7 facets Project Owner – KM department – included RM, business
process Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context
– Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement
– Software, process, team – train library staff– Good selection and number of facets
Developed broad categorization and one deep-Chemistry Final plans and hand off to client
63
Enterprise Search WorkshopEnterprise Environment – Case Two – Taxonomy, 4 facets Taxonomy of Subjects / Disciplines:
– Geology > Petrology
Facets:– Organization > Division > Group– Process > Drill a Well > File Test Plan– Assets > Platforms > Platform A– Content Type > Communication > Presentations
64
Enterprise Environment – Case Two – Taxonomy, 4 facetsEnvironment & Project Issues
Value of taxonomy understood, but not the complexity and scope– Under budget, under staffed
Location – not KM – tied to RM and software– Solution looking for the right problem
Importance of an internal library staff– Difficulty of merging internal expertise and taxonomy
Project mind set – not infrastructure– Rushing to meet deadlines doesn’t work with semantics
Importance of integration – with team, company– Project plan more important than results
65
Enterprise Environment – Case Two – Taxonomy, 4 facetsResearch and Design Issues
Research Issues– Not enough research – and wrong people– Misunderstanding of research – wanted tinker toy connections
• Interview 1 leads to taxonomy node 2
Design Issues– Not enough facets– Wrong set of facets – business not information– Ill-defined facets – too complex internal structure
66
Enterprise Environment – Case Two – Taxonomy, 4 facetsConclusion: Risk Factors
Political-Cultural-Semantic Environment – Not simple resistance - more subtle
• – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations
Access to content and people– Enthusiastic access
Importance of a unified project team– Working communication as well as weekly meetings
68
Enterprise Search WorkshopBuilding on the Foundation Text Analytics: Create the Platform – CM & Search
– New Electronic Publishing Process• Use text analytics to tag, new hybrid workflow
– New Enterprise Search• Build faceted navigation on metadata, extraction
Enhance Information Access in the Enterprise - InfoApps– Governance, Records Management, Doc duplication, Compliance
– Applications – Business Intelligence, CI, Behavior Prediction– eDiscovery, litigation support, Fraud detection
– Productivity / Portals – spider and categorize, extract
69
Enterprise Search WorkshopInformation Platform: Content Management Hybrid Model – Internal Content Management
– Publish Document -> Text Analytics analysis -> suggestions for categorization, entities, metadata - > present to author
– Cognitive task is simple -> react to a suggestion instead of select from head or a complex taxonomy
– Feedback – if author overrides -> suggestion for new category
External Information - human effort is prior to tagging– More automated, human input as specialized process –
periodic evaluations– Precision usually more important – Target usually more general
70
Text Analytics and SearchMulti-dimensional and Smart Faceted Navigation has become the basic/ norm
– Facets require huge amounts of metadata– Entity / noun phrase extraction is fundamental– Automated with disambiguation (through categorization)
Taxonomy – two roles – subject/topics and facet structure – Complex facets and faceted taxonomies
Clusters and Tag Clouds – discovery & exploration Auto-categorization – aboutness, subject facets
– This is still fundamental to search experience– InfoApps only as good as fundamentals of search
People – tagging, evaluating tags, fine tune rules and taxonomy
73
Integrated Facet ApplicationDesign Issues - General
What is the right combination of elements?– Dominant dimension or equal facets– Browse topics and filter by facet, search box– How many facets do you need?
Scale requires more automated solutions– More sophisticated rules
Issue of disambiguation:– Same person, different name – Henry Ford, Mr. Ford, Henry X. Ford– Same word, different entity – Ford and Ford
Number of entities and thresholds per results set / document– Usability, audience needs
Relevance Ranking – number of entities, rank of facets
74
Enterprise Search Workshop Thinking Fast and Slow – Daniel Kahneman System 1 and System 2 – Daniel Kahneman System 1 – fast and automatic – little conscious control
Represents categories as prototypes – stereotypes– Norms for immediate detection of anomalies – distinguish the
surprising from the normal– fast detection of simple differences, detect hostility in a voice, find
best chess move (if a master)– Priming / Anchoring – susceptible to systemic errors
• Temperature Example– Biased to believe and confirm– Focuses on existing evidence (ignores missing – WYSIATI)
.
75
Enterprise Search Workshop Thinking Fast and Slow System 2 – Complex, effortful judgments and calculations
– System 2 is the only one that can follow rules, compare objects on several attributes, and make deliberate choices
– Understand complex sentences– Check the validity of a complex logical argument– Focus attention – can make people blind to all else – Invisible Gorilla
Similar to traditional dichotomies – Tacit – Explicit, etc Basic Design – System 1 is basic to most experiences, and
System 2 takes over when things get difficult – conscious control
Text Analysis and Text Mining / Auto-Cat and TA Cat
76
Enterprise Search WorkshopSystem 1 & 2 – and Text Analytics Approaches “Automatic Categorization” – System 1 prototypes
– Limited value -- only works in simple environments– Shallow categories with large differences – Not open to conscious control
System 2 – categories – complex, minute differences, deep categories
Together:– Choose one or other for some contexts– Combine both – need to develop new kinds of categories
and/or new ways to combine?
77
Enterprise Search Workshop Text Mining and Text Analytics Text Analytics and Big Data enrich each other
– Data tells you what people did, TA tells you why Text Analytics – pre-processing for TM
– Discover additional structure in unstructured text– Behavior Prediction – adding depth in individual documents – New variables for Predictive Analytics, Social Media Analytics– New dimensions – 90% of information, 50% using Twitter analysis
Text Mining for TA– Semi-automated taxonomy development – Apply data methods, predictive analytics to unstructured text– New Models – Watson ensemble methods, reasoning apps
Extraction – smarter extraction – sections of documents, Boolean, advanced rules – drug names, adverse events – major mention
78
Enterprise Search WorkshopIntegration of Text and Data Analytics Expertise Location: Case Study: Data and Text Data Sources:
– HR Information: Geography, Title-Grade, years of experience, education, projects worked on, hours logged, etc.
Text Sources:– Document authored (major and minor authors) – data and/or text– Documents associated (teams, themes) – categorized to a taxonomy– Experience description – extract concepts, entities
Self-reported expertise – requires normalization, quality control Complex judgments:
– Faceted application– Ensemble methods – combine evaluations
79
Enterprise Search Workshop: Building on the Platform - Expertise Analysis Expertise Characterization for individuals, communities, documents,
and sets of documents Experts prefer lower, subordinate levels
– Novice & General – high and basic level Experts language structure is different
– Focus on procedures over content Applications:
– Business & Customer intelligence – add expertise to sentiment– Deeper research into communities, customers– Expertise location- Generate automatic expertise
characterization based on documents
80
Enterprise Search WorkshopNew Approaches – Applied Watson Key concept is that multiple approaches are required – and
a way to combine them – confidence score Aim = 85% accuracy of 50% of questions (Ken Jennings –
92% of 62% Used a combination of structure and text search Massive parallelism, many experts, pervasive confidence
estimation, integration of shallow and deep knowledge Key step – fast filtering to get to top 100 (System 1) Then – intense analysis to evaluate (System 2) – multiple
scoring
81
Enterprise Search WorkshopNew Approaches – Applied Watson Multiple sources – taxonomies, ontologies, etc. Special modules – temporal and spatial reasoning –
anomalies Taxonomic, Geospatial, Temporal, Source Reliability,
Gender, Name Consistency, Relational, Passage Support, Theory Consistency, etc.
Merge answer scores before ranking 3 Years, 20 researchers of all types Got to 70% of 70% - in two hours More difficult answers / more complete questions
82
Enterprise Search Workshop: ApplicationsSocial Media: Beyond Simple Sentiment Beyond Good and Evil (positive and negative)
– Social Media is approaching next stage (growing up)– Where is the value? How get better results?
Importance of Context – around positive and negative words– Rhetorical reversals – “I was expecting to love it”– Issues of sarcasm, (“Really Great Product”), slanguage
Granularity of Application– Early Categorization – Politics or Sports
Limited value of Positive and Negative– Degrees of intensity, complexity of emotions and documents
Addition of focus on behaviors – why someone calls a support center – and likely outcomes
83
Enterprise Search Workshop: ApplicationsSocial Media: Beyond Simple Sentiment Two basic approaches [Limited accuracy, depth]
– Statistical Signature of Bag of Words – Dictionary of positive & negative words
Essential – need full categorization and concept extraction New Taxonomies – Appraisal Groups – Adjective and modifiers –
“not very good”
– Supports more subtle distinctions than positive or negative Emotion taxonomies - Joy, Sadness, Fear, Anger, Surprise, Disgust
– New Complex – pride, shame, confusion, skepticism
84
Enterprise Search Workshop : ApplicationsBehavior Prediction – Telecom Customer Service
Problem – distinguish customers likely to cancel from mere threats Basic Rule
– (START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"),
– (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))
Examples:– customer called to say he will cancell his account if the does not stop receiving
a call from the ad agency. – cci and is upset that he has the asl charge and wants it off or her is going to
cancel his act
More sophisticated analysis of text and context in text Combine text analytics with Predictive Analytics and traditional behavior
monitoring for new applications
85
Enterprise Search Workshop : ApplicationsVariety of New Applications Essay Evaluation Software - Apply to expertise characterization
– Avoid gaming the system – multi-syllabic nonsense• Model levels of chunking, procedure words over content
Legal Review– Significant trend – computer-assisted review (manual =too many)– TA- categorize and filter to smaller, more relevant set– Payoff is big – One firm with 1.6 M docs – saved $2M
Financial Services– Trend – using text analytics with predictive analytics – risk and fraud– Combine unstructured text (why) and structured transaction data (what)– Customer Relationship Management, Fraud Detection– Stock Market Prediction – Twitter, impact articles
86
Enterprise Search Workshop : ApplicationsPronoun Analysis: Fraud Detection; Enron Emails Patterns of “Function” words reveal wide range of insights Function words = pronouns, articles, prepositions, conjunctions, etc.
– Used at a high rate, short and hard to detect, very social, processed in the brain differently than content words
Areas: sex, age, power-status, personality – individuals and groups Lying / Fraud detection: Documents with lies have
– Fewer and shorter words, fewer conjunctions, more positive emotion words
– More use of “if, any, those, he, she, they, you”, less “I”– More social and causal words, more discrepancy words
Current research – 76% accuracy in some contexts Text Analytics can improve accuracy and utilize new sources Data analytics (standard AML) can improve accuracy
87
Enterprise Search WorkshopConclusions
Enterprise Search is broken Search requires semantics (What is non-semantic search?) Adding Semantics requires an infrastructure approach
– People, Technology, Processes, Content & content structure
Text Analytics can change the game – in conjunction with other infrastructure elements
Semantic Search as a platform for SBA – payoff is enormous Want to learn more – come to Text Analytics World in San
Francisco in March!– Early Bird Registration – www.textanalyticsworld.com
Questions?
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
89
Resources
Books– Women, Fire, and Dangerous Things
• George Lakoff– Knowledge, Concepts, and Categories
• Koen Lamberts and David Shanks– Formal Approaches in Categorization
• Ed. Emmanuel Pothos and Andy Wills– The Mind
• Ed John Brockman • Good introduction to a variety of cognitive science theories,
issues, and new ideas– Any cognitive science book written after 2009
90
Resources
Conferences – Web Sites– Text Analytics World - All aspects of text analytics
• March 17-19, San Francisco– http://www.textanalyticsworld.com
– Semtech– http://www.semanticweb.com
91
Resources
Blogs– SAS- http://blogs.sas.com/text-mining/
Web Sites – Taxonomy Community of Practice:
http://finance.groups.yahoo.com/group/TaxoCoP/– LindedIn – Text Analytics Summit Group– http://www.LinkedIn.com– Whitepaper – CM and Text Analytics -
http://www.textanalyticsnews.com/usa/contentmanagementmeetstextanalytics.pdf
– Whitepaper – Enterprise Content Categorization strategy and development – http://www.kapsgroup.com
92
Resources
Articles– Malt, B. C. 1995. Category coherence in cross-cultural
perspective. Cognitive Psychology 29, 85-148– Rifkin, A. 1985. Evidence for a basic level in event
taxonomies. Memory & Cognition 13, 538-56– Shaver, P., J. Schwarz, D. Kirson, D. O’Conner 1987.
Emotion Knowledge: further explorations of prototype approach. Journal of Personality and Social Psychology 52, 1061-1086
– Tanaka, J. W. & M. E. Taylor 1991. Object categories and expertise: is the basic level in the eye of the beholder? Cognitive Psychology 23, 457-82
Top Related