Addressing scalability challenges in peer-to-peer search
-
Upload
harisankar-haridas -
Category
Technology
-
view
636 -
download
0
description
Transcript of Addressing scalability challenges in peer-to-peer search
![Page 1: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/1.jpg)
Addressing scalability challenges in peer-to-peer search
PhD seminar 4 Feb, 2014
Harisankar H, PhD scholar,
DOS lab, Dept. of CSE Advisor: Prof. D. Janakiram
http://harisankarh.wordpress.com
![Page 2: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/2.jpg)
Outline
• Issues with centralized search
– Can peer-to-peer search help?
• Scalability challenges in peer-to-peer search
• Proposed architectural extensions
– Two-layered architecture for peer-to-peer concept search
– Cloud-assisted approach to handle query spikes
![Page 3: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/3.jpg)
Centralized search scenario
• Scenario – Search engines crawl available content, index and
maintain it in data centers
– User queries directed to data centers, processed internally and results sent back
– Centrally managed by single company
Datacenters
Content End users
![Page 4: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/4.jpg)
Some issues with centralized search
– Privacy concerns • All user queries accessible from a single location
– Centralized control • Individual companies decide what to(not to) index, rank
etc.
– Transparency • Complete details of ranking, pre-processing etc. not
made available publicly
• Concerns of censorship and doctoring of results
![Page 5: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/5.jpg)
Some issues with centralized search contd..
• Uses mostly syntactic search techniques
– Based on word or multi-word phrases
– Low quality of results due to ambiguity of natural language
• Issues with centralized semantic search
– Difficult to capture long tail of niche interests of users • Requires rich human generated knowledge bases in numerous
niche areas
![Page 6: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/6.jpg)
Peer-to-peer search approach
• Edge nodes in the internet participate in providing and using the search service
• Search as a collaborative service
• Crawling, indexing and search distributed across the peers
![Page 7: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/7.jpg)
How could peer-to-peer search help?
• Each user query can be sent to a different peer among millions – Obtaining query logs in a single location difficult – Reduced privacy concerns
• Distributed control across numerous peers – Avoids centralized control
• Search application available with all peers – Better transparency in ranking etc.
• Background knowledge of peers can be utilized for effective semantic search – Can help improve quality of results
• Led to lot of academic research in the area as well as real world p2p search engines*
* e.g., faroo.com, yacy.net; YacyPi Kickstarter project
![Page 8: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/8.jpg)
Realizing peer-to-peer search
• Distribution of search index – Term partitioning
• Responsibility of individual terms assigned to different peers – E.g., peer1 is currently
responsible for term “computer”
• Term-to-peer mapping achieved through a structured overlay(e.g., DHT)
Image src: http://wwarodomfr.blogspot.in/2008/09/chord-routing.html
![Page 9: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/9.jpg)
Scalability challenges in peer-to-peer search
• Peers share only idle resources
• Peers join/leave autonomously
• Limited individual resources
• leads to – Peer bandwidth bottleneck during query processing
• Particularly queries involving multiple terms(index transfer between multiple peers)
– Instability during query spikes
• Knowledge management issues at large scale – Difficult to have consensus at large scale – Need wide understanding and have to meet requirements of
large diverse group
No SLA
![Page 10: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/10.jpg)
Two-layered architecture for peer-to-peer concept search*
• Peers organized as communities based on common interest
• Each community maintains its own background knowledge to use in semantic search – Maintained in a distributed manner
• A global layer with aggregate information to facilitate search across communities
• Background knowledge bases extend from minimal universally accepted knowledge in upper layer
• Search, indexing and knowledge management proceeds independently in each community
*joint work with Prof. Fausto Guinchiglia and Uladzimir, Univ. of Trento
![Page 11: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/11.jpg)
Two-layered architecture for peer-to-peer concept search
Community-1
BK-1 doc index -1
Community-3
BK-3 doc index -3
Community-2
BK-2 doc index -2
UK Comm: index
GLOBAL
![Page 12: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/12.jpg)
Two-layered architecture
• Global layer
– retrieves relevant communities for query based on universal knowledge
• Community layer
– retrieves relevant documents for query based on background knowledge of community
![Page 13: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/13.jpg)
Overcoming the shortcomings of single-layered approaches
• Search can be scoped only to the relevant communities for a query – Results in less bandwidth-related issues
• Two layers make knowledge management scalable and interoperable – Niche interests supported at community-level background
knowledge bases – Minimal universal knowledge for interoperability
• Search within community based on community’s background knowledge – Focused interest of community helps in better term-to-
concept mapping
![Page 14: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/14.jpg)
Two-layered approach
• Index partitioning
– Uses partition-by-term • Posting list for each term stored in different peers
– Uses Distributed Hash Table(DHT) to realize dynamic term-to-peer mapping • O(logN) hops for each lookup
• Overlay network
– Communities and global layer maintained using two-layered overlay • Based on our earlier work on computational grids*
– O(logN) hops for lookup even with two-layers
*M.V. Reddy, A.V. Srinivas, T. Gopinath, and D. Janakiram, “Vishwa: A reconfigurable P2P middleware for Grid Computations,” in ICPP'06
![Page 15: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/15.jpg)
Two-layered approach
• Community management – Similar to public communities in flickr, facebook
etc.
• Search within community – Uses Concept Search* as underlying semantic
search scheme • Extends syntactic search with available knowledge to
realize semantic search
• Falls back to syntactic search when no knowledge is available
*Fausto Giunchiglia, Uladzimir Kharkevich, Ilya Zaihrayeu, “Concept search”, ESWC 2009
![Page 16: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/16.jpg)
Two-layered approach
• Knowledge representation – Term -> concept mapping
– Concept hierarchy • Concept relations expressed as subsumption relations
• Concepts in documents/queries extracted – by analyzing words and natural language phrases
– Nounphrases translated into conjunctions of atomic concepts (complex concepts) • Small-1Λdog-2
– Documents/queries represented as enumerated sequences of complex concepts • Eg: 1:small-1Λdog-2 2:big-1Λanimal-3
![Page 17: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/17.jpg)
Two-layered approach
• Relevance model – Documents having more specific concepts than query
concepts considered relevant • Eg: poodle-1 relevant when searching for dog-2
– Ranking done by extending tf-idf relevance model • Incorporates term-concept and concept-concept similarities also
• Distributed knowledge maintenance – Each atomic concept indexed on DHT with id – Node responsible for each atomic concept id also stores
ids of • All immediate more specific atomic concepts • All concepts in the path to root of the atomic concept
![Page 18: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/18.jpg)
Two-layered approach
• Document indexing and search – Concepts mapped to peer using DHT – Query routed to peers responsible for the query concepts
and related concepts – Results from multiple peers combined to give final results
• Global search – The popularity(document frequency) of each concept
indexed in upper layer – Tf-idf extended with universal knowledge to search for
communities – Combined score of doc = (score of community)*(score of
doc within community)
![Page 19: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/19.jpg)
Experiments • Single layer syntactic vs semantic: TREC ad-hoc,TREC8 (
simulated with 10,000 peers)
– Wordnet as knowledge base
• Single vs 2 layer
– 18 communities (doc: categories in dMoz*)
• 18*1000 = 18,000 peers simulated
– UK = domain-independent concepts and relations from wordnet
– BK = UK + wordnet domains + YAGO
– BK mapped to communities
– Queries selected as directory path to a specific subdirectory
– Standard result: documents in that subdirectory
*http://www.dmoz.org/
![Page 20: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/20.jpg)
Experiments
• Tools – GATE(NLP), Lucene(search library), PeerSim(peer-to-
peer system simulator)
• Performance metrics – Quality
• Precision @10, precision @20 • Mean average precision, MAP
– Network bandwidth • Average number of postings transferred
– Response time • s-postings, s-hops
![Page 21: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/21.jpg)
Results (1 layer syntactic vs semantic)
• Quality improved
• But, cost also increased
![Page 22: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/22.jpg)
Results (1 layer vs 2 layer)
• Quality improved
• Cost decreased
– 94% decrease in posting transfer for opt. case
![Page 23: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/23.jpg)
Two-layered approach results
• Proposed approach gives better quality and performance over single-layered approaches – Performance can further improved using
optimizations like early termination
• But, issue of query spikes remain
![Page 24: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/24.jpg)
Query spikes in peer-to-peer search
• Query spikes can lead to instability
– Replication/caching insufficient due to high document creation rate*
rate of queries related to “Bin laden” increased by 10,000 times within one hour in Google on May 1, 2011 after Operation Geronimo.
![Page 25: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/25.jpg)
Some background • Term-partitioned search
– Term/popular query responsibility assigned to individual peers • Updates and queries are sent to peer responsible which process them
– Term -> peer mapping done using a Distributed Hash Table(DHTs)
top-k result list of q
![Page 26: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/26.jpg)
Cloud-assisted p2p search(CAPS)
• Offload responsibilities of spiking queries to public cloud
![Page 27: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/27.jpg)
Issues in realizing CAPS
• Maintaining full index copy in cloud is very expensive – Storage alone will cost more than 5 million dollars per
month*
• Approach: transfer only relevant index portion to cloud – Need to be performed fast considering effect on user
experience(result quality, response time)
• Effect on the desirable properties of peer-to-peer search – Privacy, transparency, decentralized control etc.
![Page 28: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/28.jpg)
CAPS components
• Switching decision maker
– Decide when to switch
– Simple e.g., “switch when query rate increases by X% within last Y seconds”
• Switching implementor
– Switching algorithm to seamlessly transfer index partition
– Dynamic creation of cloud instances
![Page 29: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/29.jpg)
CAPS Switching algorithm
• Ensures that result quality is not affected • Controlled bandwidth usage at peer
![Page 30: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/30.jpg)
Addressing additional concerns
• Transparency – Index resides both among peers and cloud
• Centralized control – Query can switched back to peers or other clouds
• Privacy – Only spiking queries(less revealing) are forwarded to
cloud
• Cost – Cloud used only transiently for spiking queries
• Cloud payment model – Anonymous keyword-based advertising model*
![Page 31: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/31.jpg)
CAPS Evaluation
• Experimental setup – Target system consists of millions of peers
– Implemented the relevant components in a realistic network • Responsible peer, preceding peers, cloud instance
• Datasets – Real datasets on query/corresponding
updates(rates) not publicly available
– Used synthetic queries and updates with expected query/update rates/ratio
![Page 32: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/32.jpg)
Experimental setup
• 6 heterogeneous workstations with 4-6 cores, 8-16GB RAM used
![Page 33: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/33.jpg)
Experiments
• Two sets of experiments
1. Demonstrate effect of query spike with and without cloud-assistance
2. Effect of switching on user experience
• Response time and result quality
• Switching time
![Page 34: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/34.jpg)
Results-1
With cloud assistance
Without cloud assistance
![Page 35: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/35.jpg)
Results-2(effect of switching on user experience)
• Result freshness
• Response time
![Page 36: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/36.jpg)
Switching time
![Page 37: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/37.jpg)
Conclusions
• Peer-to-peer search has many advantages by design compared to centralized search
• But, peer-to-peer search approaches have scalability issues
• Two-layered approach to peer-to-peer search can improve efficiency and result quality of peer-to-peer search
• Offloading queries to cloud can be an effective method to handle query spikes – Desirable properties of p2p systems not lost
![Page 38: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/38.jpg)
Publications
• Janakiram Dharanipragada and Harisankar Haridas, “Stabilizing peer-to-peer systems using public cloud: A case study of peer-to-peer search”, In the The 11th International Symposium on Parallel and Distributed Computing(ISPDC 2012), held at Munich, Germany.
• Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas and Uladzimir Kharkevich, “Two-layered architecture for peer-to-peer concept search”, In the 4th International Semantic Search Workshop located at the 20th Int. World Wide Web Conference(WWW 2011), 2011), held at Hyderabad, India.
• Harisankar Haridas, Sriram Kailasam, Prateek Dhawalia, Prateek Shrivastava, Santosh Kumar and Janakiram Dharanipragada, “V-cloud: A Peer-to-peer Video Storage-Compute Cloud”, In the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing(HPDC 2012), held at Delft, The Netherlands[Poster].
![Page 39: Addressing scalability challenges in peer-to-peer search](https://reader034.fdocument.pub/reader034/viewer/2022051609/54620e3fb1af9f936c8b4d37/html5/thumbnails/39.jpg)
THANK YOU
Questions/Suggestions
harisankarh[ at ]gmail.com