將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
-
Upload
chengjen-lee -
Category
Technology
-
view
211 -
download
2
Transcript of 將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將Open Data放上 Open Source Platforms開源資料入口平台 CKAN開發經驗分享
@ FOSS and Project Collaboration (Spring 2015)
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Taiwan License.
Presenter: 李承錱 Cheng-Jen Lee (Sol)
Email: cjlee AT iis.sinica.edu.tw
2
About Me
● Sol, @u10313335
● Institute of Information Science, Academia Sinica
● https://about.me/SolLee
● Python / R / Java
● Focused Areas
– CMS– Data Repository– Open Data– *nix System Administration
3
Agenda
● Open Data and Open Data Portals● About CKAN● CKAN and 5 Open Data★● Experiences● Contribution: What and How?
4
Open Data and Open Data Portals
● Open Data
– The idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control1.
● Open Data Portals
– Facilitate access to and re-use of public sector information2.
– “Infrastruction” of open data
1. Wikipeida: open data https://en.wikipedia.org/wiki/Open_data 2. Open Data Portals - Digital Agenda for Europe http://ec.europa.eu/digital-agenda/en/open-data-portals
5
About CKAN
6
CKAN
● The Comprehensive Knowledge Archive Network
● A powerful data management system– Publishing– Sharing– Finding– Using Data
7
Screenshot
8
The Most Popular Platform for Open Data
116 instancesaround the worldin March 2015
http://ckan.org/instances
9
The Most Popular Platform for Open Data● Widely used in government data portal
– In EU member states, 30% open data portals adopted CKAN (OpenDataMonitor1, March 2015)
● Workflow support for publishing data
● Data Visualization
● 100+ Extensions
● Powerful APIs
● Open-sourced (AGPLv3)
1. http://www.opendatamonitor.eu
10
United KingdomDATA.GOV.UK
11
United StatesDATA.GOV
12
JapanDATA.GO.JP
13
European UnionPUBLICDATA.EU
14
Tainan CityDATA.TAINAN.GOV.TW
15
Nantou CountyDATA.NANTOU.GOV.TW
16
Hsinchu CityOPENDATA.HCCG.GOV.TW
17
Taipei CityDATA.TAIPEI
18
台江內海研究資料集TAIJIANG.TW
20
Publish Datasets
① Add Dataset Information
21
Publish Datasets
② Add Data under the Dataset
22
Find Datasets
By Keyword
23
Find Datasets
By Location
24
Find Datasets
By filters
25
Data Preview and Visualization
recline_view (csv, xls)Grid
26
Data Preview and Visualization
recline_view (csv, xls)Graph
27
Data Preview and Visualization
recline_view (csv, xls)Lat/Long fields
28
Data Preview and Visualization
wms_preview
29
Data Preview and Visualization
geojson_preview
30
Data Preview and Visualization
● Docs: recline_view, text_view, json_view, pdf_view, webpage_view, officedocs_view...
● Pics: image_view
● And more!
31
Authorization
organization
http://opendata.hccg.gov.tw/organization
32
Data Exchange
Harvest and Federation
33
CKAN and 5 Open Data★ 1
1. Tim Berners-Lee, “Linked Data”http://www.w3.org/DesignIssues/LinkedData.html
34
CKAN and 5 Open Data★
● ★ Make your stuff available on the Web (whatever format) under an open license
Customizable licenses
35
CKAN and 5 Open Data★
● ★★ Make it available as structured data (e.g., Excel instead of image scan of a table)
★★★ Use non-proprietary formats (e.g., CSV instead of Excel)
– Upload any data format– Data API
● Get records from
structured data
Data API
36
CKAN and 5 Open Data★
● ★★★★ Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
● ★★★★★ Link your data to other data to provide context
– Built-in RDF exporting capabilities– Expose or consume metadata from other catalogs using RDF
(DCAT) docs1
● ckanext-qa2: Check the openess of datasets or resources
1. Supported by ckanext-dcat extension2. https://github.com/ckan/ckanext-qa
37
Experiences
38
System Architecture
39
Installation
● Official Documents:
– http://docs.ckan.org/en/latest/● Installation Notes (In Chinese):
– https://ckan-docs-tw.readthedocs.org/
40
Customizations for Taijiang.tw
● Custom Metadata● Data Visualization● Custom filters● Harvest● Localization● Source Code Released under AGPLv3 (On GitHub: u10313335)
– ckanext-taijiang– ckanext-spatial– taijiang-ckan-translations– taijiang-bulk-uploader
41
Custom Metadata
● Extension ckanext-scheming1
– Configure and share CKAN schemas using a JSON schema description.
– Custom template snippets for editing and display fields.Template Name Function
text.html a simple text field for free-form text
large_text.html a larger text field
date.html a date widget
markdown.html a markdown field
select.html a select box
multiple_choice.html a group of checkboxes
repeating.html a repeating fields1. https://github.com/open-data/ckanext-scheming, only for CKAN 2.3+
42
Custom Metadata – Example
{
"field_name": "data_type",
"label": {"en": "Data Type", "zh_TW": "資料類型 "},
"preset": "select",
"form_attrs": {"data-module": "autocomplete"},
"choices": [{"value": "statistics", "label": Statistics"}]
}
{
"field_name": "ref",
"preset": "repeating_text",
"label": {"en": "Reference", "zh_TW": "參考來源 "},
"form_blanks": 3
}
select
repeating_text
43
Validator and Converter
● Ensure data quality
44
Validator and Converter
● Validator
– Validate user inputs– Ex. json_validator
def json_validator(value, context): if value == '': return value try: json.loads(value) except ValueError: raise Invalid('Invalid JSON') return value
45
Validator and Converter
● Converter
– Convert data to storage– Ex. duplicate_validator
def duplicate_validator(key, data, errors, context): if errors[key]: return value = json.loads(data[key])
unduplicated = list(set(value)) data[key] = json.dumps(unduplicated)
46
Data Visualization
● There is no viewer for some GIS formats
– WMTS services– ESRI Shapefile (*.shp and *.dbf)
● Do It Ourselves!
– wmts_view– shp_view
47
Write a CKAN Plugin
● PyUtilib Component Architecture (PCA)
● Inherits from
– ckan.plugins.SingletonPlugin● Implements
– one (or several) ckan.plugins.* interfaces
48
To Build a "viewer"
● We need more…
– View template (Jinja template engine)– JavaScript module
● Ex. Shapefile preview includes shp2geojson.js1.
1. http://gipong.github.io/shp2geojson.js/ (Released under MIT license)
49
Example: Plugin for SHP Preview
from ckan import plugins as p
class SHPView(p.SingletonPlugin): p.implements(p.IResourceView, inherit=True)
def info(self): return {'name': shp_view', 'title': 'shp', 'icon': 'map-marker', 'iframed': True, 'default_title': 'SHP', } def can_view(self, data_dict): resource = data_dict['resource'] format_lower = resource['format'].lower()
if format_lower in self.SHP: return self.same_domain or self.proxy_is_enabled return False def view_template(self, context, data_dict): return 'dataviewer/shp.html'
<div data-module="shppreview" id="data-preview" data-module-map_config="{{ h.dump_json(map_config) }}"></div>
// shapefile preview moduleckan.module('shppreview', function (jQuery, _) { Return { initialize: function () { … } showPreview: function (url, data) { … } }}
Python Plugin View Template (shp.html)
JS Module (shp_view.js)
50
Result
http://taijiang.tw/dataset/tainangis-wmts
wmts_view
51
Result
shp_view QGIS
http://taijiang.tw/dataset/proj4-29
shp_view
52
Custom Filters
● Find Datasets by
– Time period– Self-defined categories
● A New Plugin
– For Time Search● Implement IPackageController.before_search
– For Self-defined Categories● Implement IPackageController.before_index and
Ifacets.dataset_facets– Both needs new definitions in solr schema
53
Example: Plugin for Time Search
from ckan import plugins as p
class TaijiangDatasets(p.SingletonPlugin): p.implements(p.IPackageController, inherit=True) p.implements(p.IFacets)
def before_search(self, search_params): … begin = parse_date(search_params['extras']['ext_begin_date']) end = parse_date(search_params['extras']['ext_end_date']) ... query = ("(start_time: [* TO {0}Z] AND end_time: [{0}Z TO *]) OR (start_time: [{0}Z TO {1}Z] AND end_time: [{0}Z TO *])") query = query.format(begin.isoformat(), end.isoformat()) search_params['q'] = query return search_params
def dataset_facets(self, facets_dict, package_type): facets_dict['date_facet'] = p.toolkit._('Date of Dataset') return facets_dict
<dynamicField name="*_time"type="date"indexed="true" stored="true" multiValued="false"/>
Python Plugin Solr Schema
54
Result
55
Harvest
● ckanext-harvest
– Remote harvesting extension– https://github.com/okfn/ckanext-harvest
● Source Type
– CKAN– CSW* (Catalog Service for the Web)– WAF* (Web Accessible Folder)– Custom (csv/xls/website… etc.)
*Provided by ckanext-spatial
56
HarvestJob Dashboard
57
HarvestBackground Process
● Manually
– (pyenv) $ paster --plugin=ckanext-harvest harvester gather_consumer/fetch_consumer/run -c /etc/ckan/default/production.ini
● Automatically
– Supervisor (for gather & fetch consumer)
– Cron (for run)
58
HarvestThe Harvesting Interface
from base import HarvesterBase
class SRDAHarvester(HarvesterBase):
def _set_config(self,config_str):
def info(self):
...
def gather_stage(self, harvest_job): …
def fetch_stage(self, harvest_object): ...
def import_stage(self, harvest_object): ...
See http://goo.gl/ZMnND7 for details.
59
Localization
● Translation for UI
– Gettext Style i18n– Babel (*.po & *.mo)
● In Python
p.toolkit._('String')● In Jinja Template
{{ _('String') }}● Transifex
Open Knowledge / CKAN– Jed (For JavaScript Modules)
● _('String')_
60
Localization
● Translation for Extensions
– opendatatrentino/ckan-custom-translations (GitHub)● Translation for Metadata
– Defined in JSON Schema
– "label": {"en": "Data Type", "zh_TW": "資料類型 "}
61
Localization
● Chinese Search
– Solr + mmseg4j1 (A Java Tokenizer)– Maximum Matching Algorithm2 (By Dr. Chih-Hao Tsai)
– Copy to Solr folder and modify Solr schema– Ref: http://is.gd/2Vpzgb
1. https://github.com/chenlb/mmseg4j-solr (Released under Apache 2.0 license)2. http://technology.chtsai.org/mmseg/
62
Contribution: What and How?
63
What to Contribute?
● CKAN Core Features
– Time and spatial search for private datasets– Publish datasets as a catalogue service Ex. CSW– Web interface for bulk uploads– A simplified deployment process– Issues on GitHub: https://github.com/ckan/ckan/issues– More ideas:
https://github.com/ckan/ideas-and-roadmap
64
What to Contribute?
● i18n
– Non-ascii Filename– Translate JS Modules (Ex. Recline.js)– UI Translation (Transifex)
65
What to Contribute?
● More Functions for Using Data in Web Browser
– Audios & Videos playback (Ex. Integrates plyr.io)– Link to third party services1, like Shiny2 (R-based) or
Ipython Notebook (Python-based)
1. http://www.data.gov/meta/open-apps/2. https://github.com/ckan/ideas-and-roadmap/issues/35
66
What to Contribute?
● Rebuild data.g0v.tw with CKAN?
● data.g0v.tw (零時資料中心 )
– Built with DKAN (A CKAN clone for Drupal)● Problems of DKAN
– Development is much slower than CKAN
– Lack of features introduced in latter versions of CKAN● Ex. Multiple persistent views of data (In CKAN 2.3)
– Most gov sites in TW use (or will use) CKAN instead of DKAN
67
How to Contribute?
● CKAN Core: ckan/ckan (GitHub)
● Most plugins are also available on GitHub
– http://extensions.ckan.org/● Development Discussions (Mailing List)
– https://lists.okfn.org/mailman/listinfo/ckan-dev ● Contributing Guide
– http://docs.ckan.org/en/latest/contributing/index.html
68
Thanks for your attention!Any Q? Email: cjlee AT iis.sinica.edu.twProfile: http://about.me/solleeGoogle Groups: CKAN Taiwan Interest Group