Yahoo! is promoting a very important initiative toward the democratization of research activities with big data. Such initiative, named Webscope, allows academic researchers to access a bunch of datasets, all of which “reviewed to conform to Yahoo!’s data protection standards” on privacy.
Among the available datasets, great space is given to language and graph data, but some datasets also address important topics such as advertising, marketing and rating data.
More information about this initiative is available on the Webscope Website.
SeCo is organizing the First International Workshop on Searching and Integrating New Web Data Sources (VLDS 2011), that will take place on September 2nd, as a satellite event of VLDB 2011 in Seattle, WA, USA.
The goal of th workshop is to gather researchers and practitioners in the diverse fields related to data integration and search applications on the web at the purpose of discussing innovative strategies for combining search facilities with integration aspects for Web data sources.
The workshop proceedings are now available online. You can download the single PDF file (Size 5MB) from here:
VLDS 2011 proceedings
Prof. Zicari interviewed Dr. Alon Y. Halevy, head of the Structured Data Group at Google Research, on Google Fusion Tables and the importance of large scale data management tools.
The full transcript of the interview is available on the ODBMS.org Web site.
Continuum - a project developed and maintained under the Apache umbrella – is a continuous integration server that is fully integrated with many popular build systems (most notably maven2) and supports automated building, testing and releasing of applications. Continuum can be either deployed as a stand-alone server or inside an application container; this tutorial is focused on the latter scenario since it involve some non-trivial preparation.
The objective is to deploy Continuum inside Tomcat 6 and set it up to build and test our project at every change.
The deployment environment is the following:
The package mentioned above can be installed and set up automatically using aptitude. Continuum – however – is not packaged and needs to be installed manually. In this tutorial we use Continuum 1.4 beta (the war, but the tar.gz will come in handy during the deploy).
Before setting up the web application, we need to setup the workspace for Continuum; Tomcat, in Debian, runs as a separate user (tomcat6) and is not able to write outside its directories. To host Continuum configuration files, databases, work area, and maven local repository we need a directory that is accessible to Tomcat for writing operations:
chown -R tomcat6.tomcat6 /var/lib/continuum
Examples of the services provided by the toolkit are:
- Street Address to Coordinates conversion: calculates the latitude/longitude coordinates for a postal address.
Currently restricted to the US and UK.
- File to Text conversion: extracts text from PDFs, Word Documents, Excel Spreadsheets. It also recovers text from JPEG, PNG or TIFF images of scanned documents
- Coordinates to political areas conversion: returns the country, region, state, county, constituencies and neighborhood a point is inside.
- GeoDict: it pulls country, city and region names from unstructured English text, and returns their coordinates.
- IP Address to Coordinates conversion: it calculates country, state, city and latitude/longitude coordinates for IP addresses.
The toolkit also contains services for text analysis, such as the Text To People and the Text To Time services.
The latest version is marked as 0.35, and it has been released in April 17th 2011. The Data Science Toolkit was assembled by Pete Warden and the source code is available at http://github.com/petewarden/dstk
Researches from the Search Computing project attended the 11th International Conference on Web Engineering (ICWE 2011) which took place in Paphos (Cyprus) on June 20-24.
Several works has been presented at the conference:
- A keynote from Stefano Ceri: The Anatomy of a Multi-Domain Search Infrastructure;
- A research paper about Multi-way rank join with parallel access;
- A live demonstration of the SeCo system.
The conference also featured a SeCo-sponsored event: the First International Workshop on Search, Exploration and Navigation of Web Data Sources (ExploreWeb 2011)
Researches from the Search Computing project attended the 2011 ACM SIGMOD Conference, which took place in Athens (Greece) on June 12-16.
A novel, live demonstration of the SeCo Execution Engine and Workbench environment has been presented at a dedicated booth.
Search Computing: Multi-domain Search on Ranked Data, authored by Alessandro Bozzon, Daniele Braga, Marco Brambilla, Stefano Ceri, Francesco Corcoglioniti, Piero Fraternali, Salvatore Vadacca
After organizing two workshops in Como, the Search Computing project decided to go “on the road”. Several workshops have been successfully applied to conferences such as VLDB, ISWC, ICWE, and ECOWS. More details here, or on the workshops’ Websites.
At ICWE 2011 in Paphos, Crete (June) we organize the ExploreWeb
Workshop, chaired by Brambilla, Fraternali, and Schwabe, see:
- At VLDB, we organize the “Very Large Data Search” Workshop, chaired
by M. Brambilla, F. Casati S. Ceri, with Hector Garcia Molina
and Alon Halevy as keynotes, see: http://vlds2011.search-computing.net/
- Also at VLDB, we sponsor the DBRank workshop, chaired by Chackrabarti
and Martinenghi, Jan Chomicki is keynote speaker, see:
At ECOWS 2011 in Lugano, Switzerland (September) we organize the DATAVIEW
Workshop, chaired by Bozzon, Comai, and Norrie, see:
At ISWC 2011 in Bonn, Germany (October) we organize the OrdRing
Workshop, chaired by Della Valle, Horrocks and Bozzon, see: