froscon2009 - 1.0

FrOSCon
Free and Open Source Software Conference

Referenten
Isabel Drost
Programm
Tag Day 1 (2009-08-22)
Raum HS3
Beginn 15:15
Dauer 01:00
Info
ID 343
Veranstaltungstyp Vortrag
Track Cloud Computing
Sprache der Veranstaltung englisch
Feedback

From data to information

Large scale data analysis.

Today it is easy and comparably cheap to buy hardware capable of storing terrabytes of data. Now we need a means to process and analyze that data.

In recent years several open source projects set out to solve problems that developers of highly scalable applications need to deal with: The Hadoop framework deals with distributed computations on large amounts of data. Several domain specific languages have been designed to make writing Hadoop jobs easier. There are data storage solutions, projects that focus on data serialization.

The talk gives a brief summary of the Hadoop ecosystem. It shows how to leverage some of the open source software to build an application that analyzes large amounts of raw, unstructured data and generates valuable information from it.

In the recent past, it became very easy for people to create and publish new information in digital form. The amount of digital, unstructured data increased exponentially over the last few years. Extracting information from these sources of unstructured or semi structured data becomes vital today.

More and more engineers turn to projects that facilitate easy parallel processing to cope with the ever growing amount of digital data. One of the most successful frameworks for parallel processing is Apache Hadoop. Growing out of Lucene it became a top level project only last year. Today there are a huge amount of sub- und sister-projects that deal with such tasks as data serialization, data storage, domain specific languages for data processing, easier administration.

The talk starts with an overview of the Hadoop ecosystem. It shows how to integrate a selection of projects for data storage, processing and analysis. The focus is on integrating data mining facilities in the processing pipeline.