Description: Commerce and research are being transformed by data-intensive discovery and prediction. Skills required to analyze massive, noisy, and heterogeneous data -- data science -- include scalable data management, practi-cal statistics and machine learning, parallel algorithms, and human-centered visualization and design. These skills span a variety of disciplines and are not easy to obtain through conventional curricula. We will tour a variety of systems, concepts, and techniques in data science, explaining the complex landscape of big data systems that has emerged in recent years that includes parallel databases, MapReduce-based systems, NoSQL systems, and more recent ideas that focus on iterative algorithms and scalable machine learning. We will also tour important practical algorithms in scalable data mining (e.g., locality sensitive hashing) and machine learning (e.g., random forests) along with an introduction to statistical experimental design and the resampling and permutation methods one can use to answer questions about statistical significance computationally. Finally, we will talk about the principles of visualization and their role in modern data science, as well as some related issues in the "human side" of big data.
In a hands-on section, students will process a terabyte-scale dataset using one of several big data systems and compare results across various systems in terms of performance, expressiveness, and simplicity.
Learning goals: The goal of this course is to provide students with a working understanding of the tools, tech-niques, and terminology of data science as it is practiced today.
Prerequisites: A general background in computer science and general familiarity with database management, as can be achieved through an undergraduate database course, is expected. Participants who have taken graduate data management courses will benefit from this additional background.
Organizer: Professor Torben Bach Pedersen tbp@cs.aau.dk
Lecturers: Associate Director Bill Howe, eScience Institute, University of Washington
ECTS: 2
Time: 22-24 June, 2015
Place: Aalborg University
22 June, 14:00-17:00 hours: Selma Lagerlöfs Vej 300, room 01.95
23 June, 9:00-17:00 hours: Niels Jernes Vej 8, room A-0.01
24 June, 9:00-17:00 hours: Selma Lagerlöfs Vej 300, room 01.95
Zip code: 9220
City: Aalborg
Number of seats: 15
Deadline: 1 June 2015
In a hands-on section, students will process a terabyte-scale dataset using one of several big data systems and compare results across various systems in terms of performance, expressiveness, and simplicity.
Learning goals: The goal of this course is to provide students with a working understanding of the tools, tech-niques, and terminology of data science as it is practiced today.
Prerequisites: A general background in computer science and general familiarity with database management, as can be achieved through an undergraduate database course, is expected. Participants who have taken graduate data management courses will benefit from this additional background.
Organizer: Professor Torben Bach Pedersen tbp@cs.aau.dk
Lecturers: Associate Director Bill Howe, eScience Institute, University of Washington
ECTS: 2
Time: 22-24 June, 2015
Place: Aalborg University
22 June, 14:00-17:00 hours: Selma Lagerlöfs Vej 300, room 01.95
23 June, 9:00-17:00 hours: Niels Jernes Vej 8, room A-0.01
24 June, 9:00-17:00 hours: Selma Lagerlöfs Vej 300, room 01.95
Zip code: 9220
City: Aalborg
Number of seats: 15
Deadline: 1 June 2015
- Teacher: Torben Bach Pedersen