Tutorials (July 10, 2017)



Full Day Tutorial

A Practical Introduction to Spatial Datasets and Urban Applications   (Abstract)

Presented by:

Bruno Gonçalves

New York University

Desislava Hristova

University of Cambridge

Anastasios Noulas

New York University

The proliferation of the mobile web and the availability of large scale digital datasets have enabled a new wave of research studies that are largely driven by these new types of data generated in urban environments. This tutorial aims to offer an overview of the opportunities and challenges posed by geolocated datasets with a particular emphasis on their use for the study of urban data science, guiding participants through the entire process of mining such datasets to using them to analyze different aspects of urban science with a theory-backed approach. We will provide an extensive overview of some of the theory underlying the study of urban systems followed by a practical introduction on how to use several different datasets and APIs in the second part of the day. Inspired by a fusion of computational approaches and complex systems, this tutorial will integrate elements from geography, computer science, urban studies, sociology, physics and complex systems. This will involve the description of methodologies for the collection of geo-referenced and spatial datasets, techniques for the analysis and modeling of geographic data and mobility, network science as a tool to understand cities, machine learning as a medium to solve optimization problems and define prediction tasks in urban environments, and finally, ways to visualize raw datasets and corresponding outputs on maps.


Morning Tutorials

Topic modeling European political debates with the EUSpeech dataset   (Abstract)

Presented by:

Tatjana Scheffler

University of Potsdam

Damian Trilling

University of Amsterdam

Cornelius Puschmann

Hans Bredow Institute for Media Research

The tutorial provides a concise and hands-on introduction to topic modeling, an increasingly popular method in computational social science (Blei, 2012; DiMaggio, 2015; Puschmann & Scheffler, 2016). While a growing number of packages in widely used programming languages are at the disposal of researchers, there are a number of caveats to consider when deploying topic models in the research process, both on the technical level and in terms of research design, such as which algorithm to rely on, how to set parameters, and in which ways to preprocess data. Interpreting the output generated by popular algorithms such as Latent Dirichlet Allocation (LDA; Blei, Ng & Jordan, 2003), and evaluating the validity of topic model statistics are key challenges for social scientists interested in using topic modeling, as is the question of successfully embedding topic models in a research design in a fashion that allows the testing of concrete hypotheses. In a series of eight compact segments, we will both provide a user-friendly description of the workflow in both Python and R for the practical application of topic models to our example data, the EUspeech corpus (Schumacher et al., 2016), and discuss the conceptual basis of topic models along with approaches for the evaluation of topic model fit and the validity of the results generated with them. We expect an audience of both social scientists and computational researchers, mostly at the PhD student and postdoctoral level. Our approach will be research-oriented, involving an overview of relevant packages in both Python and R, in addition to our own scripts (also in both languages) which will be shared via Github. We aim to be both highly practical and language-agnostic by focusing on what topic models do, how they do it, and what questions can be studied effectively using them. We expect some familiarity with programming for those participants who want to apply topic modeling in their own research, but the segments on how to interpret topic models should be equally relevant for those with and without programming knowledge.


Social Media for Health Research   (Abstract)

Presented by:

Ingmar Weber

Qatar Computing Research Institute

Yelena Mejova

Qatar Computing Research Institute

The use of social media in the health domain started in the late 90s with the appearance of email lists and online forums. Since then, both general purpose and dedicated social media have been used to monitor the health and wellbeing of thousands of individuals online. The scale, reach, and real-time nature of internet allows for epidemiological studies, tracking seasonal illnesses like flu, as well as understanding the context of some conditions, such as anorexia and bulimia. Such passive monitoring allows for nowcasting disease, that is, estimating its prevalence in the moment it happens. Other efforts bring the technology back to the individual, enabling better health data gathering, and eventually feedback and actual intervention for behavior change and improving health. In this tutorial we showcase the latest achievements in the use of social media for health research. We also address data quality issues, involving not only selection bias but also truthfulness, and show how external data ranging from public health data, to quantified self data, to electronic healthcare records can be used for validation. Further, we address a suite of interesting and important ethical issues raised by the use of social media in research, including the need and scope of informed consent, the challenges raised by database, registry and software design and use, the need for some studies to include attention to social groups to achieve community-based participation, and the many challenges to privacy and confidentiality.


Discovering and Mitigating Algorithmic Discrimination   (Abstract)

Presented by:

Sara Hajian

Eurecat Technology Centre of Catalonia

Carlos Castillo

Eurecat Technology Centre of Catalonia

Algorithms and decision-making based on Big Data are becoming pervasive and essential tools in personal finance, health care, hiring, housing, education, and policy-making. Researchers in many computing areas, including computational social science, have gradually moved from the analysis of online phenomena to the application of data and algorithms to determine the media we consume, the stories we read, the people we meet, the places we visit, but also whether we get a job or whether our loan request is approved. It is therefore of societal and ethical importance to ask whether these algorithms can be discriminative on grounds such as gender, ethnicity, marital or health status. The answer is yes, as several high-profile cases of algorithmic discrimination have been described in recent years. Algorithmic discrimination exists even without the intent to discriminate. It is sometimes inherent to data containing historical patterns of discrimination, and sometimes it is the result of correlations existing in the data with legally-protected attributes such as gender and race. The aim of this tutorial is to present different aspects of algorithmic bias. The tutorial will start by a comprehensive survey of cases in which algorithmic bias has been found. Then, it will cover two complementary approaches: computational methods for discrimination discovery, and discrimination prevention by means of fairness-aware algorithms. We will conclude by summarizing the most promising paths for future research


Introduction to GIS using Google Maps and Google Earth  (Abstract)

Presented by:

Toni Rouhana

University of California Santa Cruz

This tutorial is an introduction to GIS practices with the focus on the use of Google GIS tools including Google Maps, Google Earth and Fusion Tables. Data visualization and mapping is increasingly becoming a critical part of most social sciences. Geographic Information Systems (GIS) is not a new phenomenon in the social sciences, but it has become a crucial tool for the analysis and the visualization of spatial relationships around the world. In this workshop, we lay out some available software and introduce the Google Map APIs. It is hands-on where participants will learn various methods to analyze geodata and create map mashups: multilayered visualizations that combine data from different datasets. By the end of this tutorial participants will have a working customized map and will understand the possibilities and limitations of the Google Maps APIs.


Afternoon Tutorials

A Practical Introduction to Latent Semantic Analysis for Text  (Abstract)

Presented by:

Jacob Miller

Drexel University

David Gefen

Drexel University

Jorge Fresneda

Drexel University

In this session, we present a tutorial on the use of latent semantic analysis (LSA) in social science research. LSA is a useful method for large scale text analysis. Like Bayesian topic models and semantic network analysis, LSA is a bag-of-words approach, and the initial stages of the seminar use code and theory that are applicable in these methods as well. Like Word2Vec, LSA is a vector space approach, and so analyses in the latter part of the tutorial would are relevant to that method. We walk attendees through a small example, with provided text and R code, to jump start their own projects. We cover the theoretical basis for LSA, and discuss practical methodological choices, as we work through the code. We create a small semantic space, and conduct some initial analyses with visualizations. We conclude with discussions of potential applications of LSA to social science problems. This tutorial is intended to give attendees a practical experience in computational text analysis, and an understanding of how it could be applied to social science problems.


Computational Sociolinguistics  (Abstract)

Presented by:

Dong Nguyen

Alan Turing Institute

Research in the field of computational linguistics has so far primarily focused on language as a means to convey information. However, language is one of the main instruments by which people construct their identity and manage their social network. With the rise of social media and the increasing interest in studying social phenomena through large-scale text analysis, there has recently been a surge of interest in analyzing and modeling the social dimension of language using computational approaches. This tutorial provides a comprehensive overview of the emerging field of Computational Sociolinguistics, which studies the relation between language and society from a computational perspective. In the tutorial, topics such as the relation between language and social identity, language use in social interaction, and multilingual communication will be discussed. Moreover, the goal of this tutorial is to demonstrate how the large-scale data-driven methods that are widely used in computational linguistics can be used to study the social dimension of language, and how insights from sociolinguistics and the social sciences can inform and challenge the methods and assumptions employed in computational linguistics research. The tutorial will conclude with a discussion of open challenges in this emerging research area.

Tutorial resources can be found here.

Crowdcomputing and Citizen Science for Large-Scale Experiments  (Abstract)

Presented by:

Snehalkumar ‘Neil’ S. Gaikwad

Massachusetts Institute of Technology

Sohan Dsouza

Massachusetts Institute of Technology

Oana Vuculescu

Aarhus University

Andrew Mao

Microsoft Research

Iyad Rahwan

Massachusetts Institute of Technology

Historically, scientific experiments have been conducted at a small scale either with artificial environments or with the expertise of limited number of scientists. While social science literature investigates very deep questions to understand human behavior, many experiments are usually limited by the number of participants and duration of a study. On the contrary, computer science literature exploits advanced computational techniques to crunch voluminous datasets, but research designs are generally not experimental, which limits the opportunity to generate causal inferences. In this tutorial we demonstrate how crowdcomputing can enable computational social scientists to engage with millions of users on the Internet and study human behavior at scale for a longer time. We showcase pitfalls and lessons learned from various crowdcomputing and citizen science projects. Furthermore, we provide insights about how to build a sustainable citizen science community to scale science beyond the traditional laboratories. We envisage this tutorial will help computational social scientists effectively use crowdcomputing to investigate deep research questions and longitudinally validate their hypotheses in large scale experiments.


Digital Demographys  (Abstract)

Presented by:

Ingmar Weber

Qatar Computing Research Institute

Bogdan State

Stanford University

Demography is the science of human populations and, at its most basic, focuses on the processes of (i) fertility, (ii) mortality and (iii) mobility. Whereas modern states are typically in a reasonable position to keep records on both fertility and mortality, through birth and death registrations, as well as through censuses, measuring the mobility of populations represents a particular challenge due to reasons ranging from inconsistencies in official definitions across countries, to the difficulty of quantifying illegal migration. At the same time, mere numbers, whether on births, deaths or migration events, shed little light on the underlying causes, hence providing insufficient information to policy makers.

The use of digital methods and data sources, ranging from social media data to web search logs, offers possibilities to address some of the challenges of traditional demography by (i) improving existing statistics or helping to create new ones, and (ii) enriching statistics by providing context related to the drivers of demographic changes. This tutorial will help to familiarize participants with research in this area. First, we will give an overview of fundamental concepts in demographic research including the population equation. We also showcase traditional data collection and analysis methods such as census microdata, the construction of a basic life table, panel datasets and survival analysis.

In the second part, we present a number of studies that have tried to overcome limitations of traditional approaches by using innovative methods and data sources ranging from geo-tagged tweets [14, 42] to online genealogy. We will put particular emphasis on (i) methodological challenges such as issues related to bias, as well as on (ii) how to collect open data from the World Wide Web.

The slides and other material for this tutorial are available at https://sites.google.com/site/digitaldemography/.


A brief introduction to Exponential Random Graph Models for social networks  (Abstract)

Presented by:

András Vörös

ETH Zürich

Multivariate statistical models for social networks are becoming increasingly popular among social scientists. More and more papers applying these methods appear in top sociology, political science and other social science journals. In the meantime, network models themselves are rapidly developing, and it can be difficult for substantive researchers to keep track of the advances without a solid understanding of the basics of the models. This tutorial offers a theoretical introduction to statistical models for social networks in general, and a practical guide to applying Exponential Random Graph Models (ERGMs) in particular. Participants will get an overview of the issues that make network models indispensable and of the model families that are currently highly popular. Following the theoretical lecture, a practical session will ensure that attendees are able to run simple ERGMs on real-life datasets. At the end of the tutorial, we discuss how to further develop network-modeling skills, how ERGMs and other models may be applied to large-scale network data, how one can seek help when working with network models, and how to choose between the various available models based on our research question and data. The general goal of the tutorial is to help training scholars who are able to use the most advanced statistical network methods and have a good understanding of the challenges and directions of this dynamic field.