Mapping 1: Beginners intro to ArcGIS Storymaps

Discover the power of storytelling through maps! This beginner-friendly workshop introduces participants to ArcGIS Story Maps, a platform that combines maps, multimedia, and narratives to create engaging, interactive stories. Learn the basics of mapping, explore features for integrating text, images, and videos, and create your own Story Map by the end of the session. Perfect for anyone new to mapping or looking to visually communicate research, ideas, or projects. No prior experience with GIS or mapping tools is required. Join us to turn your data into a compelling story!

Registration for workshops is limited to students, faculty, and staff who are affiliated with the CUNY Graduate Center. Please note that you might need to log in to CUNY Academic Commons to be able to RSVP.

January 27, 2025

Introduction to TEI (Text Encoding Initiative)

This workshop is an introduction to the theory and practice of encoding electronic texts for the humanities. It is designed for students who are interested in the transcription and digitization of manuscripts and print-based texts into diplomatic, digital formats. In this workshop, participants will understand what TEI is and why it’s used, recognize the basic structure of a TEI document, and practice using a specific subset of TEI guidelines to encode a manuscript.

December 13, 2024

Have patience, think creatively: reflections on building digital identity and communications capacity for @gcsocialwelfare

This post is written by Ian G. Williams, PhD student in Social Welfare and Program Social Media Fellow. Connect with Ian G. Williams on Bluesky@igraywill.bsky.social; Instagram@igraywill; LinkedIn http://linkedin.com/in/iangraywilliams

Over the last two and a half years, I’ve had the pleasure of working as the Program Social Media Fellow (PSMF) for The Ph.D. Program in Social Welfare. I manage scholarly communications and our social media accounts. PSMF is a three-year position, intended to support programs at The Graduate Center build their communications infrastructure and social media presence. Every program (and center) at The Graduate Center is unique; the PSMF team reports directly to their program or center, and also works with GCDI.

The Social Welfare Program started in 1974 as a professional, three-year Doctor of Social Welfare (DSW), and restructured in 2001 into a research-focused Ph.D. It was excluded from the Graduate Center Fellowship’s relatively uniform funding structure, and most of its students worked outside of school – often in nonprofit and government human services. In 2021, Dr. Harriet Goodman, the program’s previous Executive Officer, retired after over 10 years at the helm. Though the degree was always granted by The Graduate Center, it had a long-standing affiliation with what is now the Silberman School of Social Work at Hunter College. Informally, its branding and identity were associated with Hunter, and students took their required classes at that campus. Our current Executive Officer, Dr. Barbra Teater—tenured at The College of Staten Island—was the first EO from a different CUNY school. This triggered some reorganization. In Fall 2021 Social Welfare physically moved into The Graduate Center, while classes were still being held remotely due to the COVID-19 pandemic. When hybrid classes started in Spring 2022, they were held at the GC campus. Dr. Teater started multiple projects and initiatives, including a newsletter, “In The Loop”, which I took over as managing editor when I came onboard (read the December 2024 issue here).

When I started as a PSMF in Fall 2022, I was in my second year. I had no experience managing public-facing media. I had ample experience with nonprofit and public sector administration and operations in my career as a social worker. This primed me to the importance of learning the contours and rhythms of our program. To me, learning the contours and rhythms of our program was as essential to getting started as figuring out how to create a LinkedIn profile and to schedule posts with Fedica. That semester I was the incoming program representative for the Doctoral and Graduate Students’ Council. Social Welfare was also finishing up a once-every-10-years self-study, which involved an external review by the site reviewers, who conducted focus groups with students and staff before producing a final report. These other processes helped me, crucially, to identify audiences within our school’s communities, key stakeholders we were trying to build relationships with, and gave me ample messaging and language to work with. As a PSMF, I focused on internal capacity building to get the needs of our program known and on the agenda for upcoming years. We had a home, but we needed resources – namely, dedicated space and funding parity with other social science programs, which we are now working towards.

Start-up wasn’t as simple as creating accounts, announcing “hello, world!” across the Internet. Communications had to fit with our existing operations (often using legacy technologies, fitting the personalities and work habits of administrative staff), and had to make sense for our needs (which were often practical and modest). The marketing and business growth oriented metrics inherent in social media platforms and analytics were largely unhelpful – some degree of quantitative data, like number of followers or post engagements, were useful, but for an academic program of specialists, we’re not the ideal user for mainstream platforms. Figuring out and putting together the program’s communications infrastructure required asking a lot of questions.

Each year had an overarching theme. The first year was learning the rhythms and cycles of our program at The Graduate Center: mapping out strategy, setting up accounts (and reclaiming the inactive ones), creating a CUNY Academic Commons page, working with our APO to set up a program listserv, playing around with graphic design software (Canva and Adobe), developing a production cycle for the newsletter, and attending all program-related events such as faculty meetings and open houses. The second year focused on refining our processes, and brought on an MSW student intern focusing on community organizing at the Silberman School of Social Work to help with some of our community-building and advocacy efforts, the newsletter, and organizing our program’s first student symposium. I also sought to increase our documentation of and visibility at conferences, making us known as a distinct Graduate Center program not just within CUNY, but also within the professional and epistemic circles in which our students participate. In my third year, I’m refining these processes and increasing the quality of our visuals (see examples below), gathering more stories from students on their conference experiences, and working with another student intern on some student-led initiatives that help . I am also writing documentation and procedure manuals for the work, so the various tasks and processes can be distributed among students and permanent staff for our program.

Working as a Program Social Media Fellow is a very interesting and rewarding job. It has its own challenges and constraints, but it’s been pretty amazing to help build out our program in ways that I am confident will have a lasting positive impact. It’s extremely rewarding to cultivate a sense of community among my fellow students and promote their accomplishments. It’s also been a great way to learn about how a program at The Graduate Center functions. As someone who researches the politics of emerging technologies and digital life, it’s also been fascinating to be behind the scenes in the social media and scholarly communications world during the demise of Academic Twitter and diffusion into other platforms and spaces.

Want to learn more about Social Welfare? Check out our LinkTree and follow our socials from there.

December 6, 2024

Racialized Aspects of Data Collection & Data Use

There is a common misconception that data is neutral, an objective truth. However, the data that we use to build computer programs, to conduct research, and to inform policy cannot exist outside of the systems of oppression that permeate our society. For example, studies show that facial recognition software is least reliable for people of color, women, and nonbinary individuals (Buolamwini & Gebru, 2018; Costanza-Chock, 2020); risk assessment algorithms are more likely to falsely flag black defendants as future criminals as opposed to white defendants (Angwin et al., 2016); and racialized data creates real barriers for minority groups’ access to housing and employment (Williams and Rucker, 2000). Given this, we—as researchers, practitioners, and educators—have a responsibility to consider how systemic inequities impact our data practices.

Optimization & Standardization

White, male heteronormativity is often technologically privileged by ideas of optimization and standardization. The ‘database revolution’ of the 1960s, characterized by the need to create more streamlined processes for how large amounts of data is organized, arranged, and managed, emphasized the importance of optimizing databases for usability and to provide “natural” representations of data. However, Stevens, Hoffman, and Florini (2021) argue that “database optimization efforts [help] reproduce and sustain white racial dominance, in part, by making it easier for dominant actors in government and business to both conceive of and organize the social world in ways that served white interests” (p. 114). Some of the most prominent works to emerge from the database revolution take up whiteness as a kind of implicit optimum, the norm for which anything outside becomes a deviation.

Therefore, it is critical for us to assess how data is collected and constructed in ways that reinforce the matrix of domination (Collins, 1990). The matrix of domination is a “conceptual model that helps us think about how power, oppression, resistance, privilege, penalties, benefits, and harms are systematically distributed,” and, when we think about data as a reflection of existing power dynamics, it is imperative that we consider the ways in which databases can serve to enshrine inequities (Costanza-Chock, 2020, p. 20). The most vulnerable populations should have both access to and control over their data, and are entitled to informed consent and transparency about how their data is being used.

Data Collection

“The decisions people make about which data matter, what means and methods to use to collect them, and how to analyze and share them are important but silent factors that reflect the interests, assumptions, and biases of the people involved” (Gaddy & Scott, 2020, 1). Racial and gender equity need to be considered during the entire data life cycle, including in planning, collection, access, use of statistical tools, analysis, and dissemination. Oftentimes, disparities are overlooked or ignored due to a simple lack of data on certain populations. For example, Boston University’s Center for Antiracist Research assisted race and ethnicity data collection efforts during the COVID-19 pandemic. They found that state-reported data suffered from deficiencies that led to errors and underestimations of racial and ethnic inequalities, including incomplete datasets, failing to account for the ways that race and ethnicity can intersect, and defining race and ethnicity in overly broad ways that obscure experiences of racism and subordination (Khoshkhoo et al., 2022). These limitations hindered evidence-based responses to the pandemic for already-marginalized groups. Remember, when collecting data, it is important to not only consider what data is available, but also what data is missing and why. Data collection practices that fail to consider race as a critical factor pose tangible harms to individuals of color.

Data Use

The University of Pennsylvania’s Actionable Intelligence for Social Policy created a toolkit for Centering Racial Equity Throughout Data Integration. Here, they uphold the work of BU’s Center by encouraging researchers to practice ethical data use with a racial equity lens “that supports power sharing and building across agencies and community members” (AISP, 2022, p. 1). They shine a light on the risks and benefits of civic data use and suggest that, while cross-sector data can often give us a more holistic view of the individuals who are ‘datafied,’ it can also reinforce legacies of racist policies and promote problematic practices. The toolkit states that:

Incorporating a racial equity lens during data analysis includes incorporating individual, community, political, and historical contexts of race to inform analysis, conclusions and recommendations. Solely relying on statistical outputs will not necessarily lead to insights without careful consideration during the analytic process, such as ensuring data quality is sufficient and determining appropriate statistical power. (AISP, 2022, p. 28)

Data should be used in ways that benefit the communities from which the data comes from.

Conclusion

Data reflects our social world, meaning that race—as well as gender, class, sexuality—is a powerful mediator for how we use and interpret it. Using some of the insights and resources from this blog, I am hopeful that readers will take some time to consider how they incorporate antiracist methodologies into their data work. Remember that all data have limits, and it’s both incorrect and harmful to assume that something technological is automatically objective and neutral. Nuanced identities and circumstances exist as much in the digital world as they do in the physical world, and they require our attention.

References & Resources:

Actionable Intelligence for Social Policy. (2022). A Toolkit for Centering Racial Equity Throughout Data Integration. https://aisp.upenn.edu/wp-content/uploads/2022/07/AISP-Toolkit_5.27.20.pdf

Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of Machine Learning Research, 81, 1-15.

Costanza-Chock, S. (2020). Design Justice: Community-led Practices to Build the Worlds We Need. MIT Press.

Williams, D. R., & Rucker, T. D. (2000). Understanding and addressing racial disparities in health care. Health care financing review, 21(4), 75–90.

Stevens, N., Hoffmann, A.L., & Florini, S. (2021) The unremarked optimum: whiteness, optimization, and control in the database revolution. Review of Communication, 21(2), 113-128.

Angwin, J., Larson, J., Mattu, S., and Kirchner, L. (2016). Machine Bias. ProPublica, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Collins, P.H. (2002). Black Feminist Thought: Knowledge, Consciousness, and the Politics of Empowerment. New York: Routledge.

Gaddy, M. & Scott, K. (2020). Principles for Advancing Equitable Data Practice. Urban Institute. https://www.urban.org/sites/default/files/publication/102346/principles-for-advancing-equitable-data-practice.pdf

Khoshkhoo, N.A., Schwarz, A.G., Puig, L.G., Glass, C., Holtzman, G.S., Nsoesie, E.O., and Rose, J.B.G. (2022). Toward Evidence-Based Antiracist Policymaking: Problems and Proposals for Better Racial Data Collection and Reporting. BU Center for Antiracist Research. https://www.bu.edu/antiracism-center/files/2022/06/Toward-Evidence-Based-Antiracist-Policymaking.pdf

November 27, 2024

Join us at the GC Digital Research Institute

The GC Digital Research Institute (GCDRI)

Jan. 21st – 24th

Application due: ~~December 3rd, 11:59pm~~ December 6 @ noon / 12:00 PM

https://gcdri.commons.gc.cuny.edu/

Are you interested in using digital tools in your research, but don’t know where to start? Do you want to learn more about digital technologies alongside a supportive group of your peers?

The GC Digital Research Institute (DRI) is an introduction to core digital research skills that can be applied widely to different kinds of projects. This institute is catered towards building foundational skills and will be paced accordingly to ensure those who are new will leave feeling more confident in their skills. DRI is designed for beginning learners—those who have little or no prior experience working with a programming language.

Participants will attend a series of remote sessions from Jan 21st-24th, 2025 to develop basic digital research skills and connect with others in an interdisciplinary context.

To accommodate a remote-learning environment, the 2025 DRI will consist of four Zoom sessions. Our first day of the institute takes place on Tuesday, January 21, at 9:30am, which will begin with a kick-off event and consist of various sessions held between 9:30am and 4:30pm from January 21st to January 24th, with breaks scheduled throughout. Participants will learn Python, HTML/CSS, digital mapping, and more through dedicated workshops and practices. Beyond our Zoom meetings, DRI will also include space for informal co-working sessions and opportunities to build with peers. By applying, you agree to participate in all activities.

Participation is free, and open to graduate students, faculty, and staff at the Graduate Center; however, applications are required. Applications are due by ~~11:59 pm on Tuesday, December 3rd, 2024~~ noon (12:00 PM) December 6. More information about previous institutes and what to expect, as well as a link to the application can be found on the GC DRI website: https://gcdri.commons.gc.cuny.edu/.

If you are interested and would like more information, here’s a link to our FAQ.

Looking forward to seeing you there!

November 26, 2024

EXTENDED DATE: Apply by 2/7/2025 for the Provost’s Digital Innovation Grants

Provost’s Digital Innovation Grants Call for Proposals

Deadline: Friday, February 7, 2025 at 11:59 PM

Provost’s Digital Innovation Grants (PDIGs), a recurring GC Digital Initiatives program, provide financial support to doctoral students at the CUNY Graduate Center as they design and develop digital projects that contribute to the GC’s research, teaching, and service missions. Since 2012, PDIGs have supported a wide range of inventive projects across the disciplines, such as an online, open-access, crowdsourced database of mentor relationships within the field of writing studies; an app to support street medics and promote health and safety among activist communities; and a computational analysis of Cold War diplomatic history; and many others.

ELIGIBILITY

Graduate Center doctoral students at any level who are in good academic standing and currently enrolled are eligible to apply.

TYPES OF PROPOSALS

Projects at any stage of development are eligible for PDIG awards. Proposals may also include the initial development of a digital project or the ongoing development, growth, and deployment of established individual or team digital projects. Such projects may require additional resources to make a tool presentable to an academic audience or to improve the design of an early prototype based on feedback and evaluation. Proposals should describe how they address a challenge or problem in the applicant’s scholarly field.

Successful applicants will be asked to share a description on the Provost’s Digital Innovation Grant website and to write a white paper upon completion of the grant that will also be published on our website. Additionally, grantees will be expected to present publicly on their work in progress during the academic year, including presenting at the 2025 Digital GC Showcase on Monday, May 13th at 6:30 PM and participating in occasional collaborative meetings and discussions with current and past grantees.

Projects that use open-source tools and that focus on making work publicly accessible are strongly encouraged.

BUDGETS

Up to $300 for travel or workshop registration fees toward learning digital skills for which training is not already available at The Graduate Center or CUNY.
Up to $2,000 for projects at all stages of development to support project development, the purchase of equipment, time, digital services, or technical support. Applicants may use the budget to support their own time working on the project.

HOW TO APPLY

Complete the form at https://forms.gle/NBLdEnBapi9YeqT6A, which includes the following:

Contact information for the project lead (one primary point of contact must be designated).
Contact information for project team members (if any).
A short abstract of no more than 100 words.
A 300 word justification for travel/training awards or a 750-1000 word project narrative for project support.
CVs for the project lead and any co-investigators.
A budget spreadsheet including the total amount of funding requested and itemized list of expenses with item names and descriptions (eg. salary, webcamera, cloud storage).

Applications for Training Grants include the following sections:

Applicant information: name and full contact information, including email, mailing address, and phone number where we can reach you.
Type of training you propose to attend: title, website, time, date, and location of the training or activity to be attended.
Abstract: a one-paragraph abstract summarizing the type of training or other activity that the applicant would like to use funding to participate in and how it will be useful to the student’s continued pursuit of research goals.
Narrative: A 300-word justification (about 1800 characters) of the training or other activity that the applicant would like to participate, an explication of how the skills will further the applicant’s research goals, why the training could be useful for scholars in the applicant’s particular field, and ways in which the applicant’s participation could be made useful to other students at the GC. Please also describe any further research activities, papers, or scholarly work that would be made possible by participating in the proposed activity.
Budget (max $300): a detailed account of travel, registration, housing, or other expenses related to attending the event.
A 1-page CV

Applications for Project Grants must include the following sections:

Applicant information: name and full contact information of the project lead (who must be a doctoral student at the GC in good academic standing) including email, mailing address, and phone number where we can reach you.
Abstract: a one-paragraph abstract summarizing the innovative contributions of the project.
List of participants: a list of participants involved in the project (include title/affiliation for each participant).
Narrative: a 750 – 1000 word description of the nature and goals of the project and the work that has already been completed (if any).
Work plan: a brief roadmap of planned activities with a timetable tied to project goals.
Budget (max $2,000): an explanation of how and why funds will be spent on particular activities, services, or purchases (funds can be used for any aspect of the project, but must be justified in this section).
Appendices: 1-page CVs of major project participants and any ancillary material.
Faculty reference: Name and contact information for a faculty mentor or advisor who knows about and can speak to its merits.

EVALUATION CRITERIA

Proposals will be evaluated by a review committee according to the following criteria:

Scholarly excellence and innovation of the project;
Contribution of the project to the development and promotion of the mission of the CUNY Graduate Center;
Contribution of the project to the larger scholarly community and to the public;
Experience of the project staff;
Likelihood that work can be accomplished within the proposed budget and time period.
Given the GC Digital Initiatives’ strong commitment to open-access scholarship and free software platforms, preference is given to projects that use open-source tools and that focus on making work publicly accessible.

Please direct any questions to Lisa Rhody, Deputy Director of Digital Initiatives at lrhody at gc.cuny.edu.

Proposals are due Friday, February 7, 2025 @ 11:59 p.m.

November 25, 2024

Digital Humanities in the Anthropocene

The Anthropocene is the age marked by the substantial consequences that human activities have caused to the Earth system(the different natural processes and their interactions). What are some challenges of doing Digital Humanities in the Anthropocene?

November 15, 2024

What is metadata, and why does it matter?

Metadata is “data about data”. In other words, metadata is information that describes one or more aspects of data. It makes working with data easier, and it allows machines to interpret and display data. One example of metadata is the catalog card, which was used to locate books in a library prior to the creation of digital catalogs. From the catalog card below, we can learn over a dozen data points about the one book represented by the card: its title, author, the author’s lifespan, the original language of publication, the translator, the publisher and year of publication, the number of pages, the physical size of the book, and its call number (or location in the library). We can even surmise that this catalog card was created before 1960, since Camus’ death date is written in by hand.

Catalog card for Albert Camus’ The Stranger. Image credit: Rachel Mattson.

As humans, we are able to interpret these different data points, and understand their relative significance and meaning. For a computer to interpret this catalog card, it is necessary for the data to be structured and tagged. For digital library catalog to function, the book’s title must be indicated as the title; publisher as the publisher, and number of pages as the number of pages, and so forth. In libraries and archives, this is often accomplished through standards such as the Standard Generalized Markup Language or XML. Furthermore, different metadata standards and vocabularies are used to ensure that resources are being described in similar ways by different people and at different institutions. Commonly used metadata standards include MARC and Dublin Core (which will be familiar to users of Omeka). A vocabulary that all library users will have encountered is Library of Congress Subject Headings (LCSH), which is used to indicate the subject matter of items in library catalogs.

MARC record for Albert Camus’ The Stranger. The Library of Congress Subject headings appear in MARC field 650.

Metadata standards and vocabularies allow digital catalogs to return appropriate results to library users. However, these standards reflect the biases of both the creators of these standards, and how they are applied to individual resources. LCSH, which are used across the United States and around the world, are an excellent example. In 1980, the subject heading, or topical term, “Aliens, Illegal” was established by the Library of Congress to describe materials on the topic of unauthorized immigration. This term reflected a longstanding nativist framing of immigrants as both illegal, and fundamentally “alien” to the American body politic.[1]

In 2014, Dartmouth University students and staff submitted an official proposal to the Library of Congress to change the subject heading; two years later the Library announced the term would be replaced with “noncitizens” and/or “unauthorized immigration”. Shortly after this announcement, the House of Representatives ordered the Library of Congress to continue using the term “illegal aliens” (this was the first time Congress intervened in a LCSH change). Many libraries individually opted to stop using this term, which remained in official usage until 2021. The current subject heading is “illegal immigration,” thus only partial resolving the demands of librarians and student activists. You can learn more about the campaign to change this subject heading by watching the documentary Change the Subject or reading the thesis of Grad Center librarian Silvia Cho on this topic.

More recently, librarians and their allies have successfully advocated to add new subject headings pertaining to the historical existence of Palestine, including “Palestine question (to 1948),” which was largely unrecognized within the LCSH prior to 2023.

[1] Ngai, Mae M. Impossible Subjects: Illegal Aliens and the Making of Modern America. Princeton: University Press, 2004.

November 8, 2024

Tidying Data Using tidyverse in R

Introduction

When researchers collect quantitative data, they often store it in a CSV, TSV, or TXT file for cleaning and analysis. This data can then be loaded into R for visualization and statistical processing. However, before data can be effectively visualized or analyzed, it often requires cleaning and reshaping to be in the correct format. The tidyverse, a popular collection of R packages, provides tools to make this process more efficient. The tidyverse includes packages like readr (for reading data files), dplyr (for data transformation), tidyr (for reshaping data), and ggplot2 (for data visualization), along with tibble, stringr, purrr, and forcats. In this post, I’ll walk you through some essential functions in these packages for tidying your data, with exercises included to help you practice programming in R.

Prerequisites

Download and install R and RStudio on your computer. Then, open the RStudio and create a new R script so you can save your work for later. To get started, install the tidyverse package and load it into your script:

install.packages("tidyverse") 
library(tidyverse)

Let’s Create a Hypothetical Dataset!

Let’s create a hypothetical dataset that records students’ IDs, genders, and their mid-term scores in English, Biology, Maths, and Physics.

student_scores <- data.frame( 
    Student_ID = 1:11, 
    English = c(85, 78, 92, 67, 88, 76, 95, 80, 72, 90, 100), 
    Biology = c(95, 87, 90, 79, 94, 96, 93, 82, 89, 97, 105), 
    Maths = c(90, 82, 58, 74, 89, 91, 88, 77, 84, 92, 100), 
    Physics = c(78, 85, 89, 80, 90, 76, 83, 91, 87, 79, 100), 
    Gender = c("f", "f", "m", "m", "f", "m", 
               "m", "f", "m", "m", "non-binary"))

In this dataset, each column is a variable, each row is an observation, and each cell is a single value. You can view the dataset in RStudio using the following lines of code. The first line opens a new window and display the entire dataset, while the second line shows just the first six rows by default:

view(student_scores) 
head(student_scores)

The dataset looks like this:

Extracting the Rows You Need: `filter()`

If you want to work specifically with certain rows, you can use the filter() function. You can combine it with logical or/and boolean operators to extract the rows you want. Commonly used logical operators with filter() include:

== (is equal to)
!= (is not equal to)
> (greater than)
< (less than)
>= (greater than or equal to)
<= (less than or equal to)

For example, to create a new dataset (called tmp) that only includes scores for female students ("f" represents female while "m" represents male), you can use ==:

tmp <- student_scores %>% 
  filter(Gender == "f")

The %>% operator, known as the pipe, passes the object (in this case, our dataset) on its left as the input argument to the function to its right (in this case, the function below).

Or, if you want scores for students who are NOT female students, use !=:

tmp <- student_scores %>% 
  filter(Gender != "f")

You can also create a dataset that includes only students with Maths scores of 90 or higher:

tmp <- student_scores %>% 
  filter(Maths >= 90)

To add a bit more complexity, you might want to include only students with Maths scores of 90 or higher and Physics scores above 76. In this case, use the & operator (a boolean operator meaning “and”) to combine these two conditions:

tmp <- student_scores %>% 
  filter(Maths >= 90 & Physics > 76)

Other commonly used (boolean) operators with the filter() function include:

- | (or)
- !is.na() (is not an NA value, used to filter out missing values coded as NA in the dataset

Exercise 1: Try creating a dataset that includes only students with ID numbers from 1 to 5. Hint: use the %in% operator along with filter(). If you’re unsure what the %in% operator does, take a few minutes to look it up — it is a useful tool often combined with the filter()!

Selecting Specific Columns: `select()`

Sometimes a dataset can have many columns, but we only want to select some of it to play with. We can use select() to get the specific columns we need:

tmp <- student_scores %>% 
  select(one_of(c("Student_ID", "Maths", "Physics", "Gender")))

Alternatively, you can get the same dataset by deleting the unnecessary column using the - operator:

tmp <- student_scores %>% 
  select(-c("English", "Biology"))

You can also use the : operator to select a range of columns. For example, if you want to select Student_ID, English, Biology, and Maths (since they are next to each other in our dataset), you don’t need to type each column name individually. Instead, you can simply use:

tmp <- student_scores %>% 
  select(Student_ID:Maths)

Creating a New Column Based on an Existing Column: `mutate()`

Sometimes, you may need to add a new column to your dataset based values in an existing column. For example, if we know whether students failed in the Maths, you can create a new column (let’s name it as fail_in_maths) that marks a failed score with 1 and others with 0. You can use mutate() combined with ifelse() to achieve this:

tmp <- student_scores %>% 
  mutate(fail_in_maths = ifelse(Maths < 60, 1, 0))

If you’re still unsure of what ifelse() does, try asking RStudio for help:

?ifelse()

Running the code adds fail_in_maths as a new column, where 1 represents a failing Maths score and 0 represents a passing score.

Exercise 2: Create a dataset that includes only students with a passing Maths score.

Hint 1: use both mutate() and filter().
Hint 2: the %>% operator can be used multiple times to chain functions together.

Reshape the Dataset: `pivot_longer()` and `pivot_wider()`

Some functions that I will introduce soon, such as group_by(), summarise(), require the dataset to be in a specific shape. Currently, our dataset has each course score in separate columns, which isn’t ideal for calculating descriptive statistics like mean, maximum, minimum, and standard deviation. You can reshape the dataset to list all the scores in one column, with an additional column showing the course corresponding to each score. Since you are going to need this dataset for descriptive statistics, let’s name this new dataset as scores_reshape.

To do this, we can use pivot_longer(), which rearranges multiple columns into two new columns: one for the course names and one for the scores:

scores_reshape <- student_scores %>% 
  pivot_longer( 
    cols = English:Physics,    # columns to reshape: from English to Physics
    names_to = "courses",      # new column name for the course names
    values_to = "scores"       # new column name for the scores
  )

After running this, your dataset will have a courses column listing each course name, and a scores column showing each student’s score for each course.

Exercise 3: use pivot_wider() to convert scores_reshape back to its original form, where each course has a separate column for its scores.

Last but Not Least: `group_by()` and `summarise()`

The group_by() function allows you to group your dataset by a specific variable so that you can perform further operations, like summarise()within each group. For example, if you want to calculate the average scores of each course, you can first group the data by course with group_by(). Then, use summarise()to create a new dataset that calculates the mean for each group (in this case, each course is a group), with each group represented in a separate row.

To calculate the average scores, use mean() function as shown below:

mean_scores <- scores_reshape %>% 
  group_by(courses) %>% 
  summarise(ave_scores = mean(scores))
mean_scores

This code will produce a new dataset mean_scores, with two columns: courses and ave_scores. It will contain four rows, each representing the average score for one of the four courses: English, Biology, Maths, and Physics.

Note: summarize() and summarise() are synonyms. They can be used interchangeably.

Exercise 4: Create a new dataset that calculates the mean and standard deviation of scores for female and male students.

Hint 1: Use filter() to exclude the non-binary student.
Hint 2: Use sd() to calculate the standard deviation.

Additional resources

If you find this tutorial helpful and want to explore more functionalities of R for data analysis, here are some online books you can use for self-learning:

Data Analytics in Digital Research with R (an online book by our former Digital Fellow, Yuxiao Luo)
R for Data Science (second edition)
Learning Statistical Models Through Simulation in R

If you’d like to see the answer keys to exercises in this tutorial, or need help working through them, feel free to join our R User Group on Mattermost! I will be posting the answer keys there.

November 1, 2024

Web Scraping with Python and the Reddit API

Introduction

In this guide, we will be using Python to scrape data from Reddit. Reddit is a social news aggregation and forum-style discussion website. Registered members submit content to the site in the form of links, text posts, images, and videos, which are then “upvoted” or “downvoted” by other members. Posts are organized by subject into user-created boards called “subreddits”, which cover a variety of topics including news, science, movies, video games, music, books, and almost anything else you can think of.

What is Web Scraping?

Web scraping is the process of extracting information from websites. This can be done manually by a human user or automatically by a computer program. Web scraping is a powerful tool for data collection and analysis, and it has many applications in various academic and non-academic fields.

Why Scrape Reddit?

Reddit is a popular website with a large and diverse user base from around the world. It contains a vast amount of data on a wide range of topics, making it a valuable resource for data analysis. By scraping Reddit, you can collect data on user behavior, trends, opinions, and more. This data can be used for market research, sentiment analysis, content analysis, cultural analysis, and other purposes.

Many websites allow scraping of their data through APIs (Application Programming Interfaces), which provide a structured way to access and retrieve data. Reddit has its own API that allows you to access its data programmatically, and is probably the most efficient way to scrape Reddit data. It is important to note, however, that not all websites allow scraping, and some may have restrictions on how their data can be used. Always be sure to read and understand the terms of service of any website you plan to scrape.

Prerequisites

Before we get started, you will need to have Python installed on your computer. You can download Python from the official website.

You will also need to install the following Python library:

PRAW: The Python Reddit API Wrapper (PRAW) is a Python package that allows you to access Reddit’s API. You can install it using “pip” with the following command in your terminal: pip install praw

Lastly, you will want a text editor to write your code in. I recommend using VSCode.

Getting Started

To get started with web scraping on Reddit, you will need to create a Reddit account and obtain API credentials. Here are the steps to do this:

1. Go to the Reddit website and create an account if you don’t already have one.

2. Go to the Reddit Apps page and click on the “Create App” or “Create Another App” button.

3. Fill in the required fields. For name and description, you can enter anything you like, e.g. “reddit scraper” and “This app scrapes recent subreddit titles”, respectively. You can keep about URL blank. For the redirect URI, you can enter http://localhost:8080. In the app type field, select “script”, since this is a personal use script.

4. Click on the “Create App” button to create your app.

After creating your app, you will see a page with your app’s “client ID” (the string of characters underneath the app title and personal use script text) and “client secret.” You will need these credentials to authenticate your app when accessing the Reddit API.

Building the Reddit Scraper

Now that you have your Reddit account and API credentials, you can start writing a Python script to scrape data from Reddit. As an example, let’s create a scraper that retrieves the most recent post titles from a specific subreddit.

Create a new Python script (e.g., reddit_scraper.py) and import the praw library first:

import praw

Next, we want to create a new function that utilizes our credentials to access the Reddit API:

def connect_to_reddit():
    reddit = praw.Reddit(
        client_id='your_client_id',      
        client_secret='your_client_secret',
        user_agent='u/your_username'    
    )
    return reddit

This function creates a new Reddit instance using your client ID, client secret, and user agent. Make sure to replace the client ID and secret with the ones Reddit provided you. The user agent is a unique identifier that helps Reddit determine the source of the API request, and for this you can simply use your Reddit username.

Now, let’s create a function that retrieves the most recent post titles from a specific subreddit:

def get_recent_post_titles(subreddit_name, post_limit=10):
    reddit = connect_to_reddit()  
    subreddit = reddit.subreddit(subreddit_name)  # Choose the subreddit

    recent_posts = subreddit.new(limit=post_limit)

    post_titles = [post.title for post in recent_posts]
    return post_titles

This function takes the name of a subreddit and an optional post limit as input, connects to Reddit using our credentials, retrieves the most recent posts from the specified subreddit, and returns a list of post titles.

Next, we’ll want to call this function with the desired subreddit name and post limit:

if __name__ == "__main__":
    subreddit_name = 'dogs'  
    post_limit = 5

    titles = get_recent_post_titles(subreddit_name, post_limit)
    print(f"Most recent {post_limit} post titles from r/{subreddit_name}:")

    for idx, title in enumerate(titles, 1):
        print(f"{idx}. {title}")

Let’s break this last part down. First, we specify the name of the subreddit we want to scrape (subreddit_name) and the number of recent posts we want to retrieve (post_limit). We then call the get_recent_post_titles function with these parameters and store the returned list of post titles in the titles variable. Finally, we print out the post titles with their corresponding index numbers using a for loop.

When you run the code in this example, you should see the 5 most recent post titles from the “dogs” subreddit printed to the console.

What Next?

This is just a simple example of how you can scrape data from Reddit using Python. There are many other ways to interact with the Reddit API and extract different types of data. You can explore the PRAW documentation to learn more about the capabilities of the library and how to use it effectively.

In addition, there are many other Python libraries and tools available for web scraping like Beautiful Soup and Scrapy.

To make your results more interesting than just printing text to the console, you can also combine web scraping with data analysis libraries like Pandas and visualization libraries like Matplotlib or Seaborn to gain insights from the data you collect.

To learn more Python tips and tricks or discuss your own projects with fellow students, feel free to join the Python User’s Group (PUG) on the CUNY Commons.

Happy scraping!

Digital Initiatives at the Grad Center

Mapping 1: Beginners intro to ArcGIS Storymaps

Introduction to TEI (Text Encoding Initiative)

Have patience, think creatively: reflections on building digital identity and communications capacity for @gcsocialwelfare

Racialized Aspects of Data Collection & Data Use

Optimization & Standardization

Data Collection

Data Use

Conclusion

Join us at the GC Digital Research Institute

The GC Digital Research Institute (GCDRI)

EXTENDED DATE: Apply by 2/7/2025 for the Provost’s Digital Innovation Grants

Provost’s Digital Innovation Grants Call for Proposals

Digital Humanities in the Anthropocene

What is metadata, and why does it matter?

Tidying Data Using tidyverse in R

Introduction

Prerequisites

Let’s Create a Hypothetical Dataset!

Extracting the Rows You Need: `filter()`

Selecting Specific Columns: `select()`

Creating a New Column Based on an Existing Column: `mutate()`

Reshape the Dataset: `pivot_longer()` and `pivot_wider()`

Last but Not Least: `group_by()` and `summarise()`

Web Scraping with Python and the Reddit API

Introduction

What is Web Scraping?

Why Scrape Reddit?

Prerequisites

Getting Started

Building the Reddit Scraper

What Next?

Need help with the Commons?

Optimization & Standardization

Data Collection

Data Use

Conclusion

The GC Digital Research Institute (GCDRI)

Provost’s Digital Innovation Grants Call for Proposals

Introduction

Prerequisites

Let’s Create a Hypothetical Dataset!

Extracting the Rows You Need: filter()

Selecting Specific Columns: select()

Creating a New Column Based on an Existing Column: mutate()

Reshape the Dataset: pivot_longer() and pivot_wider()

Last but Not Least: group_by() and summarise()

Introduction

What is Web Scraping?

Why Scrape Reddit?

Prerequisites

Getting Started

Building the Reddit Scraper

What Next?

Extracting the Rows You Need: `filter()`

Selecting Specific Columns: `select()`

Creating a New Column Based on an Existing Column: `mutate()`

Reshape the Dataset: `pivot_longer()` and `pivot_wider()`

Last but Not Least: `group_by()` and `summarise()`