Data Mining Methods – class and research projects – Digital Initiatives at the Grad Center

Prof. Paul Attewell’s Data Mining Methods is a hands-on workshop aimed at exploring data mining theory. It is taught together with computer scientist Prof. Robert Haralick. See syllabus below.

Prof. Paul Attewell [email protected]
Soc. 81900 – Data Mining Methods {17537}
Wednesdays, 4:15 – 6:15 p.m. Room TBA, 3 credits

“Data mining is the name given to an eclectic collection of statistical techniques that are already widely used in marketing and business, are likely to appear in social science research in the near future, but are rarely found in academic social science research at present. The list of techniques includes: partitioning or tree models; boosted trees, forests, and boosted forests; neural networks; linear and nonlinear manifold clustering; and partial least squares regression (aka ‘soft modeling’).

Data mining is especially well suited for analyzing very large datasets with many variables and/or cases, or where there might be many interactions or much heterogeneity in the data that is unknown to the researcher. Data mining tends to be ‘computationally intensive’ because it sometimes uses brute computer power, trying out many potential solutions or models, or trying to discover ‘hidden’ interactions between variables, before deciding which solution or model best fits the data. However, data mining software is now available for PCs (with plenty of RAM) running under Windows. From one perspective, data mining provides a partial automation of data analysis, with the computer rather than a human analyst deciding upon a statistical model to test, or which model is the most predictive. From another perspective, some of these techniques avoid the kinds of parametric assumptions that underlie more conventional econometric and statistical models, and are prized because of that.

This course will take a workshop format. Most of the class time will be devoted to learning to use data mining techniques, discovering their strengths and limitations, and trying to make sense out of complicated data. Each student will be expected to pick a dataset or research problem, and will then apply these techniques to that problem, with much advice and help from the instructors. Students can bring their own data/problem, but we will also have various datasets, from which students can choose. The class will take place in a computer lab at GC. We have a license for a windows-based data mining software suite, called JMP Pro, and registered students will use this software.

This is an exploratory class – this is the first time it will be taught at GC – and the class will be taught by Professor Robert Haralick, a computer scientist, and Professor Paul Attewell, a sociologist. We will be learning as we go, and the work will be hands on, so do not take this course if you seek a well-structured highly-organized experience. But if you enjoy exploring new techniques and “learning by doing” then this course may appeal to you. You should already have some familiarity with statistics, at least to OLS regression and logistic regression, but this course will not be highly mathematical or technical. Course grades will stress attendance, participation and project work. A paper will not be required.”