Research Thrusts (archived): Statistics Living-Learning Community

Research Thrust A: Atmospheric/Earth Science
  1. (Dr. Michael Baldwin) My research group is focused on one of the most important challenges in the atmospheric sciences: improving the understanding and prediction of high-impact weather events. These weather events (e.g., tornados, droughts, flooding precipitation) affect public safety as well as many sectors of the economy, such as agriculture, energy, water resources, and the insurance industry; their costs can be severe and wide-ranging. My group has focused on both the short-term prediction problem as well as the longer-term challenge of understanding the effects of global climate change on high-impact weather systems. More specifically, we have made substantial contributions to the understanding of weather systems through the development and application of automated analysis procedures that identify and analyze such systems in meteorological data. The algorithms developed within my research group have allowed for the rapid analysis of massive data sets, such as multi-decade downscaled climate simulations and evaluation of high-resolution forecasts covering periods of multiple years. My research group recently developed a prototype, real-time prediction system to directly measure the characteristics of precipitating weather systems in high-resolution model forecasts. Through the application of image processing techniques, weather features are automatically identified, characterized, classified, and tracked over time; we apply appropriate statistical models, and we evaluate the resulting predictions.

  2. (Dr. Sonia Lasher-Trapp) I am very excited to have the opportunity to introduce students in Statistics to real research problems in Atmospheric Science, where we make extensive use of statistical concepts and tools to evaluate trends, variability, and correlations, for example, for very large data sets. The data sets used in my research most often include time series of airborne observations of cloud and precipitation development (using a variety of instruments mounted on the aircraft) and/or 3D radar scans of cumulus congestus clouds (the precursors to thunderstorms), and output from high-resolution 3D numerical simulations of these clouds and the precipitation processes occurring within them. When studying clouds and precipitation, we are always struggling with issues of data representativeness, missing data, limited sampling, etc., and the few statisticians working in our field have made significant advances using different statistical models.

  3. (Dr. Robert J. Trapp) We make extensive use of statistical concepts and tools to evaluate trends, variability, and correlations in data within large sets. In particular, I use time series of tornado and severe-storm occurrences, Doppler weather radar observations of tornadoes and tornadic storms, and output of numerical models simulations of such storms. In our research of severe weather, we constantly struggle with issues of data representativeness, biases, sampling issues, etc. Statistical models have helped resolve some of these issues, but needed are fresh minds with a strong statistical background to develop further models, and otherwise help advance the science.

  4. (Dr. Wen-wen Tung) Our laboratory specializes in studying the dynamical predictability of Earth and atmospheric systems and related phenomena on a variety of temporal and spatial scales. Some of our datasets are drawn directly from the United States Geological Survey database. For example, recently we have found a particular interest in time series data related to the flows of rivers. For many rivers, especially those in long-developed locations, we have access to reliable records of daily river flow measurements spanning well over 100 years. These complete and relatively lengthy records are an excellent starting point for analysis. Multiscale analysis of geophysical time series is one of our lab's specialties.

  5. (Dr. Hao Zhang) Many projects are possible, using the U.S. Climate Data Online NCDC, which provides daily, monthly, and annual precipitation and temperature data at thousands of weather stations. Students can build an understanding of time series analysis, spatial correlation and interpolation, extreme value theory, etc. They can practice model fitting and forecasting. They will learn skills to manage data, e.g., editing, merging, and splitting data sets. Students will also get introduced to new topics not taught in UG classes, e.g., spatial interpolation and extreme value theory.

  6. (Dr. Frederi Viens) Students can work with Viens and with former Ph.D. student L. Barboza (now at the university of Costa Rica) and other Ph.D. students and colleagues at Purdue and elsewhere, to quantify temperature changes, including uncertainty evaluation, over the last 1,000 years regionaly and globally. Viens's group's current research draws on global data; a possible new specific focus could be parts of Africa, because that is where climate change will have the biggest impact on populations, and where some of the most effective solutions may reside. Viens recently served as a Franklin Fellow (2010--2011) for the Africa Bureau at the U.S. Department of State, where he advised U.S. diplomacy on environmental challenges facing sub-Saharan Africa. His background is in probability theory and stochastic processes; he works on theoretical topics in stochastic analysis and applications to mathematical finance, mathematical statistics, and environmental modeling.

  7. (Dr. Yutian Wu) Our research group aims at understanding the dynamical processes in the large-scale circulation of the atmosphere and how the processes respond to anthropogenic climate change. One current research project eyes on the fastest warming region in the globe - the Arctic. We are particularly interested in questions like - what are the processes that cause the Arctic warming, how the Arctic warming affects the weather and climate in North America, and are we going to suffer more extreme weather events in the future? The project will be of both scientific and societal importance for better understanding and predicting the future climate in North America. The project will use both observational datasets and state-of-the-art global climate model simulations. Analysis techniques such as time series analysis, spectral decomposition, maximum covariance analysis will be utilized.

Research Thrust B: Biostatistics
  1. (Dr. Ruben Claudio Aguilar) My cell biology laboratory is particularly interested in basic cellular mechanism with emphasis in vesicle trafficking (e.g., intracellular protein transport). We daily produce enormous data sets from our morphometric analysis of microscopy-generated cell images. The analysis of these (and similar) result collections will be valuable to the students and useful to us. We expect that following an initial training, the students will be able to propose and discuss the advantages and disadvantages of different analytical approaches and to actively participate in the experimental design. In the past, I have successfully recruited undergraduate students from the biology courses I teach. In addition, I participate in the NSF-backed Louis Stokes Alliance for Minority Participation (LSAMP) program and the Purdue Summer Research Opportunity Program (SROP). In our lab, undergraduate students receive scientific training and are presented with the opportunity of pursuing independent research sub-projects. In addition, our undergraduates participate in lab meetings (where they are encouraged to participate and ask questions), and they are being trained in the good practices of scientific presentation. Indeed, our students have been very successful in their research endeavors; we have multiple awards to poster presentations in undergraduate research events and several paper authorships.

  2. (Dr. Hyonho Chun) Recent advances in high-throughput sequencing technology produce massive data for revealing DNA sequence composition, finding transcription factor binding, and quantifying gene expression levels; these are 2--3 GB per assay; with multiple assays (replicates), this is truly ``Big Data''. A sequencing machine reveals the bases of millions of short segments of DNA or RNA in a massively parallel way. The resulting reads need to be mapped back to the genome. This can be done with many free open software tools such as Bowtie and SOAP. One needs to summarize the mapping results, called the pile-up step, to see whether there is a base pair change in DNA (SNP discovery), whether the transcription factor binding occurs (peak calling), and to measure how genes are expressed (transcribed). Afterwards, one can perform statistical analysis. Since the sequencing techniques are new, most analyses are based on very simple statistical methods, and should be understandable to UG's with appropriate guidance and discussion. The students will benefit from working with Chun and Ward on Condor for parallel computational analysis.

  3. (Dr. Laszlo Csonka) We have two potential projects that could involve sophomore students. One of these involves comparing the rates of evolution of "meaningless" non-coding sequences and gene-coding sequences in Escherichia coli, Salmonella enterica, and other closely related Enterobacteriaceae.

  4. (Dr. Laszlo Csonka) The second one consists of investigation of the conservation of gene order (synteny) in distant species of bacteria. Both of these projects require analyses of very large DNA sequence data sets, and therefore would be ideal for computer-savvy statistics majors. It will be a great learning experience for them to be exposed to the data, vocabulary, and way of thinking of biologists.

  5. (Dr. Rebecca Doerge) Trainees will study an epigenetic modification called DNA methylation, which plays a role in cellular differentiation and cancer development. ``Next-Generation Sequencing'' (NGS) technologies yield discrete count data, at single-base resolution, across the entire genome. With sodium bisulfite treatment (which causes changes to the DNA based on individual cytosine methylation status), NGS can be used to investigate DNA methylation. Students can perform Fisher's exact test for differences in methylation levels at every genomic cytosine. Using start/stop locations, students can essentially test every gene for differences in methylation levels. The dichotomy between cytosine-level and gene-level testing allows students to experience statistical issues such as data quality, variability, and multiple testing in large-data applications.

  6. (Dr. George Moore) Our research and collaborations involve veterinary medical and veterinary public health data generated from Purdue's Veterinary Teaching Hospital, large veterinary practices, or commercial veterinary diagnostic laboratories. Projects for student involvement will include practical applications of medical dataset structure, handling missing patient data, appropriate statistical methods, and presentation of data/findings for veterinary clinical audiences and publication.

  7. (Dr. Doraiswami Ramkrishna) In my research group, we have been developing mathematical models to describe metabolism since the 1980s. In doing so, we have developed our own theory to describe the metabolic behavior of cells. Our main goal is to compare our model predictions with high throughput bioinformatic data that represent intricate intracellular processes on a genomic level. A variety of technologies are equipped with the power to provide the needed quantity of data including microarrays, RNA-seq, and protein mass-spectroscopy. The overall goal of this project is the validation of a metabolic theory by means of extracting patterns from data. Looking for trends in the differential expression of genes in volumes of omic data--and comparing them with model predictions--presents the opportunity for the authentication of this model at the genome level. Approaches for analyzing high throughput bioinformatic data are diverse and extend to data mining, Bayesian statistics, and Markov Chain Monte Carlo analysis.

  8. (Dr. Maria Sepulveda) Water quality has a huge influence on fish physiology. Marine fish of course are healthier when raised in high salinity (30 ppt) water. However, it is cheaper and easier to raise marine fish in low salinity (< 5 ppt) conditions. This is important because we are relying more and more on hatchery raised fish for our consumption since most marine fish stocks have been depleted. We raised Florida Pompano, a marine fish, under low and high salinity conditions and noticed that some of the fish raised under low salinity conditions did very well while others got sick and died. We collected tissues responsible for osmoregulation (kidneys, liver, gills and gastrointestinal tract) from healthy and sick fish and conducted Next Generation Sequencing to determine differentially expressed genes in these two groups of fish. Specific objectives of this project include: 1) establish transcriptome libraries for gill, liver, kidney and gastrointestinal tract of Florida pompano reared in high and low salinities; 2) identify gene transcripts for osmoregulatory genes, key metabolic enzymes and stress response; 3) compare gene transcript abundance between Florida pompano reared in high and low salinities; and 4) discover unique sequences that may play key roles in the adaptability of marine fish to low salinity.

  9. (Dr. Lyudmila Slipchenko) We develop a new polarizable force field BioEFP for modeling processes in biology, biomedicine and materials. Potential applications of BioEFP are in drug design, cancer research, bioimaging and photovoltaics. BioEFP is based on ideas derived from quantum mechanics and does not contain parameters fitted to experiment. Instead, parameters are obtained from electronic structure calculations on chemical fragments. The accuracy of the BioEFP force field is superior to the accuracy of common classical force fields. One of the main shortcomings of BioEFP is that the parameters are not readily available but have to be computed a priori. To overcome this obstacle, we propose to create an online repository of pre-computed fragment parameters and develop a similarity search algorithm that would ascribe each fragment of a biological or materials macromolecule to a pre-defined fragment. As a longer-term goal, we propose to interface a high performance computing (HPC) cluster with a web-interface such that missing parameters could be computed on-the-fly. We expect the fragment database will contain several thousands of chemically unique fragments; the amount of data associated with each fragment ranges from several Kb to several Mb.

  10. (Dr. Jun Xie) Students will learn about statistical methods for large-scale genomic data analysis. Nowadays whole-genome genetics information is commonly available in disease studies and clinical trials. For example, genome-wide associate studies analyze a large amount of common genetic variants, i.e., single nucleotide polymorphisms (SNPs), in individuals to examine if any genetic variants are associated with a disease. Another example is pharmacogenomics research, which uses whole genome information to predict individuals' drug response. Students will learn about modern statistical methodology developed for these types of big data, including multiple testing rules, variable selection and dimension reduction methods. Students can learn hands-on experiences through statistical analysis of specific data sets from the databases of the National Center for Biotechnology Information (NCBI) at the National Institute of Health (NIH).

Research Thrust C: Healthcare Engineering and Healthcare/Biomedical Analytics
  1. (Dr. Azza Ahmed) Ahmed's research is focused on developing and testing interventions that support and improve breastfeeding outcomes among vulnerable populations, specifically, preterm infants and low-income mother/infant dyads. She has been collaborating with the Indiana WIC program to study breastfeeding outcomes among late preterm and early term infants in a longitudinal study. She designed LACTOR, an interactive web-based breastfeeding monitoring system. She just finalized a randomized control trial to test the effect of LACTOR on breastfeeding outcomes with a large online dataset. Dr. Ahmed is also collaborating with Purdue Animal Sciences, Purdue Statistics, and Eskinazi Health, in a longitudinal study to test the effect of sleep quality during pregnancy on breastfeeding outcomes. She is also collecting data on peripartum depression, night eating habit and obesity.

  2. (Dr. Ulrike Dydak) Dydak's area of expertise is in Magnetic Resonance Imaging and Magnetic Resonance Spectroscopy. She also maintains a research lab at the Indiana Institute for Biomedical Imaging Sciences (IIBIS) at the Indiana University School of Medicine. She is currently working with colleagues in Biostatistics, Neurology, Toxicology and Psychiatry, designing and implementing clinical MRI/MRS studies. For instance, at present, they are working on finding significant effects in datasets that contain measurements of metabolite concentrations from different brain regions, and they correlate those measurements with biological measures, diagnostic groups, levels of environmental exposure and other measures.

  3. (Dr. Haslyn Hunte) As Assistant Director of the Center on Poverty and Health Inequities (COPHI) at Purdue University, I work on reducing poverty-related inequities through partnerships with local communities. We study trends and problems such as insufficient access to food, barriers to treatment and health care inequalities, and inequities in policies, especially with regard to the poorer segments of populations. We believe that students benefit from seeing the full scope of the data analysis and policy research that we work on. For example, I believe that some of the Sophomore participants will appreciate (and perhaps even relate to) the variables I study as a part of a funded project of the health care safety net population. The specific aims of the research project are to provide insights to strategies that will 1) improve cost effective healthcare delivery and 2) reduce disparities in health outcomes for vulnerable populations by establishing new methods for the planning and operation of the safety net system. The research objectives: 1) Identify and map the locations of where individuals live and where they receive care within the core safety net provider system. 2) Determine whether bypass behavior is exhibited by patients for each episode of care they seek. 3) Determine the association among sociodemographic variables and care seeking behavior. To achieve our objectives we will utilize a data set with 69 million patient encounters over a five-year period from the Indianapolis Metropolitan Statistical Area.

  4. (Dr. Haslyn Hunte) I am also engaged in more traditional social epidemiology research that would also provide opportunities for one or more mentees. Using several large datasets, I am interested in the following research questions: 1) What is the association between experiences of interpersonal discrimination and health behaviors and health outcomes? To what extent, if any, does racial/ethnic discrimination explain any of the observed racial/ethnic disparities in health related outcomes like tobacco and alcohol use/abuse, obesity and elevated blood pressure? 3) What is the relationship between positive psychological functioning and psychological challenges such as discrimination and how do they interact to produce the absence or presence of poor health. 4) Does the heterogeneity within the US Black population explain any of the observed Black-White disparities in health outcomes?

  5. (Dr. Nan Kong) Feature Selection in Efficient and Effective Analysis of Photoacoustic Imaging Data for Plaque Vulnerability Characterization in Acute Cardiovascular Syndrome: Of all pathological features, lipid rich core, thin fibrous cap, and infiltration of inflammatory cells are considered as three major hallmarks of acute cardiovascular syndrome. Current imaging modalities have limited abilities to characterize plaque vulnerability. In this project, we will analyze the PA microscopy spectral data collected by a phantom made of clusters of cholesterol ester (CE) and cholesterol crystal (CC). For the spectral data analysis, an important aspect is to identify and select features with which we can efficiently and effectively distinguish CEs and CCs. The funded student is expected to work closely with Dr. Kong and periodically participate in Dr. Ji-xin Cheng's intravascular photoacoustic research team meeting when data analysis is discussed.

  6. (Dr. Nan Kong) Fall Detection using 3D Accelerometer Data: Falls are a common problem for the elderly, often resulting in hospitalization. Despite extensive preventive efforts, falls continue to be a major source of morbidity and mortality among elderly. Real-time detection of falls may enable rapid medical assistance, thus increasing the sense of security of the elderly and reducing some of the negative consequences of falls. In this project, we will analyze 3D accelerometer data collected on simulated falls performed by healthy volunteers. The objective of the project is to develop fall detection algorithms and conduct comparative studies. The funded student is expected to work closely with Dr. Kong and periodically meet with Dr. Babak Zaire, Professor of Electrical and Computer Engineering and Dr. Shirley Rietdyk, Professor of Health and Kinesiology.

  7. (Dr. Nan Kong) Feature Selection in Biomedical Image Analysis. Advanced imaging modalities (e.g., functional MRI and label-free imaging) are increasingly used for structural, functional, metabolic, as well as biological image analyses. Underpinning much of the research is the need to develop new methodologies that can extract useful information from very large databases. Methodological advances in biomedical data mining are expected to revolutionize the practice in many specialties of clinical practice. Among the various developments, an important task is to extract features that can be used to differentiate/label subjects with respect to identified structural, functional, metabolic, or biological features. In this project, we will investigate feature extraction tasks for two types of data, fMRI and photoacoustic data. The funded students are expected to work closely with Dr. Kong and periodically meet with two BME professors Dr. Ji-xin Cheng, an expert on optical spectroscopy, and Dr. Zhongming Liu, an expert on fMRI.

  8. (Dr. Nan Kong) Fall Detection using Time-Series Data. Falls are a common problem for the elderly, often resulting in hospitalization. Despite extensive preventive efforts, falls continue to be a major source of morbidity and mortality among elderly. Real-time detection of falls may enable rapid medical assistance, thus increasing the sense of security of the elderly and reducing some of the negative consequences of falls. In this project, we will analyze temporal series of 3D accelerometer data collected on simulated falls performed by healthy volunteers. The objective of the project is to develop fall detection algorithms and conduct comparative studies. The funded student is expected to work closely with Dr. Kong and periodically meet with Dr. Shirley Rietdyk, Professor of Health and Kinesiology.

  9. (Dr. Mark Lawley) The first project involves a large data set with 30 years of patient data on outpatient appointments, emergency department usage, hospitalizations, and laboratory results. The students would need to first understand the nuances of working with de-identified medical data, confidentiality, HIPAA guidelines, etc. The intended outcome of the project would be a set of models for predicting the cost and health impacts no- behavior (failing to attend a scheduled medical appointment) for chronically ill patients. Our past work has shown a strong correlation between no-show behavior and increased use of hospital resources, but we need additional work to better explore and understand these relationships. Because we are all users of the U.S. healthcare system, this is an important problem with which the students can easily relate. Further, it introduces them to a number of important statistical techniques in a practical, concrete way in a context that they can appreciate.

  10. (Dr. Mark Lawley) Another project involves diabetes. Students should, once again, relate to the context of this problem since they will almost certainly have relatives or close acquaintances afflicted by diabetes. The management of diabetes requires a careful balancing act. Patients that have chronically high glucose levels (hyperglycemia) risk long term damage to the kidneys, heart, eyes, and feet. On the other hand, over-control of glucose levels can lead to short term glucose levels that are far too low (hypoglycemia), which can cause dizziness, incoherence, fainting, and other problems. We are in the process of obtaining a large data set on diabetic patient glucose levels which we will use to study this problem of short and long term risk balancing. The students could help with time series analysis and learn about some of the simulation and optimization techniques we intend to use in the work.

  11. (Dr. Laura Prouty Sands) My training in multivariate modeling and psychometric analysis of survey instruments combined with my 25 years of research in health outcomes research reveals that I have the content expertise to effectively mentor undergraduates interested in learning how to analyze and interpret healthcare practice and policy relevant databases. My research is focused on determining optimal care pathways for vulnerable older adults. Currently I am funded by two NIH grants. The first assesses risks for post-operative cognitive decline among older surgical patients. My role on that project is to develop the methods for detecting post-operative cognitive decline and to supervise analyses of project data.

  12. (Dr. Laura Prouty Sands) The second project is directed toward determining health outcomes of unmet need for disabilities among older adults using survey and Medicare claims data. I have mentored eight Ph.D. students from the Department of Statistics, as well as two undergraduate Statistics students. I have access to a wide range of datasets related to healthcare practice and policy, e.g., the Interuniversity Consortiun for Political and Social Research (ICPSR), as well as data from the Centers for Medicare and Medicaid and the Centers for Disease Control and Prevention (CDC).

  13. (Dr. Cleveland Shields) Every semester, we have 4 to 6 undergraduate students working in our Relationships and Healthcare Lab, which I co-direct with Dr. Melissa Franks. Thus, we have considerable experience integrating undergraduates in research projects. We can involve students in analyzing data on three projects. First, students can work with data analysis for a project Dr. Franks and I are conducting of hospital readmission of patients with diabetes examining discharge planning and family involvement.

  14. (Dr. Cleveland Shields) Second, colleagues and I in the Regenstrief Center for Healthcare Engineering (RCHE), are conducting a study of health services utilization to identify geographic locations producing high utilization and costs using longitudinal data from medical records from St. Vincent Hospital Systems in Indianapolis area. Students could help with the design and conduct the analysis for this project.

  15. (Dr. Cleveland Shields) Finally, I am conducting a field experiment examining physician-patient communication. We will be gathering 240 audio recordings of interactions between physicians and actors who will be portraying a patient role. Dr. Sharon Christ serves as the statistician on this project. We will be conducting psychometric analyses to understand the underlying constructs in the measurement of communication in these medical encounters. This presents a context that is surprising and is likely to pique students' interests.

  16. (Dr. Lingsong Zhang) Zhang is working with Lawley (Biomedical Engineering) and Sands (Nursing) on statistical modeling of patient ``no-show''. They are collaborating with Alliance of Chicago, using scheduling data and electronic medical records from 7 clinics over 3 years. The focus is on diabetic patients, who visit regularly. They want to involve students in using scheduling history and demographic factors (payer class, income, education, age) to predict no-show probability and uncertainty.

  17. (Dr. Lingsong Zhang) Zhang is also working with S. Witz and K. Musselman (from the Regenstrief Center for Healthcare Engineering, RCHE), H. Wan from Purdue Industrial Engineering, and J. Castro from University of South Florida, on analysis of hospital readmission characteristics and prediction. RCHE is working with BayCare Health System (Tampa, FL) and St. Vincent Health (Indianapolis, IN), using multiple-year discharge data and patient characteristics, to identify (1) readmission probability upon discharge, (2) clinical/demographic factors associated with readmission, and (3) performance comparisons of hospital readmissions.

Research Thrust D: Probability; Theoretical Statistics; Image Processing; Financial Modeling
  1. (Dr. Guang Cheng) Students will learn resampling using bootstrap methods. The bootstrap is widely applicable for inference in massive data; however, bootstrap is computationally demanding. Kleiner et al. introduce the Bag of Little bootstrap (BLB): a robust, computationally efficient means of assessing the quality of estimators; it combines the results of bootstrapping multiple, small subsets, on parallel computing architectures. G. Cheng proposes to investigate with students whether the application of the m out of n bootstrap or subsampling in each subset in the BLB bootstrap will overcome the inconsistency in the bootstrap. Longer term: he wants to study BLB under the settings of M-estimation with a student, e.g., its consistency, asymptotics and computational efficiency in dealing with massive data.

  2. (Dr. Raghu Pasupathy) Motivated by contexts such as air quality measurement using cheap sensors, energy monitoring through smart meters in large buildings, and tracking stock tickers on mobile devices, we ask: Is existing statistical and simulation methodology adequate for online "big data" contexts? How should methods for estimating traditional statistical measures, e.g., quantiles, conditional value-at-risk adapt to the online context? Are there low-storage, fast-compute versions of function estimators, e.g., kernel densities, stochastic kriging, that are just as accurate as existing estimators? Students will help to construct and analyze O-estimators --- a new class of estimators characterized by (provably) minimal storage and update complexities, and having convergence rates matching those of analogous traditional statistical objects.

  3. (Dr. Ilya Pollak) An important area of image processing that I work in is segmentation, i.e., developing computer algorithms for automated detection of object boundaries in images. This is a critical image analysis step in problems arising in many areas, such as biomedical imaging, computer vision, and microscopy of materials. The analysis of such algorithms requires statistical comparisons of the algorithms' outputs on a large image database with ground-truth segmentations. Constructing ground-truth segmentations and writing basic utilities for such comparisons (in R or in Matlab) would be a great sophomore research project.

  4. (Dr. Ilya Pollak) In the area of quantitative finance, there are a number of recent papers on the analysis of so-called technical indicators. A very interesting sophomore research project would be to read one of these papers, implement (again, in R or Matlab) several indicators described therein and conduct statistical analysis of forecasting performance of these indicators on real market data.

  5. (Dr. Xiao Wang) Wang is currently working with Professors Chuanhai Liu and Lingsong Zhang on projects related to statistical computing, spatial statistics and image analysis. Dr. Wang has ongoing work with one undergraduate student, Yuxi Yang, on statistical modeling of ozone, nitric oxide and nitrogen dioxide levels in California. He wants to include more undergraduate students in this project.

  6. (Dr. Xiao Wang) Wang is currently working on projects related to statistical computing, spatial statistics and image analysis. Specifically, Dr. Wang is developing deep learning methods for neuroimaging data. For example, one of the studies is to use the predictive value of ultra-high dimensional imaging data and/or other scalar predictors (e.g., cognitive score) for clinical outcomes including diagnostic status and the response to treatment in the study of neurodegenerative and neuropsychiatric diseases, such as Alzheimer's disease (AD). The growing public threat of AD has raised the urgency to discover and validate prognostic biomarkers that may identify subjects at greatest risk for future cognitive decline and accelerate the testing of preventive strategies. In this regard, prior studies of subjects at risk for AD have examined the utility of various individual biomarkers, such as cognitive tests, fluid markers, imaging measurements, and some individual genetic markers (e.g., ApoE4 gene), to capture the heterogeneity and multifactorial complexity of AD (reviewed in Weiner et al. 2012). He wants to include more undergraduate students in this project.

  7. (Dr. Mark Daniel Ward) Students in Ward's research group will analyze asymptotic properties of randomly-generated sequences and trees using probabilistic generating functions, and simulations in R, as well as some Maple, for solving recurrences and deriving asymptotics. Undergraduates can also work with Ward on stochastic leader election algorithms.

Research Thrust E: Human Development and Family Studies
  1. (Dr. Edward Bartlett) Bartlett's research area is sensory neurophysiology. His lab records neural signals at the population level, where activity is averaged over thousands of neurons, at the single neuron level in live animals in response to sounds, and during intracellular patch-clamp recordings of single neurons. The research focus is to dissect the neural circuits involved in the representation of sound features across the lifespan, from early development before hearing onset through adulthood and age-related decline. Each of these neural data streams produce large amounts of continuous time-series data and discrete time-series data. Although simple analyses of these data can produce interesting and informative results, somewhat more advanced analyses, such as multiple linear regression, interval distributions, or correlations, would yield significant insights. Our lab would benefit greatly from mentoring one or more of these students.

  2. (Dr. Edward Bartlett) Bartlett's research area is sensory neurophysiology. The research focus is to dissect the neural circuits involved in the neural coding of sound features across the lifespan, from early development through adulthood and age-related decline. Neural data are obtained from recordings of single neurons and neural populations in response to speech-like and simple sounds. In addition, realistic computational models of single neurons or small groups of neurons are constructed to understand the data.

  3. (Dr. Sharon Christ) Christ can guide students on analysis of large survey data collected from people. One sample is representative of the children involved with Child Protective Services in the U.S. and the other is representative of all non-institutionalized adults in the U.S. These data are longitudinal and involved complex sampling such as clustering (non-independence) and unequal selection probabilities. As a result, trainees will learn how to apply probability weighted estimation and variance estimates that are robust to clustering. In addition, these surveys suffer from missing data and measurement errors due to self-reported nature of the data collection. Trainees will learn modeling approaches used to avoid biases due to missing data and measurement error. The statistical analysis will be applied to the study of the effects of maltreatment on adolescent development and the effects of work and working conditions on health and well-being of youth and young adults.

  4. (Dr. Sharon Christ) Christ will work with Weber-Fox and her students on the sample of children observed in her audiology lab. In this study, they will work on modeling changes in stuttering over time, and what characteristics are correlated with persistent versus desisted stuttering.

  5. (Dr. Sharon Christ) For another study, students can use the National Health Interview Survey (NHIS) to evaluate how occupations are related to smoking, alcohol use, exercise, asthma, heart disease, etc. NHIS is the national data set used to survey the U.S. adult population with respect to health.

  6. (Dr. Sharon Christ) Christ can guide students on analysis of large survey data collected from people. Data sets are representative samples of the general population of children and adolescents in the U.S. and the population of children involved with Child Protective Services (CPS) collected through surveys and observation. Trainees will work with Dr. Christ to study the effects of exposure to maltreatment on adolescent development and the effects of work and working conditions on health and well-being of adolescents and young adults. Trainees will learn how to overcome the common features of survey data of human populations, including missing data, nesting/clustering, and selection bias.

  7. (Dr. Sharon Christ) For another study, students can use time-series analyses to study sleep patterns in children, especially children diagnosed with autism spectrum disorders.

  8. (Dr. Elliot Friedman) Some of our work involves the use of data from large, nationally representative survey-based studies, and a perpetual issue with such studies is missing data. In some cases these data are missing randomly (e.g. people skipped a question by accident), and in some cases it may not be random (e.g. possible reluctance to answer questions about income or more sensitive topics). Students will have the opportunity to look for patterns of missing data to determine whether they are random or systematic. They will also devise appropriate strategies for imputing missing values in order to increase the power and reliability of analyses based on these data.

  9. (Dr. Lisa Goffman) I study specific language impairment (SLI) in children. Children with SLI show cognitive abilities within normal levels, but significantly impaired language abilities. Although their cognition is typical, it has recently been found that these children also often show impairments in their gross and fine motor skills. Because children with SLI are at risk for long-term social and academic difficulties, there is a critical need to understand the factors underlying their language and motor deficits and to develop efficacious approaches to treatment. In my NIH funded research, we are presently conducting a longitudinal study of children with SLI to better understand how their language and related motor skills change from the preschool to the school age years. We include standard language and motor measures as well as direct recordings of speech and limb movement. Our goal is to better understand how language and motor domains develop in these children, and how they change over the critical early school years.

  10. (Dr. Christine Weber-Fox) My work is in neural systems for language processing in typical development and in those with communication disorders such as stuttering or language impairment. I also have clinical experience working in both hospital (outpatient, inpatient, and acute care) and school settings. The motivation to study language processing and its connections to stuttering is apropos for sophomore students, who can readily understand the reasoning and context for why this context is important. (The recent success, for instance, of The King's Speech demonstrates that this is a topic of broad concern and interest.) My work also focuses on how neural subsystems may differ in speakers with different language experiences and communication skills, as brain processes for language vary even across individuals with 'normal' language abilities. The type of data to be analyzed in our research group include behavioral/clinical measures from children, such as cognitive test scores, including nonverbal IQ and working memory, as well as detailed measures of their speech and language performance. In addition, our data set includes physiological measures of brain activity (Event-related Brain Potentials, ERPs). As the student becomes familiar with the research goals and methods, the expected outcome for a statistics sophomore student is for them to help manage and analyze large data sets that cross domains (behavioral, electrophysiological) and span longitudinally from 4-9 years of age.

  11. (Dr. Ellen Wells) The Deep Green and Healthy Homes project was sponsored by the nonprofit Environmental Health Watch, in Cleveland, Ohio (PI: Stuart Greenberg). It compares two standards of energy efficiency renovations in low-to-moderate income housing in Cleveland, Ohio. Six homes were renovated to standard energy efficiency recommendations (~50% energy savings); 6 additional were renovated to a stricter standard (~75-90% energy savings) and included mechanical ventilation systems to help preserve air quality. Homes were monitored just after renovation and for ~ 1 year following renovation using new indoor air quality technology, and home visits were conducted every three months to conduct visual inspections, take indoor air quality measurements with field instruments, and collect data from participants via questionnaire. The new indoor air quality monitoring technology incorporates low-cost gas /temperature /relative humidity sensors into a single platform which wirelessly transmits data from the field site to our servers twice/minute. Six parameters are included in the sensors: temperature, relative humidity, CO, CO2, NOx, and total VOCs. We developed calibration equations which incorporate data from all sensors within the unit (the sensors will respond, somewhat more weakly, to a gas similar in structure to its target gas). For most homes we collected more than 1 million rows of data. Potential projects using these data include: Further methodological analysis of calibration/data transmission from the new technology; Correlation of data from new technology compared to standard field instruments; Comparing the two renovation types with regards to air quality or energy use; Description of continuous data patterns from remote monitors on a daily/weekly/etc. scale; Description of data before/during/after an event which would affect air quality (i.e., ventilation system not working, dispersal of an enormous amount of mothballs, etc.); Analysis of how occupant behavior may affect energy use/indoor air quality.

Research Thrust F: Statistical Consulting Service
  1. (Dr. Bruce Craig and Ms. Ce-Ce Furtner) The SCS has over 200 research consultations/year, serving clients from every College in the University. Any Purdue faculty, staff member, or student can be a client, for free, to receive statistical consulting and advice. Consultants help with proposal preparation, design of studies, data import/export, data analysis, and interpretation and presentation of results. For funding reasons, the SCS only employs grad students, but Director Craig is willing to involve undergraduates from this MCTP project. C. Furtner (Manager), former UG academic advisor, knows what is feasible for undergraduate students. Consulting will: lead UG's to apply for graduate study in Applied Statistics; boost communication skills; and sometimes result in papers with clients. Undergraduate students will attend meetings---led by grad student consultants---and will help with the data analysis. Listening at meetings, UG's will get an early, tangible understanding of how modeling, time series, design of experiments, etc., are used in practice.

Research Thrust G: Coastal Margin Observation & Prediction
  1. (Dr. Tawnya Peterson and Dr. António Baptista) (Please note that the summer component of this thrust is in Portland, Oregon, and would require summer travel.) The NSF Science and Technology Center for Coastal Margin Observation & Prediction (CMOP) is dedicated to the study of estuaries as bioreactors that deliver unique ecosystem services, including the filtering of land inputs into the ocean. We use the Columbia River estuary as our long-term testbed, and we support our research through continuous high-resolution observations and simulations of a vast array of multi-disciplinary variables. Diverse opportunities for statistical analysis of the data are available for undergraduate students, in association with understanding physical and ecological processes, assessment and control of the quality of observations, and assessment and improvement of computational models. These opportunities are available during the summer or---by special arrangement---throughout the year. Because of the inter-institutional and inter-disciplinary nature of CMOP, students can work with leading scientists at three universities: Oregon Health & Science University, Oregon State University and University of Washington.

Research Thrust H: Saving Nature with Statistics
  1. (Dr. Songlin Fei) Forests provide a wide variety of vital services such as timber and clean water, but they are challenged by the changing climate. Our lab strives to understand the impact of climate change on forests and the resulting impact on future climate. We use continental-wide, long-term (1980-present) data collected by the US Forest Service to understand a set of questions such as: Are trees migrating and at what rates? How are recruitment and growth of trees affected? What are the consequences of climate-induced species composition changes? Students can learn how to explore and analyze large data in a spatial and temporal context.

  2. (Dr. Songlin Fei) Invasion of exotic plant species has caused serious ecological degradation and economic losses. Our lab is working to build predictive models to understand regional invasion patterns and processes that will advance the discipline of invasion ecology and assist in effective management policy and control practices to combat invasive species. We are interested in understanding a set of questions including: (1) Why are certain exotics more invasive? (2) Why are certain ecosystems more prone to invasion? (3) What are the main factors facilitating invasion? Students interested in this topic can use continental-wide (or subset of) invasion databases to explore these or related questions. Students will learn how to manage large datasets and practice model fitting, multivariate analyses, spatial analysis, etc.

  3. (Dr. Songlin Fei) Hellbenders, a gigantic, aquatic salamander species found in North America, are declining throughout their range. In Indiana, hellbenders are now confined to a single watershed. In an effort to aid hellbender conservation and management in the state, we are developing local habitat models for hellbenders. This project will involve using classification techniques on large volume of sonar data to develop a substrate map of the study river. The student will be working with a graduate student and a pre-collected data set to come up with novel statistical classification techniques to map river bottom substrate, which will then be used as predictor variables within hellbender habitat models.

  4. (Dr. Rob Swihart) Wildlife populations and communities are subjected to human influence in innumerable ways, including activities (e.g., hunting) that have direct effects and others (e.g., timber harvest, agriculture) for which effects may occur primarily due to changes in the availability or quality of habitat. My group seeks to understand how wild vertebrates respond to human activities, as this knowledge can be important to minimizing adverse influences. We have conducted work in the Upper Wabash Ecosystem Project and the Hardwood Ecosystem Experiment, which has resulted in large data sets on dozens of wild species (mostly mammals, but also birds, amphibians, and reptiles) and associated covariates for habitat and landscape features. These data are used to address questions such as: (1) How does the intensity of human disturbance affect population abundance and species composition? (2) What makes some species more sensitive than others to human disturbance? (3) What role does spatial scale play in determining wildlife responses? Students can learn how to conduct exploratory analyses and test competing hypotheses using general and generalized linear models.

  5. (Dr. Rob Swihart) Successful management and conservation of wildlife depends on understanding the factors (e.g., extreme droughts, disease epidemics, or habitat change) that drive variation in survival and reproduction. Unfortunately, a factor's importance often appears only infrequently or slowly, which requires long-term data sets. Wild mammals are difficult to study, so long-term data sets are rare. For species of game mammals, long-term data sets from unexploited populations are even rarer, despite the fact that an understanding of population dynamics in the absence of hunting is essential. I have inherited a data set collected by students at the Purdue Wildlife Area on a non-hunted population of eastern cottontail rabbits that spans 33 years. These data will be used to ask questions such as: (1) How does climate change influence density and survival of cottontails? (2) Have changes in the plant community over time had measurable impacts on the cottontail population? (3) What influence has the increase in abundance of coyotes, an important predator, had on cottontail numbers? Students can learn how to conduct exploratory analyses and test competing hypotheses using general and generalized linear models.

  6. (Dr. Bryan Pijanowski) Work in our Center for Global Soundscapes focusses on the use of long-term soundscape recordings to assess the health of ecosystems around the world. Recently featured in Science as a new area of big data research, soundscape ecology has mushroomed into one of the fastest growing ecological sciences, using advanced sensor and sensor network arrays that combine acoustic information, 3D landscape profiles using LiDAR (light detection and ranging) along with companion time-lapse photography/4K video imaging to characterize the dynamics of a variety of ecosystems around the world. Large-scale soundscape and remote sensing databases exist for exotic places like Borneo (paleotropics), Costa Rica (neotropics), Sonoran Desert (Arizona), Midwestern temperate forests (Indiana, Chicago and Wisconsin), estuaries (Maine) and the subarctic (Alaska). Students could work on any of the following projects mentored by both a graduate student and postdoc: (1) analyze over 70 TB of soundscape data from different ecosystems comparing the spatial-temporal dynamics of these systems; (2) develop multi-media web components for use in citizen science and K-12 learning of sound, ecology, mathematics and technology (enhancing our site at; (3) developing new soundscape ecological metrics that quantify diversity of sounds in files using principles of entropy; and (4) develop new techniques for data mining and pattern recognition using novel statistical tools.

  7. (Dr. Patrick Zollner) Our lab's research efforts focus primarily on the ecology of mammals. One approach we use is to deploy infrared remote cameras to capture pictures of and gather data about the mammals of interest. Each such camera typically records thousands of photos at each location and results in challenges associated with a large volume of data. Research ongoing in our lab is collecting such photo data on the occurrence of carnivores (otter, mink and raccoon, etc.) at different locations along rivers in Indiana. These cameras record species activity with both a spatial and a temporal component, and the resulting data provide opportunities to investigate numerous potential projects of interest. For example, this data can provide a basis for developing models of competition between species as a function of both spatial activity patterns and/or daily activity patterns. Alternatively, data from these cameras can be used to examine how species activity patterns and interactions vary as a function of environmental variables (e.g., temperature, precipitation, moon phase or the presence of invasive Asian carp in rivers at some sites). We will work with a student interested in this project to define a feasible and unique question they can investigate using this data set.

  8. (Dr. Patrick Zollner) White nose syndrome is an invasive fungal species new to North America that has caused the death of more than 90% of the individuals of several species of cave hibernating bats throughout the Midwestern US. Our lab is collecting and analyzing acoustic monitoring data on the occurrence of these now threatened and endangered bat species to use in modelling summer habitat needs of these species. The acoustic bat detectors we are using record large volumes of bat echolocation calls, and we have access to such data from several regions of Indiana both before and after the arrival of the destructive fungus. A student working on this project would use these acoustic records to evaluate and validate the suitability of models we have developed from similar but independent observations. The improved habitat models resulting from this work will have important applications in the conservation of these bat species as well as the management of Indiana's forests.

  9. (Dr. Michael Saunders) What effects does timber harvesting have on forest ecosystems? - The Hardwood Ecosystem Experiment (HEE) in southern Indiana investigates the influence of forest management on various plant and animal communities within oak-dominated ecosystems. The HEE maintains a large geospatial database with repeat inventories of trees and shrubs, terrestrial vertebrates (see also "topic 4" from Dr. Swihart), and insects (see also "topic 12" from Dr. Holland). Initially, this project would evaluate the effects of forest harvesting of tree and shrub communities, but could be extended to work on other taxonomic groups. There would be opportunities for travel to the sites and to present the results to regional conferences.

  10. (Dr. Michael Saunders) How do trees grow wood? - Production ecology has been generally well studied in conifer tree species, but not in hardwood tree species. Theoretical relationships developed in conifer-dominated forest stands may or may not apply to our Indiana hardwood-dominated forests. This project will use a vast database of tree heights, diameters, and stem taper to model how walnut plantations grow. We will investigate relationships between the amount of leaves a tree displays and the amount of wood that tree produces each year. We will also study how manipulations of leaf area through pruning affect growth. There will be opportunities for the work to be extended to American chestnut, oak and other hardwood species.

  11. (Dr. Tomas Höök) Research in our lab focuses on aquatic ecology and, in particular, the dynamics of the Laurentian Great Lakes. Given that each of the Great Lakes is quite large, they are governed by almost oceanic scale physical processes and characterized by high spatial variability of physical features and biotic factors. Spatial description of variable biotic factors (e.g., fish distributions, growth rates) can contribute to hypothesis development of processes structuring biotic dynamics, while spatially comparing such biotic factors with other physical, chemical or biotic variables can evaluate hypotheses and potentially help identify mechanistic linkages. The objective of this research would be to a) describe spatial patterns of fish distributions and growth in Lake Michigan and b) relate these patterns to physical (e.g., satellite-derived surface temperature and water clarity) and biotic (e.g., chlorophyll concentrations, zooplankton densities) factors.

  12. (Dr. Tomas Höök) Northern Indiana contains ~450 natural lakes that provide diverse services, from boating to swimming to fishing to flood control. However, these services are often at odds with land-use practices and human activities. In particular, nutrient runoff (primarily phosphorous) from row crop agriculture to surface waters contributes to eutrophication, including harmful algal blooms, hypoxia, and local extinctions of sensitive species. We have access to a wealth of data from GIS databases and state and university monitoring programs related to fish community composition, water quality, lake morphometrics and land-use on the land draining into Indiana's natural lakes. The objective of this research would be to quantitatively model linkages among these different types of variables; for example, evaluating how agricultural practices on lands draining into glacial lakes influence water quality, habitat conditions and resulting fish biodiversity within these lakes.

  13. (Dr. Jeffrey D. Holland) Students in my laboratory study how the pattern of land use and habitat in different landscapes influences ecological processes involving insects such as individual dispersal and exotic species invasion, ecosystem services (e.g., pollination, predation of pests, decomposition), and maintenance of biodiversity. We simultaneously study the impact of local habitat and human activities and the larger scale landscape context. To study the landscape - insect link, we make use of extensive field surveys of insects and habitat combined with satellite and aerial data, geographical information systems, and spatial & multivariate statistics. A sample of projects students could become involved in includes: examining the impact of silvicultural regimes on the functional diversity of forest beetles, spatial analysis of aquatic insect communities, simulation modeling of insect movement, and statistical approaches to optical character recognition for capturing specimen data.

  14. (Dr. Esteban Fernandez-Juricic) Birds and airplanes collide regularly at airports across the world. These bird-strikes are a source of mortality for many species of conservation concern as well as a safety and financial concern for the airline industry. In the US, the Federal Aviation Administration has compiled a database of bird-strikes since 1990. We use this large database to answer key questions to better understand the environmental conditions that enhance the occurrence of bird strikes: (1) Are airports close to biodiversity hot-spots more likely to have a higher frequency of bird strikes involving species of conservation concern? (2) Does landscape composition around airports influence the probability/frequency of bird strikes? (3) Does habitat structure within airports influence the probability of bird strikes? (4) What is the role of regional and local bird densities in affecting bird strike frequency? (5) Does the color, speed, shape of commercial airplanes influence the probability of bird strikes? Students will learn how to manage large databases and use general and generalized linear mixed models as well as multivariate statistics. The answers to these questions have widespread management implications to reduce the frequency of bird strikes.

  15. (Dr. Esteban Fernandez-Juricic) Bird feeders are used by multiple species of birds throughout North America. However, bird feeders are not necessarily built to attract birds (for instance, to hold seeds some use Plexiglas, which blocks the ultra-violet portion of the spectrum that many bird species use to find food visually). We have developed a behavioral assay to test in aviary conditions novel bird feeders (different shapes, colors, etc.) designed taking into consideration the avian visual system, which is very different from our visual system. Through these assays, we have collected data to determine the combination of features that would increase the chances of bird visitation and seed consumption. Students will learn how to run these behavioral assays and use general and generalized linear mixed models to analyze the data. The results of this project have implications for increasing bird diversity in urbanized landscapes.

This material is based upon work supported by the National Science Foundation under Grant No. 1246818. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.