•
Monday Oct. 31, 2011
10AM
Powell-Booth 100
Clustering of Genome-wide Chromatin Mark Data Using Self-Organizing Maps
Shirley Pepke, Ali Mortazavi, and Barbara Wold
Genome-wide sequence-based assays such as ChIP-seq offer unprecedented opportunities to characterize and predict regulatory sequence regions such as those corresponding to enhancers. The ENCODE Project, in particular, has made available the results of hundreds of high-throughput assays spanning multiple cell types. The amount and complexity of this data make it a rich source for mining of knowledge, however, it also presents significant computational challenges. We have used Self-Organizing Maps (SOMs) as part of a pipeline to integratively analyze ENCODE Tier1 and Tier2 cell type data by performing a fine-grained clustering of genome segments based upon the vector of ChIP-seq signal levels within each segment. Because SOMs generate a topographic mapping of the input data onto a grid of prototype vectors such that the proximity of two vectors on the map indicates their similarity, a key advantage for interpretability is the embedding of higher level relationships ( relationships between clusters) within the maps. The input vectors we use are constructed from experimental data for a large number of histone marks (plus some transcription factors as well as open chromatin assays), thus the SOM prototype vectors underlying the 2D mapping are high-dimensional and some care is required in interpreting the map landscape. Here we look at different techniques for clustering the SOM prototype vectors in order to discriminate visually observed patterns at different levels of detail. We discuss implications of the clustering paradigm for biological interpretability and for determining functional relationships of genomic segments.
•
Position Description:
The Center for Advanced Computing Research at the California Institute of Technology is seeking a highly motivated individual to engineer scientific software to make effective use of accelerators, particularly general-purpose graphics processing units (GPGPUs). There are applications in several research groups, including geophysics, solid mechanics, chemistry, and biology. The initial responsibility will be optimizing codes to exploit a large new hybrid (CPU/GPGPU) cluster in the Division of Geological and Planetary Sciences. Specific applications include Bayesian models of fault slip during large earthquakes, inverse models of the Earth’s interior structure, large-scale remotely sensed image processing and models for use in rapid tsunami early warning systems.
Requirements & Qualifications:
- Engineer scientific codes in multiple disciplines to make effective use of accelerators.
- Document work and train students and staff in accelerator programming.
- Serve as a campus-wide subject matter expert on accelerator programming.
- Collaborate with experimental and scientific teams to deploy scientific software and respond to research challenges.
- Contribute to the writing of papers and grant proposals.
- Other duties as requested.
- B.S. in computer science, physics, applied mathematics or a closely related field.
- Must have a minimum of 2 years experience programming accelerators for scientific or closely related applications.
- Thorough knowledge of C/C++, OpenCL, and CUDA.
Caltech is an equal opportunity/affirmative action employer. Women, minorities, veterans and disabled persons are encouraged to apply.
•
“Modern Time Series Analysis of Three Cycles of Solar Chromospheric Activity”
Jeff Scargle
NASA Ames; Distinguished Visiting Scholar, Keck Institute for Space Studies
Thursday Oct 6
11AM
100 Powell-Booth
Astronomical programs such as NASA’s Kepler Mission, Caltech’s Catalina Real-Time Transient Survey and Palomar Transient Factory, plus many other all-sky photometric surveys — past, present and future — demand efficient, automatic methods for extracting information from time series data. I will describe algorithms for standard and novel analysis for:
* any data mode (events, counts in bins, point measurements with errors, etc.)
* time, frequency, and time-frequency domains
* auto- and cross- modes for single and multiple time series
Selected application examples will focus on three and a half decades of data from the NSO/AFRL/Sac Peak K-line monitoring program. Power spectrum and time-frequency analysis elucidates the solar cycle and an underlying random process, and reveals a new periodicity possibly connected with internal solar MHD activity.