Characteristics of Energy Frontier Analysis Workflows
The stages of a physics analysis in terms of "big picture" questions, but from a procedural perspective. Each step typically involves multiple passes through data and/or MC samples to create subsamples to study, followed by many passes through the subsamples.
1) Define the Analysis
- Decide on the type of the analysis (search, precision measurement).
- Determine the variables that best characterize the chosen signal process, allowing for either direct rejection of background (cuts) or statistical separation (fits)
- Validate the selection of variables using simulated data
Characteristics
This step has many of the same characteristics as (5) below, and should be similarly factored into several scenarios.
- Computational intensity: variable
- Data intensity: variable
- Fixed functionality:
- Jobs processing "generic" production simulated data normally execute on the CMS Tier 2 sites
- Streaming: no
- Modification while executing: no
- Provenance collected:
- CMSSW framework job configuration for simulation and analysis jobs
- Entire configuration stored in DBS
- Individual objects annotated with production history
- Output logical file names (LFNs)
2) Assemble Raw Data
- choose triggers
- assemble raw data set
- strip data set to "AOD" (analysis object data) equivalent
- hopefully at this point can reprocess entire data set in ~24 hours
Characteristics
- Computational intensity: 10,000 CPU-Hours
- Data intensity: 10-1000 TB input, 1-100 TB output
- Fixed functionality:
- usually executes at CMS Tier 1 site(s), jobs move to the data
- execution managed by CMS workload management tools
- output automatically staged to a selected CMS Tier 2 site
- Streaming: no
- Modification while executing: no (strongly discouraged)
- Provenance collected:
- CMSSW framework job configuration
- Entire configuration stored in DBS
- Individual objects annotated with production history
- Input and output logical file names (LFNs)
3) Preliminary Event Selection
To maximize significance of measurement (greatest precision) or to
obtain best upper limit on a process, what are the optimal selection
criteria based on the variables in (1) to apply direct rejection, and
what is the optimal technique (what variables, what technique) for
final extraction of signal. These are typically based on
- a specific model for the signal
- either a specific background model or an independent (unbiased?) background
sample
To answer this, do preliminary event selection, writing out subset of passed events in AOD format
Characteristics
- Computational intensity: 1,000 CPU-Hours
- Data intensity: 1-100 TB input, 0.1-10 TB output
- Fixed functionality:
- usually executes at the CMS Tier 2 site selected in (1)
- execution managed by CMS workload management tools
- output stored at select Tier 2 site, or automatically staged to local Tier 3
- Streaming: no
- Modification while executing: no (strongly discouraged)
- Provenance collected:
- CMSSW framework job configuration
- Entire configuration stored in DBS
- Individual objects annotated with production history
- Input and output logical file names (LFNs)
4) Evaluate Statistics
What is the "answer" and the associated statistical uncertainty for our given signal and background model? (do not look at signal data yet, if search and blind analysis)
5) Agreement with Data
How well do our signal + background model agree with the data?
- what variations do we need to make to achieve agreement
- test modeling
- control regions, isolating each major background.
- adapt modeling, selection to improve S/B
- in this step we do lots of small iterations before we go back to (3) on subsets of data
Characteristics
CMS Framework job
- Computational intensity: 100 CPU-Hours
- Data intensity: 0.1 - 10 TB input, 0.01 - 1 TB output
- Fixed functionality:
- Executes at Tier 2 or large Tier 3 site
- Streaming: no
- Modification while executing: no
- Provenance collected:
- CMSSW framework job configuration
- Entire configuration stored in DBS
- Individual objects annotated with production history
- Input and output logical file names (LFNs)
FWLite/ROOT job
- Computational intensity: 1 - 100 CPU-Hours
- Data intensity: 0.1 - 10 TB input
- Fixed functionality:
- Streaming: yes
- Modification while executing: yes
- Provenance collected:
Analysis Environment
- Computational intensity: 0.0001 - 100 CPU-Hours
- Data intensity: 0.1 - 10 TB input
- Fixed functionality:
- Streaming: yes
- Modification while executing: yes
- Provenance collected:
- How a plot was made--everything necessary to reproduce a plot from the inputs
- selection criteria
- data used as input
- transformations/calculations/etc.
- Provenance associated with cached intermediate results
- Provenance available/utilized:
- Framework provenance from previous workflows
- Analysis Environment provenance from previous sessions
- File/data parentage relationships (local DBS, registered by CMS workload management tools)
6) Biases
What effects can potentially bias our "answer" ->systematic uncertainties?
- what samples can we use to limit such biases
- repeat analysis with "wrong sign" combinations or other variations on the signal mode
- test sensitivity to parameter variations with real and simulated data
- is the systematic uncertainty acceptable; if not, how to reduce it
7) Look at the Result
Look at result once all above are satisfied
- open box, look in signal region
- write PRL, collect nobel prize
repeat from (3) until modeling and S/B are sufficient. On occasion will have to go back to (1) to either a) incorporate new calibarations, alignment, etc or b) find I need more handles to improve S/B, modeling
--
DanRiley - 15 Feb 2008