Characteristics of Energy Frontier Analysis Workflows

The stages of a physics analysis in terms of "big picture" questions, but from a procedural perspective. Each step typically involves multiple passes through data and/or MC samples to create subsamples to study, followed by many passes through the subsamples.

1) Define the Analysis

Decide on the type of the analysis (search, precision measurement).
Determine the variables that best characterize the chosen signal process, allowing for either direct rejection of background (cuts) or statistical separation (fits)
- Validate the selection of variables using simulated data

Characteristics

This step has many of the same characteristics as (5) below, and should be similarly factored into several scenarios.

Computational intensity: variable
- Distributed: yes
Data intensity: variable
Fixed functionality:
- Jobs processing "generic" production simulated data normally execute on the CMS Tier 2 sites
Streaming: no
Modification while executing: no
Provenance collected:
- CMSSW framework job configuration for simulation and analysis jobs
  - Entire configuration stored in DBS
  - Individual objects annotated with production history
- Output logical file names (LFNs)

2) Assemble Raw Data

choose triggers
assemble raw data set
strip data set to "AOD" (analysis object data) equivalent
- hopefully at this point can reprocess entire data set in ~24 hours

Characteristics

Computational intensity: 10,000 CPU-Hours
- Distributed: yes
Data intensity: 10-1000 TB input, 1-100 TB output
Fixed functionality:
- usually executes at CMS Tier 1 site(s), jobs move to the data
- execution managed by CMS workload management tools
- output automatically staged to a selected CMS Tier 2 site
Streaming: no
Modification while executing: no (strongly discouraged)
Provenance collected:
- CMSSW framework job configuration
  - Entire configuration stored in DBS
  - Individual objects annotated with production history
- Input and output logical file names (LFNs)

3) Preliminary Event Selection

To maximize significance of measurement (greatest precision) or to obtain best upper limit on a process, what are the optimal selection criteria based on the variables in (1) to apply direct rejection, and what is the optimal technique (what variables, what technique) for final extraction of signal. These are typically based on

a specific model for the signal
either a specific background model or an independent (unbiased?) background

sample

To answer this, do preliminary event selection, writing out subset of passed events in AOD format

data
background samples

Characteristics

Computational intensity: 1,000 CPU-Hours
- Distributed: yes
Data intensity: 1-100 TB input, 0.1-10 TB output
Fixed functionality:
- usually executes at the CMS Tier 2 site selected in (1)
- execution managed by CMS workload management tools
- output stored at select Tier 2 site, or automatically staged to local Tier 3
Streaming: no
Modification while executing: no (strongly discouraged)
Provenance collected:
- CMSSW framework job configuration
  - Entire configuration stored in DBS
  - Individual objects annotated with production history
- Input and output logical file names (LFNs)

4) Evaluate Statistics

What is the "answer" and the associated statistical uncertainty for our given signal and background model? (do not look at signal data yet, if search and blind analysis)

5) Agreement with Data

How well do our signal + background model agree with the data?

what variations do we need to make to achieve agreement
test modeling
- control regions, isolating each major background.
adapt modeling, selection to improve S/B
- in this step we do lots of small iterations before we go back to (3) on subsets of data

Characteristics

CMS Framework job

Computational intensity: 100 CPU-Hours
- Distributed: yes
Data intensity: 0.1 - 10 TB input, 0.01 - 1 TB output
Fixed functionality:
- Executes at Tier 2 or large Tier 3 site
Streaming: no
Modification while executing: no
Provenance collected:
- CMSSW framework job configuration
  - Entire configuration stored in DBS
  - Individual objects annotated with production history
- Input and output logical file names (LFNs)

FWLite/ROOT job

Computational intensity: 1 - 100 CPU-Hours
- Distributed: variable
Data intensity: 0.1 - 10 TB input
Fixed functionality:
Streaming: yes
Modification while executing: yes
Provenance collected:

Analysis Environment

Computational intensity: 0.0001 - 100 CPU-Hours
- Distributed: variable
Data intensity: 0.1 - 10 TB input
Fixed functionality:
Streaming: yes
Modification while executing: yes
Provenance collected:
- How a plot was made--everything necessary to reproduce a plot from the inputs
  - selection criteria
  - data used as input
  - transformations/calculations/etc.
- Provenance associated with cached intermediate results
Provenance available/utilized:
- Framework provenance from previous workflows
- Analysis Environment provenance from previous sessions
- File/data parentage relationships (local DBS, registered by CMS workload management tools)

6) Biases

What effects can potentially bias our "answer" ->systematic uncertainties?

what samples can we use to limit such biases
- repeat analysis with "wrong sign" combinations or other variations on the signal mode
- test sensitivity to parameter variations with real and simulated data
is the systematic uncertainty acceptable; if not, how to reduce it

7) Look at the Result

Look at result once all above are satisfied

open box, look in signal region
write PRL, collect nobel prize

repeat from (3) until modeling and S/B are sufficient. On occasion will have to go back to (1) to either a) incorporate new calibarations, alignment, etc or b) find I need more handles to improve S/B, modeling

-- DanRiley - 15 Feb 2008

This topic: HEP/SWIG > WebHome > EnergyFrontierAnalysisWorkflow > EnergyFrontierWorkflowCharacteristics
Topic revision: 20 Feb 2008, DanRiley

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback