Analysis Dataset APIs
- Create From Processed Dataset
- Selection of files
- Entire Processed Dataset
- Existing Analysis Dataset(s)
- DBS query (most recently processed version of all the events, with no duplication)
- List of files
- Selection of Luminosity Sections in the selected files
- All the Luminosity Sections in the selected files
- List of luminosity sections to include or exclude
- DBS query (data quality)
- An Analysis Dataset is supposed to be a snapshot, so we do not expect to modify them.
- Will depend on changes to CMSSW.
- May clear dataset from Local DBS when it is in Global DBS.
A processed dataset can contain multiple luminosity sections processed from the same set of input events, so the Analysis Dataset specifies a set of luminosity sections with no duplication of events to use for later processing. The set of lumi sections to keep are chosen according to some criteria. We think of those criteria as a query on the Processed Dataset and would like to store that query for future use in constructing new Analysis Datasets when the input Processed Dataset has grown in size. What is this query?
The lumi sections to keep may be chosen because they were calculated better in some way:
- Detector conditions are more correct
- Software revision indicates the analysis was better for a particular use of the data. (For instance, low energy muon data is good enough, while other data in the events may be in error.)
- "The most recent reprocessings of each file." Any reprocessing must replace entire files.
- "But don't give me bad lumi sections." There will always be an implicit query to exclude lumi sections marked as bad.
We want to record the meaning of this query so that we know why certain lumi sections were chosen for an analysis dataset. We would also like to be able to re-run a query against the processed dataset after it has grown. The same query may select different events from even the older parts of the processed dataset if they detector conditions have been updated.
How do you implement this? We assume there will be properties on the lumi sections in the future so that you can query them. There are also plans for a detector configuration database. We expect this to look like a database query.
What do we need to specify when creating an analysis dataset from a processed dataset? We select a set of files from a Processed Dataset, and a set of Luminosity Sections from those files. This could be accomplished as a sequence of steps in the client, except we want to capture the queries used to define the Analysis Dataset so that the query can be re-run later. If any of the queries is not one of the standard or previously executed queries, there should be a description for documentation and browsing purposes.
Selection of files--one of:
- Specify a list of LFNs
- Give a query on the DBS to be executed to select the files.
- Select all the files in a single processed dataset.
- Select all the files in the union of a list of analysis datasets
- Select all the files with the most recent reprocessed versions of the events in the dataset
- Replay a previously executed query
After selection, the constraint that all the files selected belong to the same Processed Dataset should be checked.
Selection of Luminosity Sections starts with the list of Luminosity Sections from the selected file:
- Specify a list of luminosity sections to exclude
- Give a query on the DBS to be executed to select the Luminosity Sections
- Typically a selection on the data quality attributes of the luminosity sections.
- Replay a previously executed query
Prodagent could accept an Analysis Dataset name as easily as it currently accepts a Processed Dataset name. The user might not tell Prodagent which they specify, but the DBS will return, for an Analysis Dataset, a list of lumi sections to skip, as well as LFNs which it returns now.
When migrating Analysis Datasets from Local to Global scope, or vice versa, Analysis Datasets will need a different API from Processed Datasets.
Don't delete anything relating to the Processed Dataset when deleting the Analysis Dataset.
- 05 Dec 2006
- 21 Dec 2006