Tags
The goal of this proposal is to reduce the amount of high-level human intervention required in the EventStore data life cycle by automatically generating the specific version name (SVN) at processing time and recording the SVN and its parents in the version graph in the output file. The solution should be quick to implement, be implementable with a minimum of cdj involvement, but not be so dirty that we cringe at the sight of it.

A new data type, VersionInfo, will be added to the begin run record. VersionInfo will own several strings, recording:

  • the value of $C3_LIB (could be just the end of the path)
  • the specific version name (SVN) for the data produced in this job
  • an md5 hash of the suez configuration for this job
  • a count of the number of VersionInfo objects in the frame when the job ran (this count will include the current job)
  • a list of this node's immediate parents in the dependency graph

All fields will be stored as strings (an inefficient representation for a counter, but easier to parse out of a pdf file). So long as beginrun records are not explicitly filtered, all the VersionInfos in the input files will be copied into the output file, so the output file will automatically accumulate most of the information needed to construct the SVN GraphPath. The SVN GraphPath can be fully reconstructed if every VersionInfo recorded the list of its immediate parents; a child can consider its parents to be any nodes which are not parents already. This does not accurately capture which files were actually opened, but does capture the dependencies and defines a canonical form of the dependency graph.

VersionInfo will be produced by VersionInfoProd, which will take these parameters:

  • the name of the processing phase (pass2, postpass2, dskim, etc.); the production tag of the VersionInfo produced will be set to this value, and it will also be part of the SVN. This defaults to "UserAnalysis" + the username
  • An md5 hash; if set, this must be equal to the md5 hash of the suez configuration calculated at run time; if it is not, VersionInfoProd will print the two hash values and abort the job
  • An optional comment (e.g., fixEbeam, or data32vs22), which will be added to the end of the SVN

The SVN stored is constructed from the processing phase name + software release + configuration date + comment

Issues:

Do we want a "dummy" string in VersionInfo so that an additional string can be added without creating a new version of the type? [No. Store the list of parents as a vector, and everything else as a second vector; new items can be added to the end of the vector]

Where does the md5 hash of the suez configuration come from?
  • a hash of some set of configuration files; this could be externally calculated and provided as a parameter to VersionInfoProd, or calculated within VersionInfoProd. The latter would require configuring which files to include in the inputs. Both schemes can be trivially circumvented, and could be done so by accident or honest mistake
  • calculate from suez's internal state (e.g. list of producers, list of processors, list of sources, etc.) in VersionInfoProd's init or beginrun method. Can be circumvented by making modifications after "go 1", which in turn could be caught by overriding "go". suez's internal state can be approximated by looking at the lists of processors, producers, etc. available through JobControl; this could be used to create a hash which is insensitive to trivial changes like changing comments.
  • hybrid: primary hash is the internal state, but a parameter can feed in externally calculated state

[Following comments from Valentin, I think the hash stored should be the one calculated solely from suez's internal state, which will make it possible to later verify that a configuration checked out of cvs matches what was used to process the data; if external inputs are desired, that should go into a separate hash instead of getting mixed into the internal state hash]

The md5 hash will be copied (by hand) into a configuration file, which will be read by the job control tcl scripts (for a file based hash, the hash must be in a file that is not part of the configuration set to avoid the chicken/egg problem). (Should "Official" jobs set a dummy md5 hash to avoid the default behavior when no md5 hash is specified? Is the default behavior the wrong choice? Should this be a flag?)

Where does the date come from?
  • stat the file with the md5 hash to find the last time it was changed; this requires that suez be able to find this file
  • set a parameter from a cvs $Date: 2005/06/23 13:09:02 $ tag in the file with the md5 hash. This does require that the file be reliably checked into cvs after any significant change (which may require making a trivial change to the file with the $Date: 2005/06/23 13:09:02 $ tag, if the change didn't change the md5 hash); we could make checking the status of this part of the job control scripts.

-- DanRiley - 08 Apr 2005
Topic revision: r4 - 23 Jun 2005, GregorySharp
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback