CSC/ECE 517 Fall 2014/ch1a 14 yz: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
 
(87 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Refactoring Software Reengineering Toolkit ==
== Refactoring Software Reengineering Toolkit ==


The Software Reengineering Toolkit is a proprietary set of program transformation tools available for automating custom source program analysis, modification, translation or generation of software systems for arbitrary mixtures of source languages for large scale software systems.
Design Maintenance system (DMS) is a software engineering environment that supports the incremental engineering and maintenance of large application systems, driven by domain knowledge, semantics, captured designs and automation.The DMS Software Reengineering Toolkit is a proprietary set of program transformation tools available for automating custom source program analysis, modification, translation or generation of software systems for arbitrary mixtures of source languages for large scale software systems <ref>[http://portal.acm.org/citation.cfm?id=999466&dl=GUIDE&coll=GUIDE&CFID=55567354&CFTOKEN=76359207 ''DMS: Program Transformations for Practical Scalable Software Evolution''. Proceedings International Conference on Software Engineering 2004] [http://www.semanticdesigns.com/Company/Publications/DMS-for-ICSE2004-reprint.pdf Reprint]</ref>


DMS has been used to implement a wide variety of practical tools, include domain-specific languages (such as code generation for factory control), test coverage and profiling tools, clone detection, language migration tools, C++ component reengineering., and for research into difficult topic such as refactoring C++ reliably.
DMS has been used to implement a wide variety of practical tools, include [https://en.wikipedia.org/wiki/Domain-specific_language ''domain-specific language''] (such as code generation for factory control), test coverage<ref>[http://www.semanticdesigns.com/Company/Publications/TestCoverage.pdf Branch Coverage for Arbitrary Languages Made Easy]</ref> and profiling tools, [https://en.wikipedia.org/wiki/Duplicate_code ''Clone detection''],<ref>[http://www.computer.org/portal/web/csdl/doi/10.1109/ICSM.1998.738528 ''clone Detection Using Abstract Syntax Trees''. Proceedings International Conference on Software Maintenance 1998]</ref> language migration tools, C++ component reengineering.,<ref>[http://linkinghub.elsevier.com/retrieve/pii/S0950584906001856 ''Case study: Re-engineering C++ component models via automatic program transformation'' Information and Software Technology 2007]</ref> and for research into difficult topics such as refactoring C++ reliably.<ref>[http://www.sbir.gov/sbirsearch/detail/373168 ''Small Business Innovation Research (DoE): Refactor++'']</ref>


----
----
Line 9: Line 9:
__TOC__
__TOC__


== Introduction to the DMS Software Reengineering Toolkit ==
The DMS Software Reengineering Toolkit is a set of tools for automating customized source program analysis, modification or translation or generation of software systems containing arbitrary mixtures of languages ("domains"). The term "software" for DMS is very broad and covers any formal notation, including programming languages, markup languages, hardware description languages, design notations, data descriptions, and domain-specific languages. This toolkit is the first step towards the implementation of the Design Maintenance System®, an ambitious vision of a 21st Century software engineering environment that supports the incremental construction and maintenance of large application systems, driven by semantics and captured designs.


== Introduction to the DMS Software Reengineering Toolkit ==
DMS is additionally unusual in being implemented in a parallel programming language, PARLANSE <ref>[http://www.semdesigns.com/products/parlanse/index.html ''A Parallel Language for Symbolic Expression for 80x86 Symmetric Multiprocessors under Windows ''] </ref>, which uses symmetric multiprocessors available on commodity workstations. This enables DMS to provide faster answers for large system analyses and conversions.


The DMS Software Reengineering Toolkit is a set of tools for automating customized source program analysis, modification or translation or generation of software systems, containing arbitrary mixtures of languages ("domains"). The term "software" for DMS is very broad and covers any formal notation, including programming languages, markup languages, hardware description languages, design notations, data descriptions, and domain-specific languages. This toolkit is the first step towards the implementation of the Design Maintenance System®, an ambitious vision of a 21st Century software engineering environment that supports the incremental construction and maintenance of large application systems, driven by semantics and captured designs.
A very simple model of DMS is an extremely generalized compiler, having a parser, a semantic analyzer, a program transformation engine (to do code generation and optimization), and final output formatting components (producing source code rather than binary code). It is particularly important that the analyzer output can be used to choose the desired transforms. Unlike a conventional compiler, in which each component is specific to its task of translating one source language to one target machine language, each DMS component is highly parameterized, enabling a stunningly wide variety of effects. This means one can change the input language, change the analysis, change the transforms, and change the output in arbitrary ways. Also, unlike a conventional compiler, DMS can process thousands of files from multiple languages at the same moment, allowing analyses and/or consistent code changes across complex systems of files. (An interesting property is that DMS reads formal descriptions of languages, analyses and transforms, and is consequently used to support itself.) The following diagram shows the process flow and different components of a DMS model.


A very simple model of DMS is that of an extremely generalized compiler, having a parser, a semantic analyzer, a program transformation engine (to do code generation and optimization), and final output formatting components (producing source code rather than binary code). It is particularly important that the analyzer output can be used to choose the desired transforms. Unlike a conventional compiler, in which each component is specific to its task of translating one source language to one target machine language, each DMS component is highly parameterized, enabling a stunningly wide variety of effects. This means one can change the input language, change the analysis, change the transforms, and change the output in arbitrary ways. Also, unlike a conventional compiler, DMS can process thousands of files from multiple languages at the same moment, allowing analyses and/or consistent code changes across complex systems of files. (An interesting property is that DMS reads formal descriptions of languages, analyses and transforms, and is consequently used to support itself.)
[[File:DMS.png]]


Many program analysis and transformation tools are limited to ASCII or Western European character sets such as ISO-8859; DMS can handle these as well as UTF-8, UTF-16, EBCDIC, Shift-JIS and a variety of Microsoft character encodings.
Many program analysis and transformation tools are limited to ASCII or Western European character sets such as ISO-8859; DMS can handle these as well as UTF-8, UTF-16, EBCDIC, Shift-JIS and a variety of Microsoft character encodings.


DMS uses GLR parsing technology, enabling it to handle all practical context-free grammars. Semantic predicates extend this capability to interesting non-context-free grammars (Fortran requires matching of multiple DO loops with shared CONTINUE statements by label; GLR with semantic predicates enables the DMS Fortran parser to produce ASTs for correctly nested loops as it parses).
DMS uses [http://en.wikipedia.org/wiki/GLR_parser ''GLR parsing technology], enabling it to handle all practical context-free grammars. Semantic predicates extend this capability to interesting non-context-free grammars (Fortran requires matching of multiple DO loops with shared CONTINUE statements by label; GLR with semantic predicates enables the DMS Fortran parser to produce ASTs for correctly nested loops as it parses).
 


----
----
Line 24: Line 28:
== DMS Capabilities ==
== DMS Capabilities ==


DMS provides a large set of robust, integrated facilities for building analysis and modification tools:
DMS provides a large set of robust, integrated facilities for building analysis and modification tools<ref>[http://www.semanticdesigns.com/Company/Publications/DMS-for-ICSE2004-reprint.pdf ''Program Transformations for Practical Scalable Software Evolution''] </ref>:


Full UNICODE-based parser and lexer generation with automatic error recovery. Accepts/generates files encoded in UTF-8 and UTF-16, 7 bit ASCII (ISO-646-US), 8 bit ASCII (ISO-8859-1 thru -16), EBCDIC (CP-37, CP-500), a number of Microsoft code pages (CP-1250 thru -1258), and Japanese Shift-JIS (CP-932 and JIS-0208). Standard support is included for reading multiple source files to enable INCLUDE file management and construct suitable preprocessors. The parser technology is based on GLR, and can handle any context-free language, even with ambiguities (much stronger than YACC/LALR). Proven on dozens of real languages.
*Full UNICODE-based '''parser and lexer generation''' with automatic error recovery. Accepts/generates files encoded in UTF-8 and UTF-16, 7 bit ASCII (ISO-646-US), 8 bit ASCII (ISO-8859-1 thru -16), EBCDIC (CP-37, CP-500), a number of Microsoft code pages (CP-1250 thru -1258), and Japanese Shift-JIS (CP-932 and JIS-0208). Standard support is included for reading multiple source files to enable INCLUDE file management and construct suitable preprocessors. The parser technology is based on GLR, and can handle any context-free language, even with ambiguities (much stronger than YACC/LALR).


Automatic construction of abstract (not concrete) syntax trees (non-value-carrying terminals and unit productions are suppressed; syntax-lists are converted into AST list nodes). Literal values (numbers, escaped strings) are converted to native, normalized binary values for fast internal manipulation. Source comments are captured and attached to AST nodes.
*Automatic construction of '''abstract''' (not concrete) '''syntax trees''' (non-value-carrying terminals and unit productions are suppressed; syntax-lists are converted into AST list nodes). Literal values (numbers, escaped strings) are converted to native, normalized binary values for fast internal manipulation. Source comments are captured and attached to AST nodes.


Pretty-printer generation converts ASTs back to nicely formatted legal source file form, according to a specified layout information, including source comments. In fidelity-printing mode, comments, spacing and lexical formatting information of unchanged code is preserved. Customizing allows generation of source code HTML form, or even as obfuscated source text. Trees may be output directly in XML format.
*'''Pretty-printer generation''' converts ASTs back to nicely formatted legal source file form, according to a specified layout information, including source comments. In fidelity-printing mode, comments, spacing and lexical formatting information of unchanged code is preserved. Customizing allows generation of source code HTML form, or even as obfuscated source text. Trees may be output directly in XML format.


Multi-pass attribute-evaluator generation from grammar, to allow arbitrary analysis (including name/type analysis procedures) to be specified in terms of the concrete grammar provided.
*Multi-pass '''attribute-evaluator generation''' from grammar, to allow arbitrary analysis (including name/type analysis procedures) to be specified in terms of the concrete grammar provided.


Sophisticated symbol-table construction facilities for global, local, inherited, overloaded and other language-dependent name lookup and namespace management rules.
*Sophisticated '''symbol-table construction''' facilities for global, local, inherited, overloaded and other language-dependent name lookup and namespace management rules.


Control-flow graph construction including traditional entry/exit/action/condition nodes, but also fork/join nodes to model parallelism and/or indeterminate order (e.g., C sequence points). There predefined analyzers for constructing (post) dominators, and inducing structured control-flow regions. Additional machinery can compute compilation-unit local and system/global call graphs.
*'''Control-flow graph construction''' including traditional entry/exit/action/condition nodes, but also fork/join nodes to model parallelism and/or indeterminate order (e.g., C sequence points). There predefined analyzers for constructing (post) dominators, and inducing structured control-flow regions. Additional machinery can compute compilation-unit local and system/global call graphs.


Data flow analysis framework, to allow data-flow analysis problems to be posed and answered, including predefined analyzers for constructing use-def and def-use chains. (See sample control and data flow graphs)
*'''Data flow analysis framework''', to allow data-flow analysis problems to be posed and answered, including predefined analyzers for constructing use-def and def-use chains. (See sample control and data flow graphs)


Points-to analysis for computing local or global points-to data, tested on systems of 13+ million lines of code.
*'''Points-to analysis''' for computing local or global points-to data, tested on systems of 13+ million lines of code.


Symbolic Range Analysis computing range constraints on program variables in terms of other variables. This is useful to detecting array-access errors, determining which switch case is selected, ... in conjunction with other analyses.
*'''Symbolic Range Analysis''' computing range constraints on program variables in terms of other variables. This is useful to detecting array-access errors, determining which switch case is selected, ... in conjunction with other analyses.


Multiple domains (notations/languages) can be represented at the same time. This enables processing or generating systems composed of parts from more than one domain (COBOL and JCL, C and Makefiles, etc.), and/or translation from one domain language to another.
*'''Multiple domains (notations/languages)''' can be represented at the same time. This enables processing or generating systems composed of parts from more than one domain (COBOL and JCL, C and Makefiles, etc.), and/or translation from one domain language to another.


Transforms and patterns can be written directly in surface-to-surface domain syntax form. Patterns can be matched against syntax trees and return bindings for parameter subtrees. Alternatively, procedural code can implement transforms, or refer to existing transforms and patterns to enable construction of very sophisticated transforms.
*'''Transforms''' and '''patterns''' can be written directly in surface-to-surface domain syntax form. Patterns can be matched against syntax trees and return bindings for parameter subtrees. Alternatively, procedural code can implement transforms, or refer to existing transforms and patterns to enable construction of very sophisticated transforms.


A full Associative/Commutative rewrite engine that operates on trees and DAGs, which can be used to apply sets of transforms.
*A full '''Associative/Commutative rewrite engine''' that operates on trees and DAGs, which can be used to apply sets of transforms.


A metaprogramming language, XCL, provides the ability to control the sequencing of the application of transforms and sets of transforms. (Future Release)
*An '''algebraic specification subsystem''' can be used to specify arbitrary algebras (this is just a DMS domain!). The axioms can be treated as a set of rewrite rules. (This allows one to code arbitrary simplification procedures.


An algebraic specification subsystem can be used to specify arbitrary algebras (this is just a DMS domain!). The axioms can be treated as a set of rewrite rules. (This allows one to code arbitrary simplification procedures. (We have done simplification on boolean equations that are essentially 1 million terms in size; we have also modeled optimization of transistor [not gates!] circuits this way).
The features of DMS and how each of the features work in a DMS model is shown in the diagram below.  


[[File:DMS 2.png]]
----
----


== Avaiable Languages ==
== Available Languages ==


DMS has a variety of predefined language front ends, which allows the construction of custom compilers, analysis tools, or source transformation tools<ref>[http://www.semdesigns.com/Products/FrontEnds/index.html?Home=DMSDomains ''Language/Compiler Front Ends'']</ref>. Predefined languages enable customizers to immediately focus on their reengineering task rather than on the details of the languages to be processed.


The list of language modules includes:
The list of major language modules includes:


Ada 83/95
*[http://en.wikipedia.org/wiki/C_(programming_language) ''C''](ANSI, GNU, C99, Microsoft dialects), with intelligently managed include and preprocessor directives, full name and type resolution, control and data flow analysis, system-wide call graph, system-wide points-to analysis


C (ANSI, GNU, C99, Microsoft dialects), with intelligently managed include and preprocessor directives, full name and type resolution, control and data flow analysis, system-wide call graph, system-wide points-to analysis
*[https://en.wikipedia.org/wiki/C%2B2B ''C++''] (ANSI, GNU, Microsoft dialects), including C++0x with intelligently managed include and preprocessor directives, and full name and type resolution


C++ (ANSI, GNU, Microsoft dialects), including C++0x with intelligently managed include and preprocessor directives, and full name and type resolution
*[http://en.wikipedia.org/wiki/C_Sharp_(programming_language) ''C#''] (Microsoft's .NET language) versions 1.2, 2.0, 3.0 and 4.0


C# (Microsoft's .NET language) versions 1.2, 2.0, 3.0 and 4.0
*[http://en.wikipedia.org/wiki/COBOL ''COBOL''] 85 ("ANSI"), IBM VS COBOL II, IBM Enterprise COBOL with preprocessor, COPYLIB management and Report Writer, having full name and type resolution, and control and data flow analysis


COBOL 85 ("ANSI"), IBM VS COBOL II, IBM Enterprise COBOL with preprocessor, COPYLIB management and Report Writer, having full name and type resolution, and control and data flow analysis
*[http://en.wikipedia.org/wiki/Fortran ''Fortran''] 95/90/77


CFEngine v3, a system configuration and policy enforcement aid
*[http://en.wikipedia.org/wiki/Java_(programming_language) ''Java''] 1.1,1.2,1.3,1.4,1.5,1.6, with full name and type resolution including the JDK and class files, control flow analysis and call graph construction.


Delphi 6 ("Borland ObjectPascal")
*MATLAB M-files and Simulink


ECMAScript (ECMA-262, JavaScript [Microsoft and Netscape dialects], ActionScript, ActionScript2, ActionScript3, ASP, JSP, HTML and XML scripts)
*[http://en.wikipedia.org/wiki/Perl ''Perl''] 5


EGL and VAGen (IBM)
*[http://en.wikipedia.org/wiki/PHP ''PHP'']: PHP3, PHP4 and PHP5


FORTRAN 95/90/77
*[http://en.wikipedia.org/wiki/Python_(programming_language) ''Python''] 2.6 and 3.1


HLASM (IBM)
*Verilog 1995, 2001 and SystemVerilog 3.1a


HTML 4.0, XHTML, plus IE dialect
*VHDL 1993


IDL (Corba 2.3)
*Visual Basic (VB.net, VB6, VBScript+ASP)


IEC 61131-3 Structured Text (Industrial Automation Control)
----


Java 1.1,1.2,1.3,1.4,1.5,1.6, with full name and type resolution including the JDK and class files, control flow analysis and call graph construction.
== Compare to other Analysis and Transformation Technology ==
 
JCL (IBM)
 
JOVIAL (Legacy military embedded systems language) with full name and type analysis
 
Mathematica
 
MATLAB M-files and Simulink
 
Motorola M6800/M6801/M6805/M6808/M6809/M6811/M6812 Assembly language
 
PARLANSE
 
Natural (Mainframe application programming language) and Adabas DDMs
 
Pascal (ISO 7185)
 
Perl 5
 
PHP3, PHP4 and PHP5
 
Pick Data Basic (Universe dialect)
 
PL/1 (IBM)
 
PL/SQL (Oracle Database programming language)
 
Progress aka OpenEdge (a 4GL)
 
Python 2.6 and 3.1
 
Rational Rose UML (.MDL files)
 
SQL ANSI 2011
 
SystemC 2.1
 
Verilog 1995, 2001 and SystemVerilog 3.1a
 
VHDL 1993


Visual Basic (VB.net, VB6, VBScript+ASP)
The DMS Software Reengineering Toolkit ("DMS") is used to automate program analyses and transformations. It consequently shares a number of properties with other compiler technologies, but goes beyond them often in a variety of ways. This section explores other technologies that are available in practice and compares DMS to them. Only the C++ Front ends and the compiler infrastructure will be shown, for a complete list, please refer to the DMS website. <Ref>[http://www.semanticdesigns.com/Products/DMS/DMSComparison.html#CppFrontEnds ''comparison of DMS to other well-known types of compiler technologies'']</ref>


XML
C++ parsers ("front ends") are intended to read C++ source code and build compiler data structures representing that code, with the expectation that a custom tool will be build to process the compiler data structures, for analysis or C++ transformation.
----


{| class="wikitable"
|-
! C++ Front Ends !! Status !! C/C++ Dialects Available || Scale || Performance || Implementation Language
|-
| DMS || Production; Commercial; Active Enhancement ||
Production;
C & C++;
C++0x;
ANSI,GCC,
MSVC6,
MS Studio 2005,
SystemC;
|| Tens of thousands of source files; Multilingual processing || Parallel implementation || PARLANSE
|-
| GCC || Production; Open Source; Active Enhancement || GCC; ANSI||One source plus includes || Fast, hand tuned || GCC C
|-
| EDG || Stable; Academic || Most C/C++ dialects || One Source file plus includes|| Fast hand tuned || C
|-
|OpenC++|| Stable; No preprocessor; Open Source || Most of ANSI C++; Limited template handling ||Single file plus includes || Unknown || C++
|-
|Columbus || No preprocessor || ANSI C++ || Single plus includes || unknown || C++
|}


== Compare to other Analysis and Transformation Technology ==
Compiler infrastructures are intended to make it easier to construct compilers or compiler-like tools such as source code instrumenters.


Compiler
{| class="wikitable"
Infrastructures Status Languages
|-
Available Post-Parsing
! Compiler Infrastructure !! Status !! Languages Available || Scale || Performance || Implementation Language
Customization
|-
Capability Scale Performance Implementation
| DMS || Production; Commercial; Active Enhancement ||
Language
C/C++0x,
DMS Production;
Commercial;
Active Enhancement Production
preprocessor/C/C++0x,
C#, Java1.5,
C#, Java1.5,
Fortran 77/95,
Fortran 77/95,
Line 159: Line 142:
SQL 2011,PLSQL,
SQL 2011,PLSQL,
Matlab M, Simulink,
Matlab M, Simulink,
Verilog, VHDL, SystemC;
Verilog, VHDL, SystemC;  
Extensible to others
|| Tens of thousands of source files; Multilingual processing || Parallel implementation || PARLANSE
(e.g. DSP assembly);
|-
Easily extended to include DSLs for graphs, code generation, etc. Procedural tree processing;
| GCC || Production; Open Source; Active Enhancement || GCC; ANSI||One source plus includes || Fast, hand tuned || GCC C
Name/Type resolution with or without preprocessing;
|-
Source-to-Source Transformation;
| LCC || Stable; Academic || ANSI C || One Source file plus includes|| Fast hand tuned || C
Attribute Evaluation;
|}
Regeneration of source with comments and preprocessor directives;
Control, dataflow, and global points-to analyzers;
Used to build a wide variety of source-analysis or instrumentation tools, including code generators and vectorizers. Tens of
thousands
of source files;
Multilingual processing Parallel implementation PARLANSE (parallel)
GCC Production;
Open Source;
Active Enhancement GCC; ANSI Arbitrary procedural analysis and transformation;
Code-generation focused flow analysis;
Tangled in "wants to be a compiler for every possible machine" One source file plus includes Fast, hand tuned GCC C
LCC Stable;
Academic; ANSI C Arbitrary procedureal analysis and transformation;
Tangled in "wants to be a compiler" One source file plus includes Fast, hand-tuned C
SUIF Stable;
Academic;
No preprocessor C and FORTRAN 90? Procedural tree processing;
Name/Type Resolution after preprocessing;Data flow analyses;
Reneration of preprocessed source? Multiple files? Unknown C++?
Open64 Stable;
Open Source ANSI C, C++?;
FORTRAN 90?; Arbitrary procedural analysis and transformation;
Data flow analysis;
Reneration of preprocessed source?
Tangled in "wants to be an SGI compiler" Single source file plus definitions from includes? Likely fast, used as production compiler? C++
CLang/LLVM Stable;
Open Source;
Active enhancement;
Integrated preprocessor ANSI C++;
GCC dialect? Name/type resolution while parsing;
Procedural tree processing;
Regeneration of source (w/ comments? preprocessor?)
Native x86 code generation via LLVM;
Varying flow analyses via LLVM;
Custom analyzers in C++ Single preprocessed file? Fast,hand tuned C++


----
----
Line 210: Line 158:
DMS is designed to work on large scale source systems
DMS is designed to work on large scale source systems


-with up to several million lines of source code or specification
*with up to several million lines of source code or specification
 
*across tens of thousands of source files
 
*having multiple languages at the same time
 
DMS is implemented using parallel language, PARLANSE, to provide computational horsepower consistent with this scale. While DMS runs on a single processor system at unit speed, it also runs on symmetric multiple processor workstations with enhanced performance. As an example, the attribute evaluation process is automatically parallelized, and can often provide a linear speedup on an N-way SMP system.


-across tens of thousands of source files
DMS is hosted on Windows (any version) using Intel or AMD single or symmetric multiprocessing (SMP) hardware. Using Wine, DMS runs on Linux, Solaris and MAC OS X<ref>[https://www.youtube.com/watch?v=C-_dw9iEzhA ''Google Tech talk on DMS''] </ref>.


-having multiple languages at the same time
----


DMS is implemented using our parallel language, PARLANSE, to provide computational horsepower consistent with this scale. While DMS runs on a single processor system at unit speed, it also runs on symmetric multiple processor workstations with enhanced performance. As an example, the attribute evaluation process is automatically parallelized, and can often provide a linear speedup on an N-way SMP system.
== References ==


DMS is hosted on Windows (any version) using Intel or AMD single or symmetric multiprocessing (SMP) hardware. Using Wine, DMS runs on Linux, Solaris and MAC OS X.
<references/>


----
----

Latest revision as of 03:40, 7 October 2014

Refactoring Software Reengineering Toolkit

Design Maintenance system (DMS) is a software engineering environment that supports the incremental engineering and maintenance of large application systems, driven by domain knowledge, semantics, captured designs and automation.The DMS Software Reengineering Toolkit is a proprietary set of program transformation tools available for automating custom source program analysis, modification, translation or generation of software systems for arbitrary mixtures of source languages for large scale software systems <ref>DMS: Program Transformations for Practical Scalable Software Evolution. Proceedings International Conference on Software Engineering 2004 Reprint</ref>

DMS has been used to implement a wide variety of practical tools, include domain-specific language (such as code generation for factory control), test coverage<ref>Branch Coverage for Arbitrary Languages Made Easy</ref> and profiling tools, Clone detection,<ref>clone Detection Using Abstract Syntax Trees. Proceedings International Conference on Software Maintenance 1998</ref> language migration tools, C++ component reengineering.,<ref>Case study: Re-engineering C++ component models via automatic program transformation Information and Software Technology 2007</ref> and for research into difficult topics such as refactoring C++ reliably.<ref>Small Business Innovation Research (DoE): Refactor++</ref>


Introduction to the DMS Software Reengineering Toolkit

The DMS Software Reengineering Toolkit is a set of tools for automating customized source program analysis, modification or translation or generation of software systems containing arbitrary mixtures of languages ("domains"). The term "software" for DMS is very broad and covers any formal notation, including programming languages, markup languages, hardware description languages, design notations, data descriptions, and domain-specific languages. This toolkit is the first step towards the implementation of the Design Maintenance System®, an ambitious vision of a 21st Century software engineering environment that supports the incremental construction and maintenance of large application systems, driven by semantics and captured designs.

DMS is additionally unusual in being implemented in a parallel programming language, PARLANSE <ref>A Parallel Language for Symbolic Expression for 80x86 Symmetric Multiprocessors under Windows </ref>, which uses symmetric multiprocessors available on commodity workstations. This enables DMS to provide faster answers for large system analyses and conversions.

A very simple model of DMS is an extremely generalized compiler, having a parser, a semantic analyzer, a program transformation engine (to do code generation and optimization), and final output formatting components (producing source code rather than binary code). It is particularly important that the analyzer output can be used to choose the desired transforms. Unlike a conventional compiler, in which each component is specific to its task of translating one source language to one target machine language, each DMS component is highly parameterized, enabling a stunningly wide variety of effects. This means one can change the input language, change the analysis, change the transforms, and change the output in arbitrary ways. Also, unlike a conventional compiler, DMS can process thousands of files from multiple languages at the same moment, allowing analyses and/or consistent code changes across complex systems of files. (An interesting property is that DMS reads formal descriptions of languages, analyses and transforms, and is consequently used to support itself.) The following diagram shows the process flow and different components of a DMS model.

Many program analysis and transformation tools are limited to ASCII or Western European character sets such as ISO-8859; DMS can handle these as well as UTF-8, UTF-16, EBCDIC, Shift-JIS and a variety of Microsoft character encodings.

DMS uses GLR parsing technology, enabling it to handle all practical context-free grammars. Semantic predicates extend this capability to interesting non-context-free grammars (Fortran requires matching of multiple DO loops with shared CONTINUE statements by label; GLR with semantic predicates enables the DMS Fortran parser to produce ASTs for correctly nested loops as it parses).



DMS Capabilities

DMS provides a large set of robust, integrated facilities for building analysis and modification tools<ref>Program Transformations for Practical Scalable Software Evolution </ref>:

  • Full UNICODE-based parser and lexer generation with automatic error recovery. Accepts/generates files encoded in UTF-8 and UTF-16, 7 bit ASCII (ISO-646-US), 8 bit ASCII (ISO-8859-1 thru -16), EBCDIC (CP-37, CP-500), a number of Microsoft code pages (CP-1250 thru -1258), and Japanese Shift-JIS (CP-932 and JIS-0208). Standard support is included for reading multiple source files to enable INCLUDE file management and construct suitable preprocessors. The parser technology is based on GLR, and can handle any context-free language, even with ambiguities (much stronger than YACC/LALR).
  • Automatic construction of abstract (not concrete) syntax trees (non-value-carrying terminals and unit productions are suppressed; syntax-lists are converted into AST list nodes). Literal values (numbers, escaped strings) are converted to native, normalized binary values for fast internal manipulation. Source comments are captured and attached to AST nodes.
  • Pretty-printer generation converts ASTs back to nicely formatted legal source file form, according to a specified layout information, including source comments. In fidelity-printing mode, comments, spacing and lexical formatting information of unchanged code is preserved. Customizing allows generation of source code HTML form, or even as obfuscated source text. Trees may be output directly in XML format.
  • Multi-pass attribute-evaluator generation from grammar, to allow arbitrary analysis (including name/type analysis procedures) to be specified in terms of the concrete grammar provided.
  • Sophisticated symbol-table construction facilities for global, local, inherited, overloaded and other language-dependent name lookup and namespace management rules.
  • Control-flow graph construction including traditional entry/exit/action/condition nodes, but also fork/join nodes to model parallelism and/or indeterminate order (e.g., C sequence points). There predefined analyzers for constructing (post) dominators, and inducing structured control-flow regions. Additional machinery can compute compilation-unit local and system/global call graphs.
  • Data flow analysis framework, to allow data-flow analysis problems to be posed and answered, including predefined analyzers for constructing use-def and def-use chains. (See sample control and data flow graphs)
  • Points-to analysis for computing local or global points-to data, tested on systems of 13+ million lines of code.
  • Symbolic Range Analysis computing range constraints on program variables in terms of other variables. This is useful to detecting array-access errors, determining which switch case is selected, ... in conjunction with other analyses.
  • Multiple domains (notations/languages) can be represented at the same time. This enables processing or generating systems composed of parts from more than one domain (COBOL and JCL, C and Makefiles, etc.), and/or translation from one domain language to another.
  • Transforms and patterns can be written directly in surface-to-surface domain syntax form. Patterns can be matched against syntax trees and return bindings for parameter subtrees. Alternatively, procedural code can implement transforms, or refer to existing transforms and patterns to enable construction of very sophisticated transforms.
  • A full Associative/Commutative rewrite engine that operates on trees and DAGs, which can be used to apply sets of transforms.
  • An algebraic specification subsystem can be used to specify arbitrary algebras (this is just a DMS domain!). The axioms can be treated as a set of rewrite rules. (This allows one to code arbitrary simplification procedures.

The features of DMS and how each of the features work in a DMS model is shown in the diagram below.


Available Languages

DMS has a variety of predefined language front ends, which allows the construction of custom compilers, analysis tools, or source transformation tools<ref>Language/Compiler Front Ends</ref>. Predefined languages enable customizers to immediately focus on their reengineering task rather than on the details of the languages to be processed.

The list of major language modules includes:

  • C(ANSI, GNU, C99, Microsoft dialects), with intelligently managed include and preprocessor directives, full name and type resolution, control and data flow analysis, system-wide call graph, system-wide points-to analysis
  • C++ (ANSI, GNU, Microsoft dialects), including C++0x with intelligently managed include and preprocessor directives, and full name and type resolution
  • C# (Microsoft's .NET language) versions 1.2, 2.0, 3.0 and 4.0
  • COBOL 85 ("ANSI"), IBM VS COBOL II, IBM Enterprise COBOL with preprocessor, COPYLIB management and Report Writer, having full name and type resolution, and control and data flow analysis
  • Java 1.1,1.2,1.3,1.4,1.5,1.6, with full name and type resolution including the JDK and class files, control flow analysis and call graph construction.
  • MATLAB M-files and Simulink
  • PHP: PHP3, PHP4 and PHP5
  • Verilog 1995, 2001 and SystemVerilog 3.1a
  • VHDL 1993
  • Visual Basic (VB.net, VB6, VBScript+ASP)

Compare to other Analysis and Transformation Technology

The DMS Software Reengineering Toolkit ("DMS") is used to automate program analyses and transformations. It consequently shares a number of properties with other compiler technologies, but goes beyond them often in a variety of ways. This section explores other technologies that are available in practice and compares DMS to them. Only the C++ Front ends and the compiler infrastructure will be shown, for a complete list, please refer to the DMS website. <Ref>comparison of DMS to other well-known types of compiler technologies</ref>

C++ parsers ("front ends") are intended to read C++ source code and build compiler data structures representing that code, with the expectation that a custom tool will be build to process the compiler data structures, for analysis or C++ transformation.

C++ Front Ends Status C/C++ Dialects Available Scale Performance Implementation Language
DMS Production; Commercial; Active Enhancement

Production; C & C++; C++0x; ANSI,GCC, MSVC6, MS Studio 2005, SystemC;

Tens of thousands of source files; Multilingual processing Parallel implementation PARLANSE
GCC Production; Open Source; Active Enhancement GCC; ANSI One source plus includes Fast, hand tuned GCC C
EDG Stable; Academic Most C/C++ dialects One Source file plus includes Fast hand tuned C
OpenC++ Stable; No preprocessor; Open Source Most of ANSI C++; Limited template handling Single file plus includes Unknown C++
Columbus No preprocessor ANSI C++ Single plus includes unknown C++

Compiler infrastructures are intended to make it easier to construct compilers or compiler-like tools such as source code instrumenters.

Compiler Infrastructure Status Languages Available Scale Performance Implementation Language
DMS Production; Commercial; Active Enhancement

C/C++0x, C#, Java1.5, Fortran 77/95, Pascal, Ada, COBOL, Natural, JCL, JavaScript, PHP4/5, VB.net,VBScript,VB6, Python 2/3, HTML, XML, SQL 2011,PLSQL, Matlab M, Simulink, Verilog, VHDL, SystemC;

Tens of thousands of source files; Multilingual processing Parallel implementation PARLANSE
GCC Production; Open Source; Active Enhancement GCC; ANSI One source plus includes Fast, hand tuned GCC C
LCC Stable; Academic ANSI C One Source file plus includes Fast hand tuned C

Industrial Scale

These foundations have been used for an amazing variety of industrial tasks, including quality analysis, restructuring, automated porting, pretty printing and highly optimized code generation.

DMS is designed to work on large scale source systems

  • with up to several million lines of source code or specification
  • across tens of thousands of source files
  • having multiple languages at the same time

DMS is implemented using parallel language, PARLANSE, to provide computational horsepower consistent with this scale. While DMS runs on a single processor system at unit speed, it also runs on symmetric multiple processor workstations with enhanced performance. As an example, the attribute evaluation process is automatically parallelized, and can often provide a linear speedup on an N-way SMP system.

DMS is hosted on Windows (any version) using Intel or AMD single or symmetric multiprocessing (SMP) hardware. Using Wine, DMS runs on Linux, Solaris and MAC OS X<ref>Google Tech talk on DMS </ref>.


References

<references/>