CSC/ECE 517 Fall 2009/wiki3 4 ashi4: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
 
(30 intermediate revisions by 2 users not shown)
Line 8: Line 8:


== Some Basic Terminologies ==
== Some Basic Terminologies ==
* <b>DRY</b> - Donot Repeat Yourself which means that any piece of knowledge should occur once and once only.
* <b>DIE</b> - Duplication Is Evil which is another name for DRY principle.
* <b>Data Duplication</b> - Multiple instances of same data.
* <b>Database Anomaly</b> - Inconsistencies in database caused due to redundancy.
* <b>Functional dependency</b> - When an attribute X functionally determines another attribute Y, Y is said to be functionally dependent on X.


== Why DRY? ==
== Why DRY? ==
Line 13: Line 23:
The question should be <b>Why not DRY</b>. When it comes to building or extending a software system, it feels extremely easy to just put in the logic in terms of code and make sure that the code works and the overall system functions properly. However, it is most convenient to not bother whether or not the code follows the best practices or is it easy to maintain. But it is cosidered to be a really good technique to apply the DRY principle here for various reasons;
The question should be <b>Why not DRY</b>. When it comes to building or extending a software system, it feels extremely easy to just put in the logic in terms of code and make sure that the code works and the overall system functions properly. However, it is most convenient to not bother whether or not the code follows the best practices or is it easy to maintain. But it is cosidered to be a really good technique to apply the DRY principle here for various reasons;


1. Duplicating the code for convenience causes a lot of overhead in terms of time and space.
<b>1. Duplicating the code for convenience causes a lot of overhead in terms of time and space.<br></b>
&nbsp;&nbsp;&nbsp;<b>An Example : </b><br>
Suppose that we need to develop a software which queries the database very often. If the query is a static query, that is, the
values given as an input to the query does not change and returns the same result set irrespective of the number of times it is
executed, it is always a good practice to execute the query once and store the result set which can be used by different parts of
the source code. This saves us the execution time as well as the memory.
<b>2. The system might function really well but it becomes extremely difficult to maintain it since the source code has turned clumsy.</b><br>
&nbsp;&nbsp;&nbsp;<b>An Example : </b><br>
The responsibility of a software developer or a software firm does not end with building the software. Neither can the user of the
software be expected to stay satisfied with the product that has been delivered. Changes or an upgrade might be necessary at any
time. So imagine, if the code is full of duplication and redundancy, it is almost impossible to understand it and extending
it.
<b>3. It is said that the best way to check whether the code really follows the best practices is to realise whether or not the code needs extensive documentation.</b><br>
&nbsp;&nbsp;&nbsp;<b>An Example : </b><br>
Documenting the code is completely the programmer's discretion. Suppose there is a variable X which represents a person's salary.
The salary is computed within a code and stored in X. Every method that uses X needs to have a comment mentioning that X
represents salary. It would have been a lot easier if the variable name was 'salary'.
<b>4. It becomes very difficult to debug the code in case of anomalies.</b><br>
&nbsp;&nbsp;&nbsp;<b>An Example : </b><br>
Suppose there is a method M which calculates salary. For a programmer's convenience, consider that this method is duplicated in a
different class. Assume that the salary has to be revised for some set of employees, and the programmer changes M in one class but
misses to do so in another. It would take a lot of effort and time from the person debugging the code to find out why the salary
is not getting revised in peculiar cases.
<b>5. It is always a good practice to have one reference.</b><br>
&nbsp;&nbsp;&nbsp;<b>An Example : </b><br>
The example given in the point 4 explained above is a proof for the reason that we should have just one method M that calculates
the salary.


== Why Data Duplication is harmful ==


2. The system might function really well but it becomes extremely difficult to maintain it since the source code has turned clumsy.
We discussed why it is important to use DRY in the [[#Why Dry | previous]] section by discussing how it may result in clumsy code. It is a misconception that DRY is applicable to code only. It is very much applicable to data as well which we shall see in this section by analysing how it could be a harmful idea to have data duplication.
<br>
The following figure is a representation of how DRY is violated in a bad way by data duplication.
[[Image:dry.jpg|thumb|center|550px|alt=Taken from http://tagschema.com/blogs/tagschema/ |Taken from http://tagschema.com/blogs/tagschema/ ]]
<br><br>


3. It is said that the best way to check whether the code really follows the best practices is to realise whether or not the code needs extensive documentation.
Let us consider some of the many possible problems of Data Duplication
* Stale Data: The main problem with having multiple copies of the data is that of stale data present in one of the copies, because it was not updated. This can lead to an undefined system behavior.


4. It becomes very difficult to debug the code in case of anomalies.
* More Memory: Stores multiple copies increases the amount of memory required for the system


5. It is always a good practice to have one reference.
* Causes anomalies: It causes functional dependencies in the system. <b><i>Please refer point 4 of the [[#Why Dry | previous]] section.</i></b>


== DRY on Data ==
* Too many references to change: <b><i>Please refer point 5 of the [[#Why Dry | previous]] section.</i></b>


== When Data Duplication wins over DRY... ==
== When Data Duplication wins over DRY... ==
Data Duplication can be a big problem is many systems. Some of the problems are listed below
* Stale Data: The main problem with having multiple copies of the data is that of stale data present in one of the copies, because it was not updated. This can lead to an undefined system behavior.


* More Memory: Stories multiple copies increases the amount of memory required for the system


Data Duplication while bad as discussed in the previous section, it is some times required. Some of the examples are listed below


Data Duplication while bad, are some times required. Some of the examples are listed below
'''Caches in Computers''' basically maintain a copy of data which is present in the main memory. This is good for the system on 
the whole because it increases the overall performance of the computer by making sure that the processor doesn't have to get data
or instruction from the main memory.


  '''Caches in Computers''' basically maintain a copy of data which is present in the main memory. This is good for the system on 
  '''Buffer Manager in DBMS''' is an intermediate layer in the DBMS which gets the DB pages from the Disk and buffers them
  the whole because it increases the overall performance of the computer by making sure that the processor doesn't to get data or
  for the upper layer components to use. Buffer Manager creates copies of the DB pages and stores in the buffer pool.
  instruction.
 
'''Source Control''' software creates a copy of the source files for each project to ensure that the properly working source are
  not disturbed.


  '''Buffer Manager in DBMS''' is a intermediate layer in the  DBMS which gets the DB pages from the Disk and buffer them
  '''Read-copy-update (RCU)''' [[#References | [5]]] is an operating system kernel technology for improving performance on computers
  for upper layer components to use.  
with more than one CPU. RCU allows you to read a shared data structure as if there were no other CPU accessing it. When you need
to update the data structure, you can update the global pointer to the data and keep the old copy until all the threads currently
  executing inside the kernel have completed. The updated pointer ensures that none of the CPUs have any remaining references, after
  which the old copy can be deleted.  


  '''Source Control''' software creates creates a copy of the code each member of the teamwhen a more than one person is working a  
  '''Global Data''' in system can be a cause of concern because multiple components of the system share the same data specially in a  
  project This ensures smother development.
  multi-threaded application. Instead, we could have multiple copies of the data accessible locally to each component.


  '''Read-copy-update (RCU)'''[3] is an operating system kernel technology for improving performance on computers with more than one
  '''Prototypes''' are another example where DRY is violated. When a new object has to be created, we clone the prototype
CPU. More technically it is a synchronization mechanism which can sometimes be used as an alternative to a readers-writer lock.
  there by creating multiple copies of the same thing in the system.
RCU allows you to read a shared data structure as if there is no other CPU accessing it. When you need to update the data 
structure, you can update the global pointer to the data and keep the old copy until all the threads currently executing inside
  the kernel have completed. The updated pointer ensures that none of the CPUs have any remaining references, so the old copy can be
deleted.  


  '''Documentation''' of a code can be a repetition of what the code does. In cases where the code is too simple to be explained, we  
  '''Documentation''' of a code can be a repetition of what the code does. In cases where the code is too simple to be explained, we  
Line 56: Line 99:
  possible place even if it means repeating the same thing.
  possible place even if it means repeating the same thing.


== Guidelines on whether or not to use DRY for Data ==
== Conclusion ==
 
DRY is one of the important principle that needs to be used while design a system. DRY principle for code is definitely a must, but as mentioned above there can be cases where DRY can be violated for data. There can be no definite guidelines for not following this rule, but in the following cases violating DRY for data should be considered


== Conclusion ==
* If the performance of the system is significantly improved. For example adding layers of cache improves the performance of the computer significantly. <br/>
* If there is need for data redundancy. For example Source control software makes sure that if one version of file is damaged there are always previous versions to fall back on.<br/>
* If violating DRY reduces overhead. For example making copies of global data locally reduces the overhead of adding logic for mutual exclusion.<br/>


== References ==
== References ==
Line 70: Line 117:
4. [http://stevesmithblog.com/blog/don-rsquo-t-repeat-yourself/ Importance of DRY]
4. [http://stevesmithblog.com/blog/don-rsquo-t-repeat-yourself/ Importance of DRY]


5. [http://lse.sourceforge.net/locking/rcu/HOWTO/intro.html#WHATIS / Read-Copy-Update]
5. [http://lse.sourceforge.net/locking/rcu/HOWTO/intro.html#WHATIS Read-Copy-Update]
 
6. [http://tagschema.com/blogs/tagschema/ A review on DRY in Social Communication]
 
7. [http://en.wikipedia.org/wiki/Abstraction_principle_%28programming%29 The Abstraction Principle (Good background to DRY)]

Latest revision as of 02:27, 19 November 2009

DRY Principle for Data


Introduction

Donot Repeat Yourself (DRY) also known as Duplication Is Evil (DIE) is one of the most fundamental principles of programming, or rather we can say that it is the most important principle of good programming. This principle was formulated by Andrew Hunt and David Thomas in their book The Pragmatic Programmer. The principle states that "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.". This principle has been broadly accepted and acknowledged by the developers across the globe. It is considered to be one of the best coding practices to build an efficient and flexible code which is not only easy to understand but easy to maintain too. A parallel concept to DRY is Orthogonality which is making the components of the system as functionally independent as possible. It is well understood and an accepted fact that the DRY and the Orthogonality principle together lead to a very well structured code as discussed here. But if we observe the DRY principle more carefully, it is easy to notice that DRY is not only applicable to code but it is applicable to any piece of knowledge that can be represented in some way; one such form of knowledge representation is data. In fact, data and knowledge are used more or less interchangeably. The purpose of this document is to emphasize on the fact that the DRY principle is not only applicable to the code but to the data as well.

Some Basic Terminologies

  • DRY - Donot Repeat Yourself which means that any piece of knowledge should occur once and once only.
  • DIE - Duplication Is Evil which is another name for DRY principle.
  • Data Duplication - Multiple instances of same data.
  • Database Anomaly - Inconsistencies in database caused due to redundancy.
  • Functional dependency - When an attribute X functionally determines another attribute Y, Y is said to be functionally dependent on X.

Why DRY?

The question should be Why not DRY. When it comes to building or extending a software system, it feels extremely easy to just put in the logic in terms of code and make sure that the code works and the overall system functions properly. However, it is most convenient to not bother whether or not the code follows the best practices or is it easy to maintain. But it is cosidered to be a really good technique to apply the DRY principle here for various reasons;

1. Duplicating the code for convenience causes a lot of overhead in terms of time and space.
   An Example :

Suppose that we need to develop a software which queries the database very often. If the query is a static query, that is, the 
values given as an input to the query does not change and returns the same result set irrespective of the number of times it is 
executed, it is always a good practice to execute the query once and store the result set which can be used by different parts of 
the source code. This saves us the execution time as well as the memory.

2. The system might function really well but it becomes extremely difficult to maintain it since the source code has turned clumsy.
   An Example :

The responsibility of a software developer or a software firm does not end with building the software. Neither can the user of the 
software be expected to stay satisfied with the product that has been delivered. Changes or an upgrade might be necessary at any 
time. So imagine, if the code is full of duplication and redundancy, it is almost impossible to understand it and extending 
it.

3. It is said that the best way to check whether the code really follows the best practices is to realise whether or not the code needs extensive documentation.
   An Example :

Documenting the code is completely the programmer's discretion. Suppose there is a variable X which represents a person's salary. 
The salary is computed within a code and stored in X. Every method that uses X needs to have a comment mentioning that X 
represents salary. It would have been a lot easier if the variable name was 'salary'.

4. It becomes very difficult to debug the code in case of anomalies.
   An Example :

Suppose there is a method M which calculates salary. For a programmer's convenience, consider that this method is duplicated in a 
different class. Assume that the salary has to be revised for some set of employees, and the programmer changes M in one class but 
misses to do so in another. It would take a lot of effort and time from the person debugging the code to find out why the salary 
is not getting revised in peculiar cases. 

5. It is always a good practice to have one reference.
   An Example :

The example given in the point 4 explained above is a proof for the reason that we should have just one method M that calculates 
the salary.

Why Data Duplication is harmful

We discussed why it is important to use DRY in the previous section by discussing how it may result in clumsy code. It is a misconception that DRY is applicable to code only. It is very much applicable to data as well which we shall see in this section by analysing how it could be a harmful idea to have data duplication.
The following figure is a representation of how DRY is violated in a bad way by data duplication.

Taken from http://tagschema.com/blogs/tagschema/
Taken from http://tagschema.com/blogs/tagschema/



Let us consider some of the many possible problems of Data Duplication

  • Stale Data: The main problem with having multiple copies of the data is that of stale data present in one of the copies, because it was not updated. This can lead to an undefined system behavior.
  • More Memory: Stores multiple copies increases the amount of memory required for the system
  • Causes anomalies: It causes functional dependencies in the system. Please refer point 4 of the previous section.
  • Too many references to change: Please refer point 5 of the previous section.

When Data Duplication wins over DRY...

Data Duplication while bad as discussed in the previous section, it is some times required. Some of the examples are listed below

Caches in Computers basically maintain a copy of data which is present in the main memory. This is good for the system on  
the whole because it increases the overall performance of the computer by making sure that the processor doesn't have to get data 
or instruction from the main memory.
Buffer Manager in DBMS is an intermediate layer in the  DBMS which gets the DB pages from the Disk and buffers them 
for the upper layer components to use. Buffer Manager creates copies of the DB pages and stores in the buffer pool.
Source Control software creates a copy of the source files for each project to ensure that the properly working source are 
not disturbed.
Read-copy-update (RCU)  [5] is an operating system kernel technology for improving performance on computers
with more than one CPU. RCU allows you to read a shared data structure as if there were no other CPU accessing it. When you need 
to update the data structure, you can update the global pointer to the data and keep the old copy until all the threads currently 
executing inside the kernel have completed. The updated pointer ensures that none of the CPUs have any remaining references, after 
which the old copy can be deleted. 
Global Data in system can be a cause of concern because multiple components of the system share the same data specially in a 
multi-threaded application. Instead, we could have multiple copies of the data accessible locally to each component.
Prototypes are another example where DRY is violated. When a new object has to be created, we clone the prototype 
there by creating multiple copies of the same thing in the system.
Documentation of a code can be a repetition of what the code does. In cases where the code is too simple to be explained, we 
can omit the comments on the basis of DRY, but for the codes which are not only large but clumsy, documentation has to be at every 
possible place even if it means repeating the same thing.

Conclusion

DRY is one of the important principle that needs to be used while design a system. DRY principle for code is definitely a must, but as mentioned above there can be cases where DRY can be violated for data. There can be no definite guidelines for not following this rule, but in the following cases violating DRY for data should be considered

  • If the performance of the system is significantly improved. For example adding layers of cache improves the performance of the computer significantly.
  • If there is need for data redundancy. For example Source control software makes sure that if one version of file is damaged there are always previous versions to fall back on.
  • If violating DRY reduces overhead. For example making copies of global data locally reduces the overhead of adding logic for mutual exclusion.

References

1. Andrew Hunt and David Thomas, "Pragmatic Programmer : From Journeyman to Master", Addison-Weasley Publications, October-99.

2. A discussion with the Authors of the Pragmatic Programmer

3. A good introduction to DRY

4. Importance of DRY

5. Read-Copy-Update

6. A review on DRY in Social Communication

7. The Abstraction Principle (Good background to DRY)