CSC/ECE 517 Fall 2011/ch1 1b sa: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(78 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This page is about various collection frameworks in programming languages. A brief description of collections and collection framework is discussed and then, the history and evolution of collections is discussed. Later, we discuss how these collections are retrofitted into programming languages. And then, we discuss the advantages of [http://en.wikipedia.org/wiki/Java_collections_framework Collections Framework in Java] over its competing frameworks such as [http://en.wikipedia.org/wiki/Standard_Template_Library STL] and [http://www.cs.duke.edu/csl/docs/jgl/user/Overview.html#TOP JGL]. And then, we showcase some of the positives of having collections being built into the language than having them as class libraries.
== Collection ==
A collection<ref>[http://en.wikipedia.org/wiki/Collection_%28computing%29 All about Collections]</ref> is a unit that groups multiple data elements into a single unit. These collections are used for storing, retrieving, manipulating and communicating aggregate data. A dictionary (maps words to their meanings), playing cards (collection of cards) can be considered as examples of collections.
== Collection Framework in Programming languages ==
Collection is a framework that provides interfaces that are well defined for storing and manipulating groups of data as a single unit. These interfaces implement the data structures which are reusable. Though it is a framework, it works as a library. It makes sure that the programmer does not have to create too many interfaces.
== History ==
Collections were first introduced into programming languages, as a part of [http://en.wikipedia.org/wiki/Smalltalk SmallTalk] in the year 1980<ref>[http://web.cecs.pdx.edu/~harry/musings/SmalltalkOverview.html Smalltalk Overview]</ref>. This collection consisted of sets, bags, maps (called as dictionaries), lists with fixed (arrays) and variable size.<ref>[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.7900&rep=rep1&type=pdf William R. Cook, Interfaces and Specifications for the Smalltalk-80 Collection Classes]</ref>
Collections were later introduced into C++ as Standard Template Library (STL) in the year 1994 by Stepanov and Lee. The collections in this had containers (vectors, maps, lists, stacks, queues and sets), algorithms, iterators, functors.<ref>[http://en.wikipedia.org/wiki/Standard_Template_Library Standard Template Library]</ref>
Collections were then introduced into Java initially with the Java 2 platform, Standard Edition, version 1.2, which had synchronized versions of [http://en.wikipedia.org/wiki/Hash_table hash table] and vector. Later, an updated version of these concurrency utilities was included in JDK 5.0 as of JSR 166. This was named as Collections Framework. This framework, which basically works like a library, provides a convenient [http://en.wikipedia.org/wiki/Application_programming_interface Application Programming Interface (API)] to many of the [http://en.wikipedia.org/wiki/Abstract_data_type abstract data types] from computer science [http://en.wikipedia.org/wiki/Data_structure data structure] curriculum: maps, sets, lists, trees, arrays, hash tables and other collections.
Collections were introduced into Ruby in mid 1990s. These collections were built directly into the programming language. These collections have sets, arrays, hash tables and iterators.
== Retrofitting of Collections into Programming Languages ==
Now, let us look at how these collections were fitted into the programming languages. A brief description of how various data structures were implemented in programming languages is given below:
=== Sets ===
In this interface, the data need not be in sequence. But we have to note that duplicate elements are not allowed in [http://en.wikipedia.org/wiki/Set_(computer_science) sets]. In many of the languages sets are supported directly. In the rest they can be implemented by a hash table with dummy values. And we use only the keys in the hash table for representing the set. Some of the operations performed on sets are addition, removal or deletion and searching of elements.
=== Lists ===
The next data type is the [http://en.wikipedia.org/wiki/List_(computing) list]. In lists, the order of data or elements of the list is important and we can have duplicate data elements. We have two special types of lists- the queues and the stacks. If addition and deleting of elements in done from either end of the list, then it can be termed as a queue and if the operations are performed at the same end, then they can be termed as stacks. We have to note that, the order of the elements is not changed here. Some of the operations that can be performed on the lists are, adding and deleting data elements present at a particular location in the list, searching for a particular data element in the list and sorting of elements.
=== Bags ===
A [http://en.wikipedia.org/wiki/Multiset bag] is similar to set. There is no order of sequence for the data elements in bags. These are usually referred to as multi-sets. The difference between a bag and a set is that, duplicate data elements are allowed in bags. Some of the examples of operations performed on bags are adding and removing of data elements, finding out the count of a particular data element in the bags etc. By performing sorting operation, we can transform a bag into a list.
=== Trees ===
A [http://en.wikipedia.org/wiki/Tree_(data_structure) tree] basically has a root node and some elements or data related to them and those elements and data are further related to more data or elements. The relation among these elements is parent-child relation. So, every element of data has none or more children and every element except the root element, will have a parent element. Examples of operations on trees are the addition of data items so as to maintain a specific property of the tree to perform sorting, etc. and traversals to visit data items in a specific sequence. A tree used for sorting operations is usually called a heap. In most of the frameworks there is direct support for trees and in others, they can be implemented using lists and arrays. 
=== Maps ===
[http://en.wikipedia.org/wiki/Associative_array Maps] are special kind of sets. A map is a set of pairs, each pair representing a one-directional "mapping" from one element to another. These maps can be finite or infinite. Mathematically speaking, a map is just a collection of pairs. In the Collections Framework, however, the interfaces Map and Collection are distinct with no lineage in the hierarchy. Some of the operations performed by these Maps include altering, querying, and providing alternative views.
==Advantages of Collections in Java vs C++==
==Advantages of Collections in Java vs C++==
Let us see some of the features that makes Collections framework to rule over other competing frameworks such as STL and JGL.
Let us see some of the features that makes Collections framework to rule over other competing frameworks such as [http://www.sgi.com/tech/stl/stl_introduction.html STL] and [http://www.cs.duke.edu/csl/docs/jgl/user/Overview.html#TOP JGL].
Collections framework is a core Java framework used to provide various data structures and algorithms and an API to access them. STL is a similar framework which is supports C++ and JGL is the decedent of STL in Java.  
Collections framework is a core Java framework used to provide various data structures and algorithms and an API to access them. STL is a similar framework which is supports C++ and JGL is the decedent of STL in Java.  
[[File:collection_archi.gif]]        [[File:jgl_archi.gif]]
===Package Organization===
===Package Organization===
Packaging is one of the important factor that reveals the quality and amount of thought that was put on the developing the library.
 
[[File:package_structure.png]]      [[File:package_stats.png]]
 
 
Packaging is one of the important factor that reveals the quality and amount of thought that was put in developing a library.
The difference in the number of classes shows the major advantage of the Collection framework over the STL/JGL which makes the task simpler with less number of classes to know about.
The difference in the number of classes shows the major advantage of the Collection framework over the STL/JGL which makes the task simpler with less number of classes to know about.
===Iterator interface hierarchies===
===Iterator interface hierarchies===
[[File:jgliterator.png]]  [[File:coll_iterator.png]]
Table clearly shows the interface hierarchy in Collections is lot simpler than STL/JGL. The complex approach of STL often leads to developer’s headache due to numerous permutations of incompatible iterators.
Table clearly shows the interface hierarchy in Collections is lot simpler than STL/JGL. The complex approach of STL often leads to developer’s headache due to numerous permutations of incompatible iterators.
====Range view Selection====
 
Iterators in JGL point to elements. JGL uses these pointers to select a sub range of a container. When we add or remove an element, any iterator pointing to it gets invalidated. So, clearly, JGL doesn't let its iterators remove or add elements because such a capability would sabotage most of its range-based algorithms. More than half of the algorithms doesn’t support the operations on sub range of container instead they work on entire container wasting their time.
=====Range view Selection=====
In contrast iterators in Collections point between elements and have an easier approach for selecting sub range in a collection using indices of elements.  
Iterators in JGL point to elements. JGL uses these pointers to select a sub range of a container. When we add or remove an element, any iterator pointing to it gets invalidated. So, clearly, JGL doesn't let its iterators remove or add elements because such a capability would sabotage most of its range-based algorithms. More than half of the algorithms doesn’t support the operations on sub range of container instead they work on entire container wasting their time.<ref>[http://www.javaworld.com/javaworld/jw-01-1999/jw-01-jglvscoll.html?page=2 Laurence Vanhelsuw, The battle of container frameworks]</ref>
In contrast iterators in Collections point between elements and have an easier approach for selecting sub range in a collection using indices of elements.
 
===Synchronization===
===Synchronization===
In the view of multithreaded environment, all the STL/JGL containers are synchronized which means operations cannot be parallelized and is a certain issue of overhead and performance. Whereas Collections are not synchronized by default but if required provides a thread safe wrapper around the unsynchronized collection.
In the view of multi threaded environment, all the STL/JGL containers are [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronized] which means operations cannot be paralleled and is a certain issue of overhead and performance. Whereas Collections are not synchronized by default but if required provides a [http://en.wikipedia.org/wiki/Thread_safety thread safe] wrapper around the unsynchronized collection.
 
===Implementation of Collection/Container entities===
===Implementation of Collection/Container entities===
====ArrayList====
====ArrayList====
ArrayList is a resizable form of an Array. Equivalent of ArrayList in STL is vector. When ArrayList constructor ArrayList(int n) is invoked, it creates an empty container and pre allocated space for n elements. So when size method is invoked at this point it returns 0. When a similar constructor of Vector is invoked it creates a container of n elements initialized to 0. So when a size method is invoked it returns n. Adding an element at the beginning in ArrayList takes less time than vector because STL’s vector must shift all the n elements before adding a new element.
 
C++ vectors does not support in-place reallocation of memory i.e upon reallocation memory held by vector is copied to a new location by the element’s copy constructor and later releasing the memory which is an inefficient process.
ArrayList is a resizable form of an Array. Equivalent of ArrayList in STL is vector. When ArrayList constructor ArrayList(int n) is invoked, it creates an empty container and pre allocated space for n elements. So when size method is invoked at this point it returns 0. When a similar constructor of [http://en.wikipedia.org/wiki/Vector_%28C%2B%2B%29 Vector] is invoked it creates a container of n elements initialized to 0. So when a size method is invoked it returns n. Adding an element at the beginning in ArrayList takes less time than vector because STL’s vector must shift all the n elements before adding a new element.
*example*
C++ vectors does not support in-place reallocation of memory i.e upon reallocation memory held by vector is copied to a new location by the element’s copy constructor and later releasing the memory which is an inefficient process.<ref>[http://en.wikipedia.org/wiki/Vector_%28C%2B%2B%29 Vectors in C++]</ref>
 
[[File:empty.png]]
 
ArrayList in Collections Framework after default [http://en.wikipedia.org/wiki/Constructor_%28object-oriented_programming%29 constructor] call
 
[[File:all_filled_stl.png]]
 
vector in STL after default constructor call
 
[[File:1_filling_stl.png]]
 
Adding new element 1 by shifting all elements in vector
 
[[File:1_filled_stl.png]]
 
After adding 1 in vector
 
====Maps====
====Maps====
Maps in Collections framework are internally implemented using arrays. So time taken for accessing, removing a key or checking whether the key is present in the map would be done in the order O(1) i.e in constant time. Initially maps are initiated to a desired size or default size of 10, so until the resizing of the map it would take constant time i.e O(1) and if size is exceeded it would take linear time in the order of input to resize the map i.e O(n).   
Maps in Collections framework are internally implemented using arrays. So time taken for accessing, removing a key or checking whether the key is present in the map would be done in the [http://en.wikipedia.org/wiki/Big_O_notation order O(1)] i.e in constant time. Initially maps are initiated to a desired size or default size of 10, so until the resizing of the map it would take constant time i.e O(1) and if size is exceeded it would take linear time in the order of input to resize the map i.e O(n).<ref>[http://www.sharmanj.com/Collections.pdf Sharad Ballepu,Collections – What Happens Within]</ref>  
Maps in STL/JGL are internally implemented using self-balancing binary search trees (AVL or Red Black trees). So time taken for accessing, adding, removing a key or checking whether the key is present in the map would be done in the order O(log n).  
Maps in STL/JGL are internally implemented using [http://en.wikipedia.org/wiki/Self-balancing_binary_search_tree self-balancing binary search trees] (AVL or Red Black trees). So time taken for accessing, adding, removing a key or checking whether the key is present in the map would be done in the order O(log n).<ref>[http://en.wikipedia.org/wiki/Map_%28C%2B%2B%29 Maps in C++]</ref>
*Table here *
====Sets====
Sets are internally represented using binary search trees in STL so the operations such as adding, removing an element, checking whether an element is there or not can be performed in the order of O(log n).
http://en.wikipedia.org/wiki/Set_%28computer_science%29
Sets in CF are implemented as HashSet, TreeSet, SortedSet through maps, binary search trees. HashSet implementation of set can perform set operations such as adding, removing an element, checking whether an element is there or not can be performed in the order of O(1).
http://www.cis.usouthal.edu/~hain/general/Publications/oopsla-paper6.pdf
Set operations such as set union, difference and intersection in STL takes more space and time than in Collections framework. Because when these operations are performed on two sets, result is stored in the first set in CF whereas a new set is created and stored in it.
// Create the sets
Set set1 = new HashSet();
Set set2 = new HashSet();


// Add elements to the sets ...
{| class="wikitable"
|-
! Operation !! STL      !! Collections
|-
| Searching for an element || O(log n)  || O(1)
|-
| Inserting a new element || O(log n)  || O(1)**
|-
| Incrementing/decrementing an iterator || O(log n)  || O(h/n)*
|-
| Removing a single map element || O(log n)  || O(1)
|-
|-
| Copying an entire map || O(n)  || O(n)
|-
| Iterating through all elements || O(n)  || O(n)
|}


// Copy all the elements from set2 to set1 (set1 += set2)
* Where h is table size.
// set1 becomes the union of set1 and set2
** It takes O(n) when map size need to be resized.
set1.addAll(set2);


// Remove all the elements in set1 from set2 (set1 -= set2)
====Sets====
// set1 becomes the asymmetric difference of set1 and set2
Sets are internally represented using binary search trees in STL so the operations such as adding, removing an element, checking whether an element is there or not can be performed in the order of O(log n).<ref>[http://en.wikipedia.org/wiki/Set_%28computer_science%29 Sets implementation in C++]</ref>
set1.removeAll(set2);


// Get the intersection of set1 and set2
Sets in CF are implemented as [http://download.oracle.com/javase/6/docs/api/java/util/HashSet.html HashSet], [http://download.oracle.com/javase/6/docs/api/java/util/TreeSet.html TreeSet], [http://download.oracle.com/javase/6/docs/api/java/util/SortedSet.html SortedSet] through maps, binary search trees. HashSet implementation of set can perform set operations such as adding, removing an element, checking whether an element is there or not can be performed in the order of O(1).<ref>[http://download.oracle.com/javase/tutorial/collections/implementations/set.html Sets implementation in Java]</ref>
// set1 becomes the intersection of set1 and set2
set1.retainAll(set2);


// Remove all elements from a set
Set operations such as set union, difference and intersection in STL takes more space and time than in Collections framework. Because when these operations are performed on two sets, result is stored in the first set in CF whereas a new set is created and stored in it.<ref>[http://www.cis.usouthal.edu/~hain/general/Publications/oopsla-paper6.pdf Sean Wentworth, David Langan, Thomas Hain,An Empirical Analysis of the Java Collections Framework Versus the C++ Standard Template Library]</ref>
set1.clear();


http://www.exampledepot.com/egs/java.util/coll_SetOps.html
// Create the sets
Set set1 = new HashSet();
Set set2 = new HashSet();
// Add elements to the sets ...
// Copy all the elements from set2 to set1 (set1 += set2)
// set1 becomes the union of set1 and set2
set1.addAll(set2);
// Remove all the elements in set1 from set2 (set1 -= set2)
// set1 becomes the asymmetric difference of set1 and set2
set1.removeAll(set2);
// Get the intersection of set1 and set2
// set1 becomes the intersection of set1 and set2
set1.retainAll(set2);
// Remove all elements from a set
set1.clear();


===Compilation and executable size===
===Compilation and executable size===
STL is majorly dependent on templates. As compilers generate additional amount of code for each template type, indiscriminate usage of templates leads to code bloat which results in generation of large executable files.  
STL is majorly dependent on templates. As compilers generate additional amount of code for each template type, indiscriminate usage of templates leads to [http://en.wikipedia.org/wiki/Code_bloat code bloat] which results in generation of large executable files.<ref>[http://en.wikipedia.org/wiki/Template_(programming) Templates in C++]</ref>
 
===Readability===
===Readability===
Code tends to lose its readability when there is a need of lot of new classes to perform a task and if this task involves templates it even makes the code complex and virtually unmaintainable.
Code tends to lose its readability when there is a need of lot of new classes to perform a task and if this task involves templates it even makes the code complex and virtually unmaintainable.
In contrast readability in Java using Collections framework is far better as it performs the task with a minimal set of classes and interfaces.
In contrast readability in Java using Collections framework is far better as it performs the task with a minimal set of classes and interfaces.
===Debugging===
===Debugging===
STL has far less support for debugging when compared to Java’s Collection framework. The error messages generated by the template code are far from being readable; one has to get acquainted with these to understand them.  
STL has far less support for debugging when compared to Java’s Collection framework. The error messages generated by the template code are far from being readable; one has to get acquainted with these to understand them.
 
===It's either in the core, or it isn't===
===It's either in the core, or it isn't===
One of the big advantages of Collections framework over its competing frameworks like STL, JGL which would dwarf any advantage is it is a product of Sun and has been given core status. So any new user of one of these frameworks would certainly go for the one which has “brand” tag attached it.
One of the big advantages of Collections framework over its competing frameworks like STL, JGL which would dwarf any advantage is it is a product of Sun and has been given core status. So any new user of one of these frameworks would certainly go for the one which has “brand” tag attached to it.
 
== Advantages of Collections in Ruby over Collections Frameworks in Java ==
This column basically discusses about the advantages of having collections built into the languages, like in Ruby<ref>[http://en.wikipedia.org/wiki/Ruby_%28programming_language%29 Introduction to Ruby(Programming Language)]</ref>, over languages where collections are used as [http://en.wikipedia.org/wiki/Library_(computing) class libraries], like [http://en.wikipedia.org/wiki/Java_collections_framework Java Collections Framework] in Java.
 
The advantage of having collections built into programming language like in Ruby over Java, where the collections are present as class libraries is that, we can write the equivalent code in fewer lines. This will lead to further advantages, both in terms of speed of development and in terms of bug-fixing.
An example of this is eliminating duplicates in a list, which is discussed below.
 
Since the collections are inbuilt in Ruby, all the methods of class libraries can be directly used as operators with the objects. Whereas in Java, where collections are defined in class libraries, we need to create an instance of the collections we want to use in our class and call the methods defined in the collections class to perform any operation related to that collection.
So, for example, if we want to find unique elements in an array, in Ruby, we can directly use an operator ‘uniq’ as collections are built into the language. In Java, we can achieve this task either by checking through the array manually or we can create an instance of Set and use it to find the unique elements from an arraylist. 
 
  In Ruby:
  a = ["a", "a", "b", "b", "c"]
  a.uniq  #=> ["a", "b", "c"]
 
 
  In Java: 
  public static <T> void removeDuplicates(ArrayList<T> list) {
  int size = list.size();
  int out = 0;
  final Set<T> encountered = new HashSet<T>();
  for (int in = 0; in < size; in++) {
          final T t = list.get(in);
          final boolean first = encountered.add(t);
          if (first) {
              list.set(out++, t);
          }
      }
      while (out < size) {
          list.remove(--size);
      }
  }
 
Speaking in terms of a programmer’s knowledge, Ruby’s inbuilt collections feature is advantageous when compared to Java. For instance, in Java, to achieve a particular functionality, depending on the functionality, sometimes, we need to create instances of other class objects. In that case, we need to understand the functionality of those classes. But in case of Ruby, all the operations and methods are inbuilt, so they can be directly used as operators of objects. For example, in Java, if we want to sort elements in an array, the programmer needs to have idea about the class which is to be implemented here i.e., the ArrayList, since arrays in Java are not sorted and it has to be done by calling another class(ArrayList) where this method is implemented. In case of Ruby, since the collections are inbuilt, the arrays can be sorted just by calling the method ‘sort’.
 
  a = ["d", "a", "e", "c", "b"]
  a.sort                    #=> ["a", "b", "c", "d", "e"]
  a.sort {|x, y| y <=> x}  #=> ["e", "d", "c", "b", "a"]
 
 
One more advantage of having in-built collections in a programming language like Ruby is that it handles the dependency problem. Because, if we have a situation where the software depended on library X, and library X depended on library Y and your software also depended on library Y, but then, it needs a different version of Y, then in such a scenario, Ruby is smart enough to keep those dependencies separate.
 
Another advantage with Ruby in terms of having in-built collections is that, we get better checking and [http://en.wikipedia.org/wiki/Code_refactoring refactoring] support, than what we get from static types and compilation.
 
=== Implementation of Collection ===
Here, we are going to compare the implementation of the following collection entities in Ruby with their implementation in Java.
 
==== Sets ====
The advantage of Sets (which is a collection) implemented in Ruby over sets used in Java is that, in Ruby, we can add any type of data element to the set<ref>[http://www.ruby-doc.org/stdlib/libdoc/set/rdoc/classes/Set.html Sets implementation in Ruby]</ref>. Whereas in Java, we need to add only data elements of the type defined while defining the set. Let us look at the following example:
 
  In Ruby:
  require 'set'
  s1 = Set.new [1, 2]                  # -> #<Set: {1, 2}>
  s1.add("abc")                        # -> #<Set: {1, 2, "abc"}>
 
So, from this example we can see that, we can add data elements of different type to a set in Ruby. Now, let us look at the example of a set in Java.
 
  In Java:
  public static void main(String[] args)
    {
      Set<String>  someSet = new HashSet<String>();
          someSet.add(asd);
          someSet.add(bar);
    }
So here, we can observe that, while defining a set, we are also mentioning the type, so here we are creating a set which can hold data elements that are strings. So, if we try to add non-string elements, they will be considered as ineligible elements.
 
==== Arrays ====
Having collections like arrays built into the language can be advantageous. For instance, in Ruby, we can increase the size of the array dynamically and automatically<ref>[http://www.troubleshooters.com/codecorn/ruby/basictutorial.htm#_Arrays Implementation of Arrays in Ruby]</ref>. For instance, let us consider the following example. Consider an array a = [1, 2, 3];
So, we can add another element in a[3] without incrementing the size of the array initially. That is,
a = [1, 2, 3];  #starting with an array
a[3] = 4; #Add a fourth element to it: a is [1, 2, 3, 4].
This is not the case in Java. In Java, we cannot change the size of an array. An alternative is to instantiate an object of ArrayList from Collections Framework, which can contain only non-primitive elements or objects.
 
==== Iterators ====
Ruby implements [http://en.wikipedia.org/wiki/Iterator iterators] quite differently; all iterations are done by means of passing [http://en.wikipedia.org/wiki/Callback_%28computer_programming%29 callback] closures to container methods - this way, Ruby not only implements basic iteration but also several patterns of iteration like function mapping, filters and reducing. Ruby also supports an alternative syntax for the basic iterating method ‘each’. This is because, Ruby has collections built into the language itself.<ref>[http://en.wikipedia.org/wiki/Iterator#Ruby Iterators in Ruby]</ref>
One of the major advantages of collections in Ruby over collections frameworks used in Java is the Iterators. Iterators in Ruby is one of the most known and powerful feature of the language. It works efficiently with blocks and the yield method. We have many variants and options available, the most important fact is that you can pass a block of code to be executed as a callback. No need for interfaces or template methods like in Java.
 
== References ==
<references/>
 
 
== Further Reading ==
[http://java.sun.com/developer/onlineTraining/collections/Collection.html 1: Collections framework API]
 
[http://www.ruby-doc.org/ 2: Documentation about Ruby]
 
[http://books.google.com/books?hl=en&lr=&id=jcUbTcr5XWwC&oi=fnd&pg=PR5&dq=ruby+programming&ots=fIBlwi3wcz&sig=a1KD3qk6S6WfX33q6t3e4YWiJYM#v=onepage&q&f=false 3: The Ruby Programming Language, Authors: David Flanagan & Yukihiro Matsumoto]

Latest revision as of 23:55, 25 September 2011

This page is about various collection frameworks in programming languages. A brief description of collections and collection framework is discussed and then, the history and evolution of collections is discussed. Later, we discuss how these collections are retrofitted into programming languages. And then, we discuss the advantages of Collections Framework in Java over its competing frameworks such as STL and JGL. And then, we showcase some of the positives of having collections being built into the language than having them as class libraries.

Collection

A collection<ref>All about Collections</ref> is a unit that groups multiple data elements into a single unit. These collections are used for storing, retrieving, manipulating and communicating aggregate data. A dictionary (maps words to their meanings), playing cards (collection of cards) can be considered as examples of collections.

Collection Framework in Programming languages

Collection is a framework that provides interfaces that are well defined for storing and manipulating groups of data as a single unit. These interfaces implement the data structures which are reusable. Though it is a framework, it works as a library. It makes sure that the programmer does not have to create too many interfaces.

History

Collections were first introduced into programming languages, as a part of SmallTalk in the year 1980<ref>Smalltalk Overview</ref>. This collection consisted of sets, bags, maps (called as dictionaries), lists with fixed (arrays) and variable size.<ref>William R. Cook, Interfaces and Specifications for the Smalltalk-80 Collection Classes</ref> Collections were later introduced into C++ as Standard Template Library (STL) in the year 1994 by Stepanov and Lee. The collections in this had containers (vectors, maps, lists, stacks, queues and sets), algorithms, iterators, functors.<ref>Standard Template Library</ref> Collections were then introduced into Java initially with the Java 2 platform, Standard Edition, version 1.2, which had synchronized versions of hash table and vector. Later, an updated version of these concurrency utilities was included in JDK 5.0 as of JSR 166. This was named as Collections Framework. This framework, which basically works like a library, provides a convenient Application Programming Interface (API) to many of the abstract data types from computer science data structure curriculum: maps, sets, lists, trees, arrays, hash tables and other collections. Collections were introduced into Ruby in mid 1990s. These collections were built directly into the programming language. These collections have sets, arrays, hash tables and iterators.

Retrofitting of Collections into Programming Languages

Now, let us look at how these collections were fitted into the programming languages. A brief description of how various data structures were implemented in programming languages is given below:

Sets

In this interface, the data need not be in sequence. But we have to note that duplicate elements are not allowed in sets. In many of the languages sets are supported directly. In the rest they can be implemented by a hash table with dummy values. And we use only the keys in the hash table for representing the set. Some of the operations performed on sets are addition, removal or deletion and searching of elements.

Lists

The next data type is the list. In lists, the order of data or elements of the list is important and we can have duplicate data elements. We have two special types of lists- the queues and the stacks. If addition and deleting of elements in done from either end of the list, then it can be termed as a queue and if the operations are performed at the same end, then they can be termed as stacks. We have to note that, the order of the elements is not changed here. Some of the operations that can be performed on the lists are, adding and deleting data elements present at a particular location in the list, searching for a particular data element in the list and sorting of elements.

Bags

A bag is similar to set. There is no order of sequence for the data elements in bags. These are usually referred to as multi-sets. The difference between a bag and a set is that, duplicate data elements are allowed in bags. Some of the examples of operations performed on bags are adding and removing of data elements, finding out the count of a particular data element in the bags etc. By performing sorting operation, we can transform a bag into a list.

Trees

A tree basically has a root node and some elements or data related to them and those elements and data are further related to more data or elements. The relation among these elements is parent-child relation. So, every element of data has none or more children and every element except the root element, will have a parent element. Examples of operations on trees are the addition of data items so as to maintain a specific property of the tree to perform sorting, etc. and traversals to visit data items in a specific sequence. A tree used for sorting operations is usually called a heap. In most of the frameworks there is direct support for trees and in others, they can be implemented using lists and arrays.

Maps

Maps are special kind of sets. A map is a set of pairs, each pair representing a one-directional "mapping" from one element to another. These maps can be finite or infinite. Mathematically speaking, a map is just a collection of pairs. In the Collections Framework, however, the interfaces Map and Collection are distinct with no lineage in the hierarchy. Some of the operations performed by these Maps include altering, querying, and providing alternative views.


Advantages of Collections in Java vs C++

Let us see some of the features that makes Collections framework to rule over other competing frameworks such as STL and JGL. Collections framework is a core Java framework used to provide various data structures and algorithms and an API to access them. STL is a similar framework which is supports C++ and JGL is the decedent of STL in Java.

Package Organization


Packaging is one of the important factor that reveals the quality and amount of thought that was put in developing a library. The difference in the number of classes shows the major advantage of the Collection framework over the STL/JGL which makes the task simpler with less number of classes to know about.

Iterator interface hierarchies

Table clearly shows the interface hierarchy in Collections is lot simpler than STL/JGL. The complex approach of STL often leads to developer’s headache due to numerous permutations of incompatible iterators.

Range view Selection

Iterators in JGL point to elements. JGL uses these pointers to select a sub range of a container. When we add or remove an element, any iterator pointing to it gets invalidated. So, clearly, JGL doesn't let its iterators remove or add elements because such a capability would sabotage most of its range-based algorithms. More than half of the algorithms doesn’t support the operations on sub range of container instead they work on entire container wasting their time.<ref>Laurence Vanhelsuw, The battle of container frameworks</ref> In contrast iterators in Collections point between elements and have an easier approach for selecting sub range in a collection using indices of elements.

Synchronization

In the view of multi threaded environment, all the STL/JGL containers are synchronized which means operations cannot be paralleled and is a certain issue of overhead and performance. Whereas Collections are not synchronized by default but if required provides a thread safe wrapper around the unsynchronized collection.

Implementation of Collection/Container entities

ArrayList

ArrayList is a resizable form of an Array. Equivalent of ArrayList in STL is vector. When ArrayList constructor ArrayList(int n) is invoked, it creates an empty container and pre allocated space for n elements. So when size method is invoked at this point it returns 0. When a similar constructor of Vector is invoked it creates a container of n elements initialized to 0. So when a size method is invoked it returns n. Adding an element at the beginning in ArrayList takes less time than vector because STL’s vector must shift all the n elements before adding a new element. C++ vectors does not support in-place reallocation of memory i.e upon reallocation memory held by vector is copied to a new location by the element’s copy constructor and later releasing the memory which is an inefficient process.<ref>Vectors in C++</ref>

ArrayList in Collections Framework after default constructor call

vector in STL after default constructor call

Adding new element 1 by shifting all elements in vector

After adding 1 in vector

Maps

Maps in Collections framework are internally implemented using arrays. So time taken for accessing, removing a key or checking whether the key is present in the map would be done in the order O(1) i.e in constant time. Initially maps are initiated to a desired size or default size of 10, so until the resizing of the map it would take constant time i.e O(1) and if size is exceeded it would take linear time in the order of input to resize the map i.e O(n).<ref>Sharad Ballepu,Collections – What Happens Within</ref> Maps in STL/JGL are internally implemented using self-balancing binary search trees (AVL or Red Black trees). So time taken for accessing, adding, removing a key or checking whether the key is present in the map would be done in the order O(log n).<ref>Maps in C++</ref>

Operation STL Collections
Searching for an element O(log n) O(1)
Inserting a new element O(log n) O(1)**
Incrementing/decrementing an iterator O(log n) O(h/n)*
Removing a single map element O(log n) O(1)
Copying an entire map O(n) O(n)
Iterating through all elements O(n) O(n)
* Where h is table size.
** It takes O(n) when map size need to be resized.

Sets

Sets are internally represented using binary search trees in STL so the operations such as adding, removing an element, checking whether an element is there or not can be performed in the order of O(log n).<ref>Sets implementation in C++</ref>

Sets in CF are implemented as HashSet, TreeSet, SortedSet through maps, binary search trees. HashSet implementation of set can perform set operations such as adding, removing an element, checking whether an element is there or not can be performed in the order of O(1).<ref>Sets implementation in Java</ref>

Set operations such as set union, difference and intersection in STL takes more space and time than in Collections framework. Because when these operations are performed on two sets, result is stored in the first set in CF whereas a new set is created and stored in it.<ref>Sean Wentworth, David Langan, Thomas Hain,An Empirical Analysis of the Java Collections Framework Versus the C++ Standard Template Library</ref>

// Create the sets
Set set1 = new HashSet();
Set set2 = new HashSet();
// Add elements to the sets ...
// Copy all the elements from set2 to set1 (set1 += set2)
// set1 becomes the union of set1 and set2
set1.addAll(set2);
// Remove all the elements in set1 from set2 (set1 -= set2)
// set1 becomes the asymmetric difference of set1 and set2
set1.removeAll(set2);
// Get the intersection of set1 and set2
// set1 becomes the intersection of set1 and set2
set1.retainAll(set2);
// Remove all elements from a set
set1.clear();

Compilation and executable size

STL is majorly dependent on templates. As compilers generate additional amount of code for each template type, indiscriminate usage of templates leads to code bloat which results in generation of large executable files.<ref>Templates in C++</ref>

Readability

Code tends to lose its readability when there is a need of lot of new classes to perform a task and if this task involves templates it even makes the code complex and virtually unmaintainable. In contrast readability in Java using Collections framework is far better as it performs the task with a minimal set of classes and interfaces.

Debugging

STL has far less support for debugging when compared to Java’s Collection framework. The error messages generated by the template code are far from being readable; one has to get acquainted with these to understand them.

It's either in the core, or it isn't

One of the big advantages of Collections framework over its competing frameworks like STL, JGL which would dwarf any advantage is it is a product of Sun and has been given core status. So any new user of one of these frameworks would certainly go for the one which has “brand” tag attached to it.

Advantages of Collections in Ruby over Collections Frameworks in Java

This column basically discusses about the advantages of having collections built into the languages, like in Ruby<ref>Introduction to Ruby(Programming Language)</ref>, over languages where collections are used as class libraries, like Java Collections Framework in Java.

The advantage of having collections built into programming language like in Ruby over Java, where the collections are present as class libraries is that, we can write the equivalent code in fewer lines. This will lead to further advantages, both in terms of speed of development and in terms of bug-fixing. An example of this is eliminating duplicates in a list, which is discussed below.

Since the collections are inbuilt in Ruby, all the methods of class libraries can be directly used as operators with the objects. Whereas in Java, where collections are defined in class libraries, we need to create an instance of the collections we want to use in our class and call the methods defined in the collections class to perform any operation related to that collection. So, for example, if we want to find unique elements in an array, in Ruby, we can directly use an operator ‘uniq’ as collections are built into the language. In Java, we can achieve this task either by checking through the array manually or we can create an instance of Set and use it to find the unique elements from an arraylist.

 In Ruby:
 a = ["a", "a", "b", "b", "c"]
 a.uniq   #=> ["a", "b", "c"]
 
 In Java:  
 public static <T> void removeDuplicates(ArrayList<T> list) {
 int size = list.size(); 
 int out = 0;
 final Set<T> encountered = new HashSet<T>();
 for (int in = 0; in < size; in++) {
          final T t = list.get(in);
          final boolean first = encountered.add(t);
          if (first) {
              list.set(out++, t);
          }
     }
     while (out < size) {
          list.remove(--size);
     }
 }

Speaking in terms of a programmer’s knowledge, Ruby’s inbuilt collections feature is advantageous when compared to Java. For instance, in Java, to achieve a particular functionality, depending on the functionality, sometimes, we need to create instances of other class objects. In that case, we need to understand the functionality of those classes. But in case of Ruby, all the operations and methods are inbuilt, so they can be directly used as operators of objects. For example, in Java, if we want to sort elements in an array, the programmer needs to have idea about the class which is to be implemented here i.e., the ArrayList, since arrays in Java are not sorted and it has to be done by calling another class(ArrayList) where this method is implemented. In case of Ruby, since the collections are inbuilt, the arrays can be sorted just by calling the method ‘sort’.

  a = ["d", "a", "e", "c", "b"]
  a.sort                    #=> ["a", "b", "c", "d", "e"]
  a.sort {|x, y| y <=> x}   #=> ["e", "d", "c", "b", "a"]


One more advantage of having in-built collections in a programming language like Ruby is that it handles the dependency problem. Because, if we have a situation where the software depended on library X, and library X depended on library Y and your software also depended on library Y, but then, it needs a different version of Y, then in such a scenario, Ruby is smart enough to keep those dependencies separate.

Another advantage with Ruby in terms of having in-built collections is that, we get better checking and refactoring support, than what we get from static types and compilation.

Implementation of Collection

Here, we are going to compare the implementation of the following collection entities in Ruby with their implementation in Java.

Sets

The advantage of Sets (which is a collection) implemented in Ruby over sets used in Java is that, in Ruby, we can add any type of data element to the set<ref>Sets implementation in Ruby</ref>. Whereas in Java, we need to add only data elements of the type defined while defining the set. Let us look at the following example:

 In Ruby: 
 require 'set'
 s1 = Set.new [1, 2]                   # -> #<Set: {1, 2}>
 s1.add("abc")                         # -> #<Set: {1, 2, "abc"}>

So, from this example we can see that, we can add data elements of different type to a set in Ruby. Now, let us look at the example of a set in Java.

 In Java:
 public static void main(String[] args)
   {
     Set<String>  someSet = new HashSet<String>();
          someSet.add(asd);
          someSet.add(bar);
   }

So here, we can observe that, while defining a set, we are also mentioning the type, so here we are creating a set which can hold data elements that are strings. So, if we try to add non-string elements, they will be considered as ineligible elements.

Arrays

Having collections like arrays built into the language can be advantageous. For instance, in Ruby, we can increase the size of the array dynamically and automatically<ref>Implementation of Arrays in Ruby</ref>. For instance, let us consider the following example. Consider an array a = [1, 2, 3]; So, we can add another element in a[3] without incrementing the size of the array initially. That is, a = [1, 2, 3]; #starting with an array a[3] = 4; #Add a fourth element to it: a is [1, 2, 3, 4]. This is not the case in Java. In Java, we cannot change the size of an array. An alternative is to instantiate an object of ArrayList from Collections Framework, which can contain only non-primitive elements or objects.

Iterators

Ruby implements iterators quite differently; all iterations are done by means of passing callback closures to container methods - this way, Ruby not only implements basic iteration but also several patterns of iteration like function mapping, filters and reducing. Ruby also supports an alternative syntax for the basic iterating method ‘each’. This is because, Ruby has collections built into the language itself.<ref>Iterators in Ruby</ref> One of the major advantages of collections in Ruby over collections frameworks used in Java is the Iterators. Iterators in Ruby is one of the most known and powerful feature of the language. It works efficiently with blocks and the yield method. We have many variants and options available, the most important fact is that you can pass a block of code to be executed as a callback. No need for interfaces or template methods like in Java.

References

<references/>


Further Reading

1: Collections framework API

2: Documentation about Ruby

3: The Ruby Programming Language, Authors: David Flanagan & Yukihiro Matsumoto