Chapter 1
Introduction 1.1 Introduction 1.2 Basic Terminology: Elementary Data Organization 1.3 Types of Data structutes 1.4 Data Structure operations 1.5 Algorithm Complexity and 1.6 Time-Space trade-off 1.1 Introduction: In a beginning programming course, variables are introduced that store a single datum. In all but trivial programs, however, we need to store multiple related data items for convenience. For example, data items like students records many need to be stored under a single name for ease of processing. Such convenient structuring of data is called data organization: a container for the data is called data structure. A data structure is a means of organizing data in primary memory in a form that is convenient to process by a program. In the definition of data structure, structure means a set of rules that holds the data together. In other words, if we take a group of data and fit them into a structure such that we can define its relating rules, we have made a data structure. Data is stored either in main memory or in secondary memory. In order to represent data we need some models. The different models (logical or mathematical) to represent/organize/store data in the main memory are together referred to as data structures. The data structure is collection of data organized in a specific manner in computer’s main memory. The data structure is a way Abstract Data Type (ADT) can be implemented. In the beginning of computer programming, if we wanted to read a file , we wrote the code to read the physical file device. It did not take long to realize that we were writing the same code over and over again. So, we created what is known as abstract data type(ADT). We wrote the code to read the file and placed in a library for all programmers to use. An abstract data type is a data declaration packaged together with the operations that are meaningful for the data type. In other words, ADT is the encapsulation of the data and the operations on the data in such a way that they are hidden from the users. ADT consists of a set of definitions that allow programmers to use functions while hiding the implementation. With an ADT, users are not concerned with how the task is done but rather with what it can do. Suppose we have to solve a problem that requires queue data structure. The queue data structure is not available in programming languages. There are two ways to solve this problem: One way is to write a program that simulates the queue we need. But this is good only for one application. Another way is to write a queue ADT that can be used to solve any queue problem. If we choose the latter way, the other programmers writing other applications can concentrate on the applications rather than writing the queue data structure. The different models (logical or mathematical) to represent/organize/store data in the secondary memory are called file structures or data storage structures. A data structure could be a programming construct provided in a language or it can be defined by the programmer. Example of data structure include: Arrays, Linked lists, Stacks, Queues, Trees, Graphs. Data structures are applied in sorting, searching, hash tables, graph algorithms, pattern matching, data compressing etc. The content of this course has not changed much in the last two decades except for inclusion of some new algorithms. Good programming and problem solving requires knowledge of data structures. Without a sufficient understanding of data structures, the level at which a programmer can work will be severely limited. Motivation for Data Storage Structures: • The main purpose for computer systems is to execute programs. • These programs together with the data they act upon, must be stored in main memory during execution. • Ideally, we would want the programs and data to reside in the main memory permanently. • However, this is not possible because • the main memory is usually too small to store all needed data permanently. • the main memory is a volatile storage device that looses its content when power is turned off or lost. • Since large data sets cannot fit into main memory, secondary storage structures are necessary to handle such data. • Magnetic disk is the most common form of secondary storage in computer systems. • With storage structures, a small portion of the data is kept in primary storage, and additional data is read from secondary storage as needed. • A data storage structures is a means of organizing data as blocks in secondary memory that optimizes I/O read/write operations. • Data storage structures include: Sequential files, Random access files, Indexed files, B-Trees, B*-Trees. 1.2 Basic Terminology: Elementary data organization Data: The term data means a value or set of values. For example, marks of students, figures obtained during exit polls etc. Data item:- A data item means a single unit of values. For example, roll number, name, address etc. Entity: Entity is something that has certain qualities, characteristics, properties or attributes that may contain some values. For example, Student is an entity. The attributes of student may be roll number, name, address, etc,. The values of these attributes may be 100, Ram, House No:133-A, Pragati Vihar, Dehradun. Entity Set: An entity set is a group of or set of similar entities. For example, employees of an organization, students of a class etc. Information: When the data are processed by applying certain rules, new processed data is called information. The data are not useful for decision marking whereas information is useful for decision making. Field: File is a single elementary unit of information representing an attribute of an entity. For example, 1, 2, 3, 4….etc are represented by a single unit called roll number field. Record: Record is a collection of field values of a given entity. For example, roll number, name, address etc of a particular student File: File is a collection of records of the entities in a given entity set. For example, file containing records of students of a particular class. Key: A key is one or more field(s) in a record that take(s) unique values and can be used to distinguish one record from the others. There are three cases: Case 1: When more than one field may have unique values. In that case, there exist multiple keys, but at a time we use only one field as a key, called primary key. The other key(s) are called as alternate key(s). Case 2: There is no field that has unique values. Then a combination of two or more fields can be used to form a key. Such a key is called composite key. Case 3: There is no possibility of forming a key from the within the record. Then an extra field can be added to the record that can be used as a key. 1.3: Types of data structures (i) Linear vs Non-Linear Data structures:- Linear data structures: Data items form a sequence. For examples:-arrays, linked lists, stacks, queues. Non-Linear Data structures: Data items do not form a sequence. For examples:- trees, graphs. (ii) Homogeneous vs Non-Homogeneous Data structures Homogeneous Data structures: All data items are of same type.For example, arrays. Non-Homogeneous:- All data items are may or may not be similar. For example, record or structure. 1.4 Data structure operations:- The common data structures that can be performed on different data structures are:- (i) Traversal: Accessing each data item of data structure in order to process it is called traversal. (ii) Searching: Finding the location of a given data item. (iii) Insertion: Adding new data item to the data structure. (iv) Deletion: Removing an existing data item from a data structure. (v) Sorting:- Arranging the data items in some logical order(Ascending or descending order). (vi) Merging:- Combining the elements of two similar sorted data structures into a single structure. 1.5: Algorithm Complexity: The time-space trade-off Algorithm A well-defined computational procedure that takes some value, or a set of values, as input and produces some value, or a set of values, as output. Sequence of computational steps that transform the input into the output. Sequence of unambiguous instructions for solving a problem, i.e., for obtaining a required output for any legitimate input in a finite amount of time. An algorithm can be expressed in three ways:- (i) in any natural language such as English, called pseudocode. (ii) in a programming language or (iii) in the form of a flowchart. Complexity of algorithm is a function of size of input of a given problem instance which determines how much running time/memory space is needed by the algorithm in order to run to completion. Time Complexity: Time complexity of an algorithm is the amount of time it needs in order to run to completion. Space Complexity: Space Complexity of an algorithm is the amount of space it needs in order to run to completion. There are two points which we should consider about computer programming:- (i) An appropriate data structure and (ii) An appropriate algorithm. There may be more than one algorithm to solve a particular problem. E.g.: sorting n numbers: 106 numbers Friend’s computer = 109 instructions/second Friend’s algorithm = 2n2 instructions Your computer = 107 instructions/second Your algorithm = 50nlgn instructions Your friend = You = 20 times better!! Time Space Trade-off The best algorithm to solve a given problem is one that requires less memory space and less time to run to completion. But in practice, it is not always possible to obtain both of these objectives. One algorithm may require less memory space but may take more time to complete its execution. On the other hand, the other algorithm may require more memory space but may take less time to run to completion. Thus, we have to sacrifice one at the cost of other. In other words, there is Space-Time trade-off between algorithms. If we need an algorithm that requires less memory space, then we choose the first algorithm at the cost of more execution time. On the other hand if we need an algorithm that requires less time for execution, then we choose the second algorithm at the cost of more memory space.