##Table of Contents
TODO:
##Summary
In computer science, a data structure is a data organization, management and storage format that enables efficient access and modification. A data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.
An individual data piece (i.e. a record) is usually part of a bigger collection of data. Each record has a key (e.g. a value to be sorted) and the rest of the record has satellite data (data carried around with the key).
So what goes on behind the scenes? When we move data around, we apply a sorting algorithm; usually there’s a large amount of satellite data that makes it slow to physically move (e.g. erase from hard drive, write to another place in hard drive) so instead we normally just move our pointer to the records.
A sorting algorithm is the method that we determine the sorted order (regardless of individual fields or large satellite data). There’s various algorithms we can apply and various ways we can store our data.
An abstract data type is a mathematical model for data types, where a data type is defined by its behavior from the point of view of a user of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations.
We call these ‘abstract’ because it gives an implementation independent
view, (hiding the details of the implementation). For example, we have
data types like int
, float
, char
with the knowledge of what values
can be assigned and what operations can be performed, but not know the
implementation details.
The issue with Python compared to other programming languages is that in say
Java, the naming scheme of a standard data structure is pretty straightforward.
In Java, there isn’t just a list
; it’s either a LinkedList
or an ArrayList
.
In Python, we have a more ‘human’ naming scheme, meaning we just call it a list
,
but really don’t know if this is a linked list or a dynamic array.
What?
So let’s try to break this down a bit. Say we swapped out ‘data type’ with vehicle. What would be an ‘abstract vehicle’? You would know what some of the actions you could do with it would be (e.g. drive the vehicle, sit in the vehicle), but it doesn’t say exactly how you would do that (whether that’s an automatic or manual vehicle, whether you need to open the door because its in a car or motorcycle). Some of the data stored might be say the color of the vehicle, number of wheels, etc. The idea is that we get an implmentation independent view. We hide the details of so that it’s an abstraction. We don’t care about details like the number of wheels, the number of doors (if any) or how to apply an action (e.g. drive stick or automatic); it’s all a black box. We do this to create an encapsulation around the data.
So we might have an Abstract Data Type like an Integer (e.g. -2, -1, 0, 1, 2) where we can do operations like addition, subtraction, multiplication, division, and comparisons like equals, greater than, less than. We don’t care how the computer moves the data around, whether it’s a binary number or a binary-coded decimal. We simply say ‘hey, add these two numbers together and give me the result’.
Some common abstract data types include:
User vs Implementor
Imagine if you’re the user that has to interact with the interface, using the
operations that have been specified by the abstract data type. You might call a get_max
method or a delete_last
method, but not be concerned about how its implemented, since
there can be many different ways to implement an abstract data type; i.e. we create
an implementation-indepdent view of the data.
The implementation of an abstract data type is often called the data structure.
Programming Specific
In a programming language like Python, we have objects like Abstract Base Class. Here we see what methods are available, but they’re not yet implemented.
#####Fundamental Abstract Data Type (ADT)
There are a few fundamental abstract data types:
Containers
Containers are abstract data types that hold stuff. We use the container to hold data and retrieve data items independently of content. Some popular types of containers are:
Dictionaries
Dictionaries are a form of container that gives you access to data (values) by keys.
Binary Search Trees
Binary Search Trees is a binary tree where each node contains at most two children. Each child can be identified as either a left or right child.
Priority Queues
Priority Queues are a list with a priority (i.e. sorting items inside). Priority Queue operations can be implemented in time by representing the heap with a binary search tree.
####Abstract Data Types and Classes
So how does an Abstract Data Type work with a programming language’s Class? We define a class to be a description of what:
Data items are called objects in this object-oriented paradigm. An object is an instance of a class.
Python specific Classes
Python has a number of built-in classes, from:
####Using Lists to build data structures
A lot of data structures can be implemented with the help of arrays and lists. In Python, arrays are lists and lists are say custom classes for:
You might see the following methods for the data structure of alist
.
Operations include:
[]
to access an element of a sequence+
to combine sequences together*
to concatenate a repeated number of times (good for initializing a list)in
to check if an item is in a sequencelen
to return the number of items in the sequence[:]
to extract a part of a sequence (up to, but not including)Methods include:
append
- alist.append(item)
- add a new item to the end of a listinsert
- alist.insert(i, item)
- insert an item at the i
th position in a listpop
- alist.pop()
to remove and return the last item in a listpop
- alist.pop(i)
to remove and return the i
th item in a listsort
- alist.sort()
to modify a list to be sorted (modifies inplace)reverse
- alist.reverse()
to modify a list to be in reversed orderdel
- del alist[i]
to delete the item in ith position (modifies inplace)index
- alist.index(item)
- returns the index of the first occurrence of itemcount
- alist.count(item)
- returns the number of occurrences of itemremove
- alist.remove(item)
- removes the first occurrence of item####<a` id=”stringfordatastructures”>Using Strings to build data structures</a>
You might see the following methods for the data structure of astring
.
Methods include:
center
- astring.center(w)
- returns a string as size w
that is center-justified (e.g. ` hey there `)count
- astring.count(item)
- returns the number of occurrences of item
in the stringljust
- astring.ljust(w)
- returns a string as size w
that is left-justifiedlower
- astring.lower()
- returns a string in all lowercaserjust
- astring.rjust(w)
- returns a string as size w
that is right-justifiedfind
- astring.find(item)
- returns the index of the first occurrence of item
split
- astring.split(schar)
- splits a string into substrings at schar
####<a` id=”setsfordatastructures”>Using Sets to build data structures</a>
Use Sets with:
engineers = set(['John', 'Jane', 'Jack', 'Janice'])
Operations include:
in
- checks for set membership (e.g. 'John' in set(['John', 'Jane'])
)len
- returns the cardinality of the set (e.g .len(set(['John', 'John', 'Jane']))
)|
- aset | otherset
returns a new set with all elements from both sets&
- aset & otherset
returns a new set with only those elements common to both sets-
- aset - otherset
returns a new set with all items from the first set not in the second set<=
- aset <= otherset
asks Boolean whether all elements of the first set are in the second setMethods include:
union
- aset.union(otherset)
- returns a new set with all elements from both setsintersection
- aset.intersection(otherset)
- returns a new set with only those elements common to both setsdifference
- aset.difference(otherset)
- returns a new set with all items from first set, not othersetissubset
- aset.issubset(otherwise)
- Asks whether all elements of one set are in the other setadd
- aset.add(item)
- adds item to the set (inplace)remove
- aset.remove(item)
- removes item to the set (inplace)pop
- aset.pop()
- remove an arbitrary element from the setclear
- aset.clear()
- removes all elements from the set####<a` id=”dictsfordatastructures”>Using Dictionaries to build data structures</a>
Operators include:
[]
- myDict[k]
returns the value for key k
, otherwise an errorin
- mykey in mydict
- returns True if key is in the dictionary, otherwise Falsedel
- del mydict[key]
- removes the entry from the dictionaryMethods include:
keys
- mydict.keys()
- returns the keys of the dictionary in a dict_keys
objectvalues
- mydict.values()
- returns the values of the dictionary in a dict_values
objectitems
- mydict.items()
- returns the key-value pairs in a dict_items
objectget
- mydict.get(mykey)
- returns the value associated with key ‘mykey’, None
otherwiseget
- mydict.get(mykey, altvalue)
- returns the value associated with key ‘mykey’, ‘altvalue’ otherwise#####Modeling a Problem
Modeling is the art of formulating your application in terms of precisely described, well understood problems. Modeling is the key to applying algorithmic design techniques to real-world problems. It’s a way of relating your application to what has been done before.
Most algorithms are designed to work on a rigorously defined abstract structure, such as a permutation or graph or set. To utilize algorithms to their fullest extent, you have to be able to describe your problem abstractly in terms of procedures on fundamental structures.
What this means is that if you’re in a company building software for “widgets”, you can’t look up a solution to a “widget optimization problem”. Instead, you have to describe how your widget fits into common data structures (to then leverage the common methods of those data structures).
#####Fundamental Data Structures
So with your widget, you can see if it fits any of these fundamental data structures:
These fundamental data structures are important because they provide the language to use to model applications. Study the input and output of each problem and you’ll know where to look when the problem arises in your application.
Modeling is only the first step in designing an algorithm for a problem, but its also the most important step towards a solution. Make sure you model the problem correctly before trying to solve it.
###Contiguous vs Linked Data Structures
Data structures can be either contiguous or linked, depending on whether they are based on arrays or pointers.
####Comparison: Linked Lists versus static Arrays
Advantages of Linked List over Arrays
Advantages of Arrays over Linked Lists
An algorithm is a procedure to accomplish a specific task. An algorithm must solve a general, well-specified problem. An algorithmic problem is specified by describing the complete set of instances it must work on and of its output after running on one of these instances.
We take an input (i.e. some value(s)) and produces an output (some value(s)). For example, we might encounter a sorting problem where we want to sort a sequence of numbers into nondecreasing order.
An instance of sorting might be an array of names like ['Mike', 'Bob',
'Sally', 'Jill', 'Jan']
whereas an algorithm takes any of the possible
input instances and transforms it to the desired output.
You can express algorithms in three common forms:
The best notation depends on which method you are most comfortable with. Usually I prefer to express the ideas of an algorithm in English, then move to a more formal programming language like Python. If your idea is not clearly revealed when you express an algorithm, then you are using too low-level a notation to describe it.
Before even looking at an algorithm, ask yourself if you have a careful description of the problem that the algorithm is trying to solve? Problem specifications usually have two parts:
It is impossible to prove the correctness of an algorithm for a fuzzily-stated problem. Ask the wrong problem and you will get the wrong answer.
Its very important that you try to narrow the set of allowable instances until there is a correct and efficient algorithm.
For example, what is the best route between these two places? We need to define what ‘best’ means, whether that is shortest in distance, time, or minimizing the number of turns.
###a id=”thinkrecursively”>Recursive Objects</a>
When you’re trying to find what the problem is, learning to think recursively is to important so that you know how to look for big things that are made from smaller things that are made exactly the same type as the big thing. Think of houses as sets of rooms, then adding or deleting a room and you still have a house behind. If we look at our fundamental data structures, we have:
n-1
things.These recursive descriptions of objects require both decomposition rules and basis cases, namely the specification of the smallest and simplest objects where the decomposition stops.
###What makes a good algorithm?
In a good algorithm, you want to look for:
Sometimes you can’t achieve all three.
An algorithm is correct if every input instance halts with the correct output. A single input is an instance of a problem. For example:
// an instance of a problem Input Sequence of {31, 41, 59, 26, 41, 58}
// Output Sequence of {26, 31, 41, 41, 58, 59}
It is usually not obvious whether a given algorithm correctly solves a given problem.
Correct algorithms usually come with a proof of correctness, which is an explanation of why we know that the algorithm must take every instance of the problem to the desired result. We can’t just say “it’s obvious”, because usually it’s not obvious.
There’s a fundamental difference between algorithms, which always produce a correct result, and heuristics, which may usually do a good job, but without providing any guarantee.
#####<a id=”disprovealgorithms>Disproving algorithms</a>
Searching for counterexamples is the best way to disprove the correctness of a heuristic.
Good counterexamples have all unnecessary details boiled away. Make it clear why the proposed algorithm fails. Try to:
#####<a id=”provealgorithms>Proving algorithms</a>
A proof or demonstration of correctness is needed. Usually this is done through mathematical induction. So what is it? Think of mathematical induction like recursion. You break a problem up by solving a base case and then any smaller pieces.
####Is the algorithm efficient?
There are different algorithms to solve the same problem and they are often drastically different in efficiency. We normally measure efficiency by looking at running time, which is mainly affected by the size of input. However, Big O notation is also used to check space complexity. See Big O for further details.
####Is a sorting algorithm stable?
For sorting algorithms, you might hear that one type is ‘stable’ while others are ‘unstable’.
##Walking through an Algorithm Problem
Mentally record any unique information in the problem. For example:
Ask yourself if you’ve used all the information in the problem, especially if you get stuck.
Do not skip this step. Draw a good example, meaning watch out to:
Make sure you really understand what you’re trying to solve. You can’t come up with the right solution if you’re solving the wrong problem!
Your example should help reinforce your understanding of the problem and whether any algorithms that you apply will work.
After the example, state a brute force answer (even if it’s obvious to you). It’s okay that the initail solution is terrible. Explain the space and time complexity, then improve it. It’s a great starting point for optimizations and it helps you wrap your head around the problem.
###Optimize
After the brute force algorithm, you’ll need to come up with an optimized algorithm.
###Walk Through
After you’ve gotten an optimal algorithm, don’t dive into coding. Take a moment to solidify your understanding of the algorithm.
Whiteboard coding is very slow; same with testing and fixing code. Make sure you have code that is close to perfect in the beginning.
If you don’t understand exactly what you’re about to write, you’ll struggle to code it. It will take longer to finish the code and you’ll make major errors.
###Implement
When you have an optimal algorithm and know exactly what you’re going to write, then go ahead and implement it. Start coding on the top left corner, avoid “line creep” (where each line of code is at a slant) and remmber spacing.
Write beautiful code, meaning:
init_matrix(size)
instead of filling in detailsStartEndPair
)
You can fill in the details for that new Class later and will allow you to expand on that dataset.###Test
Don’t submit code in an interview without testing it.
When you find bugs, don’t just make the first correction you can think of. Analyze why the bug occurred and ensure your fix is the best one.
##Optimize and Solve Technique 1: BUD
Look for BUD:
###Bottlenecks
A bottleneck is a part of your algorithm that slows down the overall runtime. This can occur when:
##Optimize and Solve Technique 2: DIY
Try to solve the problem manually with a real life example, like instead of “Create a sorting algorithm” using binary search, we think about how to locate a student’s paper.
##Optimize and Solve Technique 3: Simplify and Generalize
We’ll implemnet a multi-step approach:
##Optimize and Solve Technique 4: Base Case and Build
We solve the problem first for a base case (e.g. n=1
) and then try to build up
from there. When we get to more complex cases (e.g. n=3
, n=4
), we try to build
those using the prior solutions. Now that we understand the pattern, try to use
say a recursive algorithm.
##Optimize and Solve Technique 5: Data Structure Brainstorm
Run through the list of data structures and try to apply each one. This might help if you’re stuck on a problem and realize that if you used say a tree, it’s a trivial problem.
Use data structures generously!!!
##Best Conceivable Runtime (BCR)
Consider the best conceivable runtime and what that solution might look like in Big O.
Try to write good code, meaning:
##Big O
We use Big O notation to give an estimated running time based on the input size (as it tends toward infinity). Big O is basically a math formula that counts how many steps of computations get executed given a specified input size; the more computational steps the slower it is. We characterize these functions according to their rate of growth (aka growth rate), which is represented using the O Notation. We also care about space (so see Space Complexity below).
Note: We also use Big O for measuring space complexity (not just time)!
When we say calculate Big O, we often have some scenarios we want to look at:
Note: Access vs Search - Access is like if I knew your mailbox, I would go there directly. Search would be I know your mailbox is somewhere in this block, start searching through. You might not always have a constant Access (e.g. linked list might take you through multiple items before you find the one you’re looking for)
####Big O Example
Say you want to get a hard drive of information to your friend across the country. We can do an ‘electronic transfer’ through email or a ‘physical transfer’ by flying out there with the hard drive. Each method has a different time complexity:
O(s)
where s
is the
size of the file (for this example we say this is a linear function,
but it can have a different rate of growth) * ‘Airplane Transfer’
can be described as O(1)
where 1
is a constant that simply means as
the data gets larger, it doesn’t affect our flight time.If we compare these two scenarios, the ‘Airplane Transfer’ time is slower for small files, but at some point beats out the ‘Electronic Transfer’. This general idea is that the running time of an algorithm is a function of the size of its inputs.
Big O is used to compare the efficiency of algorithms without implementing them. We have a couple tools for this:
1.) RAM model of computation 2.) Asymptotic analysis of worst-case complexity
Machine-independent algorithm design depends upon a hypothetical computer called the Random Access Machine (RAM). This computer strikes a balance of capturing essential behavior of computers while being simple to work with.
+
, *
, -
etc) takes exactly one time stepWe measure run time by counting up the number of steps an algorithm takes on a given problem. Our imaginary machine runs a given number of steps per second.
With RAM, we can count how many steps our algorithm takes on any given input instance by executing it. We look at the best-case, average-case, and worst-case complexity. Usually we use the worst-case complexity.
Common O Notation runtimes are below (from the fastest first to slowest last):
O(1)
- constant; e.g. check if a number is even or odd (uses
a constant-size lookup or hash table)O(log N)
- logarithmic; e.g. find an item in a sorted array
with a binary search (because we split in half)O(N)
- linear; e.g. find an item in an unsorted listO(N log N)
- loglinear; e.g. heapsort, quicksort (best and avg case),
merge sortO(N^2)
- quadratic; e.g. selection sort, insertion sort,
worst case for bubble sort, quicksort (worst case)O(2^N)
- exponential; e.g. finding exact solution to traveling salesman
problem using dynamic programmingO(N!)
- factorial; e.g. solving traveling salesman problem via
brute-force searchA good cheatsheet is here
When evaluating time, we can look at the O Notation a few different ways.
What really happens is that we’re interested in the asymptotic notation runtime, which is how the algorithm runtime scales for very large inputs and not in the minute details or small inputs. This means we:
O(2N)
is actually O(N)
)O(N^2 + N^2)
is actually O(N^2)
)
O(N + log N)
becomes O(N)
- E.g. O(N^2 + N)
becomes O(N^2)
N
approaches infinity
Add or Multiple
If we do code chunk A completely then start code chunk B, the notation
should look like O(A + B)
. If we do code chunk A and code chunk B is
in code A (e.g. for-loop in a for-loop), the notation should look like
O(A * B)
.
####Amortized Time
Amortized time is like doing an operation say a million times. You don’t care about the worse-case or best-case scenario every once in a while. We only care about the time taken in total to do the million operations.
####Log N Runtimes
O(log N)
is a common runtime because of the elements being halved
scenario. For example, in a binary search for x
in an N-element sorted
array, the options are:
x==middle
then return middle 2. if x<middle
then search leftx>middle
then search rightThe runtime for this looks like this (say given an N-element array of 16):
We could reverse this to say how many times can we multiply by 2 until we get N?
We then get 2^k=N
, which is what log
expresses. For example:
2^4=16
, which is logBase(16,2)=4
where logBase(value, base)
logBase(N,2)=k
, which is 2^k=N
When you have a recursive functions that makes multiple calls, the
runtimes will often (not always) look like O(branches^depth)
where
‘branches’ is the number of times each recursive call branches.
####Space Complexity
Besides time, other considerations in real life (that we won’t consider for now) are things like space (RAM, hard drive), bandwidth speed, caches, and parallelism (single, multiple cores).
The general rule is if we need to create an array of size n
, we would
require O(n)
space.
A program updates variables in memory according to code. Variables come in types, a classification of data saying what the possible values are and what operations can be performed on it (e.g. add a number, check the length of a string). In Python, everything is an object, including different types like:
A bit is atomic, the smallest unit of storage, storing just a 0
or 1
If we have just one bit, the pattern can be either 0
or 1
.
If we have two bits, we can have patterns like 00
, 01
, 10
, 11
If we have three bits, we can have patterns like 000
, 001
, 010
, etc.
With each additional bit, we can double the number of previous patterns
1 bit = 2 patterns
2 bits = 4 patterns
3 bits = 8 patterns
4 bits = 16 patterns
5 bits = 32 patterns
6 bits = 64 patterns
7 bits = 128 patterns
8 bits = 256 patterns (one byte)
Mathematically: n bits yields 2^n
patterns
A byte is a collection of 8 bits. One byte can store one character (e.g. ‘A’, ‘$’)
Kilobyte (KB) = 1 thousand bytes
Megabyte (MB) = 1 million bytes
Gigabyte (GB) = 1 billion bytes
Terabyte (TB) = 1 trillion bytes
Examples of ASCII
A is 65
B is 66
space is 32
Bitwise operators act on operands as if they were a string of binary
digits. It operates bit by bit, meaning we have O(1)
computations per
bit and a time complexity of O(n)
. Common bitwise operators in Python are:
& # bitwise AND, e.g. x& y =0 (0000 0000)
| # bitwise OR, e.g. x | y = 14 (0000 1110)
~ # bitwise NOT, e.g. ~x = -11 (0000 0101)
^ # bitwise XOR, e.g. x ^ y = 14 (0000 1110)
>> # bitwise right shift, x>> 2 = 2 (0000 0010)
<< # bitwise left shift, x<< 2 = 40 (0010 1000)
So what exactly does a bitwise operator do?
These can appear as a truth table:
AND | 0 1 OR | 0 1 XOR | 0 1 NOT | 0 1
----+----- ---+---- ----+---- ----+----
0 | 0 0 0 | 0 1 0 | 0 1 | 1 0
1 | 0 1 1 | 1 1 1 | 1 0
An example would be if you only wanted the lower 4 bits of an integer, you AND
it with 15 (binary 1111
). The zero bits act as a filter, forcing the bits in
the result to be zero as well.
201: 1100 1001
AND 15: 0000 1111
------------------
IS 9 0000 1001
In addition, » and « are often included as bitwise operators, and they “shift” a value respectively right and left by a certain number of bits, throwing away bits that roll of the end you’re shifting towards, and feeding in zero bits at the other end.
1001 0101 >> 2 gives 0010 0101
1111 1111 << 4 gives 1111 0000
Note that with Python, a left shift simply expands the width of the number for the extra bits (whereas other languages drop off the extra bits).
So why is this important? Think of when you chmod
a file to give permissions.
Each file and folder has 8-bit data that controls the permissions.
000
means no permissions. To read and write it would be 6 (4+2)
. To read, write
and execute it would be 7 (4+2+1)
. Common permissions are:
755
common in web servers; owner has all permissions, everyone else read and execute,
but not make changes to a file777
everyone can do everything644
only the owner can read and write, everyone else can only read. No one can execute655
only the owner can read and write, but not execute the file. Everyone else can
read and execute, but cannot modify the file.Key methods for numeric types are:
abs(-35)
math.ceil(2.17)
math.floor(3.14)
min(x, -4)
# returns the minimum number out of thesemax(3.14, y)
# returns the maximum number out of thesepow(2.71, 3.14)
# same as 2.71 ** 3.14
math.sqrt(225)
inf
) # floats are not inifnite precision, use this to refer to infinity-inf
) # floats are not inifnite precision, use this to refer to negative infinitymath.isclose()
# for comparing floating point values, returns True or FalseIt’s good to know what the max size is on your machine using sys.maxsize
and sys.float_info
+>>> import sys
+>>> sys.maxsize
9223372036854775807
+>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308,
min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
###Computing the Pairity of a Word
The pairity of a binary word is 1 if the number of 1’s in the word is odd. Otherwise, it is 0. Parity checks are used to detect single bit errors in data storage and communication.
x & (x-1) # equals x with its lowest set bit erased
e.g. x = (00101100), then x - 1 = (00101011)
x & (x-1) = (00101100) & (00101011) = (00101000)
00101100 x
00101011 x-1
00101000 result
Python has a few built-in types. This is for Python 3.6:
https://docs.python.org/3.6/library/index.html
###Algorithms
###Concepts
Use this table below for questions about scalability or memory limitation. For example, if we have a bit vector mapping of every 32-bit integer to a boolean value, we need 2^32 bits (or 2^30 bytes) to store this mapping; that’s approximately 1 GB of memory. Remember that 1 bit = .125 bytes (8 bits = 1 byte)
Power of 2 | Exact Value (X) | Approx Value | X Bytes in MB, GB, TB
7 | 128 | |
8 | 256 | |
10 | 1024 | 1 thousand | 1 KB
16 | 65536 | | 64 KB
20 | 1048576 | 1 million | 1 MB
30 | 1073741824 | 1 billion | 1 GB
32 | 4294967296 | | 4 GB
40 | 1099511627776 | 1 trillion | 1 TB
###Arrays
The array is the simplest data structure; it is a contiguous block of memory,
usually used to represent sequences. If we have an Array A, then we can say that
A[i]
is the (i + 1)th
object stored in the array (because A[i]
starts at 0).
Retrieving and updating A[i]
takes O(1)
calculation time.
####Array Advantages and Disadvantages”
Advantages of a contiguously-allocated arrays (i.e. a ‘list’ in Python) are:
Disadvantages are:
We can efficiently enlarge arrays as we need them through dynamic arrays. As soon as we need a larger array, we double the size and copy the contents of the smaller array into the larger array. The primary thing lost using dynamic arrays is the guarantee that each array access takes constant time in the worst case. All queries will be fast, except those relatively few queries triggering array doubling.
The time for operations really depends on whether an array is sorted or unsorted.
####Sorted Arrays
Inserting and Deleting
For a sorted array, deleting an element from an array means moving all successive
elements one over to the left to fill the vacated space. For example, if an array is:
[2, 3, 5, 7, 9, 11, 13, 17]
, then deleting an element at index 4 results in
[2, 3, 5, 7, 11, 13, 17, 0]
. The time complexity of deleting an element at
index i
from an array of length n
is O(n-i)
. The same is true for inserting
a new element (as opposed to updating an existing entry). The reason for this is that
we now need to move elements over.
####Array Problem 1
Say we have an input array consisting of integers. Your job is to reorder its entries
so that the even entries appear first. We can do this easily in O(n)
space where n
is the length of the array. However, this is more difficult to solve without allocating
additional storage.
Hash tables are a very practical way to maintain a dictionary. They exploit the fact that looking up an item in an array takes constant time once you have its index. A hash function is a mathematical function that maps keys to integers. We use the value of our hash function as an index into an array, and store our item at that position. Here are the steps:
##Hashing
We do hashing because of its speed. If we know information about
the structure of our data (e.g. if a list is ordered or not), we can
use this knowledge to do efficient searches (e.g. in O(log N)
runtime
using a binary search instead of O(N)
). We can get this even faster
with a technique called hashing.
If we know where every item should be, then our search can do a single comparison to find the item. This concept is called hashing and it has a really efficient runtime. The idea is that when we input a key (e.g. Will) through a hash function, this returns a map to where the data is stored (aka the hash values, hash codes, hash sums, hashes).
####Hash Table
A hash table (aka hash map) is a collection of items that are stored in a way where we can find the items very quickly. Each position in the hash table is called a slot (sometimes the entire slot is referred to as a bucket) and can hold an item.
What is a real life example of this? Say a teacher sorts their student’s papers in categories of A-Z based on their first names (or specific buckets like W-Z if there are not a lot of names with those characters, as long as we get about an even distribution). If there are multiple people with the same first letter (e.g. Laura, Lisa), we have a hash collision that can be fixed with a few different methods, with chaining being the simplest (i.e. multiple names are added to the same category).
####Hash Function
A hash function is the function that distributes key/value pairs
across an array of slots. A sample function would look like index =
f(key, array_size)
and hash = hashfunc(key)
. The goal of the hash
function is to:
A perfect hash function is the case where every item is mapped to
a unique slot, which then runs in O(1)
time. We don’t need a perfect
hash function in order for the hash to be efficient.
For worst case scenario (high number of collisions), the worst case
runtime is O(N)
where N
is the number of keys.
Another implementation of the hash table is with a balanced binary
search tree, which will give us O(log N)
lookup time. This will
use less space since we no longer allocate a large array.
####Hash Example
Say we have a small hash table represented by a list of 11 empty
slots (m=11
) that have Python value of None
. Note that the number
of empty slots should ideally be prime (so we can have an easier
way of resolving hash collisions, more on that later).
slots 0 1 2 3 4 5 6 7 8 9 10
values None None None None None None None None None None None
Say we need to put in a set of integer items, like this: 54, 26, 93, 73
We can create a hash function that takes an item in the collection
and returns an integer that tells what slot (between 0
and m-1
) to
place the item. A common hash function is the remainder method,
where we take an item and divide by the table size, returning the
remainder as the hash value.
54 % 11 = remainder of 10 26 % 11 = remainder of 4 93 % 11 = remainder
of 5 73 % 11 = remainder of 7
After we calculate the hash values, we enter them into the hash table.
The load factor is the numberofitems / tablesize
(e.g. 4 items/
10 buckets). Conceptually, if the load factor is small, there are
potentially less collisions; if the load factor is large, there are
potentially more collisions (which also means collision resolution
is more difficult). Usually a load factor of .8
or higher is where
linear probing’s performance really degrades and you should definitely do
chaining
. Here we have our updated hash table:
slots 0 1 2 3 4 5 6 7 8 9 10
values None None None None 26 93 None 73 None None 54
####Hash Collision
So what happens in our example if we add a new item of 71
? We would get
a remainder of 5
, which would create a hash collision with 93
;
this means at least two values are trying to fit into the same bucket
(in our example, the values of 93
and 71
). We have two goals:
As we look at different ways to minimize hash collision, we should keep in mind that you want a function to uniformly distribute hash values. These can be tested using statistical tests (e.g. Pearson’s chi-squared test for discrete uniform distributions)
####Minimizing Hash Collision with the Folding Method
One way of trying to minimize hash collision is with the folding method.
436-555-4601
, we would take the numbers and divide them into
groups of 2, like 43, 65, 55, 46, 01
. 2. We then add all the pieces
together to get a resulting hash value. e.g. 43+65+55+46+1 = 210
210
), we do the modulus to get a
remainder of 1
(210%11=1
) to place the value 210
into position 1
.43+65+55+46+01
turns into 43+56+55+64+01=219
,
which 219%11=10
####Minimizing Hash Collision with the Mid-Square Method
Another way of trying to minimize hash collision is with the mid-square method.
44
, we calculate
44^2=1936
. 2. The next step is extracting the middle two digits
(e.g. 93
from 1936
) and taking the modulus to get the remainder
(93%11=5
)####Minimizing Hash Collision when Hashing Characters
We can create hash functions for character-based items like strings.
For example, we have: 99+97+116=312
, then 312%11=4
.
ord('c') # 99 ord('a') # 97 ord('t') # 116
If we do a hash function that just simply adds up all the characters
and gets the modulus, we end up with hash collision for anagrams
(e.g. dad
). We can fix this by adding a weighted position to the
character (e.g. multiply by 1 for first char, multiply by 2 for second
char, etc). For example:
def hash(astring, tablesize):
sum = 0 for position in range(len(astring)):
sum = sum * position + ord(astring[position])
return sum%tablesize
Collision resolution is the method for resolving hash collision, which means what do we do when two or more items are hashed to the same slot in the hash table. There are many ways of attempting to address this including open addressing and separate chaining (chaining).
####Resolving Hash Collision with Open Addressing
Open Addressing (aka closed hashing) is a method to resolve hash collisions by trying to find the next open slot in the hash table through some probe sequence, which systematically visits each slot until an open slot is found. The name comes from the fact that the location (‘address’) of the item is not determined by its ‘hash value’. The downside of open addressing is that the number of stored entries cannot exceed the number of slots.
There are many probe sequences, which has different takes on trying to minimize clustering of hash values. These probing sequences include:
Deletion in an open addressing scheme could get ugly since removing one element might break a chain of insertions. We have no alternative except to reinsert all the items in the run following the new hole.
Hash Table Hash Table
(Expected) (Worst Case)
Search O(n/m) O(n)
Insert O(1) O(1)
Delete O(1) O(1)
Successor O(n + m) O(n + m)
Predecessor O(n + m) O(n + m)
Minimum O(n + m) O(n + m)
Maximum O(n + m) O(n + m)
####Resolving Hash Collision with Open Addressing and Linear Probing
When we visit each bucket one at a time, we are using a technique called
linear probing. With the given example above, if we added 71
,
we would have a hash collision since 93
is already in position 5
(71%11=5
and 93%11=5
). We call this rehashing when we try to make
71
into a new hash.
slots 0 1 2 3 4 5 6 7 8 9 10
values None None None None 26 93 None 73 None None 54
####Resolving Hash Collision with Open Addressing and Quadratic Probing
Instead of making the skip a constant value (e.g. 1, 2), we can use
a rehash function that increments the hash value by 1, 3, 5, 7, etc.
This means that if the first hash value is h
, the successive hash
values are h+1
, h+4
, h+9
, h+16
, h+25
etc. This ‘skip’ value
is successive perfect squares.
####Resolving Hash Collision with Separate Chaining
Separate chaining (aka chaining) allows many items to exist at the same location in the hash table. When a hash collision happens, the item does not do probing and instead is placed in the proper slot. This is the easiest resolution, but devotes a considerable amount of memory to pointers. This is space that could be used to make the table larger and the ‘lists’ smaller.
When we search for an item, we use the hash function to generate the slot where it should be. Each slot has a collection so we use a searching technique to decide if the item is present. The advantage is that on average, there are likely to be very few items in each slot so the search should be more efficient. If there are many items in each slot, we have a few ways to handle this:
Here is some sample code for a hash table implementation:
HashTable Implementation
class SlotItem(object):
""" An items in the same hash table slot """
def __init__(self, key, value):
self.key = key
self.value = value
def __str__(self):
return self.key
class HashTable(object):
"""
Implementation of a hash table using chaining instead of
open addressing. Each slot has a list of items that are
appended when there is a hash collision.
"""
def __init__(self, size):
self.size = size
self.table = [[]] * self.size
def hash_function(self, key):
"""
Return modulus (i.e. remainder) and handles two scenarios:
1 - numbers only (int or long); just take modulus 2 -
characters involved; create hash from position and ordinal
value of character (position added to handle anagrams)
"""
if type(key) is int or type(key) is long: # numbers only
return key % self.size
else:
total = 0 # characters involved for position in xrange(len(key)):
total = total + (ord(key[position]) * (position+1))
return total % self.size
def set(self, key, value):
"""
Finds slot with hash_function, then saves key-value pair.
If there is an existing key, we overwrite that value.
""" index = self.hash_function(key) # find correct slot
for slot in self.table[index]: # look inside slot for key
if slot.key == key: # key found, replace current value
slot.value = value return
self.table[index].append(SlotItem(key, value)) # key not found, add
self.check_load_factor(upper_limit=.8, resize=2)
def get(self, key):
""" Finds slot with hash_function, returns slot value or
else None """
index = self.hash_function(key)
for slot in self.table[index]: # look inside slot for key
if slot.key == key: # key found, return current value
return slot.value
return None # no key found
def remove(self, key):
""" Given a key, remove the key-value pair """
index = self.hash_function(key)
for i, slot in enumerate(self.table[index]):
if slot.key == key: # key found, return current value
del self.table[index][i]
def check_load_factor(self, upper_limit=.8, resize=2):
"""
Check load factor, if limit reached, warn to resize
larger table
""" load = 0 for i in xrange(self.size):
for slot in self.table[i]:
if slot:
load += 1
load_factor = float(load)/float(self.size)
#print "Load factor is ", load_factor
if load_factor > upper_limit: # need to resize for larger hash table
print "Load Factor is past upper limit, you should resize"
# TODO: Create deepcopy, dynamically resize for high and low limit
else:
pass # load_factor is in acceptable limits
Tests for Hashtable Implementation
""" Implement a hashmap (with amortized constant time look-ups)
without using a hashmap primitive. Includes an executable testing
framework for the data structure """
import unittest from nose.tools import assert_equal, assert_not_equal,
assert_true, assert_false
class TestUnitHashTable(unittest.TestCase):
""" Unit Tests - Test each piece in isolation """
def setUp(self):
self.ht = HashTable(11) self.a = SlotItem('hello', 1112223333)
# char, int scenario self.b = SlotItem(1, 1) # int, int
scenario self.c = SlotItem('f', 'morestuff') # char, char
scenario self.ht.table = [[self.a], [self.b], [], [self.c],
[], [], [], [], [], [], []]
def test_keys_can_be_integers(self):
try:
self.ht.set(11, 100)
except TypeError:
self.fail("UT: Hashing func cannot handle key with ints")
def test_keys_can_be_characters(self):
try:
# hello = 104*1 + 101*2 + 108*3 + 108*4 + 111*5 = 1617 =>
%11=0 self.ht.set('hello', 'world') # pos: 'hello'(0), value: 'world'
except TypeError:
self.fail("UT: Hashing func cannot handle key with chars")
def test_keys_can_be_spaces(self):
try:
self.ht.set(' ', 'somevalue')
except TypeError:
self.fail("UT: Hashing func cannot handle key with
spaces")
def test_keys_can_be_mix_chararacters_and_integers(self):
try:
self.ht.set('a1b3e', 'somevalue')
except TypeError:
self.fail("UT: Hashing func cannot handle key with ints & chars")
def test_hashes_anagram_keys_to_different_buckets(self):
""" Ensure hashing of keys with anagrams turns out unique
""" self.ht.set('elvis', 'samestuff') # 1666 => %11=5
self.ht.set('lives', 'samestuff') # 1651 => %11=1
bucket5 = self.ht.table[5] # elvis bucket goes to slot 5
bucket1 = self.ht.table[1] # lives bucket goes to slot 1
for _ in bucket5:
if _.key == 'elvis':
value5 = _.value
for _ in bucket1:
if _.key == 'lives':
value1 = _.value
assert_equal(value5, value1)
def test_identify_empty_hash_table_index(self):
# Check if list in this bucket is empty
bucket2 = self.ht.table[2] assert_false(bucket2)
def test_identify_filled_hash_table_index(self):
# Check if list in this bucket is empty
bucket1 = self.ht.table[1] assert_true(bucket1)
def test_get_existing_hash_table_item(self):
value = self.ht.get('hello') assert_equal(value, 1112223333)
def test_get_nonexisting_hash_table_item(self):
value = self.ht.get('nothere') assert_equal(value, None)
def test_set_data_on_empty_hash_table_slot(self):
assert_equal(self.ht.get(16), None) self.ht.set(16, 'abc')
assert_equal(self.ht.get(16), 'abc')
def test_set_data_on_existing_hash_table_slot(self):
assert_equal(self.ht.get('f'), 'morestuff')
self.ht.set('f', 'differentstuff')
assert_equal(self.ht.get('f'), 'differentstuff')
def test_remove_data_on_existing_key_with_value(self):
assert_equal(self.ht.get('hello'), 1112223333)
self.ht.remove('hello') assert_equal(self.ht.get('hello'), None)
def test_remove_data_on_nonexisting_key(self):
try:
self.ht.remove('notthere')
except:
print "Issue with trying to remove a nonexisting key"
raise
def test_set_load_factor_high(self):
assert_equal(self.ht.size, 11)
self.ht.table = [[self.a], [self.b], [self.a], [self.c], [self.a],
[self.c], [], [self.a], [self.a], [self.b], []]
self.ht.set(10, 'here')
# TODO: if more time, trigger auto resize
#assert_equal(self.ht.size, 22)
#assert_equal(self.ht.get(10), 'here') # check value here after resize
# TODO: check hash value not too high
# TODO: check uniform distribution of values
class TestFunHashTable(object):
""" Functional Tests - Test all pieces together from end to end
"""
def test_end_to_end(self):
hash_table = HashTable(11)
print "FT: Created Hash Table of size, ", hash_table.size
print "FT: Default value type at slot 0 is None"
assert_equal(hash_table.get(0), None)
print "FT: Setting value in slot 13 (i.e. 13%11=2)
with value "dee":
hash_table.set(13, 'dee')
assert_equal(hash_table.get(13), 'dee')
print "FT: Adding value in slot 3 with value dee"
hash_table.set(3, 'dee') assert_equal(hash_table.get(3), 'dee')
print "FT: Checking that two values are the same"
assert_equal(hash_table.get(13), hash_table.get(3))
print "FT: Deleting value in slot 2" hash_table.remove(2)
assert_equal(hash_table.get(2), None)
if __name__ == '__main__':
# run unit tests suite =
unittest.TestLoader().loadTestsFromTestCase(TestUnitHashTable)
unittest.TextTestRunner(verbosity=2).run(suite)
A = TestFunHashTable()
A.test_end_to_end()
###String Matching via Hashing
Strings are sequences of characters where the order of the character matters.
The primary data structure for representing strings is an array of characters.
We get a constant-time access to the ith
character of the string.
You’ll often run into a substring search. There’s different approaches, including:
p
in
text t
at every position in the textp
and
the m
character substring starting from the i
th position of t
. If these
two strings are different, the hash values will almost certainly be different.
False positives are so rare that we can easily spend O(m)
time to check.### Duplicate Detection Via Hashing
The key idea of hashing is to represent a large object (e.g. a key, a string, a substring) using a single number. This representation of a large object is to be able to manipulate the object in constant time.
Hashing is used most often in speeding up search, including:
Hashing is a fundamental idea in randomized algorithms, yielding linear expected-time algorithms.
##<a id=”specializeddatastructures>Specialized Data Structures</a>
We have basic data structures that represent an unstructured set of items so we can run retrieval operations.
There’s also another set of data structures that represent more structured or specialized kinds of objects, like points in space, strings, and graphs. These include:
So remember that a list is a mutable data structure that can hold different data types in the same list (e.g. int, then a str) And remember that an array is a mutable data structure that can only hold the same data types in the same array (e.g. only strings) So with an array, memory allocation is done for all of its elements as one block of memory (easier since its all the same data types). A linked list instead does memory allocation so that each element (aka linked list element, node), gets its own memory block. The linked list gets its overall structure by using pointers to connect all its nodes together.
So what are some examples of linked lists?
In terms of implementation, we can search for an item in a linked list either iteratively or recursively.
###Pointers
Before we get too in-depth with linked list/structures, we need to get into detail about pointers. Pointers are the connections that hold the pieces of linked structures together.
Pointers represent the address of a location in memory. A variable that stores a pointer to a given data item can provide more freedom than storing a copy of the item itself.
A real life example might be a cell-phone number can be thought of as a pointer to its owner as they move around.
In programming languages like C, we have:
p
is assumed to give the address in memory where a chunk of data is located*p
to denote the item that is pointed to by pointer p
&x
to denote the address of (i.e. pointer to) of a particular variable x
.NULL
pointer value is used to denote structure-terminating or unassigned pointersEach node contains two fields:
A container is a data structure that lets you store and retrieve data items independent of content. Containers are in contrast to dictionaries, which are abstract data types that retrieve based on key values or content. There are a few types of containers and they’re identified by the retrieval order. The two most important are:
put
and get
operation for stacks are called
push
and pop
. An example is a subway car where the people that come
in first have the hardest time getting out during busy hours.get
and put
operations in queuse are usually called
enqueue
(insert item x at the back of the queue q) or dequeue
(return
and remove the front item from queue q). Queues are the fundamental
data structure controlling breadth-first searches in graphs.You can implement a stack or queue using either arrays or linked lists.
###Stacks and Queues in Python
In Python, you can use a list as a stack very easily (just run
mylist.append(myitem)
or mylist.pop(myitem)
.
In Python, if you need a queue, just use the collections.deque
.
You can use a list as a queue, but it’s not efficient (due to inserts
or pops from the beginning of a list since all the other elements have
to be shifted by one). With a deque, you can run get
and put
with:
myqueue.popleft()
and myqueue.append(myitem)
The dictionary data type allows access to data items by content. The idea is that you stick an item into a dictionary so you can find it later. The primary operations on a dictionary are:
D
and search key k
, return a pointer to the element
in the dictionary whose key value is k
(if one exists)D
and a data item x
, add it to the set in the dictionaryD
and given a pointer to a given data item x
, remove the
data item from the dictionaryA lot of common data processing tasks take dictionaries. For example, if we want to remove all duplicate names from a mailing list, we can:
When we define a problem in terms of an abstract dictionary’s operations, we can avoid the details of the data structure’s representation and instead focus on the task
There are various dictionaries based on:
Remember that a data representation can have some efficient operations while other operations are more expensive. A lot of these operation costs depend on whether the data structure is sorted or unsorted.
####Dictionary Runtimes (Sorted Array and Unsorted Arrays)
Dictionary runtimes as a sorted array and unsorted array are:
Unsorted array Sorted array
Search O(n) O(log n)
Insert O(1) O(n)
Delete O(1)* O(n)
Successor O(n) O(1)
Predecessor O(n) O(1)
Minimum O(n) O(1)
Maximum O(n) O(1)
You can see that the implementation of a dictionary using a sorted array completely reverses our notations of what is easy and what is hard.
In a sorted array, insertion and deletion become more expensive because making room for a new item or filling a hole requires moving around a lot of items. While in a sorted array, get methods like search and successor and minimum/maximum are quick operations.
When designing data structures, make sure to balance all the different operations it supports. The fastest data structure to support both operations A and B might not be the fastest data structure to support either operation A or B.
####Dictionary Runtimes (Lists)
A dictionary can be implemented as sorted list or unsorted list. A dictionary can also be implemented as a singly-linked list or doubly-linked list.
We then have:
Dictionary runtimes as a the above are:
Singly Double Singly Doubly
Unsorted Unsorted Sorted Sorted
Search O(n) O(n) O(n) O(n)
Insert O(1) O(1) O(n) O(n)
Delete O(n)* O(1) O(n)* O(1)
Successor O(n) O(n) O(1) O(1)
Predecessor O(n) O(n) O(n)* O(1)
Minimum O(n) O(n) O(1) O(1)
Maximum O(n) O(n) O(1)* O(1)
Same as unsorted arrays, search operations are slow when we have unsorted while maintenance operations are fast.
###Dictionary Runtimes (Binary Search Trees)
So far we’ve seen data structures that allow either fast search OR flexible updates, but not both fast search AND flexible updates.
Binary Search requires that we have fast access to two elements, being the median elements above and below the given node. We create this by using a ‘linked list’ with two pointers per node and that gives us binary search trees.
There’s a rooted binary tree that is recursively defined as either:
A binary search tree labels each node in a binary tree with a single key (say x
).
The labeling scheme is special, with:
x
have keys < x
.x
have keys > x
.n
nodes, and any set of n
keys, there is exactly one labeling
that makes it a binary search tree.####Binary Search Tree Examples
For example, on a three node tree, there are five distinct binary search trees:
3 3 2 1 1
2 1 1 3 3 2
1 2 2 3
####Binary Search Tree Implementation and Operations
Binary tree nodes have a:
Binary trees have the following basic operations:
x
, or we proceed left
(< x) or right
(> x)
Remember that both the left and right subtrees of a binary search tree are also binary search trees
We get O(h)
runtime where h
is the heightminimum
and maximum
element in the tree.
The smallest key is in the leftmost subtree of the root and largest key is in the rightmost subtree of the rootO(n)
time where n is the number of nodes in the treex
into a binary search tree T
where it’ll know how to find it again
Insertion runs on O(h)
time####Binary Search Tree Runtimes
All three dictionary operations take O(h)
time, where h
is the height of the tree.
The smallest height we can hope for is when the tree is perfectly balanced, where
h = log n
. The issue is that the tree has to be perfectly balanced.
With binary search trees, the order that we insert data greatly affects runtimes (since that’s the shape/height we create for our tree)
For example, if we insert our data that has been sorted, we have n
list items with an
n
tree. Say we have a list of [5, 9, 11, 12]
. This creates a binary tree of:
5
9
11
12
So with the above example, we have n
runtime because we only have right insertions,
creating a skinny tree. Our implementation depends a lot on randomization and hoping
for the best results based on the input.
There are different implementations of binary search trees caused balanced binary search
trees, such as red-black trees and splay trees that guarantee the height always
be O(log n)
(by adjusting the tree a little after each insertion / keeping it close enough
to be balanced).
Many algorithms process items in a specific order. Priority Queues are data structures that provide more flexibility than simple sorting; instead they allow new elements to enter a system at arbitrary intervals. It’s cost-effective to insert a new job into a priority queue than to re-sort everything on each new item’s arrival. So the idea is that you can add items at any time, but only the item with the highest priority can be removed first.
Priority Queues are often implemented with heaps since we’re sure that the value of a node is less than all the values in its subtrees. The smallest element is the one with the highest priority.
We have three primary operations:
x
with key k
, insert it into the priority queue Q
Q
Examples of priority queues could be:
When the data structure is: an unsorted array, a sorted array, or a balanced search tree, the runtimes are:
Unsorted Sorted Balanced
Array Array Tree Insert O(1) O(n) O(log n) Find Min O(1) O(1) O(1) Delete Min O(n) O(1) O(log n)
Notice how find minimum is in constant time for each of these data structures. The trick is to use an extra variable to store the pointer/index to the minimum entry in each of these structures so we can return this value whenever we’re asked to find the minimum.
We briefly cover the structure of data, then go into a couple of design approaches with incremental and divide and conquer, which are opposites of each other.
for
or while
loops. * divide and conquer
approach breaks the problem into several subproblems that are similar
to the original problem, but smaller in size. This is used in algorithms
like merge sort.Incremental is the repetition of a block of statements. An example is:
a = 0 for i in range(10): #0, 1, 2...8, 9
a+=1
print a #10
####loop invariant
As we create our loops, we need to be aware of loop invariants (aka invariant), which simply means that these general conditions have to be true.
####Approach: Divide and Conquer
The divide and conquer approach is to break apart a large problem into several subproblems that are similar to the original problem but smaller in size. We solve the subproblems recursively (i.e. they call themselves) until they reach the base case, and then combine these solutions to create a solution to the original problem.
####recursion
Recursion is a method where the solution to a problem depends on solutions to smaller instances of the same problem. You can visually see this as a recursion tree, which is a tree diagram of recursive calls. Recursion has to obey three laws:
Recursive Example of calculating the Fibonacci number
def fib(n):
""" return the Fibonacci number """
if n==0:
return 0
elif n==1:
return 1
else:
return fib(n-1) + fib(n-2)
Recursive Example of Removing Duplicates Letters next to each other
def removeDups(word):
if len(word) <= 1:
return word
elif word[0] == word[1]:
return removeDups(word[1:])
else:
return word[0] + removeDups(word[1:])
word = 'aaaabbbbbcccdd'
print word print removeDups(word) # abcd
Other examples of recursive problems include:
Sorting is the basic building block that many other algorithms are built around. By really understanding sorting, we can then understand how to solve other problems. A lot of the interesting ideas in the design of algorithms appear in the context of sorting (e.g. divide-and-conquer, data structures, randomized algorithms).
There are several fundamental sorting algorithms, including:
###Sorting Algorithms - Comparison Sorts
A basic computational problem is the sorting problem, where you sort a sequence of n numbers (aka keys). We apply the above general approaches (insertion, divide and conquer) using different types of algorithms. The following algorithms (insertion sort, bubble sort, selection sort, merge sort, heapsort, and quicksort) are all comparison sorts (i.e. they determine the order of an input array by comparing elements).
How to use Sorting:
A really important algorithm design technique is to use sorting as a basic building block because many other problems become easy once a set of items is sorted.
Clever sorting algorithms exist that run in O(n log n)
whereas naive
sorting algorithms are really slow with O(n^2)
. Let’s look at some sample
numbers for n:
n n^2/4 n lg n
10 25 33
100 2500 664
1000 250000 9965
10000 25000000 132877
100000 2500000000 1660960
Some sample applications of searching are:
O (log n)
time, providing keys are all sortedn
numbers, how do you find the pair of numbers
that have the smallest difference between them. Once the numbers are sorted, the
closest pair of numbers must lie next to each other somewhere in sorted order.
A linear-time scan through the sorted numbers completes the job for a total of O (n log n)
time, including the sortingn
items?
This is a special case of the above closest-pair problem, where the pair separated
is by a gap of zero. The most efficient algorithm sorts the numbers and then does a linear
scan through checking all adjacent pairsn
items, which elements occurs the largest
number of times in the set. If the items are sorted, we can sweep from left to right
and count them, since all identical items will be lumped together during sortingk
th largest item in an array? If the keys are placed in
sorted order, the k
th largest can be found in constant time by simply looking at the
k
th position of the array.n
points in two dimensions? Think of the convex hull as a rubber band stretched over the
points in the plane and then released. Basically, the convex hull gives a nice presentation
of the shape of the points and is needed to build more complicated geometric algorithms.
Once you have the points sorted by the x
coordinate, the points can be inserted from
left to right into the hull. Since we know that the right-most point is always on the boundary,
we know the point will appear in the hull. Adding a new point might cause another point to
be deleted, but even then the total time is linear after the sorting has been done.Say you have two sets (size m
and size n
) and need to determine if they’re disjoint (i.e.
no element in common). Analyze the worst-case complexity in terms of m
and n
if m
is
substantially smaller than n
. You have a few different algoirthms that you can do, including:
O(n log n)
, then do a binary search with each
of the m
elements in the second, looking to see if it exists in the big set.
Total time is O((n+m) log n)
O(m log m)
, then do a binary search with each
of the n
elements in the big set, looking to see if it exists in the small one.
Total time is O((n+m) log m)
O(n log n + m log m + n + m)
.So now that we know the importance of sorting and how to start sorting, we need to ask ourselves what order do we want our items sorted? Some questions to ask are:
Again, you might wonder why we want to learn the different ways of sorting when you’re usually better off not implementing them and instead should use a built-in library function instead. The answer is that these design techniques are very important for other algorithm problems that you will encounter.
####Incremental: insertion sort
Insertion sort is a simple sorting algorithm based on the incremental approach and is efficient at sorting a small number of elements in place and is useful when items are mostly sorted already. The idea is to always maintain a sorted sublist in the lower portions of the list; what this means is we will have a ‘sorted’ and an ‘unsorted’ group. See example steps:
sorted|unsorted
0|5,3,-1
0,5|3,-1
0,3,5|-1
-1,0,3,5|
So what are we doing? We move the items from our ‘unsorted group’ to the ‘sorted group’ (this is where the new item is then ‘inserted’) into the sorted group. Notice we have a base case where the first item is sorted (because it is the only item in the sorted group) When we move from unsorted to sorted one item at a time, it simply does a comparison of the next unsorted item with the next sorted item(if it is larger, then insert item after your comparison item). Basically, we move this invisible line that divides the “sorted” vs the “unsorted” groups one item at a time.
For example, if we want to sort a hand of playing cards:
Example Code:
""" Insertion Sort """
def insertionSort(mylist):
for index in range(1, len(mylist)):
print "Index is ", index # 1, 2, 3, 4, 5, 6, 7, 8; this is the outer loop
# setup first case (only one item) currentvalue =
mylist[index] position = index
# this is the inner loop, loops through the sorted list backwards and compares values
while position > 0 and mylist[position-1] > currentvalue:
mylist[position] = mylist[position-1] position = position - 1
# found spot in inner sorted loop to place item
mylist[position] = currentvalue
if __name__ == '__main__':
mylist = [54,26,93,17,77,31,44,55,20]
print "Original: ", mylist
insertionSort(mylist)
print "Insertion Sorted: ", mylist
So why would you want one of the O(n^2) algorithms when there are O(n log n) algorithms? Well, insertion sort is good when:
Bubble sort (aka sinking sort, ripple sort) is a simple but inefficient sorting algorithm that repeatedly goes through the list to be sorted, compares each pair of adjacent items, and swaps them if they are in the wrong order.
Here is a sample run:
# first pass 5 1 4 2 8
# original 1 5 4 2 8
# swap the first pair
(5, 1) 1 4 5 2 8 # check to swap the next pair (4, 5), but no swap needed
(4 < 5) 1 4 2 5 8 # swap 5 and 2 1 4 2 5 8
# check to swap the next pair (5, 8), but no swap needed (5 < 8)
# second pass 1 4 2 5 8 1 2 4 5 8
# swap 2 and 4 1 2 4 5 8
# check to swap the next pair (4, 5), but no swap needed
(4 < 5) 1 2 4 5 8 # check to swap the next pair (5, 8), but no swap needed (5 < 8)
# third pass checks through each pair, but no swaps needed since its sorted
For example, say we were sorting scrabble tiles into alphabetical order.
The name bubble sort is because elements tend to move up into the correct
order like bubbles rising to the surface and you see a rippling effect
for the ones that are not in the correct order. After each pass of the
bubble sort, one item is definitely sorted; a total of n-1
passes to
sort n
items. Big O Runtime is O(n^2)
.
Example Code
# Bubble Sort def bubbleSort(mylist):
for passnum in range(len(mylist)-1, 0, -1):
#print passnum
# backwords (8,7,6,..2,1) b/c other items are already sorted
for i in range(passnum):
if mylist > mylist[i+1]: # compare neighbors
mylist, mylist[i+1] = mylist[i+1], mylist # swap
if __name__ == '__main__':
mylist = [54,26,93,17,77,31,44,55,20]
print "Original: ", mylist
bubbleSort(mylist) print "Bubble Sorted: ", mylist
####Incremental: selection sort
Selection sort improves on bubble sort by making only one exchange for every pass through the list. The selection sort finds the largest value as it makes its pass and after completing the pass, places it in the correct location/order. What happens is that there is a ‘swap’ (where we put the largest value into the largest index; the item that was previously in the largest index is swapped over to where the previous largest value was).
Here is an example run:
#sorted | unsorted 64 25 12 22 11
# initial state 11|25 12 22 64
# find smallest value (11) and swap with first element (64) 11 12|25 22 64
# find next smallest value from unsorted (12) and compare with second element (25),
do swap (12 < 25) 11 12 22|25 64
# find next smallest value from unsorted (22) and compare with third element
(25), do swap (22 < 25) 11 12 22 25|64 # find next smallest value
#from unsorted (25) and compare with fourth element (25), no need to swap (25 < 64) 11 12 22 25 64|
# all sorted
Similar to bubble sort, after the initial pass, the largest item appears
in place. The final item is in place after n-1
passes to sort n
items. This is slightly faster than bubble sort since we don’t have to
do as many exchanges. Big O Runtime is still O(n^2)
.
""" Selection Sort from largest to smallest """
def selectionSort(mylist):
for fillslot in range(len(mylist)-1, 0, -1):
#print fillslot
# backwords (8,7,6,..2,1) b/c other items are already sorted
positionOfMax = 0 for i in range(1, fillslot+1):
if mylist[i] > mylist[positionOfMax]: # is value greater than value at max
positionOfMax = i
# to move the largest value to the largest index,
# we 'swap' the item # currently in the largest index
position mylist[fillslot], mylist[positionOfMax] = mylist[positionOfMax], mylist[fillslot]
if __name__ == '__main__':
mylist = [54,26,93,17,77,31,44,55,20]
print "Original: ", mylist
selectionSort(mylist)
print "Selection Sorted: ", mylist
The algorithm can be changed to swap out for the smallest item instead of the largest item (depending on how you want to sort by). If so, the idea is the same; we select the smallest unsorted item (instead of largest) and then swap it with the item in the next position to be filled.
We look through the entire array for the smallest element, once you find it you swap it (smallest element with the first element of the array). Then you look for the smallest element in the remaining array (the array without the first element) and swap with the second element, etc.
Realistically, you wouldn’t normally need selection sort because it is O(n^2) and so is not an optimal solution for large lists.
####Divide and Conquer: merge sort
Merge sort is a recursive algorithm that uses the divide-and-conquer approach to keep splitting a list in half until there are pairs of individual items. Don’t worry if the list is an odd number, just add the last item anywhere. We then go in reverse (instead of splitting), we combine the items back together using a merge. This merge compares the pairs and determines which is smaller/larger, which then sorts the smaller list. The trick is that since all of our lists are already sorted, when we combine pairs of lists, we only need to look at the first element of each list.
Use mergesort when quicksort is impractical (if you can’t have the possibility of quicksort’s worst case runtime of ‘O(n^2)’. Mergesort is also better able to be done in parallel computing.
Here’s a chart:
Algorithm Time Complexity | Space Complexity
Best Average Worst | Worst
Quicksort Ω(n log(n)) Θ(n log(n)) O(n^2) | O(log(n))
Mergesort Ω(n log(n)) Θ(n log(n)) O(n log(n)) | O(n)
Here is an example run:
# k i j
# 0 1 2 3 4 5 6 7
# position 9 7 3 8 4 5 6 2
# original values in list
# start splitting up (divide by two) 9 7 3 8 | 4 5 6 2
# divide into two lists
9 7 | 3 8 | 4 5 | 6 2
# divide by two again into four lists
9 | 7 | 3 | 8 | 4 | 5 | 6 | 2
# divide until all items are separated
# now merge back with the merged pairs sorted 7 9 | 3 8 | 4 5 | 2 6
# merge from single items to pairs that are sorted 3 7 8 9 | 2 4 5 6
# merge again, key is that we only compare first item from each list (since they are sorted)
2 3 4 5 6 7 8 9
# combine again, get final sorted list
For example, if we want to sort a hand of playing cards:
Merge sort is good for data that is too big to fit into memory at once because the pattern of storage is regular. It is especially good for data stored as linked lists.
Example Code:
""" Merge Sort """
def mergeSort(mylist):
print "Splitting", mylist
if len(mylist) > 1:
mid = len(mylist) // 2 lefthalf = mylist[:mid] righthalf = mylist[mid:]
mergeSort(lefthalf) mergeSort(righthalf)
# below code merges the two smaller sorted lists to larger
sorted list i = 0 # left half index j = 0
# right half
index k = 0 # main / large sorted list
while i < len(lefthalf) and j < len(righthalf):
# take the smallest value from either left or right half
if lefthalf[i] < righthalf[j]:
mylist[k] = lefthalf[i] # smaller value on lefthalf
i += 1
else:
mylist[k] = righthalf[j] # smaller value on righthalf
j += 1
k += 1
# insert remaining values from lefthalf
while i < len(lefthalf):
mylist[k] = lefthalf[i] i += 1 k += 1
# insert remaining values from righthalf
while j < len(righthalf):
mylist[k] = righthalf[j] j += 1 k += 1
print "Merging", mylist
if __name__ == '__main__':
mylist = [54,26,93,17,77,31,44,55,20]
print "Original: ", mylist
mergeSort(mylist)
print "Merge Sorted: ", mylist
####Divide and Conquer: quick sort
Quick sort is an efficient algorithm that does a sort in place by splitting the array into two smaller arrays, one with lower value elements and one with higher value elements based off a ‘pivot’ element.
Quicksort is my go-to sorting algorithm. This is a really good algorithm to know offhand. It performs fast in most real-world data (because its inner loop can be efficiently implemented in most architectures)
Example Run:
6 5 1 3 8 4 6 9 5 # initial list
# we pick a pivot (say for this example it is the last element of
#the list, which is 2; can also be any item, e.g. the first item of the list, median)
|6 5 1 3 8 4 6 9 5
# we setup the pivot wall (separated by `|`),
# items to the left should be smaller, items to
# the right of the wall are larger
|6 5 1 3 8 4 6 9 5
# we compare the pivot element 5 with the first element to the right of the wall (6);
# 5 < 6 so no swap needed
|6 5 1 3 8 4 6 9 5
# we compare the pivot element 5 with the next element to the right of the wall (5);
#5 == 5 so no swap needed
1|5 6 3 8 4 6 9 5
# our current element 1 is smaller than the pivot
# element 5 so we switch the current element (1) with the lowest
# index item on the right side of the wall (6)
A good video of quicksort is here: https://www.youtube.com/watch?v=kUon6854joI
It’s similar to merge sort in that it does the divide and conquer approach. The advantage is that you do not need as much storage over merge sort, but the performance could possibly be diminished (depending on if the pivot value selected is near the middle).
Example Code:
""" Quick Sort in Python3
Quick sort uses divide and conquer to gain the same advantages as merge sort,
with the benefit of using less storage, but at the cost of a worse worst case runtime
O(n^2) if the pivot values are bad.
"""
import pdb
from typing import List
def quickSort(mylist):
""" Initialize our recursive function """
quickSortHelper(mylist, 0, len(mylist)-1)
def quickSortHelper(mylist, first, last):
""" Recursive function to split up """
if first < last: # check if need to sort still
splitpoint = partition(mylist, first, last)
# now that we know our splitpoint, we can then recursively run quicksort on the list's bottom half and top half
quickSortHelper(mylist, first, splitpoint-1)
quickSortHelper(mylist, splitpoint+1, last)
def partition(mylist, first, last):
""" Partition Process, made up of:
* Pick a pivot value (i.e. what we'll compare our unsorted numbers to)
Based off this value, we'll compare our unsorted values and either move
our items to the left of the pivot or to the right of the pivot.
* """
pivotvalue = mylist[first] # get the first value as pivotvalue
leftmark = first + 1
rightmark = last
done = False
while not done:
# Go from leftmost side onwards (to right) and try to find a value
# that is greater than the pivot value (i.e. left side of pivot should be
# smaller values than pivot value, if we found one that is greater, we
# stop at leftmark, saying we need to do a swap to the right side)
while leftmark <= rightmark and mylist[leftmark] <= pivotvalue:
leftmark += 1
# Go from rightmost side inwards (to left) and try to find a value
# that is less than the pivot value (i.e. right side of pivot should be
# greater values than pivot value, if we found one that is smaller, we
# stop at rightmark, saying we need to do a swap to the left side)
while rightmark >= leftmark and mylist[rightmark] >= pivotvalue:
rightmark -= 1
if rightmark < leftmark:
done = True # we're done sorting through this list because we've crossed
else:
# we have a swap between a value in the left list and a value in the right list
mylist[leftmark], mylist[rightmark] = mylist[rightmark], mylist[leftmark]
# Once rightmark is less than leftmark, then rightmark is now the split point.
# That means what we picked as the pivot value can now be exchanged with the
# contents of the split point and the pivot value is now in the correct place
# Note: remember that our pivot value was the first value in our list
mylist[first], mylist[rightmark] = mylist[rightmark], mylist[first]
return rightmark
if __name__ == '__main__':
mylist = [54, 26, 93, 17, 77, 31, 44, 55, 20]
print("Original: ", mylist)
quickSort(mylist)
print("Quick Sorted: ", mylist)
The quickselect algorithm is a selection algorithm, whereas the quicksort
algorithm is a sorting algorithm. It’s similar to quicksort except that we
only need to recursively call one side of the partition to be re-partitioned.
Instead of a runtime of O(log n)
, we get O(n)
.
What this solves is that if we get an unordered list of items and want to find
the k-th smallest or k-th largest item, we can get items in O(n)
.
""" Python 3 implementation of a quickselect algorithm """
from typing import List
import unittest
import random
class Solution:
def quickselect(self, items, item_index):
if items is None or len(items) < 1:
return None
if item_index < 0 or item_index > len(items) - 1:
raise IndexError()
return self.select(items, 0, len(items) - 1, item_index)
def select(self, lst, l, r, index):
# base case
if r == l:
return lst[l]
# choose random pivot
pivot_index = random.randint(l, r)
# move pivot to beginning of list
lst[l], lst[pivot_index] = lst[pivot_index], lst[l]
# partition
i = l
for j in range(l+1, r+1):
if lst[j] < lst[l]:
i += 1
lst[i], lst[j] = lst[j], lst[i]
# move pivot to correct location
lst[i], lst[l] = lst[l], lst[i]
# recursively partition one side only
if index == i:
return lst[i]
elif index < i:
return self.select(lst, l, i-1, index)
else:
return self.select(lst, i+1, r, index)
class SolutionTest(unittest.TestCase):
def test_quickselect(self):
s = Solution()
response = s.quickselect([12, 2, 4, 3, 5], 2)
assert response == 4
response = s.quickselect([12, 2, 4, 3, 5], 0)
assert response == 2
####Divide and Conquer: heap sort
Heap sort takes the best properties of merge sort (the run time) and the efficency of insertion sort’s ability to sort in place (i.e. only a constant number of elements are stored outside the input array at any time). What makes heapsort unique is that it uses a data structure called a heap to help manage information (instead of a linear-time search); heapsort divides its input into a sorted and an unsorted region and it iteratively shrinks the unsorted region by extracting the largest element and moving that to the sorted region. This heapsort is really efficient at managing priority queues.
Really, heapsort is just an implementation of selection sort using the right data structure.
Note: In programming languages like Java or Lisp, heap refers to ‘garbage-collected storage’; this is not what we’re talking about here.
Given an array A that represents a heap, we can look at the two attributes (length, heap-size) and determine what part is valid (i.e. this is the correctly sorted region) and what is still the heap (i.e. unsorted):
A.length
gives us the number of elements in the array *
A.heap-size
gives us how many elements in the heap are stored within
array A. * The heap would then be calculated as A[1 ... A.heap-size]
where 0 <= A.heap-size <= A.length
#####Heaps
Heaps are a simple data structure for efficiently working with priority queue operations insert and extract-min. They work by maintaining a partial order on the set of elements that are weaker than the sorted order (i.e. efficient to maintain), yet stronger than random order (so finding minimum element is quick)
Power in any hierarchically-structured organization is reflected by a tree, where each
node represents a person. Each edge (x, y)
implies that x
directly supervises y
(aka x __dominates__ y
). The top person is at the root of the heap / top of the heap
.
There’s a couple variations for a heap-labeled tree, where we have a binary tree that has the key labeling of each node dominating the key labeling of each of its children.
A binary tree might look like:
1492
/ \
1783 1776
/ \ / \
1804 1865 1945 1963
/
1918
This binary tree might be represented as a node with points to its two children. However, the memory used by the pointers can easily outweight the size of keys.
The heap is a data structure that will let us represent binary trees without using any pointers. We store data as an array of keys, and the important difference is that we will use the position of the keys to implicitly satisfy the role of the pointers.
We store the root of the tree in the first position of the array. In the second position,
we have the left child. On the third position, we have the right child. The positions can
actually be calculated, where given an array k
, the left children of k
will sit
at positoin 2k
and the right children will be 2k+1
while the parent of k
will be
in position k/2
. We can then move around without any pointers.
1 1492
2 1783
3 1776
4 1804
5 1865
6 1945
7 1963
8 1918
####Heap Issues
Space
We can store any binary tree in an array without pointers, but the issue is that we might
get a tree that is sparse. If we were given a tree with height h
that was sparse (number
of nodes n < 2^h
, then all missing internal nodes still take up space in our structure.
We need to represent the full binary tree to maintain the positional mapping between parents
and children. Space efficiency demands that we not allow holes in our tree (i.e. each level
be packed as much as it can be)
Search for a particular key
We can’t search for a particular key in a heap beause we don’t know anything about the relative order of the n/2 leaf elements in a heap. We have to do linear search through them. In order to do a binary search (i.e. we have a sorted list, then divide by 2 to see if we’re looking at the upper or lower, etc), we need a binary search tree. We’re missing the order of elements in the heap.
Heaps are constructed incrementally, by inserting each new element into the left-most open spot in the array, namely the (n+1)st position of a previously n-element heap. We get a balanced shape out of a heap-labeled tree, but note that we don’t necessarily maintain the dominance ordering of the keys.
The new key might be less than its parent in a min-heap or greater than its parent in
a max-heap. We’ll need to swap any dissatisfied element with its parent. We do this process
repeatedly, essentially bubbling up the new key to its right position. The swap process
takes constant time. With an n
element heap, we have a height of lg n
, which makes each
insertion take at most O(log n)
time. That means an initial construction of the heap
takes O(n log n)
time.
We can make faster heaps using the bubble_down
procedure, which will create heaps
in linear time instead of O (n log n)
time. However, also note that construction time
does not help improve worst-case performance.
####Extracting the Minimum from a Heap
We can identify the dominant element of a heap since it sits at the top of the heap
in the first positoin of the array. Removing the top element leaves a hole in the array.
We can fill the hole by moving the element from the right-most leaf (at the n
th position
of the array) into the first position.
Heapify with bubble_down
The shape of the tree is restored, however the labeling of the root might not satisfy the heap property (e.g. if min-heap, key might be less than its parent / i.e. the new root may be dominated by its both of its children). We can swap by bubbling down the heap until the element dominates all its children; this percolate-down operation is called heapify because it merges two heaps (the subtrees below the original root) with a new key.
####Heapsort
When we exchange the maximum element with the last element and call heapify repeatedly,
we get an O (n log n)
sorting algorithm called Heapsort. Heapsort is easy to program,
runs an in-place sort (i.e. uses no extra memory) and has a worst-case of O(n log n)
time,
which is the best that you can expect from any sorting algorithm.
There are some other algorithms that are slightly faster in practice, but you can’t go wrong with heapsort for sorting data that sits in your computer’s memory.
####(Binary) Heap
A (binary) heap data structure is an array object that we can view as a binary tree. Think of this algorithm as two parts:
[6, 5, 3, 1, 8, 7, 2, 4]
) that we
use to create the heap, a data structure that looks like a binary tree.A heap is a complete binary tree in which the node is less than all the values in its subtrees (or greater if reversed). An example looks like:
3
/ \
9 4
/ \ / \
26 10 18 20
As we’re building this binary tree, the heap swaps elements depending
on the type (e.g. min or max) of the binary heap (e.g. sorting smallest
to largest, larger nodes don’t stay below smaller node parents and end
up swapping; 8
can’t be below 5
on the heap). Once the binary tree
is built, we have a tree where each array index represents a node and
also has the index of the node’s parent, left child branch, or right
child branch.
####Priority Queue
As mentioned earlier, heap sort is great for creating priority queues, which is a data structure for maintaining a set S of elements, each with an associated value called a key. There’s max-priority queue (e.g. used to schedule jobs on a server and set their relative priorities) and a min-priority queue (e.g. used for discrete event-driven simulation models like determining how many patients can be served from changing 8 hours to 9 hours of operation when avg surgery takes say 4 hours).
##Sorting Algorithms with Linear Time (Decision Tree Models)
Previously mentioned algorithms are comparison sorts, which determines
the sort order based only on comparisons between the input elements.
If we make the assumption that all the input elements are distinct, we can
sort by linear time. This means we can do comparison sorts in terms
of decision trees, which is a full binary tree that represents the
comparisons between elements in a sorting algorithm (say elements i
:
j
).
####Counting Sort
Placeholder for smart stuff.
####Radix Sort
Placeholder for smart stuff.
####Bucket Sort
Placeholder for smart stuff.
Probabilistic analysis is the use of probability in the analysis of problems. We can use this in analyzing the running time of an algorithm or we can use it to analyze other quantities, such as who to hire. We have some examples:
For this example, we want to hire an office assistant. We interview candidates and determine if they are better than the current assistant (if so, replace the current assistant right then and there). There is a cost to hiring and firing someone. We want to find the expected cost after interviewing everyone (which is a fixed n candidates).
Say we have a list and rank them into an ordered list of best
possible candidate using: [rank1, rank2, rank3, rankn]
. Saying that
applicants come in a random order is equivalent to saying there is n!
permutations of the numbers 1 through n. We call this uniform random
permutation, which means that each of the possible n! permutations
appears with equal probability.
We first assume (or make sure we randomly select) candidates for hire. We can check probabilities and expectations using an indicator random variable. For example, if we flip a coin, we count the number of times it actually comes up heads (saying using a random number generator) to what we expect.
How many people must there be in a room before there is a 50% chance that two of them are born on the same day of the year? We have:
k = number of people in the room
; we index the people in the room
with integers (e.g. 1, 2, 3, … k) * n = 365 days
; we ignore leap
years * Assume that birthdays are uniformly distributed across n days
and are independent.####Balls and Bins
If we randomly toss identical balls into ‘b’ bins (1 through ‘b’) and
assuming the tosses are independent with an equal chance of ending up in
any bin, we have the probability that a tossed ball lands in any given
bin as 1/b
of success (where success is falling into the given bin).
The ball tossing can be seen as a sequence of Bernoulli trials
(i.e. a binomial trial, a random experiment where there are exactly two
possible outcomes; success and failure). This answers questions like:
####Streaks
If you flip a fair coin ‘n’ times, what is the longest streak of consecutive heads that you expect to see?
Graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph is made up of vertices (aka nodes; note that a node is called a leaf if it has no children) and lines that connect them are called edges.
Dijkstra’s shortest path algorithm is commonly used to find the the shortest paths on a graph from a source to all vertices.
The A*
(A Star) algorithm is used in pathfinding and graph traversal,
which is the process of finding a path between multiple points (aka nodes).
Its used due to its performance and accuracy, but in practical travel-routing systems,
it is generally outperformed by algorithms which can pre-process the graph to
attain better performance.
####Prim’s Algorithm for Minimum Spanning Tree
Prim’s Algorithm for Minimum Spanning Tree is a greedy algorithm that takes a spanning tree (all vertices must be connected) where the two disjoint subsets of vertcies are connected to make a Spanning Tree.
TODO: How this algorithm works
####Trees
Trees are a type of graph and they’re described as: * Each tree has a root node * The root node has zero or more child nodes * Each child node has zero or more child nodes, and so on.
The tree cannot contain cycles. The nodes may or may not be in a particular order, they could have any data type as values, and they may or may not have links back to their parent nodes.
Real life examples of trees
<html>
- Next level has <head>
node and <body>
node<head>
splits into <meta>
and <title>
; <body>
splits into <h1>
and <h2>
/
- Next level has dev/
, etc/
, usr/
, var/
, etc.Pieces of Trees
C:
-> Users\
-> WLiu
-> My Documents
*
Parent - node A is the parent node of B if A has outgoing edges that
connect to B.How to appraoch trees
We can think of trees two different ways, as a large set of rules or as a smaller recursive set of rules:
Types of Trees There are different types of trees that can take on different properties:
all left
descendants <= n < all right descendents
for every node. Note that
some definitions of a binary search tree cannot have duplicates, others
the duplicate values will be on the right or can be on either side.
All are valid definitions.O(log n)
run
times for insert
and find
operations.Traversing Trees You can traverse binary trees using a variety of methods (the most common being ‘in-order’ traversal):
Tries (prefix trees)
A trie (aka prefix tree) is a special kind of tree; a trie is like an n-ry tree in which characters are stored at each node and each path down the tree may represent a word.
There are *
nodes (aka null nodes) that are used to indicate
complete words (e.g. Many*
means many is a complete word or Ma*
means there are lots of words that start with Ma
).
A trie is commonly used to store the entire English language for quick prefix lookups. A hash table can quickly look up whether a string is a valid word, but it cannot tell us if a string is a prefix of any valid words.
#####Types of Trees
We have different types of trees including:
####Example Trees
An example tree belongs to the category directed acyclic graphs (DAGs). Git is a great example of a DAG. Other examples are say an Airflow scheduled job.
####Graphs
A tree is a type of graph, but not all graphs are trees. A tree is a connected graph without cycles. A graph is a collection of nodes with edges between (some of) them. Graphs are used to model pairwise relations between objects.
####Real Life examples of Graphs
Computers can use graphs to find the shortest, quickest, or easiest path from one place to another. Some real life scenarios include:
####Classic examples of Graphs
####Vocabulary and Definitions for Graphs
A graph can be represented by G=(V,E)
for the graph G
, V
is a set
of vertices and E
is a set of edges. Each edge is a tuple (v, w)
where w, v
makes up V
. Optionally, we can add a third component to
the edge tuple to represent a weight.
####Graph Abstract Data Type (ADT)
This is how to setup a graph abstract data type (ADT) that we can build off of:
Graph()
creates a new, empty graph * addVertex(vert)
adds
an instance of Vertex
to the graph * addEdge(fromVert, toVert)
adds a new, directed edge to the graph that connects two vertices *
addEdge(fromVert, toVert), weight)
adds a new, weighted, directed edge
to the graph that connects two vertices * getVertex(vertKey)
finds the
vertex in the graph named vertKey
* getVertices()
returns a list
of all vertices in the graph * in
returns True
for the statement:
vertex in graph
if the given vertex is in the graph, False
otherwise####Ways to store a graph (adjacency list and matrix)
You can store graphs as an adjacency list (most common way) or as an adjacency matrices. When two verticies are connected by an edge, they are adjacent, thus the name of list and matrix. There are advantages and disadvantages to both.
N*N
boolean matrix (where N is the number of nodes). A value at
matrix[v][w]
means row v
and column w
, which indicates an edge
from node v to node w.
Example adjacency matrix
v0 v1 v2 v3 v4 v5
v0 5 2
v1 4
v2 9
v3 7 3
v4 1
v5 1 8
Vertex
class as a dictionary or a list (below
sample is as a dict where the keys are the vertices and the values are
the weights)
Example adjacency list
v0: id=v0, adj={v1:5, v5:2}
v1: id=v1, adj={v2:4}
v2: id=v2, adj={v3:9}
v3: id=v3, adj={v4:7, v5:3}
v4: id=v4, adj={v0:1}
v5: id=v5, adj={v2:1, v4:8}
####Adjacency List Implementation
To implement an ‘adjacency list’, we need to implement:
Graph
, which is a master list of the vertices; this contains a
dictionary that maps vertex names to vertex objects. * Vertex
, which
represents each vertex in the graph. Each vertex uses a dictionary to
keep track of the vertices to which it is connected and the weight of
each edge.After constructing a graph. The two most common ways to search a graph are depth-first search and breadth-first search.
In depth-first search (DFS), we start at the root (or an arbitrarily selected node) and explore each branch completely before moving on to the next branch (thus the name depth-first) before we go wide. This is usually a simpler approach.
In breadth-first search (BFS), we start at the root (or an arbitrarily selected node) and explore each edge/neighbor before going on to any of their children (thus the name breadth-first) before we go deep. To keep track of which vertices have been visited, we color them to:
white
- all vertices are initialized as white when constructed;
this means the vertex is undiscovered (i.e. not visited yet)gray
- when a vertex is initially discovered it is colored gray. When a
vertex is colored gray, there may be some white vertices adjacent to
it (indicating that there are still additional vertices to explore)black
- when a vertex is completely explored, it is colored black.
Once a vertex is black, there are no white vertices adjacent to it.The heart of a BFS is a Queue
, which decides which vertex to
explore next.
####Trees vs Graphs
You can think of a tree as just a restricted form of a graph. The main difference is that a tree is setup so that each node has only one parent, but graphs can have multiple predecessors (i.e. we don’t say ‘parent’ for graphs because there can be multiple connections) Trees have a root node while graphs don’t.
Trees also don’t contain cycles. Trees fit into a category called Directed Acyclic Graphs (DAG), meaning that there are no cycles (i.e. acyclic). Graphs can be cyclic or acyclic (remember that a tree is always a graph, but a graph is not always a tree).
An example of a tree would be a scheduled job that depended on previous jobs. An example of a graph would be a map with cities being connected.
A NP-Complete problem stands for nondeterministic polynomial time problem. There is no known way to find a solution quickly (i.e. the time required to solve a problem using any currently known algorithm increases rapidly as the size of the problem grows). Note: By quickly, it means in polynomial time.
Some famous NP-Complete problems are:
A NP-Hard problem is one that is not solvable in polynomial time, but can be verified in polynomial time.
##Gotchas
Remember when whiteboarding, its a bit different than coding because you can’t just run your code. Some things to watch out for include:
range(10)
will be 0
to 9
, but not 10)