If you’re looking into a NoSQL database, Cassandra is a solid choice. I’m barely a month into using Cassandra with Django, but these are my notes on what to look for (and hopefully pitfalls to avoid).
Cassandra is pretty stable, but that doesn’t mean you should install the latest version (which is v3 at the time of this writing). Check version compatibility; meaning check the versions of the programs and libraries that you’ll be using (hint: it’s usually not the latest version). For example, if you’re doing Spark Streaming and need to write to Cassandra immediately, look at the versions for the below and find the highest matching version:
Spark + Spark Streaming library
Cassandra
Spark Streaming library connector to Cassandra (https://github.com/datastax/spark-cassandra-connector#version-compatibility)
Speaking of which, DataStax is the company that supports a lot of Cassandra’s tools. That connector library that was mentioned above, DataStax open sourced it. They’ll be your go to source for many Cassandra related topics.
I learned the above install lesson the hard way by doing installs first, writing some Scala code, then realizing the libraries I needed didn’t support the version of Cassandra I had. Some common systems used with Cassandra include Spark, Spark Streaming, and Kafka.
Also, most of the features that are in Cassandra v3 aren’t needed; I only ran into one issue that was in v3 that was not in v2 (Queryset filtering IN on a partition key), but that was an inefficient query and should have been redesigned better. Some of the new features in v3 (like joins) kind of go against the philosophy of NoSQL databases (more on that later - see Data Modeling further down).
You can download cassandra and bash cassandra
to run it, or if you want, you can try docker.
docker pull cassandra
docker run --name cassandra -p 9042:9042 -d cassandra
docker exec -it cassandra cqlsh
docker exec -it cassandra bash
I want to mention DataStax again because they are the go-to resource for Cassandra. They offer great short videos that cover the basics and design philosophy of Cassandra A lot of tech videos are boring, but these really just get to the point.
The main driver we’ll be using (and what other Django libraries base their connectors off of) is this Python driver repo In particular, the documentation is great and you’ll mainly use their Object Mapper (they have Models and Queries that are very similar to Django)(http://datastax.github.io/python-driver/api/index.html#object-mapper)
Here’s some notes about why we want to use Cassandra and its advantages/weaknesses: At least watch this intro video here (https://academy.datastax.com/resources/ds101-introduction-cassandra) about what Cassandra does.
Cassandra’s basics of data modeling (https://academy.datastax.com/courses/ds220-data-modeling/) make the following main ideas:
Cassandra’s Architecture is:
So you’ve installed and started running Cassandra locally (i.e. bash cassandra
), and can look around the system using the DevCenter GUI or through their command line cqlsh
. You can write some CQL (their version of SQL) to look around or create some data (CQL syntax is mostly the same as SQL, but you can look over it here for the details) Now how do you read and write data in programmatically with Django?
Let’s look at the libraries (i.e. how to access data without the cql shell or GUI):
pip install cassandra-driver
pip install django-cassandra-engine
Let’s talk a bit more about the Django Cassandra Engine. Here’s what to look out for:
For Data Modeling, we replace the Django Models with the Cassandra Models.
Additional Notes about Data Modeling and definitions of keys and here:
So how do keys affect things like indexes? There’s primary and secondary indexes.
Example Model:
import uuid
from cassandra.cqlengine import columns
from cassandra.cqlengine.models import Model
class ExampleModel(Model):
example_id = columns.UUID(primary_key=True, default=uuid.uuid4)
some_category = columns.Text(primary_key=True, partition_key=True, clustering_order=”DESC”)
example_type = columns.Integer(index=True)
created_at = columns.DateTime()
description = columns.Text(required=False)
DRF needs some modifications to use:
def create
method must return Model.objects.create(**validated_data)
)generics.RetrieveUpdateDestroyAPIView
get_object
method (i.e. get_object
method must return `Model.objects(pk==self.kwargs[‘pk’]) # or whatever the object selection isCassandra didn’t work well with DRF (e.g. updating the built-in documentation on DRF on a Model update) unless we explicitly declared the options. This setting worked:
DATABASES = {
'sqlite': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
},
'default': { # Run 'manage.py sync_cassandra'
'ENGINE': 'django_cassandra_engine',
'NAME': 'cassdb',
'USER': 'test',
'PASSWORD': 'test',
'TEST_NAME': 'test_test',
'HOST': 'localhost',
'OPTIONS': {
'replication': {
'strategy_class': 'SimpleStrategy',
'replication_factor': 1
},
'connection': {
'consistency': ConsistencyLevel.LOCAL_ONE,
'retry_connect': True,
#'port': 9042,
# + All connection options for cassandra.cluster.Cluster()
},
'session': {
'default_timeout': 10,
'default_fetch_size': 10000
# + All options for cassandra.cluster.Session()
}
}
}
Some things to know about Django with databases: Having multiple databases in the same project worked
Can run the SQL in command line using cqlsh
using: $cqlsh
pip install cql
$cqlsh
Apache Spark is a cluster computing system; can be used to process live data streams (e.g. sources like Kafka) and push this data out (e.g. to files, databases, or dashboards)
brew install scala
, brew install apache-spark
$spark-shell
to use Scala, or pyspark
to use PythonAdded jar file under: /Users/williamliu/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar
Apache Kafka is a high-throughput distributed messaging system (can handle lots of reads and writes from many clients). Zookeeper is used to coordinate processes of distributed applications like group messaging, shared registers, distributed lock servers. This is done using an efficient replicated state machine to guarantee that updates to nodes are ordered.
brew install kafka
zkserver start
(Note: Need to start Zookeeper before Kafka starts)zkCli
$kafka-server-start /usr/local/etc/kafka/server.properties
$kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
$kafka-topics --list --zookeeper localhost:2181
Kafka settings (e.g. for Producer and Consumer) are saved in: /usr/local/etc/kafka/ e.g. under consumer.properties and producer.properties
We have producers that send data to consumers
* Create data from the standard input and put into a Kafka topic: $kafka-console-producer --broker-list —localhost:9092 --topic test
Nothing appears Yet; can start typing in some messages
* Read data from a Kafka topic: $kafka-console-consumer --zookeeper localhost:2181 --topic test --from-beginning
Then consume
* Kafka is a distributed publish-subscribe messaging system meant to scale
* Kafka has feeds of messages in topics. Producers write data to topics. Consumers read from topics. Topics are partitioned
zkServer start
to run the serverAmazon SNS (Simple Notification Server) to deliver push messages to applications and/or users
https://github.com/edenhill/kafkacat
netcat (e.g. nc -z 127.0.0.1 8080) telnet 127.0.0.1 8080 list any process listening to a specific port (e.g. 8080): $lsof -i:8080
‘Project Structure’ to add new dependencies
Get Twitter Examples https://dev.twitter.com/oauth/tools/signature-generator/6799311?nid=813
COPY table1 (column1, column2, column3) FROM ‘table1data.csv’ WITH HEADER=true;
sed "s/$/,/g” myfilename.csv > newfilename.csv // add comma to end of each line
brew switch cassandra 3.5
datetime.datetime.now()
2016-05-06 13:53:07.387654
UUID: fdd0ba00-13b2-11e6-88a9-0002a5d5c51b
Imports/Exports data from CSV
COPY table1 (column1, column2, column3) FROM ‘table1data.csv’ WITH HEADER=true;
Don’t mess with the Primary Key
TRUNCATE
Be explicit to say this is what you’re partitioning off of, then clustering off of (order matters, we first order by added_year, then video_id). If we don’t have video_id, we’ll have duplicates e.g. PRIMARY KEY ((tag), added_year, video_id) ) WITH CLUSTERING ORDER BY (added_year DESC);
Remember to use UPSERTS to update say a table with new column data
Counters INTS have concurrency issues so we’ll need a ‘counter’ UPDATE moo_counts SET moo_count = moo_count + 8
Source Executes a file containing CQL statements SOURCE ‘./myscript.cql’;