William Liu

Cassandra

Summary

If you’re looking into a NoSQL database, Cassandra is a solid choice. I’m barely a month into using Cassandra with Django, but these are my notes on what to look for (and hopefully pitfalls to avoid).

Install

Cassandra is pretty stable, but that doesn’t mean you should install the latest version (which is v3 at the time of this writing). Check version compatibility; meaning check the versions of the programs and libraries that you’ll be using (hint: it’s usually not the latest version). For example, if you’re doing Spark Streaming and need to write to Cassandra immediately, look at the versions for the below and find the highest matching version:

Spark + Spark Streaming library
Cassandra
Spark Streaming library connector to Cassandra (https://github.com/datastax/spark-cassandra-connector#version-compatibility)

Speaking of which, DataStax is the company that supports a lot of Cassandra’s tools. That connector library that was mentioned above, DataStax open sourced it. They’ll be your go to source for many Cassandra related topics.

I learned the above install lesson the hard way by doing installs first, writing some Scala code, then realizing the libraries I needed didn’t support the version of Cassandra I had. Some common systems used with Cassandra include Spark, Spark Streaming, and Kafka.

Also, most of the features that are in Cassandra v3 aren’t needed; I only ran into one issue that was in v3 that was not in v2 (Queryset filtering IN on a partition key), but that was an inefficient query and should have been redesigned better. Some of the new features in v3 (like joins) kind of go against the philosophy of NoSQL databases (more on that later - see Data Modeling further down).

Install (and run with Docker)

You can download cassandra and bash cassandra to run it, or if you want, you can try docker.

DataStax

I want to mention DataStax again because they are the go-to resource for Cassandra. They offer great short videos that cover the basics and design philosophy of Cassandra A lot of tech videos are boring, but these really just get to the point.

The main driver we’ll be using (and what other Django libraries base their connectors off of) is this Python driver repo In particular, the documentation is great and you’ll mainly use their Object Mapper (they have Models and Queries that are very similar to Django)(http://datastax.github.io/python-driver/api/index.html#object-mapper)

Cassandra Concepts

Here’s some notes about why we want to use Cassandra and its advantages/weaknesses: At least watch this intro video here (https://academy.datastax.com/resources/ds101-introduction-cassandra) about what Cassandra does.

Cassandra Data Modeling

Cassandra’s basics of data modeling (https://academy.datastax.com/courses/ds220-data-modeling/) make the following main ideas:

Cassandra Cassandra Architecture

Cassandra’s Architecture is:

Django Integration with Cassandra

So you’ve installed and started running Cassandra locally (i.e. bash cassandra), and can look around the system using the DevCenter GUI or through their command line cqlsh. You can write some CQL (their version of SQL) to look around or create some data (CQL syntax is mostly the same as SQL, but you can look over it here for the details) Now how do you read and write data in programmatically with Django?

Let’s look at the libraries (i.e. how to access data without the cql shell or GUI):

Let’s talk a bit more about the Django Cassandra Engine. Here’s what to look out for:

Data Modeling with Django and Cassandra

For Data Modeling, we replace the Django Models with the Cassandra Models.

Additional Details about Cassandra Data Modeling - Keys

Additional Notes about Data Modeling and definitions of keys and here:

So how do keys affect things like indexes? There’s primary and secondary indexes.

Example Model:

import uuid
from cassandra.cqlengine import columns
from cassandra.cqlengine.models import Model

class ExampleModel(Model):
    example_id    = columns.UUID(primary_key=True, default=uuid.uuid4)
    some_category = columns.Text(primary_key=True, partition_key=True, clustering_order=”DESC”)
    example_type  = columns.Integer(index=True)
    created_at    = columns.DateTime()
    description   = columns.Text(required=False)

Django REST Framework

DRF needs some modifications to use:

Serializers

Views

Database

Cassandra didn’t work well with DRF (e.g. updating the built-in documentation on DRF on a Model update) unless we explicitly declared the options. This setting worked:

DATABASES = {
   'sqlite': {
       'ENGINE': 'django.db.backends.sqlite3',
       'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
   },

   'default': {  # Run 'manage.py sync_cassandra'
       'ENGINE': 'django_cassandra_engine',
       'NAME': 'cassdb',
       'USER': 'test',
       'PASSWORD': 'test',
       'TEST_NAME': 'test_test',
       'HOST': 'localhost',
       'OPTIONS': {
           'replication': {
               'strategy_class': 'SimpleStrategy',
               'replication_factor': 1
           },
           'connection': {
               'consistency': ConsistencyLevel.LOCAL_ONE,
               'retry_connect': True,
               #'port': 9042,
               # + All connection options for cassandra.cluster.Cluster()
           },
           'session': {
               'default_timeout': 10,
               'default_fetch_size': 10000
               # + All options for cassandra.cluster.Session()
           }
       }
   }

Some things to know about Django with databases: Having multiple databases in the same project worked

Random Notes

Apache Cassandra / CQL

Can run the SQL in command line using cqlsh using: $cqlsh

Apache Spark

Apache Spark is a cluster computing system; can be used to process live data streams (e.g. sources like Kafka) and push this data out (e.g. to files, databases, or dashboards)

Added jar file under: /Users/williamliu/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar

Apache Kafka

Apache Kafka is a high-throughput distributed messaging system (can handle lots of reads and writes from many clients). Zookeeper is used to coordinate processes of distributed applications like group messaging, shared registers, distributed lock servers. This is done using an efficient replicated state machine to guarantee that updates to nodes are ordered.

Kafka settings (e.g. for Producer and Consumer) are saved in: /usr/local/etc/kafka/ e.g. under consumer.properties and producer.properties

We have producers that send data to consumers * Create data from the standard input and put into a Kafka topic: $kafka-console-producer --broker-list —localhost:9092 --topic test Nothing appears Yet; can start typing in some messages * Read data from a Kafka topic: $kafka-console-consumer --zookeeper localhost:2181 --topic test --from-beginning Then consume

 * Kafka is a distributed publish-subscribe messaging system meant to scale
 * Kafka has feeds of messages in topics. Producers write data to topics. Consumers read from topics. Topics are partitioned 

Apache Zookeeper

AWS SNS

Amazon SNS (Simple Notification Server) to deliver push messages to applications and/or users

Kafka Tool (kafkacat)

https://github.com/edenhill/kafkacat

IP/Port

netcat (e.g. nc -z 127.0.0.1 8080) telnet 127.0.0.1 8080 list any process listening to a specific port (e.g. 8080): $lsof -i:8080

Scala

‘Project Structure’ to add new dependencies

Get Twitter Examples https://dev.twitter.com/oauth/tools/signature-generator/6799311?nid=813

CQL

COPY table1 (column1, column2, column3) FROM ‘table1data.csv’ WITH HEADER=true;

sed "s/$/,/g” myfilename.csv > newfilename.csv  // add comma to end of each line

brew switch cassandra 3.5

datetime.datetime.now()
2016-05-06 13:53:07.387654
UUID: fdd0ba00-13b2-11e6-88a9-0002a5d5c51b

Imports/Exports data from CSV

COPY table1 (column1, column2, column3) FROM ‘table1data.csv’ WITH HEADER=true;

Notes about Cassandra Keys and Clustering

Don’t mess with the Primary Key TRUNCATE to drop that table Cannot nest Collection in a Collection describe # to get info about the table

Be explicit to say this is what you’re partitioning off of, then clustering off of (order matters, we first order by added_year, then video_id). If we don’t have video_id, we’ll have duplicates e.g. PRIMARY KEY ((tag), added_year, video_id) ) WITH CLUSTERING ORDER BY (added_year DESC);

Remember to use UPSERTS to update say a table with new column data

Counters INTS have concurrency issues so we’ll need a ‘counter’ UPDATE moo_counts SET moo_count = moo_count + 8

Source Executes a file containing CQL statements SOURCE ‘./myscript.cql’;