Modelling Azure Cosmos DB

azure data-modeling azure-cosmosdb

564 观看

2回复

3155 作者的声誉

I'm planning on saving records from CoinMarketCap API for own purpose. I would like to save price information on the top 100 crypto coins every 15 minutes, and I would like to save it in an Azure Cosmos DB.

Since im new to the whole concept of document db's, I need some help modelling the documents.

First I started out with this model.

[
    {
        "id": "bitcoin",
        "name": "Bitcoin",
        "symbol": "BTC",
        "rank": "1",
        "price_usd": "573.137",
        "price_btc": "1.0",
        "24h_volume_usd": "72855700.0",
        "market_cap_usd": "9080883500.0",
        "available_supply": "15844176.0",
        "total_supply": "15844176.0",
        "percent_change_1h": "0.04",
        "percent_change_24h": "-0.3",
        "percent_change_7d": "-0.57",
        "last_updated": "1472762067"
    },
    {
        "id": "ethereum",
        "name": "Ethereum",
        "symbol": "ETH",
        "rank": "2",
        "price_usd": "12.1844",
        "price_btc": "0.021262",
        "24h_volume_usd": "24085900.0",
        "market_cap_usd": "1018098455.0",
        "available_supply": "83557537.0",
        "total_supply": "83557537.0",
        "percent_change_1h": "-0.58",
        "percent_change_24h": "6.34",
        "percent_change_7d": "8.59",
        "last_updated": "1472762062"
    },
    ...
]

But since id did not change every time I wrote to the DB, the records was only updated, and not aggregated. I guess this was as expected.

So in order to make sure the records were aggregated, I rewrote the model to this.

[
    {
        "id": <timestamp>_bitcoin
        "identifier": "bitcoin",
        "name": "Bitcoin",
        "symbol": "BTC",
        "rank": "1",
        "price_usd": "573.137",
        "price_btc": "1.0",
        "24h_volume_usd": "72855700.0",
        "market_cap_usd": "9080883500.0",
        "available_supply": "15844176.0",
        "total_supply": "15844176.0",
        "percent_change_1h": "0.04",
        "percent_change_24h": "-0.3",
        "percent_change_7d": "-0.57",
        "last_updated": "1472762067"
    },
    {
        "id": <timestamp>_ethereum
        "identifier": "ethereum",
        "name": "Ethereum",
        "symbol": "ETH",
        "rank": "2",
        "price_usd": "12.1844",
        "price_btc": "0.021262",
        "24h_volume_usd": "24085900.0",
        "market_cap_usd": "1018098455.0",
        "available_supply": "83557537.0",
        "total_supply": "83557537.0",
        "percent_change_1h": "-0.58",
        "percent_change_24h": "6.34",
        "percent_change_7d": "8.59",
        "last_updated": "1472762062"
    },
    ...
] 

Here I added an separate id with an timestamp and reference to the old id in order to make it unique.

This is working okay, but I think it is some duplicate data (ex. name and symbol) which I think looks bad having double. But maybe this is how it is in the document db world?

I also thought about a model like this.

[
    {
        "id": <timestamp>_bitcoin
        "identifier": "bitcoin",
        "name": "Bitcoin",
        "symbol": "BTC",
        "rank": "1",
        "price_history": [{
            "price_usd": "573.137",
            "price_btc": "1.0",
            "24h_volume_usd": "72855700.0",
            "market_cap_usd": "9080883500.0",
            "available_supply": "15844176.0",
            "total_supply": "15844176.0",
            "percent_change_1h": "0.04",
            "percent_change_24h": "-0.3",
            "percent_change_7d": "-0.57",
            "last_updated": "1472762067"
        ]}
    },
    {
        "id": <timestamp>_ethereum
        "identifier": "ethereum",
        "name": "Ethereum",
        "symbol": "ETH",
        "rank": "2",
        "price_history": [{
            "price_usd": "12.1844",
            "price_btc": "0.021262",
            "24h_volume_usd": "24085900.0",
            "market_cap_usd": "1018098455.0",
            "available_supply": "83557537.0",
            "total_supply": "83557537.0",
            "percent_change_1h": "-0.58",
            "percent_change_24h": "6.34",
            "percent_change_7d": "8.59",
            "last_updated": "1472762062"
        ]}
    },
    ...
] 

But as there is no option to add new records to price_history without rewriting the whole document, this would not be a good idea. Also since the number of records in price_history would be potententially grow without limits, this document would get really big and hard to handle.

Next I think of splitting into separate documents, but not sure if that is the way to go either. So I'm a bit lost at the moment.

Any suggestions?

作者: Martin 的来源 发布者: 2017 年 12 月 27 日

回应 2


4

4442 作者的声誉

决定

Here's the fact portion:

  1. Having a single document per crypto-coin and embedding pricing information within the document in 15 minute intervals is not workable. Cosmos DB has a 2MB document size limit. You will blow through that size limit if you pursue an embed strategy. Also, you are right that larger documents are more difficult to work with and incur higher RU charges to retrieve.

  2. In NoSQL in general, duplication of data is not necessarily a cardinal sin. You need to think about how data will be retrieved and what information you need to work with. This is especially important since there aren't relational joins as exists in relational DBs.

Now for the completely opinion part:

  • Consider a crypto-coin document that contains general information about each coin you need to track. You may not actually even need this document.

  • Store time series data as separate documents. You actually have to go this way due to the aforementioned document size limitation, and that there is no upper bound on the number of timestamp readings.

  • For your 1 hour, 24 hour, and 7 day lookback aggregations that are stored as part of each timestamp, you may just query using aggregation functions and set these properties per timestamp each time you write a new entry. Given that you're only storing 100 different crypto-currencies and your timestamps are every 15 minutes, this is feasible.

There's a great video on Channel 9 by Ryan CrawCour and David Makogon that deals with modeling data in Cosmos DB that I found very helpful when getting my head around this.

作者: Rob Reagan 发布者: 2017 年 12 月 27 日

-1

792 作者的声誉

Here is a good article summarizing the best practices for data modeling in non-relational world https://docs.microsoft.com/en-us/azure/cosmos-db/modeling-data

HTH

作者: Rafat Sarosh 发布者: 2017 年 12 月 27 日
32x32