Timeseries with open ports

Hi,

I try to save information about host scanning: ip address and found open ports. Now I can save the once scan result:
ip1 open port1
ip2 open port1
etc

Finally I geting convenient picture with ports and hosts.

But I have results for every new day and if I add data-node I can’t understand how to save results without duplicating ip-nodes and port-nodes. Or vice versa different data-nodes will point to the same ip-node and then it can’t possible to find which port-node to which ip-node->date-node belongs to

Have you tried using an upsert? You can query for the existing node based on the IP address or port number, and then you can use the queried UIDs to run a mutation based on the existing nodes.

https://docs.dgraph.io/mutations/#upsert-block

Yes I know about upsert but I mean the different:
I have such scheme that unite Ip, open ports and datacenters.
ip: string @index(exact) @upsert .
port: string @index(exact) @upsert .
dc: string @index(exact) @upsert .
opens: [uid] @reverse .
indc: [uid] @reverse .
type Host {
ip
opens: [Port]
indc: [DC]
}
type Port {
port
}
type DC {
dc
}

And I get such picture with ip, open ports and datacenters. Now it easier to drug some port (for example 21) and see all ip and dc that have that open port.

But I can’t imagine how to add date node so I can also to see all dates when that port was detected. Or find dates where some certain ports on certain ip was opened.
If I add date node I could get picture with few data-node pointed in single ip-node (and farther to port-node) and can’t build the path from port to date:
one day: date1 → ip1 → have open ports 21 and 443
next day: date2 → ip1 → have open ports 22

On picture I will have ip1 connected to 21, 22 and 443 and can’t distinguish what port to what date belongs to

Any idea? (

In case of my explanation wasn’t clear:

I’m scanning my network every day. And getting results in form: ip - open ports.
And try to understand is it possible to use dgraph for storing results so I could ask later such requests:

  • what port was opened from ip in date1 (or for few dates or range)
  • who has open port (eg. 21) in any dates (or just in range of dates)

Edit:
I just really understood what your problem is after white boarding it. :slight_smile:
What is your relationship between data center and IP and ports?
A single IP can have multiple port to multiple data centers opened?

Please ignore below.

Not sure but could this work:
Add
date: datetime @index(day) @upsert .

IP on DATE opens PORT in DC
Not sure if its the best way, but it is one way.
May not be date-range friendly.
Reverse edges may also be useful.

Not quite. For simplicity let forget about data center and focus just on IPs, Ports and Dates. - The real problem (at least for me) that after few days scanning I’ll get few nodes like that. Let:
in Date1was found IP1with Port22
in Date2was found IP1with Port80

(in bold - nodes, in italic - edges)

And now I want to make query about IP1 and his Ports was opened in Date1. But in response will be both Ports (22 and 80) because in current scheme there aren’t any mentions between Date and Port. So I try to think up proper scheme that will allow such requests.

Edit: Just saw your post @Sergo. Will ignore DC. Removed DC column from image.

Actually, would be good to get some feedback from someone may have already done this. Read: not me. I am new to DGraph and Graph DBs.

Assuming a simple spread sheet:
image

How can we create a schema that accurately represents that without duplication.
@Sergo Your data looks like that, correct?

Would a GraphDB be good for such a use case with Timeseries data?

Correct: yes, I want to do without duplications and trying to understand whether graph dbs suitable for such tasks.

So I believe that question mostly belongs to graph theory than directly to Dgraph.
Will try to ask Neo4j

They also don’t have any good decision. Suggested to remove Date node and use instead it additional property Date1 on Relation from IP node to Port.

But it doesn’t ask what to do in case of I have few similar results for different days:
IP1 Open (Facet Date1) Port22
IP1 Open (Facet Date2) Port22

Is it possible to make few edges with the same name but different properties in facet? - Neo4j can do such

So I found similar problem and suggested decision in task

And one of decisions there from neo4j (again): TimeTree Graph

Would be great to hear something about from Dgraph guys. And I’m still interesting about possibility to build few link-edges between the same nodes which differ only in set of facets value.

I think you have three choices depending upon how your data looks like given that Dgraph should be able to handle thousands of predicates easily:

  1. If you have limited number of IPs, I would create a predicate for each IP
  2. If you have limited number of ports, I would create a predicate for each port
  3. If you have no additional information about the data, then I don’t see any option other than data duplication.

Let me know, what case is it and I will be happy to further help you in your data modelling.

I having about 5-8 thousands IPs every day and all of them almost 100% the same as previous day. Or even if they are changing during the years it’s going from non-changing pool about 10 000 IPs. And every IP could have 1-10 open ports. Mostly it the same ports but in the whole it coldn’t be more ~2-4 hundreds ports.

As far as I understand in the case of duplication the advantage of a graph database over a relational database becomes less obvious?

In that case, I’d recommend that you keep ports as your predicate. This is how your schema would look like:

date: datetime @index(day) @upsert.
ip: string @index(exact) @upsert .
port_22: [uid] .
port_22_reverse: [uid] .
port_80: [uid] .
port_80_reverse: [uid] .

Now, for the data below, this is how your mutation would look like:
in Date1was found IP1with Port22
in Date2was found IP1with Port80

_:ip1 <port_22> _:date1 .
_:date1 <port_22_reverse> _:ip1 .
_ip1 <ip> "10.0.0.1" .
_date1 <date> "2010/01/01" .

_:ip2 <port_80> _:date2 .
_:date2 <port_22_reverse> _:ip2 .
_ip2 <ip> "10.0.0.2" .
_date2 <date> "2010/01/02" .

You will have to use upsert to avoid de-duplication of the data so that you can reuse the dates and the IPs again and again. Let me know if you need help creating the upsert queries for the same input.

I have created separate predicates for forward and reverse edges so that you can query in both directions, dates to IP or IP to dates. If you know the list of ports before hand, you could instead use the @reverse index and Dgraph would take care of creating the reverse edge itself.

Let me know if you have more questions.

In such variant I’ll get duplications with ports as edges instead of having ports as nodes. I tried to find how to present ip, ports and dates as nodes. At least ip and ports with date in facets.
But in any case thanks for help, will try to research.

@Sergo

Are you sure this is actually a graph problem in the first place?

Have you considered a TimeSeries DB such as Influx?

I think Influx has a free tier in its cloud service that lets you do a quick test run for free to decide if that’s a path you want to explore further. For a number of reasons, the upcoming version 2 is the way to go when deploying yourself because, among other things, it unifies several tools into one single binary so the setup & getting started is much easier now.

Personally, I am a big fan of graph databases whenever it makes sense but this use case really looks a lot like a time series problems to me. Yes, you can solve it with a graph, but don’t expect miracle performance…

3 Likes

Hi,

I heard about timeseries db and about influx of course too. But samples above it just small part of bigger task: I thought about flexible scheme like in graph because I have a lot of different sources:

  • from nessus: ip, ports, open ports, cve-vulnerabilities;
  • from my quick-scan ip and ports;
  • from cmdb: networks, datacenters, hosts names, net-gtroups, os
  • from inventory: packets names wih versions and host names
    and more…

Now I have to pull it from different places, normalize it and think up tables and joins. And in any moment I can get new sources because we have rather difficult production infrastructure (few thousands servers, couple of petabytes data).

@Sergo

Have you tried Kafka for data ingres?

it looks like you are building some kind of domain specific knowledge graph and in this case, you have to make do with some kind of data duplication. I am not aware of how you can avoid it.

The nice property of DGraph is that the underlying storage engine only holds its key values in memory so that means, even if you have multi TB of data on disk, the key tree still fits in memory and gives you relatively fast query results. Technically, some degree of data duplication shouldn’t be much of an issue, but I would do some benchmark just to be sure.

2 Likes

We are using kafka a lot but mainly as dumb and fast (and “fat”) data bus queue between different layers for example when we need deliver about 800k events/s to db.

I also tend to think that without duplication it will not work. Or will I only use the current last view without history

@Sergo

Keep ports as your predicate.

When duplicating data, always start with duplicating the one that has the least footprint as the total impact on storage is usually insignificant. Whenever possible, I encode duplicates as ENUM’s because that usually translates to just one single additional integer value per row.

For your IP numbers, you might even create a centrally maintained lookup table. If I understand it correctly, you have about ~10k IP addresses per day with relatively low variance in the sense that you only get a few new ones relative to the previous day. If you make the table immutable, append-only, and with a mem-cache on top of it, you get among the fastest performance possible.

There are just some 65k ports in total, with a few hundred ports in actual usage per IP, so technically, all you need is integer encoding directly as property. I don’t think that’s worth normalizing.

Then, all you need is a table or graph that contains a timestamp, a key referring to the IP address in the lookup table, and the actual open Port and some kind of foreign key as reference to external data in case the IP doesn’t do the trick. I would do a comparison between queries against a table that stores the IP vs. a table the retrieves the IP from a K/V store. it makes sense to run the K/V store on the same machine as the DB to avoid a network hop. If the difference in performance is negligible, the K/V store for IP addresses is unbelievable storage efficient.

I think I would stuff all that into a TimeSeries DB with additional foreign keys to any external data stored in a graph, another DB, a CMS or whatever. In many ways, it makes sense separating time-discrete data from time-invariant data and join them ad-hoc during query time while caching all frequently accessed time-invariant data to boost queries’ latency & performance.

As long as we don’t get QuanX/3D-XPoint converged storage-memory into standard machines, you have to cache the crap out of your memory to sustain performance. Doing so would require 3D-XPoint pricing falling below DRAM but with DRAM pricing already collapsing for quite some time, this isn’t going to happen for a very long time and Intel already sells at a loss. Conversely, at this point in time, it is actually cheaper to buy more RAM and cache more.

That said, I actually use Hasura to converge data & compute services from various sources. Through that, I actually replace the foreign key inside a relational table with a remote join in Hasura over the primary key that is already there. That works across Postgres, a graph, and several web services. It works but could be better.

A few days ago, I had a conversation with some of the Hasura engineers and they say that they are working on a universal database integration layer that should be released in late Q1.
I certainly keep an eye on this because, for us, it would reduce integration time substantially mainly because you can query across all remote data sources through the unfied API exposed by Hasura.
That’s a very big deal for us.

1 Like