Apache flume

This means that Flume is very scalable and can handle constant data streams effectively without worry of data loss. Or if you need to route messages to a standby channel to do maintenance on your Hadoop instance. The latter is good if you have one source system that you want to talk to multiple Hadoop instances. The former is great if you have, for instance, multiple listeners that only need to talk to one Hadoop instance. As your throughput needs change you can configure your agents to do fan-in (more agents talk to fewer channels) or fan-out (less agents talk to more channels). The channels have disaster recovery mechanisms built in. So, why use a channel in the middle? Flexibility and asynchronicity. The channels produce new outgoing events to Hadoop/HDFS/HBase (or whatever you use for persistence). The events are queued up into a "Channel". I had that up and running in a few hours.įlume works by having an agent running on a JVM listen for events (such as Service Broker messages to a JMS system). Why not just point Flume to listen in to our Service Broker events and scrape the data that way. We would probably write something that "listened" for "interesting" data on the production system and then put it into Hadoop. I had better success getting the data in to Hadoop.but now I had to get the data out of my RDBMS in a NoSQL-optimized format that I could quickly use HiveQL against.Īs my journey continued I thought about how I would intend to get my data into Hadoop if we ever deployed it as a real RDBMS complement. I began thinking like a NoSQL Guy and decided to use HiveQL. I then tried using ETL tools like Informatica and had better success, but it was still too cumbersome. I tried scripting (which is how most of the NoSQL solutions have you load their sample data, but tends not to work well with your "real" data) first but the learning curve was too high for a proof-of-concept. As a RDBMS data architect I was biased towards using techniques to load Hadoop that I would use to load a SQL Server. But it took me a long time to figure out that my approach to data loading was wrong. The easiest way I found to do this was using Apache Flume. During my evaluation of NoSQL solutions the biggest hurdles I had, by far, was loading data into Hadoop.

YOUR CART

Apache flume