Design Patterns Page

I decided to create a Design Patterns page to keep notes on all the articles out there since a few have disappointed me with the inaccuracies that they contain. I want to avoid making some of these mistakes and will try to list and create code examples that are correct.

Design Patterns

Advertisements

Apache Spark DataFrame Numeric Column (Again)

There is nothing like having a way to make it work to find more ways.

I found a .cast() method for the columns I want to use as numeric value and this avoids using a UDF to transform it.

I now prefer this way… until I find another, simpler…

package com.cinq.experience;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;

import java.io.UnsupportedEncodingException;

public class Session {

	public static void main(String[] args) throws UnsupportedEncodingException {
		SparkConf conf = new SparkConf().setAppName("SparkExperience").setMaster("local");
		JavaSparkContext jsc = new JavaSparkContext(conf);
		SQLContext sqlContext = new SQLContext(jsc);

		DataFrame df = sqlContext.read()
				.format("com.databricks.spark.csv")
				.option("header", "true")
				.load("session.csv")
				.cache();

		DataFrame crazy = df.select(df.col("x-custom-a"), df.col("x-custom-count").cast(DataTypes.LongType));
		crazy.groupBy(crazy.col("x-custom-a")).avg("CAST(x-custom-count, LongType)").show();
	}
}

Apache Spark DataFrame Average

We had some trouble doing the math on a column with dataframes even if the method is readily there.

We kept getting an error that the column was not a numeric value.

After a bit of reading I figured that I needed to use a UDF to transform the string column to a numeric column.

package com.cinq.experience;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;


public class DataFrameAvg {

	public static void main(String[] args) {
		SparkConf conf = new SparkConf().setAppName("DataFrameAvg").setMaster("local");
		JavaSparkContext jsc = new JavaSparkContext(conf);
		SQLContext sqlContext = new SQLContext(jsc);

		DataFrame df = sqlContext.read()
				.format("com.databricks.spark.csv")
				.option("header", "true")
				.load("numericdata.csv");

		df.registerTempTable("allData");
		df.show();
		sqlContext.udf().register("toInt", new UDF1() {
			public Integer call(String s) throws Exception {
				System.out.println("Parsing: " + s);
				return Integer.parseInt(s);
			}
		}, DataTypes.IntegerType);

		DataFrame withNumber = sqlContext.sql("SELECT toInt(number) from allData");
		withNumber.groupBy().avg("c0").show();
	}
}

and the content of the numericdata.csv is very simple:

name,number
person1,53
person2,42
person3,27
person4,15
person5,24
person6,30
person7,33
person8,36

Maven Generate

Why did I only discover this lately?

Because the archetype:create was deprecated in Maven 3.0.5 and you should use the archetype:generate from now on. A bit odd to do this in a .0.5 release. I must be missing something about the reasoning behind this change.

So from now on when I need the default directory structure:
mvn archetype:generate -DgroupId=com.cinq.example -DartifactId=example1 -DinteractiveMode=false

Minimal log4j.xml

Too often I copy my log4j.xml from one project to another so I figured I post it here as a template.


<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration debug="false">
  <appender name="default.console" class="org.apache.log4j.ConsoleAppender">
  <param name="target" value="System.out" />
  <param name="threshold" value="debug" />
  <layout class="org.apache.log4j.PatternLayout">
  <param name="ConversionPattern" value="%d{ISO8601} %-5p [%c{1}] - %m%n" />
  </layout>
  </appender>
  <logger name="com.halogensoftware.hosting" additivity="false">
  <level value="debug" />
  <appender-ref ref="default.console" />
  </logger>
  <root>
  <priority value="info" />
  <appender-ref ref="default.console" />
  </root>
</log4j:configuration>

HDFS file listing

Only using the Hadoop libraries I can list all the files that are in a subdirectory with this:

 // list all sites we have data for
 FileSystem fs = FileSystem.get(new Configuration());
 FileStatus status[] = fs.listStatus(new Path("hdfs:///dir/subdir/"));
 for ( FileStatus s : status ) {
     try {
         FileStatus[] metricFile = fs.listStatus(new Path(s.getPath().toString() + "/file.json"));
         logger.info("File: " + metricFile[0].getPath().toString());
     } catch ( IOException e ) {
         // there is no metric file
     }
 }

Since I use Spark for most applications I do I prefer this way of dealing with it:

SparkConf sc = new SparkConf().setAppName("Learning");
JavaSparkContext jsc = new JavaSparkContext(sc);
JavaPairRDD<String, String> allMetricFiles = jsc.wholeTextFiles("hdfs:///dir/subdir/*/file.json");
for ( Tuple2<String, String> each : allMetricFiles.toArray() ) {
	logger.info("Only metric file: " + each._1);
}