No web.xml

Servlet 3.1 allows to have web apps with no web.xml but Maven was giving me errors when trying to package the application.

I had to add this to my pom.xml:

<plugin>
  <artifactId>maven-war-plugin</artifactId>
  <version>2.6</version>
  <configuration>
    <failOnMissingWebXml>false</failOnMissingWebXml>
  </configuration>
</plugin>
Advertisements

Apache Spark DataFrame Numeric Column (Again)

There is nothing like having a way to make it work to find more ways.

I found a .cast() method for the columns I want to use as numeric value and this avoids using a UDF to transform it.

I now prefer this way… until I find another, simpler…

package com.cinq.experience;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;

import java.io.UnsupportedEncodingException;

public class Session {

	public static void main(String[] args) throws UnsupportedEncodingException {
		SparkConf conf = new SparkConf().setAppName("SparkExperience").setMaster("local");
		JavaSparkContext jsc = new JavaSparkContext(conf);
		SQLContext sqlContext = new SQLContext(jsc);

		DataFrame df = sqlContext.read()
				.format("com.databricks.spark.csv")
				.option("header", "true")
				.load("session.csv")
				.cache();

		DataFrame crazy = df.select(df.col("x-custom-a"), df.col("x-custom-count").cast(DataTypes.LongType));
		crazy.groupBy(crazy.col("x-custom-a")).avg("CAST(x-custom-count, LongType)").show();
	}
}

Apache Spark DataFrame Average

We had some trouble doing the math on a column with dataframes even if the method is readily there.

We kept getting an error that the column was not a numeric value.

After a bit of reading I figured that I needed to use a UDF to transform the string column to a numeric column.

package com.cinq.experience;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;


public class DataFrameAvg {

	public static void main(String[] args) {
		SparkConf conf = new SparkConf().setAppName("DataFrameAvg").setMaster("local");
		JavaSparkContext jsc = new JavaSparkContext(conf);
		SQLContext sqlContext = new SQLContext(jsc);

		DataFrame df = sqlContext.read()
				.format("com.databricks.spark.csv")
				.option("header", "true")
				.load("numericdata.csv");

		df.registerTempTable("allData");
		df.show();
		sqlContext.udf().register("toInt", new UDF1() {
			public Integer call(String s) throws Exception {
				System.out.println("Parsing: " + s);
				return Integer.parseInt(s);
			}
		}, DataTypes.IntegerType);

		DataFrame withNumber = sqlContext.sql("SELECT toInt(number) from allData");
		withNumber.groupBy().avg("c0").show();
	}
}

and the content of the numericdata.csv is very simple:

name,number
person1,53
person2,42
person3,27
person4,15
person5,24
person6,30
person7,33
person8,36

Maven Generate

Why did I only discover this lately?

Because the archetype:create was deprecated in Maven 3.0.5 and you should use the archetype:generate from now on. A bit odd to do this in a .0.5 release. I must be missing something about the reasoning behind this change.

So from now on when I need the default directory structure:
mvn archetype:generate -DgroupId=com.cinq.example -DartifactId=example1 -DinteractiveMode=false

Minimal log4j.xml

Too often I copy my log4j.xml from one project to another so I figured I post it here as a template.


<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration debug="false">
  <appender name="default.console" class="org.apache.log4j.ConsoleAppender">
  <param name="target" value="System.out" />
  <param name="threshold" value="debug" />
  <layout class="org.apache.log4j.PatternLayout">
  <param name="ConversionPattern" value="%d{ISO8601} %-5p [%c{1}] - %m%n" />
  </layout>
  </appender>
  <logger name="com.halogensoftware.hosting" additivity="false">
  <level value="debug" />
  <appender-ref ref="default.console" />
  </logger>
  <root>
  <priority value="info" />
  <appender-ref ref="default.console" />
  </root>
</log4j:configuration>

HDFS file listing

Only using the Hadoop libraries I can list all the files that are in a subdirectory with this:

 // list all sites we have data for
 FileSystem fs = FileSystem.get(new Configuration());
 FileStatus status[] = fs.listStatus(new Path("hdfs:///dir/subdir/"));
 for ( FileStatus s : status ) {
     try {
         FileStatus[] metricFile = fs.listStatus(new Path(s.getPath().toString() + "/file.json"));
         logger.info("File: " + metricFile[0].getPath().toString());
     } catch ( IOException e ) {
         // there is no metric file
     }
 }

Since I use Spark for most applications I do I prefer this way of dealing with it:

SparkConf sc = new SparkConf().setAppName("Learning");
JavaSparkContext jsc = new JavaSparkContext(sc);
JavaPairRDD<String, String> allMetricFiles = jsc.wholeTextFiles("hdfs:///dir/subdir/*/file.json");
for ( Tuple2<String, String> each : allMetricFiles.toArray() ) {
	logger.info("Only metric file: " + each._1);
}