1. What is CRISP-DM?
Answers:
• Microsoft's linear regression algorithm
• A six phase method for predicting
e-commerce buying habits
• A decision tree developed in the 1980's
but almost entirely replaced by the CART method today
• A cross-industry
standard process for data mining
2. Which of the following is valid XML?
Answers:
• <valid>This
One</valid>
• All
are valid
• <body
answer="valid">This One</body>
•
<valid>"This One"</valid>
3. Which of these is an example of a sequential
pattern relationship?
Answers:
• Placing two frequently
purchased items next to each other on the shelf
• Reorganizing your
basketball team's starting lineup based on an analysis of performance
• Using business
experience and gut instinct to design a new floorplan in a grocery store
•
Predicting the likelihood of a backpack being purchased based on a consumer's
purchase of sleeping bags and hiking shoes
4. Sharding refers to:
Answers:
• none of the above
• simultaneously
accessing multiple object databases over SSH
•
partioning a database for distribution across different servers
• a measure of the noise
in a database's contents
5. Which of the following is most appropriate
for finding the shortest chain of friends linking two people in a social graph
who are not friends with each other?
Answers:
• Neural Networks
• k-means algorithm
•
Dijkstra's algorithm
• Markov chains
6. What is a genetic algorithm?
Answers:
• An algorithm that
estimates how well a particular pattern (a model and its parameters) meet the
criteria of the KDD process. Evaluation of predictive accuracy (validity) is
based on cross validation. Evaluation of descriptive quality involves
predictive a
• A classic algorithm
for frequent item set mining and association rule learning over transactional
databases. It proceeds by identifying the frequent individual items in the
database and extending them to larger and larger item sets as long as those
item s
• A
search algorithm that enables us to locate optimal binary string by processing
an initial random population of binary strings by performing operations such as
artificial mutation, crossover and selection.
7. Which xpath selector expression captures all
link elements of the form 'http://example.com/profile/12345' in an html page
while excluding all links of the form 'http://example.com/casenumber/12345?
Answers:
• //a/[contains(@href,
"profile")]/@href
•
//a/[contains(@href, "profile")]
• //href/profile
• //a/profile
8. Which of the following is not valid JSON?
Answers:
•
{["answer": "this one"]}
• {"answer":
["this one"]}
• All are valid
• {"answer":
"this one"}
9. Which industry can benefit from data mining?
Answers:
• Manufacturing
• Retail
• Finance/Banking
• All of
these
10. In predictive models, the values or classes
to be predicted are called the:
Answers:
• All of
these
• Response
• Target variables
• Dependent
11. Data items grouped into relationships and
preferences are known as:
Answers:
•
Clusters
• Predictable Sets
• Punctional
Organizations
• Degrees of Fit
12. True or False? Economic indicators are
external data factors.
Answers:
• True
• False
13. What is a KDD Process?
Answers:
• Knoop-hardness
measured through high-impact dimension
• Differential
Decryption
• K-mean Data Discovery
•
Knowledge Discovery in Databases
14. Which of the following disciplines overlaps
Data Mining?
Answers:
• All of
the above
• Statistics
• Artificial
Intelligence
• Linguistics
15. Which are popular data mining methods?
Answers:
• Probabilistic
Graphical Dependency Models
• Relational Learning
Models
• All of
these
• Decision Trees and
Rules
16. Which of these are NOT types of analytical
software:
Answers:
• Machine learning
• Neural network
• Statistical
• All
are valid types
17. What is data visualization?
Answers:
• A structured and
developed prediction of data results
• The visual
interpretation of complex relationships in multidimensional data
• The technical term for
the act of data being stored in a server
18. Which of the following is not a relational
database?
Answers:
• Google Big Table
• MongoDB
• Apache Cassandra
• All of
the above
19. Decision trees are able to handle missing
values without using any impute transformation. True or False?
Answers:
• False
• True
20. Which of the following is valid XML?
Answers:
• <valid>This
One</valid>
•
<valid>"This One"</valid>
• <body
answer="valid">This One</body>
• All
are valid
21. A(n) _____ algorithm creates rules that
describe how often events have occurred together.
Answers:
•
associative
• pruning
• CHAID
• artificial
22. Changes to parts of a code could lead to the
problem of ______________ data.
Answers:
•
inconsistent
• dirty
• granular
• nonintegrated
23. What are decision trees?
Answers:
•
Structures that generate rules for the classification of a dataset
• Hierarchical
dimensions that can be created with a hyper cube browser
• Data not collected by
the organization, such as data available from a reference book
• Complex reports
generated by a qualified data scientist
24. The annual revenue of an international
company is correlated with other attributes like advertisement, exchange rate,
inflation rate etc. Having these values (or their reliable estimations for the
next year) the company have to calculate its expected revenue for the next
year. Choose the appropriate data mining task for this business problem.
Answers:
• Segmentation
•
Regression
• Classification
25. You are a credit risk manager of a retail
bank. Some information about customers are available to analytics. Based on
this data you have to decide that a person will be a good or bad customer.
Choose the appropriate data mining task for this business problems.
Answers:
• Regression
• Segmentation
•
Classification
26. What is CRISP-DM?
Answers:
• A
cross-industry standard process for data mining
• A six phase method for
predicting e-commerce buying habits
• Microsoft's linear
regression algorithm
• A decision tree
developed in the 1980's but almost entirely replaced by the CART method today
27. In a neural net, to what does topology
refer?
Answers:
• The
number of layers and the number of nodes in each layer
• The graphical
visualization of the data
• The number of nodes
utilized
• The range of variables
in a set
28. What is the measure of how much two random
variables change together?
Answers:
• stochastic inertia
•
covariance
• polyconvergence
• binary standard deviation
29. Which of the following clustering algorithms
can find clusters of arbitrary shape?
Answers:
• Single-Link
• DSBSCAN
• None of these
• Both
of these
30. A function used by a node in a neural net to
transform input data from any domain of values into a finite range of values is
known as a(n):
Answers:
•
Activation Function
• Chi-square
• Confusion matrix
• Antecedent
31. True of False? Loose coupling data mining
architecture is mainly for memory-based data mining systems that does not
require high scalability and high performance.
Answers:
• False
• True
32. Data not collected by the organization, such
as data from a proprietary database, that is combined with the organization’s
own data is known as:
Answers:
• Non-applicable date
• Noise
•
Overlay
• Overfitting
33. With which of these layers does a neural
network start?
Answers:
• Input
layer
• Hidden Layer
• Output Layer
• Transparent layer
34. Suppose that the company's marketing
department collects data from customers. Make customer groups to ensure that
the most appropriate group to target the different offers. Choose the
appropriate data mining task for this business problem.
Answers:
•
Segmentation
• Classification
• Regression
35. What is the front end layer of data mining
architecture?
Answers:
• An
intuitive and user friendly user interface
• The team of
programmers who designed the software utilized in a particular mining project
• The hardware designed
specifically for storage of massive amounts of data
• Firewalls established
to protect data from malicious sources
36. To increase the confidence of your state of
classification performance on the entire population, you should:
Answers:
• Increase the size of
the training dataset
• Decrease the size of
the test dataset
•
Increase the size of the test dataset
• Decrease the size of
the training dataset
37. Which data mining technique organizes sets
of data into predefined groups?
Answers:
• Sequential Patterning
• Clustering
•
Classification
• Gamification
38. In the association between two variables,
what is the difference between the antecedent and the consequent?
Answers:
• The
antecedent is on the left, the consequent on the right
• Nothing, they are
interchangeable
• The antecedent is
always a very complex variable
• The antecedent is on
the right, the consequent is on the left.
39. A hyperplane is a
Answers:
• non-terminating error
condition
• variant of the C4.5
algorithm
• collection of linked
hypertext files
•
decision boundary separating classes of data
40. Which of these are NOT considered internal
data factors?
Answers:
• Staff Skills
•
Economic downturns
• Product Positioning
• Price
41. The level of the model that specifies (often
graphically) which variables are locally dependent on each other.
Answers:
•
Structural Level
• Quantitative Level
• Qualitative Level
• Primary Level
42. The algorithm powering the Google search
engine is:
Answers:
• AdaBoost
• The Brin-Page Method
•
PageRank
• GoogleCrawler
43. Which of these is NOT a common descriptions
of layers?
Answers:
•
Functional
• Input
• Hidden
• Output
44. Support Vector Machines have an advantage
over Neural Networks because SVM's are
Answers:
• more
resistent to local minima convergence
• parametric
• none of the above
• easier to train via
online learning
45. Which of these is an example of a sequential
pattern relationship?
Answers:
• Using business
experience and gut instinct to design a new floorplan in a grocery store
• Reorganizing your
basketball team's starting lineup based on an analysis of performance
• Placing two frequently
purchased items next to each other on the shelf
•
Predicting the likelihood of a backpack being purchased based on a consumer's
purchase of sleeping bags and hiking shoes
46. What is Change and Deviation Detection?
Answers:
• The process of finding
a model which describes significant dependencies between variables
• Methods for finding a
compact description for a subset of data.
• A task which consists
of techniques for estimating, from data, the joint multi-variate probability
density function of all of the variables/fields in the database.
• A task
focusing on discovering the most significant changes in the data from
previously measured or normative values
47. In the analysis of time-series data, the
mean value over a given time period (usually some interval in the past up to
the present) is called a(n)
Answers:
• unbiased mean
• partial average
• compounded mean
• moving
average
48. Sharding refers to:
Answers:
• none of the above
•
partioning a database for distribution across different servers
• a measure of the noise
in a database's contents
• simultaneously
accessing multiple object databases over SSH
49. What is Dependency Modeling?
Answers:
• The
process of finding a model which describes significant dependencies between
variables
• A multi-step process
involving data preparation, pattern searching, knowledge evaluation, and
refinement with iteration after modification.
• A task which consists
of techniques for estimating, from data, the joint multi-variate probability
density function of all of the variables/fields in the database.
• Learning a function
that maps a data item into one of several predefined groups or clusters.
50. What is Regression?
Answers:
•
Learning a function that maps a data item to a real-valued prediction variable.
• An expression E in a
language L describing facts in a subset FE of F.
• Learning a function
that maps a data item into one of several predefined groups.
• A descriptive task
where one seeks to identify a finite set of categories to describe the data.
51. Which of the following storage solutions is
most appropriate for a semi-structured dataset whose members do not all have
the same attributes?
Answers:
• MariaDB
•
MongoDB
• SQLite
• MySQL
52. In order to estimate classification
performance on an entire population, you need _______
Answers:
•
disjoint training and test datasets
• Disjoint training
• (None of these)
• Test Datasets
53. What is the type of data mining that drives
the Amazon.com recommendation system?
Answers:
•
Association Learning
• Anomaly Detection
• Clustering Algorithms
• Fuzzy Logic
54. Which of the following algorithms is
generally suitable for unsupervised learning tasks?
Answers:
• Restricted Boltzmann
machine
• k-nearest neighbor
• info-fuzzy networks
•
k-means algorithm
55. True or False? Tests in CART are always
Binary.
Answers:
• True
• False
56. Which of these are evolutionary
computational methods?
Answers:
• Heuristic algorithms
• Bayesian inference
algorithms
•
Genetic algorithms
• Clustering algorithms
57. Generalization error is a consequence of
Answers:
• Poorly defined
Chernoff Bound
• Underfit
• Parametric analysis
•
Overfit
58. A technique that classifies each record in a
dataset based on a combination of the classes of the k record(s) most similar
to it in a historical dataset is:
Answers:
• Decision Treeing
•
Nearest Neighbor
• Association Model
Query
• Logistic Regression
59. What is the extraction of useful if-then
rules from data based on statistical significance?
Answers:
• Dynamic Information
Inference
• Preliminary Method Mapping
• Rule
Induction
• Fuzzy Logic
Application
60. What is a genetic algorithm?
Answers:
• A classic algorithm
for frequent item set mining and association rule learning over transactional
databases. It proceeds by identifying the frequent individual items in the
database and extending them to larger and larger item sets as long as those
item s
• An algorithm that
estimates how well a particular pattern (a model and its parameters) meet the
criteria of the KDD process. Evaluation of predictive accuracy (validity) is
based on cross validation. Evaluation of descriptive quality involves
predictive a
• A
search algorithm that enables us to locate optimal binary string by processing
an initial random population of binary strings by performing operations such as
artificial mutation, crossover and selection.
61. In the MapReduce model, Map and Reduce
functions act directly on which kind of data structure?
Answers:
•
key-value pair
• linked lists
• MySQL matrices
• relational databases
62. What is Interestingness?
Answers:
• A multi-step process
involving data preparation, pattern searching, knowledge evaluation, and
refinement with iteration after modification.
• An expression E in a
language L describing facts in a subset FE of F.
• A discovered pattern
that is true on new data with some degree of certainty, and generalizes to
other data.
• An
overall measure of pattern value, combining validity, novelty, usefulness, and
simplicity.
63. Which of the following is most appropriate
for finding the shortest chain of friends linking two people in a social graph
who are not friends with each other?
Answers:
• Markov chains
• Neural Networks
•
Dijkstra's algorithm
• k-means algorithm
64. True or False? The MARS algorithm cannot
produce rules.
Answers:
• True
• False
65. In which type of analysis is a Kohonen
feature map typically employed?
Answers:
•
Cluster analysis
• Exploratory data
analysis
• Descriptive modeling
analysis
• Predictive analysis
66. What is Classification?
Answers:
•
Learning a function that maps a data item into one of several predefined
groups.
• Methods for finding a
compact description for a subset of data.
• A discovered pattern
that is true on new data with some degree of certainty, and generalizes to
other data.
• A descriptive task
where one seeks to identify a finite set of categories to describe the data.
67. Which of the following is NOT a common
source system?
Answers:
• Node
• SAP source
• DB Connect
• UDC
68. A DBMS reduces data redundancy and
inconsistency by
Answers:
• Utilizing a data dictionary
•
Enforcing referential integrity
• uncoupling program and
data
• Minimizing isolated
files with repeated data
69. Which of the followng clustering
algorithms can optimize an ojbective
function?
Answers:
• k-means only
• Subspace Clustering
Algorithms
• DSBSCAN and Single
Link
•
k-means and CLARANS
70. Which of the following is not a common goal
of the KDD Process:
Answers:
• Prediction
•
Performance
• Description
71. What is Clustering?
Answers:
• A
descriptive task where one seeks to identify a finite set of categories to
describe the data.
• A task which consists
of techniques for estimating, from data, the joint multi-variate probability
density function of all of the variables/fields in the database.
• Learning a function
that maps a data item into one of several predefined groups or clusters.
• The process of finding
a model which describes significant dependencies between variables
72. Which of the following is NOT a function of
data warehouses?
Answers:
• Extracting data
•
Cleaning dirty data
• Storing purchased data
• Cleaning data
73. In Natural Language Processing, what is the
role of a lexical analyzer?
Answers:
• processes the parse
tree for semantic meaning
• generates a
context-free grammar
• checks the validity of
a token
• splits
the stream of input characters into tokens
74. Which of the following properties is a
constraint on a RESTful application?
Answers:
•
stateless
• linearly seperable
• returns JSON output
• stateful
75. What is Summarization?
Answers:
• A descriptive task
where one seeks to identify a finite set of categories to describe the data.
•
Methods for finding a compact description for a subset of data.
• A task focusing on
discovering the most significant changes in the data from previously measured
or normative values
• The process of finding
a model which describes significant dependencies between variables
76. Which of the following is NOT a method of
combining multiple models into an ensemble model?
Answers:
•
Bootstrapping
• Averaging
• Stacking
• Voting
77. The component of the Hadoop Distributed
Filesystem responsible for storing metadata is called the
Answers:
• Datanode
• FS Shell
•
Namenode
• DFSAdmin
78. Converted information to provide insights
about historical patterns and future trends is known as:
Answers:
• Clustering
• Linear regression
• Meta-data
•
Knowledge
79. Which of the following properties applies to
Single-Layer Perceptrons?
Answers:
• continuous output
• random
initalization of weights
• backpropagation
• able to learn
non-linear separations
80. Which of the following applications are
usually used to classify students' performances?
Answers:
• Market-basket analysis
• Regression analysis
• Cluster analysis
•
If...then... analysis
81. The authentication protocol used by many
significant web APIs is called:
Answers:
• OAuth
• HTTPS
• SSL
• PGP
82. In any numerical data set with a meaningful
mean value, what is the minimum fraction of data that will fall within n
standard deviations of the mean?
Answers:
• 1/n^2
•
1-1/n^2
• 1/n
• 1/2n
83. What is CURL?
Answers:
• A
command-line tool for retrieving files
• A methodology for
classifying hidden features of data
• The part of HTTP that
specifies access permission
• Combinatorial
Unsupervised Recursive Learning algorithm
84. Which of these is a possible architecture of
a data mining system?
Answers:
• Transitive coupling
• Quickstart coupling
•
No-coupling
• Magnetic coupling
85. Which xpath selector expression captures all
link elements of the form 'http://example.com/profile/12345' in an html page
while excluding all links of the form 'http://example.com/casenumber/12345?
Answers:
• //a/profile
•
//a/[contains(@href, "profile")]
• //a/[contains(@href,
"profile")]/@href
• //href/profile
86. What is the first step in the business
understanding phase?
Answers:
• Create data mining
goals to achieve the business objectives
• Firmly
grasp business objectives and needs
• Create a list of all
relevant algorithms to be applied to the task
• Assess the current
situation by finding out the resources, assumptions, constraints etc.
87. Taking multiple random samples of data and
building a classification model for each is known as:
Answers:
• Binning
• Fuzzy Sampling
•
Boosting
• Clustering
88. What is Pig
Answers:
• A programming language
that enables Hadoop to operate as a data warehouse.
• A
programming language that simplifies the common tasks of working with Hadoop.
• None of these
89. A commonly used continuous alternative to
the step function in multi-layered neural network output is the
Answers:
• logarithmic function
• hyperbolic function
•
logistic function
• multi-layered NN
cannot compute continuous output
90. Which of the following algorithms produces
decision trees?
Answers:
• DBSCAN
• ID3
• none of the above
• logistic regression
91. Which of these is not a step in the KDD
process?
Answers:
• Data Mining
• Data Cleaning
• Data Integration
• Data
Quantification
92. "In 2% of the purchases at the hardware
store, both a pick and a shovel were bought,” is an example of:
Answers:
• Supervised learning
• Validation
•
Support
• Topology
93. Apriori is a seminal algorithm for finding
frequent item sets using:
Answers:
• Normal mixture models
• Overfitting methods
•
Candidate generation
• None of these
94. If more than one value occurs the same
number of times, the data is:
Answers:
• Multivariated
• Multi-faceted
•
Multi-modal
• Multi-leafed
95. The level of the model that specifies the
strengths of the dependencies using some numerical scale.
Answers:
• Numeric Level
• Primary Level
•
Quantitative Level
• Dependency Level
96. Which of the following method can be used
for modeling a categorical target variable?
Answers:
• Non-Linear Regression
• All of the Above
• Regression
•
Logistic Regression
• ARIMA
97. Which of the following is not a primary
phase of a Hadoop Reducer?
Answers:
• Shuffle
• Reduce
• Map
• Sort
98. The measured differences between a model and
its predictions are known as:
Answers:
• Outliers
• Range
• Non-applicable data
• Noise
99. Which decision tree method performs
multi-level splits when computing classification trees?
Answers:
• C4.5 algorithm
• ID3 (Iterative
Dichotomiser 3)
• CHAID
(Chi Square Automatic Interaction Detection)
• CART (Classification
and Regression Trees)
100. True or False? Artificial neural networks
are linear predictive models.
Answers:
• False
• True
101. Which of the following is not an
appropriate tool for harvesting data from a website that accesses its database
through Javascript/AJAX calls?
Answers:
• PhantomJS
• wget
• Selenium
• All of the above are
appropriate
102. What is the advantage of the k-Medoids
Clustering Algorithm over the k-Means Clustering (Lloyd's) Algorithm?
Answers:
• represents clusters by
center
• all of the above
• more
resistant to outliers
• uses iterative
refinement
103. Which of the following is not valid JSON?
Answers:
• All are valid
•
{["answer": "this one"]}
• {"answer":
["this one"]}
• {"answer":
"this one"}
104. Which of the following is part of a retail
customer data mining strategy?
Answers:
• holiday sale
• customer testimonials
•
loyalty cards
• money-back guarantee
105. The two major functions of BI servers are:
Answers:
•
Management and delivery
• Processing and
management
• Source and results
• Application and
delivery
106. How do you measure interestingness in
association patterns?
Answers:
•
measure lift
• meaure accuracy
• measure variance
• measure relevance
107. Where can a website operator generally find
data on her customers' IP addresses?
Answers:
• all of the above
• HTTP request headers
• cookies
• server
logfiles
108. Hash based technique, Transaction
Reduction, Portioning, Sampling, and Dynamic Item Counting are all examples of
what?
Answers:
• Method to repeatedly
scan the scan the database and check a large set of candidates by pattern
matching.
•
Techniques to improve the efficiency of an Apriori algorithm
• Methods of generating
frequent item sets without candidate generation.
• Methods for finding a
compact description for a subset of data.
109. Data mining provides a link between:
Answers:
• Parallel processing
and RAID
• Separate
transactional and analytical systems
• Online analytical
processing and dynamic information
• Genetic algorithms and
logistic regression
110. A descriptive approach to exploring data
that can help identify relationships among values in a database is:
Answers:
• Function activation
• Predictive analysis
• Clustering
• Link
analysis
111. What is Hive
Answers:
• Hive
enables Hadoop to operate as a data warehouse.
• Hive is a programming
language that simplifies the common tasks of working with Hadoop.
• Both of these
112. What is the purpose of the Hadoop
Distributed File System (HDFS)?
Answers:
• Creating a context in
which there are no restrictions on the data, enabling it to be unstructured and
schemaless.
• All of these.
• Ensuring that data is
replicated with redundancy across the cluster.
• To
enable computation to take place by allowing each server to have access to the
data.
113. The silhouette coefficient can be used to
determine the natural number of clusters for ________.
Answers:
• Density Based
Algorithms
• Hierarchichal
Algorithms
• Subspace Clustering
Algorithms
•
Partitioning Algorithms
No comments:
Post a Comment